Archive

Posts Tagged ‘dmoz’

How Safe are Internet Website Directories?

January 23rd, 2010

Recently, we told you that Dmoz.org, one of the largest user-edited directories on the Internet, is also one of the safest directories. Directories such as Dmoz.org contain links to hundreds of thousands to millions of sites. These directories are categorized by volunteers or through automated means. Many search engines, including Google, Hotbot and others, potentially use data from these directories. These directories are also used as efficient lookup services by thousands of web-surfers who want to locate sites which belong to a very specific category.

Given the important role that these directories play in the Internet, one would expect that they would make an attempt to point only to websites which are “safe.” By “safe,” we mean sites which have not been injected with malware, via code-injection attacks or other attack vectors.

We are not picking on Dmoz.org here. We were very impressed to see that none of the 2.8 million sites we profiled, were present on the Google Safe Browsing List. This could indicate that sites listed on Dmoz.org are concerned about their image, hence care about their visitors, and take appropriate precautions against malware.

To follow up on our previous article, we have further analyzed 10,000 sites, randomly chosen from the Dmoz.org corpus of nearly 2.8 million websites. Each of the 10,000 sites was tested against each of the below website reputation services.

Note: When analyzing a domain-name or URL, for verification with the Google Safe Browsing List, we have calculated the hash of the website name to match against the list. The test was conducted between January 19th and January 21st, 2010. The list of domain names tested are presented at the end of this article.

We identify the most interesting results below:

  1. McAfee SiteAdvisor marked 0.39% of domains as Unsafe, 84.23% as Safe, 15.08% as Untested and 0.3% as Potentially-Unsafe.
  2. Norton Safe Web marked 0.39% of domains as Unsafe, 59.02% as Safe, 39.79% as Untested and 0.8% as Potentially-Unsafe.
  3. Google Safe Browsing marked 0.02% of domains as Unsafe, 99.98% as Safe.
    Note: The presence of the hash of the domain name being tested, on the Google Safe Browsing List, is interpreted as “Unsafe” while its absence is interpreted as “Safe.”
  4. Microsoft Bing marked 0.06% of domains as Unsafe, 93.2% as Safe, and 6.74% as Untested.
  5. Comodo Site Inspector marked 0.08% of domains as Unsafe, 99.46% as Safe, and 0.44% as Unreachable.
    Note: We were only able to test the first 5000 URLs with Comodo Site Inspector.

McAffee SiteAdvisor and Norton SafeWeb seem to detect nearly 19 times more websites as “Unsafe to Visit” than Google, and nearly 6 times more websites as “Unsafe to Visit” than Bing. It is interesting to note that it is an order of magnitude difference in the number of websites marked as “Unsafe to Visit” by these competing services.

We would like to know how long McAfee, Norton or Bing cache results for a particular site. Google allows webmasters to request reviews when they believe the site has been disinfected, and Comodo’s service seems to be an On-Demand service. This makes an interesting place to start for a future experiment. Further, it would be interesting to see whether sites listed on Yahoo the Directory and other directories are classified by these services.
Read more…

Report, Security , , , , , , ,

Where Can You Find (2.8 million) Safe Websites?

January 19th, 2010

Hackers are hitting websites hard and fast. Everyday, upwards of 6,000 new websites are compromised by malware due to code injection, FTP credential compromise, weak server security, web-application flaws and the full gamut of other security issues.

In this vein, any system used to determine whether a website is clean or infected, needs to be able to handle large numbers of sites for analysis. This ability ensures a high throughput rate when analyzing “suspect” sites.

One of our goals at StopTheHacker.com, is to target throughput rates in excess of 1,000,000 sites per day. This obviously necessitates an automated process with high reliability and accuracy (we have it). To develop such an automated process, we focus heavily on advanced Machine Learning and Artificial Intelligence techniques which can learn on the fly from compromised websites and update to catch even more bad websites. All on the fly.

In order to develop training sets for machine-learning based automated solutions, one needs to get hold of a massive dataset. We recently profiled over 2.8 million websites (2,800,560 to be exact). What dataset is this? All these profiled sites were sourced from DMOZ. Surprisingly, none of these websites are listed in the Google Safe Browsing List as of January 19, 2010.
Note: DMOZ is a user-edited directory of sites (which provided a good starting point for this experiment).

Each website is classified according to a categorization scheme described here. We used the description to download and analyze around 2.8 million sites. Each site name was entered in a program which calculated a hash of the site name and looked it up on the Google Safe Browsing List to determine if the website was on the malware list or not.

Interestingly, we did not find any of the sites on the Google Safe Browsing List. This definitely adds a feather of sorts to DMOZ Directory’s proverbial hat. I think they might just be able to claim that they are the “largest and safest human-edited directory on the web”!

A graphical representation of the top 50 categories, sorted by those having the most websites is presented, followed by a list of the top 100 categories.
Read more…

News, Report, Security ,