Archive

Posts Tagged ‘safe sites’

Where Can You Find (2.8 million) Safe Websites?

January 19th, 2010

Hackers are hitting websites hard and fast. Everyday, upwards of 6,000 new websites are compromised by malware due to code injection, FTP credential compromise, weak server security, web-application flaws and the full gamut of other security issues.

In this vein, any system used to determine whether a website is clean or infected, needs to be able to handle large numbers of sites for analysis. This ability ensures a high throughput rate when analyzing “suspect” sites.

One of our goals at StopTheHacker.com, is to target throughput rates in excess of 1,000,000 sites per day. This obviously necessitates an automated process with high reliability and accuracy (we have it). To develop such an automated process, we focus heavily on advanced Machine Learning and Artificial Intelligence techniques which can learn on the fly from compromised websites and update to catch even more bad websites. All on the fly.

In order to develop training sets for machine-learning based automated solutions, one needs to get hold of a massive dataset. We recently profiled over 2.8 million websites (2,800,560 to be exact). What dataset is this? All these profiled sites were sourced from DMOZ. Surprisingly, none of these websites are listed in the Google Safe Browsing List as of January 19, 2010.
Note: DMOZ is a user-edited directory of sites (which provided a good starting point for this experiment).

Each website is classified according to a categorization scheme described here. We used the description to download and analyze around 2.8 million sites. Each site name was entered in a program which calculated a hash of the site name and looked it up on the Google Safe Browsing List to determine if the website was on the malware list or not.

Interestingly, we did not find any of the sites on the Google Safe Browsing List. This definitely adds a feather of sorts to DMOZ Directory’s proverbial hat. I think they might just be able to claim that they are the “largest and safest human-edited directory on the web”!

A graphical representation of the top 50 categories, sorted by those having the most websites is presented, followed by a list of the top 100 categories.
Read more…

News, Report, Security ,