• Where Can You Find (2.8 million) Safe Websites?

    Hackers are hitting websites hard and fast. Everyday, upwards of 6,000 new websites are compromised by malware due to code injection, FTP credential compromise, weak server security, web-application flaws and the full gamut of other security issues.

    In this vein, any system used to determine whether a website is clean or infected, needs to be able to handle large numbers of sites for analysis. This ability ensures a high throughput rate when analyzing “suspect” sites.

    One of our goals at StopTheHacker.com, is to target throughput rates in excess of 1,000,000 sites per day. This obviously necessitates an automated process with high reliability and accuracy (we have it). To develop such an automated process, we focus heavily on advanced Machine Learning and Artificial Intelligence techniques which can learn on the fly from compromised websites and update to catch even more bad websites. All on the fly.

    In order to develop training sets for machine-learning based automated solutions, one needs to get hold of a massive dataset. We recently profiled over 2.8 million websites (2,800,560 to be exact). What dataset is this? All these profiled sites were sourced from DMOZ. Surprisingly, none of these websites are listed in the Google Safe Browsing List as of January 19, 2010.
    Note: DMOZ is a user-edited directory of sites (which provided a good starting point for this experiment).

    Each website is classified according to a categorization scheme described here. We used the description to download and analyze around 2.8 million sites. Each site name was entered in a program which calculated a hash of the site name and looked it up on the Google Safe Browsing List to determine if the website was on the malware list or not.

    Interestingly, we did not find any of the sites on the Google Safe Browsing List. This definitely adds a feather of sorts to DMOZ Directory’s proverbial hat. I think they might just be able to claim that they are the “largest and safest human-edited directory on the web”!

    A graphical representation of the top 50 categories, sorted by those having the most websites is presented, followed by a list of the top 100 categories.

    DMOZ categories ranked by the number of associated sites:

    Sites	Category
    654667	Regional/N
    654420	Regional/North_America
    569801	Regional/E
    265961	Regional/Europe
    120813	Regional/A
    120813	Regional/A
    88487	Society/Religion_and_Spirituality
    66790	Arts/Music
    66790	Arts/Music
    49708	Regional/O
    49268	Regional/Oceania
    48641	Regional/Asia
    39457	Reference/Education
    35682	Games/Video_Games
    34671	Arts/Movies
    31420	Science/B
    31403	Science/Biology
    29998	Computers/Software
    28296	Computers/Internet
    24826	Regional/C
    23427	Sports/S
    22570	Arts/Literature
    22570	Arts/Literature
    22274	Recreation/Pets
    20796	Science/S
    20016	Society/Law
    19946	Arts/Performing_Arts
    19651	Science/Social_Sciences
    18652	Business/Industrial_Goods_and_Services
    17764	Sports/B
    17753	Society/Issues
    17493	Business/Textiles_and_Nonwovens
    17248	Business/Business_Services
    17247	Business/Construction_and_Maintenance
    17033	Business/Arts_and_Entertainment
    16948	Regional/Africa
    16010	Computers/Programming
    15703	Business/Consumer_Goods_and_Services
    15418	Sports/Soccer
    14186	Sports/M
    13701	Society/People
    13696	Sports/E
    13672	Society/Organizations
    13348	Science/E
    13063	Arts/Visual_Arts
    13001	Recreation/Outdoors
    12976	Health/Conditions_and_Diseases
    12696	Sports/F
    11815	Science/T
    11653	Home/Cooking
    11383	Shopping/Home_and_Garden
    11332	Regional/M
    11332	Regional/M
    11190	Regional/Middle_East
    11075	Science/Technology
    10589	Science/M
    10456	Sports/Equestrian
    10236	Society/History
    10157	Health/Medicine
    10030	Arts/Television
    9875	Science/Math
    9429	Sports/G
    8948	Arts/Animation
    8581	Sports/W
    8216	Science/A
    8117	Business/Financial_Services
    8117	Business/Financial_Services
    8070	Business/Electronics_and_Electrical
    8049	Arts/People
    7920	Shopping/Crafts
    7920	Shopping/Crafts
    7771	Business/Transportation_and_Logistics
    7764	Business/Agriculture_and_Forestry
    7597	Sports/Golf
    7597	Sports/Golf
    7511	Sports/Football
    7481	Shopping/Clothing
    7481	Shopping/Clothing
    7466	Shopping/Food
    7265	Recreation/Food
    7189	Sports/C
    7072	Sports/Basketball
    7028	Shopping/Health
    6606	Science/Environment
    6398	Regional/U
    6390	Computers/Hardware
    6390	Computers/Hardware
    6372	Business/Food_and_Related_Products
    6271	Sports/H
    6222	Recreation/Autos
    6100	Science/Earth_Sciences
    6100	Science/Earth_Sciences
    6003	Arts/Crafts
    6001	Sports/Martial_Arts
    5938	Regional/Caribbean
    5904	Sports/Motorsports
    5880	Recreation/Travel
    5590	Business/Healthcare
    5560	Society/Genealogy
    5437	Society/Philosophy