<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>stopthehacker.com &#187; safe sites</title>
	<atom:link href="http://www.stopthehacker.com/tag/safe-sites/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.stopthehacker.com</link>
	<description>Jaal, LLC</description>
	<lastBuildDate>Tue, 07 Feb 2012 14:00:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Where Can You Find (2.8 million) Safe Websites?</title>
		<link>http://www.stopthehacker.com/2010/01/19/where-can-you-find-millions-of-safe-websites/</link>
		<comments>http://www.stopthehacker.com/2010/01/19/where-can-you-find-millions-of-safe-websites/#comments</comments>
		<pubDate>Wed, 20 Jan 2010 03:46:18 +0000</pubDate>
		<dc:creator>anirban</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Report]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[dmoz]]></category>
		<category><![CDATA[safe sites]]></category>

		<guid isPermaLink="false">http://www.stopthehacker.com/?p=1009</guid>
		<description><![CDATA[Hackers are hitting websites hard and fast. Everyday, upwards of 6,000 new websites are compromised by malware due to code injection, FTP credential compromise, weak server security, web-application flaws and the full gamut of other security issues. In this vein, any system used to determine whether a website is clean or infected, needs to be [...]]]></description>
			<content:encoded><![CDATA[<p>Hackers are hitting websites hard and fast. Everyday, upwards of 6,000 new websites are compromised by malware due to code injection, FTP credential compromise, weak server security, web-application flaws and the full gamut of other security issues.</p>
<p>In this vein, any system used to determine whether a website is clean or infected, needs to be able to handle large numbers of sites for analysis. This ability ensures a high throughput rate when analyzing &#8220;suspect&#8221; sites.</p>
<p>One of our goals at StopTheHacker.com, is to target throughput rates in excess of 1,000,000 sites per day.  This obviously necessitates an automated process with high reliability and accuracy (we have it). To develop such an automated process, we focus heavily on advanced Machine Learning and Artificial Intelligence techniques which can learn on the fly from compromised websites and update to catch even more bad websites. All on the fly.</p>
<p>In order to develop training sets for machine-learning based automated solutions, one needs to get hold of a massive dataset. We recently profiled over 2.8 million websites (2,800,560 to be exact). What dataset is this? All these profiled sites were sourced from <a href="http://dmoz.org" target="_self">DMOZ</a>. Surprisingly, none of these websites are listed in the <a href="http://code.google.com/apis/safebrowsing/" target="_blank">Google Safe Browsing List</a> as of January  19, 2010.<br />
<em>Note: DMOZ is a user-edited directory of sites (which provided a good starting point for this experiment).</em></p>
<p>Each website is classified according to a categorization scheme described <a href="http://rdf.dmoz.org/" target="_blank">here</a>. We used the description to download and analyze around 2.8 million sites. Each site name was entered in a program which calculated a hash of the site name and looked it up on the <a href="http://code.google.com/apis/safebrowsing/" target="_blank">Google Safe Browsing List</a> to determine if the website was on the malware list or not.</p>
<p>Interestingly, we did not find any of the sites on the <a href="http://code.google.com/apis/safebrowsing/" target="_blank">Google Safe Browsing List</a>. This definitely adds a feather of sorts to DMOZ Directory&#8217;s proverbial hat. I think they might just be able to claim that they are the &#8220;largest <em>and safest</em> human-edited directory on the web&#8221;!</p>
<div class="gallery">
<div id="attachment_1039" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.stopthehacker.com/wp-content/uploads/2010/01/top-50-dmoz-categories.jpeg" rel="attachment wp-att-1039" title="Top 50 DMOZ categories (by number of websites)."><img src="http://www.stopthehacker.com/wp-content/uploads/2010/01/top-50-dmoz-categories-300x123.jpg" alt="Top 50 DMOZ categories (by number of websites)." title="Top 50 DMOZ categories (by number of websites)." width="300" height="123" class="size-medium wp-image-1039" /></a><p class="wp-caption-text">Top 50 DMOZ categories (by number of websites).</p></div>
</div>
<p>A graphical representation of the top 50 categories, sorted by those having the most websites is presented, followed by a list of the top 100 categories.<br />
<span id="more-1009"></span><br />
<strong>DMOZ categories ranked by the number of associated sites:</strong></p>
<pre class="brush: plain; title: ; notranslate">
Sites	Category

654667	Regional/N
654420	Regional/North_America
569801	Regional/E
265961	Regional/Europe
120813	Regional/A
120813	Regional/A
88487	Society/Religion_and_Spirituality
66790	Arts/Music
66790	Arts/Music
49708	Regional/O
49268	Regional/Oceania
48641	Regional/Asia
39457	Reference/Education
35682	Games/Video_Games
34671	Arts/Movies
31420	Science/B
31403	Science/Biology
29998	Computers/Software
28296	Computers/Internet
24826	Regional/C
23427	Sports/S
22570	Arts/Literature
22570	Arts/Literature
22274	Recreation/Pets
20796	Science/S
20016	Society/Law
19946	Arts/Performing_Arts
19651	Science/Social_Sciences
18652	Business/Industrial_Goods_and_Services
17764	Sports/B
17753	Society/Issues
17493	Business/Textiles_and_Nonwovens
17248	Business/Business_Services
17247	Business/Construction_and_Maintenance
17033	Business/Arts_and_Entertainment
16948	Regional/Africa
16010	Computers/Programming
15703	Business/Consumer_Goods_and_Services
15418	Sports/Soccer
14186	Sports/M
13701	Society/People
13696	Sports/E
13672	Society/Organizations
13348	Science/E
13063	Arts/Visual_Arts
13001	Recreation/Outdoors
12976	Health/Conditions_and_Diseases
12696	Sports/F
11815	Science/T
11653	Home/Cooking
11383	Shopping/Home_and_Garden
11332	Regional/M
11332	Regional/M
11190	Regional/Middle_East
11075	Science/Technology
10589	Science/M
10456	Sports/Equestrian
10236	Society/History
10157	Health/Medicine
10030	Arts/Television
9875	Science/Math
9429	Sports/G
8948	Arts/Animation
8581	Sports/W
8216	Science/A
8117	Business/Financial_Services
8117	Business/Financial_Services
8070	Business/Electronics_and_Electrical
8049	Arts/People
7920	Shopping/Crafts
7920	Shopping/Crafts
7771	Business/Transportation_and_Logistics
7764	Business/Agriculture_and_Forestry
7597	Sports/Golf
7597	Sports/Golf
7511	Sports/Football
7481	Shopping/Clothing
7481	Shopping/Clothing
7466	Shopping/Food
7265	Recreation/Food
7189	Sports/C
7072	Sports/Basketball
7028	Shopping/Health
6606	Science/Environment
6398	Regional/U
6390	Computers/Hardware
6390	Computers/Hardware
6372	Business/Food_and_Related_Products
6271	Sports/H
6222	Recreation/Autos
6100	Science/Earth_Sciences
6100	Science/Earth_Sciences
6003	Arts/Crafts
6001	Sports/Martial_Arts
5938	Regional/Caribbean
5904	Sports/Motorsports
5880	Recreation/Travel
5590	Business/Healthcare
5560	Society/Genealogy
5437	Society/Philosophy
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.stopthehacker.com/2010/01/19/where-can-you-find-millions-of-safe-websites/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

