• Analyzing the Google Blacklist, Part 1

    Google’s efforts to clean up the Internet and provide a useful advisory to Internet users has been very successful. Nearly every modern browser now incorporates Google’s Safe Browsing List information, to prevent users from inadvertently visiting malware infested websites and phishing websites.

    Motivation
    In this article we will be analyzing the Google malware hash lists that have been published over the past few months in order to answer these important questions:

    • How many websites get blacklisted each day?
    • How many websites manage to get off the blacklist?
    • How soon do websites get off the blacklist?
    • How many never get off the blacklist?

    These are practical questions which are often posed by frustrated, sometimes confused and angry website owners, time and time again at help forums, and via our contact page.

    Resources
    Google has done a good job creating detailed help content describing the process of blacklisting, as well as a group where website owners can ask for help. Additionally there are excellent resources like BadwareBusters where users can find volunteers to help them. We also participate in these groups.

    Yet, there is still a demand for getting clear cut answers to some basic questions like the ones detailed above. In this vein we want to provide scientifically sound and statistically significant analysis of freely available information to provide clear answers to these questions. A small FAQ is also available on our site to answer questions from website owners and admins.

    Goals
    This series of experiments is split into multiple parts. This article presents a first look (part 1) at openly available data. The goal of the experiment is to understand:

    • How many websites get blacklisted each day?
    • How many websites manage to get off the blacklist?
    • How soon do websites get off the blacklist?
    • How many never get off the blacklist?
    • How many websites fall back onto the blacklist?
    • How much time elapses before a website falls back into the blacklist?

    Methodology
    For the purposes of this experiment, Google malware hash lists were collected from March 3, 2010 to June 1, 2010 (113 days). Malware hash lists were collected every 30 minutes. Each malware hash list contains the information in the Google malware hash specification. All hash lists were parsed and unique hashes were extracted and time stamped, and correlated with the malware hash list version.

    Subsequently an analysis was conducted to answer the questions posed above. At no point was an attempt identify a website name from the hashes. Also, note that a single website can have more than one unique hash. For example: “www.abcd.com”, “abcd.com”, and “www.abcd.com/infected/” can all generate different hashes.

    Brief Highlights

    • Total number of unique hashes tracked: 688,602.
    • Average number of unique hashes per day (over 113 day period): 6093.
    • 25.8% of hashes never got off the Google blacklist.
      Each one of these unique hashes was deemed infected for over 3 months (greater than 113 days).
    • 43% of hashes were listed exactly once as infected and managed to get off the Google blacklist.
      The average time each of these hashes was blacklisted was 13 days (89 days max).
    • 2% of hashes were blacklisted exactly twice.
      Each one of these hashes was blacklisted, was then removed from the blacklist and then fell back in (the sites were hacked again). These sites remained infected for an average of 19 days (89 days max), and remained clean for an average of 17 days before being hacked again.

    Analysis
    It is clear from these initial results that a very large number of websites, nearly one quarter of the 6000 hashes added per day never make it off the Google blacklist. There are a number of reasons for this. One being that most webmasters, who may be good at website design and layouts, may not have the technical skills which are required to clean websites infected by malware and code injection attacks. We have also met website owners who are extremely business savvy, but lack the technical expertise to recover from a blacklisting event. The income lost due to business interruption in these cases is considerable.

    We see that 43% of websites which get blacklisted manage to make it off the blacklist, but these websites suffer for an average period of 13 days.

    Some websites manage to get off the blacklist and then fall in again. The average time for these “repeat offenders” on the blacklist is larger than the previous case. The time for which these “repeat offenders” stay clean is not very high, an average of just 17 days.

    Conclusion
    These numbers clearly show the current sorry state of website security. It is unfortunate that thousands of websites are affected every day. At stopthehacker.com, we strive to help combat this trend.  These issues need to be addressed specifically by services that currently are not readily available to the masses. To address this vacuum in the service space, and disrupt the security market stopthehacker.com provides its advanced Health Monitoring and Vulnerability assessment services for website owners. Our services take away the anguish which business owners face when their websites are attacked. Please visit our services page to find out how we can help you. In fact, you can even sign up for free services.

    Further detailed analysis will be presented in the second part of this series. We will show detailed analysis of the data and will provide more insight on the implications of these observations.

    Stay tuned for Part 2!

    • […] We discussed the aim of this experiment and methodology in the last part of this series. We won’t repeat them here, but we encourage you to take a look at our first article in this […]

      Posted by Analyzing the Google Blacklist, Part 2 – stopthehacker.com – Jaal, LLC on June 30th