Shostack + Friends Blog Archive

 

AOL search records 'research'

Most readers will have read by now of America Online publicly releasing a large sample of search records.
From the README supplied with the data:

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID - an anonymous user ID number.
Query  - the query issued by the user, case shifted with
most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank  - if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL  - if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.

There are about 20 million queries acccording to AOL, from about 650 thousand sources.
Some fun facts:
260 records match the SSN regex

/(?!000)([0-6]d{2}|7([0-6]d|7[012]))([ -]+?)(?!00)dd3(?!0000)d{4}/

.
A contributor to the interesting-people list reports [link to http://www.interesting-people.org/archives/interesting-people/200608/msg00032.html no longer works] somewhat fewer matches, but perhaps (s)he has a more discriminating regex, or cleaned the results.
Of the ‘SSN matches’, one also contained what appeared to be a person’s full name, address, date of birth, and driver’s license number (with state of issue).
OTOH, an extremely primitive “credit card number” regex yielded only four hits. I’m having some issues with the Regexp::Common:CC perl package, so I rolled my own regex and I know it is terrible.

2 comments on "AOL search records 'research'"

  • Adam says:

    There’s a word for naming anything “anonID.” Perhaps it’s irony, perhaps it’s less complimentary.

  • head says:

    Yes.. try out the AOL search database yourself.. It is just fun to look at some of the search data..
    http://data.aolsearchlogs.com/log/random.cgi

Comments are closed.