AOL search records 'research'
Most readers will have read by now of America Online publicly releasing a large sample of search records.
From the README supplied with the data:
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}. AnonID - an anonymous user ID number. Query - the query issued by the user, case shifted with most punctuation removed. QueryTime - the time at which the query was submitted for search. ItemRank - if the user clicked on a search result, the rank of the item on which they clicked is listed. ClickURL - if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.
There are about 20 million queries acccording to AOL, from about 650 thousand sources.
Some fun facts:
260 records match the SSN regex
/(?!000)([0-6]d{2}|7([0-6]d|7[012]))([ -]+?)(?!00)dd3(?!0000)d{4}/
.
A contributor to the interesting-people list reports [link to http://www.interesting-people.org/archives/interesting-people/200608/msg00032.html no longer works] somewhat fewer matches, but perhaps (s)he has a more discriminating regex, or cleaned the results.
Of the ‘SSN matches’, one also contained what appeared to be a person’s full name, address, date of birth, and driver’s license number (with state of issue).
OTOH, an extremely primitive “credit card number” regex yielded only four hits. I’m having some issues with the Regexp::Common:CC perl package, so I rolled my own regex and I know it is terrible.
There’s a word for naming anything “anonID.” Perhaps it’s irony, perhaps it’s less complimentary.
Yes.. try out the AOL search database yourself.. It is just fun to look at some of the search data..
http://data.aolsearchlogs.com/log/random.cgi