[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Scroogle and Tor

Thus spake scroogle@xxxxxxxxxxx (scroogle@xxxxxxxxxxx):

> My efforts to counter abuse occasionally cause some
> programmers to consider using Tor to get Scroogle's
> results. About a year ago I began requiring any and all
> Tor searches at Scroogle to use SSL. Using SSL is always
> a good idea, but the main reason I did this is that the
> SSL requirement discouraged script writers who didn't
> know how to add this to their scripts. This policy
> helped immensely in cutting back on the abuse I was
> seeing from Tor.
> Now I'm seeing script writers who have solved the SSL
> problem. This leaves me with the user-agent, the search
> terms, and as a last resort, blocking Tor exit nodes.
> If they vary their search terms and user-agents, it can
> take hours to analyze patterns and accurately block them
> by returning a blank page. That's the way I prefer to do
> it, because I don't like to block Tor exit nodes. Those
> who are most sympathetic with what Tor is doing are also
> sympathetic with what Scroogle is doing. There's a lot of
> collateral damage associated with blocking Tor exit nodes,
> and I don't want to alienate the Tor community except as
> a last resort.

Great, now that we know the motivations of the scrapers and a history
of the arms race so far, it becomes a bit easier to try to do some
things to mitigate their efforts. I particularly like the idea of
feeding them random, incorrect search results when you can fingerprint

If you want my suggestions for next steps in the arms race for this,
(having written some benevolent scrapers and web scanners myself), it
would actually be to do things that require your adversary to
implement and load more and more bits of a proper web browser into
their crawlers for them to succeed in properly issuing queries to you.

Some examples:

1. A couple layers of crazy CSS.

If you use CSS style sheets that fetch other randomly generated and
programmatically controlled style elements that are also keyed to the
form submit for the search query (via an extra hidden parameter or
something that is their hash), then you can verify on your server side
that a given query also loaded sufficient CSS to be genuine. 

The problem with this is it will mess with people who use your search
plugin or search keywords, but you could also do it in a brief landing
page that is displayed *after* the query, but before a 302 or
meta-refresh to actual results, for problem IPs.

2. Storing identifiers in the cache

http://crypto.stanford.edu/sameorigin/safecachetest.html has some PoC
of this. Torbutton protects against long-term cache identifiers, but
for performance reasons the memory cache is enabled by default, so you
could use this to differentiate crawlers who do not properly obey all
brower caching sematics. Caching is actually pretty darn hard to get
right, so there's probably quite a bit more room here than just plain

3. Javascript "proof of work"

If the client supports javascript, you can have them factor some
medium-sized integers and post the factorization with the query
string, to prove some level of periodic work. The factors could be
stored in cookies and given a lifetime. The obvious downside of this
is that I bet a fair share of your users are running NoScript, or
prefer to disable js and cookies.

Anyways, thanks for your efforts with Scroogle. Hopefully the above
ideas are actually easy enough to implement on your infrastructure to
make it worth your while to use for all problem IPs, not just Tor.

Mike Perry
Mad Computer Scientist
fscked.org evil labs

Attachment: pgp90yrSXj8bz.pgp
Description: PGP signature