[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Scroogle and Tor

Some have wondered why anyone would want to abuse Scroogle
using Tor. Apart from some malicious types that may be
doing it for their own amusement, it looks to me like they
are trying to datamine Google -- arguably the largest,
most diverse database on the planet.

If you can manage to run a script 24/7 that datamines
Google, you can monetize your results. Search engine
optimizers would like to be able to do this. So would
various directory builders.

Doing it by scraping google.com directly is not easy.
Scroogle provides 100 links of organic results per
request, with less than one-half the byte-bloat that
Google delivers for the same links and snippets. It is
also much easier to parse Scroogle's simple output page
than it is to parse Google's output page.

I spend a couple hours per day blocking abusers. A huge
amount of this is done through a couple dozen monitoring
programs I've written, but for the most part these
programs provide candidates for blocking only, and
my wetware is needed to make the final determination.

My efforts to counter abuse occasionally cause some
programmers to consider using Tor to get Scroogle's
results. About a year ago I began requiring any and all
Tor searches at Scroogle to use SSL. Using SSL is always
a good idea, but the main reason I did this is that the
SSL requirement discouraged script writers who didn't
know how to add this to their scripts. This policy
helped immensely in cutting back on the abuse I was
seeing from Tor.

Now I'm seeing script writers who have solved the SSL
problem. This leaves me with the user-agent, the search
terms, and as a last resort, blocking Tor exit nodes.
If they vary their search terms and user-agents, it can
take hours to analyze patterns and accurately block them
by returning a blank page. That's the way I prefer to do
it, because I don't like to block Tor exit nodes. Those
who are most sympathetic with what Tor is doing are also
sympathetic with what Scroogle is doing. There's a lot of
collateral damage associated with blocking Tor exit nodes,
and I don't want to alienate the Tor community except as
a last resort.

One reason why Scroogle has lasted for more than six
years is that we are nonprofit, and Google knows by now
that I don't tolerate abuse. My job is to stop the abuser
before Scroogle passes their search terms to Google.
Abusers who use Tor make this more difficult for me.
Blocking an IP address is easy, but blocking Tor abusers
without alienating other Tor users is more complex.

-- Daniel Brandt

To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxxx with
unsubscribe or-talk    in the body. http://archives.seul.org/or/talk/