[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-talk] Tor and Google error / CAPTCHAs.



On 25 September 2016 at 17:54, <blobby@xxxxxxxxxxxxxxx> wrote:

> Hi Alec,
>
> Thanks for your detailed and informative response. I had never heard of
> "scraping".


Scraping comes in many forms and with many motives and intentions - in the
previous email I managed to outline a couple, but that is no more than a
sketch of one aspect of the topic.

Scraping also raises interesting legal arguments, both pro-and-con - for
instance:

* https://en.wikipedia.org/wiki/Facebook,_Inc._v._Power_Ventures,_Inc.
*
http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/

...and Weev:

*
http://arstechnica.com/tech-policy/2012/11/internet-troll-who-exploited-att-security-flaw-faces-5-years-in-jail/

...and of course, Aaron Swartz:

* http://www.newyorker.com/tech/elements/when-programmers-scrape-by

...so when I say "many forms and with many motives and intentions", I must
acknowledge "dual use" - that some forms of scraping are benign, or are
protest, or are sharing that which perhaps should be shared.

But here, primarily, I am discussing the forms of scraping which are
third-party-based and exploitative of user data with intent to defraud; or
similar.


BTW: are you the Alec Muffett name-checked in Kevin Mitnick's
> autobiography? I assume so.
>

Yeah, that was a long time ago. :-)



> It may be of note that when I got the Google error, Amazon also required a
> CAPTCHA in order for me to login to my account. Whomever was using the exit
> node maliciously, was obviously affecting non-Google organizations too.
>

Indeed, that's possible; in fact I should amend my previous post to point
out that "scrapers" - people who scrape - do so through many different
proxy networks, not only Tor, and also that some forms of scraping utilise
(eg:) malicious browser plugins that are installed by otherwise entirely
blameless people: victims who don't realise that their web browser is now
helping a part of some scraping outfit's infrastructure.

You ask an interesting question about "badness" of IP addresses; long story
short what you are referring to are "IP reputation databases" - which are
used by many people, for instance:


https://github.com/botherder/targetedthreats/blob/master/targetedthreats.rules

…from Claudio Guarnieri (@botherder) is a list of IP-based Snort IOC
(Indicator of Compromise) rules for civil society organisations to use.
tldr: If your organisation sees network traffic matching the list of IOCs
on your network, bad shit may be happening to you.

Speaking generally about industry rather than specifically about FB or any
other company: there are only (worst-case) 4 billion IPv4 addresses in the
world (and a few more v6) and since the average hard drive is ~1Tb nowadays
it's pretty trivial to build & share databases of how much "badness" is
measured to be emanating from any given IP address.

So that's what tends to happen: it's not (necessarily) a matter of what
kind of software the computer is running (though that is helpful to know) -
nor would it completely matter what country the computer appears to be in
(though some countries _are_ more lax about quenching bad network
neighbourliness).

Instead it's more (though not exclusively) a matter of measuring actual
observed behaviour emanating from given IP addresses.

What happens *after* such information gets collected is more interesting;
some organisations call for network "shunning" a-la redlining (
https://en.wikipedia.org/wiki/Redlining) - others enforce CAPTCHAs on IP
addresses which are known to enable scrapers.  Yet more do rate-limiting or
temporary bans.

An organisation's response to scraping seems typically the product of:

1) the technical resources at its disposal
2) its ability to distinguish scraping from non-scraping traffic
3) the benefit to the organisation of sieving-out and handling the
non-scraping traffic, rather than ignoring it all

I would argue that Facebook was the first to launch a really large onion
site by scoring highly (HHH/HMH) in all three of these categories: big
brains, actual high-signal login credentials, and a million normal people
who want to use Facebook over Tor (especially "at need").

By comparison I would estimate Google as HMM (or HML) and Cloudflare as
HLL; both companies with great people (I know many of them) but with Medium
or Low abilities to sort scraping from non-scraping, and Medium or Low
impetus to do so.

This is why corporate outreach is so important for Tor: to build awareness
and raise perception so that that third statistic becomes more important
for other companies to address.

    - alec

-- 
http://dropsafe.crypticide.com/aboutalecm
-- 
tor-talk mailing list - tor-talk@xxxxxxxxxxxxxxxxxxxx
To unsubscribe or change other settings go to
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-talk