Re: [tor-talk] Tor and Google error / CAPTCHAs.

On 27 September 2016 at 06:42, grarpamp <grarpamp@xxxxxxxxx> wrote:

On Sat, Sep 24, 2016 at 10:21 AM, Alec Muffett <alec.muffett@xxxxxxxxx>
> wrote:
> > [scraping}
> For some reason I view that as a copout.

You know, I would never phrase it that way, but in some respects I agree
with you.  I'll explain...

I mean, provide real data showing that it's intolerable and
> I'll say yes with you. Otherwise google [et al's] infrastructure
> can surely handle it (the load), and even possibly intelligently
> defend against it.

It's not right to conflate:

    "their infrastructure can surely handle it!"


    "they cannot be bothered to sort the wheat from the chaff!"

...but the latter is a lot closer to the truth than the former, and I find
it regrettable.

Let's do some back-of-the envelope maths: I have no idea of Google's
statistics but if 1 million people use Facebook over Tor, and Facebook
serves 1.7 billion people, then the Tor-using population of Facebook is

  ( 1 million / 1.7 billion ) * 100 = 0.06% (rounded up)

...of the userbase.

To put this into context, imagine a vacuum cleaner, and a bag of dust it in
is about 1.5kg / 3.3lbs; then put a single grain of rice into the bag
(1/64g) -

  ( 1 / ( 64 * 1500 ) ) * 100 = 0.001%

So globally per capita, the overall percentage of people who use Facebook
over Tor would be about 60 grains of rice.

That's about a teaspoonful of rice in a vacuum cleaner.  Have you ever
vacuumed-up a teaspoonful of dropped rice and not bothered to pick it out
of the bag?

You have to really _care_ about that rice, care about those users in order
to want to do that.  It's not economical behaviour.

But the situation is actually _worse_ than this, because the vast majority
of "legitimate" traffic does not pass through Tor en-route to Facebook or
Google, most of it is via apps, or via direct browsing.

When you're dealing with the traffic which emanates from Tor's exit nodes
the relative percentage of dust (scraping & spam) to rice (legit people)
increases greatly.

I don't know the numbers - 10x, 100x ? - it will vary from platform to
platform, and (as stated before) FB will have a slightly easier time of it
because of the richer signals from login credentials.

It might be 6 grains of rice in a vacuum cleaner. or 1 grain. Or less,
depending on the platform.

So to convince people who work at companies of the value of hunting for and
recovering these grains of rice, you have got to make them _care_.

So sorry... when I search 'keyboard controllers' and get
> captcha'd, so far I'm thinking, "really?, such low tolerance?,
> you're full of shit".

I understand that perspective, but again that's looking at the "tail
wagging the dog".

In such circumstances they are not actually looking at you / what you are
searching for. They are looking at the behaviour of all traffic, of
everyone and everything else which emanates from that exit node.

They are mostly looking at a bag of dust, not at your rice-grain legitimate

And if you want to make them care about that, and if you would like them to
do better, my first tip is not to go around telling the (say: Google)
engineers that they are "full of shit".

It's a human thing.  It tends to make people upset and not listen.

I would love for Google and CloudFlare to do better in this space.  CF did
at least _try_ with a crazy proof-of-work scheme (which is a popular way of
identifying scrapers, btw) but that's a category error because Tor is a
network stack not a browser-access-solution.  But the Tor activist
community just totally savaged CF, with the entirely predictable result of
both sides hunkering down into a war of attrition.

Let's not repeat that?


