[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Tor client performance (was Re: URGENT: patch needed ASAP for authority bug)

On Thu, Apr 22, 2010 at 03:30:06AM -0500, Scott Bennett wrote:
> >Thanks for the brilliant patch.
> >
>      Well, Roger's patch may provide some relief until the next tor release
> comes out, but lest anyone get too excited, it would be well to keep in mind
> that it is a patch to treat symptoms.  It is not at all clear yet, AFAIK,
> what the cause of the recent troubles has been.  Wiping out connections that
> might otherwise remain available for use, thereby making it necessary for
> clients to make new SSL connections sooner than they might otherwise have
> needed does have a cost.  At some point, I hope that cause will be found and
> dealt with.

Actually, you're in luck -- we do have a pretty good handle on at least
part of the problem.

A few years ago, we switched clients over to tunneling their directory
requests via ordinary TLS (aka "OR") conns rather than asking them as http
(in plaintext), so it was harder to censor them.

But while clients use a small set of first hops (called "entry guards")
when building three-hop circuits -- the ones they use for anonymity --
they just pick any relay when they want to do a directory fetch. They
weight these choices by relay bandwidth, so we give more attention to
the relays that can handle more attention.

We've been meaning for a while to switch to a "directory guard" design,
where we only ask our entry guards for directory information. This
approach would be better for privacy, because it would be harder for
a directory mirror to enumerate users (he can't learn what they do,
but he can learn that they use Tor). It would be better for scalability
too, since we wouldn't be spewing out so many new TLS connections (sound
familiar?). But it would be put additional load on just the entry guards,
and worse, it would screw up all our user count statistics including the
per-country graphs that are helping us understand where Tor is seeing use.

So the problem in this thread was that Tor clients weren't hanging up
quickly enough once they'd done a directory fetch. Unlikely when they're
using their entry guards, Tor clients are quite unlikely to get back to
the same relay for the next directory fetch. So it's quite reasonable for
them to hang up a lot faster than they do for TLS conns to their guards.

We should change clients to do this faster hanging up. In the mean time
(and in addition), we should teach directory mirrors to defend themselves
from over-zealous clients.

The approach in the patch is to close only the circuits which behave
like directory fetches that are finished: they came from a client,
don't extend anywhere else, don't have any streams on them right now,
and haven't answered any queries in a while (in the patch, "a while" is
60 seconds). Once we close those circuits, our earlier logic to close
TLS conns that aren't in use (don't have any circuits) and have been
idle "a while" kicks in. In this case "a while" used to be 15 minutes,
and it's 1 minute in the patch here.

So that leaves us two questions.

First: what timeout should we actually pick for these circuits and
connections? I'm inclined to be pretty aggressive -- first because
everybody is freaking out about this huge influx of connections, second
because there *is* a huge influx of connections, and third because these
circs and conns really are unlikely to be reused again if they aren't
reused quite quickly.

Second: why is there a huge influx of connections? I think there are
three answers here. A) Mike's new bandwidth weightings are putting more
attention on the fast relays than before. B) There actually are a lot of
new Tor clients that can reach the Tor network, now that China opened
up a bit. And C) We had some other bugs lately that reduced the number
of available relays, and people have been turning off their DirPort,
leading to evenly more concentrated pain on the directory mirrors that
remain. My intuition is that "B" is the heaviest factor.

>  Having a way to close idle OR connections based upon a timeout
> specified by the authorities in the consensus, but overridable by a torrc
> line by individual relay operators, looks to me like a good thing to have
> henceforward.  That way attentive relay operators can decrease or increase
> the timeout period according to their needs, but the authorities would still
> have a possibility of adjusting the timeout period on NORDO relays on all
> the other relays.

Yes, maybe. I'm not sure. The broader question here is how many
connections are too many, and how aggressive should we be at killing them.
It's a shame to be killing them at the server end at all -- in an ideal
world, clients should be realizing they won't need them and hanging up
early. But the parameters we picked for clients a while ago didn't take
into account that there would be half a million Tor clients all clamoring
for attention at once.

>      Thanks for making the patch available, Roger.  Circuit build times have
> climbed here from ~26 s a few days ago to 97 s at the moment.  It will be
> good to see those times fall again.

Yeah, things sure have gone to hell lately:

I should get this patch into git so more people can upgrade.

And then at some point think about whether to get something into a new
stable release.