[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Tor client performance (was Re: URGENT: patch needed ASAP for authority bug)



On Fri, Apr 16, 2010 at 04:06:08AM -0500, Scott Bennett wrote:
> > We cut the latency of the Tor
> >network in half by rebalancing the network to send more traffic to the
> >relays that could handle it.
> 
>      Please produce the evidence in support of such a claim.  Like I wrote
> in response to Sebastian earlier, there has certainly been no such effect
> visible from here.

Take a look at
http://metrics.torproject.org/torperf-graphs.html
In particular,
http://metrics.torproject.org/graphs/torperf/torperf-50kb-moria-12m.png

This is a graph of the average time it takes to fetch a 50KB file over
Tor. We have a process running on moria that fetches this 50KB file
every 5 minutes (with the exception of a period in January where I was
cleaning up from the breakin).

You see the part at the beginning of September 2009 where it drops
from about 8-9 seconds to about 4-5 seconds? That's when we switched
from using the bandwidths in the descriptors as weights to using the
authority-measured bandwidths as weights.

The reason for the new approach was that relays with lots of capacity
were not seeing as much attention from Tor clients as they "should" in
a better balanced network. This conclusion is based in part on Steven's
PETS 2008 paper:
http://freehaven.net/anonbib/#murdoch-pet2008
plus Steven's graphs in Section 4 of the performance paper:
https://blog.torproject.org/blog/why-tor-is-slow
and in part on Mike's results from measuring the live network -- see
Figure 10 in http://fscked.org/talks/TorFlow-HotPETS-final.pdf

Mike's theory was that we could rebalance the network by *over*emphasizing
the attention that fast relays get, until the measurements in Figure 10
look flatter. You can read an overview of the actual measurement approach,
plus the calculations it uses, here:
https://gitweb.torproject.org//tor.git?a=blob_plain;f=doc/spec/proposals/161-computing-bandwidth-adjustments.txt;hb=HEAD

The basic idea is to compare all the nodes that advertise a given
bandwidth to their peers, and for the ones that are faster than their
peers, bump up the weight that clients give them proportionally;
similarly, bump down the ones that are slower than their peers. Relays
that are given higher weights would attract more traffic load.

There's actually a brief drop in the torperf graph at the beginning of
September right before the longer-lasting drop. That's where we tried out
the feature. Mike's original idea was that we would use the bandwidths
in a feedback style, meaning that in the next round of comparisons, your
peers would be the people who have similar weights in the *consensus*,
rather than similar weights in *descriptors*. But the problem was that it
never stabilized. The really fast relays never attracted enough traffic,
so they were always faster than their peers, so their weights just grew
at each round. So we scrapped that notion.

The result is that we cut the latency of the Tor network almost in half.

You can see two spikes in November -- that's when the bwauthority scripts
failed to achieve consensus, and we reverted to using the values in
descriptors. I don't remember for sure, but I think a few of the other
latency spikes can be explained by that too. Somebody could go check
the consensus archives at
http://archive.torproject.org/tor-directory-authority-archive/
if we wanted to learn more.

How come we're back up at 7 seconds? I think the load on the network is
continuing to increase, cancelling out some of the gains we made. There
could be other possible explanations too of course. I have a nagging
feeling in the back of my mind that Mike's changes were extra impressive
when he was the only client using them, but they became less striking
when the rest of the clients followed suit. We saw an example of this
phenomenon in the paper by Snader and Borisov, where they propose an
alternate path weighting algorithm:
http://freehaven.net/anonbib/#snader08
Their proposed algorithm works really well when they're the only ones
doing it, and my theory is that it works far worse than the current
Tor algorithm when everybody starts doing it. Open research question in
practice, though; I could well be wrong.

Incidentally, you can see a horrible spike in latency in the mid April
2010 part of the torperf graph -- up to like 15 seconds for an average
50kb fetch:
http://metrics.torproject.org/graphs/torperf/torperf-50kb-moria-6m.png
That's because we tried turning on the new EWMA design by Ian Goldberg
and his grad students at Waterloo, to give higher priority to quieter
circuits:
http://archives.seul.org/or/cvs/Dec-2009/msg00322.html
http://archives.seul.org/or/cvs/Dec-2009/msg00324.html
Changes in version 0.2.2.7-alpha - 2010-01-19
[...]
    - When choosing which cells to relay first, relays can now favor
      circuits that have been quiet recently, to provide lower latency
      for low-volume circuits. By default, relays enable or disable this
      feature based on a setting in the consensus. You can override
      this default by using the new "CircuitPriorityHalflife" config
      option. Design and code by Ian Goldberg, Can Tang, and Chris
      Alexander.
I'll try to dig up a more traditional design document for those interested
in learning more; I believe the grad student who was writing it up vanished
abruptly though, alas. In any case, long story short, we had a bug in our
implementation that made it go really wrong:
http://archives.seul.org/or/cvs/Apr-2010/msg00179.html
We stopped the experiment last week, and the torperf results should be
back to normal by now (it takes a few days before the graphs updates).

>      That was an interesting read, Roger.  Thank you for the reference.
> My assessments (after a single reading) are that the first section looked
> very good, especially the analysis of circuit construction failure rates.
> Those results are quite valuable and immediately applicable information
> to have.  The third section, which describes the exit node security-scanning
> project also looks quite good, but it is obvious that that will be an
> ongoing project for many years because the potential for further enhancement
> seems almost endless.  The middle section, though, while describing a fairly
> fancy processing approach, is nevertheless built upon a choice of a useless
> source of data, as I continue to point out.

Well, I wouldn't call it useless. It sure is better than the old
approach. I would encourage you to write up more details about what is
your better approach, and why. You and Mike can fight it out. But just
telling us we're doing it theoretically wrong doesn't really do anything.

> >One downside of this new feature is that the old trick we had,
> >MaxAdvertisedBandwidth, doesn't as much do what you want.
> 
>      Does it do *anything* at all anymore besides adjust what a node
> publishes in its descriptor?

It does: it influences what peers you are compared to by the bandwidth
authority scripts. So (simplifying things) if you cut your advertised
bandwidth in half, but the bwauthorities measure you as twice as
impressive as your peers compared to before, then it's a no-op. In my
experience if you set maxadvertisedbandwidth to something quite low,
then the bwauthorities won't ever be terribly impressed with you.

That sure doesn't solve our problem though.

>  It not only doesn't do what we want or
> what the documentation still says it does, it also removes some important
> operational control from the person most responsible for the node's
> operation and its impact upon the system and network in which the node
> runs.  That flies in the face of the philosophy prior to this change
> of leaving such control of operational characteristics in the hands of
> those we wish to encourage to run relays.  If relay operators discover
> that they can't limit consumption by tor of one or more resources anymore,
> they may choose, or even be forced to choose, to stop running them at all.
>      What will be in later versions of tor?  Allocation of exit policies
> by the authorities?

You are conflating issues here. We've never had a good way to limit
memory use, limit incoming connections, limit number of relays your
relays will hold a connection open to, etc. As the number of Tor clients
continues to grow -- especially with some apparently huge spikes lately --
it is becoming clearer that we need to come up with some better answers
here. But the maxadvertisedbandwidth thing was never the silver bullet
that you're talking about. All the relays are getting overwhelmed
currently.

That said, we could imagine some solutions. One solution is that your
Tor relay can put an extra line in its descriptor specifying a maximum
weighting it wants for itself in the consensus. Then the directory
authorities take that into account in their bandwidth votes. This approach
has the downside that you are limiting an artificial unitless number, so
you may have to keep adjusting your number as the network changes around
you. But let's take a step back -- you *already* were in that position
before, when the unitless number was slightly more like bandwidth. By
setting maxadvertisedbandwidth, you were reducing the proportion of
network weightings that clients would use when allocating their traffic,
and hoping that would somehow regulate your memory and socket load. The
fraction of the clients you attracted was based on how many other relays
there were and what weights they published, not (just) on the values you
chose for bandwidthrate or maxadvertisedbandwidth.

(The reason we can't just let people choose an upper bound on sockets is
because of the unclear anonymity implications of a non-clique topology,
and the difficulty of telling every client what links work and what
links don't. See e.g. http://freehaven.net/anonbib/#danezis-pet2008
for details.)

Is my "putting an extra line in the descriptor" solution good enough? It's
certainly a start. I'd love to hear a better idea though -- especially
if it's either as easy to implement, or if not, comes with design and
implementation help. :)

For example, it's possible we should flatten the high end of relays as
ordered by bandwidth. Right now the bwauthorities never vote a single
relay's weight in the consensus to be more than 5% of the network. We
could lower that to 3% or something. It would slow down the network as a
whole, but it would provide more breathing room to the really fast relays.

I should also note that having the bwauthorities measure the relays
resolves a huge anonymity vulnerability ("relays can lie about their
bandwidth") as described here:
http://freehaven.net/anonbib/#bauer:wpes2007

> >But that said, there does seem to be a problem here: we're seeing way
> >more directory requests than we were a few weeks ago. That's translating
> >into more connections seen by relays.
> >
> >Is that because we're seeing way more users? Or are Tor clients generating
> >more directory requests than they "should"? Hm.
> >
>      More bad news, I guess, but I don't see any direct connection to
> the problems central to this thread so far.  That doesn't mean there isn't
> such a connection, but none appears very obvious to me at the moment.

Well, the problem central to this thread is "Lately, certain relays are
receiving way more *connections* than they can handle, and it's not
only the relays at the very top of the bandwidth charts." So I think
it's very relevant.

--Roger