[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: URGENT: patch needed ASAP for authority bug



Hi Scott,

no reason to panic currently. I've cc'ed Mike and Nick here, in case they
can better explain what is going on.

On Apr 15, 2010, at 2:42 PM, Scott Bennett wrote:
I believe I spotted an authority bug with pretty severe consequences this a.m. It is having seriously bad effect on the star heavyweight node of the tor network, Olaf Selke's blutmagie. I can't submit a PR for it due to the flyspray web page's problems with letting me log in, and Olaf wrote me that he's at work at the moment and can't submit a PR until he
gets home after work.  So please read on, and if someone would please
submit an urgent PR for this, we (and probably others) would appreciate it. If you do, please shoot a note off to Olaf <olaf.selke@xxxxxxxxxxxx> to let him know about it, so he won't submit a duplicate PR. I don't think a fix for this one should wait for the next release. Instead, patches for both "stable" and "alpha" branches should be made available to authority operators as soon as someone can come up with them. (Only the authorities need to be fixed right away because the bug is somewhere in the authority
code for generating consensus entries.)
    Here's what I found.  blutmagie's torrc is set up for a target
throughput rate of 18000 KB/s and a maximum burst rate of 24000 KB/s.
Olaf noticed that blutmagie was being swamped by a horrendous load of
incoming connections nearly all the time, so he tried using
MaxAdvertisedBandwidth to reduce the frequency of inbound connections.
He repeatedly lowered the maximum advertised rate, and blutmagie's
descriptor correctly reflects that, now showing a target rate of 2000 KB/s,
but the connection rate showed no apparent change.  He recently began
reporting this trouble on OR-TALK, IIRC, but no one seemed to know why the limit on the advertised target rate, even when set so low compared to the actual rate and also compared to the rates published by other heavyweight
nodes, why the advertised rate didn't reduce the load.
    The problem lies in the consensus document, where it shows (or did
an hour or so ago),

w Bandwidth=27900

Note that 27900 KB/s is considerably higher than the maximum burst rate in the descriptor and is 13.95 times the supposed maximum advertised rate.
That means that, while old client versions that use the values in the
descriptors in their route selection process will probably honor the maximum advertised rate of 2000 KB/s, newer clients use the rate in the consensus, 27900 KB/s, in theirs, thus continuing to drown blutmagie in an ongoing
flood of incoming connections.
The authorities are currently disregarding the limit published in every node's descriptor and instead are conjuring up their own numbers. This needs
to stop and right away.

The value in the consensus is not an actual bandwidth, but rather it is a
bandwidth weight, used by clients to do load balancing. This value is
automatically determined by directory authorities doing active
measurements of nodes capacity, to more evenly distribute the load.
Blutmagie, due to having huge capacity, gets a big share of the network
by having a lot of unused bandwidth. I have warned that this might lead to
sad consequences, as available bandwidth is not the only factor to
determine how much traffic a node can handle, but rather there are other
things to take into account (number of circuits you need to establish,
higher memory requirements to service lots of connections compared to
only one connection that the bandwidth scanner uses, higher overhead
when more connections need to be handled).

Another side-effect is that limiting your bandwidth via MaxAdvertised*
options is no longer viable, because the active measurements are
affecting circuit building, not the passive advertised values. This has
bad consequences for everyone who tries to attract few clients, but
has lots of bandwidth (we're seeing the problem on a few vservers as
well).

I'm not sure what can be done about this, because measuring
bandwidth is easy and has led to dramatic speed increases in the
network for people running the 0.2.2.x versions (only those use the
bandwidth weights currently, afaik); whereas measuring a node's
capacity to deal with massive amounts of connections is not trivial.

Something that might or might not figure into this is that newly started
Tor clients do active speed tests, building test circuits for the first ~hour
and a half to find a good value for timing out slow circuits. These
additional circuits might explain a generally higher load on the relays,
but I'm not sure about this here.

So, to summarize: There is currently no bug in the authority code, they
are working as intended. I'm waiting for Mike's further input here to
see if we need or can do something about the trouble it seems to
create for blutmagie.

Sebastian