[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: URGENT: patch needed ASAP for authority bug
Hi Scott,
no reason to panic currently. I've cc'ed Mike and Nick here, in case
they
can better explain what is going on.
On Apr 15, 2010, at 2:42 PM, Scott Bennett wrote:
I believe I spotted an authority bug with pretty severe
consequences
this a.m. It is having seriously bad effect on the star heavyweight
node
of the tor network, Olaf Selke's blutmagie. I can't submit a PR for
it
due to the flyspray web page's problems with letting me log in, and
Olaf
wrote me that he's at work at the moment and can't submit a PR until
he
gets home after work. So please read on, and if someone would please
submit an urgent PR for this, we (and probably others) would
appreciate it.
If you do, please shoot a note off to Olaf <olaf.selke@xxxxxxxxxxxx>
to
let him know about it, so he won't submit a duplicate PR. I don't
think
a fix for this one should wait for the next release. Instead,
patches for
both "stable" and "alpha" branches should be made available to
authority
operators as soon as someone can come up with them. (Only the
authorities
need to be fixed right away because the bug is somewhere in the
authority
code for generating consensus entries.)
Here's what I found. blutmagie's torrc is set up for a target
throughput rate of 18000 KB/s and a maximum burst rate of 24000 KB/s.
Olaf noticed that blutmagie was being swamped by a horrendous load of
incoming connections nearly all the time, so he tried using
MaxAdvertisedBandwidth to reduce the frequency of inbound connections.
He repeatedly lowered the maximum advertised rate, and blutmagie's
descriptor correctly reflects that, now showing a target rate of
2000 KB/s,
but the connection rate showed no apparent change. He recently began
reporting this trouble on OR-TALK, IIRC, but no one seemed to know
why the
limit on the advertised target rate, even when set so low compared
to the
actual rate and also compared to the rates published by other
heavyweight
nodes, why the advertised rate didn't reduce the load.
The problem lies in the consensus document, where it shows (or did
an hour or so ago),
w Bandwidth=27900
Note that 27900 KB/s is considerably higher than the maximum burst
rate
in the descriptor and is 13.95 times the supposed maximum advertised
rate.
That means that, while old client versions that use the values in the
descriptors in their route selection process will probably honor the
maximum
advertised rate of 2000 KB/s, newer clients use the rate in the
consensus,
27900 KB/s, in theirs, thus continuing to drown blutmagie in an
ongoing
flood of incoming connections.
The authorities are currently disregarding the limit published
in every
node's descriptor and instead are conjuring up their own numbers.
This needs
to stop and right away.
The value in the consensus is not an actual bandwidth, but rather it
is a
bandwidth weight, used by clients to do load balancing. This value is
automatically determined by directory authorities doing active
measurements of nodes capacity, to more evenly distribute the load.
Blutmagie, due to having huge capacity, gets a big share of the network
by having a lot of unused bandwidth. I have warned that this might
lead to
sad consequences, as available bandwidth is not the only factor to
determine how much traffic a node can handle, but rather there are other
things to take into account (number of circuits you need to establish,
higher memory requirements to service lots of connections compared to
only one connection that the bandwidth scanner uses, higher overhead
when more connections need to be handled).
Another side-effect is that limiting your bandwidth via MaxAdvertised*
options is no longer viable, because the active measurements are
affecting circuit building, not the passive advertised values. This has
bad consequences for everyone who tries to attract few clients, but
has lots of bandwidth (we're seeing the problem on a few vservers as
well).
I'm not sure what can be done about this, because measuring
bandwidth is easy and has led to dramatic speed increases in the
network for people running the 0.2.2.x versions (only those use the
bandwidth weights currently, afaik); whereas measuring a node's
capacity to deal with massive amounts of connections is not trivial.
Something that might or might not figure into this is that newly started
Tor clients do active speed tests, building test circuits for the
first ~hour
and a half to find a good value for timing out slow circuits. These
additional circuits might explain a generally higher load on the relays,
but I'm not sure about this here.
So, to summarize: There is currently no bug in the authority code, they
are working as intended. I'm waiting for Mike's further input here to
see if we need or can do something about the trouble it seems to
create for blutmagie.
Sebastian