[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Analysis of the problems many relay operators are currently facing



On Wed, Apr 21, 2010 at 12:04 PM, Sebastian Hahn <mail@xxxxxxxxxxxxxxxxx> wrote:
> I'll try to summarize here what I've learned in the past weeks over the
> problems we are currently having with the Tor network as a whole, and the
> issues that individual relay operators have; as well as describing the
> issues
> we have identified (some of which have been adressed already). As the
> information comes from #tor-dev on OFTC, bug reports and mailing lists, but
> no
> overview exists, it seems worthwhile to collect what we know.

Thanks for collecting all this info, Sebastian!

 [...]
>    Another issue exists that has not been identified yet, where a relay is
>    only reachable from outside sporadically, even though there is no load.
>    This issue is rare and has not been reproduced reliably.

I assume if they had anything in their logs, you would have said so.

> Another class of problems exists which affects some/many relays: The relay
> attracts a huge amount of connections, affecting stability of network
> equipment
> and operating system. These problems might occur:
>
>    The Tor process runs out of memory, because it has too many open
>    connections. Tor is then killed by the OS's OOM-killer.

Where is the memory going, exactly?  Is it all in buffers, or is it
somewhere else?  Time to profile memory usage yet again.

If this is mostly Linux (and the mention of an OOM-killer suggests
that it is), we might want to see if using the new-ish standalone
Linux jemalloc helps people.   (We keep hitting fragmentation issues
with glibc malloc.)  To try it out, build the code from
http://www.canonware.com/jemalloc/ , stick the resulting .so file in
your LD_PRELOAD before you start Tor, and see if the memory usage is
much better.

>    Tor exhausts the ulimit -n that is affecting it, meaning random things
>    like opening logfiles, establishing new connections or gathering more
>    entropy fail, often creating many warnings in Tor's logfile. In some
>    cases it appears that Tor is spinning until a file descriptor becomes
>    available, burning all cpu.

Some of our earlier discussion related to bug 925 is probably relevant
here.  We should solve that one so we can stop spinning when we're out
of fds.  What's more, when we run out of fds, it generally means that
the estimated number of connections we could handle was too high; if
we revise our estimate downwards, we can go back to keeping a few fds
in reserve for disk IO.

>    Tor makes a home router/DSL modem/kernel lock up, because it cannot
>    handle the load. Symptoms include that internet access is completely
>    nonfunctional even after the relay is stopped, or that it is extremely
>    slow. These symptoms might last until the relevant piece of equipment is
>    restarted.

Do we know what kind of load is needed for this?  Too many
simultaneous connections, too many simultaneous incoming connections,
too much bw, or what?

>    All these share the same underlying problem: Tor is getting more
>    connections than it can handle. One way to help would be to make sure
>    unused connections are closed more quickly, so that relays don't need
>    to maintain as many active connections concurrently as they need to do
>    now. A Tor patch that logs what state current connections have [0] shows
>    that on some systems, around 10% of all connections were used for a
>    begindir operation before, but now don't have a circuit attached anymore.

I think this number is could be a bit on the low side: the numbers I'm
seeing people report on IRC seem higher than 10%.  Also, the patch
counts the number of connections that have only a single circuit: if
this circuit is the same circuit that was used for the begindir (and I
think it typically is), then those connections too are lying unused.
 [...]
>        As many relay operators are forced to turn off their relay because
> they
>        don't have the resources to keep their relay up anymore, the problem
> only
>        gets worse for the other operators, who need to deal with an
> unchanged
>        number of clients.

If this is what's happening, then our resource-limiting code isn't
working as well as it should be and we need to fix it.  In the past,
we've reached equilibrium in the face of large surges of clients by


>        One last concern is that we're seeing scalability problems with our
> current
>        design. Lots of chinese users are back on the network, as many relays
> have
>        been unblocked by the gfw. Some relays are seeing more than 40k
> active
>        connections, while being far away from reaching their bw limits.

This is fishy; we should also try to rule out broken clients and/or
DOS attempts.

One more thing I saw suggested in IRC: it's possible that the new
bandwidth authority code is causing a problem here, since it uses a
node's bandwidth capacity to determine what fraction of the network's
connections it can handle.  But bandwidth isn't the only resource:
just because a router can handle twice as many bytes as it's pushing,
doesn't mean we should send it twice as many connections as it's
getting.  I'm not sure of the right lesson here.  Perhaps we need to
limit the amount that we up-rate any router's bandwidth, or perhaps we
need to find a way for a router to signal that it's running into load
troubles and get downrated.

-- 
Nick