[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Analysis of the problems many relay operators are currently facing



On Wed, Apr 21, 2010 at 02:44:03PM -0400, Nick Mathewson wrote:
> > Another class of problems exists which affects some/many relays: The relay
> > attracts a huge amount of connections, affecting stability of network
> > equipment
> > and operating system. These problems might occur:
> >
> >    The Tor process runs out of memory, because it has too many open
> >    connections. Tor is then killed by the OS's OOM-killer.
> 
> Where is the memory going, exactly?  Is it all in buffers, or is it
> somewhere else?  Time to profile memory usage yet again.

My intuition is that a huge amount of it was going to openssl buffers.
When you have 20k TLS connections open, that's 37k*20k = 740 megs of
ram just sitting idle in openssl.

We got swisstorexit to run his non-exit relay under valgrind for a few
hours, and it didn't find any outright leaks.

> If this is mostly Linux (and the mention of an OOM-killer suggests
> that it is),

I think it's mostly Linux only because most of our fast relays are Linux.

> we might want to see if using the new-ish standalone
> Linux jemalloc helps people.   (We keep hitting fragmentation issues
> with glibc malloc.)  To try it out, build the code from
> http://www.canonware.com/jemalloc/ , stick the resulting .so file in
> your LD_PRELOAD before you start Tor, and see if the memory usage is
> much better.

Somebody should try this and see how it goes. You might suggest it to
the folks on tor-relays -- I bet not many of them read this list.

> >    Tor exhausts the ulimit -n that is affecting it, meaning random things
> >    like opening logfiles, establishing new connections or gathering more
> >    entropy fail, often creating many warnings in Tor's logfile. In some
> >    cases it appears that Tor is spinning until a file descriptor becomes
> >    available, burning all cpu.
> 
> Some of our earlier discussion related to bug 925 is probably relevant
> here.  We should solve that one so we can stop spinning when we're out
> of fds.  What's more, when we run out of fds, it generally means that
> the estimated number of connections we could handle was too high; if
> we revise our estimate downwards, we can go back to keeping a few fds
> in reserve for disk IO.
> 
> >    Tor makes a home router/DSL modem/kernel lock up, because it cannot
> >    handle the load. Symptoms include that internet access is completely
> >    nonfunctional even after the relay is stopped, or that it is extremely
> >    slow. These symptoms might last until the relevant piece of equipment is
> >    restarted.
> 
> Do we know what kind of load is needed for this?  Too many
> simultaneous connections, too many simultaneous incoming connections,
> too much bw, or what?

I think the main cause is typically too many open connections. There's
a state table inside the router, and once the table fills up, things
start to lose.

Too many simultaneous incoming connections could be another problem,
but I think it's rarer. It's certainly rarer to find good evidence of it.

> >    All these share the same underlying problem: Tor is getting more
> >    connections than it can handle. One way to help would be to make sure
> >    unused connections are closed more quickly, so that relays don't need
> >    to maintain as many active connections concurrently as they need to do
> >    now. A Tor patch that logs what state current connections have [0] shows
> >    that on some systems, around 10% of all connections were used for a
> >    begindir operation before, but now don't have a circuit attached anymore.
> 
> I think this number is could be a bit on the low side: the numbers I'm
> seeing people report on IRC seem higher than 10%.  Also, the patch
> counts the number of connections that have only a single circuit: if
> this circuit is the same circuit that was used for the begindir (and I
> think it typically is), then those connections too are lying unused.

I think I've solved this particular variant of the issue:
http://archives.seul.org/tor/relays/Apr-2010/msg00066.html
http://archives.seul.org/tor/relays/Apr-2010/msg00073.html
http://archives.seul.org/tor/relays/Apr-2010/msg00078.html

The next question is how much of a backport we should try to squeeze
into 0.2.1.x, on the theory that if stable Tor can't run a relay, our
network is going to continue to decay:
http://metrics.torproject.org/graphs/exit/exit-30d.png
http://metrics.torproject.org/torperf-graphs.html

Oh, and the other next question is what actual cutoff parameters to use
in the relays (and in the clients, when we get around to writing the
patch for them).

>  [...]
> >        As many relay operators are forced to turn off their relay because
> > they
> >        don't have the resources to keep their relay up anymore, the problem
> > only
> >        gets worse for the other operators, who need to deal with an
> > unchanged
> >        number of clients.
> 
> If this is what's happening, then our resource-limiting code isn't
> working as well as it should be and we need to fix it.  In the past,
> we've reached equilibrium in the face of large surges of clients by

...by?

> >        One last concern is that we're seeing scalability problems with our
> > current
> >        design. Lots of chinese users are back on the network, as many relays
> > have
> >        been unblocked by the gfw. Some relays are seeing more than 40k
> > active
> >        connections, while being far away from reaching their bw limits.
> 
> This is fishy; we should also try to rule out broken clients and/or
> DOS attempts.

I think they're real clients.

> One more thing I saw suggested in IRC: it's possible that the new
> bandwidth authority code is causing a problem here, since it uses a
> node's bandwidth capacity to determine what fraction of the network's
> connections it can handle.  But bandwidth isn't the only resource:
> just because a router can handle twice as many bytes as it's pushing,
> doesn't mean we should send it twice as many connections as it's
> getting.  I'm not sure of the right lesson here.

Yep.

>  Perhaps we need to
> limit the amount that we up-rate any router's bandwidth, or perhaps we
> need to find a way for a router to signal that it's running into load
> troubles and get downrated.

Mike pushed back in the tor-relays thread about my suggestion of
capping the amount of attention we give to any router. That approach
will apparently really reduce the performance gain we can get.

It would be great to reduce the weightings for relays that are failing --
but it's hard to remotely detect "about to fail", and "actually failed"
usually comes in the form of a down relay.

--Roger