[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Analysis of the problems many relay operators are currently facing



I'll try to summarize here what I've learned in the past weeks over the
problems we are currently having with the Tor network as a whole, and the issues that individual relay operators have; as well as describing the issues
we have identified (some of which have been adressed already). As the
information comes from #tor-dev on OFTC, bug reports and mailing lists, but no
overview exists, it seems worthwhile to collect what we know.

For the past months, quite a few relays have been sporadically dropping from the consensus; either until they published the next descriptor again or for longer periods of time. For this we have identified the following problems:

Some vendors have backported openssl features to older versions, rendering those relays either completely useless as they are unable to establish connections so won't even bootstrap; or useless as relays to people using
    certain openssl versions. Thus, the directory authorities couldn't
    establish connections to them, meaning they marked them offline.

We believe this is now fixed as of 0.2.2.11-alpha. A fix for the stable
    series of Tor has not been released yet.


Authorities only downloaded descriptors for relays from V2 directory authorities if they didn't have them available themselves. As only two
    V2 auths remain, one of which probably disallows most relays from
publishing descriptors, this led to authorities knowing only about a part of the network. Some relays were thus unreachable by the majority of
    dir authorities, meaning they dropped out of the consensus.

We believe this is now fixed as of 0.2.2.12-alpha. Not all authorities
    have upgraded yet.


Relays (and authorities) running 0.2.2.11-alpha crash 24 hours after start
    if they have the statistic gathering functionality enabled.

We believe this is now fixed as of 0.2.2.12-alpha. A workaround is to
    disable statistic gathering.


Another issue exists that has not been identified yet, where a relay is only reachable from outside sporadically, even though there is no load.
    This issue is rare and has not been reproduced reliably.

Another class of problems exists which affects some/many relays: The relay attracts a huge amount of connections, affecting stability of network equipment
and operating system. These problems might occur:

    The Tor process runs out of memory, because it has too many open
    connections. Tor is then killed by the OS's OOM-killer.

Tor exhausts the ulimit -n that is affecting it, meaning random things like opening logfiles, establishing new connections or gathering more entropy fail, often creating many warnings in Tor's logfile. In some cases it appears that Tor is spinning until a file descriptor becomes
    available, burning all cpu.

    Tor makes a home router/DSL modem/kernel lock up, because it cannot
handle the load. Symptoms include that internet access is completely nonfunctional even after the relay is stopped, or that it is extremely slow. These symptoms might last until the relevant piece of equipment is
    restarted.


    All these share the same underlying problem: Tor is getting more
connections than it can handle. One way to help would be to make sure unused connections are closed more quickly, so that relays don't need to maintain as many active connections concurrently as they need to do now. A Tor patch that logs what state current connections have [0] shows
    that on some systems, around 10% of all connections were used for a
begindir operation before, but now don't have a circuit attached anymore.
	Generally, the fraction of connections used exclusively for begindir
	operations appears to be high, so it might be worthwhile to close the
circuits on them more quickly and not keep them around for possible later
	cannibalization.
	
	
Another theory is that the fastest relays (by consensus weights) are used by a large proportion of users. This means that almost every Tor user will make a connection to those few relays, massively increasing the amount of
	connections the relay has to handle at the same time. Some evidence
supporting this is that even after the bw authorities voted blutmagie's bw weight down a lot after the operator lowered the banwdidthrate considerably, it was still seeing many concurrent connections, while the amount of new
	connections/s was dropping a lot.
	
	
	As many relay operators are forced to turn off their relay because they
don't have the resources to keep their relay up anymore, the problem only
	gets worse for the other operators, who need to deal with an unchanged
	number of clients.
	
	
One last concern is that we're seeing scalability problems with our current design. Lots of chinese users are back on the network, as many relays have
	been unblocked by the gfw. Some relays are seeing more than 40k active
connections, while being far away from reaching their bw limits. If usage
	increases to grow and a clear bug cannot be identified that causes the
massive amount of connections and it can be determined that this is just
	Tor's popularity growing, alternative designs that don't require
	tcp connections might become a necessity very quickly.
	
I hope I didn't forget any problem/solution/analysis here, if so, please add it
so we can all track this down as quickly as possible.

Thanks
Sebastian
	
[0] http://archives.seul.org/or/relays/Apr-2010/msg00066.html