[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Analysis of the problems many relay operators are currently facing

To: or-dev@xxxxxxxxxxxxx
Subject: Analysis of the problems many relay operators are currently facing
From: Sebastian Hahn <mail@xxxxxxxxxxxxxxxxx>
Date: Wed, 21 Apr 2010 18:04:24 +0200
Delivered-to: archiver@xxxxxxxx
Delivered-to: or-dev-outgoing@xxxxxxxx
Delivered-to: or-dev@xxxxxxxx
Delivery-date: Wed, 21 Apr 2010 12:14:01 -0400
Reply-to: or-dev@xxxxxxxxxxxxx
Sender: owner-or-dev@xxxxxxxxxxxxx

I'll try to summarize here what I've learned in the past weeks over the

problems we are currently having with the Tor network as a whole, andtheissues that individual relay operators have; as well as describing theissues

we have identified (some of which have been adressed already). As the

information comes from #tor-dev on OFTC, bug reports and mailinglists, but no

overview exists, it seems worthwhile to collect what we know.

For the past months, quite a few relays have been sporadicallydropping fromthe consensus; either until they published the next descriptor againor forlonger periods of time. For this we have identified the followingproblems:

Some vendors have backported openssl features to older versions,renderingthose relays either completely useless as they are unable toestablishconnections so won't even bootstrap; or useless as relays topeople using

    certain openssl versions. Thus, the directory authorities couldn't
    establish connections to them, meaning they marked them offline.

We believe this is now fixed as of 0.2.2.11-alpha. A fix for thestable

    series of Tor has not been released yet.

Authorities only downloaded descriptors for relays from V2directoryauthorities if they didn't have them available themselves. Asonly two

    V2 auths remain, one of which probably disallows most relays from

publishing descriptors, this led to authorities knowing onlyabout a partof the network. Some relays were thus unreachable by the majorityof

    dir authorities, meaning they dropped out of the consensus.

We believe this is now fixed as of 0.2.2.12-alpha. Not allauthorities

    have upgraded yet.

Relays (and authorities) running 0.2.2.11-alpha crash 24 hoursafter start

    if they have the statistic gathering functionality enabled.

We believe this is now fixed as of 0.2.2.12-alpha. A workaroundis to

    disable statistic gathering.

Another issue exists that has not been identified yet, where arelay isonly reachable from outside sporadically, even though there is noload.

    This issue is rare and has not been reproduced reliably.

Another class of problems exists which affects some/many relays: Therelayattracts a huge amount of connections, affecting stability of networkequipment

and operating system. These problems might occur:

    The Tor process runs out of memory, because it has too many open
    connections. Tor is then killed by the OS's OOM-killer.

Tor exhausts the ulimit -n that is affecting it, meaning randomthingslike opening logfiles, establishing new connections or gatheringmoreentropy fail, often creating many warnings in Tor's logfile. Insomecases it appears that Tor is spinning until a file descriptorbecomes

    available, burning all cpu.

    Tor makes a home router/DSL modem/kernel lock up, because it cannot

handle the load. Symptoms include that internet access iscompletelynonfunctional even after the relay is stopped, or that it isextremelyslow. These symptoms might last until the relevant piece ofequipment is

    restarted.


    All these share the same underlying problem: Tor is getting more

connections than it can handle. One way to help would be to makesureunused connections are closed more quickly, so that relays don'tneedto maintain as many active connections concurrently as they needto donow. A Tor patch that logs what state current connections have[0] shows

    that on some systems, around 10% of all connections were used for a

begindir operation before, but now don't have a circuit attachedanymore.

	Generally, the fraction of connections used exclusively for begindir
	operations appears to be high, so it might be worthwhile to close the

circuits on them more quickly and not keep them around for possiblelater

	cannibalization.

Another theory is that the fastest relays (by consensus weights) areusedby a large proportion of users. This means that almost every Tor userwillmake a connection to those few relays, massively increasing theamount of

	connections the relay has to handle at the same time. Some evidence

supporting this is that even after the bw authorities votedblutmagie's bwweight down a lot after the operator lowered the banwdidthrateconsiderably,it was still seeing many concurrent connections, while the amount ofnew

	connections/s was dropping a lot.
	
	
	As many relay operators are forced to turn off their relay because they

don't have the resources to keep their relay up anymore, the problemonly

	gets worse for the other operators, who need to deal with an unchanged
	number of clients.

One last concern is that we're seeing scalability problems with ourcurrentdesign. Lots of chinese users are back on the network, as many relayshave

	been unblocked by the gfw. Some relays are seeing more than 40k active

connections, while being far away from reaching their bw limits. Ifusage

	increases to grow and a clear bug cannot be identified that causes the

massive amount of connections and it can be determined that this isjust

	Tor's popularity growing, alternative designs that don't require
	tcp connections might become a necessity very quickly.

I hope I didn't forget any problem/solution/analysis here, if so,please add it

so we can all track this down as quickly as possible.

Thanks
Sebastian
	
[0] http://archives.seul.org/or/relays/Apr-2010/msg00066.html

Follow-Ups:
- Re: Analysis of the problems many relay operators are currently facing
  - From: Nick Mathewson

Prev by Author: Re: Analysis of the problems many relay operators are currently facing
Next by Author: Re: The long anticipated move from Flyspray...
Previous by thread: Re: /src/common/compat.c:363: undefined reference to `__vcsprint
Next by thread: Re: Analysis of the problems many relay operators are currently facing
Index(es):
- Author
- Thread