[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: How to Run High Capacity Tor Relays
On Tue, 24 Aug 2010 08:27:53 -0700 Mike Perry <mikeperry@xxxxxxxxxx>
wrote:
>After talking to Moritz and Olaf privately and asking them about their
>nodes, and after running some experiments with some high capacity
>relays, I've begun to realize that running a fast Tor relay is a
>pretty black art, with a lot of ad-hoc practice. Only a few people
>know how to do it, and if you just use Linux and Tor out of the box,
>your relay will likely underperform on 100Mbit links and above.
>
>In the interest of trying to help grow and distribute the network, my
>ultimate plan is to try to collect all of this lore, use Science to
>divine out what actually matters, and then write a more succinct blog
>post about it.
>
>However, that is a lot of work. It's also not totally necessary to do
>all this work, when you can get a pretty good setup with a rough
>superset of all of the ad-hoc voodoo. This post is thus about that
>voodoo.
>
>Hopefully others will spring forth from the darkness to dump their own
>voodoo in this thread, as I suspect there is one hell of a lot of it
>out there, some (much?) of which I don't yet know. Likewise, if any
>blasphemous heretic wishes to apply Science to this voodoo, they
>should yell out, "Stand Back, I'm Doing Science!" (at home please, not
>on this list) and run some experiments to try to eliminate options
>that are useless to Tor performance. Or cite academic research papers.
>(But that's not Science, that's computerscience - which is a religion
>like voodoo, but with cathedrals).
I think it might also be worthwhile to compile similar collections
of notes for each of the BSDs, OS X, and maybe Solaris, too. Although
LINUX has the largest following among the high-throughput tor nodes, it
does not have a 100% monopoly on them. "desync" on FreeBSD (amd64),
for example, is often listed in the top 20 nodes (sorted by throughput
capacities), and "SEC" on Solaris (i86pc) is usually in the top 20 or 30.
Two other FreeBSD nodes, "doom" and "dannenberg", are also very fast and
usually in the top 30 or 40.
>
>Anyway, on with the draft:
>
>
>=3D=3D Machine Specs =3D=3D
>
>First, you want to run your OS in x64 mode because openssl should do
>crypto faster in 64bit. =20
>
>Tor is currently not fully multithreaded, and tends not to benefit
>beyond 2 cores per process. Even then, the benefit is still marginal
>beyond just 1 core. 64bit Tor nodes require about one 2Ghz Xeon/Core2
>core per 100Mbit of capacity.
However, better utilization of multicored systems can be realized
by running one tor instance per core. See, for example, Olaf Selke's
blutmagie{,2,3,4} nodes, which run on a Core 2 Quad CPU.
>
>Thus, to fill an 800Mbit link, you need at least a dual socket, quad
>core cpu config. You may be able to squeeze a full gigabit out of one
>of these machines. As far as I know, no one has ever done this with
>Tor, on any one machine.
>
>The i7's also just came out in this form factor, and can do
>hyperthreading (previous models may list 'ht' in cpuinfo, but actually
>don't support it). This should give you a decent bonus if you set
That's not true. Although many non-HTT-capable CPUs do as you say,
the P4 Prescotts and some Xeons of that time were the first HTT-enabled
CPUs. The HTT-enabled Core i? CPUs, however, have a set of pipelines
better configured to get the most from HTT operations than the older
CPUs did.
>NumCPUs to 2, since ht tends to work better with pure integer math
My node runs on a 3.4 GHz Prescott, and I set NumCPUs 2. That chip
only has one FP pipe, so really good overlap happens when one thread is
doing mainly FP instructions, while another does mostly other things.
Running two FPU-bound threads simultaneously gives the same performance
as a non-hyperthreading CPU. My understanding is that the same is true
for the Core i? CPUs, but that the other pipes are where the gains over
the old version of HTT come into play, e.g., an extra integer arithmetic
pipe allows both threads to do the same kind of integer operations in
parallel, whereas the old version required, say, slots in two pipes in
the same clock cycle for an integer add, but only three such pipes were
present, thus making it only possible for one CPU thread to proceed on
the cycle and forcing a hardware delay on the other until the next cycle.
The newer processors in an example like this would have four pipes, so
each logical CPU could have two in the same cycle. Also, IIRC, there
are add/subtract pipes, multiply pipes, and divide pipes on the integer
side of things, but a single floating point pipe handles all floating
point instructions.
>(like crypto). We have not benchmarked this config yet though, but I
>suspect it should fill a gigabit link fairly easily, possibly
>approaching 2Gbit.
>
>At full capacity, exit node Tor processes running at this rate consume
>about 500M of ram. You want to ensure your ram speed is sufficient,
>but most newish hardware is good. Using on this chart:
>https://secure.wikimedia.org/wikipedia/en/wiki/List_of_device_bandwidths#Me=
>mory_Interconnect.2FRAM_buses
>you can do the math and see that with a dozen memcpys in each
>direction, you come out needing DDR2 to be able to push 1Gbit full
>duplex.
Olaf's original problem, when trying to run all his traffic through
a single tor instance, was an overloaded core. He experimented a bit
with LINUX's kl\\hugepages facility in an attempt to releave a big chunk
of that CPU load, but because tor for LINUX uses the OpenBSD malloc() and
free(), the libhugetlbfs method didn't work. When he tried building tor
without the OpenBSD version, the leak in the native LINUX library revived
the old problem of tor on LINUX, wherein the allocated memory continues
to grow slowly until the process dies for exceeding the limit on allocated
memory.
If someone were to take a close look at either a) the native LINUX
version of free() and malloc() to find and plug the leak or b) the OpenBSD
version and the libhugetlbfs code to make them work with each other, then
operators of fast tor nodes on LINUX systems could most likely get fairly
big performance enhancements by using the libhugetlbfs support.
[no comment on the remainder of Mike's suggestions --SB]
Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet: bennett at cs.niu.edu *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good *
* objection to the introduction of that bane of all free governments *
* -- a standing army." *
* -- Gov. John Hancock, New York Journal, 28 January 1790 *
**********************************************************************