[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Effect of Tor window size on performance



Dear All,

I'm new to the or-dev list so please excuse me if I mention something that was already discussed. We've did a study of Tor's flow-control mechanism, and what came out is that reducing the flow-control window size (CIRCWINDOW) would be beneficial not just for the individual client, but for the whole Tor network. You can find a tech report describing our results (still work in progress) at

http://disi.unitn.it/locigno/preprints/TR-DISI-08-041.pdf

It is also about other topics (namely measures with datagram based solutions, in our case using IPsec, and a theoretical evaluation of TCP based solutions), but I think what is most interesting for immediate use in development is how much the flow-control window size can influence application-level delay in an overload situation. In my understanding, overload is the natural state of the Tor network, so it is important to make the system behave well in these conditions.

I've started to discuss this topic with Roger and he suggested me to extend the discussion to the or-dev list. See our conversation below and my original mail at the very end of this long mail.
Any feedback/discussion of the ideas is welcome,
Csaba


Roger Dingledine wrote:
On Sat, Nov 01, 2008 at 07:03:36PM +0100, Csaba Kiraly wrote:
One things that could be of immediate use for Tor is the study of Tor's
flow-control mechanism, and its behavior with different window sizes. In
short, by reducing the window size from the current 1000 to e.g. 200 in
the Tor network, there would be a huge decrease in application level
round trip times. It would also mean much smaller buffers in OR nodes,
reducing their memory consumption significantly. At the same time,
throughput would remain the same. Figure 6 of the tech report shows
delay and throughput with some window sizes.

...

I think these are great ideas. You're right that the initial choice of
window size was just a random number that I picked that seemed large
enough to provide good throughput. And on thinking about it more,
I think you're right that a larger number actually just congests the
network more.

I've had an item on the 'volunteer' page related to window size for a
few years now: Item #8 on https://www.torproject.org/volunteer#Research

It seems that variable window sizes would be even more flexible than what
you propose, since then it could adapt to different levels of congestion.
In particular, I worry about:

Your throughput is limited to cell_size * CIRCWINDOW / D (without
considering the size of CIRCWINDOW_INCREMENT and delay of the SENDME
cell on the return path)
With typical data: cell size=0.5 kbytes ; CWND_SIZE=1000, D=1 sec
The throughput is limited to 500 kbytes/sec

What if D=5 because there's still congestion? If we then cut the circuit
window size by another factor of 5, we end up with an expected throughput
of 20KB, which is a bit on the 'uncomfortably low' end. (What if D=10?)
As a first attempt we are trying fixed reduced window size. If it gets variable, what we introduce instead of the current end-to-end flow-control is some kind of congestion control with an end-to-end feedback loop. To avoid strange interactions between various levels of congestion-control, I think this variability should be kept really slow (obviously it should be done per circuit).

Lets concentrate now on 5-7.
5, This is studied in Fig. 2, confronting TCP-in-TCP with normal TCP on
the same path (TCP-in-IP). With one stream, delays are almost the same.
TCP is good in adapting to the congestion and keeping its delay low.

This introduces the recent discussion about using UDP for transport. See
e.g. http://archives.seul.org/or/talk/Oct-2008/msg00095.html

One of the big reasons they concluded why Tor is slow is because it puts
both loud and quiet streams together into one TCP connection. When TCP's
back-off kicks in, all the streams get punished, not just the loud ones.

Ultimately we'd like to resolve this whole TCP flow control thing by
switching to UDP transport between each relay, and then either have a
user-space TCP stack sitting at each relay to reconstruct the packets,
or have one sitting at each end of the circuit to reconstruct them.
Alas, there are still a big pile of open development problems there
-- as a first example, there appears to be no free-software stable
platform-independent secure user-space TCP stack out there.

On the long run, I'm also more of a fan of a datagram based solution with only end-to-end congestion control. What we've been using in the tech report is IPsec tunnels, and at that point kernel level TCP stack comes for free. With this solution, performance is almost equivalent to native IP tunneling on the same path. Obviously IPsec and kernel TCP has all of its well known drawbacks, but that's off-topic here. We had our reasons to use IPsec, but what's important from Tor's perspective, is that a UDP based transport would produce almost the same performance, if OR nodes are strong enough and delay factors 5-7 are handled well. (see delay factors at the end in my original mail)

Another thing that comes to my mind is that a better separation of signaling (directory lookup, circuit setup, etc.) and data path would be beneficial. This would ease things such as experimenting with DTLS or other transport methods on the overlay tunnel, and experimenting with different solutions end-to-end.
6, I've noticed this limit (MIN_TLS_FLUSHLEN) in the code. Its size is
about 32 cells, doesn't seems to be an important delay component.

Right. I added that in because in some cases we were reading in hundreds
of kilobytes without initiating any writes until all the reads were
done. I have absolutely no idea whether that was a good idea. It would
be great to be able to have some data about these things.

Speaking of which, Tor's round-robin when we run low on our bandwidth
buckets has to matter here. We try to round-robin 16KB at a time, but
have smaller allocations when our token bucket is getting low. Again,
I just made up some numbers, and it would be great to get real data there.

7, The huge part of the delay is here: hundreds if not thousands of
cells waiting in a FIFO queue (the OR's output buffer towards another
OR) before getting to the TLS tunnel.

So the next question is an implementation one. Right now the window sizes
are hard-coded at both ends. I've been meaning to extend the protocol
so sendme cells have a number in them, and so the initial window sizes
are specified in the 'create' and 'created' cells for circuits and the
'begin' and 'connected' cells for streams. But we haven't really fleshed
out the details of those designs, or how they could be phased in and still
handle clients and relays that don't use (or know about) the numbers.

So the big deployment question is: is it worth it to work on a design
for the above, and then either shrink the default window sizes or do
something smarter like variable window sizes, or should we just be
patient until a UDP-based solution is more within reach?

One answer is that if you were interested in working on a design proposal
and patch, it would be much more likely to get implemented. :)
We are doing verifications on this. Our lab experiments (the ones in the tech report) show that there is a huge gain on the user side in delays, while throughput is untouched. Throughput is capped with a static window size, but I think the cap can be chosen better than what it is now. There should also be a big gain in the memory consumption of ORs, although we didn't measure it yet. Since the Tor network is kind of overloaded all the time, memory usage should decrease almost linearly with the window size.

Currently we are verifying one-side modification of the circuit, i.e. whether one side of the connection can reduce the widow size on its own, without explicitly notifying the other side. From the code it seems to me that this will work, and if so, phasing in a smaller window size in a new release should not be a problem.

One negative effect which could be considered, is the following: having
less cells in buffers, the "mixing" effect is less. I know that there is
no real mixing in Tor, what I mean is that even if cells are not
re-ordered, due to the introduced delays, finding correspondence between
an entering cell and its correspondent cell leaving later on is quite
hard. I don't think reducing this delay is a problem, but it should be
mentioned at least.

This doesn't bother me either. I think we're quite vulnerable already
to an adversary trying to match up flows.

In the report, we also analyze the effect of external TCP tunnels when
several streams are multiplexed over them. In this case, an TCP tunnel
becomes a real bottleneck.

Indeed. Is this the same statement that I made above, from Joel's
thesis?
Well, we've had different ways of analysing it but some the conclusions are similar, even if not the same :) We did not try analysing the effect of mixing different (loud and quiet) kinds of traffic. What we say is that all streams suffer even if they are from the same kind.

Finally, we also discuss how the combination of 1) datagram based
operation, 2) allowing for cell/packet drops in overlay nodes, and 3)
relying on only one end-to-end congestion control loop increase
performance. We do this through an example of an IPsec based
architecture, but results partially apply to DTLS as well.

Sounds great.

Are you able to send this to the or-dev list too? I think it would be
more useful to have other people see the thread too.

Thanks!
--Roger

Csaba Kiraly wrote:
Dear Roger,

As Giuseppe said, the tech report is work in progress, however I think
there are already some parts that might be interesting for you and for
others involved in Tor development.

One things that could be of immediate use for Tor is the study of Tor's
flow-control mechanism, and its behavior with different window sizes. In
short, by reducing the window size from the current 1000 to e.g. 200 in
the Tor network, there would be a huge decrease in application level
round trip times. It would also mean much smaller buffers in OR nodes,
reducing their memory consumption significantly. At the same time,
throughput would remain the same. Figure 6 of the tech report shows
delay and throughput with some window sizes.

I hope you don't mind if I give a more detailed Tor-specific explanation
of this aspect here below, since it was not (yet) discussed in the tech
report in enough detail.

As you know, Tor breaks up the end-to-end TCP loop, and replaces it with
its own reliable delivery service. It implements reliable delivery by
using TCP on the overlay links, and by not dropping cells in OR nodes
(it can't drop cells in ORs, because there is no other end-to-end
reliability mechanism that would resend a cell).

If you consider also that bottlenecks are typically inside the Tor
network (and not after the exit node) you get to the following phenomenon:
The source sends cells as fast as it can until its CIRCWINDOW  (or
STREAMWINDOW)permits. These cells all get buffered at the OR before the
bottleneck overlay link. When some (CIRCWINDOW_INCREMENT) data gets
through the bottleneck and arrives to the exit node, a SENDME cell is
sent back. The source pushes again, and as a result, the buffer before
the bottleneck link is always almost full. This introduces two problems:
large delays for end nodes and huge memory consumption for ORs.

Large delay :
The delay introduced by the FIFO buffer in the OR before the bottleneck
can be estimated as:
cell_size * CIRCWINDOW / bottleneck_bandwidth

With typical data: cell size=0.5 Kbytes; CIRCWINDOW=1000;
bottleneck_bandwidth= 50 Kbytes/sec;
you get a potential delay of 10 seconds. Fortunately, applications are
typically not that eager to push all that information inside, still, the
accumulated delay is huge.

Large memory consumption at OR nodes:
an OR node should potentially buffer cell_size * CIRCWINDOW, i.e. 500
Kbytes for each circuit.

With a smaller window, both the delay and memory consumption would be
smaller. Thus the important question is: why not reduce the window? One
would say throughput, but the important thing to note is that a smaller
(but not too small) window would not reduce throughput. This can be seen
on the right side of fig.4: all the Tor curves are the same, independent
of the window size (and the number of streams).

The reason is simple: if there is a bottleneck on the path taken by the
circuit, the throughput depends on how the bandwidth of the TCP
connection of the bottlenecked OR link is shared among circuits.
Buffering more cells before this TCP connection does not make it more
effective.

Delay and throughput can be analyzed in more detail.

In Tor, delay comes from the following factors (I hope I've managed to
list them all):
0, circuit setup delay
1, delay outside the Tor network: TCP from application to SOCKS proxy
2, delay outside the Tor network: TCP from exit node to destination node
3, cell forming delay at the edge of the circuit (either at the OP or at
the exit OR)
4, IP level delay on each overlay link (due to physical distance, IP
hops, etc.)
5, delay introduced by TCP on each overlay link
6, delay introduced by TLS record forming on each overlay link
7, delay of FIFO cell buffers in ORs for each overlay link

With CIRCWINDOW=1000, it seems that point 7. is responsible for large
part of the overall delay:
0, is important at the beginning, but not during transmission
1, is almost negligible
2, is in the range of normal unprotected operation. There is not too
much to do to reduce it, except for selecting the exit node near the
destination, but this could compromise privacy.
3, you already have a mechanism (based on port numbers) to mitigate this
4, there is not too much to do with this, except changing the OR
selection strategy, and thus compromising privacy. There are numerous
proposals on that, as far as I remember.

These factors add up for some hundreds of milliseconds, depending on the
choice of the OR nodes, and how they are connected to the Internet.

Lets concentrate now on 5-7.
5, This is studied in Fig. 2, confronting TCP-in-TCP with normal TCP on
the same path (TCP-in-IP). With one stream, delays are almost the same.
TCP is good in adapting to the congestion and keeping its delay low.

6, I've noticed this limit (MIN_TLS_FLUSHLEN) in the code. Its size is
about 32 cells, doesn't seems to be an important delay component.

7, The huge part of the delay is here: hundreds if not thousands of
cells waiting in a FIFO queue (the OR's output buffer towards another
OR) before getting to the TLS tunnel.

This would suggest to reduce CIRCWINDOW as much as possible. To
understand what possible means, throughput should be analyzed:
Suppose for a minute that the Tor network is not overloaded, and there
is no overlay link bottleneck that would limit your throughput. In this
case 7) becomes negligible, while CIRCWINDOW limits your cells in
flight, and thus your throughput. Let D be the delays due to delay
factors 1..6

Your throughput is limited to cell_size * CIRCWINDOW / D (without
considering the size of CIRCWINDOW_INCREMENT and delay of the SENDME
cell on the return path)
With typical data: cell size=0.5 kbytes ; CWND_SIZE=1000, D=1 sec
The throughput is limited to 500 kbytes/sec

Reducing CIRCWINDOW to 200, this goes down to 100 kbytes/sec (if D=1
sec), which is still much above what is needed. Moreover, the Tor
network IS more congested than that (at least this was my impression),
so reducing CIRCWINDOW would have absolutely no negative effect on the
throughput.

One negative effect which could be considered, is the following: having
less cells in buffers, the "mixing" effect is less. I know that there is
no real mixing in Tor, what I mean is that even if cells are not
re-ordered, due to the introduced delays, finding correspondence between
an entering cell and its correspondent cell leaving later on is quite
hard. I don't think reducing this delay is a problem, but it should be
mentioned at least.

In the report, we also analyze the effect of external TCP tunnels when
several streams are multiplexed over them. In this case, an TCP tunnel
becomes a real bottleneck.

Finally, we also discuss how the combination of 1) datagram based
operation, 2) allowing for cell/packet drops in overlay nodes, and 3)
relying on only one end-to-end congestion control loop increase
performance. We do this through an example of an IPsec based
architecture, but results partially apply to DTLS as well.

I hope you don't mind for this long mail!
Best regards,
Csaba