Dear All,I'm new to the or-dev list so please excuse me if I mention something that was already discussed. We've did a study of Tor's flow-control mechanism, and what came out is that reducing the flow-control window size (CIRCWINDOW) would be beneficial not just for the individual client, but for the whole Tor network. You can find a tech report describing our results (still work in progress) at
http://disi.unitn.it/locigno/preprints/TR-DISI-08-041.pdfIt is also about other topics (namely measures with datagram based solutions, in our case using IPsec, and a theoretical evaluation of TCP based solutions), but I think what is most interesting for immediate use in development is how much the flow-control window size can influence application-level delay in an overload situation. In my understanding, overload is the natural state of the Tor network, so it is important to make the system behave well in these conditions.
I've started to discuss this topic with Roger and he suggested me to extend the discussion to the or-dev list. See our conversation below and my original mail at the very end of this long mail.
Any feedback/discussion of the ideas is welcome, Csaba Roger Dingledine wrote:
As a first attempt we are trying fixed reduced window size. If it gets variable, what we introduce instead of the current end-to-end flow-control is some kind of congestion control with an end-to-end feedback loop. To avoid strange interactions between various levels of congestion-control, I think this variability should be kept really slow (obviously it should be done per circuit).On Sat, Nov 01, 2008 at 07:03:36PM +0100, Csaba Kiraly wrote:One things that could be of immediate use for Tor is the study of Tor's flow-control mechanism, and its behavior with different window sizes. In short, by reducing the window size from the current 1000 to e.g. 200 in the Tor network, there would be a huge decrease in application level round trip times. It would also mean much smaller buffers in OR nodes, reducing their memory consumption significantly. At the same time, throughput would remain the same. Figure 6 of the tech report shows delay and throughput with some window sizes.... I think these are great ideas. You're right that the initial choice of window size was just a random number that I picked that seemed large enough to provide good throughput. And on thinking about it more, I think you're right that a larger number actually just congests the network more. I've had an item on the 'volunteer' page related to window size for a few years now: Item #8 on https://www.torproject.org/volunteer#Research It seems that variable window sizes would be even more flexible than what you propose, since then it could adapt to different levels of congestion.In particular, I worry about:Your throughput is limited to cell_size * CIRCWINDOW / D (without considering the size of CIRCWINDOW_INCREMENT and delay of the SENDME cell on the return path) With typical data: cell size=0.5 kbytes ; CWND_SIZE=1000, D=1 sec The throughput is limited to 500 kbytes/secWhat if D=5 because there's still congestion? If we then cut the circuit window size by another factor of 5, we end up with an expected throughput of 20KB, which is a bit on the 'uncomfortably low' end. (What if D=10?)
On the long run, I'm also more of a fan of a datagram based solution with only end-to-end congestion control. What we've been using in the tech report is IPsec tunnels, and at that point kernel level TCP stack comes for free. With this solution, performance is almost equivalent to native IP tunneling on the same path. Obviously IPsec and kernel TCP has all of its well known drawbacks, but that's off-topic here. We had our reasons to use IPsec, but what's important from Tor's perspective, is that a UDP based transport would produce almost the same performance, if OR nodes are strong enough and delay factors 5-7 are handled well. (see delay factors at the end in my original mail)Lets concentrate now on 5-7. 5, This is studied in Fig. 2, confronting TCP-in-TCP with normal TCP on the same path (TCP-in-IP). With one stream, delays are almost the same. TCP is good in adapting to the congestion and keeping its delay low.This introduces the recent discussion about using UDP for transport. See e.g. http://archives.seul.org/or/talk/Oct-2008/msg00095.html One of the big reasons they concluded why Tor is slow is because it puts both loud and quiet streams together into one TCP connection. When TCP's back-off kicks in, all the streams get punished, not just the loud ones. Ultimately we'd like to resolve this whole TCP flow control thing by switching to UDP transport between each relay, and then either have a user-space TCP stack sitting at each relay to reconstruct the packets, or have one sitting at each end of the circuit to reconstruct them. Alas, there are still a big pile of open development problems there -- as a first example, there appears to be no free-software stable platform-independent secure user-space TCP stack out there.
Another thing that comes to my mind is that a better separation of signaling (directory lookup, circuit setup, etc.) and data path would be beneficial. This would ease things such as experimenting with DTLS or other transport methods on the overlay tunnel, and experimenting with different solutions end-to-end.
We are doing verifications on this. Our lab experiments (the ones in the tech report) show that there is a huge gain on the user side in delays, while throughput is untouched. Throughput is capped with a static window size, but I think the cap can be chosen better than what it is now. There should also be a big gain in the memory consumption of ORs, although we didn't measure it yet. Since the Tor network is kind of overloaded all the time, memory usage should decrease almost linearly with the window size.6, I've noticed this limit (MIN_TLS_FLUSHLEN) in the code. Its size is about 32 cells, doesn't seems to be an important delay component.Right. I added that in because in some cases we were reading in hundreds of kilobytes without initiating any writes until all the reads were done. I have absolutely no idea whether that was a good idea. It would be great to be able to have some data about these things. Speaking of which, Tor's round-robin when we run low on our bandwidth buckets has to matter here. We try to round-robin 16KB at a time, but have smaller allocations when our token bucket is getting low. Again, I just made up some numbers, and it would be great to get real data there.7, The huge part of the delay is here: hundreds if not thousands of cells waiting in a FIFO queue (the OR's output buffer towards another OR) before getting to the TLS tunnel.So the next question is an implementation one. Right now the window sizes are hard-coded at both ends. I've been meaning to extend the protocol so sendme cells have a number in them, and so the initial window sizes are specified in the 'create' and 'created' cells for circuits and the 'begin' and 'connected' cells for streams. But we haven't really fleshed out the details of those designs, or how they could be phased in and still handle clients and relays that don't use (or know about) the numbers. So the big deployment question is: is it worth it to work on a design for the above, and then either shrink the default window sizes or do something smarter like variable window sizes, or should we just be patient until a UDP-based solution is more within reach? One answer is that if you were interested in working on a design proposal and patch, it would be much more likely to get implemented. :)
Currently we are verifying one-side modification of the circuit, i.e. whether one side of the connection can reduce the widow size on its own, without explicitly notifying the other side. From the code it seems to me that this will work, and if so, phasing in a smaller window size in a new release should not be a problem.
Well, we've had different ways of analysing it but some the conclusions are similar, even if not the same :) We did not try analysing the effect of mixing different (loud and quiet) kinds of traffic. What we say is that all streams suffer even if they are from the same kind.One negative effect which could be considered, is the following: having less cells in buffers, the "mixing" effect is less. I know that there is no real mixing in Tor, what I mean is that even if cells are not re-ordered, due to the introduced delays, finding correspondence between an entering cell and its correspondent cell leaving later on is quite hard. I don't think reducing this delay is a problem, but it should be mentioned at least.This doesn't bother me either. I think we're quite vulnerable already to an adversary trying to match up flows.In the report, we also analyze the effect of external TCP tunnels when several streams are multiplexed over them. In this case, an TCP tunnel becomes a real bottleneck.Indeed. Is this the same statement that I made above, from Joel's thesis?
Finally, we also discuss how the combination of 1) datagram based operation, 2) allowing for cell/packet drops in overlay nodes, and 3) relying on only one end-to-end congestion control loop increase performance. We do this through an example of an IPsec based architecture, but results partially apply to DTLS as well.Sounds great. Are you able to send this to the or-dev list too? I think it would be more useful to have other people see the thread too. Thanks! --Roger
Csaba Kiraly wrote:
Dear Roger, As Giuseppe said, the tech report is work in progress, however I think there are already some parts that might be interesting for you and for others involved in Tor development. One things that could be of immediate use for Tor is the study of Tor's flow-control mechanism, and its behavior with different window sizes. In short, by reducing the window size from the current 1000 to e.g. 200 in the Tor network, there would be a huge decrease in application level round trip times. It would also mean much smaller buffers in OR nodes, reducing their memory consumption significantly. At the same time, throughput would remain the same. Figure 6 of the tech report shows delay and throughput with some window sizes. I hope you don't mind if I give a more detailed Tor-specific explanation of this aspect here below, since it was not (yet) discussed in the tech report in enough detail. As you know, Tor breaks up the end-to-end TCP loop, and replaces it with its own reliable delivery service. It implements reliable delivery by using TCP on the overlay links, and by not dropping cells in OR nodes (it can't drop cells in ORs, because there is no other end-to-end reliability mechanism that would resend a cell). If you consider also that bottlenecks are typically inside the Tornetwork (and not after the exit node) you get to the following phenomenon:The source sends cells as fast as it can until its CIRCWINDOW (or STREAMWINDOW)permits. These cells all get buffered at the OR before the bottleneck overlay link. When some (CIRCWINDOW_INCREMENT) data gets through the bottleneck and arrives to the exit node, a SENDME cell is sent back. The source pushes again, and as a result, the buffer before the bottleneck link is always almost full. This introduces two problems: large delays for end nodes and huge memory consumption for ORs. Large delay : The delay introduced by the FIFO buffer in the OR before the bottleneck can be estimated as: cell_size * CIRCWINDOW / bottleneck_bandwidth With typical data: cell size=0.5 Kbytes; CIRCWINDOW=1000; bottleneck_bandwidth= 50 Kbytes/sec; you get a potential delay of 10 seconds. Fortunately, applications are typically not that eager to push all that information inside, still, the accumulated delay is huge. Large memory consumption at OR nodes: an OR node should potentially buffer cell_size * CIRCWINDOW, i.e. 500 Kbytes for each circuit. With a smaller window, both the delay and memory consumption would be smaller. Thus the important question is: why not reduce the window? One would say throughput, but the important thing to note is that a smaller (but not too small) window would not reduce throughput. This can be seen on the right side of fig.4: all the Tor curves are the same, independent of the window size (and the number of streams). The reason is simple: if there is a bottleneck on the path taken by the circuit, the throughput depends on how the bandwidth of the TCP connection of the bottlenecked OR link is shared among circuits. Buffering more cells before this TCP connection does not make it more effective. Delay and throughput can be analyzed in more detail. In Tor, delay comes from the following factors (I hope I've managed to list them all): 0, circuit setup delay 1, delay outside the Tor network: TCP from application to SOCKS proxy 2, delay outside the Tor network: TCP from exit node to destination node 3, cell forming delay at the edge of the circuit (either at the OP or at the exit OR) 4, IP level delay on each overlay link (due to physical distance, IP hops, etc.) 5, delay introduced by TCP on each overlay link 6, delay introduced by TLS record forming on each overlay link 7, delay of FIFO cell buffers in ORs for each overlay link With CIRCWINDOW=1000, it seems that point 7. is responsible for large part of the overall delay: 0, is important at the beginning, but not during transmission 1, is almost negligible 2, is in the range of normal unprotected operation. There is not too much to do to reduce it, except for selecting the exit node near the destination, but this could compromise privacy. 3, you already have a mechanism (based on port numbers) to mitigate this 4, there is not too much to do with this, except changing the OR selection strategy, and thus compromising privacy. There are numerous proposals on that, as far as I remember. These factors add up for some hundreds of milliseconds, depending on the choice of the OR nodes, and how they are connected to the Internet. Lets concentrate now on 5-7. 5, This is studied in Fig. 2, confronting TCP-in-TCP with normal TCP on the same path (TCP-in-IP). With one stream, delays are almost the same. TCP is good in adapting to the congestion and keeping its delay low. 6, I've noticed this limit (MIN_TLS_FLUSHLEN) in the code. Its size is about 32 cells, doesn't seems to be an important delay component. 7, The huge part of the delay is here: hundreds if not thousands of cells waiting in a FIFO queue (the OR's output buffer towards another OR) before getting to the TLS tunnel. This would suggest to reduce CIRCWINDOW as much as possible. To understand what possible means, throughput should be analyzed: Suppose for a minute that the Tor network is not overloaded, and there is no overlay link bottleneck that would limit your throughput. In this case 7) becomes negligible, while CIRCWINDOW limits your cells in flight, and thus your throughput. Let D be the delays due to delay factors 1..6 Your throughput is limited to cell_size * CIRCWINDOW / D (without considering the size of CIRCWINDOW_INCREMENT and delay of the SENDME cell on the return path) With typical data: cell size=0.5 kbytes ; CWND_SIZE=1000, D=1 sec The throughput is limited to 500 kbytes/sec Reducing CIRCWINDOW to 200, this goes down to 100 kbytes/sec (if D=1 sec), which is still much above what is needed. Moreover, the Tor network IS more congested than that (at least this was my impression), so reducing CIRCWINDOW would have absolutely no negative effect on the throughput. One negative effect which could be considered, is the following: having less cells in buffers, the "mixing" effect is less. I know that there is no real mixing in Tor, what I mean is that even if cells are not re-ordered, due to the introduced delays, finding correspondence between an entering cell and its correspondent cell leaving later on is quite hard. I don't think reducing this delay is a problem, but it should be mentioned at least. In the report, we also analyze the effect of external TCP tunnels when several streams are multiplexed over them. In this case, an TCP tunnel becomes a real bottleneck. Finally, we also discuss how the combination of 1) datagram based operation, 2) allowing for cell/packet drops in overlay nodes, and 3) relying on only one end-to-end congestion control loop increase performance. We do this through an example of an IPsec based architecture, but results partially apply to DTLS as well. I hope you don't mind for this long mail! Best regards,Csaba