This novelette-length mail is meant to provide the background information necessary to understand choices in congestion control deployment for Tor. Much of this background information won't be included in my planned Tor Proposal, which will hopefully be much shorter, but the choices made there will be informed by this material. This material has been enhanced through review and conversations with Toke Høiland-Jørgensen, g burdell, Rob Jansen, and George Kadianakis. Immense credit goes to Toke in particular, for detailed and thoughtful review, and input on queue management. Like my last mail on this topic, this mail comes with a playlist. Fair warning: it is not exactly easy listening. Still, hopefully it gives you strength and inspiration in these trying times: https://people.torproject.org/~mikeperry/borders/everybody-loves-them/TakeAKneeAndMeditateInThePaint.txt Section -I: Recap [RECAP] https://wizardzines.com/zines/networking I strongly believe that congestion control will provide the single largest performance win for the Tor network. Specifically, it will considerably improve both latency and throughput metrics in the best, average, and worst case. In addition to performance, congestion control will also provide the second-largest scalability win on our research horizon (behind Walking Onions), by making the queue length memory requirements of Tor relays independent of the number of circuits carried on that relay. I also believe we can achieve the vast majority of the benefits by deploying an RTT-based system that only clients and Exits need to support, specifically BOOTLEG_RTT_TOR. After this, we might consider more advanced options based on ECN, but their benefits are far less clear, and fraught with peril. This mail assumes that you have read "Options for Congestion Control in Tor-like Networks": https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html Even if you have already read the Options mail, I recommend reviewing the the SENDME section and the sections marked with TCP_HISTORY, BOOTLEG_RTT_TOR, BACKWARD_ECN_TOR, and perhaps also FORWARD_ECN_TOR, as understanding this mail requires an understanding of those sections. The summary of the other sections from the Options mail is basically this: adding ECN signals to Tor seems straight-forward, but is actually very tricky. Section @: Introduction and summary [INTRO_SUMMARY] Deep understanding of the congestion control problem space is essential because shitty congestion control will not be a huge win, and will reduce the benefit of other performance improvements. Worse still, it may introduce side channels and other anonymity attacks. In particular, our level of congestion latency (and associated relay queue lengths) will be very sensitive to congestion signal response time, to our congestion window update rules, and also to our choice of queue management algorithm (ie: which circuits send cells and/or congestion signals). Our average circuit bandwidth on the network will similarly be very sensitive to the ability of the congestion window to quickly update and converge on the available Bandwidth-Delay Product (BDP) of its path. In other words, we need to make good design choices in the following three different areas: I. Which congestion signal to use [CONGESTION_SIGNALS] II. How to update congestion window [CONTROL_ALGORITHMS] III. What circuits to send the signal on [QUEUE_MANAGEMENT] The next three sections of this mail correspond to those three areas. As before, I have tagged the specific sub-sections that will be used in the Tor Proposal with [PROPOSAL_CANDIDATE]. I have also given tags to section in this document. References to things in this document are enclosed in brackets. References to the Options mail are simply all caps, without brackets. The high-level plan is that we use RTT as a primary congestion signal, as per BOOTLEG_RTT_TOR from the Options mail, and try it out with a few different congestion algorithms, such as [TCP_VEGAS_TOR] and [TCP_WESTWOOD_TOR]. It turns out that when RTT is used as a congestion signal, our current Circuit-EWMA is sufficient to serve as [QUEUE_MANAGEMENT]. It also turns out that Circuit-EWMA actually should protect us from fairness issues in the control algorithms, and from cheaters who try to run different congestion control algorithms. Circuit-EWMA will also deliver RTT congestion signals more heavily to bulk circuits (because it prioritizes their cells below light circuits), causing them to back off in favor of lighter traffic, even as we relay fewer of their cells. Thanks goes to Toke Høiland-Jørgensen for pointing this out. The combination of RTT-based congestion signaling, a decent but simple control algorithm, and Circuit-EWMA will get us the most if not all of the benefits we seek, and only requires clients and Exits to upgrade to use it. Once it is deployed, circuit bandwidth caps will no longer be capped at ~500kb/sec by the fixed window sizes of SENDME; queue latency will fall significantly; memory requirements at relays should plummet; and bottlenecks in the network should dissipate. However, we can still improve upon this further, after this initial deployment. Because the "slow start" phase of TCP can take a long time, we will also want to play around with various initial window sizes, window growth rate parameters, and experiment with using a single BACKWARD_ECN_TOR cell_t.command ECN-style signal. This signal will be sent on circuits as determined by [CODEL_TOR]. An instance of [CODEL_TOR] can be added to each TLS connection, after Circuit-EWMA is applied. While BACKWARD_ECN_TOR is complex and slow to roll out (all relays must support this new cell_t.command), this optimization will likely be worth it because BACKWARD_ECN_TOR can send a signal on a circuit to stop the exponential growth of slow start in *less than* RTT/2 time, which is significantly faster than implicitly measuring an RTT-based signal (which takes RTT plus some measurement lag to act upon). This faster signal means faster termination of slow start, which means less buffer bloat. For stream flow control, on the client side, we should just ack all streams cells with a circuit-level SENDME even though they were not delivered to their blocked stream. The local Tor client should keep queuing them until the application reads them. Thus, a blocked client application stream will not block other streams, but will cause Tor client stream-level queues to build up. For exit-side streams, we should *not* send SENDMEs until the cell has been sent on its exiting upstream connection. This allows TCP pushback on one stream to block the entire circuit, but this is desirable to avoid queuing, and it will only happen in the upload direction. Section I: Selecting Congestion Signals [CONGESTION_SIGNALS] The "Options for Congestion Control in Tor-like Networks" mail probably should have been titled "Options for Congestion Signals in Tor-like Networks", as its primary focus was on the ways Tor endpoints can receive a signal about the congestion status of a circuit on bottleneck relays. In that post, I postponed discussion of queue management, and deliberately under-specified how to react to a congestion signal (going no further than saying "use SlowStart and AIMD"). I chose that scoping because the most difficult barrier we have faced in the past decade of work on this topic is the side channels enabled by a congestion signal (and information leaks in general). However, I should have made our constraints more explicit. Our choice of acceptable congestion signal mechanism is restricted by these design constraints: 1. No side channels between Guard and Exit, or between circuits 2. Incremental upgrade path 3. Fast response time 4. Cheater prevention These design constraints restrict our choice of congestion signal, but they are not our only design constraints in this problem space. When we get to the later sections on window control update algorithm, and queue management algorithm, I will explicitly enumerate additional design constraints there. Hopefully this will provide more clarity than the uncategorized list of "causes of death" in the Options post. For your convenience, here's the summary table of all the options for congestion signal again: ______________________________________________________________________ | MIX | CHEATING | RESPONSE| SIDE | UPGRADE | | TRACK | DETECTION | TIME | CHANNELS | PATH? | |~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~|~~~~~~~~~~~~|~~~~~~~~~~| |BOOTLEG_RTT_TOR | Exit RTT |100c+RTT | RTT | Exits | |FORWARD_ECN_TOR | Circ OOM | RTT | None? | Full Net | |FORWARD_ECN_RTT_TOR | Exit RTT | RTT | RTT |Exits;Full| |BACKWARD_ECN_TOR |Middles;OOM| RTT/2 | Guard->Exit| Full Net | |BACKWARD_ECN_RTT_TOR |Middles;RTT|RTT/2;RTT| Guard->Exit|Exits;Full| |BACKWARD_ECN_TRAP_TOR | Middles | RTT/2 |Guard<->Exit| Full Net | |EDGE_FORWARD_ECN_TOR |Middles;OOM| 2*RTT/3 | None? | Full Net | |START_BACKWARD_ECN_TOR|Middles;OOM|RTT/2;RTT| Low/None? | Full Net | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ My short-term favorite is BOOTLEG_RTT_TOR, because it only requires Exit relay support to use. Recall that BOOTLEG_RTT_TOR defines an RTT latency threshold as its congestion signal, and was first described in section 4.1 of the N23 Defenestrator paper. Even though the N23 paper didn't bother to name this system, RTT use as a congestion signal in this way also resembles TCP Vegas, as we shall see. https://www.cypherpunks.ca/~iang/pubs/defenestrator.pdf Note that because Circuit-EWMA will add additional delay to loud circuits, "cheaters" who use alternate congestion control algorithms should end up with more delay than those who do not. This means that if we use RTT as a congestion signal, Circuit-EWMA should provide *more* RTT congestion signal to cheaters, and cheaters would only be punishing themselves with more Circuit-EWMA delay. So cheater resilience for an RTT signal is actually more well handled that I previously thought. Again, thanks goes to Toke Høiland-Jørgensen for pointing this out. For completeness, we will still explore some examples of alternate "cheating" control algorithms in Section II, though. Unfortunately, RTT is itself a side channel, as per: https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf https://www.robgjansen.com/publications/howlow-pets2013.pdf There is also literature on shaping circuit bandwidth to create a side channel. This can be done regardless of the use of congestion control, and is not an argument against using congestion control. In fact, the Backlit defense may be an argument in favor of endpoints monitoring circuit bandwidth and latency more closely, as a defense: https://www.freehaven.net/anonbib/cache/ndss09-rainbow.pdf https://www.freehaven.net/anonbib/cache/ndss11-swirl.pdf https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf Recall that BACKWARD_ECN_TOR used circuit-level cell_t.command to signal congestion. This allows all relays in the path to signal congestion in under RTT/2 in either direction, and it can be flipped on existing relay cells already in transit, without introducing any overhead. However, because cell_t.command is visible and malleable to all relays, it can also be used as a side channel. So we must limit its use to a couple of cells per circuit, at most. Recall that FORWARD_ECN_TOR adds encrypted end-to-end relay cells that explicitly signal congestion to either the client or the Exit, but requires the client to mediate and echo this cell to protect against side channels. Because congestion signals must be relayed off the client, it still takes the Exit a full RTT to learn about congestion at earlier nodes in the circuit. Additionally, because this cell itself must travel the full path in each direction, it is further delayed by congestion on the path, and it also adds yet more congestion to congested paths. Congestion in Exit to client direction is the most common, as this is the download direction. It will be particularly expensive and slow to bounce the congestion signal off the client to the Exit. For this reason, I think all forms of FORWARD_ECN_TOR are not likely to be as efficient as RTT signaling, despite the FORWARD_ECN_RTT_TOR variant initially being flagged as a PROPOSAL_CANDIDATE in the Options mail. However, we should still keep it in mind, since RTT may turn out to be an unreliable signal by itself. Therefore, since RTT is already measurable with our current SENDME flow control, I propose that we use it via BOOTLEG_RTT_TOR (or TCP Vegas), and then add START_BACKWARD_ECN_TOR only exactly once (to exit slow start). If RTT proves unreliable, then we can experiment with the more heavyweight FORWARD_ECN_TOR full-cell signal. As we shall see in the following control algorithm section, RTT-based systems like TCP Vegas largely avoid the need to send any congestion signal (drop or ECN). In other words: if we can rely on RTT, we should not need any explicit signals at all (though they would improve responsiveness). Section II: Congestion Window Control Algorithms [CONTROL_ALGORITHMS] Recall that the congestion window is the number of packets TCP will allow to be sent on a connection without any acknowledgment. The congestion window is meant to match the bandwidth-delay product of the connection as close as possible. The bandwidth-delay product can be thought of as the total amount of data that can be in-transit on a link at one time without using any queue memory (bytes per second of bandwidth multiplied by seconds of transit time delay = total bytes in transit on the wire at one time). On the Internet, if the congestion window exceeds the bandwidth-delay product of a transmission path, then packets must build up in router queues, and eventually get dropped. If the congestion window is below the bandwidth-delay product of the path, then all routers on the path spend time unused and idle, and thus utilization suffers, even if users are trying to use full bandwidth. TCP variants basically compete on their ability to collectively track the bandwidth-delay product of a path under a wider of a variety of network conditions and other competing traffic. In the interest of brevity, my "Options for Congestion Control in Tor-like Networks" mail suggested that we update the congestion window according to Slow Start with AIMD, in response to a congestion signal (or lack thereof). However, that is not the only option, and if we retain the ability to measure RTT, it may not even be the most efficient one. So to help us make this decision, I did a review of historical TCP and TCP-like systems in the research literature, and will present the highlights here. But first, let's define the control requirements that we want to satisfy: 1. Full bandwidth utilization of slowest TLS link in every circuit 2. Queue length minimization at the relay sending on this TLS link 3. Fairness between circuits (esp partially overlapping ones) 4. Rapid convergence to full circuit capacity These requirements mirror the requirements used in TCP evaluation. TCP implementations compete over metrics in each of these areas, and algorithm design is far more art than formally proved optimal or even correct. That does not mean that formalism is impossible, however. For a great formal breakdown of this design space without all of TCP's baggage, see Section 3 of the XCP paper. Note that we can't use XCP as-is because its congestion signal fields provide way too much information for use in potential side-channels, but it is worth a read because it clarifies the control theory behind congestion control: http://conferences.sigcomm.org/sigcomm/2002/papers/xcp.pdf In terms of how this design space is approached in TCP: utilization and queue minimization are typically handled simultaneously - sometimes explicitly and sometimes implicitly. Similarly, fairness is sometimes explicitly designed for, but more often fairness is an emergent property that is analyzed when competing against a benchmark (historically: TCP Reno). Fairness abuse is also an area where cheating can occur that enables cheaters to get more than their fair share of bandwidth, and such cheating can be very hard to detect and prevent on the Internet. Interestingly, many of the design goals of TCP are *also* improved via queue management algorithms at routers (Section III of this mail). In fact, our Circuit-EWMA circuit scheduler will effectively hand out RTT congestion signals (delay) to loud circuits before quiet ones, and will *also* enforce fairness and impede cheating, by delaying a cheater's cells even more. In this sense, we do not need to worry quite so much about fairness and cheating as security properties, but a lack of fairness at the TCP algorithm will *still* increase memory use in relays to queue these unfair/loud circuits. So we should still be mindful of these properties in selecting our congestion algorithm, to minimize relay memory use, if nothing else. For a good summary of several TCP congestion control algorithms, see: http://intronetworks.cs.luc.edu/1/html/reno.html ("Slow Start+AIMD") http://intronetworks.cs.luc.edu/1/html/newtcps.html (modern TCPs) As you can see, the TCP design space is huge. A good way to break down the TCP space is to separate the drop-based algorithms from the delay-based algorithms. TCP drop-based algorithms require either packet drops or ECN packet marking to signal congestion. Delay based algorithms infer congestion from RTT measurements only. But as we shall see, this distinction is artificial for us. BOOTLEG_RTT_TOR effectively converts delay *into* an explicit congestion signal, as far as congestion window control algorithms are concerned. Similarly, either of our ECN signals can be used in place of the drop signal in the algorithms below. I'm going to review four options that strike me as most relevant to our needs. Two are drop-based: TCP Reno, and TCP CUBIC. Two are delay-based: TCP Vegas, TCP Westwood. It should be noted that there are also hybrid algorithms that use drops/ECN in addition to delay as congestion signals, but since we are focusing on primarily using RTT only as our signal, I won't be covering them. For a good overview of those options, see: https://arxiv.org/pdf/1903.03852.pdf Toke Høiland-Jørgensen suggested that we also investigate BBR, which is a hybrid algorithm currently being evaluated by DropBox and Google. However, BBR involves packet send time scheduling, which will require replacement of Circuit-EWMA and other extensive changes. Furthermore, BBR is still under active development, experimentation, and tuning. In fact, while v2 has been tested by Dropbox, no IETF draft specification exists yet. Still, for completeness: https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00 https://queue.acm.org/detail.cfm?id=3022184 https://datatracker.ietf.org/doc/slides-106-iccrg-update-on-bbrv2/ https://dropbox.tech/infrastructure/evaluating-bbrv2-on-the-dropbox-edge-network I am also going to simplify and omit treatment of packet loss and duplicate acks, since we will not be dropping any packets. Additionally, I have normalized all formulas below to be in terms of Tor cells, rather than bytes. The hope is that these simplifications help us get to the essence of what these algorithms would look like when deployed on Tor, without getting distracted by irrelevant drop and ack synchronization management details from TCP/IP. However, as we move closer to specifying an actual design proposal, we should double-check my simplifications below, for the algorithm(s) that we actually specify. Section II.A: TCP Reno [TCP_RENO_TOR] https://tools.ietf.org/html/rfc5681 http://intronetworks.cs.luc.edu/1/html/reno.html Until recently, TCP Reno was the most popular TCP on the Internet, and it is still used as a fairness benchmark. While we won't be using it directly, it is thus important to understand. Like all TCPs, TCP Reno satisfies its utilization and queue minimization design goals by trying to make its congestion window closely match the bandwidth-delay product of the link. More importantly, when multiple TCP connections are present competing for the bandwidth of a link, TCP Reno fairly distributes the bandwidth-delay product of the link equally among all competing TCP Reno connections. http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf https://www.cse.wustl.edu/~jain/papers/ftp/cong_av.pdf TCP Reno probes the bandwidth-delay product by increasing its congestion window until it overflows the the queue capacity of the slowest router on a path, causing a packet drop or ECN signal, upon which TCP Reno reduces its congestion window. The initial phase of TCP Reno is called "Slow Start". For our purposes, Slow Start in TCP Reno means that every SENDME ack, the congestion window cwnd becomes: cwnd = cwnd + SENDME_RATE (recall Tor's SENDME_RATE is 100 cells) Because increasing the congestion window by 100 every SENDME allows the sender to send 100 *more* 512 byte cells every 100 cells, this is an exponential function that causes cwnd to double every cwnd cells. TCP Reno keeps on doubling until it experiences a congestion signal. Once a congestion signal is experienced, Slow Start is exited, and the Additive-Increase-Multiplicative-Decrease (AIMD) steady-state phase begins. AIMD algorithms can be generalized with two parameters: the add quantity, and the multiply quantity. This is written as AIMD(a,m). TCP Reno uses AIMD(1, 0.5). It should be noted that most later TCP's include an AIMD component of some kind, but tend to use 0.7 for m instead of 0.5. This allows them to recover faster from congestion. After slow start exits, in steady-state, after every ack (SENDME) response without a congestion signal, the window is updated as: cwnd = cwnd + SENDME_RATE/cwnd This comes out to increasing cwnd by 1, every time cwnd cells are successfully sent without a congestion signal occurring. Thus this is additive linear growth, not exponential growth. If there was a congestion signal, cwnd is updated as: cwnd = cwnd * m (m=0.5 or 0.7) This is called "Fast Recovery". If you dig into the RFCs, actual TCP Reno has some additional details for responding to multiple packet losses that in some cases can fall back into slow-start. However, for our purposes, the congestion window should only be updated once per cwnd cells, in either the congestion case, or the non-congestion case. TCP Reno (and all TCPs) also revert to slow start when the connection has been idle, in case bottleneck capacity has changed in this time. We may or may not want to do this, as slow start has a dramatic impact on page load times. TCP Reno is still used as the benchmark to determine fairness. If an algorithm is "TCP Friendly", this basically means that when competing with TCP Reno to use spare path capacity, it does not get its ass kicked, and it does not kick Reno's ass. Several algorithms contain hacks or special cases to preserve this property. If this fairness property holds, a bunch of math can prove convergence rates, utilization, congestion window sizes, and even queue sizes. Modern queue management also relies on this math to auto-tune queue parameters, as we will see in Section III. http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf https://www.icir.org/tfrc/aimd.pdf Section II.B: TCP Cubic [TCP_CUBIC_TOR] http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-cubic https://pandorafms.com/blog/tcp-congestion-control/ https://tools.ietf.org/html/rfc8312 TCP Reno unfortunately does not perform optimally for networks where the natural packet loss rate starts to become frequent, which happens for both high bandwidth and high latency networks. This shortcoming has lead to a huge explosion of competition among replacements and improvements for this use case, and TCP Cubic has won out (for now). TCP Cubic basically takes TCP Reno and tweaks the additive portion of AIMD behavior to be a cubic function (y=x^3) instead of linearly additive. The cubic function was chosen because it is concave, then flat, and then convex. The concave piece allows for quickly recovering after congestion (or non-congestion-related packet loss), and the convex piece allows for gently probing for fresh spare link capacity. Because of adoption by both the Linux kernel and the Windows kernel as the default TCP, TCP Cubic is now the most popular TCP on the Internet. Unfortunately, TCP Cubic is complicated. It has multiple magic numbers, and is defined piece wise depending on how its congestion window would compare to the equivalent TCP Reno window (to ensure the "TCP Friendly" property -- without this hack, TCP Cubic is "too polite" and TCP Reno eats its lunch). It additionally has time-based components that must be clocked relative to current measured RTT (though it does not use RTT as a congestion signal). The cubic congestion window is defined as: W_max = previous window size before last congestion signal W_cubic(t) = C*(t-K)^3 + W_max With magic number constants: C = 0.4 beta_cubic = 0.7 K = cubic_root(W_max*(1-beta_cubic)/C) To maintain competitiveness with TCP Reno, Cubic keeps a running estimate of an equivalent TCP Reno's window size, using the AIMD math cited for Reno above: W_est(t) = W_max*beta_cubic + [3*(1-beta_cubic)/(1+beta_cubic)] * (t/RTT) if W_est(t) > W_cubic(t): cwnd = W_est(t) else: cwnd = cwnd + (W_cubic(t+RTT) - cwnd)/cwnd Upon congestion signal, Cubic uses its beta_cubic parameter (0.7), and thus does not back off quite as far as Reno: W_max = cwnd cwnd = cwnd * beta_cubic It is pretty easy to convince yourself that Cubic will do as least as well as TCP Reno (thanks to the AIMD Reno window comparison hack), but it is very hard to conceptualize the above and convince yourself that it is the best we could possibly do for our use case. Especially since Tor is *not* affected by an increasing inherent drop rate as bandwidth increases, unlike the Internet. For these reasons, the complexity of TCP Cubic does not strike me as worthwhile, as compared to other, simpler candidates. Section II.C: TCP Vegas and [TCP_VEGAS_TOR] [PROPOSAL_CANDIDATE] http://pages.cs.wisc.edu/~akella/CS740/F08/740-Papers/BOP94.pdf ftp://ftp.cs.princeton.edu/techreports/2000/628.pdf (lol ftp, sorry) Similar to BOOTLEG_RTT_TOR from my Options mail, TCP Vegas was designed to make use of RTT and bandwidth estimation to avoid queue congestion, and thus avoid any dropped packets or ECN signals. However, TCP Vegas actually directly estimates queue length of the bottleneck router, and aims to minimize the packets in that queue. To accomplish this, TCP Vegas maintains two RTT measurements: RTT_current and RTT_min. It also maintains a bandwidth estimate. The bandwidth estimate is the current congestion window size divided by the RTT estimate: BWE = cwnd/RTT_current (Note: In later TCP Vegas relatives like TCP Westwood, this bandwidth estimate was improved to use an explicit count of the unacked packets in transit, instead of cwnd). Recall that in order to utilize the full capacity of the link, all TCPs attempt to maintain its congestion window as close to the bandwidth-delay product (ie: the "memory capacity") of the transmission path as possible. The extra queue use along a path can thus be estimated by first estimating the path's bandwidth-delay product: BDP = BWE*RTT_min TCP Vegas then estimates the queue use caused by congestion as: queue_use = cwnd - BDP = cwnd - cwnd*RTT_min/RTT_current = cwnd * (1 - RTT_min/RTT_current) If queue_use drops below a threshold alpha (typically 2-3 packets for TCP, but perhaps double or triple that for our smaller cells), then the congestion window is increased. If queue_use exceeds a threshold beta (typically 4-6 packets, but again we should probably double or triple this), then the congestion window is decreased. This update happens only once per cwnd packets: if queue_use < alpha: cwnd = cwnd + 1 elif queue_use > beta: cwnd = cwnd - 1 As a backstop, TCP Vegas still will half its congestion window in the event that a congestion signal (packet drop or ECN) still occurs. However, such packet drops should not actually even happen, unless Vegas is fighting with a less fair control algorithm that is causing drops instead of backing off based on queue delay. Vegas also uses this queue_use estimate to govern its initial slow start behavior. During slow start, the Vegas congestion window grows exponentially, but only half as fast as Reno (to avoid excessive congestion). For us, this means that cwnd is updated every SENDME ack by: if queue_use < gamma: cwnd = cwnd + SENDME_RATE/2 This exponential increase continues until the queue_use estimate exceeds "gamma" (which is determined experimentally and sadly not listed in the TCP Vegas paper). If I had to guess, gamma is probably close to beta (4-6 packets). Because Vegas tries much harder to avoid congestion than TCP Reno, Vegas is vulnerable to cheating through fairness abuse. When TCP Vegas competes against a TCP Reno-style TCP that responds only to packet drops (or ECN signals in our case) and ignores RTT, TCP Vegas detects congestion earlier than TCP Reno would, and ends up yielding bandwidth to TCP Reno. This means that Vegas does *not* have the "TCP Friendly" property that modern AQM algorithms depend on for their parameterization. Note that we should expect the same thing to happen if TCP Vegas competed with BOOTLEG_RTT_TOR algorithm run by cheaters. However, since RTT is being used as the congestion signal in both cases, Circuit-EWMA should still balance this out, but we should experiment with this to confirm. There may still be additional queuing memory costs in the presence of cheaters, compared to either pure [TCP_VEGAS_TOR] or pure BOOTLEG_RTT_TOR. Even so: this form of cheating is also pretty easy for honest Exit nodes to detect, and cheating clients only get benefit in the upload direction, even if they did cheat. To see this a bit more concretely, if we sub in TCP Reno in BOOTLEG_RTT_TOR, during slow start we have: T=(1−a)*rtt_min+a*rtt_max If rtt < T: cwnd = cwnd + SENDME else: exit slow start And then for steady state: If rtt < T: cwnd = cwnd + SENDME_RATE/cwnd else: cwnd = cwnd * 0.5 The 'a' parameter from BOOTLEG_RTT_TOR is operating similarly to the queue_len criterion of [TCP_VEGAS_TOR], but the window update rules are [TCP_RENO_TOR], which are more aggressive than TCP Vegas. So effectively, this is still [TCP_VEGAS_TOR] with a different target condition for queue_len. This means we could expect BOOTLEG_RTT_TOR to drive the congestion to some higher rate of queue use than [TCP_VEGAS_TOR], and senders who use it may out-compete Vegas much like TCP Reno out-competes Vegas, until Circuit-EWMA starts punishing them (at the expense of more queue memory waste). This also shows us a crucial testing consideration: [TCP_VEGAS_TOR] may not compete well with our current SENDME congestion control, since it's ability to set queue_target may cause it to de-prioritize itself below competing SENDME flows. BOOTLEG_RTT_TOR, on the other hand, will allow us to set its 'a' parameter at whatever point makes it competitive with SENDME. For this reason, BOOTLEG_RTT_TOR may yield better performance characteristics until all clients upgrade. Section II.D: TCP Westwood [TCP_WESTWOOD_TOR] [PROPOSAL_CANDIDATE] https://tools.ietf.org/html/rfc8290#section-4.1 http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-westwood TCP Westwood is essentially TCP Reno, but if the current BDP estimate is greater than half the current congestion window, TCP Westwood uses that instead of halving the congestion window, during a congestion signal. In other words, upon a congestion signal, TCP Westood does: cwnd = max(cwnd * 0.5, BDP_estimate) This has the effect of recovering from congestion events much quicker than TCP Reno, and thus utilizing more of the path capacity. Westwood+ also has some improvements on TCP Vegas's BDP_estimate in order to smooth it in the presence of delayed and/or dropped acks: BWE_k = a*BWE_k-1 + (1−a)*BWM_k We may not need this smoothing as much, because we do not have any dropped acks, and dropped acks were far more damaging to TCP Vegas than slightly delayed acks. Section III: Queue Management Algorithms [QUEUE_MANAGEMENT] On the Internet, Queue Management Algorithms are used by routers to decide how long their queues should be, which packets they should drop and when, and which connections to send ECN signals on and when. Queue management is almost as large of a research area as congestion control. Thanks to bittorrent and video streaming, queue management also grew to consider fairness or QoS between flows. This is done by separating queues or imposing a scheduling algorithm on packet ordering. For an excellent overview of fairness vs queue management, see RFC 7806: https://tools.ietf.org/html/rfc7806 Since Tor relays cannot drop packets, our design goals are slightly different than Internet queue management. We care about the following design goals: 1. Prioritize well-behaved/light circuits over loud/heavy ones 2. Decide when to send ECN signals 3 Decide which circuits to send ECN signals on 4. Detect which circuits are queuing too much (or otherwise cheating) Recall that Tor has already deployed Circuit-EWMA in order to meet goal #1. Circuit-EWMA tries to give priority to less loud circuits in the circuit scheduler, so that interactive applications experience less latency than bulk transfers. https://www.cypherpunks.ca/~iang/pubs/ewma-ccs.pdf Circuit-EWMA can be viewed as a fairness algorithm as per RFC 7806 Section 2, since it schedules flows independently with the target fairness property of prioritizing quiet circuits cells over loud ones. When Circuit-EWMA is used in combination with RTT as a congestion signal, it is effectively *also* a marking algorithm. This is because it *also* distributes RTT congestion signals more frequently to loud circuit: https://tools.ietf.org/html/rfc7806#section-2 https://tools.ietf.org/html/rfc7806#section-3 Recall that Tor also had to deploy KIST, to move queues from the Linux kernel into Tor, for Circuit-EWMA to properly prioritize them. See Section 4 and 5 of the full-length KIST paper for details on this interaction, and see page 5 for a good graphical depiction of all the queuing points in Tor and the Kernel: https://matt.traudt.xyz/static/papers/kist-tops2018.pdf It is likely the case that Circuit-EWMA and KIST are all we need if we only use RTT signaling. But if we add in ECN, at minimum we will need to update Circuit-EWMA to inform us which circuit(s) to send ECN signals on. In fact, thanks to RFC7806's distinction between fairness and management, modern Internet queue management algorithms compose well with Circuit-EWMA, allowing us to still use Circuit-EWMA for the fairness portion, but use an existing Internet algorithm for the ECN marking. So it is useful to review Internet history here, as well as current state-of-the-art. On the Internet, queue management algorithms arose fairly early for one primary reason: early routers performed what is called "tail dropping". Tail dropping means that routers allowed their queues to get full, and then started dropping all packets that arrived while queue remained full. This has all kinds of negative consequences, but the easiest to intuitively understand is that if you wait until your queue is full before you start dropping packets, and then drop all of the packets that arrive from then on, competing TCP connections will all back off at the same time, leading to lulls in network utilization. The key innovation to address tail dropping was to randomly drop (or ECN mark) packets early, before the queue was full, with a per-packet drop probability. This probability was computed based on many factors of the link, and sometimes based on individual TCP connection tracking. Until very recently, both queue management and fair queuing algorithms tended to have many knobs that must be tuned for specific kinds of links and load levels. The Internet seems to be converging on three main contenders for "knobless" queue management: PIE, ARED, and CoDel. Here's some recent review material: http://content.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_EUNICE_14_40255.pdf https://datatracker.ietf.org/meeting/88/materials/slides-88-iccrg-4/ However, be aware that these knobless systems make assumptions about the congestion control algorithms and the latency of the connections that may not apply to us. The assumptions are most clearly specified in Section 3.2 of the CoDel RFC, and are closely related to the "TCP Fairness vs TCP Reno" assumption from our congestion control review: https://tools.ietf.org/html/rfc8289#section-3.2 Ok, so let's review the top "knobless" contenders. Section III.A: PIE and ARED [ARED_PIE_TOR] https://tools.ietf.org/html/rfc8033 http://www.icir.org/floyd/papers/adaptiveRed.pdf As a quick summary, both PIE and ARED manage their queues by dropping packets probabalistically. In both systems, the per-packet drop probability is updated in proportion to average queue transit time -- longer queues mean the drop probability is increased. Both systems decide to drop individual packets from the queue in using this probability. This has the effect that louder flows (with more packets) are also more likely to experience drops (or ECN congestion marking). But otherwise they have no explicit notion of fairness. This makes them marking algorithms as per RFC7806's classification. If explicit QoS or fairness is desired, they have to be deployed in combination with a fairness algorithm to prioritize between separate queues, again as per RFC 7806 Section 3.3: https://tools.ietf.org/html/rfc7806#section-3.3 Recall that Tor has multiple levels of queues: Circuit, TLS connections, and kernel buffers, as per Figure 1 on page 5 in the KIST paper: https://matt.traudt.xyz/static/papers/kist-tops2018.pdf Even if we keep Circuit-EWMA for scheduling (and thus fairness), we can still add in PIE or ARED per each TLS connection. If the average total queue length for all circuits on a TLS connection grows beyond the marking threshold for PIE/ARED, we could then begin sending congestion signals on circuits, as they send cells, as per the cell drop probability of these algorithms. However, because we only want to send one BACKWARD_ECN_TOR signal per circuit, we will have to save these signals for circuits that have not yet gotten one. It is an open question how we should manage the mark probability if we have to pass over a circuit because we already sent a backward ECN signal on it. Section III.B: CoDel and FQ_CoDel [CODEL_TOR] [PROPOSAL_CANDIDATE] https://queue.acm.org/detail.cfm?id=2209336 https://tools.ietf.org/html/rfc8289 https://tools.ietf.org/html/rfc8290 CoDel is a relatively recent algorithm that is gaining traction because of its simple elegance. Like PIE and ARED, CoDel itself is a marking algorithm, and can be combined with any fairness/QoS algorithm. It also comes with its own fairness algorithm, called FQ_CoDel. https://tools.ietf.org/html/rfc7806#section-3.2 Despite being touted as parameterless, CoDel does have two parameters: inspection_interval <- set above max RTT (but not too high) min_queue_target <- set to 5% of min RTT CoDel tags packets with timestamps as soon as they enter the queue. Every inspection_interval, it examines the queue to see if any packets have waited longer than target_delay. If they have, it begins dropping packets (sending ECN signals in our case). Because CoDel is accurately tracking the time *each* packet spends in the queue, it is able to differentiate between "good" queues (not congested), vs "bad" queues (congested), on a packet-by-packet basis. This good vs bad distinction is determined by comparing the packet transit time to min_queue_target. If a packet exceeds this transit time, then the queue is bad, and the packet must be dropped or ECN marked. It is this per-packet distinction that significantly separates CoDel from PIE and ARED, which both use average queue times. This was the source of some controversy, as per-packet timestamping was initially argued to be too expensive. But on modern commodity hardware, this per-packet timestamping is actually incredibly efficient. It also allows CoDel to converge on the min_queue_target very rapidly, with high link utilization. However, it would seem that because we can't drop packets, sojourn times will *not* decrease down to min_queue_target immediately upon congestion for Tor. We will have to wait for the actual backoff signal to propagate (1 RTT) before the queue size changes due to backoff. According to Toke Høiland-Jørgensen (one of the FQ_CoDel spec authors), this should not be that significant, but it should be tested. FQ_CoDel extends CoDel to provide fairness in a very similar way as Circuit-EWMA does, by separating out TCP connections into "generated a queue recently" and "did not generate a queue recently". This allows packets from flows that are not causing congestion to be sent more quickly than flows that are. This also has the side effect that when connections build long FQ queues, their packets spend more time in queues, and have an increased probability of being ECN marked or dropped. https://ieeexplore.ieee.org/document/8469111 Like CoDel, FQ_CoDel relies on the assumption that it can drop a lot of packets at once to correct queue sizes instantly: https://tools.ietf.org/html/rfc8290#section-4.1 This makes me think we should not rush to replace Circuit-EWMA with FQ_CoDel, but we may get benefit from doing CoDel on a per-TLS connection basis, just as is described for PIE and ARED above. In this case, it is more clear which circuits to ECN mark. Whenever a cell has a Circuit-EWMA circuitmux queue transit time longer than the target queue time threshold (min_queue_target), that circuit gets an ECN mark. If it has already been ECN marked, no biggie. In other words: The decision to ECN mark circuits is completely dependent only on how long their cells traverse our queues, and does not require any capacity estimates or other measurements. Very clean. So if we decide we want to try ECN, I think Circuit-EWMA with CoDel is a smart first choice. We would then use CoDel to deliver exactly one BACKWARD_ECN_TOR signal when it detects a circuit's queues are too long. This RTT/2 signal should cause the circuit to exit slow start more quickly than RTT measurement by endpoints will. Section IV: Next Steps [NEXT_STEPS] After this extended review, I am convinced that an RTT-based approach can provide a comprehensive solution to all of our design requirements, and can be enhanced based on experimentation and further additions. Such an approach also only requires clients and Exits (or just clients and onion services) to upgrade to initially deploy. However, we can still improve on this with a few enhancements. In particular, we can compare [TCP_VEGAS_TOR] to [TCP_WESTWOOD_TOR] to BOOTLEG_RTT_TOR, and see how they perform in competition with each other, and in competition with Tor's currently fixed-size SENDME windows. We can also implement CoDel in circuitmux to track transit times and determine which circuits to send a single BACKWARD_ECN_TOR signal, to exit slow start on circuits more quickly, and we can play around with higher initial window sizes. We can do this experimentation first in Shadow, and also on the live network via consensus parameters. The immediate next steps here are to create a proper Tor Proposal from the sections tagged [PROPOSAL_CANDIDATE] above. -- Mike Perry
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ tor-dev mailing list tor-dev@xxxxxxxxxxxxxxxxxxxx https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev