This near-book-length mail is a pre-proposal to help us review previous congestion control ideas for Tor, and brainstorm new ones. My goal is to turn this material into one or more proposals and provide ways of evaluating these options. With proper voodoo, maybe we can resurrect a couple oldies from the graveyard of mixnet lore, and remix them into some Vaporwave bangers. If we're real lucky with some fresh ideas, maybe we can even make a hip hop Internet streaming chart buster from this mix. But long mail is looong, ok? It's sooo long it has its own soundtrack. Grab your favorite links from the references at the end, maybe print everything out, find the nearest sofa or bed, and curl up with it all. Flip on some Pretty Lights. Let's get that flow Finally Moving. Turn off the tv! The flow is More Important Than Michael Jordan. Listen: the flow is Hot Like Sauce. I hope you're an Organ Donor - Extended Overhaul might be necessary. We're going on a mental Midnight Voyage by Ghostland Observatory. But don't worry, we'll make it by 4 AM, without being trapped there, I swear. Hopefully that's not Pushing Time with ya Ample Mammal(s). Relax: there will still be time for some Codeine(g) Dreaming. If you finish reading all of it before you go ROCKABYE BABY, you're definitely a High Roller. All of this is almost exactly an hour of Bumpin' In The Voodoo with Manic Focus. If you'd rather take your time and chillax with something Muy Tranquilo but less Gramatik, just put on some Blank Banshee. But don't fall asleep, or George Clanton will Kill You In Bed. Ready? Ok, back to serious business. Motivation: Load balancing (TorFlow/sbws), circuit build timeout cutoffs (CBT), and QoS (KIST+EWMA) have provided huge perf wins for Tor. Tuning and improving these systems will continue to provide latency improvements in the average case, but congestion control is the missing piece we need for the network to operate properly at high utilization levels. Congestion control is also necessary to increase the throughput of the network at even at low utilization levels. Historically, Tor performance shows high correlation to network utilization levels, and I believe this is largely due to the effects of ephemeral congestion: https://lists.torproject.org/pipermail/tor-scaling/2019-June/000051.html In fact, Section 6.3 of https://www.freehaven.net/anonbib/cache/murdoch-pet2008.pdf uses Pollaczek-Khinchin queuing theory to show that expected Tor queue latency is proportional to network utilization divided by spare capacity, if queues are not otherwise bounded somehow (by congestion control). Summary: The rest of this post is organized as follows: First, I go over TCP and ECN, because I build upon those for a couple new ideas. Then, I summarize Tor's current SENDME flow control and review its many failings. Then, I review past attempts at congestion control for Tor in the research literature. Then, I propose four new candidate ideas. Finally, I conclude the post with some ideas for evaluation and further analysis. Unless you are a congestion control expert who is also deeply familiar with Tor's many failed attempts in this area, I strongly recommend wading through this post in order. The summaries are meant to get you up to speed without having to do quite as much reading as I did, and they are filled with in-line URL references in case you want to dig deeper. If you still want to skip ahead to the highlights, search this mail for [PROPOSAL_CANDIDATE]. There are four such sections. Those sections build on other ideas, though. For those, search for [TCP_HISTORY], [BOOTLEG_RTT_TOR], [FORWARD_ECN_TOR], and [BACKWARD_ECN_TOR]. Search for [TRACK_LISTING] to find a summary table of all the new ideas from this post, with their associated key properties. Crucially absent from this pre-proposal post is the closely related discussion of QoS/scheduling, which we need to make decisions about which circuit(s) to select for delivering congestion control signals, when congestion occurs. For now, this post just handwaves this piece as "use either EWMA or something RED-like". There is much literature on these mechanisms, enough to warrant separate treatment, but thankfully the choice of which to use is orthogonal to the choice of congestion control signal delivery mechanism. -I. History of Internet Congestion Control [TCP_HISTORY] In TCP, each endpoint maintains send and receive buffers of packets, called windows. The receive window holds packets until a contiguous chunk is received, at which point data is delivered to the application and an acknowledgement packet is sent to the other end. The send window size is doubled every ack until the first packet is dropped, after which it is increased by 1 for each acked packet, and halved for each drop. This is called Slow Start with AIMD (Additive-Increase Multiplicative-Decrease). Packets that are not acked are retransmitted from the send window buffer after a timeout of one RTT (Round Trip Time). Drops are caused by intermediate routers' fixed-size queues being full, but can also be caused by poor network conditions (which leads to sub-optimal throughput). In this way, TCP responds to congestion anywhere on the path, and takes around one RTT to detect said congestion. Even this detailed summary is a simplified model. In reality, TCP is way more complicated than that. The real thing has like 23 states, and a whole bunch of kruft from the 80s and 90s that we'll want to cut out of our mix. We just need the key hooks. Decades later, Explicit Congestion Notification (ECN) was proposed to allow routers to explicitly set a flag on TCP packets to signal that their queue is getting full, instead of dropping packets. For abstruse compatibility and reliability reasons, when a router adds this flag to a packet, it is sent all the way to one endpoint. That endpoint *then* re-adds the flag to acks going in the other direction, all the way to the other endpoint, who then backs off as if it detected a packet drop. This is called Forward ECN: https://tools.ietf.org/html/rfc3168#section-6.1 Unfortunately, because of the requirement to bounce the congestion notification flag all the way off the other endpoint, it *still* takes Forward ECN at least one RTT to respond to congestion, and so almost no intermediate routers have bothered to deploy it -- the gains are only marginal. Also interesting for our purposes is the vastly simplified Backwards ECN that uses separate ICMP signaling to eliminate the endpoint echo RTT for much faster responsiveness to congestion: https://tools.ietf.org/html/draft-salim-jhsbnns-ecn-00 Backwards ECN was never widely deployed either, because ICMP is easy to spoof+spam, requires extra packet overhead, many routers filter it, and no strong authentication binds it to the TCP connection. But enough history! Onward! @. Status quo of the Tor flow: SENDME sadness and pain Tor has a window-based flow control system called SENDMEs. This system does not provide congestion control, as window sizes are hard-coded constants. I believe it is important to review this system in detail, so we can avoid making any of the same mistakes again. Bear with me. Recall that Tor circuits are 3 hops long when contacting the Internet (Guard, Middle, Exit), and 7 hops long when a client contacts an onion service. TCP is used for connection in between these hops. One or more reliable streams are multiplexed inside each circuit. Each endpoint of a circuit (client and Exit, or client and onion service) maintains a remaining-to-send cell count for each circuit and stream, which is refilled when it receives a SENDME from the other end. Each time it sends a cell, it decrements this count. When this count reaches 0 without receiving corresponding SENDMEs, it will stop sending data. The initial value of these counts is 1000 cells for circuits and 500 cells for streams, and each SENDME refills 100 cell counts for circuits, and 50 cell counts for streams. Crucially: no backpressure happens on interior relay connections for this windowing system. Instead, we just queue. If you have a TCP background, you may find it easier to mentally replace "SENDME" with "ACK", and you won't be far off. Its just that our acks always ack a fixed amount of cells, and the window size is also fixed. In the steady state stage, window updates of 50 or 100 cells per SENDME results in an upper bound of performance inversely proportional to half the circuit RTT. This is 2*CELL_SIZE*50/RTT for streams, and 2*CELL_SIZE*100/RTT for circuits. For streams with a ~100msec RTT, this is ~500Kbytes/sec. This means that once the Tor network has enough capacity for RTT to be close to link transit time (ie: no queue delay), adding more Tor relays will *not* make Tor faster. This also means that onion service throughput will almost always be much lower than Exit throughput. The RTT is much much higher for 7 hop onion circuits than for 3 hop exits, so even if there is plenty of spare network capacity, Onion service sites will *never* download as fast as their clearweb equivalents with our current flow control. All of this is done just to provide flow control, so that data flow can stop in the event of endpoint stalls and/or interior relay failure. It does not actually provide fairness or limit congestion when multiple clients are used. Additionally, because nothing proves the endpoint has read the data, a client that does not read data but keeps sending SENDMEs to the Exit can deliberately force queues to build up in the Guard node (due to the client-blocked TCP layer), to induce OOM conditions and crashes. We need to upgrade all clients and all Exit relays to fix this problem, which will take quite some time: https://gitweb.torproject.org/torspec.git/tree/proposals/289-authenticated-sendmes.txt Even with honest or authenticated behavior, the SENDME protocol means that each circuit can queue up to 1000 cells at an overloaded bottleneck Tor router, which load balancing can help alleviate, but can't correct in transient and degenerate cases. This is why high capacity Exits have obscene memory requirements when Exits are scarce (ie: below ~1/3 total network throughput), despite there being no memory leaks. This also tells us that Tor relays *cannot* scale to support larger numbers of users without a corresponding linear increase in memory requirements. As a side effect of this protocol, it is possible for the client and the Exit to compute the RTT of a circuit by measuring time between every 50th or 100th sent cell and the associated SENDME arrival. Some previous work makes use of this to try to detect congestion, but it is not a property we want to keep, if we can avoid it. No deployed code uses it, and its presence is worrisome: http://people.cs.ksu.edu/~eyv/papers/latency_leak-ccs07.pdf https://www.robgjansen.com/publications/howlow-pets2013.pdf To further complicate things, these SENDME windows only apply to end-to-end data cells on the circuit. Tor also supports other non-data end-to-end cells, as well as non-end-to-end data cells (so-called "leaky pipe topology", which is used for circuit padding), which are not counted against SENDME windows. Both of these details have also caused us problems in the SENDME design. Still, for now, we will ignore both of these details. Full proposal treatment will need to consider them, though. I. Vaporwaving to ghosts in the anonymity graveyard Ok, so let's party like it's 2010 and go chase some ghosts of past anonymity literature and lore. In my review of this area, I have identified the following causes of death for previous congestion control attempts in Tor: * Side channels and other anonymity risks * Slow or limited responsiveness to queue pressure * Poor fairness properties * Poor throughput * Lack of stream flow control * Endpoint fairness cheating ability * Deployment costs Ghost 1: Drop Signaling Cause of Death: Side channel issues; deployment cost Early proposals for congestion control for Tor attempted to simply tunnel OS TCP streams. This uses the OS's native implementation of TCP Slow Start and AIMD as congestion control methods. These attempts all quickly died, due to OS TCP stack fingerprinting problems (ie: nmap): https://murdoch.is/papers/tor11datagramcomparison.pdf Two years ago, I tried to make the case for using QUIC as a drop-in replacement for Tor circuits, and use QUIC's streams to carry Tor streams. Such a network could leverage QUIC's TCP-like drop-signaled congestion control algorithm internally, and terminate QUIC at the Exit node, transforming the QUIC streams into TCP connections initiated from the Exit. This avoids the TCP fingerprinting issues. In fact, even without a full datagram network, Tor could still deploy relay-only drop signaling if we added windowing and retransmission at the circuit layer: https://lists.torproject.org/pipermail/tor-dev/2018-March/013026.html Unfortunately, the present/absent bit vector of missing packets in this window is a communication channel between malicious Tor relays or Internet routers and the Exit node. This effect was an obscure piece of ancient mix network lore until Nick Mathewson publicly documented it on tor-dev a little over a year ago: https://lists.torproject.org/pipermail/tor-dev/2018-November/013562.html I was the voice of optimism in that grim post, but there is not a lot of hope to be had. It still seems to me that adding cover traffic will reduce the bandwidth of this side channel, but it is a deeply unsolved traffic analysis problem to quantify the level of cover traffic needed, on top of the engineering expense of protocol and cryptographic revamping needed to support cell drops at intermediate relays. It may also be possible to add a MAC scheme that prevents excessive drops by intermediate relays and/or internet routers, but then we may lose a lot of congestion signaling capability and responsiveness. Aside: This drop side channel is much worse for VPN systems that blindly tunnel OS-level TCP. Intermediate Internet routers between the VPN client and the VPN server can use this side channel to send information to any Internet router after the VPN server, via exposed TCP sequence numbers. In a Tor-like system with Exit stream termination, the Exit relay must be one of the side channel participants. For more fun VPN TCP side channels, see also https://seclists.org/oss-sec/2019/q4/122 and https://tools.ietf.org/html/rfc6040. Ghost 2: Datagram Tor with uTP/LEDBAT Signaling Cause of Death: Responsiveness; fairness/cheating; side channels uTP/LEDBAT is the bittorrent transport that measures the RTT of a connection, and backs off if high latency is detected. The theory is that latency above the maximum acceptable value (100ms by spec) indicates that router queues have formed, due to competing traffic causing a bottleneck. Typically these queues form at poor-quality consumer edge routers that queue way too much for their link capacity and user counts. https://tools.ietf.org/html/rfc6817 https://research.torproject.org/techreports/libutp-2013-10-30.pdf Because LEDBAT's goal is to use the link fully, and yield capacity in presence of competing traffic, LEDBAT congestion control requires that target latency upper bounds be known, and also expects some queuing to occur to cause this latency. If network drops still occur, it also uses TCP-like behavior, with similar side channel issues for our usecase. If inherent path latency ever exceeds the LEDBAT target max, throughput will plummet to near-zero. This is bad for Tor, as our path latency is highly variable, especially between 7 hop onion circuits vs 3 hop exit circuits. Additionally, malicious LEDBAT clients can cheat by tolerating a *larger* max delay than other clients. They will thus accept larger queue sizes, which allow larger windows, which allow marginally better throughput, at the expense of more congestion latency for everyone. Ghost 3: Congestion Aware Tor Cause of Death: Hazy anonymity analysis; responsiveness; throughput Congestion Aware Tor uses more detailed per-node latency estimates to measure transient congestion-induced delay and respond to it by altering path selection: https://www.cypherpunks.ca/~iang/pubs/Congestion_Aware_FC12.pdf The downside of Congestion Aware Tor is that rather than managing the congestion window, it suggests simply migrating circuits and changing path weights. This has hazy effects on anonymity, as the set of viable paths in the network is reduced by some hard-to-predict amount. Our Circuit Build Timeout code (CBT) is able to do this kind of path pruning in a significantly more precise manner by setting the circuit timeout such that nearly exactly 80% of all possible network paths are used, but it does not continually perform this timing measurement once the circuit is built. We could build a hybrid system with CBT-style math, with continual measurements, with circuit migration, and with conflux multipath, but this will still require retransmission buffers for reliable circuit migration, will be slower to respond than true window control, and can't actually cap congestion. Worse, circuit migration won't reduce congestion at Exit nodes (you have to keep the same Exit or streams will die). Even with these changes, it also will not remove the throughput cap, or improve onion service throughput. Ghost 4: N23 Tor (aka DefenestraTor) Cause of Death: Stream control; side channels; poor throughput N23 Tor is described in Section 4.2 of the DefenestraTor paper: https://cseweb.ucsd.edu/~savage/papers/PETS11.pdf Basically, each router limits queue length for each circuit to N2 + N3, which are consensus and circuit parameters, respectively. N3 is tuned per circuit by latency measures similar to Congestion Aware Tor, but is still capped at 500 per circuit. This system died primarily because we could not figure out how to bolt stream flow control back on to it. But that is a solvable problem (and any circuit-level congestion control system we deploy must solve it): https://lists.torproject.org/pipermail/tor-dev/2012-November/004138.html https://lists.torproject.org/pipermail/tor-dev/2012-November/004143.html I was not involved in the evaluation of N23 because I was designing and building Tor Browser at the time, but upon diving into it now, I have many additional concerns. First: Because queues are per-circuit but backpressure is per-connection, the backpressure suffers from multiplexing information leak issues. If a circuit manages to fill its N23 limit, the only way for the system to have backpressure to enforce N23 queue sizes is to stop reading entirely on that circuit's upstream TCP connection, stalling *all* other circuits on *only* that particular connection. This property is extremely worrisome from an anonymity standpoint, as it gives a client adversary a mechanism to experimentally stall pairwise router connections to probe for the path taken by a long-lived active target circuit (such as an onion service intro point). Second: It's not clear to me how credits are sent in both directions on a circuit, so that the congestion control can be applied in both directions. Does anyone remember? (Does anyone care?) Third: At the end of the day, the system still has total queue lengths that scale in proportion to the number of circuits on the relay, rather than being globally fixed and controlled through fairness properties. In my view, this is the final nail in the coffin: it's not an improvement for scalability, actually decreases throughput (due to the new lower window cap), and can't control congestion enough to reduce the long-tail latency. Indeed, the perf CDFs from the paper show all of these effects, if you look closely. Ghost 5: RTT Tor [BOOTLEG_RTT_TOR] Cause of Death: Sender cheating; Small max window size Buried in the Defenestrator paper is an unnamed system that uses RTT to estimate congestion and update windows accordingly, much like LEDBAT. It is described in Section 4.1 of https://cseweb.ucsd.edu/~savage/papers/PETS11.pdf I'm going to call this system RTT Tor. RTT Tor works by tracking the min and max RTT observed on a circuit, using the SENDME cell counting side effect. It then picks a threshold timeout T, at a tuneable point in between the min and max observed RTT. If the most recently measured RTT exceeds T, the circuit is declared congested. Unlike LEDBAT, because RTT Tor dynamically chooses T rather than using a hardcoded target RTT, it is possible to use on Tor. The window starts at 100 cells. So long as the RTT stays below T, the window grows by an additional 100 cells every SENDME. If the RTT exceeds T, the window is cut in half. Window size max is capped at 1000 cells. The window cannot go below 100 cells. This system depends upon the Exit node's ability to measure RTT to the client in order to have a downstream window. This capability has been shown to impact client anonymity: http://people.cs.ksu.edu/~eyv/papers/latency_leak-ccs07.pdf https://www.robgjansen.com/publications/howlow-pets2013.pdf Furthermore, endpoints can cheat by choosing a higher T value or otherwise acting as if their window size is increasing when it is not. However, while the paper does not describe a solution for such cheating, it is possible to detect if one endpoint is honest. Because both endpoints can measure the RTT, both endpoints can track what their peer's window should be growing to based on this algorithm, and compare that to an estimate of their peer's window size. The peer's window size can be estimated by counting the number of cells that arrive in RTT/2 amount of time. If the RTT is asymmetric or highly variable, this detection mechanism will have false positives, but we can at least reduce the ability and incentive to cheat so long as one of the endpoints is honest. But if both endpoints cheat, intermediate routers have no way of limiting the data that is in flight or can queue up, except for closing the circuit through the circuit OOM killer. Hence the system still must impose a safe cell window max. It must also impose this max size because some circuits may have an RTT_min that is not much different from the RTT_max, even though the circuit is very congested (because it is continually very congested). In this scenario, the system would keep adding more congestion until the circuit OOM killer kicks in. But, as far as options go, this is not a bad one. Plus, it only requires the client and the Exit to support it. II. Fresh Remixes and New Ideas So what if we took some hits from the 90s and remixed their hooks, Tor style? Let's start with the simplest, most TCP-like scheme we can come up with, and then discuss potential variants and improvements to them, and see which of these ideas mix together well. Remix 1: Forward ECN [FORWARD_ECN_TOR] What if we tried to make something as close to TCP ECN as possible for Tor? Let's use an initial window size of 100 cells, and make our SENDMEs into 100 cell ACKs. Let's create a RELAY_COMMAND_QUEUE_PRESSURE relay cell command that can be sent by a relay towards a client whenever the sum total of queued cells exceeds some limit. Which circuit gets the cell can be chosen by EWMA or some other QoS algorithm (ie: RED-like weighted random choice, as per TCP ECN). The cell would indicate the direction that the circuit was loudest on (ie: client-bound or exit-bound). When the client gets the cell, if the direction is exit-bound, the client halves its outbound window. If the direction is client-bound, the client sends the cell back to the Exit node, who then halves its client-bound SENDME send window. (Recall that Tor's "relay commands" are encrypted to/from the client, with Tor's per-hop circuit keys. This means that extra relay cells can be injected by any hop in the circuit towards the client, and intermediate relays cannot read these cells' relay commands. As we will soon see in [BACKWARD_ECN_TOR] and related ideas, using cleartext "cell commands" instead can help make things more efficient by allowing signaling of congestion without requiring extra cell injection, but this also introduces more side channel risk). We can use Slow Start with AIMD here: Before the first RELAY_COMMAND_QUEUE_PRESSURE, outbound windows would double every window. After the first RELAY_COMMAND_QUEUE_PRESSURE, they would increase by say 100, every window. Or something similar. To reduce the potential for side channel injection, the client can ensure that RELAY_COMMAND_QUEUE_PRESSURE do not arrive more often than once per window. To prevent patterns from trivially being encoded on these cells, the client can delay relaying them until they are on a K % M == 0 boundary, for M=5 or 10 or so (high M will reduce responsiveness to congestion). The client can also refuse to relay these cells if they would cause the Exit's window to drop below some minimum size (say 128). Intuition tells me the relays should use total outbound queue size for deciding *when* to send these cells, and use specific circuit queue length and/or activity to decide *which* circuit to send them on. In this way, we globally limit queue size (per hop latency) regardless of the number of circuits, and still enforce fairness among these circuits. However, picking parameters and algorithms for all of this is a research problem, or at least requires much more QoS RFC and research literature review. There may be additional optimizations too, like sending these cells on circuits of any TCP connection that starts blocking or exceeding per-connection queue limits. While this will work if clients are well-behaved, this system has a few drawbacks. First, it will be slow to respond to client-induced download congestion at middle nodes, since we have to wait a full circuit RTT for the client to relay cells to the Exit. Since the Exit is responsible for the download window for web-like clients, this is also the direction that needs the most congestion control, which is unfortunate. Second, clients can easily cheat, even by malfunction: all they have to do is refuse or forget to relay cells to an Exit, and they get faster downloads at the expense of more network congestion for everyone. We could do what RTT Tor did and cap the window size at 1000, but then we're back in the situation where throughput is artificially capped at some bizarre function of circuit RTT, and worst-case latency will not improve as much as we want. We could rely more heavily on the circuit OOM killer, and kill circuits that queue too much on the assumption that they are cheating/malfunctioning, but tuning this to avoid false positives may be tricky. Third, while this system could technically be deployed even if some nodes do not support it yet, this is risky, as nodes who do not support it will experience excessive congestion at worst, and no improvement at best. Fourth, onion services are tricky to handle. Because of the way rendezvous circuits are joined, this system will send RELAY_COMMAND_QUEUE_PRESSURE towards the client for the client-side of the rend circuit, and towards the service for the service end of the circuit. Then, either of these sides would echo that signal as a RELAY_COMMAND_QUEUE_PRESSURE *all* the way to the other side. Either or both could cheat in this case, and again, the only recourse we have is for each relay to enforce strict circuit queue length limits via the circuit OOM killer. Remix 2: Forward ECN mixed with RTT Tor [FORWARD_ECN_RTT_TOR] [PROPOSAL_CANDIDATE] Ok, [FORWARD_ECN_TOR] sounds decent, but because the client mediates everything, it will be slow to respond to client-destined download congestion, and clients can cheat. Also, if some relays don't support the signal yet, the window may become too large for their congested status. What if we *also* mix in [BOOTLEG_RTT_TOR], so that an endpoint's send window can only grow if *both* conditions hold: the measured circuit RTT stays below RTT Tor's T threshold, *and* there are no congestion signals coming from [FORWARD_ECN_TOR]? The addition of the signaling cell from [FORWARD_ECN_TOR] also allows us to be more robust in the cheating detection described in [BOOTLEG_RTT_TOR], and eliminate false positives. If either endpoint detects a window that is growing too large for their measured circuit RTT (by counting the number of cells arriving in that RTT, and comparing that to what the window should be based on the RTT window growth algorithm), it can send a warning shot RELAY_COMMAND_QUEUE_PRESSURE, explicitly telling the cheater to back off. If the window still grows because the other endpoint ignores this warning shot, it can close the circuit. This system is fully incrementally deployable: it can be used if only the client and Exit support it. It is resilient to cheating so long as Exits are honest, without false positives, and even without intermediate relay support. While we still aren't doing better than 1 RTT response to congestion, we now have a good answer for cheating that doesn't have any false positive risk for good actors. Furthermore, because we also have explicit congestion signaling, we no longer have to worry as much about imposing a max window size, to protect circuits for which RTT_min is close to RTT_max due to persistent congestion. Good actors on these circuits will back off, and for actors acting badly enough to add additional congestion, RTT_max will increase enough for them to eventually be detected. Downside: Since onion services have no Exit node, if we want to use the RTT mechanism to mitigate cheating for them, we must terminate congestion control at the Rendezvous point for this proposal idea, so that onion service clients and services can't collude to get faster service. The RP would then be responsible for examining each half of the spliced circuit's congestion windows, and sending ECN signals down whichever side had a larger window. This may or may not be worth the additional complexity, as opposed to just employing the circuit OOM killer if circuit queues get too long (due to cheating). The only remaining downside I see is that we are relying on baking the ability to make RTT measurements into the protocol. Still, I think this is worth turning into its own proposal. Does anyone see any other issues, or have any suggestions? Remix 3: Backward ECN [BACKWARD_ECN_TOR] Ok so the [FORWARD_ECN_RTT_TOR] mix is pretty tight. We've got an incrementally deployable system that is somewhat resistant to cheaters, and responds to congestion within an RTT. Can we do better? Yes. Remember that Backward ECN IETF draft that we covered in [TCP_HISTORY] earlier? It had that RTT/2 response with a hyphy hook. Let's sample that. First: just like the other proposals, we still need to use SENDMEs, to ack received packets. However, since we are not measuring RTT here, it does not have to be exactly every 100 cells. The SENDME can be sent at a randomized cell count to obscure RTT, with a different randomized ack cell count value included in the SENDME. In addition to RELAY_COMMAND_QUEUE_PRESSURE, which is sent to the client and can't be read by any other intermediate relays, let's make a new *cell* command type, CELL_COMMAND_RELAY_QUEUE_PRESSURE. Because this is a cell command, it can be read by intermediate relays, and can be set by an intermediate relay on any cell heading towards the Exit, simply by flipping cell_t.command from CELL_COMMAND_RELAY to CELL_COMMAND_RELAY_QUEUE_PRESSURE, so long as all relays in the circuit understand this new command. This also avoids sending any additional empty cells when congestion occurs in the common Web download direction, which is nice. We should still respond to congestion with Slow Start AIMD: when the client or exit gets this relay cell or cell command, it cuts its send window in half in that direction. Otherwise windows grow during transmission similar to TCP (ie double the window size each window until the first congestion cell, then linear). The downside of these cells being a new end-to-end cell_t.command is that it opens up side channel attacks we saw with the RELAY_EARLY attack: https://blog.torproject.org/tor-security-advisory-relay-early-traffic-confirmation-attack The attack is to flip this cell_t.command field on a sequence of cells to encode a bitstring, to enable the Exit to send data to the Guard. So in this mix, let's still use RELAY_COMMAND_QUEUE_PRESSURE towards the client. We will only use the relay-visible cell_t.command CELL_COMMAND_RELAY_QUEUE_PRESSURE towards the Exit. This means we only need to worry about Guard to Exit side channels, which require much more information to be useful (ie: they must encode client IP address or some other unique client identifier). Note that because all intermediate relays can see the cell_t.command, they can enforce similar properties as clients did in [FORWARD_ECN_TOR], to limit side channels in the Guard to Exit direction. For instance, they can ensure that these commands do not get sent more often than once per window of client-bound incoming cells. They can also enforce that the Exit-bound cell_t.command = CELL_COMMAND_RELAY_QUEUE_PRESSURE in the outgoing direction is only set on K % M == 0 cell count boundaries, for low M=5 or 10. Note that this K % M spacing actually works better for longer circuits, since there is more chance for other relays to flip the congestion bit at each M spacing position, which damages the reliability of the side channel signal. When middle relays enforce that the window is not allowed to drop below win_min=128, a malicious Guard can only inject log2(win_size)-log2(win_min) of these cells in a burst. Once the middle node detects that the window size would be below the legal minimum, it knows either cheating is happening, or a side channel is being used. For a typical win_size of ~8k (ie: 16X faster throughput than current Tor) and a win_min=128, this detection will be triggered in ~6 properly spaced fake CELL_COMMAND_RELAY_QUEUE_PRESSURE commands. However, in the AIMD steady state for this side channel, a malicious guard can attempt to keep the window size as close to win_min as possible, without going below it. Since the window size grows linearly after the first ECN signal, the guard gets to send an additional CELL_COMMAND_RELAY_PRESSURE cell_t.command around once every win_min client-bound incoming cells. So long as no other relay is also sending congestion control signals, the guard can keep flipping bits roughly this often (though the K % M == 0 requirement would restrict the placement of these cells). Downsides: Even just 6 bits of reliable information can be damaging if target client anonymity sets are already small, and so long as client-bound incoming cells keep arriving, there are still more opportunities to flip the cell_t.commands from the Guard to the Exit. The big question here is can these 6+ cell_t.command flips be turned into a large side channel, despite K % M == 0 placement enforcement? And can this side channel be made reliable? If so (and it seems likely that this is so), then this whole idea needs revision (see [EDGE_FORWARD_ECN_TOR] and [START_BACKWARD_ECN_TOR] for possible revisions). We also need all intermediate relays to upgrade to properly forward the CELL_COMMAND_RELAY_QUEUE_PRESSURE cell command just like CELL_COMMAND_RELAY. Also, clients can still cheat by ignoring the encrypted RELAY_COMMAND_QUEUE_PRESSURE. On the plus side, clients who cheat can at best get faster upload speeds. They cannot get faster download speeds unless the Exits also decide to cheat. Finally, the story for cheating onion service endpoints is similarly complicated: Do we have the RP mediate, or fall back to the circuit OOM killer yet again? Remix 4: Backward ECN mixed with RTT Tor [BACKWARD_ECN_RTT_TOR] If we really are worried about upload cheating, we can mix RTT Tor back in, just like we did for [FORWARD_ECN_RTT_TOR], and use its cheating detection. An endpoint that detects abnormally large window sizes (based on arrival cell counts per RTT) could send the RELAY_COMMAND_QUEUE_PRESSURE warning shot to reduce the window. So now we have a system that responds in RTT/2 when supported by all intermediate relays, and falls back to RTT response in the presence of one cheater endpoint or lack of support by intermediate relays. Unfortunately, this still has the side channel issues of [BACKWARD_ECN_TOR], and so I have not tagged it as a proposal candidate. Remix 5: Backward ECN with Transparent Trap [BACKWARD_ECN_TRAP_TOR] Ok so the "SENDME ack exactly 100 cells" 2-step beat is getting a little tired. Can we devise any ways to trap cheaters without having a predicable RTT measurement baked into the protocol, and still respond to congestion in RTT/2? There is a way to defeat the wizard. But there be dragons in this trap house, and they really want to soul bond using side channels while everybody is watching. Hence I have not flagged this as a proposal candidate. Plus, as these mixed metaphors and obscure references suggest, this remix is very busy and complicated. Still, let's consider it because it may be useful to remix later. Let's take [BACKWARD_ECN_TOR], and instead of using RELAY_COMMAND_QUEUE_PRESSURE towards the client, let's use cell_t.command=CELL_COMMAND_RELAY_QUEUE_PRESSURE in both directions. This allows us to perform the same kinds of side channel and cheating checks at the middle node in the Exit to Guard direction, and also avoids sending any extra relay cells during congestion. Just as in [BACKWARD_ECN_TOR], let's randomize the cell count at which we send each SENDME, and include a different randomized ack count in the SENDME, so that each endpoint can add cells back to the window for that SENDME, but cannot compute RTT. Let's also still use Slow Start with AIMD (double the window if there are no congestion signals, half the window on congestion signals, and grow it linear after that). However, this is still not sufficient to stop Exit to Guard side channels! The Exit to Guard direction is different than the Guard to Exit direction: relays can inject arbitrary amounts of full relay cells for the client in the Guard direction, whereas they could not do so in the Guard to Exit direction. In Tor (and other mixnets), this property is called "leaky pipe topology", and is what allows us to send RELAY_COMMAND_DROP padding cells from the middle relay to the client (and vice-versa). This means that the combination of these injected cells plus congestion control can be used to encode a pattern from Exit to Guard, and subvert the checks at the middle node, because relays can inject dummy cells sent towards the client to meet the win_size and M spacing requirements. The good news is that the client-side circuit padding system only allows RELAY_COMMAND_DROP from a hop for which the client has negotiated a padding machine. This means that Exit nodes cannot inject padding for assistance in any side channel usage, regardless of the congestion control scheme we choose. The bad news is that vanilla Tor allows the presence of invalid protocol-violating cells, and does not close the circuit for these cases. Best case, it issues a log message. Worst case: it silently drops them. Thankfully, the vanguards addon is designed such that any cell that is not specifically marked as valid by Tor counts as a dropped cell, after which the circuit will be closed. This is a fail-closed defense. If new code is added for cell processing that forgets to flag cells as properly processed by calling circuit_read_valid_data(), those cells count as false positive dropped cells, rather than false negatives. This impacts reliability, but has the advantage that we won't shoot ourselves in the foot by adding new code that is too permissive without noticing that this code simply does not work properly because circuits get closed when it is exercised. All this means we will need to port this defense into Tor itself, though, or dummy cells can be injected by the Exit for side channel assistance in congestion control schemes or other cases: https://github.com/mikeperry-tor/vanguards/blob/master/README_TECHNICAL.md#the-bandguards-subsystem But what about legit padding causing the Guard to think that win_size and M spacing requirement are violated? Well, so long as any relay with an active padding machine stores the congestion signals so that it can delay them such that they are sent on the next K % M == 0 boundary after padding is applied, these limits can still be enforced, without revealing the quantity of padding. Additionally, any guard node congestion caused by the sum of real+padding traffic can be controlled under this scheme, because the middle relay can observe the guard's congestion signals and decrease padding approperiately. Downsides: If preventing *any* side channel bits from flowing from the Exit to the Guard is important, this scheme likely does not accomplish that. It does seem that Exit to Guard side channels are far more dangerous than Guard to Exit, since all the Exit has to say is "hey this client looked at something I think is interesting", where as the Guard has to communicate a unique identifier for the anonymity set, so that concern alone may kill this idea. This scheme is also complicated, and any implementation errors will bring back the RELAY_EARLY side channel attack in the Exit to Guard direction, even if the math might claim otherwise. Remix 6: Edge-Forward ECN [EDGE_FORWARD_ECN_TOR] [PROPOSAL_CANDIDATE] The common problem with both [BACKWARD_ECN_TOR] and [BACKWARD_ECN_TRAP_TOR] is that they enable side channels by one or both of the edges of the circuit (Guard to Exit, or Exit to Guard). Forward ECN mitigates these side channels, because the edge no longer gets to encode its congestion signal at specific points in a traffic pattern. This works better than win_size and M position limiting by itself, and it can be used in combination with client-side window size checks and M position limiting, too. If middle nodes could somehow reliably detect the Guard and Exit endpoints of a circuit, then the middle node(s) could force one or both endpoints to use Forward ECN [FORWARD_ECN_TOR]. Then the circuit edges would not be able to communicate with each other by directly encoding congestion signals in deterministic traffic patterns. This is easier said than done. In practice, middles can tell that they are middles because they do not get any authentication cells from the client, and they are not asked to open any streams. But what we really need is for middles to be able to tell if their *neighbors* are the circuit edges, because these edges are the only ones that we actually need to forbid from using cell_t.command from BACKWARD_ECN_TOR; it is fine if multiple middles use cell_t.command (for example, in onion service circuits). The only way I can think of to reliably detect circuit edges from the middle position is to forbid either Exits or Guards nodes from being used as middles, so that middles could know when they were next to one of these node types, and therefore next to the edge of the circuit. Then middles could forbid these edges' use of the cleartext CELL_COMMAND_RELAY_QUEUE_PRESSURE. This only-middles-as-middles stratification requirement has been proposed before for several other reasons, so maybe it is not a horrible idea? This is more tricky for onion service circuits, though. For those, we could require that sensitive positions (such as HSDIR, RP, and IP) also be Exit-flagged nodes? Or maybe there is another clever way to detect just the edges of a circuit from the middle position? Even if there is, this proposal still has some cheating risk, as clients can refuse to relay these edge congestion signals to get marginally better throughput. Again, RTT measurements can be preserved here as a backstop against cheating, but for the additional anonymity risk. Otherwise we have to fall back to the circuit OOM killer. Remix 7: Start Backward ECN [START_BACKWARD_ECN_TOR] [PROPOSAL_CANDIDATE] The Slow Start with AIMD window management for all of these systems means that we most need the quick RTT/2 response to the congestion early on in the Slow Start phase, while the window size is growing exponentially. This is also when the network is most vulnerable to cheaters. So let's allow only 1 (or maybe 2) CELL_COMMAND_RELAY_QUEUE_PRESSURE cell_t.commands per circuit in the Guard to Exit direction as per [BACKWARD_ECN_TOR], and maybe even 1 CELL_COMMAND_RELAY_QUEUE_PRESSURE cell_t.command in the Exit to Guard direction from [BACKWARD_ECN_TRAP_TOR]. After these cell_t.command limits are hit, middle nodes forbid any further use of this cell type on that circuit, and enforce that all circuit member relays switch to [FORWARD_ECN_TOR] or [FORWARD_ECN_RTT_TOR] from then on (depending on how concerned we are about cheating vs disclosing circuit RTT). This is my all-around favorite of the options. The only downside is that we have to choose between disclosing RTT, or risking some cheating, but since the cheating can now only happen after slow start has finished, it will provide less speed advantages. If clients cheat with the goal of causing memory exhaustion at relays, the circuit OOM killer can kick in. Remix 8: Your idea here! Anyone have any other old lore to resurrect? Any new ideas? Bonus B-Side Track: Stream Flow Control [PROPOSAL_CANDIDATE] All of the systems discussed in this post address the problem of circuit-level congestion control, but do not perform any stream-level flow control. While we were prototyping N23, Andreas Krey pointed out that a single blocked stream could consume the entire circuit window with unread packets, causing the surprising result that other active streams on the same circuit will also stall: https://lists.torproject.org/pipermail/tor-dev/2012-November/004138.html He then went on to propose some alternatives: XON/XOFF, bundling stream window updates in circuit SENDMEs, or just killing any blocked streams that have too many unread cells: https://lists.torproject.org/pipermail/tor-dev/2012-November/004143.html XON/XOFF is the simplest option, but we will need heuristics to avoid excessive XON/XOFF chatter on application streams that frequently block in normal operation. It also will have RTT/2 delay before it actually causes the other end to stop sending. If the blocked stream is sending cells at the full bandwidth of the circuit, this could very well be a full circuit window worth of cells before the XOFF makes it across. We can just ack those cells with a circuit-level SENDME even though they were not delivered to their blocked stream, but we have to be careful to authenticate what we read in this case, so we do not re-introduce the unauthenticated SENDME attack: https://gitweb.torproject.org/torspec.git/tree/proposals/289-authenticated-sendmes.txt Can we learn anything from the layered flow control in QUIC? It is excessively complicated: https://docs.google.com/document/d/1F2YfdDXKpy20WVKJueEf4abn_LVZHhMUMS5gX6Pgjl4/edit III. Evaluation First, does anyone have any strong objections to any of the remixes tagged with PROPOSAL_CANDIDATE? It will save time if we can decide right from the start that any are completely unacceptable. Bonus points if we can mod them back into being acceptable, or mod one of the remixes that are not tagged with PROPOSAL_CANDIDATE into being acceptable. We also need some criteria to help us determine how we're going to compare the remaining ones. Here is a condensed [TRACK_LISTING], for quick review/comparison and criteria brainstorming: ______________________________________________________________________ | MIX | CHEATING | RESPONSE| SIDE | UPGRADE | | TRACK | DETECTION | TIME | CHANNELS | PATH? | |~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~|~~~~~~~~~~~~|~~~~~~~~~~| |BOOTLEG_RTT_TOR | Exit RTT |100c+RTT | RTT | Exits | |FORWARD_ECN_TOR | Circ OOM | RTT | None? | Full Net | |FORWARD_ECN_RTT_TOR | Exit RTT | RTT | RTT |Exits;Full| |BACKWARD_ECN_TOR |Middles;OOM| RTT/2 | Guard->Exit| Full Net | |BACKWARD_ECN_RTT_TOR |Middles;RTT|RTT/2;RTT| Guard->Exit|Exits;Full| |BACKWARD_ECN_TRAP_TOR | Middles | RTT/2 |Guard<->Exit| Full Net | |EDGE_FORWARD_ECN_TOR |Middles;OOM| 2*RTT/3 | None? | Full Net | |START_BACKWARD_ECN_TOR|Middles;OOM|RTT/2;RTT| Low/None? | Full Net | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ What are the things we need to decide if a scheme is acceptable? Here's a laundry list. Please do add any others you think of: - Side channel accounting - Interaction with non-data traffic, padding traffic, and leaky-pipes - Anonymity analysis as per https://www.robgjansen.com/publications/howlow-pets2013.pdf - No/minimal cheating - No throughput cap - Quick responsiveness (at least RTT; RTT/2 is better) - Fairness analysis (may be a separate proposal on QoS/EWMA) - Onion service deployment - Simulation methodology that will reflects live deployment - Deployment costs - Upgrade path Must-Read References: Cultural etymology of Tor's end-of-decade donations campaign aesthetic: https://en.wikipedia.org/wiki/Vaporwave https://blog.torproject.org/better-internet-possible-ive-seen-it Tor congestion latency is proportional to network utilization divided by spare_capacity: Section 6.3 of https://www.freehaven.net/anonbib/cache/murdoch-pet2008.pdf RTT-based Congestion Control for Tor: Section 4.1 of https://cseweb.ucsd.edu/~savage/papers/PETS11.pdf TCP ECN: https://tools.ietf.org/html/rfc3168#section-6.1 Backward ECN: https://tools.ietf.org/html/draft-salim-jhsbnns-ecn-00 Stream Flow Control Issues and Ideas: https://lists.torproject.org/pipermail/tor-dev/2012-November/004138.html https://lists.torproject.org/pipermail/tor-dev/2012-November/004143.html Side channels in circ command cells: https://blog.torproject.org/tor-security-advisory-relay-early-traffic-confirmation-attack -- Mike Perry
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ tor-dev mailing list tor-dev@xxxxxxxxxxxxxxxxxxxx https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev