# [or-cvs] migrate stuff from section 4 to 5 and vice versa

Update of /home2/or/cvsroot/tor/doc/design-paper
In directory moria.mit.edu:/home2/arma/work/onion/cvs/tor/doc/design-paper

Modified Files:
challenges.tex
Log Message:
migrate stuff from section 4 to 5 and vice versa

Index: challenges.tex
===================================================================
RCS file: /home2/or/cvsroot/tor/doc/design-paper/challenges.tex,v
retrieving revision 1.52
retrieving revision 1.53
diff -u -d -r1.52 -r1.53
--- challenges.tex	8 Feb 2005 07:37:30 -0000	1.52
+++ challenges.tex	8 Feb 2005 07:54:28 -0000	1.53
@@ -423,7 +423,7 @@
% this para should probably move to the scalability / directory system. -RD
% Nope. Cut for space, except for small comment added above -PFS

-\section{Policy issues}
+\section{Social challenges}

Many of the issues the Tor project needs to address extend beyond
system design and technology development. In particular, the
@@ -498,7 +498,7 @@

On the other hand, while the number of active concurrent users may not
matter as much as we'd like, it still helps to have some other users
-who use the network. We investigate this issue in the next section.
+on the network. We investigate this issue next.

\subsection{Reputability and perceived social value}
Another factor impacting the network's security is its reputability:
@@ -803,8 +803,8 @@

\section{Design choices}

-In addition to social issues, Tor also faces some design challenges that must
-be addressed as the network develops.
+In addition to social issues, Tor also faces some design tradeoffs that must
+be investigated as the network develops.

\subsection{Transporting the stream vs transporting the packets}
\label{subsec:stream-vs-packet}
@@ -915,54 +915,6 @@
mid-latency as they are constructed, we could handle both types of traffic
on the same network, giving users a choice between speed and security.

-\subsection{Measuring performance and capacity}
-\label{subsec:performance}
-
-One of the paradoxes with engineering an anonymity network is that we'd like
-to learn as much as we can about how traffic flows so we can improve the
-network, but we want to prevent others from learning how traffic flows in
-order to trace users' connections through the network.  Furthermore, many
-mechanisms that help Tor run efficiently
-
-Currently, nodes try to deduce their own available bandwidth (based on how
-much traffic they have been able to transfer recently) and include this
-information in the descriptors they upload to the directory. Clients
-choose servers weighted by their bandwidth, neglecting really slow
-servers and capping the influence of really fast ones.
-
-This is, of course, eminently cheatable.  A malicious node can get a
-disproportionate amount of traffic simply by claiming to have more bandwidth
-than it does.  But better mechanisms have their problems.  If bandwidth data
-is to be measured rather than self-reported, it is usually possible for
-nodes to selectively provide better service for the measuring party, or
-sabotage the measured value of other nodes.  Complex solutions for
-mix networks have been proposed, but do not address the issues
-completely~\cite{mix-acc,casc-rep}.
-
-Even with no cheating, network measurement is complex.  It is common
-for views of a node's latency and/or bandwidth to vary wildly between
-observers.  Further, it is unclear whether total bandwidth is really
-the right measure; perhaps clients should instead be considering nodes
-based on unused bandwidth or observed throughput.
-% XXXX say more here?
-
-%How to measure performance without letting people selectively deny service
-%by distinguishing pings. Heck, just how to measure performance at all. In
-%practice people have funny firewalls that don't match up to their exit
-%policies and Tor doesn't deal.
-
-%Network investigation: Is all this bandwidth publishing thing a good idea?
-%How can we collect stats better? Note weasel's smokeping, at
-%which probably gives george and steven enough info to break tor?
-
-Even if we can collect and use this network information effectively, we need
-to make sure that it is not more useful to attackers than to us.  While it
-seems plausible that bandwidth data alone is not enough to reveal
-sender-recipient connections under most circumstances, it could certainly
-reveal the path taken by large traffic flows under low-usage circumstances.
-
\subsection{Running a Tor node, path length, and helper nodes}
\label{subsec:helper-nodes}

@@ -1111,79 +1063,119 @@
a way for their users, using unmodified software, to get end-to-end
encryption and end-to-end authentication to their website.

-\subsection{Trust and discovery}
-\label{subsec:trust-and-discovery}
+\label{subsec:routing-zones}

-The published Tor design adopted a deliberately simplistic design for
-authorizing new nodes and informing clients about Tor nodes and their status.
-In the early Tor designs, all nodes periodically uploaded a signed description
-of their locations, keys, and capabilities to each of several well-known {\it
-  directory servers}.  These directory servers constructed a signed summary
-of all known Tor nodes (a directory''), and a signed statement of which
-nodes they
-believed to be operational at any given time (a network status'').  Clients
-periodically downloaded a directory in order to learn the latest nodes and
-keys, and more frequently downloaded a network status to learn which nodes are
-likely to be running.  Tor nodes also operate as directory caches, in order to
-lighten the bandwidth on the authoritative directory servers.
+Anonymity networks have long relied on diversity of node location for
+protection against attacks---typically an adversary who can observe a
+larger fraction of the network can launch a more effective attack. One
+way to achieve dispersal involves growing the network so a given adversary
+sees less. Alternately, we can arrange the topology so traffic can enter
+or exit at many places (for example, by using a free-route network
+like Tor rather than a cascade network like JAP). Lastly, we can use
+distributed trust to spread each transaction over multiple jurisdictions.
+But how do we decide whether two nodes are in related locations?

-In order to prevent Sybil attacks (wherein an adversary signs up many
-purportedly independent nodes in order to increase her chances of observing
-a stream as it enters and leaves the network), the early Tor directory design
-required the operators of the authoritative directory servers to manually
-approve new nodes.  Unapproved nodes were included in the directory,
-but clients
-did not use them at the start or end of their circuits.  In practice,
-directory administrators performed little actual verification, and tended to
-approve any Tor node whose operator could compose a coherent email.
-This procedure
-may have prevented trivial automated Sybil attacks, but would do little
-against a clever attacker.
+Feamster and Dingledine defined a \emph{location diversity} metric
+in \cite{feamster:wpes2004}, and began investigating a variant of location
+diversity based on the fact that the Internet is divided into thousands of
+independently operated networks called {\em autonomous systems} (ASes).
+The key insight from their paper is that while we typically think of a
+connection as going directly from the Tor client to her first Tor node,
+actually it traverses many different ASes on each hop. An adversary at
+any of these ASes can monitor or influence traffic. Specifically, given
+plausible initiators and recipients and path random path selection,
+some ASes in the simulation were able to observe 10\% to 30\% of the
+transactions (that is, learn both the origin and the destination) on
+the deployed Tor network (33 nodes as of June 2004).

-There are a number of flaws in this system that need to be addressed as we
-move forward.  They include:
-\begin{tightlist}
-\item Each directory server represents an independent point of failure; if
-  any one were compromised, it could immediately compromise all of its users
-  by recommending only compromised nodes.
-\item The more nodes join the network, the more unreasonable it
-  becomes to expect clients to know about them all.  Directories
-  burdensome.
-\item The validation scheme may do as much harm as it does good.  It is not
-  only incapable of preventing clever attackers from mounting Sybil attacks,
-  but may deter node operators from joining the network.  (For instance, if
-  they expect the validation process to be difficult, or if they do not share
-  any languages in common with the directory server operators.)
-\end{tightlist}
+The paper concludes that for best protection against the AS-level
+adversary, nodes should be in ASes that have the most links to other ASes:
+Tier-1 ISPs such as AT\&T and Abovenet. Further, a given transaction
+is safest when it starts or ends in a Tier-1 ISP. Therefore, assuming
+initiator and responder are both in the U.S., it actually \emph{hurts}
+our location diversity to add far-flung nodes in continents like Asia
+or South America.

-We could try to move the system in several directions, depending on our
-choice of threat model and requirements.  If we did not need to increase
-network capacity in order to support more users, we could simply
- adopt even stricter validation requirements, and reduce the number of
-nodes in the network to a trusted minimum.
-But, we can only do that if can simultaneously make node capacity
-scale much more than we anticipate feasible soon, and if we can find
-entities willing to run such nodes, an equally daunting prospect.
+Many open questions remain. First, it will be an immense engineering
+challenge to get an entire BGP routing table to each Tor client, or to
+summarize it sufficiently. Without a local copy, clients won't be
+able to safely predict what ASes will be traversed on the various paths
+through the Tor network to the final destination. Tarzan~\cite{tarzan:ccs02}
+and MorphMix~\cite{morphmix:fc04} suggest that we compare IP prefixes to
+determine location diversity; but the above paper showed that in practice
+many of the Mixmaster nodes that share a single AS have entirely different
+IP prefixes. When the network has scaled to thousands of nodes, does IP
+prefix comparison become a more useful approximation?
+%
+Second, we can take advantage of caching certain content at the
+exit nodes, to limit the number of requests that need to leave the
+network at all. What about taking advantage of caches like Akamai or
+Google~\cite{shsm03}? (Note that they're also well-positioned as global
+%
+Third, if we follow the paper's recommendations and tailor path selection
+to avoid choosing endpoints in similar locations, how much are we hurting
+of knowing our algorithm?
+%
+Lastly, can we use this knowledge to figure out which gaps in our network
+would most improve our robustness to this class of attack, and go recruit
+new nodes with those ASes in mind?

+%Tor's security relies in large part on the dispersal properties of its
+%network. We need to be more aware of the anonymity properties of various
+%approaches so we can make better design decisions in the future.

-In order to address the first two issues, it seems wise to move to a system
-including a number of semi-trusted directory servers, no one of which can
-compromise a user on its own.  Ultimately, of course, we cannot escape the
-problem of a first introducer: since most users will run Tor in whatever
-configuration the software ships with, the Tor distribution itself will
-remain a potential single point of failure so long as it includes the seed
-keys for directory servers, a list of directory servers, or any other means
-to learn which nodes are on the network.  But omitting this information
-from the Tor distribution would only delegate the trust problem to the
-individual users, most of whom are presumably less informed about how to make
-trust decisions than the Tor developers.
+\subsection{The China problem}
+\label{subsec:china}

-%Network discovery, sybil, node admission, scaling. It seems that the code
-%will ship with something and that's our trust root. We could try to get
-%people to build a web of trust, but no. Where we go from here depends
-%on what threats we have in mind. Really decentralized if your threat is
-%RIAA; less so if threat is to application data or individuals or...
+Citizens in a variety of countries, such as most recently China and
+Iran, are periodically blocked from accessing various sites outside
+their country. These users try to find any tools available to allow
+them to get-around these firewalls. Some anonymity networks, such as
+Six-Four~\cite{six-four}, are designed specifically with this goal in
+mind; others like the Anonymizer~\cite{anonymizer} are paid by sponsors
+such as Voice of America to set up a network to encourage Internet
+freedom. Even though Tor wasn't
+users across the world are trying to use it for exactly this purpose.
+% Academic and NGO organizations, peacefire, \cite{berkman}, etc
+
+Anti-censorship networks hoping to bridge country-level blocks face
+a variety of challenges. One of these is that they need to find enough
+exit nodes---servers on the free' side that are willing to relay
+arbitrary traffic from users to their final destinations. Anonymizing
+networks including Tor are well-suited to this task, since we have
+already gathered a set of exit nodes that are willing to tolerate some
+political heat.
+
+The other main challenge is to distribute a list of reachable relays
+to the users inside the country, and give them software to use them,
+without letting the authorities also enumerate this list and block each
+relay. Anonymizer solves this by buying lots of seemingly-unrelated IP
+used up', and telling a few users about the new ones. Distributed
+have tens of thousands of separate IP addresses whose users might
+volunteer to provide this service since they've already installed and use
+the software for their own privacy~\cite{koepsell:wpes2004}. Because
+the Tor protocol separates routing from network discovery \cite{tor-design},
+volunteers could configure their Tor clients
+to generate node descriptors and send them to a special directory
+server that gives them out to dissidents who need to get around blocks.
+
+Of course, this still doesn't prevent the adversary
+from enumerating all the volunteer relays and blocking them preemptively.
+Perhaps a tiered-trust system could be built where a few individuals are
+given relays' locations, and they recommend other individuals by telling them
+those addresses, thus providing a built-in incentive to avoid letting the
+might help to bound the number of IP addresses leaked to the adversary. Groups
+like the W3C are looking into using Tor as a component in an overall system to
+help address censorship; we wish them luck.
+
+%\cite{infranet}

\section{Scaling}
\label{sec:scaling}
@@ -1282,119 +1274,127 @@
%efficiency over baseline, and also to determine how far we are from
%optimal efficiency (what we could get if we ignored the anonymity goals).

-\label{subsec:routing-zones}
+\subsection{Trust and discovery}
+\label{subsec:trust-and-discovery}

-Anonymity networks have long relied on diversity of node location for
-protection against attacks---typically an adversary who can observe a
-larger fraction of the network can launch a more effective attack. One
-way to achieve dispersal involves growing the network so a given adversary
-sees less. Alternately, we can arrange the topology so traffic can enter
-or exit at many places (for example, by using a free-route network
-like Tor rather than a cascade network like JAP). Lastly, we can use
-distributed trust to spread each transaction over multiple jurisdictions.
-But how do we decide whether two nodes are in related locations?
+The published Tor design adopted a deliberately simplistic design for
+authorizing new nodes and informing clients about Tor nodes and their status.
+In the early Tor designs, all nodes periodically uploaded a signed description
+of their locations, keys, and capabilities to each of several well-known {\it
+  directory servers}.  These directory servers constructed a signed summary
+of all known Tor nodes (a directory''), and a signed statement of which
+nodes they
+believed to be operational at any given time (a network status'').  Clients
+periodically downloaded a directory in order to learn the latest nodes and
+keys, and more frequently downloaded a network status to learn which nodes are
+likely to be running.  Tor nodes also operate as directory caches, in order to
+lighten the bandwidth on the authoritative directory servers.

-Feamster and Dingledine defined a \emph{location diversity} metric
-in \cite{feamster:wpes2004}, and began investigating a variant of location
-diversity based on the fact that the Internet is divided into thousands of
-independently operated networks called {\em autonomous systems} (ASes).
-The key insight from their paper is that while we typically think of a
-connection as going directly from the Tor client to her first Tor node,
-actually it traverses many different ASes on each hop. An adversary at
-any of these ASes can monitor or influence traffic. Specifically, given
-plausible initiators and recipients and path random path selection,
-some ASes in the simulation were able to observe 10\% to 30\% of the
-transactions (that is, learn both the origin and the destination) on
-the deployed Tor network (33 nodes as of June 2004).
+In order to prevent Sybil attacks (wherein an adversary signs up many
+purportedly independent nodes in order to increase her chances of observing
+a stream as it enters and leaves the network), the early Tor directory design
+required the operators of the authoritative directory servers to manually
+approve new nodes.  Unapproved nodes were included in the directory,
+but clients
+did not use them at the start or end of their circuits.  In practice,
+directory administrators performed little actual verification, and tended to
+approve any Tor node whose operator could compose a coherent email.
+This procedure
+may have prevented trivial automated Sybil attacks, but would do little
+against a clever attacker.

-The paper concludes that for best protection against the AS-level
-adversary, nodes should be in ASes that have the most links to other ASes:
-Tier-1 ISPs such as AT\&T and Abovenet. Further, a given transaction
-is safest when it starts or ends in a Tier-1 ISP. Therefore, assuming
-initiator and responder are both in the U.S., it actually \emph{hurts}
-our location diversity to add far-flung nodes in continents like Asia
-or South America.
+There are a number of flaws in this system that need to be addressed as we
+move forward.  They include:
+\begin{tightlist}
+\item Each directory server represents an independent point of failure; if
+  any one were compromised, it could immediately compromise all of its users
+  by recommending only compromised nodes.
+\item The more nodes join the network, the more unreasonable it
+  becomes to expect clients to know about them all.  Directories
+  burdensome.
+\item The validation scheme may do as much harm as it does good.  It is not
+  only incapable of preventing clever attackers from mounting Sybil attacks,
+  but may deter node operators from joining the network.  (For instance, if
+  they expect the validation process to be difficult, or if they do not share
+  any languages in common with the directory server operators.)
+\end{tightlist}

-Many open questions remain. First, it will be an immense engineering
-challenge to get an entire BGP routing table to each Tor client, or to
-summarize it sufficiently. Without a local copy, clients won't be
-able to safely predict what ASes will be traversed on the various paths
-through the Tor network to the final destination. Tarzan~\cite{tarzan:ccs02}
-and MorphMix~\cite{morphmix:fc04} suggest that we compare IP prefixes to
-determine location diversity; but the above paper showed that in practice
-many of the Mixmaster nodes that share a single AS have entirely different
-IP prefixes. When the network has scaled to thousands of nodes, does IP
-prefix comparison become a more useful approximation?
-%
-Second, we can take advantage of caching certain content at the
-exit nodes, to limit the number of requests that need to leave the
-network at all. What about taking advantage of caches like Akamai or
-Google~\cite{shsm03}? (Note that they're also well-positioned as global
-%
-Third, if we follow the paper's recommendations and tailor path selection
-to avoid choosing endpoints in similar locations, how much are we hurting
-of knowing our algorithm?
-%
-Lastly, can we use this knowledge to figure out which gaps in our network
-would most improve our robustness to this class of attack, and go recruit
-new nodes with those ASes in mind?
+We could try to move the system in several directions, depending on our
+choice of threat model and requirements.  If we did not need to increase
+network capacity in order to support more users, we could simply
+ adopt even stricter validation requirements, and reduce the number of
+nodes in the network to a trusted minimum.
+But, we can only do that if can simultaneously make node capacity
+scale much more than we anticipate feasible soon, and if we can find
+entities willing to run such nodes, an equally daunting prospect.

-%Tor's security relies in large part on the dispersal properties of its
-%network. We need to be more aware of the anonymity properties of various
-%approaches so we can make better design decisions in the future.

-\subsection{The China problem}
-\label{subsec:china}
+In order to address the first two issues, it seems wise to move to a system
+including a number of semi-trusted directory servers, no one of which can
+compromise a user on its own.  Ultimately, of course, we cannot escape the
+problem of a first introducer: since most users will run Tor in whatever
+configuration the software ships with, the Tor distribution itself will
+remain a potential single point of failure so long as it includes the seed
+keys for directory servers, a list of directory servers, or any other means
+to learn which nodes are on the network.  But omitting this information
+from the Tor distribution would only delegate the trust problem to the
+individual users, most of whom are presumably less informed about how to make
+trust decisions than the Tor developers.

-Citizens in a variety of countries, such as most recently China and
-Iran, are periodically blocked from accessing various sites outside
-their country. These users try to find any tools available to allow
-them to get-around these firewalls. Some anonymity networks, such as
-Six-Four~\cite{six-four}, are designed specifically with this goal in
-mind; others like the Anonymizer~\cite{anonymizer} are paid by sponsors
-such as Voice of America to set up a network to encourage Internet
-freedom. Even though Tor wasn't
-users across the world are trying to use it for exactly this purpose.
-% Academic and NGO organizations, peacefire, \cite{berkman}, etc
+%Network discovery, sybil, node admission, scaling. It seems that the code
+%will ship with something and that's our trust root. We could try to get
+%people to build a web of trust, but no. Where we go from here depends
+%on what threats we have in mind. Really decentralized if your threat is
+%RIAA; less so if threat is to application data or individuals or...

-Anti-censorship networks hoping to bridge country-level blocks face
-a variety of challenges. One of these is that they need to find enough
-exit nodes---servers on the free' side that are willing to relay
-arbitrary traffic from users to their final destinations. Anonymizing
-networks including Tor are well-suited to this task, since we have
-already gathered a set of exit nodes that are willing to tolerate some
-political heat.
+\subsection{Measuring performance and capacity}
+\label{subsec:performance}

-The other main challenge is to distribute a list of reachable relays
-to the users inside the country, and give them software to use them,
-without letting the authorities also enumerate this list and block each
-relay. Anonymizer solves this by buying lots of seemingly-unrelated IP
-used up', and telling a few users about the new ones. Distributed
-have tens of thousands of separate IP addresses whose users might
-volunteer to provide this service since they've already installed and use
-the software for their own privacy~\cite{koepsell:wpes2004}. Because
-the Tor protocol separates routing from network discovery \cite{tor-design},
-volunteers could configure their Tor clients
-to generate node descriptors and send them to a special directory
-server that gives them out to dissidents who need to get around blocks.
+One of the paradoxes with engineering an anonymity network is that we'd like
+to learn as much as we can about how traffic flows so we can improve the
+network, but we want to prevent others from learning how traffic flows in
+order to trace users' connections through the network.  Furthermore, many
+mechanisms that help Tor run efficiently

-Of course, this still doesn't prevent the adversary
-from enumerating all the volunteer relays and blocking them preemptively.
-Perhaps a tiered-trust system could be built where a few individuals are
-given relays' locations, and they recommend other individuals by telling them
-those addresses, thus providing a built-in incentive to avoid letting the
-might help to bound the number of IP addresses leaked to the adversary. Groups
-like the W3C are looking into using Tor as a component in an overall system to
-help address censorship; we wish them luck.
+Currently, nodes try to deduce their own available bandwidth (based on how
+much traffic they have been able to transfer recently) and include this
+information in the descriptors they upload to the directory. Clients
+choose servers weighted by their bandwidth, neglecting really slow
+servers and capping the influence of really fast ones.

-%\cite{infranet}
+This is, of course, eminently cheatable.  A malicious node can get a
+disproportionate amount of traffic simply by claiming to have more bandwidth
+than it does.  But better mechanisms have their problems.  If bandwidth data
+is to be measured rather than self-reported, it is usually possible for
+nodes to selectively provide better service for the measuring party, or
+sabotage the measured value of other nodes.  Complex solutions for
+mix networks have been proposed, but do not address the issues
+completely~\cite{mix-acc,casc-rep}.
+
+Even with no cheating, network measurement is complex.  It is common
+for views of a node's latency and/or bandwidth to vary wildly between
+observers.  Further, it is unclear whether total bandwidth is really
+the right measure; perhaps clients should instead be considering nodes
+based on unused bandwidth or observed throughput.
+% XXXX say more here?
+
+%How to measure performance without letting people selectively deny service
+%by distinguishing pings. Heck, just how to measure performance at all. In
+%practice people have funny firewalls that don't match up to their exit
+%policies and Tor doesn't deal.
+
+%Network investigation: Is all this bandwidth publishing thing a good idea?
+%How can we collect stats better? Note weasel's smokeping, at
+%which probably gives george and steven enough info to break tor?
+
+Even if we can collect and use this network information effectively, we need
+to make sure that it is not more useful to attackers than to us.  While it
+seems plausible that bandwidth data alone is not enough to reveal
+sender-recipient connections under most circumstances, it could certainly
+reveal the path taken by large traffic flows under low-usage circumstances.

\subsection{Non-clique topologies}

@@ -1493,7 +1493,7 @@
authentication mechanisms. We can't just keep escalating the blacklist
standoff forever.
%
-Fourth, as described in Section~\ref{sec:scaling}, the current Tor
+Fourth, the current Tor
architecture does not scale even to handle current user demand. We must
find designs and incentives to let clients relay traffic too, without
sacrificing too much anonymity.