[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[freehaven-cvs] r1761: Initial musings on what to do for an experiment. (in doc/trunk: . correlation07)
Author: nickm
Date: 2007-02-09 22:37:53 -0500 (Fri, 09 Feb 2007)
New Revision: 1761
Added:
doc/trunk/correlation07/experiment.txt
Modified:
doc/trunk/
Log:
r12203@Kushana: nickm | 2007-02-09 22:37:40 -0500
Initial musings on what to do for an experiment.
Property changes on: doc/trunk
___________________________________________________________________
svk:merge ticket from /freehaven-doc/trunk [r12203] on c95137ef-5f19-0410-b913-86e773d04f59
Added: doc/trunk/correlation07/experiment.txt
===================================================================
--- doc/trunk/correlation07/experiment.txt 2007-02-09 14:19:21 UTC (rev 1760)
+++ doc/trunk/correlation07/experiment.txt 2007-02-10 03:37:53 UTC (rev 1761)
@@ -0,0 +1,163 @@
+BRAIN-DUMP ON END-TO-END CORRELATION
+
+Passive end-to-end correlation attacks all follow the same general format:
+
+ You're watching some traffic enter and leave an anonymity system, and you
+ want to link traffic streams coming in to traffic streams going out. Let
+ A_i be all data about the streams entering, and B_j be all data about the
+ streams exiting. You compute feature-set functions f_in(A_i), f_out(B_j)
+ on each stream, and then a similarity function c on <f_in(A_i),
+ f_out(B_j)> pair. The pairs giving high values of c probably represent
+ the same stream as it passes through the network.
+
+ If instead of observing exit streams at the same time as entry streams,
+ you pre-compute expected f_out(B_j) for a large number of possible target
+ responders, and then later observe A_i, you're doing a fingerprinting
+ attack. The math is the same.
+
+The usual way to transform this into an active attack is to introduce
+patterns into one or more target A_i or B_j, so that the corresponding
+B_j/A_i will not resemble the background traffic.
+
+There are many experiments all following the same general design: launch a
+correlation or fingerprinting attack; see how fast it works. The results are
+pretty uniformly: "fast".
+
+Here are some independent variables and experiment parameters:
+
+ - What pattern does the traffic on the streams follow?
+ - How many simultaneous streams are there? How many are under
+ observation?
+ - What transformation does the network apply to the streams? This can
+ include delay, padding, aggregation, etc.
+ - What data can the attacker measure about streams, and to what accuracy?
+ - What method does the attacker use for f_in, f_out, and c?
+ - If a fingerprinting attack, how many sites has the attacker
+ fingerprinted, and how often do they change, and by how much?
+
+Here are some methodological options:
+ - Do you use an actual deployed anonymity system, an ersatz relaying
+ system, or just simulation?
+ - Do you use live traffic, simulated traffic, or replayed traffic?
+ - For network effects, do you use the real internet, a network with
+ simulated internet-like delays, or do you just do everything on one
+ computer?
+
+Here are some possible dependent variables:
+ - How long on average does it take the attacker to correctly identity N (or
+ N%) of target streams?
+ - How much traffic on average does it take the attacker to correctly
+ identify N (or N%) of target streams?
+
+Here are some other considerations:
+ - How linkable are different B_j streams coming from the same user? If the
+ user is connecting to a website that uses a "user id" cookie or other
+ identifier, an intersection attack should work very well, even if we
+ somehow beat end-to-end correlation.
+
+ ***
+
+I want answers for the following questions.
+
+Evaluation of existing attack methods:
+
+ - In the domains where they work, do any end-to-end attack methods work
+ better than any others? Does this difference matter in practice?
+
+ - How much information, in practice, does the attacker need to win? For
+ example, is microsecond timing info really helpful, or can an attacker do
+ just as well with access to the more loosely synchronized, less accurate
+ clocks you'd find in an existing network of off the shelf IDS systems?
+
+Evaluation of existing attack methods against Tor:
+
+ - Against a Tor-like system, do any of the end-to-end attack methods fail
+ to work? What works best?
+
+ - To what extent if any does Tor's existing cell padding confuse end-to-end
+ attacks?
+
+ - To what extent if any does Tor's existing stream-multiplexing over
+ circuits, and circuit multiplexing over connections confuse these
+ attacks?
+
+Evaluation of Tor modifications against end-to-end attacks
+
+ - Do any of these help?
+ - Long-range padding cells
+ - Revised cell-packing methods
+ - Mid-latency design
+ - Mixed-latency design (alpha-mixing)
+
+Evaluation of experimental reliability:
+
+ - How dependent are the above results on user behavior models, number of
+ users, etc etc?
+
+ ***
+
+Rough experimental design
+
+First, we'll need data. This will consist in some traffic patterns, and in
+the result of transmitting those traffic patterns through Tor or a Tor-like
+network.
+
+For the traffic, I'd like to consider a few patterns:
+ * To model the report-aggregation case, we can use a simulated model, or
+ traces from actual series of reports.
+
+ * Lots of people have done work with HTTP: we can ask for an existing
+ corpus of data there (which will be an inaccurate reflection of
+ HTTP in the wild today, but which will at least be repeatable) or we can
+ solicit a few volunteers to have their browsing behavior recorded, or we
+ can some up with some kind of dopey simulation.
+
+ * I bet IRC, IM, and SSH follow pretty noticeable patterns: if we record
+ them, we'll have a pretty good profile for such patterns in general.
+
+ * We should consider the above patterns in isolation and in combination
+ (as they'd exist on a real network).
+
+For the network, we've got the a few options:
+ * We could simulate something. This seems kind of useless given how
+ unlike simulated networks real low-latency anonymity networks behave,
+ but it would at least be quite repeatable.
+
+ * We could rig up a private Tor network on a LAN, and maybe add some
+ artificial delay or congestion on some links. This would be better in
+ realism, and still somewhat repeatable.
+
+ * As above, but we could put the private Tor network on hosts distributed
+ around the internet. We could even encourage people who don't mind
+ having their traffic captured to use it in order to generate background
+ traffic.
+
+ * We could add a few private entries and exits to the live Tor network,
+ and capture traffic on them.
+
+ * We could just use the actual live Tor network. This would be the most
+ realistic approach, but I don't believe it's possible to do this without
+ risking exposure of actual user traffic. Thus, we must utterly reject
+ this approach for ethical grounds. (We _could_ use the live Tor network
+ to spot-check the realism of our experimental network by comparing their
+ delay characteristics on streams generated by us. Since we would only
+ be observing our own streams, there would be no risk of compromising
+ real users' traffic.)
+
+I think the third option above presents the best compromise between
+repeatability and realism. Also, it'll be far more easy later to experiment
+with protocol modifications if we're on a separate, private Tor network.
+
+The experiment doesn't need to last too long: getting a few hours with each
+configuration would be fine to start. (After all, it will be a publishable
+result if any of the attack methods we're considering turns out to take
+longer than a few minutes to succeed.)
+
+To evaluate how much our deviations from reality are hurting us, we should
+make sure to measure along axes relating to those deviations: size of
+network, network congestion, etc.
+
+Getting the data will probably take longer than analyzing it, and it's likely
+that in analyzing it, we'll think of more experiments we want to do.
+Therefore, I'm not going to discus analysis in too much detail right now.
+
***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe freehaven-cvs in the body. http://freehaven.net/