[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

[freehaven-cvs] r1761: Initial musings on what to do for an experiment. (in doc/trunk: . correlation07)



Author: nickm
Date: 2007-02-09 22:37:53 -0500 (Fri, 09 Feb 2007)
New Revision: 1761

Added:
   doc/trunk/correlation07/experiment.txt
Modified:
   doc/trunk/
Log:
 r12203@Kushana:  nickm | 2007-02-09 22:37:40 -0500
 Initial musings on what to do for an experiment.



Property changes on: doc/trunk
___________________________________________________________________
 svk:merge ticket from /freehaven-doc/trunk [r12203] on c95137ef-5f19-0410-b913-86e773d04f59

Added: doc/trunk/correlation07/experiment.txt
===================================================================
--- doc/trunk/correlation07/experiment.txt	2007-02-09 14:19:21 UTC (rev 1760)
+++ doc/trunk/correlation07/experiment.txt	2007-02-10 03:37:53 UTC (rev 1761)
@@ -0,0 +1,163 @@
+BRAIN-DUMP ON END-TO-END CORRELATION
+
+Passive end-to-end correlation attacks all follow the same general format:
+
+    You're watching some traffic enter and leave an anonymity system, and you
+    want to link traffic streams coming in to traffic streams going out.  Let
+    A_i be all data about the streams entering, and B_j be all data about the
+    streams exiting.  You compute feature-set functions f_in(A_i), f_out(B_j)
+    on each stream, and then a similarity function c on <f_in(A_i),
+    f_out(B_j)> pair.  The pairs giving high values of c probably represent
+    the same stream as it passes through the network.
+
+    If instead of observing exit streams at the same time as entry streams,
+    you pre-compute expected f_out(B_j) for a large number of possible target
+    responders, and then later observe A_i, you're doing a fingerprinting
+    attack.  The math is the same.
+
+The usual way to transform this into an active attack is to introduce
+patterns into one or more target A_i or B_j, so that the corresponding
+B_j/A_i will not resemble the background traffic.
+
+There are many experiments all following the same general design: launch a
+correlation or fingerprinting attack; see how fast it works.  The results are
+pretty uniformly: "fast".
+
+Here are some independent variables and experiment parameters:
+
+  - What pattern does the traffic on the streams follow?
+  - How many simultaneous streams are there?  How many are under
+    observation?
+  - What transformation does the network apply to the streams?  This can
+    include delay, padding, aggregation, etc.
+  - What data can the attacker measure about streams, and to what accuracy?
+  - What method does the attacker use for f_in, f_out, and c?
+  - If a fingerprinting attack, how many sites has the attacker
+    fingerprinted, and how often do they change, and by how much?
+
+Here are some methodological options:
+  - Do you use an actual deployed anonymity system, an ersatz relaying
+    system, or just simulation?
+  - Do you use live traffic, simulated traffic, or replayed traffic?
+  - For network effects, do you use the real internet, a network with
+    simulated internet-like delays, or do you just do everything on one
+    computer?
+
+Here are some possible dependent variables:
+  - How long on average does it take the attacker to correctly identity N (or
+    N%) of target streams?
+  - How much traffic on average does it take the attacker to correctly
+    identify N (or N%) of target streams?
+
+Here are some other considerations:
+  - How linkable are different B_j streams coming from the same user?  If the
+    user is connecting to a website that uses a "user id" cookie or other
+    identifier, an intersection attack should work very well, even if we
+    somehow beat end-to-end correlation.
+
+                                     ***
+
+I want answers for the following questions.
+
+Evaluation of existing attack methods:
+
+  - In the domains where they work, do any end-to-end attack methods work
+    better than any others?  Does this difference matter in practice?
+
+  - How much information, in practice, does the attacker need to win?  For
+    example, is microsecond timing info really helpful, or can an attacker do
+    just as well with access to the more loosely synchronized, less accurate
+    clocks you'd find in an existing network of off the shelf IDS systems?
+
+Evaluation of existing attack methods against Tor:
+
+  - Against a Tor-like system, do any of the end-to-end attack methods fail
+    to work?  What works best?
+
+  - To what extent if any does Tor's existing cell padding confuse end-to-end
+    attacks?
+
+  - To what extent if any does Tor's existing stream-multiplexing over
+    circuits, and circuit multiplexing over connections confuse these
+    attacks?
+
+Evaluation of Tor modifications against end-to-end attacks
+
+  - Do any of these help?
+    - Long-range padding cells
+    - Revised cell-packing methods
+    - Mid-latency design
+    - Mixed-latency design (alpha-mixing)
+
+Evaluation of experimental reliability:
+
+  - How dependent are the above results on user behavior models, number of
+    users, etc etc?
+
+                                     ***
+
+Rough experimental design
+
+First, we'll need data.  This will consist in some traffic patterns, and in
+the result of transmitting those traffic patterns through Tor or a Tor-like
+network.
+
+For the traffic, I'd like to consider a few patterns:
+   * To model the report-aggregation case, we can use a simulated model, or
+     traces from actual series of reports.
+
+   * Lots of people have done work with HTTP: we can ask for an existing
+     corpus of data there (which will be an inaccurate reflection of
+     HTTP in the wild today, but which will at least be repeatable) or we can
+     solicit a few volunteers to have their browsing behavior recorded, or we
+     can some up with some kind of dopey simulation.
+
+   * I bet IRC, IM, and SSH follow pretty noticeable patterns: if we record
+     them, we'll have a pretty good profile for such patterns in general.
+
+   * We should consider the above patterns in isolation and in combination
+     (as they'd exist on a real network).
+
+For the network, we've got the a few options:
+   * We could simulate something.  This seems kind of useless given how
+     unlike simulated networks real low-latency anonymity networks behave,
+     but it would at least be quite repeatable.
+
+   * We could rig up a private Tor network on a LAN, and maybe add some
+     artificial delay or congestion on some links.  This would be better in
+     realism, and still somewhat repeatable.
+
+   * As above, but we could put the private Tor network on hosts distributed
+     around the internet.  We could even encourage people who don't mind
+     having their traffic captured to use it in order to generate background
+     traffic.
+
+   * We could add a few private entries and exits to the live Tor network,
+     and capture traffic on them.
+
+   * We could just use the actual live Tor network.  This would be the most
+     realistic approach, but I don't believe it's possible to do this without
+     risking exposure of actual user traffic.  Thus, we must utterly reject
+     this approach for ethical grounds.  (We _could_ use the live Tor network
+     to spot-check the realism of our experimental network by comparing their
+     delay characteristics on streams generated by us.  Since we would only
+     be observing our own streams, there would be no risk of compromising
+     real users' traffic.)
+
+I think the third option above presents the best compromise between
+repeatability and realism.  Also, it'll be far more easy later to experiment
+with protocol modifications if we're on a separate, private Tor network.
+
+The experiment doesn't need to last too long: getting a few hours with each
+configuration would be fine to start.  (After all, it will be a publishable
+result if any of the attack methods we're considering turns out to take
+longer than a few minutes to succeed.)
+
+To evaluate how much our deviations from reality are hurting us, we should
+make sure to measure along axes relating to those deviations: size of
+network, network congestion, etc.
+
+Getting the data will probably take longer than analyzing it, and it's likely
+that in analyzing it, we'll think of more experiments we want to do.
+Therefore, I'm not going to discus analysis in too much detail right now.
+

***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe freehaven-cvs       in the body. http://freehaven.net/