[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Link padding and the intersection attack



Hi all,

I decided if I waited to have time to read through the discussion
and respond, I'd be too far behind. So, I'll just focus on
Andrei's most recent messsage to force myself to stay engaged,
and then have to leave this till at least tomorrow.

> Date: Fri, 16 Aug 2002 11:12:57 +0100 (BST)
> From: Andrei Serjantov <aas23@hermes.cam.ac.uk>
> 
> More on the intersection attack...
> 
> > > > Scenario 1 also covers "Alice has an account on slashdot.org from which
> > > > she writes comments", "Alice uses Freedom to check her nym's mail at
> > > > the popserver", and "Alice mails a bunch of reply blocks to a Mixminion
> > > > nymserver so it can deliver her mail".
> > >
> > > In my opinion, this is too hard. Pseudonymity requires an entirely
> > > different infrastructure than anonymity (end to end padding). Sorry.
> >
> > Really? But this is going to be a common use of the system. Or at least,
> > it will be unless we explicitly tell people they won't get anonymity if
> > they do it. (And even then...)
> 
> Yes. My opinion is that you have to do something like Paul proposed to
> protect against it. A real time anonymity infrastructure like OR or Tarzan
> will not do -- you need to introduce time delay to give everyone a chance
> to put themselves into the anonymity set.
> 
What can I say, umm, I agree.

> > > Scenario 2 -- I agree with. Here is my view on the way the attacker can
> > > analyse his observations:
> > >
> > > He watches Bob and computes anonymity sets every time there is a request
> > > coming to her. Suppose they are as follows:
> > > {1,2,3,4,5};{1,6,7,8,9};{1,2,6,10},{1,3,10}
> > >
> > > Now he comuptes all the possible combinations of what could have happened.
> > > There are many, eg:
> > >
> > > 1->Bob;
> > > 1->Bob;
> > > 1->Bob;
> > > 1->Bob; (call this 1^4)
> > >
> > > Or, 1^2;2;3, or 1^2;10^2. Etc.
> > >
> > > Now we need to build a probability distribution over these traces.  If you
> > > assume that P(X^2) = P(X;Y) where X \= Y, then basically all the possible
> > > traces will have identical probabilities. Now you need to convince me that
> > > P(X^2) = k*P(X;Y) (X \= Y \and k = *large* constant).
> >
> > Translating this back into English, you want me to convince you that the
> > the chance of Alice going to Bob twice is much higher than the chance
> > of Alice going to Bob and Arnold going to Bob. (Is that accurate?)
> 
> Yes, that is correct.
> 
> > Is this the right way to look at it? I've already argued above that in
> > some scenarios, Alice *will* make multiple connections to Bob.
> 
> There is a difference. If you *know* that two connections from the same
> guy was made, you can just do the intersection of anonymity sets. If you
> don't, the above is the way to go.
> 

This is an important distinction between systems like Crowds that
make the connection anonymity depend on the data anonymity and ones
like mixes or OR that don't. It's also why the analysis of Wright et
al. at NDSS last year was misleading. 

> > Whether there are a bunch of other people talking to Bob, and thus perhaps
> > it's much more common to find two unrelated connections to Bob than two
> > connections from the same Alice ... I'm not sure that matters. I guess my
> 
> ????
> 
> > point is that we can look at the chances that Arnold (an ordinary guy) is
> > in the set of suspects, based on some assumed distribution of recipients
> > (I think randomly choosing recipients is a fine first model), and if we
> > find an Alice that's in the set more often than she should be, she is more
> > suspicious. The more often she's in the set, the more suspicious she is.
> 
> She could be talking to *someone else* at the same time. Let me produce a
> refinement of the model to try to convince you:
> Lets' say we have 4 senders on the internet (A-D) and 4 receivers (1-4)
> 
> 4 connections occur involving these guys. (no other connection which
> occurs beforehand involves A-D, 1-4)
> 
> Here are some anonymity sets. They have constraints on them.
> Round 1:
> 1:{ABCD}
> 2:{ABCD}
> 3:{BCD}
> 4:{CD}
> 
> Round 2:
> 3:{ABCD}
> 4:{ABCD}
> 1:{BCD}
> 2:{CD}
> 
> Round 3:
> 1:{ABCD}
> 2:{ABCD}
> 3:{BCD}
> 4:{CD}
> 
> Round 4:
> 2:{ABCD}
> 4:{ABCD}
> 1:{BCD}
> 3:{CD}
> 
> Now we compute all the possibilities of who was communicating to whom at
> each round, eg:
> 
> Round 1:
> 1->A
> 2->B
> 3->C
> 4->D
> 
> Round 2:
> 1->B
> 2->C
> 3->A
> 4->D
> 
> Round 3:
> 1->A
> 2->C
> 3->B
> 4->D
> 
> Round 4:
> 1->B
> 2->A
> 3->C
> 4->D
> 
> In summary, our hypothesis is:
> 1:A;B;A;B
> 2:B;C;C;A
> 3:C;A;B;C
> 4:D;D;D;D
> 
> Now we are going to obtain a probability distribution over those
> hypotheses. If we assume that everyone is as likely to communicate to
> everyone, then all we need to do is multiply the k_i factors corresponding
> to:
> 1:k_1*k_1*k_2*k_2
> 2:k_1*k_1*k_2*k_1
> 3:k_1*k_1*k_1*k_2
> 4:k_1*k_2*k_3*k_4
> 
> Now multiply those number together to obtain how likely your hypothesis
> was to have happened. Now, compute this for all possible hypothesis and
> make a probability distrubution out of them. (i.e. divide each one by the
> sum of all numbers). Now the probability of the above scenario is the
> probability that (4->D)^4 or that 4 sends 4 messages to D. Still with me?
> 
> Ok, The above uses anonymity sets, you can do it using anonymity
> probability distributions as well, as in [SD02]. I demonstrated it here
> for simplicity.
> 

I agree with this analysis, in the context where all senders send one
message per round and all receivers receive one per round. I think
that will be the basic building block. If I wanted to hide my tracks I
would do something more. What you want to do for further analysis is
to build to something more where one entity is several of the senders
and/or several of the receivers as described above, sort of like in my
group logic stuff.

> > > Note that Scenario 1 is basically knowing that the trace we are looking
> > > for is X^4 and deducing that 1^4 is the only possible trace.
> >
> > It's a bit more complex than that -- we may not see all of the connections
> > from Alice, so a 1^3 may be her if there's no 1^4.
> 
> The above assumes that you do. (But yes, you can include the null sender
> and receiver which means you do not see who Alice sent a message to or Bob
> received a message from)
> 
> > Does this mean you agree that this is a strong attack? :)
> 
> I'll leave you to think about that :-))
> 

Wait, which is the strong attack?

> > Can you tie the k_i values into something real so we can have a better
> > intuition about them?
> 
> Well -- if you had a log file for a bunch of servers, you would be able to
> estimate k_i's. For email, intuitively, k_i is large -- you are much more
> liekly to send an email to someone you have sent an email to before than
> just find someone random and sedn them an email. But yes, estimating these
> is a slight difficulty -- you need lots of logs.
>

This is where some automated analysis would be useful. Perhaps
something like the probabilistic model checking that Shmatikov
described at CSFW in looking at Crowds. In general, 
who knows what these k_i's are, but you could either run a bunch
of simulations to generate some, or probably better at this stage,
run a bunch of analyses for different k_i values.

aloha,
Paul