[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Safely collecting data to estimate the number of Tor users



On Sun, 29 Aug 2010 13:17:35 +0200
BjÃrn Scheuermann <scheuermann@xxxxxxxxxxxxxxxxxxxxx> wrote:

> Steven J. Murdoch wrote:
> > Another thing to do is to shrink the hash size. If we assume that
> > there are going to be, say, no more than 1 million distinct IP
> > addresses, we could use a 40 bit hash with only a small number of
> > collisions (due to the Birthday Paradox). However, someone who tries
> > to reverse the full 32 bit IPv4 space will get many collisions.
> 
> with a 40 bit hash, the vast majority of 2^32 IP addresses will not
> collide with any other IP address, despite the birthday paradox. So this
> would not really be a significant additional level of protection. It
> would make a difference only for the small fraction of the IP addresses
> that actually do collide.
> 
> However, you don't need that many bits anyway, as collisions are not a
> big issue when your goal is to only count how many different addresses
> you have seen. If you proceed as you suggested, you can, as detailed
> below, do well with a 21-bit hash. With additional tweaks, you can
> further reduce that to 12-bit hash values.

And then it's more efficient to use FM sketches rather than a list of
hash values.  The *only* problem I had with your first message in this
thread was your suggestion that the FM sketches could be exchanged
between the operators; the FM sketches may still be quite sensitive
(only $AGENCY knows exactly how sensitive they are; we really can't
tell).

Other than the fact that the FM sketches must remain secret, your
approach is also safer than Dr. Murdoch's idea.


There is a slight improvement, though:

* Each instrumented Tor instance should generate an FM sketch in memory
  as it is contacted by directory clients.  This loses most of the
  sensitive information instantly.

* Every N minutes, the Tor instance should encrypt its current FM
  sketch with Dr. Loesing's public key for this purpose, base64-encode
  it, and write it to the log on a single line.  As long as the
  encryption algorithm is randomized, this packages up the rest of the
  sensitive information so only one trusted person can recover it.  The
  Tor instance then erases its FM sketch and starts accumulating a new
  one.

* The Tor operators use one or more invocations of grep to extract
  *only* the encrypted FM sketches from the log.  Sort the resulting
  list, and delete the original log files.  This loses any potentially
  sensitive timing information, so that Dr. Loesing can't recover it.
  Each operator should then merge the encrypted-sketch lists produced
  by all of his/her/its Tor relays into one sorted list before sending
  it in for analysis.

* Dr. Loesing decrypts all of the FM sketches, ORs them together, and
  publishes only the resulting count.  Then he erases the decrypted FM
  sketches and sets his private key on fire.


Robert Ransom

Attachment: signature.asc
Description: PGP signature