[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] Two protocols to measure relay-sensitive hidden-service statistics

To: "A. Johnson" <aaron.m.johnson@xxxxxxxxxxxx>
Subject: Re: [tor-dev] Two protocols to measure relay-sensitive hidden-service statistics
From: George Kadianakis <desnacked@xxxxxxxxxx>
Date: Wed, 18 Feb 2015 15:42:21 +0000
Cc: tor-dev@xxxxxxxxxxxxxxxxxxxx
Delivered-to: archiver@xxxxxxxx
Delivery-date: Wed, 18 Feb 2015 10:43:31 -0500
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/simple; d=riseup.net; s=squak; t=1424274195; bh=NdL+z2qR7za1waWHDQkL3H/35JtykQCTcc5GkRkEeew=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=EdT4SfX59x1XdG1OhMNOwZ6fR3FkmRfCyl9lUl57ngw4u+jiCseApIKaAgNNvifND YDjQeYlGR5JO5MdCDLFIgUgLdj5lL0b/lhgHNEo0/KKRNzF/IJwdNjHFwWJ/zMekYL t928ZDQNEodAV6Dx2L8/8KzWrnn8B0uWTTnMdSs8=
In-reply-to: <B4198CA0-A827-4650-B703-70BDC5A78D1C@xxxxxxxxxxxx> (A. Johnson's message of "Tue, 6 Jan 2015 12:14:27 -0500")
List-archive: <http://lists.torproject.org/pipermail/tor-dev/>
List-help: <mailto:tor-dev-request@lists.torproject.org?subject=help>
List-id: discussion regarding Tor development <tor-dev.lists.torproject.org>
List-post: <mailto:tor-dev@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=unsubscribe>
References: <B4198CA0-A827-4650-B703-70BDC5A78D1C@xxxxxxxxxxxx>
Reply-to: tor-dev@xxxxxxxxxxxxxxxxxxxx
Sender: "tor-dev" <tor-dev-bounces@xxxxxxxxxxxxxxxxxxxx>
User-agent: Microsoft Outlook Express 6.00.2900.5843

"A. Johnson" <aaron.m.johnson@xxxxxxxxxxxx> writes:

> Hello tor-dev,
>
> <snip> 
>
> Two HS statistics that we (i.e. people working on Sponsor R) are interested in collecting are:
>   1. The number of descriptor fetches received by a hidden-service directory (HSDir)
>   2. The number of client introduction requests at an introduction points (IPs)
> The privacy issue with #1 is that the set of HSDirs is (likely) unique to an HS, and so
> the number of descriptor fetches at its HSDirs could reveal the number of clients it had during a
> measurement period. Similarly, the privacy issue with #2 is that the set of IPs are (likely)
> unique to an HS, and so the number of client introductions at its IPs could reveal the number of
> client connections it received.
>
> <snip>
>
> The AnonStats1 protocol to privately publish both statistics if we trust relays not to pollute the
> statistics (i.e. #2 is not a problem) is as follows:
>   1. Each StatAuth provides 2k partially-blind signatures on authentication tokens to each relay in
>   a consensus during the measurement period. The blind part of a signed token is simply a random
>   number chosen by the relay. The non-blind part of a token consists of the start time of the
>   current measurement period. The 2k tokens will allow the relay to submit k values to the
>   StatAuths. Note that we could avoid using partially-blind signatures by changing keys at the
>   StatAuths every measurement period and then simply providing blind signatures on random numbers.
>   2. At the end of the measurement period, for each statistic, each relay uploads the following
>   each on its own Tor circuit and accompanied by a unique token from each StatAuth:
>     1. The count
>     2. The ``statistical weight'' of the relay (1/(# HSDirs) for statistic #1 and the probability of
>     selection as an IP for statistic #2)
>   3. The StatAuths verify that each uploaded value is accompanied by a unique token from each
>   StatAuth that is valid for the current measurement period. To infer the global statistic from
>   the anonymous per-relay statistic, the StatAuths add the counts, add the weights, and divide
>   the former by the latter.
>

Some more thoughts on AnonStats1:

- The two statistics proposed here are not independent. I suspect that
  the two numbers will actually be quite close to each other, since to
  do an intro request you need to first fetch a descriptor.

  (In practice, the numbers *will be* different because a user might do
  multiple intro requests without fetching the descriptor multiple
  times. Or maybe a descriptor fetch failed so the client could not
  follow with an introduction request.)

  My worry is that the numbers might be quite close most of the
  time. This means that about 9 relays (6 HSDirs + 3 IPs) will include
  that number -- the popularity of the HS -- in their result in the
  end. Of course, that number will get smudged along with all the
  other measurements that the reporting relay sees, but if the number
  is big enough then it will dominate the other measurements and the
  actual value might be visible in the results.

  The above might sound stupid. Here are some brief calculations:

  There are 30000 hidden services and 3000 HSDirs. The recent tech
  report shows that each HSDir is responsible for about 150 hidden
  services. This means that there are about 150 numbers that get
  smudged together every time. If *most* of those 30k hidden services
  are tiny non-popular ones, there is a non-negligible chance that
  most of those 150 numbers are also going to be tiny, which means
  that any moderately big number will stand out. And for every
  measurement period, there are 9 relays that have a chance of making
  this number stand out.

  Another issue here is that if you assume that the popularity of
  hidden services doesn't change drastically overnight, and you
  believe in the above paragraph, it's even possible to track the
  popularity of hidden services even if you don't know their actual
  popularity value. To do that, everyday you check the reported
  measurements to check if there are any numbers close to yesterday's
  numbers. If this consistently happens over a few days, you can be
  pretty confident that you have found the popularity of a hidden
  service.

  To take this stupidity one step further, you can model this whole
  thing as a system of 3000 equations with 150 unknown variables
  each. Each day you get a new system of equations. It wouldn't
  surprise me if the value of most variables is negligible (tiny
  hidden services) which can be ignored.  Everytime you find the
  popularity of a hidden service, you learn the value of another
  variable. If you assume that only 300 hidden services generate a
  substantial amount of HSDir requests, how many days do you need to
  find the value of those 300 variables?

  Unfortunately, this is one of the worries that is hard to address
  without first making the whole thing and seeing how the actual
  numbers look like...

- And even if all the above is garbage, I'm still a bit concerned
  about the fact that the popularity of the *most popular* hidden
  service will be trackable using the above scheme. That's because the
  most popular hidden service will almost always dominate the other
  measurements.

- Also, the measurement period will have to change. Currently, each
  relay sends its extrainfo descriptor every 24 hours. For the
  AnonStats1 scheme to work, the measurement period needs to be
  non-deterministic, otherwise the StatsAuth can link relay
  measurements over different days based on when they reported stats.
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Prev by Author: Re: [tor-dev] [ooni-dev] Let's come up with the roadmap for the future of OONI
Next by Author: Re: [tor-dev] Tor Proposal status updates: Feb 2015
Previous by thread: Re: [tor-dev] Porting Tor Browser to the BSDs
Next by thread: [tor-dev] Tor Project Idea for GSOC 2015
Index(es):
- Author
- Thread