[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-dev] Two protocols to measure relay-sensitive hidden-service statistics
"A. Johnson" <aaron.m.johnson@xxxxxxxxxxxx> writes:
> Hello tor-dev,
>
> <snip>
>
> Two HS statistics that we (i.e. people working on Sponsor R) are interested in collecting are:
> 1. The number of descriptor fetches received by a hidden-service directory (HSDir)
> 2. The number of client introduction requests at an introduction points (IPs)
> The privacy issue with #1 is that the set of HSDirs is (likely) unique to an HS, and so
> the number of descriptor fetches at its HSDirs could reveal the number of clients it had during a
> measurement period. Similarly, the privacy issue with #2 is that the set of IPs are (likely)
> unique to an HS, and so the number of client introductions at its IPs could reveal the number of
> client connections it received.
>
> <snip>
>
> The AnonStats1 protocol to privately publish both statistics if we trust relays not to pollute the
> statistics (i.e. #2 is not a problem) is as follows:
> 1. Each StatAuth provides 2k partially-blind signatures on authentication tokens to each relay in
> a consensus during the measurement period. The blind part of a signed token is simply a random
> number chosen by the relay. The non-blind part of a token consists of the start time of the
> current measurement period. The 2k tokens will allow the relay to submit k values to the
> StatAuths. Note that we could avoid using partially-blind signatures by changing keys at the
> StatAuths every measurement period and then simply providing blind signatures on random numbers.
> 2. At the end of the measurement period, for each statistic, each relay uploads the following
> each on its own Tor circuit and accompanied by a unique token from each StatAuth:
> 1. The count
> 2. The ``statistical weight'' of the relay (1/(# HSDirs) for statistic #1 and the probability of
> selection as an IP for statistic #2)
> 3. The StatAuths verify that each uploaded value is accompanied by a unique token from each
> StatAuth that is valid for the current measurement period. To infer the global statistic from
> the anonymous per-relay statistic, the StatAuths add the counts, add the weights, and divide
> the former by the latter.
>
Some more thoughts on AnonStats1:
- The two statistics proposed here are not independent. I suspect that
the two numbers will actually be quite close to each other, since to
do an intro request you need to first fetch a descriptor.
(In practice, the numbers *will be* different because a user might do
multiple intro requests without fetching the descriptor multiple
times. Or maybe a descriptor fetch failed so the client could not
follow with an introduction request.)
My worry is that the numbers might be quite close most of the
time. This means that about 9 relays (6 HSDirs + 3 IPs) will include
that number -- the popularity of the HS -- in their result in the
end. Of course, that number will get smudged along with all the
other measurements that the reporting relay sees, but if the number
is big enough then it will dominate the other measurements and the
actual value might be visible in the results.
The above might sound stupid. Here are some brief calculations:
There are 30000 hidden services and 3000 HSDirs. The recent tech
report shows that each HSDir is responsible for about 150 hidden
services. This means that there are about 150 numbers that get
smudged together every time. If *most* of those 30k hidden services
are tiny non-popular ones, there is a non-negligible chance that
most of those 150 numbers are also going to be tiny, which means
that any moderately big number will stand out. And for every
measurement period, there are 9 relays that have a chance of making
this number stand out.
Another issue here is that if you assume that the popularity of
hidden services doesn't change drastically overnight, and you
believe in the above paragraph, it's even possible to track the
popularity of hidden services even if you don't know their actual
popularity value. To do that, everyday you check the reported
measurements to check if there are any numbers close to yesterday's
numbers. If this consistently happens over a few days, you can be
pretty confident that you have found the popularity of a hidden
service.
To take this stupidity one step further, you can model this whole
thing as a system of 3000 equations with 150 unknown variables
each. Each day you get a new system of equations. It wouldn't
surprise me if the value of most variables is negligible (tiny
hidden services) which can be ignored. Everytime you find the
popularity of a hidden service, you learn the value of another
variable. If you assume that only 300 hidden services generate a
substantial amount of HSDir requests, how many days do you need to
find the value of those 300 variables?
Unfortunately, this is one of the worries that is hard to address
without first making the whole thing and seeing how the actual
numbers look like...
- And even if all the above is garbage, I'm still a bit concerned
about the fact that the popularity of the *most popular* hidden
service will be trackable using the above scheme. That's because the
most popular hidden service will almost always dominate the other
measurements.
- Also, the measurement period will have to change. Currently, each
relay sends its extrainfo descriptor every 24 hours. For the
AnonStats1 scheme to work, the measurement period needs to be
non-deterministic, otherwise the StatsAuth can link relay
measurements over different days based on when they reported stats.
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev