[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Anonymity-preserving collection of usage data of a hidden service authoritative directory



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Nick,

>> 1. One of the directory server operators could temporarily add a handful
>> of logging statements to the code that writes publish and fetch requests
>> for hidden service descriptors with inquired onion addresses to a log
>> file. He could then anonymize onion addresses by consistently replacing
>> them with something like hash(onion address + "some random number to
>> forget afterwards") and publish them on the Tor homepage. Everyone could
>> make a statistic from the data, but nobody would be able to identify a
>> certain hidden service.
> 
> This is vulnerable to some kinds of attacks.  For instance, if I
> wanted to see the statistics for foo.onion, I could make a bunch of
> requests for foo.onion at 12:03, then a bunch at 12:19, then a bunch
> at 12:42, and then look through the published statistics to see which
> "anonymized" address had a lot of requests at those times.  Not so
> good.

I agree with you that there are possible attacks to this solution. My
idea was to perform such an analysis only once or twice (e.g. next week
and in two months) and without announcing the exact time to do so
before. Someone performing the attack described by you would need to
access the hidden service from the day you announce that you will
perform such an analysis until the publication of the results. But what
would she have gained? She knows when the service was available anyway,
because she accessed it all day and night. :)

But I see your point. There might be anonymity implications we did not
see. And it's not the most transparent way to collect the data...

>> 2. We extend the code permanently to create a new status page containing
>> hidden service directory activity. This could include _aggregated_
>> values, e.g. number of fetches in 15-minute intervals of the last 24
>> hours (comparable to bandwidth measurement).
> 
> I think the second-safest thing would be to do a combination of the
> two approaches: collect information in RAM, and dump totals on a
> 12-hour basis, with actual addresses hidden by hashing them
> concatenated to a random value that the code never actually discloses.
> This is still vulnerable to some statistical attacks as above, but
> less so.

In the second approach I would not include single hidden service entries
at all---only aggregated values. Else this would be like solution 1
again with the same anonymity issues.

> The single-safest thing would be not to collect this information at
> all.  Maybe we should look carefully at what we hope to learn from it,
> and try to collect the statistics we want, rather than the actual raw
> data.  (In other words, if you want to know how many hidden services
> are active at a given time, it's far safer to have the code output the
> total, rather than having the code output a list of hidden services
> which you then proceed to count.)
> 
> So, I'll start by asking: what statistics about hidden services are
> you hoping to collect?

Bad news first: The disadvantage of defining aggregation of findings
without knowing the real data is that you cannot find unexpected things
using explorative data analysis. It's not that I did not think about my
expected findings, but I cannot say that this list is complete. :(

These are some questions I would like to have answered by the data:

1.) I would like to find out how frequently a hidden service (HS)
authoritative directory is requested to publish or fetch rendezvous
service descriptors (RSDs). This has impacts on required network traffic
and CPU cycles. A special case of this question is whether most traffic
is produced for a small set of HSs, or all HSs are equally active and
accessed.

2.) I would like to know how many RSDs exist at a time. This determines
how much storage is needed and gives an outlook on how big replication
messages containing these RSDs can become.

This is my specification for changing Tor to provide aggregated data to
answer these questions:

My proposal is to add a new status page comparable to the network status
with entries that are built like write-history and read-history in
server descriptors. For each entry there is one aggregated value per
interval of 900 seconds (15 minutes) for a total of 96 intervals (1 day):

- --- begin of specification ---

"publish-total-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
   total number of valid publish requests observed in the interval

"publish-novel-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
   total number of valid publish requests that contain a novel RSD, i.e.
   one with a currently unknown ID

"publish-top-5-percent-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
"publish-top-10-percent-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
"publish-top-20-percent-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
   total number of valid publish requests containing an RSD for one of
   the top 5 (10, 20) percent of all HSs (ordered by number of publish
   requests); can help to figure out which share of publish request
   (probably non-novel publish requests) comes from the top available
   HSs

"fetch-total-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
   total number of valid fetch requests observed in the interval

"fetch-successful-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
   total number of valid and successful fetch requests observed in the
   interval

"fetch-top-5-percent-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
"fetch-top-10-percent-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
"fetch-top-20-percent-history" YYYY-MM-DD HH:MM:SS (NSEC s) NUM... NL
   total number of valid fetch requests asking for one of the top 5 (10,
   20) percent of all HSs (ordered by number of fetch requests); can
   help to figure out whether there are hot spots under the HSs

desc-total-history
   total number of current RSDs at the end of the interval

- --- end of specification ---

These are my ideas. What do you think, is this data too revealing? Or
did I miss an important measurement?

And what do you think, should I rather continue to try to convince you
that solution 1 is not that bad after all or start coding on solution 2?
;) No, to be serious, I am not afraid of the implementation, but would
not put too much work into it if you would not include it into your code
anyway.

Karsten
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGIVkm0M+WPffBEmURAmKuAJ92bwBHAC7SNsGMpW4+PXMF6IF4CQCgwekL
IzivG577zpSOt7jWhGFMMc0=
=z8Z2
-----END PGP SIGNATURE-----