[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Publishing sanitized bridge descriptors



Hi everyone,

I'm planning to publish a sanitized version of the bridge descriptors that our bridge authority Tonga gathers. The general idea behind this is to make all data public that we gather for statistical purposes. There are several reasons for doing so: transparency towards our community, restricting ourselves to gathering only those statistics that we think are safe to make public, allowing others to do the same research as we do, etc.

The bridge descriptors contain IP addresses and other contact information of bridges that we don't want to give away. Doing so would defeat the purpose of bridges, after all.

Here are the steps that we're taking to remove all potentially sensitive information from bridge descriptors before publication:

1. Replace the bridge identity with its SHA1 value

Clients can request a bridge's current descriptor by sending its identity string to the bridge authority. This is a feature to make bridges on dynamic IP addresses useful. Therefore, the original identities (and anything that could be used to derive them) need to be removed from the descriptors. The bridge identity is replaced with its SHA1 hash value. The idea is to have a consistent replacement that remains stable over months or even years (without keeping a secret for a keyed hash function).

2. Remove all cryptographic keys and signatures

It would be straightforward to learn about the bridge identity from the bridge's public key. Replacing keys by newly generated ones seemed to be unnecessary (and would involve keeping a state over months/ years), so that all cryptographic objects have simply been removed.

3. Replace IP address with 127.0.0.1

Of course, the IP address needs to be removed, too. However, the IP address is resolved to a country code first and the result written to the contact line as "somebody at example dot de" for Germany, etc. The ports are kept unchanged though.

4. Replace contact information

If there is contact information in a descriptor, the contact line is changed to "somebody at ...". If there is none, a contact line is added saying "nobody at ..." in order to put in the country code.

5. Replace nickname with Unnamed

The bridge nicknames might give hints on the location of the bridge if chosen without care; e.g. a bridge nickname might be very similar to the operators' relay nicknames which might be located on adjacent IP addresses. All bridge nicknames are therefore replaced with the string Unnamed.

Note that these processing steps only prevent people from learning about new bridge locations. People who already know a bridge identity or location can easily learn more about this bridge from the sanitized descriptors. This is useful for statistical analysis, e.g. to filter out bridges that have been running as relays before.

The Java application that does all the parsing, replacing, and rewriting can be found here:

https://tor-svn.freehaven.net/svn/projects/archives/trunk/bridge-desc-sanitizer/

Here is a sample of the bridge descriptors of October 2008 (not 2009, in case there turn out to be sensitive parts in there):

http://freehaven.net/~karsten/volatile/bridges-2008-10.tar.bz2 (4.6 MB)

Are there any sensitive parts in that tarball that we don't want to publish?

Thanks,
--Karsten