[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Publishing sanitized bridge descriptors
Hi everyone,
I'm planning to publish a sanitized version of the bridge descriptors
that our bridge authority Tonga gathers. The general idea behind this
is to make all data public that we gather for statistical purposes.
There are several reasons for doing so: transparency towards our
community, restricting ourselves to gathering only those statistics
that we think are safe to make public, allowing others to do the same
research as we do, etc.
The bridge descriptors contain IP addresses and other contact
information of bridges that we don't want to give away. Doing so would
defeat the purpose of bridges, after all.
Here are the steps that we're taking to remove all potentially
sensitive information from bridge descriptors before publication:
1. Replace the bridge identity with its SHA1 value
Clients can request a bridge's current descriptor by sending its
identity string to the bridge authority. This is a feature to make
bridges on dynamic IP addresses useful. Therefore, the original
identities (and anything that could be used to derive them) need to be
removed from the descriptors. The bridge identity is replaced with its
SHA1 hash value. The idea is to have a consistent replacement that
remains stable over months or even years (without keeping a secret for
a keyed hash function).
2. Remove all cryptographic keys and signatures
It would be straightforward to learn about the bridge identity from
the bridge's public key. Replacing keys by newly generated ones seemed
to be unnecessary (and would involve keeping a state over months/
years), so that all cryptographic objects have simply been removed.
3. Replace IP address with 127.0.0.1
Of course, the IP address needs to be removed, too. However, the IP
address is resolved to a country code first and the result written to
the contact line as "somebody at example dot de" for Germany, etc. The
ports are kept unchanged though.
4. Replace contact information
If there is contact information in a descriptor, the contact line is
changed to "somebody at ...". If there is none, a contact line is
added saying "nobody at ..." in order to put in the country code.
5. Replace nickname with Unnamed
The bridge nicknames might give hints on the location of the bridge if
chosen without care; e.g. a bridge nickname might be very similar to
the operators' relay nicknames which might be located on adjacent IP
addresses. All bridge nicknames are therefore replaced with the string
Unnamed.
Note that these processing steps only prevent people from learning
about new bridge locations. People who already know a bridge identity
or location can easily learn more about this bridge from the sanitized
descriptors. This is useful for statistical analysis, e.g. to filter
out bridges that have been running as relays before.
The Java application that does all the parsing, replacing, and
rewriting can be found here:
https://tor-svn.freehaven.net/svn/projects/archives/trunk/bridge-desc-sanitizer/
Here is a sample of the bridge descriptors of October 2008 (not 2009,
in case there turn out to be sensitive parts in there):
http://freehaven.net/~karsten/volatile/bridges-2008-10.tar.bz2 (4.6
MB)
Are there any sensitive parts in that tarball that we don't want to
publish?
Thanks,
--Karsten