[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Safely collecting data to estimate the number of Tor users



Hi everyone,

in the past year or so, we put some efforts on finding out how many
people use the Tor network every day. We expect that there are 500,000
daily users, but we have no good data to support this expectation. We'd
like to be more certain about the user count in order to understand the
Tor network better and hopefully improve it.

We have started writing down the current state of counting users in a
privacy-preserving way. Note that this is just a draft that is going to
change over time:


https://gitweb.torproject.org/karsten/metrics.git/blob_plain/refs/heads/counting-users:/report/counting-users/countingusers.pdf

One of the more promising approaches to count Tor users is to count
unique client IP addresses on a fast directory mirror (see Section 3.2
"Count unique IP addresses of connecting clients..."). We make use of
the fact that clients send out 20 to 80 directory requests per day and
very likely contact every fast directory mirror at least once. This is
going to change with the directory guard design, though. We'll need to
come up with a way to combine the findings of multiple directory guards.

So, here's my plan for researching this more: I'd like to run an
experiment with multiple fast directory mirrors run by the same operator
on the same host (like Jake's trusted and Pandora*, Olaf's blutmagie*,
Moritz's torserversNet*, etc.). I'm going to write a patch for Tor to
accept some key string in its torrc and extend SafeLogging to accept the
value 'encrypt'. Tor will then pass all client IP addresses through a
keyed hash function using the provided key string and write the result
to its logs. I'm also going to implement #1668 to make log granularity
configurable. The operators configure the same key string for all their
relays and run them with the new SafeLogging option and logging
granularity of 15 minutes for, say, a week. Operators then delete the
key string and only keep the logs. The operators do not give out these
logs to me or anyone else. I'm going to write Python scripts to analyze
the logs and publish them for the operators and others to review. The
operators will run these scripts and publish the results.

I hope to learn more about the overlap of unique IP address sets seen by
fast directory mirrors and also about client uptime sessions. I'd like
to try out different schemes to safely combine unique IP address sets to
come up with a better user count.

Before writing code, what questions or concerns are there about this
experiment? Are there better ways to achieve what I'm trying to achieve?

Thanks,
--Karsten