[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Questions about gathering information and statistics about the tor-network

I'm writing a tool right now to gather some longtime statistics about the tor-network. I want to plot these hourly taken information (e.g. with gnuplot) to offer plots on a daily/weekly/monthly/yearly base about the tor-network.

I think this is usefull (for the tor-development and the interested users) to observe the development of the tor-network over the time like: is the number of nodes growing/shrinking, are routers positions spreading more around the world over time or starting to even more concentrate on some countrys like the US, Germany,.. , number of and relation of exit-to entry-/middle-nodes, average uptime of the nodes, development of which ports are being blocked by the nodes, is the average bandwith of the network growing or shrinking and so on...

There are some informations which can be easily collected by the single server-descriptors by simply asking the control-port like: the number of nodes, with geoiplookup and their IP's also their country, the uptime and the blocked ports and stuff like this.

But there are some informations which are interesting too which aren't as easily to gather:

1.) the number of users: this would be a cool information but I don't know if there's at the moment any way also even just to roughly estimate the number of users. There are in my opinion just two places where such informations could a bit reliable be gathered but both are out of the game because of the current implementation to offer a good security. And one way (place) to get a rough estimation not of the number of users but if this number is growing or shrinking.

a.) the entry-nodes: every entry-node knows (or can know) how many individual users ( at least individual IPs ) are connecting to it right now. But because we don't know how many different circuits a user has open at one moment, we can't say how many users we have in total even if all entry-nodes would report the number of currently individual connections it has. Only workaround would be throwing all the information of all entry-nodes with all IPs of all users in one pot. But this would be a very very bad idea. So gathering the number of users based on entry-nodes is not going to work (at least not if we want the network to be as safe as it is at the moment).

b.) the directory-servers: if all clients would ask the directory-servers in a constant intervall for new information we could gather the number of requests per dir-server per 24h hours and divide it with the interval lenghts. But this has two problems: one is that not every client is on 24h per day so the information would be pretty unreliable even if we would guess an average time a client is online within 24h. The other is that the implementation (https://svn.torproject.org/svn/tor/trunk/doc/spec/dir-spec.txt under 5.1) isn't a static interval for all clients but more randomly choosen. So also this is no option by a matter of fact that we don't know how long each client is up and the random interval.

c.) the number of downloads of a new released tor version: the number of downloads of a new stable release of the tor-client could give an hint if the number of users is growing or shrinking. Of course this could just be collected on the tor-project page and thus would just be a snippet of all downloads/users because there are e.g. many users of modern operationsystems ( yes some small bang against MS/MacOS/Sun ;) ) which offer a packagemanagment-system and don't compile by hand. Those downloads and updates can't be count but even this snippet of downloads of a new stable-version (maybe within one week after it has been released) could give some impression if we compair this number to prior releases if the average number of users is growing or shrinking. 

2.) the network health: network health can be understood in many different ways. One aspect I thought of would be the comparison of the bandwith all nodes are offering compaired to the bandwidth which is acutally used under the premise that we have enough users to consume all the bandwidth the nodes do offer (and I think we can safely make this premise). A good network health would mean under this condition that the bandwith which is acutally used is nearly the same as the bandwith the nodes offer. This gives an estimation of how good is tor on building circuits. If there are some nodes which aren't used all the bandwith they have to offer and other nodes which are nearly breaking under the bandwith they are asked for it means tor isn't doing well on assembling circuits. Also interesting would be here the number of connections each node has compaired with the bandwith it offers but the number of connections isn't exported at all. At least I couldn't find it in the service-descriptor. I came to think about this by simple tests. Building a circuit with three really fast nodes gives you more bandwidth than building a circuit with three really slow nodes. But on a healt network you would have the same bandwidth in any case because the number of connections through the slow ones would be lowered and on the fast nodes increased until they offer the same bandwidth to their users.

But also with simple checking the bandwith we have some limitations (at least as I understand the specs: https://svn.torproject.org/svn/tor/trunk/doc/spec/dir-spec.txt under 2.1). We have bandwidth-avg and bandwidth-observed (burst is kind of useless here for us as I think). I don't know how these values are gathered, the specs are a bit unprecisly here but they are pretty different if I take a look at them. Sometimes the observed value is less than 10% of the avg so I don't know if this value is usefull/accurate? It would be cool if a router tells us how much it is willing to share and how much it is acutally sharing but afaik we don't have the bandwidth a router is willing to share but just how much it is sharing which is bandwidth-avg or? Am I interpreting this correct?

I wanted to ask what you think about the idea to create such statistics at all? And have you some better ideas or thoughts about the number of users and the network-health?


Attachment: signature.asc
Description: PGP signature