[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #2680 [Metrics]: present bridge usage data so researchers can focus on the math
#2680: present bridge usage data so researchers can focus on the math
---------------------+------------------------------------------------------
Reporter: arma | Owner: karsten
Type: task | Status: assigned
Priority: normal | Milestone:
Component: Metrics | Version:
Keywords: | Parent:
Points: | Actualpoints:
---------------------+------------------------------------------------------
Comment(by karsten):
Replying to [comment:3 arma]:
> The "fingerprint" and "descriptor" in statuses.csv are always the same.
I think you're printing "fingerprint" for both of them?
Ooops, fixed.
> I think the next step is to write a short overview of how to reconstruct
these files to answer some research question.
See the new Section 3 of the README and the new R file analysis.R in
task-2680.
> For example, say I want to get a list of all the countries that a given
bridge has seen over time. I guess I want to iterate over all bridge
fingerprints -- should I use the list of all fingerprints I find in
statuses.csv or in descriptors.csv -- should they be the same?
If you want to learn about usage by country, you should only look at
descriptors.csv, not at statuses.csv. The data in bridge network statuses
and the data in extra-info descriptors are not tightly connected (even
though one can link them via the bridge's descriptor identifier). A
bridge is free to write anything in its extra-info descriptor, including a
few days old bridge statistics. That is in no way related to the bridge
authority thinking that a bridge is running at a later time.
I added a note to the README.
> So step zero, given a fingerprint, is to look it up in relays.csv and
make sure it's not there. If it is, either ignore it or if we want to get
fancier, ignore data from it close to the time it's in the relay list.
Correct. We're removing all bridges that have been seen as relays for the
metrics graphs, because even with a time distance of 1 week we had
unrealistic usage numbers that I couldn't explain otherwise. If someone
wants to investigate this further, I'd be happy to learn if we can do
something smarter.
> Step one is to look it up in statuses.csv, get a set of descriptor
hashes, discard all the ones whose third-to-last value is not TRUE, and
skip duplicate hashes.
See above. Removing descriptors of non-running bridges is not meaningful
here.
> Then step two is to take those remaining descriptor hashes and look them
up in descriptors.csv, at which point I can learn which countries they saw
unless the countries are all NA in which case we don't have data?
NA means no data, right.
> And the optional step three is to take the timestamp from the status
file and look up the fingerprint in assignments.csv to decide if it's
http, email, or unassigned?
The timestamps of the assignments and the timestamps of the bridge network
statuses do not necessarily match precisely. But BridgeDB does not
reassign bridges between distributors (yet), so there's no need to compare
timestamps here.
I think that the example in analysis.R helps clarifying things a bit.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2680#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs