[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[tor-commits] [metrics-tasks/master] Add an example analysis and fix a minor bug.
commit 22700d31144c1b8f5c3cc954634f4db9ceffec30
Author: Karsten Loesing <karsten.loesing@xxxxxxx>
Date: Tue Mar 15 14:49:20 2011 +0100
Add an example analysis and fix a minor bug.
---
task-2680/ProcessSanitizedBridges.java | 2 +-
task-2680/README | 138 +++++++++++++++++++++++++++-----
task-2680/analysis.R | 50 ++++++++++++
3 files changed, 169 insertions(+), 21 deletions(-)
diff --git a/task-2680/ProcessSanitizedBridges.java b/task-2680/ProcessSanitizedBridges.java
index 1f0e00e..c3ab6c8 100644
--- a/task-2680/ProcessSanitizedBridges.java
+++ b/task-2680/ProcessSanitizedBridges.java
@@ -84,7 +84,7 @@ public class ProcessSanitizedBridges {
String fingerprint = Hex.encodeHexString(Base64.decodeBase64(
parts[2] + "="));
String descriptor = Hex.encodeHexString(Base64.decodeBase64(
- parts[2] + "="));
+ parts[3] + "="));
String published = parts[4] + " " + parts[5];
String address = parts[6];
String orPort = parts[7];
diff --git a/task-2680/README b/task-2680/README
index 65d8b85..69aec70 100644
--- a/task-2680/README
+++ b/task-2680/README
@@ -1,3 +1,22 @@
+Presenting bridge usage data so that researchers can focus on the math
+======================================================================
+
+ "Right now the process of learning how to parse bridge consensus files,
+ bridge descriptor files, match up which descriptors go with which
+ consensus line, which bridges were Running when, etc is too
+ burdensome -- researchers who want to analyze bridge reachability are
+ giving up before they even get to the part they tried to sign up for."
+ (from arma's description of this ticket in Trac)
+
+This ticket contains the code to process the data tarballs from the
+metrics website and convert them to a format that is more useful for
+researchers. This README also contains instructions for working with the
+new data formats.
+
+
+1 Processing data tarballs from metrics.tpo
+--------------------------------------------
+
This ticket contains Java and R code to
a) process bridge and relay data to convert them to a format that is more
@@ -6,13 +25,9 @@ This ticket contains Java and R code to
This README has a separate section for each Java or R code snippet.
-The Java applications produce four output formats containing bridge
-descriptors, bridge status lines, bridge pool assignments, and hashed
-relay identities. The data formats are described below.
-
---------------------------------------------------------------------------
-ProcessSanitizedBridges.java
+1.1 ProcessSanitizedBridges.java
+---------------------------------
- Download sanitized bridge descriptors from the metrics website, e.g.,
https://metrics.torproject.org/data/bridge-descriptors-2011-01.tar.bz2,
@@ -31,9 +46,9 @@ ProcessSanitizedBridges.java
- Once the Java application is done, you'll find the two files
statuses.csv and descriptors.csv in this directory.
---------------------------------------------------------------------------
-ProcessSanitizedAssignments.java
+1.2 ProcessSanitizedAssignments.java
+-------------------------------------
- Download sanitized bridge pool assignments from the metrics website,
e.g., https://metrics.torproject.org/data/bridge-pool-assignments-2011-01.tar.bz2
@@ -48,9 +63,9 @@ ProcessSanitizedAssignments.java
- Once the Java application is done, you'll find a file assignments.csv
in this directory.
---------------------------------------------------------------------------
-ProcessRelayConsensuses.java
+1.3 ProcessRelayConsensuses.java
+---------------------------------
- Download v3 relay consensuses from the metrics website, e.g.,
https://metrics.torproject.org/data/consensuses-2011-01.tar.bz2, and
@@ -69,16 +84,24 @@ ProcessRelayConsensuses.java
- Once the Java application is done, you'll find a file relays.csv in
this directory.
---------------------------------------------------------------------------
-verify.R
+1.4 verify.R
+-------------
- Run the R verification script like this:
$ R --slave -f verify.R
---------------------------------------------------------------------------
-descriptors.csv
+2 New data formats
+-------------------
+
+The Java applications produce four output formats containing bridge
+descriptors, bridge status lines, bridge pool assignments, and hashed
+relay identities. The data formats are described below.
+
+
+2.1 descriptors.csv
+--------------------
The descriptors.csv file contains one line for each bridge descriptor that
a bridge has published. This descriptor consists of fields coming from
@@ -115,9 +138,9 @@ Bridges running early 0.2.2.x versions published faulty stats and are
therefore removed from descriptors.csv. Bridges running 0.2.2.x or higher
(except the faulty 0.2.2.x versions) collect stats in 24-hour intervals.
---------------------------------------------------------------------------
-statuses.csv
+2.2 statuses.csv
+-----------------
The statuses.csv file contains one line for every bridge that is
referenced in a bridge network status. Note that if a bridge is running
@@ -145,9 +168,16 @@ The columns in statuses.csv are:
- valid: TRUE if bridge has the Valid flag, FALSE otherwise
- v2dir: TRUE if bridge has the V2Dir flag, FALSE otherwise
---------------------------------------------------------------------------
+Note that there is no tight relation between statuses.csv and
+descriptors.csv when it comes to bridge usage statistics (even though
+one can link them via the bridge's server descriptor identifier). A
+bridge is free to write anything in its extra-info descriptor, including a
+few days old bridge statistics. That is in no way related to the bridge
+authority thinking that a bridge is running at a later time.
+
-assignments.csv
+2.3 assignments.csv
+--------------------
The assignments.csv file contains one line for every running bridge and
the rings, subrings, and buckets that BridgeDB assigned it to.
@@ -162,9 +192,9 @@ The columns in assignments.csv are:
- flag: Flag subring
- bucket: File bucket, only for distributor "unallocated"
---------------------------------------------------------------------------
-relays.csv
+2.4 relays.csv
+---------------
The relays.csv file contains SHA-1 hashes of identity fingerprints of
normal relays. If a bridge uses the same identity key that it also used
@@ -177,3 +207,71 @@ The columns in relays.csv are:
- consensus: ISO-formatted consensus publication time
- fingerprint: Hex-formatted SHA-1 hash of identity fingerprint
+
+3 Working with the new data formats
+------------------------------------
+
+The new data formats are plain CSV files that can be processed by many
+statistics tools, including R. For some analyses it may be sufficient to
+evaluate a single CSV file and be done. But most analyses would require
+combining two or more of the CSV files.
+
+See analysis.R for an example analysis. Run it like this:
+
+ $ R --slave -f analysis.R
+
+Below is the output in case you don't have R installed but want to know
+what kind of results to expect:
+
+Reading descriptors.csv.
+Read 97394 rows from descriptors.csv.
+28429 of these rows have bridge stats.
+Here are the first 10 rows, sorted by fingerprint and bridge stats
+interval end, and only displaying German and French users:
+ fingerprint bridgestatsend de fr
+45933 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:32:47 0 0
+21782 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:33:53 0 0
+18869 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:53:07 0 0
+5182 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 19:23:52 0 0
+48686 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 09:38:20 0 0
+33774 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 19:30:08 0 0
+67666 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 22:11:47 0 0
+31329 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-06 09:14:07 0 0
+31668 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-07 11:23:26 0 0
+16943 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-08 11:49:26 0 0
+Reading relays.csv
+Read 1606208 rows from relays.csv.
+Filtering out bridges that have been seen as relays.
+26425 descriptors remain. Again, here are the first 10 rows, sorted by
+fingerprint and bridge stats interval end, and only displaying German
+and French users:
+ fingerprint bridgestatsend de fr
+45933 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:32:47 0 0
+21782 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:33:53 0 0
+18869 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:53:07 0 0
+5182 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 19:23:52 0 0
+48686 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 09:38:20 0 0
+33774 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 19:30:08 0 0
+67666 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 22:11:47 0 0
+31329 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-06 09:14:07 0 0
+31668 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-07 11:23:26 0 0
+16943 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-08 11:49:26 0 0
+Reading assignments.csv
+Read 778561 rows from assignments.csv.
+Filtering out bridges that have not been distributed via email.
+14684 descriptors remain. Again, Here are the first 10 rows, sorted by
+fingerprint and bridge stats interval end, and only displaying German
+and French users:
+ fingerprint bridgestatsend de fr
+66036 003817328def77002ff276a9af54bc4326a86d1c 2011-01-01 05:53:12 32 8
+61891 003817328def77002ff276a9af54bc4326a86d1c 2011-01-01 11:46:58 32 8
+54391 003817328def77002ff276a9af54bc4326a86d1c 2011-01-02 03:32:30 40 8
+73165 003817328def77002ff276a9af54bc4326a86d1c 2011-01-02 21:33:14 48 8
+82707 003817328def77002ff276a9af54bc4326a86d1c 2011-01-03 03:47:23 48 8
+5300 003817328def77002ff276a9af54bc4326a86d1c 2011-01-03 21:48:10 32 8
+23940 003817328def77002ff276a9af54bc4326a86d1c 2011-01-04 15:48:56 32 8
+2706 003817328def77002ff276a9af54bc4326a86d1c 2011-01-05 09:49:39 40 8
+17273 003817328def77002ff276a9af54bc4326a86d1c 2011-01-06 03:50:23 24 8
+72380 003817328def77002ff276a9af54bc4326a86d1c 2011-01-06 21:51:09 24 8
+Terminating.
+
diff --git a/task-2680/analysis.R b/task-2680/analysis.R
new file mode 100644
index 0000000..fbe3199
--- /dev/null
+++ b/task-2680/analysis.R
@@ -0,0 +1,50 @@
+# Read descriptors.csv.
+cat("Reading descriptors.csv.\n")
+data <- read.csv("descriptors.csv", stringsAsFactors = FALSE)
+cat("Read", length(data$fingerprint), "rows from descriptors.csv.\n")
+
+# We're interested in bridge stats. Let's filter out all descriptors that
+# don't have any bridge stats.
+data <- data[!is.na(data$bridgestatsend), ]
+cat(length(data$fingerprint), "of these rows have bridge stats.\n")
+
+# Sort data first by bridge fingeprint, then by bridge stats interval end.
+data <- data[order(data$fingerprint, data$bridgestatsend), ]
+cat("Here are the first 10 rows, sorted by fingerprint and bridge",
+ "stats\ninterval end, and only displaying German and French users:\n")
+data[1:10, c("fingerprint", "bridgestatsend", "de", "fr")]
+
+# Looks good, but we should exclude all bridges that have been seen as
+# relays, or they will skew our results. Read relays.csv.
+cat("Reading relays.csv\n")
+relays <- read.csv("relays.csv", stringsAsFactors = FALSE)
+cat("Read", length(relays$fingerprint), "rows from relays.csv.\n")
+
+# Filter out all descriptors of bridges that have been seen as relays.
+cat("Filtering out bridges that have been seen as relays.\n")
+data <- data[!data$fingerprint %in% relays$fingerprint, ]
+cat(length(data$fingerprint), "descriptors remain. Again, here are the",
+ "first 10 rows, sorted by\nfingerprint and bridge stats interval",
+ "end, and only displaying German\nand French users:\n")
+data[1:10, c("fingerprint", "bridgestatsend", "de", "fr")]
+
+# And finally, we only want to know bridge statistics of the bridges that
+# were distributed via email. Read assignments.csv.
+cat("Reading assignments.csv\n")
+assignments <- read.csv("assignments.csv", stringsAsFactors = FALSE)
+cat("Read", length(assignments$fingerprint), "rows from",
+ "assignments.csv.\n")
+
+# Filter out all descriptors of bridges that were not assigned to the
+# email distributor.
+cat("Filtering out bridges that have not been distributed via email.\n")
+data <- data[!data$fingerprint %in%
+ assignments[assignments$type == 'email', "fingerprint"], ]
+cat(length(data$fingerprint), "descriptors remain. Again, Here are the",
+ "first 10 rows, sorted by\nfingerprint and bridge stats interval",
+ "end, and only displaying German\nand French users:\n")
+data[1:10, c("fingerprint", "bridgestatsend", "de", "fr")]
+
+# That's it.
+cat("Terminating.\n")
+
_______________________________________________
tor-commits mailing list
tor-commits@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-commits