[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #32135 [Metrics/Statistics]: Write BridgeDB metrics parser and analyse existing data
#32135: Write BridgeDB metrics parser and analyse existing data
--------------------------------+--------------------------------
Reporter: phw | Owner: phw
Type: task | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: s30-o21a1 | Actual Points:
Parent ID: #31274 | Points: 2
Reviewer: | Sponsor:
--------------------------------+--------------------------------
Changes (by phw):
* status: needs_review => needs_revision
Comment:
Thanks for your work on this!
Replying to [comment:6 karsten]:
> Okay, I finished a first [https://gitweb.torproject.org/user/karsten
/metrics-
web.git/commit/?h=task-32135&id=93f2500cabf22fdf03d109bb7855445b18afd62d
patch] that processes BridgeDB metrics once per day to produce a .csv file
and that adds two graphs to Tor Metrics. Can you please take a look at
that patch, not regarding the Java/R code, but regarding user-facing
documentation of the two new graphs? In particular, please take a look at
the `TODO`s in that patch. (irl, I'll ask you to review a revised branch
for the code portions once the documentation parts are all set.)
[[br]]
[https://gitweb.torproject.org/user/karsten/metrics-
web.git/diff/src/main/resources/web/json/metrics.json?h=task-32135&id=93f2500cabf22fdf03d109bb7855445b18afd62d
Commit 93f2500c]:
For bridgedb-transport, I would change the title to:
{{{
"BridgeDB requests for each bridge type"
}}}
...and the description to:
{{{
"<p>This graph shows the number BridgeDB requests for each bridge type.
BridgeDB requests over Tor and unsuccessful requests (e.g., invalid emails
or incorrect CAPTCHAs) are not included in these numbers.</p>"
}}}
For bridgedb-distribution, I would change the title to:
{{{
"BridgeDB requests for each distribution method"
}}}
...and the description to:
{{{
"<p>This graph shows the number of BridgeDB requests for each distribution
method. HTTPS requests over Tor and unsuccessful requests (e.g., invalid
emails or incorrect CAPTCHAs) are not included in these numbers.</p>"
}}}
Here are my changes to [https://gitweb.torproject.org/user/karsten
/metrics-web.git/diff/src/main/resources/web/jsps/reproducible-
metrics.jsp?h=task-32135&id=93f2500cabf22fdf03d109bb7855445b18afd62d
commit 93f2500c]:
{{{
<h3 id="bridgedb-stats" class="hover">BridgeDB requests
<a href="#bridgedb-stats" class="anchor">#</a>
</h3>
<p>BridgeDB metrics contain aggregated information about requests to the
BridgeDB service. BridgeDB keeps track of each request per distribution
method
(HTTPS, moat, email), per bridge type (e.g., vanilla or obfs4) per country
code
or email provider (e.g., "ru" or "gmail") per request success ("success"
or
"fail"). Every 24 hours, BridgeDB writes these metrics to disk and then
begins
a new measurement interval.</p>
<p>The following description applies to the following graph and
tables:</p>
<ul>
<li>BridgeDB requests by bridge type<a href="/bridgedb-transport.html"
class="btn btn-primary btn-xs"><i class="fa fa-chevron-right" aria-
hidden="true"></i> graph</a></li>
<li>BridgeDB requests by distribution <a href="/bridgedb-
distribution.html" class="btn btn-primary btn-xs"><i class="fa fa-chevron-
right" aria-hidden="true"></i> graph</a></li>
</ul>
<h4>Step 1: Parse BridgeDB metrics to obtain reported request numbers</h4>
<p>Obtain BridgeDB metrics from <a href="/collector.html#type-bridgedb-
metrics">CollecTor</a>.
Refer to the <a href="https://gitweb.torproject.org/bridgedb.git/tree/doc
/bridgedb-metrics-spec.txt">BridgeDB metrics specification</a> for details
on the descriptor format.</p>
<h4>Step 2: Skip requests coming in over Tor exits</h4>
<p>Skip any request counts with <code>zz</code> as their
<code>CC/EMAIL</code> metrics key part. We use the <code>zz</code> pseudo
country code for requests originating from Tor exit relays. We're
discarding
these requests because <a href="https://bugs.torproject.org/32117">bots
use the
Tor network to crawl BridgeDB</a> and including bot requests would provide
a
false sense of how users interact with BridgeDB. Note that BridgeDB
maintains
a separate distribution pool for requests coming from Tor exit relays.</p>
<h4>Step 3: Aggregate requests by date, distribution method, and bridge
type</h4>
<p>BridgeDB metrics contain request numbers broken down by distribution
method,
bridge type, and a few more dimensions. For our purposes we only care
about
total request numbers by date and either distribution method or bridge
type.
We're using request sums by these three dimensions as aggregates. As date
we're using the date of the BridgeDB metrics interval end. If we
encounter
more than one BridgeDB metrics interval end on the same UTC date (which
shouldn't be possible with an interval length of 24 hours), we arbitrarily
keep
whichever we process first.</p>
</div>
<div class="container">
}}}
I wasn't sure what `TODO If we're supposed to "unbin" numbers, this is
probably where we should say that.` meant, so I deleted the line. Is this
about the `bin_size/2` modification you mentioned above?
In [https://gitweb.torproject.org/user/karsten/metrics-
web.git/diff/src/main/resources/web/jsps/stats.jsp?h=task-32135&id=93f2500cabf22fdf03d109bb7855445b18afd62d
commit 93f2500c], I would replace "transport" with "bridge type" (because
we include vanilla, which is technically the absence of a transport
protocol) and "distribution" with "distribution method". I would also
change:
{{{
<li><b>transport:</b> Name of the pluggable transport protocol, which
includes <code>"obfs2"</code>, <code>"obfs3"</code>, <code>"obfs4"</code>,
<code>"scramblesuit"</code>, and <code>"fte"</code>, and which will change
in the future.</li>
}}}
to
{{{
<li><b>transport:</b> Name of the bridge type, which includes
<code>"vanilla"</code>, <code>"obfs2"</code>, <code>"obfs3"</code>,
<code>"obfs4"</code>, <code>"scramblesuit"</code>, and <code>"fte"</code>,
and which will change in the future.</li>
}}}
May may want to change the column's name to something like "bridge_type"
but I think it's also ok to keep it.
[[br]]
> By the way, while reading your code, I found that you're only looking at
BridgeDB metrics files in CollecTor's `recent/` directory. There's
currently a (minor) bug in CollecTor where we never remove files from that
directory. I'm going to fix that at some point, and then your script will
only provide the latest three files. A possible fix would be to also
process files in CollecTor's `archive/` directory. Not sure how much of an
issue that is when these graphs exist on Tor Metrics, but I thought I
should let you know.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32135#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs