[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[tor-dev] Metrics: Estimating fraction of reported directory-request statistics
I am trying to reproduce the "frac" computation from the Reproducible
Metrics instructions:
https://metrics.torproject.org/reproducible-metrics.html#relay-users
Which is also Section 3 in the tech report on counting bridge users:
https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4
h(R^H) * n(H) + h(H) * n(R\H)
frac = -----------------------------
h(H) * n(N)
My minor goal is to reproduce the "frac" column from the Metrics web
site (which I assume is the same as the frac above, expressed as a
percentage):
https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01&end=2022-04-08&country=all&events=off
date,country,users,lower,upper,frac
2022-04-01,,2262557,,,92
2022-04-02,,2181639,,,92
2022-04-03,,2179544,,,93
2022-04-04,,2350360,,,93
2022-04-05,,2388772,,,93
2022-04-06,,2356170,,,93
2022-04-07,,2323184,,,93
2022-04-08,,2310170,,,91
I'm having trouble with the computation of n(R\H) and h(R∧H). I
understand that R is the subset of relays that report directory request
counts (i.e. that have dirreq-stats-end in their extra-info descriptors)
and H is the subset of relays that report directory request byte counts
(i.e. that have dirreq-write-history in their extra-info descriptors).
R and H partially overlap: there are relays that are in R but not H,
others that are in H but not R, and others that are in both.
The computations depend on some values that are directly from
descriptors:
n(R) = sum of hours, for relays with directory request counts
n(H) = sum of hours, for relays with directory write histories
h(H) = sum of written bytes, for relays with directory write histories
> Compute n(R\H) as the number of hours for which responses have been
> reported but no written directory bytes. This fraction is determined
> by summing up all interval lengths and then subtracting the written
> directory bytes interval length from the directory response interval
> length. Negative results are discarded.
I interpret this to mean: add up all the dirrect-stats-end intervals
(this is n(R)), add up all the dirreq-write-history intervals
(this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it
would only be true when H is a subset of R.
> Compute h(R∧H) as the number of written directory bytes for the
> fraction of time when a server was reporting both written directory
> bytes and directory responses. As above, this fraction is determined
> by first summing up all interval lengths and then computing the
> minimum of both sums divided by the sum of reported written directory
> bytes.
This seems to be saying to compute h(R∧H) (a count of bytes) as
min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are
hours / bytes. What would be more natural to me is
min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of
n(R) and n(R) by the larger, then multiply this ratio by the observable
byte count. But this, too, only works when H is a subset of R.
Where is this computation done in the metrics code? I would like to
refer to it, but I could not find it.
Using the formulas and assumptions above, here's my attempt at computing
recent "frac" values:
date `n(N)` `n(H)` `h(H)` `n(R)` `n(R\H)` `h(R∧H)` frac
2022-04-01 166584 177638. 2.24e13 125491. 0 1.59e13 0.753
2022-04-02 166951 177466. 2.18e13 125686. 0 1.54e13 0.753
2022-04-03 167100 177718. 2.27e13 127008. 0 1.62e13 0.760
2022-04-04 166970 177559. 2.43e13 126412. 0 1.73e13 0.757
2022-04-05 166729 177585. 2.44e13 125389. 0 1.72e13 0.752
2022-04-06 166832 177470. 2.39e13 127077. 0 1.71e13 0.762
2022-04-07 166532 177210. 2.48e13 127815. 0 1.79e13 0.768
2022-04-08 167695 176879. 2.52e13 127697. 0 1.82e13 0.761
The "frac" column does not match the CSV. Also notice that n(N) < n(H),
which should be impossible because H is supposed to be a subset of N
(N is the set of all relays). But this is what I get when I estimate
n(N) from a network-status-consensus-3 and n(H) from extra-info
documents. Also notice that n(R) < n(H), which means that H cannot be a
subset of R, contrary to the observations above.
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev