[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

[tor-dev] Metrics: Estimating fraction of reported directory-request statistics

To: tor-dev@xxxxxxxxxxxxxxxxxxxx
Subject: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics
From: David Fifield <david@xxxxxxxxxxxxxxx>
Date: Sat, 16 Apr 2022 18:16:23 -0600
Delivered-to: archiver@xxxxxxxx
Delivery-date: Tue, 19 Apr 2022 13:12:01 -0400
Dkim-signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=bamsoftware.com; s=mail; h=Content-Transfer-Encoding:Content-Type: MIME-Version:Message-ID:Subject:To:From:Date:Sender:Reply-To:Cc:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=/JNpmBNsvIqGPVAUnlFghJDPGtpwDfUu6xjPaUblIkE=; b=PBPtcEWsM0L3820MbtOWRPWCaF DMMLBCAGn3oXO6M7wTKnu58hOAdbNT5YpygM8E29Uv0/GlPU/nD1dtETORfE2nLbjATfB4rWUaDQD Tst13F5Z9Fenn4CAc3U+/VZc5rq69fOnSGIDgXxsJ10Gw6ni9wBx23dzV8OMwMVf7tKo=;
List-archive: <http://lists.torproject.org/pipermail/tor-dev/>
List-help: <mailto:tor-dev-request@lists.torproject.org?subject=help>
List-id: discussion regarding Tor development <tor-dev.lists.torproject.org>
List-post: <mailto:tor-dev@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=unsubscribe>
Mail-followup-to: tor-dev@xxxxxxxxxxxxxxxxxxxx
Reply-to: tor-dev@xxxxxxxxxxxxxxxxxxxx
Sender: "tor-dev" <tor-dev-bounces@xxxxxxxxxxxxxxxxxxxx>
User-agent: NeoMutt/20180716

I am trying to reproduce the "frac" computation from the Reproducible
Metrics instructions:
https://metrics.torproject.org/reproducible-metrics.html#relay-users
Which is also Section 3 in the tech report on counting bridge users:
https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4

       h(R^H) * n(H) + h(H) * n(R\H)
frac = -----------------------------
                h(H) * n(N)

My minor goal is to reproduce the "frac" column from the Metrics web
site (which I assume is the same as the frac above, expressed as a
percentage):

https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01&end=2022-04-08&country=all&events=off
date,country,users,lower,upper,frac
2022-04-01,,2262557,,,92
2022-04-02,,2181639,,,92
2022-04-03,,2179544,,,93
2022-04-04,,2350360,,,93
2022-04-05,,2388772,,,93
2022-04-06,,2356170,,,93
2022-04-07,,2323184,,,93
2022-04-08,,2310170,,,91

I'm having trouble with the computation of n(R\H) and h(R∧H). I
understand that R is the subset of relays that report directory request
counts (i.e. that have dirreq-stats-end in their extra-info descriptors)
and H is the subset of relays that report directory request byte counts
(i.e. that have dirreq-write-history in their extra-info descriptors).
R and H partially overlap: there are relays that are in R but not H,
others that are in H but not R, and others that are in both.

The computations depend on some values that are directly from
descriptors:
n(R) = sum of hours, for relays with directory request counts
n(H) = sum of hours, for relays with directory write histories
h(H) = sum of written bytes, for relays with directory write histories

> Compute n(R\H) as the number of hours for which responses have been
> reported but no written directory bytes. This fraction is determined
> by summing up all interval lengths and then subtracting the written
> directory bytes interval length from the directory response interval
> length. Negative results are discarded.

I interpret this to mean: add up all the dirrect-stats-end intervals
(this is n(R)), add up all the dirreq-write-history intervals
(this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it
would only be true when H is a subset of R.

> Compute h(R∧H) as the number of written directory bytes for the
> fraction of time when a server was reporting both written directory
> bytes and directory responses. As above, this fraction is determined
> by first summing up all interval lengths and then computing the
> minimum of both sums divided by the sum of reported written directory
> bytes.

This seems to be saying to compute h(R∧H) (a count of bytes) as
min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are
hours / bytes. What would be more natural to me is
min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of
n(R) and n(R) by the larger, then multiply this ratio by the observable
byte count. But this, too, only works when H is a subset of R.

Where is this computation done in the metrics code? I would like to
refer to it, but I could not find it.

Using the formulas and assumptions above, here's my attempt at computing
recent "frac" values:

date       `n(N)`  `n(H)`   `h(H)`  `n(R)` `n(R\H)` `h(R∧H)` frac
2022-04-01 166584 177638.  2.24e13 125491.       0   1.59e13 0.753
2022-04-02 166951 177466.  2.18e13 125686.       0   1.54e13 0.753
2022-04-03 167100 177718.  2.27e13 127008.       0   1.62e13 0.760
2022-04-04 166970 177559.  2.43e13 126412.       0   1.73e13 0.757
2022-04-05 166729 177585.  2.44e13 125389.       0   1.72e13 0.752
2022-04-06 166832 177470.  2.39e13 127077.       0   1.71e13 0.762
2022-04-07 166532 177210.  2.48e13 127815.       0   1.79e13 0.768
2022-04-08 167695 176879.  2.52e13 127697.       0   1.82e13 0.761

The "frac" column does not match the CSV. Also notice that n(N) < n(H),
which should be impossible because H is supposed to be a subset of N
(N is the set of all relays). But this is what I get when I estimate
n(N) from a network-status-consensus-3 and n(H) from extra-info
documents. Also notice that n(R) < n(H), which means that H cannot be a
subset of R, contrary to the observations above.
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Follow-Ups:
- Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics
  - From: David Fifield
- Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics
  - From: Silvia/Hiro

Prev by Author: Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics
Next by Author: Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics
Previous by thread: [tor-dev] We built a new Tor-based team chat prototype. Wanna try it?
Next by thread: Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics
Index(es):
- Author
- Thread