[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[tor-dev] Proposal draft: Better hidden service stats from Tor relays
Hello there,
I inline a copy of a proposal we've been working on lately. Discussion
can be found in the "Feedback on obfuscating hidden-service statistics"
thread.
The proposal suggests that Tor relays add some stats about hidden
service usage. We believe that these stats are not dangerous and can
be useful to Tor developers and to people who want to understand
hidden services and the onionspace better.
Any feedback is appreciated :)
======
Filename: 238-hs-relay-stats.txt
Title: Better hidden service stats from Tor relays
Author: George Kadianakis, David Goulet, Karsten Loesing, Aaron Johnson
Created: 2014-11-17
Status: Draft
0. Motivation
Hidden Services is one of the least understood parts of the Tor
network. We don't really know how many hidden services there are
and how much they are used.
This proposal suggests that Tor relays include some hidden service
related stats to their extra info descriptors. No stats are
collected from Tor hidden services or clients.
While uncertainty might be a good thing in a hidden network,
learning more information about the usage of hidden services can be
helpful.
For example, learning how many cells are sent for hidden service
purposes tells us whether hidden service traffic is 2% of the Tor
network traffic or 90% of the Tor network traffic. This info can
also help us during load balancing, for example if we change the
path building of hidden services to mitigate guard discovery
attacks [0].
Also, learning the number of hidden services, can give us an
understanding of how widespread hidden services are. It will also
help us understand approximately how much load is put in the
network by hidden service logistics, like introduction point
circuits etc.
1. Design
Tor relays will add some fields related to hidden service
statistics in their extra-info descriptors.
Tor relays collect these statistics by keeping track of their
hidden service directory or rendezvous point activities, slightly
obfuscating the numbers and posting them to the directory
authorities. Extra-info descriptors are posted to directory
authorities every 24 hours.
2. Implementation
2.1. Hidden service statistics interval
We want relays to report hidden-service statistics over a long-enough
time period to not put users at risk. Similar to other statistics, we
suggest a 24-hour statistics interval. All related statistics are
collected at the end of that interval and included in the next
extra-info descriptors published by the relay.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
[At most once.]
YYYY-MM-DD HH:MM:SS defines the end of the included measurement
interval of length NSEC seconds (86400 seconds by default).
A "hidserv-stats-end" line, as well as any other "hidserv-*" line,
is first added after the relay has been running for at least 24
hours.
2.2. Hidden service traffic statistics
We want to learn how much of the total Tor network traffic is caused by
hidden service usage. There are three phases in the rendezvous
protocol where traffic is generated: (1) when hidden services make
themselves available in the network, (2) when clients open connections
to hidden services, and (3) when clients exchange application data with
hidden services. We expect (3) to consume most bytes here, so we're
focusing on this only. More precisely, we measure hidden service
traffic by counting RELAY cells seen on a rendezvous point after
receiving a RENDEZVOUS1 cell. These RELAY cells include commands to
open or close application streams, and they include application data.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-rend-relayed-cells" SP num NL
[At most once.]
Approximate number of RELAY cells seen in either direction on
a circuit after receiving and successfully processing a
RENDEZVOUS1 cell. The actual number observed by the directory
is multiplied with a random number in [0.9, 1.1] and then gets
floored before being reported.
The keyword indicates that this line is part of hidden-service
statistics ("hidserv") and contains aggregate data from the relay
acting as rendezvous point ("rend").
2.3. HSDir hidden service counting
We also want to learn how many hidden services exist in the network.
The best place to learn this is at hidden service directories where
hidden services publish their descriptors.
Tor relays will add the following line to their extra-info descriptor:
"hidserv-dir-published-ids" SP num NL
[At most once.]
Approximate number of unique hidden-service identities seen in
descriptors published to and accepted by this hidden-service
directory. The actual number observed by the directory is
multiplied with a random number in [0.9, 1.1] and then gets
floored before being reported.
This statistic requires keeping a separate data structure with unique
identities seen during the current statistics interval. We could, in
theory, have relays iterate over their descriptor caches when producing
the daily hidden-service statistics blurb. But it's unclear how
caching would affect results from such an approach, because descriptors
published at the start of the current statistics interval could already
have been removed, and descriptors published in the last statistics
interval could still be present. Keeping a separate data structure,
possibly even a probabilistic one, seems like the more accurate
approach.
3. Security
The main security considerations that need discussion are what an
adversary could do with reported statistics that they couldn't do
without them. In the following, we're going through things the
adversary could learn, how plausible that is, and how much we care.
(All these things refer to hidden-service traffic, not to
hidden-service counting. We should think about the latter, too.)
3.1. Identify rendezvous point of high-volume and long-lived connection
The adversary could identify the rendezvous point of a very large and
very long-lived HS connection by observing a relay with unexpectedly
large relay cell count.
3.2. Identify number of users of a hidden service
The adversary may be able to identify the number of users
of an HS if he knows the amount of traffic on a connection to that HS
(which he potentially can determine himself) and knows when that
service goes up or down. He can look at the change in the total
reported RP traffic to determine about how many fewer HS users there
are when that HS is down.
4. Discussion
4.1. Why count only RP cells? Why not also count IP cells?
As discussed on IRC, counting only RP cells should be fine for now.
Everything else is protocol overhead, which includes HSDir traffic,
introduction point traffic, or rendezvous point traffic before the
first RELAY cell, etc.
Furthermore, introduction points correspond to specific HSes, so
publishing IP cell stats could reveal the popularity of specific
HSes.
4.2. How to use these stats?
4.2.1. How to use RP Cell statistics
We plan to extrapolate reported values to network totals by dividing
values by the probability of clients picking relays as rendezvous
point. This approach should become more precise on faster relays and
the more relays report these statistics.
We also plan to compare reported values with "cell-*" statistics to
learn what fraction of traffic can be attributed to hidden services.
Ideally, we'd be able to compare values to "write-history" and
"read-history" lines to compute similar fractions of traffic used for
hidden services. The goal would be to avoid enabling "cell-*"
statistics by default. In order for this to work we'll have to
multiply reported cell numbers with the default cell size of 512 bytes
(we cannot infer the actual number of bytes, because cells are
end-to-end encrypted between client and service).
4.2.2. How to use HSDir HS statistics
We plan to extrapolate this value to network totals by calculating what
fraction of hidden-service identities this relay was supposed to see.
This extrapolation will be very rough, because each hidden-service
directory is only responsible for a tiny share of hidden-service
descriptors, and there is no way to increase that share significantly.
Here are some numbers: there are about 3000 directories, and each
descriptor is stored on three directories. So, each directory is
responsible for roughly 1/1000 of descriptor identifiers. There are
two replicas for each descriptor (that is, each descriptor is stored
under two descriptor identifiers), and descriptor identifiers change
once per day (which means that, during a 24-hour period, there are two
opportunities for each directory to see a descriptor). Hence, each
descriptor is stored to four places in
identifier space throughout a 24-hour period. The probability of any
given directory to see a given hidden-service identity is
1-(1-1/1000)^4 = 0.00399 = 1/250. This approximation constitutes an
upper threshold, because it assumes that services are running all day.
An extrapolation based on this formula will lead to undercounting the
total number of hidden services.
A possible inaccuracy in the estimation algorithm comes from the fact
that a relay may not be acting as hidden-service directory during the
full statistics interval. We'll have to look at consensuses to
determine when the relay first received the "HSDir" flag, and only
consider the part of the statistics interval following the valid-after
time of that consensus.
4.3. Multiplicative or additive noise?
A possible alternative to multiplying the number of cells with a random
factor is to introduce additive noise. Let's suppose that we would
like to obscure any individual connection that contains C cells or
fewer (obscuring extremely and unusually large connections seems
hopeless but unnecessary). That is, we don't want the (distribution
of) the cell count from any relay to change by much whether or not C
cells are removed. The standard differential privacy approach would be
to *add* noise from the Laplace distribution Lap(\epsilon/C), where
\epsilon controls how much the statistics *distribution* can
multiplicatively differ. This is not to say that we need to add noise
exactly from that distribution (maybe we weaken the guarantee slightly
to get better accuracy), but the same idea applies. This would apply
the same to both large and small relays. We *want* to learn roughly
how much hidden-service traffic each relay has - we just want to
obscure the exact number within some tolerance. We'll probably want to
include the algorithm and parameters used for adding noise in the
"hidserv-rend-relayed-cells" line, as in, "lap=x" with x being
\epsilon/C.
[0]: guard discovery: https://lists.torproject.org/pipermail/tor-dev/2014-September/007474.html
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev