[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Proposal 161: Computing Bandwidth Adjustments



On Tue, May 12, 2009 at 07:48:28PM -0700, Mike Perry wrote:
> Title: Computing Bandwidth Adjustments
> Author: Mike Perry
> Created: 12-May-2009
> Target: 0.2.2.x

(Checked in to the main repo as Proposal 161.  I added a "filename"
and a "status" field here.)

The proposal looks good for a start, but I'd like to see more
discussion of some points.  I have some question below.

 [...]
> 2. Average Stream Bandwidth Calculation
>  
>   The average stream bandwidths are obtained by dividing the network
>   into 3% slices according to advertised node bandwidth, yielding 
>   about 45 nodes per slice in the current network.
> 
>   Two hop circuits are built using nodes from the same slice, and a large
>   file is downloaded via these circuits. This process is repeated
>   several hundred times, and average stream capacities are assigned to
>   each node from these results.

There are some magic numbers here: Why 3%?  How large are these files
and how often are they downloaded?  Why "several hundred"?  Is the
"several hundred" a function of the "45" above?

I want to see a calculation for what fraction of the network capacity
we expect to use for this testing.  I wouldn't expect it to be over a
percent, but I'd like to see the math to prove it.

Also, how long does it take to run these tests over the network?  If
it takes much longer than a consensus interval, we should think about
what happens if a node's advertised capacity changes greatly, and
whether we want to treat newer measurements as more accurate.

Another possibility I want to consider: what if, instead of putting
nodes into slices based on their declared bandwidths, we put them into
slices based on the most recent bandwidth for them in the consensus?
That way, if a node advertises 100x as much bandwidth as it really
has, we will soon start measuring it along with the really other slow
nodes, rather than measuring it along with the high-capacity nodes
forever.

[Ah.  I see on IRC that the above paragraph is what you already have
in mind.]

 [...]
> 3. Ratio Filtering
> 
>   After the base ratios are calculated, a second pass is performed
>   to remove any streams with nodes of ratios less than X=0.5 from
>   the results of other nodes.   In addition, all outlying streams
>   with capacity of one standard deviation below a node's average
>   are also removed.
> 
>   The final ratio result will be calculated as the maximum of 
>   these two resulting ratios if both are less than 1.0, the minimum
>   if both are greater than 1.0, and the mean if one is greater
>   and one is less than 1.0.

I am little confused about what happens here.  Can you clarify this
with some pseudocode for the whole algorithm?  Based on what you've
written, I _think_ it would go something like:

     1. Partition nodes into equal-sized sets of approximately 45
        nodes, dividing slices by declared bandwidth.
     2. For each slice, build about 100 two-hop circuits, and check
        the bandwidth on each.
     3. For each node N, let BW_measured(N) = MEAN(b | b is the bandwidth
        of a stream that went through N).
     4. For each slice S, let BW_avg(S) = MEAN(b | b is the bandwidth
        of a stream built through S).  Let BW_stddev(S) = The standard
	deviation of all such streams.  Let Normal_Streams(S) = { all
	streams in S such that their bandwidth b is at least 0.5 *
        BW_avg(S), and such that |b-BW_avg(S)|<= BW_stddev(S) }

     5. For each node N, let BW_measured2(N) = MEAN(b | b is the bandwidth
        of a stream that went through N, for all streams in
        Normal_Streams(S)).

     6. For each node N, if BW_measured and BW_measured2 are on the
        same side of 1.0, let BW_final(N) be whichever of BW_measured
        and BW_measured2 is closer to 1.0.  Otherwise let BW_final(N)
        be the mean of BW_measured and BW_measured2.

Am I close?  Let's get the right pseudocode into the proposal; this is
a complicated enough algorithm that it could use a mathy exposition.

Also, I'm not sure what the motivation is.  Have you tried it without
this step, and found that it didn't work so well?

> 4. Security implications
> 
>   The ratio filtering will deal with cases of sabotage by dropping
>   both very slow outliers in stream average calculations, as well
>   as dropping streams that used very slow nodes from the calculation
>   of other nodes.
> 
>   This scheme will not address nodes that try to game the system by
>   providing better service to scanners. The scanners can be detected
>   at the entry by IP address, and at the exit by the destination fetch.
> 
>   Measures can be taken to obfuscate and separate the scanners' source
>   IP address from the directory authority IP address. For instance,
>   scans can happen offsite and the results can be rsynced into the
>   authorities.  The destination fetch can also be obscured by using SSL
>   and periodically changing the large document that is fetched. 
> 
>   Neither of these methods are foolproof, but such nodes can already
>   lie about their bandwidth to attract more traffic, so this solution
>   does not set us back any in that regard.

Yeah.  I suppose we can carry on an arms-race here if somebody really
wants to start one.

> 4. Integration with Proposal 160
> 
>   The final results will be produced for the voting mechanism
>   described in Proposal 160 by multiplying the derived ratio by 
>   the average observed advertised bandwidth during the course of the
>   scan. This will produce a new bandwidth value that will be 
>   output into a file consisting of lines of the form:
> 
>   <node-idhex> SP new_bandwidth NL
> 
>   This file can be either copied or rsynced into a directory readable
>   by the directory authority.

Let's make this a little more extensible, so we can add new fields in
the future without breaking things.  Let's say that the format is

    "node id=<idhex> SP bw=<new_bandwidth> NL"

and specify that implementations must ignore any lines starting with
any keywords they don't recognize, or any key=value values inside a
line if they don't recognize the keyword.

That way, if we want to tell Tor more in the future, we can.


* *

So, something that didn't come up: I think that it's important to have
values for weights change a bit slowly based on node speed or
slowness.  Otherwise, we can wind up with a situation where we notice
a node is slow, so everybody leaves, so we notice it is fast, so
everybody hammers it.  I _think_ that the above algorithm has this
property--does it?

(Crazy idea 1: If the values change too fast or slow, we could
exponentiate the ratios out of the algorithm before using them.
Slightly less crazy idea 2: if the values change too fast, we could do
a standard time-weighted average, where we compute the new declared
bandwidth BW' as a weighted average of the old declared bandwidth and
the new measured bandwidth.)


yrs,
-- 
Nick