[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [or-cvs] r9993: Describe a simpler implementation for proposal 108, and note (in tor/trunk: . doc/spec/proposals)



On Wed, Apr 25, 2007 at 03:16:28PM -0400, Roger Dingledine wrote:
> On Fri, Apr 20, 2007 at 01:17:15PM -0400, nickm@xxxxxxxx wrote:
> >    tor/trunk/doc/spec/proposals/108-mtbf-based-stability.txt
> [snip]
> > +Alternative:
> > +
> > +   "A router's Stability shall be defined as the sum of $alpha ^ d$ for every
> > +   $d$ such that the router was not observed to be unavailable $d$ days ago."
> > +
> > +   This allows a simpler implementation: every day, we multiply yesterday's
> > +   Stability by alpha, and if the router was running for all of today, we add
> > +   1.
> 
> I don't think you mean quite that. For a server that just appeared,
> there are an infinite number of previous days where it was not observed
> to be unavailable. Do you mean 'was observed to be available'?

Ah, you're right.

> And by available, do we mean available for the entire day?

I think so, for arbitrary values of "day".

> 
> What are some ways we can choose \alpha?

We should probably decide how much we'd like to discount the distant
past.  Something between .80 and .95 is probably around right.

> 
> > +Limitations:
> > +
> > +   Authorities can have false positives and false negatives when trying to
> > +   tell whether a router is up or down.  So long as these aren't terribly
> > +   wrong, and so long as they aren't significantly biased, we should be able
> > +   to use them to estimate stability pretty well.
> 
> I haven't seen any discussion about how the router's declared uptime fits
> into this. If a router goes down and then comes up again in between
> measurements, the proposed approach will treat it as being up the
> whole time -- yet connections through it will be broken. One approach
> to handling this would be to notice if the uptime decreases from one
> descriptor to the next. This would indicate a self-declared downtime
> for the router, and we can just figure that into the calculations.

This would be a good thing, but it _would_ give routers incentive to
lie about uptime.

> 
> I'm not sure how we should compute the length of the downtime though:
> in some cases it will be just a split second as for a reboot or upgrade,
> but in others maybe the computer, network, or Tor process went down
> and then came back a long time later. I guess since our computations
> are just rough approximations anyway, we can just assume a zero-length
> downtime unless our active testing also noticed it.

Actually, I chose "up for an entire day" as a minimum quantum for a
reason.  The main problem with router instability isn't the fraction
of time it's down; if you try to connect to a router that isn't there,
that's not a big deal.  The problem with router instability is the
likelihood that it will _go_ down and drop all your circuits.
Remember, a router that goes down for 5 minutes out of a every hour
has a _higher_ fractional uptime than a router that goes down for one
day out of every week... but the latter router is far more stable, and
far more useful if your goal is long-lived circuits.

(That's why I originally chose MTBF rather than uptime percentage.
I'm _trying_ to approximate the same insight by requiring you to be up
for the entirety of a day rather than a fraction of it, but there may
be better ways to approximate it.)

> 
> Speaking of the active testing, here's what we do right now:
> 
> Every 10 seconds, we call dirserv_test_reachability(), and it tries making
> connections to a different 1/128 of the router list. So a given router
> gets tried every 1280 seconds, or a bit over 21 minutes. We declare a
> router to be unreachable if it has not been successfully found reachable
> within the past 45 minutes. So at least two testing periods not to go
> by before a running router is considered to be no longer running.
> 
> So our measurements won't be perfect, but I think this approach is a
> much better one than just blindly believing the uptime entry in the
> router descriptor.
> 
> What is our plan for storing (and publishing?) the observed uptime
> periods for each router?

I don't think publishing is necessary; there's nothing to stop us from
doing it later if we chhose.

To store the uptime, I was thinking of a flat file written
periodically; it would probably be something like 64K at the moment,
which wouldn't be a big problem for authorities to flush every 10
minutes or so.  If we wanted to be fancier, we could keep an
append-only events journal, and periodically use it to rebuild a
status file, but that doesn't seem necessary.

We could also start poking at the dark sad world of Berkeley DB and
friends, I guess.  The annoyances of that are well known, but it won't
be too bad if we only require it on authorities.

yrs,
-- 
Nick Mathewson

Attachment: pgpRT7yQMsAuP.pgp
Description: PGP signature