On Wed, Apr 25, 2007 at 03:16:28PM -0400, Roger Dingledine wrote: > On Fri, Apr 20, 2007 at 01:17:15PM -0400, nickm@xxxxxxxx wrote: > > tor/trunk/doc/spec/proposals/108-mtbf-based-stability.txt > [snip] > > +Alternative: > > + > > + "A router's Stability shall be defined as the sum of $alpha ^ d$ for every > > + $d$ such that the router was not observed to be unavailable $d$ days ago." > > + > > + This allows a simpler implementation: every day, we multiply yesterday's > > + Stability by alpha, and if the router was running for all of today, we add > > + 1. > > I don't think you mean quite that. For a server that just appeared, > there are an infinite number of previous days where it was not observed > to be unavailable. Do you mean 'was observed to be available'? Ah, you're right. > And by available, do we mean available for the entire day? I think so, for arbitrary values of "day". > > What are some ways we can choose \alpha? We should probably decide how much we'd like to discount the distant past. Something between .80 and .95 is probably around right. > > > +Limitations: > > + > > + Authorities can have false positives and false negatives when trying to > > + tell whether a router is up or down. So long as these aren't terribly > > + wrong, and so long as they aren't significantly biased, we should be able > > + to use them to estimate stability pretty well. > > I haven't seen any discussion about how the router's declared uptime fits > into this. If a router goes down and then comes up again in between > measurements, the proposed approach will treat it as being up the > whole time -- yet connections through it will be broken. One approach > to handling this would be to notice if the uptime decreases from one > descriptor to the next. This would indicate a self-declared downtime > for the router, and we can just figure that into the calculations. This would be a good thing, but it _would_ give routers incentive to lie about uptime. > > I'm not sure how we should compute the length of the downtime though: > in some cases it will be just a split second as for a reboot or upgrade, > but in others maybe the computer, network, or Tor process went down > and then came back a long time later. I guess since our computations > are just rough approximations anyway, we can just assume a zero-length > downtime unless our active testing also noticed it. Actually, I chose "up for an entire day" as a minimum quantum for a reason. The main problem with router instability isn't the fraction of time it's down; if you try to connect to a router that isn't there, that's not a big deal. The problem with router instability is the likelihood that it will _go_ down and drop all your circuits. Remember, a router that goes down for 5 minutes out of a every hour has a _higher_ fractional uptime than a router that goes down for one day out of every week... but the latter router is far more stable, and far more useful if your goal is long-lived circuits. (That's why I originally chose MTBF rather than uptime percentage. I'm _trying_ to approximate the same insight by requiring you to be up for the entirety of a day rather than a fraction of it, but there may be better ways to approximate it.) > > Speaking of the active testing, here's what we do right now: > > Every 10 seconds, we call dirserv_test_reachability(), and it tries making > connections to a different 1/128 of the router list. So a given router > gets tried every 1280 seconds, or a bit over 21 minutes. We declare a > router to be unreachable if it has not been successfully found reachable > within the past 45 minutes. So at least two testing periods not to go > by before a running router is considered to be no longer running. > > So our measurements won't be perfect, but I think this approach is a > much better one than just blindly believing the uptime entry in the > router descriptor. > > What is our plan for storing (and publishing?) the observed uptime > periods for each router? I don't think publishing is necessary; there's nothing to stop us from doing it later if we chhose. To store the uptime, I was thinking of a flat file written periodically; it would probably be something like 64K at the moment, which wouldn't be a big problem for authorities to flush every 10 minutes or so. If we wanted to be fancier, we could keep an append-only events journal, and periodically use it to rebuild a status file, but that doesn't seem necessary. We could also start poking at the dark sad world of Berkeley DB and friends, I guess. The annoyances of that are well known, but it won't be too bad if we only require it on authorities. yrs, -- Nick Mathewson
Attachment:
pgpRT7yQMsAuP.pgp
Description: PGP signature