[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Writing geoip stats to disk on directories



On Mon, May 25, 2009 at 11:45:51PM +0200, Karsten Loesing wrote:
> My first idea is to synchronize request history periods with writing
> down stats. This basically means writing down stats only when periods
> end. The main reason is that we should ensure that only requests are
> written to disk that have been measured over exactly 24 hours. Writing
> down stats earlier might be problematic from an anonymity point of view.
> And after a restart we don't pick up these values anyway. Longer times
> (or in general different times than 24 hours) would complicate the
> analysis to a certain extent. In terms of code that means dropping
> DUMP_GEOIP_STATS_INTERVAL in main.c and dumping stats whenever we change
> the request period.

That sounds like a great idea. More generally, we have three main
constraints here:

A) We want to not output any numbers until we have a sufficient number
of hours of data. Otherwise we risk providing an aggregate over too
small a time window.

B) We want to avoid the same problem on the other end. That is, if we
print the rolling 24 hour aggregate every 5 minutes, then people could
learn too much about what happened exactly 24 hours earlier.

C) We want to know the size of the interval that our statistics are over
(as opposed to "between 24 and 48 hours and I won't tell you which").

C') Optionally, we might also want the interval to be over exactly 24
hours, so we don't have to deal with time zone questions. (We'll never
totally get away from this issue, since we'll still wonder about weekends,
holidays, summers, etc.)

So it sounds like the new plan is "output nothing until you have 24
hours of data. Then output the 24 hours of data. Then 8 hours later,
add in the new 8 hours, drop the old 8 hours, and output the new last
24 hours of data. Repeat."

That solves (a) because we wait til we have enough data before outputting
anything; solves (b) because we drop data only in blocks of 8 hours,
which we hope is large enough; and solves (c) because every number we
get is for a 24-hour window.

Did I get it right?

If so, this approach sounds like what we should be doing both for the
geoip aggregate stats you write to the file and analyze, and also for
the aggregate stats that bridges collect.

> Also, stats should be appended to the geoip-stats
> file rather than replacing that file.

Also sounds good.

> My next thought is whether or not we want to make the period length
> configurable. From earlier measurements I found that the period length
> of 8 hours (as defined in REQUEST_HIST_PERIOD) works fine. Also,
> configurable period lengths might complicate analysis, too. If we want
> to make the period length configurable, we should define a lower limit
> of, say, 2 hours. Otherwise, people could compare subsequent
> observations to learn more details about requests. Possible values for
> period lengths would then be 2, 3, 4, 6, 8, 12, or 24 hours. But again,
> do we need to make this configurable? And if so, should we use the
> config option DirRecordUsageSaveInterval instead of REQUEST_HIST_PERIOD?

Don't make it configurable until you actually need it to be. It sounds
like you don't need it to be.

> The next step would be to add the geoip-stats lines (or a subset of
> them) to extra-info documents. I think a proposal for that would be in
> place, but first I think it's fine to start with measuring on a few
> nodes and working with files.

You mean for bridges, or for normal relays? I don't think we're at the
point yet where we want normal relays to turn on the ENABLE_GEOIP_STATS
compile-time flag. I want to be much more sure that we've got all the
details right first. (Like, have run it for several months at a test
relay and have a good handle on exactly what it's getting.)

Thanks!
--Roger