Re: [tor-dev] Searchable metrics archive - Onionoo-like API available online for probing

On Mon, Sep 2, 2013 at 2:20 PM, Karsten Loesing <karsten@xxxxxxxxxxxxxx> wrote:

On 8/23/13 3:12 PM, Kostas Jakeliunas wrote:
> [snip]

Hi Kostas,

I finally managed to test your service and take a look at the
specification document.

Hey Karsten!

Awesome, thanks a bunch!

The few tests I tried ran pretty fast! ÂI didn't hammer the service, so
maybe there are still bottlenecks that I didn't find. ÂBut AFAICS, you
did a great job there!

Thanks for doing some poking! There is probably space for quite a bit more of parallelized benchmarking (not sure of term) to be done, but at least in principle (and from what I've observed / benchmarked so far), if a single query runs in good time, it's rather safe to assume that scaling to multiple queries at the same time will not be a big problem. There's always a limit of course, which I haven't yet observed (and which I should be able to / would do well to find, ideally.) This is, however, one of the strengths of PostgreSQL in any case: very nice parallel-query-scaling. Of course, since the queries are, more or less, always disk i/o-bound, there still could be hidden sneaky bottlenecks, that is very true for sure.

Thanks for writing down the specification.

So, would it be accurate to say that you're mostly not touching summary,
status, bandwidth, and weights resources, but that you're adding a new
fifth resource statuses?

In other words, does the attached diagram visualize what you're going to
add to Onionoo? ÂSome explanations:

- summary and details documents contain only the last known information
about a relay or bridge, but those are on a pretty high detail level (at
least for details documents). ÂIn contrast to the current Onionoo, your
service returns summary and details documents for relays that didn't run
in the last week, so basically since 2007. ÂHowever, you're not going to
provide summary or details for arbitrary points in time, right? Â(Which
is okay, I'm just asking if I understood this correctly.)

(Nice diagram, useful-) responding to particular points / nuances:

summary and details documents contain only the last known information
about a relay or bridge, but those are on a pretty high detail level (at
least for details documents)

This is true: the summary/details documents (just like in Onionoo proper) deal with the *last* known info about relays. That is how it works now, anyway.

As per our subsequent IRC chat, we will now assume this is how it is intended to be. The way I see it from the perspective of my original project goals etc., the summary and details (+ bandwidth and weights) documents are meant for Onionoo {near-, full-}compatibility; they must stay Onionoo-like. The new network status document is the "olden archive browse and info extract" part: it is one of the ways of exposing an interface to the whole database (after all, we do store all the flags and nicknames and IP addresses for *all* the network statuses.)

However, you're not going to
provide summary or details for arbitrary points in time, right? Â(Which
is okay, I'm just asking if I understood this correctly.)

There is no reason why this wouldn't be possible. (I experimented with new search parameters, but haven't pushed them to master / changed the backend instance that is currently running.)

A query involving date ranges could, for example, be something akin to,

"get a listing of details documents for relays which match this $nickname / $address / $fingerprint, and which have run (been listed in consensuses dated) from $startDate to $endDate." (would use new ?from=.., ?to=.. parameters, which you've mentioned / clarified earlier.)

As per our IRC chat, I will add these parameters / query options not only to the network status document, but also to the summary and details documents.

- bandwidth and weights documents always contain information covering
the whole lifetime of a relay or bridge, where recent events have higher
detail level. ÂAgain, you're not going to change anything here besides
providing these documents for relays and bridges that are offline for
more than a week.

- statuses have the same level of detail for any time in the past.
These documents are new. ÂThey're designed for the relay search service
and for a simplified version of ExoneraTor (which doesn't care about
exit policies and doesn't provide original descriptor contents). ÂThere
are no statuses documents for bridges, right?

Yes & yes. No documents for bridges, for now. I'm not sure of the priority of the task of including bridges - it would sure be awesome to have bridges as well. For now, I assume that everything else should be finished (the protocol, the final scalable database schema/setup, etc.) before embarking on this point.

The status entry API point is indeed about getting info from the whole archives, at the same detail level for any portion of the archives.

(I should have articulated this / put into a design doc before, but this important nuance is still fresh in my mind. It seems that now it's all finally coming into place (including my mind.))

[The new network status documents are] designed for the relay search service
and for a simplified version of ExoneraTor (which doesn't care about
exit policies and doesn't provide original descriptor contents).

By the way, just as a general note, it is always possible to reconstruct any descriptor, and any network status entry, in principle. I point this out because, for one, I recall Damian mentioning that it would be nice if the torsearch system could be used as part of other apps - it would be able to reconstruct original Stem instances/objects for any descriptor / network status entry in question. (The focus for now, though, is Onionoo and database, of course.)

If this is correct (and please tell me if it's not), this seems like a
plausible extension of Onionoo.

Thanks for taking a close look at the protocol description and thanks for the feedback, everything is correct as far as I can see!

A few ideas on statuses documents: how about you change the format of
statuses, so that there's no more one document per relay and valid-after
time, but exactly one document per relay? ÂThat document could then
contain an array of status objects saying when the relay was contained
in the network status, together with information about its addresses.

This makes a lot of sense (I've been juggling these ideas as well, but at the end of the day, I'm not sure. So I will do this instead.)

The nickname for a given relay (identified by a fingerprint) can change through time as well. So the status object would ideally include the date of containment in network status / consensus, addresses, and nickname. (This is where a listing of flags would go in as well, I suppose.) I think that would make sense?

Since we know that there will only be one relay document, its fields could be made to be top-level (so not {relays: [ {"fingerprint" : "$fingerprint", ..., "entries": [ { ... }, { ... }, ... ]} ]} but, rather (hopefully not garbled up identation),

{

Â "fingerprint": "$fingerprint",

Â ... # first_seen, last_seen, for example

Â "entries": [

Â Â { ... },

Â Â ...

Â ]

}

It might be useful to group consecutive valid-after times when all
addresses and other relevant information about a relay stayed the same.
ÂSo, rather than adding "valid_after", put in "valid_after_from" and
"valid_after_to".

Yes, thought about this as well! This would be ideal. It would indeed I think require that we

[...] could even generate these statuses documents in advance once

per hour and store them as JSON documents in the database, similar to
what's the plan for the other document types? ÂThat might reduce
database load a lot, though you'll still need most of your database foo
for the search part.

Some kind of caching at some level would be needed for sure, inevitably. Preprocessing/preparing JSON documents (the way Onionoo does it, I suppose) makes sense.

I'm not sure of scale, however. Ideally torsearch would be able to keep track of outdated JSON documents / which ones need changing. Again, there already are around 170K unique fingerprints in the current online database as of now.

I'll think about this. Lots of things can be done at the postgres level (you're probably thinking about this as well.)

Also:

If it was OK (it would be a bit queer maybe) to involve result pagination at this level as well, the API could be told to, say,

"group the last `min(limit, UPPER_LIMIT)` [e.g. 500] status entries for this fingerprint into a status object / valid-after range summary." => produce status entry objects, each featuring addresses, nickname, valid_after_from, and valid_after_to.

As a rule of thumb, the count of status objects returned would be (much) less than (say) 500, of course. A client would then append the parameters ?offset=500[&limit=500] (or whatnot) to get a status entry summary (summary in the sense that does not reduce the amount of actual useful information returned) for the next 500 network statuses of this relay.

It would be great if this kind of protocol querying approach made sense. But if it's a bit strange / unoptimal (from the perspective of a client querying the DB), let me know.

And maybe you can compress information even more by
putting all relevant IP addresses in a list and refer to them by list
index. ÂCompare this to bandwidth and weights documents which are
optimized for size, too.

Yeah, this would be great, actually. I'll think about all these & practical caching / JSON document generation options. I'm unsure of feasibility (it's definitely doable in the end, but not sure of scope), but I hope to be able to accomplish all this. Might follow up later on / tomorrow, etc.

Happy to chat more about these ideas on IRC.

> Please report any inconsistencies / errors / time-outs / anything that
> takes a few seconds or more to execute. I'm logging the queries (together
> with IP addresses for now - for shame!), so will be able to later correlate
> activity with database load, which will hopefully provide some realistic
> semi-benchmark-like data.

I could imagine that you'll get more testers if you provide instructions
for using your service as relay search or ExoneraTor replacement. ÂMaybe
you could write down the five most common searches that people could
perform to search for a relay or find out whether an IP address was a
Tor relay at a given time? ÂIf you want, I can link to such a page from
the relay search and the ExoneraTor page.

Indeed, I was thinking lately that it should be made more explicit that, for example, this present system already encompasses ExoneraTor use cases, and so on. I was planning to eventually write up something of the kind (with lots of examples and clearly articulated use cases, etc.) of course, but maybe I should do this sooner. OK.

I also already have a way of constantly updating the database (using cron -> rsync & torsearch import), but it's a bit of a hack, still. Hopefully soon I will ramp up the DB to actually have the latest consensuses in Reality(tm).

Once I have the latter running nicely,

> If you want, I can link to such a page from

> the relay search and the ExoneraTor page.

we can think of doing this!

All in all, great work! ÂNice!

Thanks,
Karsten

Thanks for your as always great feedback, Karsten :)

Kostas.Â