[tor-dev] [GSoC '13] Status report - Searchable metrics archive

Hi all,

TL;DR of TL;DR: all good as far as GSoC scope goes; work will continue; the live torsearch backend has been updated; will update API doc today.

TL;DR:

Â - updated database / imported data up until current date

Â - hourly cronjob continuously rsync's latest archival data, imports to DB (+ tested and deployed):

Â Â * had some trouble ensuring no duplication but also no excessive DB querying (the intermediary 'fingerprint' table (that basically solves the DB bottleneck problem) was previously updated in semi-manual-mode); solved with a kind of an 'upsert'; works well;

Â Â * made switching from archival data import process to the rsync'ed ::metrics-recent import seamless (data can overlap, importer will take care of it; import speed is good)

Â - Onionoo API improvements:

Â Â * date range queries for all three document types currently provided

Â Â * the network status entry document was expanded/improved, as per previous discussion and Karsten's feedback

Â Â * yet to finish off IP address and nickname summarization - for now, providing two document versions - original and 'condensed' (valid-after ranges)

Â Â * updating Onionoo API doc soon (sorry - had meant to update until now.)

Â - caching:

Â Â * tried some things, settled for a semi-vanilla python-level-caching solution for the time being, some testing was done

Â Â * have yet to still work out some snags, not deploying for now

Â Â * more (pure json response dumping to disk, etc.) can be done later on

Â - other things:

Â Â * wrote a working stub / minimal version for a DB-maintenance-from-within-torsearch component (e.g. for a full VACUUM with no DB access)

Â Â * other small things I may have forgotten to mention

General picture: the metrics archive / torsearch is in decent shape and an appropriate (to gsoc) stage of development. Development will continue as we're not done here yet, but the archive is in a functioning state, delivering the minimum functionality required. I will finish cleaning up before the hard pencils down date; and then continue development (at whatever pace). :) The things that were blocking before are no longer blocking, so ironically, I think I'll now be able to write more code i.r.l. to torsearch than over the summer. Anyway -

I've been working on things falling within the GSoC scope, finishing the most urgent matters that were left re: gsoc milestone - quoting from my previous status report re: what to do next (see the bullet point list above for a concise no-ramble-mode version of all this):

> update the database to the latest consensuses and descriptors

Done. However,Â

>Âimport 2008 year data, etc.

I had parts of 2009 data in previously while working on DB bottlenecks, but presently, the statuses still start at 2010; this got pushed down the priority lane and takes a lot of (passive) time.ÂWill be able to batch-import now and confirm.

> turn on the cronjob for rsync and import of newest archives

Done. [0, 1] What kept me a bit busy was integrating the third helper/mediating 'fingerprint' table (we query this one before extracting things from the massive 'statusentry' table (which contains all network status entries)) into the batch_import_consensuses process used for import.

I had some problems making sure data in the fingerprint table is not duplicated and contains latest info (last valid-after etc), while minimizing the amount of work and queries needed to test for row/entry existence, etc. Solved with an actually simple hybrid 'upsert' approach [2], all is well (did some testing, OK, but not extensive.)

Improved the archival importer for it to be able to process duplicate consensus documents from different directories (hence they're bypassing Stem's persistence file check - so need to take care of duplicated processing ourselves) - this might happen in production when, after massively importing data from the downloaded archives, we switch to rsync's ::metrics-recent/relay-descriptors folder for import. Consensuses may overlap (they overlapped when I did the switch.) When passed down a consensus document by Stem (so after checking for duplication in Stem's persistence file), we simply check whether we already store this consensus, using a separate 'consensus' table, which only stores consensus document validity ranges (this is fast and works nicely). [2]

</details>

> [from an older report] Onionoo date range queries

Done. [4] Works for all three document types (details, summary, statuses); can be used together with offset+limit, etc.

Smart/decent datetime parsing - can pass '2011-07-28 12:00:00', or '2011-07', or '2011', etc.

Example: http://ts.mkj.lt:5555/details?search=200.&from=2010&to=2012

>Âcaching

Tried all kinds of options, and since I wanted to have something decent and stable working for now, opted for a very simplistic stock python-level caching approach (no onionoo-like 'JSON documents to disk' thing yet.) But I've yet to work out some things, delayed deploying code and live version.

>Âhopefully later integration with the bandwidth and weights documents in Onionoo proper

Out of current scope; hopefully will be able to work on this next / soon.

> [from an older report] Documentation, specification/implementation document

Not finished yet - not very good. I should try more of this "publish early and often" thing.

>Âexpand the list of fields contained in the three [...] documents

Tried things, but e.g. providing assigned flag data was pushed to later. Current live backend not changed in regards to this.

> [from an older report] rewrite/improve the network status entry document / improve the 'statuses' API point

Done [7] (deployed changes, will update API doc - this is not a nice thing to do (i.e. doc being not updated), but since it's not production yet, deployed early)

For now, we are providing two network status document versions - the original one (+ 'relays_published'), and a condensed one (?condensed=true).

The latter basically zips up all the valid-after values into ranges, as per Karsten's suggestion, basically telling the client where any gaps in consensuses were present (which may turn out to be a rather useful thing, by the way.) This works well together with from..to date ranges, offset, and limit.

I'm yet to finish off IP address and nickname summarization - for now, when in 'condensed' presentation mode, each range contains its last addresses and nickname.

Example: http://ts.mkj.lt:5555/statuses?lookup=F397038ADC51336135E7B80BD99CA3844360292B&condensed=true

Also, as torsearch might one day be deployed on a machine such that the maintainer won't have direct access to the DB server, we'll need to do VACUUM FULL et al. from within torsearch. It is advisable to do a full VACUUM or even a REINDEX now and then, especially after a ton of entries is added (say, after a massive/bulk data batch-import.) Wrote a working stub for a 'torsearch/DB maintenance' component (for vacuum full, for now.) [8] Other maintenance things may be added to this separate file/component later on.

I think that's all for now. Hopefully I'm past the 'experiment with all the things!' stage (I've still been trying and tinkering with stuff / different approaches to problems; but for what it's worth, the current codebase and the live backend are in decent shape.)

Comments / suggestions / barbaric yawps?

[0]: https://github.com/wfn/torsearch/commit/b9a31fdabc52bca1baa00064359c6f691aff58e1
[1]: see e.g. http://ts.mkj.lt:5555/details
[2]: https://github.com/wfn/torsearch/commit/74cb2b19e1a5b47fa95b62df8d89e0c9fb2c7e6c
# there is no 3. It's.. gone!
[4]: https://github.com/wfn/torsearch/commit/5f862930b81ad34e04d6315c6c57bfd3cb499e10
# more numbers have mysteriously disappeared! [API doc commit would go here]
[7]: https://github.com/wfn/torsearch/commit/fa2183bb9fb58bff5bc1469fa947367ecd0e3268

[8]:Âhttps://github.com/wfn/torsearch/commit/0f24fee4fb73a12fd088f0d1e271dbdb05636531

Kostas (wfn on #tor-dev)

0x0e5dce45Â@ pgp.mit.edu