[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

Hello!

Updating on my Searchable Tor metrics archive project. (As is very evident) I'm very open for naming suggestions. :)

To the best of my understanding and current satisfaction, I solved the database bottlenecks, or at least I am, as of now, satisfied with the current output from my benchmarking utility. Things may change, but I am confident (and have support to argue) that the whole thing runs swell at least on amazon m2.2xlarge instances.

For fun and profit, a part of the database (which, has, for now, status entries (only) in the range [2010-01-01 00:00:00,Â2013-05-31 23:00:00]), namely, what is currently used by the Onionoo-like API is now available online (not on EC2, though) - will now write a separate email so that everyone can inspect it.

I should now move on with implementing / extending the Onionoo API, in particular, working on date range queries, and refining/rewriting the "list status entries" API point (see below). Need to carefully plan some things, and always keep an updated API document. (Also need to update and publish a separate, more detailed specification document.)

More concrete report points:

re-examined my benchmarking approach, and wrote a rather simple but effective set of benchmarking tools (more like a simple script) [1] that can be hopefully used outside this project as well; at the very least, together with the profiling and the query_info tools, it is powerful (but also simple) enough to be used to test all kinds of bottlenecks in ORMs and elsewhere.
used this tool to generate benchmark reports on EC2 and on the (less powerful) dev server, and with different schema settings (usually rather minor schema changes that do not require re-importing all the data)
came up with a triple table schema that proves to render our queries quickly: we first do a search (using whatever criteria (e.g. nickname, fingerprint, address, running), if any) on a table which has a column with unique fingerprints; extract the relevant fingerprints; JOIN with the main status entry table, which is much larger; and get the final results. Benchmarked using this schema.

<details> If we are only extracting a list of the latest status entries (with distinct on fingerprint), we can do LIMITs and OFFSETs already on the fingerprint table, before the JOIN. This helps us quite a bit. On the other hand, nickname searches etc. are also efficient. As of now, I have re-enabled nickname+address+fingerprint substring search (not from the middle (LIKE %substring%), but from the beginning of a substring (LIKE substring%), which is still nice), and all is well. Updated the higher-level ORM to reflect this new table [2] (I've yet to change some column names, though - but these are cosmetics.) </details>
found a way to generate the SQL queries that I need to generate using the higher-level SQLAlchemy SQL API using various SQLAlchemy-provided primitives, and always observing the resulting query statements. This is good, because everything becomes more modular: much easier to shape the query depending on the query parameters received, etc. (while still retaining it in sane order.)
hence (re)wrote a part of the Onionoo-like API that uses the new schema and the SQLAlchemy primitives. Extended the API a bit. [3]
wrote a very hacky API point for getting a list of status entries for a given fingerprint. I simply wanted a way (for myself and people) to query this kind of a relation easily and externally. It now works as part of the API. This part will probably need some discussion.
wrote a (kind of a stub) document explaining the current Onionoo-like API, what can be queried, what can be returned, what kinds of parameters work. [4] Will extend this later on.

while writing the doc and rewriting part the API, stumbled upon a few things that make clear that I've made some shortcuts that may hurt later on. Will be happy to elaborate on them later on / separately. I need to carefully plan a few things, and then try rewriting the Onionoo API yet again, this time including more parameters and fields returned.

TL;DR yay, a working database backend!

I might give *one* more update detailing things I might have forgotten about soon re: this report - I don't want to make a habit of delaying reports (which I have consistently done), so reporting what I have now.

[1]: https://github.com/wfn/torsearch/blob/master/torsearch/benchmark.py
[2]: https://github.com/wfn/torsearch/blob/master/torsearch/models.py
[3]: https://github.com/wfn/torsearch/blob/master/torsearch/onionoo_api.py
[4]: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md

Kostas (wfn on OFTC)

0x0e5dce45Â@ pgp.mit.edu