[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[tor-dev] Onionoo protocol/implementation nuances / Onionoo-connected metrics project stuff
Hi Karsten,
(not sure whom to CC and whom not to, I have a couple of fairly specific technical questions / comments (which (again) I should have delved into earlier), but then again, maybe the scope of the tor-dev mailing list includes such cases..)
@tor-dev: This is in regard to the Searchable metrics archive project, intersected with Onionoo stuff.
I originally wanted to ask two questions, but by the time I reached the middle of the email, I wasn't anymore sure if they were questions, so I think these are simply my comments / observations about a couple of specific points, and I just want to make sure I'm making general sense. :)
..So it turns out that it is very much possible to avoid Postgres sequential scans - to construct such queries/cases which are best executed using indexes (the query planner thinks so, and it makes sense as far as I can tell) and whatever efficient (hm, < O(n)?) search algorithms are deemed best - for the metrics search backend/database; the two things potentially problematic to me seem to be:
 - when doing ORDER BY, making sure that the resulting SELECT covers / would potentially return a relatively small amount of rows (from what I've been reading / trying out, whenever a SELECT happens which may cover > ~10-15% of all the table's rows, sequential scan is preferred, as using indexes would result in even more disk i/o, whereas the seq.scan can read in massive(r) chunks of data from disk because it's, well, sequential). This means that doing "ORDER BY nickname" (or fingerprint, or any other non-unique but covering-a-relatively-small-part-of-all-the-table column in the network status table) is much faster than doing e.g. "ORDER BY validafter" (where validafter refers to the consensus document's "valid after" field) - which makes sense, of course, when you think about it (ordering a massive amount of rows, even with proper LIMIT etc. is insane.) We construct separate indexes (e.g. even if 'fingerprint' is part of a composite (validafter + fingerprint) primary key), and all seems to be well.
I've been looking into the Onionoo implementation, particularly into ResourceServlet.java [1], to see how ordering etc. is done there. I'm new to that code, but am I right in saying that as of now, the results (at least for /summary and /details) are generally fairly unsorted (except for the possibility of "order by consensus_weight"), with "valid_after" fields appearing in an unordered manner? This of course makes sense in this case, as Onionoo is able to return *all* the results for given search criteria (or, if none given, all available results) at once.
The obvious problem with the archival metrics search project (argh, we are in need of a decent name I daresay :) maybe I'll think of something.. no matter) is then, of course, the fact that we can't return all results at once. I've been so far assuming that it would therefore make sense to return them, whenever possible (and if not requested otherwise, if "order" parameter is later implemented), in "valid_after" descending order. I suppose this makes sense? This would be ideal, methinks. So far, it seems that we can do that, as long as we have a WHERE clause that is restricting-enough.
select * from (select distinct on (fingerprint) fingerprint, validafter, nickname from statusentry where nickname like 'moria1' order by fingerprint, validafter desc) as subq order by validafter desc limit 100;
, for example, works out nicely, in terms of efficiency / query plan, postgres tells me. (The double ORDER BY is needed, as postgres' DISTINCT needs an ORDER BY, and that ORDER BY's leftmost criterion has to match DISTINCT ON (x))
Now, you mentioned / we talked about what is the (large, archival metrics) backend to do about limiting/cutoff? Especially if there are no search criteria specified, for example. Ideally, it may return a top-most list of statuses (which in /details include info from server descriptors), sorted by "last seen" / "valid after"? Thing is, querying a database for 100 (or 1000) of items with no ORDER BY Is really cheap; introducing ORDER BYs which would still produce tons of results is considerably less so. I'm now looking into this (and you did tell me this, i.e. that, I now think, a large part of DB/backend robustness gets tested at these particular points; this should have been more obvious to me).
But in any case, I have the question: what *should* the backend return (when no search parameters/filters specified, or very loose ones (nickname LIKE "a") are?
What would be cheap / doable: ORDER BY fingerprint (or digest (== hashed descriptor), etc.) It *might* (well, should) work to order by fingerprint, limit results, and *then* reorder by validafter - with no guarantee that the topmost results would be with highest absolute validafters. I mean, Onionoo is doing this kind of reordering / limiting itself, but it makes sense as it can handle all the data at once (or I'm missing something, i.e. the file I linked to already interacts with a subset of data; granted, I haven't thoroughly read through.) In our/this case, it makes sense to try and outsource all relational logic to ORM (but of course if it turns out we can cheaply get 100 arbitrary results and more easily juggle with them ourselves / on the (direct) python side, then sure.)
But would such arbitrary returned results make sense? It would look just like Onionoo results, but - a (small) subset of them.
Ramble #2: Onionoo is doing more or less direct (well, compiled, so efficient) regexps on fingerprint, nickname, etc. By default, "LIKE %given_nickname_part%" is again (sequentially) expensive; Postgres does offer full text search extensions (would need to build additional tables, etc.), and I think this makes sense to be done; it would cover all our "can supply a subset of :param" bases. I'll see into this. (It's also possible to construct functional(istic) indexes, e.g. with regexp, but need to know the template - I wonder if using Onionoo's regexpressions would work / make sense - will see.) For now, LIKE/= exact_string is very nice, but of course the whole idea is that it'd be possible to supply substrings. Of note is the fact that in this case, LIKE %substring% is O(n) in the sense that query time correlates with row count, afaict. As of now, full text search extensions would solve the problem I think, even if they may look a bit like an overkill at first.
End of an email without a proper direction/point. :)
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev