[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] Onionoo protocol/implementation nuances / Onionoo-connected metrics project stuff



On 7/29/13 9:45 PM, Kostas Jakeliunas wrote:
> Hi Karsten,
> 
> (not sure whom to CC and whom not to, I have a couple of fairly specific
> technical questions / comments (which (again) I should have delved into
> earlier), but then again, maybe the scope of the tor-dev mailing list
> includes such cases..)
> 
> @tor-dev: This is in regard to the Searchable metrics archive project,
> intersected with Onionoo stuff.
> 
> I originally wanted to ask two questions, but by the time I reached the
> middle of the email, I wasn't anymore sure if they were questions, so I
> think these are simply my comments / observations about a couple of
> specific points, and I just want to make sure I'm making general sense. :)
> 
> ..So it turns out that it is very much possible to avoid Postgres
> sequential scans - to construct such queries/cases which are best executed
> using indexes (the query planner thinks so, and it makes sense as far as I
> can tell) and whatever efficient (hm, < O(n)?) search algorithms are deemed
> best - for the metrics search backend/database; the two things potentially
> problematic to me seem to be:
> 
>   - when doing ORDER BY, making sure that the resulting SELECT covers /
> would potentially return a relatively small amount of rows (from what I've
> been reading / trying out, whenever a SELECT happens which may cover >
> ~10-15% of all the table's rows, sequential scan is preferred, as using
> indexes would result in even more disk i/o, whereas the seq.scan can read
> in massive(r) chunks of data from disk because it's, well, sequential).
> This means that doing "ORDER BY nickname" (or fingerprint, or any other
> non-unique but covering-a-relatively-small-part-of-all-the-table column in
> the network status table) is much faster than doing e.g. "ORDER BY
> validafter" (where validafter refers to the consensus document's "valid
> after" field) - which makes sense, of course, when you think about it
> (ordering a massive amount of rows, even with proper LIMIT etc. is insane.)
> We construct separate indexes (e.g. even if 'fingerprint' is part of a
> composite (validafter + fingerprint) primary key), and all seems to be well.
> 
> I've been looking into the Onionoo implementation, particularly into
> ResourceServlet.java [1], to see how ordering etc. is done there. I'm new
> to that code, but am I right in saying that as of now, the results (at
> least for /summary and /details) are generally fairly unsorted (except for
> the possibility of "order by consensus_weight"), with "valid_after" fields
> appearing in an unordered manner? This of course makes sense in this case,
> as Onionoo is able to return *all* the results for given search criteria
> (or, if none given, all available results) at once.

This is correct.  Onionoo only allows ordering by consensus_weight,
because I only added features when we came up with new use cases.

> The obvious problem with the archival metrics search project (argh, we are
> in need of a decent name I daresay :) maybe I'll think of something.. no
> matter) is then, of course, the fact that we can't return all results at
> once. I've been so far assuming that it would therefore make sense to
> return them, whenever possible (and if not requested otherwise, if "order"
> parameter is later implemented), in "valid_after" descending order. I
> suppose this makes sense? This would be ideal, methinks. So far, it seems
> that we can do that, as long as we have a WHERE clause that is
> restricting-enough.
> 
> select * from (select distinct on (fingerprint) fingerprint, validafter,
>> nickname from statusentry where nickname like 'moria1' order by
>> fingerprint, validafter desc) as subq order by validafter desc limit 100;
> 
> 
> , for example, works out nicely, in terms of efficiency / query plan,
> postgres tells me. (The double ORDER BY is needed, as postgres' DISTINCT
> needs an ORDER BY, and that ORDER BY's leftmost criterion has to match
> DISTINCT ON (x))
> 
> Now, you mentioned / we talked about what is the (large, archival metrics)
> backend to do about limiting/cutoff? Especially if there are no search
> criteria specified, for example. Ideally, it may return a top-most list of
> statuses (which in /details include info from server descriptors), sorted
> by "last seen" / "valid after"? Thing is, querying a database for 100 (or
> 1000) of items with no ORDER BY Is really cheap; introducing ORDER BYs
> which would still produce tons of results is considerably less so. I'm now
> looking into this (and you did tell me this, i.e. that, I now think, a
> large part of DB/backend robustness gets tested at these particular points;
> this should have been more obvious to me).
> 
> But in any case, I have the question: what *should* the backend return
> (when no search parameters/filters specified, or very loose ones (nickname
> LIKE "a") are?
> 
> What would be cheap / doable: ORDER BY fingerprint (or digest (== hashed
> descriptor), etc.) It *might* (well, should) work to order by fingerprint,
> limit results, and *then* reorder by validafter - with no guarantee that
> the topmost results would be with highest absolute validafters. I mean,
> Onionoo is doing this kind of reordering / limiting itself, but it makes
> sense as it can handle all the data at once (or I'm missing something, i.e.
> the file I linked to already interacts with a subset of data; granted, I
> haven't thoroughly read through.) In our/this case, it makes sense to try
> and outsource all relational logic to ORM (but of course if it turns out we
> can cheaply get 100 arbitrary results and more easily juggle with them
> ourselves / on the (direct) python side, then sure.)
> 
> But would such arbitrary returned results make sense? It would look just
> like Onionoo results, but - a (small) subset of them.

Can you rephrase this question, say, in a sentence or two?  I'm having a
hard time getting what you're asking for.

Making a few random statements here, in case one of them addresses the
issue you're having:

- Ordering is important even if no WHERE statement is given or if the
WHERE statement matches a lot of rows.  Think of the relay-search
application when you look for all relays with nickname U* (like in
"Unnamed" which is very frequent).  You want to be sure to find the last
relays in the database, not a random subset.

- You might want to look into multi-key indexes, e.g., an index on
(validafter, nickname).  It could be that PostgreSQL uses them instead
of a sequential search.

- It might make sense to write a set of test selects and check them
against your expected query strategies.  Maybe there are tools for this,
but if there are no tools, simply run each query using EXPLAIN ANALYZE
and parse the output for keywords indicating sequential scan, index
scan.  Then create indexes and re-run your test script to find out if
they improve things or not.

- Also look into insertion speed when you create lots of indexes.  That
is, once you're happy with query speed, import another month of data to
see if that still happens in reasonable time.

> Ramble #2: Onionoo is doing more or less direct (well, compiled, so
> efficient) regexps on fingerprint, nickname, etc. By default, "LIKE
> %given_nickname_part%" is again (sequentially) expensive; Postgres does
> offer full text search extensions (would need to build additional tables,
> etc.), and I think this makes sense to be done; it would cover all our "can
> supply a subset of :param" bases. I'll see into this. (It's also possible
> to construct functional(istic) indexes, e.g. with regexp, but need to know
> the template - I wonder if using Onionoo's regexpressions would work / make
> sense - will see.) For now, LIKE/= exact_string is very nice, but of course
> the whole idea is that it'd be possible to supply substrings. Of note is
> the fact that in this case, LIKE %substring% is O(n) in the sense that
> query time correlates with row count, afaict. As of now, full text search
> extensions would solve the problem I think, even if they may look a bit
> like an overkill at first.

Again, not sure if I understand the question.  A few possible answers:

- Feel free to restrict searches to things you can implement
efficiently.  It's better to exclude an expensive query and force the
user to think about a better way to query, than to offer that search
option and let the user sit waiting for a long time.

- You can prepare columns for searching to make queries much faster.
For example, you could store LOWER(nickname) in addition to nickname if
you're going to do lower-case searches for nickname anyway; that's what
relay search does.  Or you can store the first 16 bits of IP addresses
in a separate column to narrow down search results quickly; that's what
ExoneraTor in metrics-web.git does.  Or you can store descriptors in
monthly tables to keep indexes small; that's what relay search does.

> End of an email without a proper direction/point. :)

Well, now looking into your other mails in this thread.

(Hint: in general, if you want me to reply quickly, please write a
single mail with clear comments or questions, not one that you need to
correct/extend three times within an hour.  Maybe simply leave it
sitting in your drafts folder for an hour until you're sure it's what
you want to say.  Thanks!)

> 
> [1]:
> https://gitweb.torproject.org/onionoo.git/blob/HEAD:/src/org/torproject/onionoo/ResourceServlet.java
> 

_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev