Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

So, why do you join descriptors and network statuses in the search
process? ÂAt the Munich dev meeting I suggested joining the tables
already in the import process. ÂWhat do you think about that idea?

Yes, I had made a half-hearted attempt to normalize the two tables some time ago, for a small amount of descriptors and status entries; I'll be trying out this scheme in full (will need to re-import a major part of the data (which I didn't do then) to be able to see if it scales well) after I try something else. (Namely, using a third table of unique fingerprints (the statusentry table currently holds ~170K unique fingerprints vs. ~67M rows in total) and (non-unique) nicknames for truly quick fingerprint lookup and nickname search; I did experiment with this as well, but I worked with a small subset of overall data in that case, too; and I think I can do a better job now.)

It had seemed to me that the bottleneck was in having to sort a too large number of rows, but now I understand (if only just a bit) more about the 'explain analyze' output to see that the 'Nested Loop' procedure, which is what does the join in the join query discussed, is expensive and is part of the bottleneck so to speak. So I'll look into that after properly benchmarking stuff with the third table. (By the way, for future reference, we do have to test out different ideas on a substantial subset of overall data, as the scale function is not, so to say, linear.) :)

> https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
>
> We use the following indexes while executing that query:
>
> Â* lower(nickname) on descriptor
>
> Â* (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry

Using only the first 12 characters sounds like a fine approach to speed
up things. ÂBut why 12? ÂWhy not 10 or 14? ÂThis is probably something
you should annotate as parameter to find a good value for later in the
process. Â(I'm not saying that 12 is a bad number. ÂIt's perfectly fine
for now, but it might not be the best number.)

Yes, this is as unscientific as it gets. As of now, we're using a raw SQL query, but I'll be encapuslating them properly soon (so we can easily attach different WHERE clauses, etc.), at which point I'll make it into a parameter. I did do some tests, but nothing extensive; just made sure the indexes can fit into memory whole, which was the main constraint. Will do some tests.

Also, would it keep indexes smaller if you took something else than
base16 encoding for fingerprints? ÂWhat about base64? ÂOr is there a
binary type in PostgreSQL that works fine for indexes?

Re: latter, no binary type for B-Trees (which is the default index type in pgsql) as far as I can see. But it's a good idea / approach, so I'll look into it, thanks! On the whole though, as long as all the indexes occupy only a subset of pgsql's internal buffers, there shouldn't be a problem / that's not the problem, afaik. But, if we're making a well-researched ORM/database design, I should look into it.

Do you have a list of searches you're planning to support?

These are the ones that should *really* be supported:Â

?search=nickname
?search=fingerprint
?lookup=fingerprint
?search=address [done some limited testing, currently not focusing on this]
?running=<boolean>
?flag=flag [every kind of clause which further narrows down the query is not bad; the current db model supports all the flags that Stem does, and each flag has its own column]
?first_seen_days=range
?last_seen_days=range

As per the plan, the db should be able to return a list of status entries / validafter ranges (which can be used in {first,last}_seen_days) given some fingerprint.

Thanks for your feedback and reply!

Kostas.

[1]:Âhttp://www.postgresql.org/docs/9.1/static/pgtrgm.html