[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently
#15844: Develop database schema to support Onionoo's search parameter efficiently
-----------------------------+-----------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Onionoo | Version:
Resolution: | Keywords:
Actual Points: | Parent ID:
Points: |
-----------------------------+-----------------
Comment (by teor):
I like leeroy's suggestion of a `substring(foo)` (or, perhaps more
efficiently, a `position(foo)`) over a `LIKE %foo%`. Some database engines
can recognize their equivalence in the simple case of `%text%`, but you
don't want to rely on this. And you don't really need people to to be able
to do `%foo%bar%`, do you?
Allowing people to insert unquoted metacharacters, even into a LIKE
string, can open you up to DoS issues. Depends how powerful postgres LIKE
regexes are.
For comparing the starts of strings, sometimes `left(identity,
len(search_string)) = search_string)` can be faster than `identity LIKE
search_string + '%'` for similar reasons.
The database may be more efficient if it's collation is case-insensitive
overall. And it will be easier to program if that is the default.
Individual columns and comparisons can still be case-sensitive (or, even
more efficiently, compared exactly, which is equivalent for search
purposes here).
You might also want to consider storing the base64 encoded string in its
decoded binary form in a `bit varying` column, and then doing the decoding
for each search term before comparing it using `position()`. The combined
decoding and comparison might be faster, and the storage and indexing will
definitely be faster and smaller. This strategy might apply to some of the
other fingerprints (store in binary form, convert before search). This
would also avoid some of your case-insensitive matching. It would also
give you a canonical form for storage and comparison. You'd have to encode
them for display, or, as a speed-space tradeoff, store a lookup table of
binary->encoded for each fingerprint in the database.
You can ever compare partial bytes, which may be a feature, or a bug if
you're looking at the middle of strings. You might need to restrict it to
`position() > 0 && position() % 4 == 0` for hex, and `position() > 0 &&
position() % 6 == 0` for base 64, so you only compare whole encoded
character positions, and not sub-character bit positions. Otherwise users
might complain that they put certain characters in, and don't get any out.
By the way, full-text search won't help, because in postgres it does
words, not substrings.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs