[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently
#15844: Develop database schema to support Onionoo's search parameter efficiently
-----------------------------+-----------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Onionoo | Version:
Resolution: | Keywords:
Actual Points: | Parent ID:
Points: |
-----------------------------+-----------------
Comment (by leeroy):
A database is an important transition to remove the restrictions on
searchable period-of-time. You've mentioned you only get a week with the
onionoo resources as provisioned. It's reasonable to expect a database to
scale to the entirety of available history. It's the substring search on
the entire history that'll get you. Still, it would be also be epic for
analysis/vis. Your limit would be the storage space used by the database,
and the i/o of requests. It might be useful to bin the data, similar to
!CollecTor, into recent, and longer (to be decided) interval. Storage
isn't so much an issue, and neither is transaction i/o which can scale
with pgpool.
About your simplifications:
1a) If you're only interested in substring search on nickname and contact
then an alternative to `LIKE '%foo%'` must be considered. Some real
substring performance tests should be performed. What are the candidates?
There's the position function of postgreSQL, as teor mentioned, user-
defined functions, and a comparison with trigram-match indexing? The
testing would only need to apply to substring since you defined the rest
of the queries as prefix search capable (best case).
1b) I can think of ways around the inherent limits imposed by an unbounded
data set. A simplification of this thinking is:
* spawn n-helpers
* give each helper a range of data to work with
* schedule helpers
* for each helper apply substring search
* return results until all data processed
1c) A thought about introducing describing notation to the search pattern
itself. A marker of sorts to encode the search type. In the absence of any
marker perform a most general search. `@foo@`, or something like it, could
transform into a contact-specific search. This would avoid changes to
parametrization and keep backwards compatibility.
2) I do agree it would simplify things to have a case insensitive search
of base64 fingerprints. However, you've specified this type of search as
prefix search capable. So it would be best case performance. A query that
spans multiple tables can use a combination of substring-search and
prefix-search to maximize performance of the query. It's even conceivable
to perform the query in separate transactions rather than one.
3) A general change to a minimum of 3 characters would cause problems in
nickname searches won't it? A nickname is allowed to be a single
character. Lets suppose for an moment that the minimum is 1. Well then
only one character needs to match in order to return a result. Compare
this to 3 characters where all 3 need to match. I do, however, think
there's value in the careful consideration of a "single-unit" with respect
to an entity-type. It influences the returned data for a particular query.
A single character for nicknames, a single octet for an ip, a hex-value
for fingerprint, etc. This may produce more relevant results for a given
pattern.
I don't think there's any doubt a clean and efficient implementation is
doable. As long as it's understood that it's unlikely to ever be as fast
as Java. The Java VM offers runtime optimization, postgreSQL offers a
transactional data store. Even in a best case: postgreSQL, n-workers,
compiled user-defined substring search, and Java, there's still an
overhead from the transaction.
The Java deployment is a pain to scale, postgreSQL is designed to scale
independently from the Java powered engine behind the scenes. It's a
compromise.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs