[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[tor-bugs] #13600 [Onionoo]: Improve bulk imports of descriptor archives
#13600: Improve bulk imports of descriptor archives
-------------------------+---------------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Onionoo | Version:
Keywords: | Actual Points:
Parent ID: | Points:
-------------------------+---------------------
We need to improve bulk imports of descriptor archives. Whenever somebody
wants to initialize Onionoo with existing data, they'll need to process
years of descriptors. The current code is not at all optimized for that,
but it's designed for running once per hour and updating things as quickly
as possible. Let's fix that and support bulk imports better.
Here's what we should do:
- We define a new directory `in/archive/` where operators can put
descriptor archives fetched from CollecTor. Whenever there are files in
that directory we import them first (before descriptors in `in/recent/`).
In particular, we iterate over files twice: in the first iteration we look
at the first contained descriptor to determine its type, and in the second
iteration we parse files containing server descriptors and then files
containing other descriptors. (This order is important for computing
advertised bandwidth fractions, which only works if we parse server
descriptors before consensuses.) This process will take very long, so we
should log whenever we complete a tarball, and ideally we'd print out how
many tarballs we already parsed and how many more we need to parse.
- We add a new command-line switch `--update-only` for only updating
status files and not downloading descriptors or writing document files.
Operators could then import archives, which would take days or even weeks,
and then switch to downloading and processing recent descriptors. My
branch task-12651-2 is a major improvement here, because it ensures that
''all'' documents will be written once the bulk import is done, not just
the ones for relays and bridges that were contained in recent descriptors.
Future command-line options would be `--download-only` and `--write-only`
for the other two phases and `--single-run` that does what's the current
default but once we switch from being called by cron every hour to
scheduling our own hourly runs internally.
I somewhat expect us to run into memory problems when importing months or
even years of data at once. So, part of the challenge here will be to
keep an eye on memory usage and fix any memory issues.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/13600>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs