[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #13600 [Onionoo]: Improve bulk imports of descriptor archives
#13600: Improve bulk imports of descriptor archives
-----------------------------+-----------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Onionoo | Version:
Resolution: | Keywords:
Actual Points: | Parent ID:
Points: |
-----------------------------+-----------------
Comment (by karsten):
Replying to [comment:9 leeroy]:
> No problem. It looks like this branch and deployed Onionoo produce
slightly different results when processing the same data set (recent 73h).
I attach a sample (onionoo_k is this branch). I'll test some multiple
archive imports on this branch.
>
> In ''status'':
>
> * The timestamp (?) after the country code is sometimes set to -1.
I think this one is harmless. If you're curious, you can read more about
this by reading the comment in NodeStatus starting with "This is a
(possibly surprising) hack...".
> In ''out'':
>
> * Some bandwidth documents have an extra value.
This one should be harmless, too. This has to do with running the hourly
updater at a later time and compressing bandwidth intervals lying farther
in the past. We simply don't need the 15-minute precision anymore when
we're outside of the 3-day graph interval. There would be similar
compressions once we're outside the 1-week, 1-month, etc. interval.
> __Importing multiple months:__ I know Onionoo can, because I tested it
(testing it on this branch), but should it be encouraged? The current load
on memory is rather high. If someone tries to import a year of archives at
once, can the current heap dependency be guaranteed not to induce a
failure. Maybe this won't be that big a deal. Just warn the operator to
limit the number of months at a time until other tickets deal with the
heap load. Something to add to the documentation?
Yes, this is something we could add to the documentation. Unfortunately,
reducing memory requirements enough to import multiple months or even
years of descriptors is tough, because that's a very different use case
from running the updater once per hour with only one hour of descriptors.
When in doubt, I optimized the process in favor of the hourly update
process. That's why I'd prefer to add a warning to the documentation.
> __Input validation:__ I saw metrics-lib included some packages for
compressed file handling so I tried importing from .xz instead of tarball.
Some validation of the input archives might be worthwhile. Bad things will
happen to the log when this is attempted.
True! I just created #16424 for this to support importing .xz-compressed
tarballs. In general, Onionoo is not very robust against invalid input
provided by the ''service operator'', because so far that service operator
person was also the main developer. But let's try to fix that and make it
more robust, if we can.
> __Parsing archives:__ Parse history doesn't include archives, and
archives aren't removed after parsing. DescriptorDownloader cannot now
remove the archives (current behavior) because it only considers the
recent folder.
Oh, I don't think Onionoo should remove tarballs from the archive
directory after parsing them, because it didn't place them there
beforehand. What we could do, however, is add a parse history for files
in the archive directory; see the newly created #16426.
> __Parsing archives:__ If --single-run or --update-only is used with
archives that have ''already been parsed'', they will be parsed again.
This leads to a change in the size of the status folder. It becomes
smaller for the same number of archive-sourced files. I didn't try to
determine the reason for this change at the time. I intend to revisit this
potential problem to see if the same thing happens, and why. It might be
interesting if the change also happens during re-processing of recent data
(which may happen when restoring a backup of data).
It would be interesting to learn more about that directory becoming
smaller. For now, I'll assume it's related to the differences stated
above. But if you spot an actual bug there, please mention it here or
open a new ticket.
Thanks for trying this out and sending feedback here!
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/13600#comment:10>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs