[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
[tor-bugs] #19778 [Metrics/CollecTor]: Bridge descriptor sanitizer runs out of memory after 13.5 days
#19778: Bridge descriptor sanitizer runs out of memory after 13.5 days
-----------------------------------+-----------------------------
Reporter: karsten | Owner:
Type: defect | Status: new
Priority: High | Milestone: CollecTor 1.1.0
Component: Metrics/CollecTor | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-----------------------------------+-----------------------------
I'm currently reprocessing the bridge descriptor archive for #19317. The
process, started with `-Xmx6g` on a machine with 8G RAM, ran out of memory
after 13.5 days. I uploaded the custom log with additional debug lines
for the currently processed tarball here:
https://people.torproject.org/~karsten/volatile/collector-
bridgedescs.log.xz (556K).
While writing tests for #19755, I noticed a possible explanation, though I
don't have facts to prove: `BridgeSnapshotReader` contains a `Set<String>
descriptorImportHistory` that stores SHA-1 digests of files and single
descriptors to skip duplicates as early as possible. Its effect can be
seen in log lines like this, which comes from reprocessing 1 day of
tarballs:
{{{
2016-07-28 11:54:31,206 DEBUG o.t.c.b.BridgeSnapshotReader:215 Finished
importing files in directory in/bridge-descriptors/. In total, we parsed
87 files (skipped 9) containing 24 statuses, 33984 server descriptors
(skipped 168368), and 29618 extra-info descriptors (skipped 50027).
}}}
I don't know a good way to confirm this theory other than running the
process once again for a few days and logging the size of that set. I
also tried attaching `jvisualvm` last time, but for some reason that
detached and froze after 90 hours.
Possible fixes:
- Use some kind of least-recently-used (or maybe least-recently-inserted
if that's easier to implement) cache that allows us to skip duplicates in
tarballs written on the same day or so. There's no harm in reprocessing a
duplicate, it just takes more time than skipping it. Needs some testing
to get the size right, though it seems from the log above that 100k
entries might be enough.
- Avoid keeping a set and instead start the sanitizing process until we
know enough about a descriptor to check whether we wrote it before. That
would mean computing the SHA-1 digest and parsing until reaching the
publication time. In early tests this increased processing time by factor
1.2 or 1.3, and even more processing time is not exactly what I'm looking
for.
- Are there other options, ideally ones that are easy to implement and
maintain?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/19778>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs