[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #6471 [Metrics Utilities]: Design file format and Python library for multiple GeoIP or AS databases
#6471: Design file format and Python library for multiple GeoIP or AS databases
-------------------------------+--------------------------------------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Metrics Utilities | Version:
Keywords: | Parent:
Points: | Actualpoints:
-------------------------------+--------------------------------------------
Comment(by karsten):
Replying to [comment:1 gsathya]:
> Possible first step would be to figure out if there are any additional
info that we don't need/use in maxminds db.
> A naive solution then would be to -
> Step 0) Remove unnecessary data
> Step 1) Diff the old csv with the new csv
> Step 2.1) Add a human readable(?) line to the old csv - explaining the
date of change, no of lines changed and possibly other details that might
become obvious once we actually try to diff
> Step 2.2) Modify the diff to make more parseable since we know that we
are only diff-ing csv's - i bet we can optimize this a bit
> Step 3) Append the modified diff to the old csv
> Step 4) Write a library that can parse added human readable line and the
modified diff
>
> Another solution would be to go all out and write our own spec and a
parser that converts every newly generated GeoIP db into something that
conforms with our spec. (And write a library to parse such a file)
>
> The second approach would be a lot more useful in the long run but a lot
more time consuming to write. If we pick either approaches(or an
alternative one) I'd be happy to write the python code for it!
Your first approach above already sounds like a design for a file format,
and I admit that the second approach requires a lot of work before seeing
any results.
Hmm. How about a third approach: write a library that a) reads unmodified
database files to memory, maybe together with a mapping file containing
dates when these files became valid, and b) resolves IP addresses and
dates to country codes or ASes. We wouldn't want to memorize the full
file contents, but only the relevant information for looking up an IP
address and date. But we can still wonder about a compact file format for
that later on.
This third approach has the disadvantage that initializing the lookup
library may take a while (tens of seconds, maybe minutes). But it reduces
development time a lot at the beginning. Also, we may learn a lot about
compact representations of address ranges, dates, country codes, and ASes
which we can use to design a good file format later on.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6471#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs