[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #9380 [BridgeDB]: BridgeDB should use Stem for parsing descriptors
#9380: BridgeDB should use Stem for parsing descriptors
-------------------------+-------------------------------------------------
Reporter: sysrqb | Owner: isis
Type: | Status: needs_review
enhancement | Milestone:
Priority: blocker | Version:
Component: | Keywords: stem, bridgedb-0.3.0, bridgedb-
BridgeDB | parsers, isis2014Q3Q4, isisExB
Resolution: | Parent ID:
Actual Points: |
Points: |
-------------------------+-------------------------------------------------
Comment (by atagar):
Hi Isis, here's some quick feedback...
-----
{{{
from stem.descriptor import extrainfo_descriptor
from stem.descriptor import server_descriptor
}}}
Unused.
-----
{{{
def parseServerDescriptorsFile(filename, validate=True):
"""Parse a file which contains ``@type bridge-server-descriptor``s.
.. note:: ``validate`` defaults to ``False`` because there appears to
be a
bug in Leekspin, the fake descriptor generator, where Stem thinks
the
fingerprint doesn't match the keyâ
.. note:: We have to lie to Stem, pretending that these are ``@type
server-descriptor``s, **not** ``@type bridge-server-descriptor``s.
See ticket `#11257`_.
.. _`#11257`: https://trac.torproject.org/projects/tor/ticket/11257
:param str filename: The file to parse descriptors from.
:param bool validate: Whether or not to validate descriptor
contents. (default: ``False``)
:rtype: list
:returns: A list of
:api:`stem.descriptor.server_descriptor.RelayDescriptor`s.
"""
logging.info("Parsing server descriptors with Stem: %s" % filename)
descriptorType = 'server-descriptor 1.0'
document = parse_file(filename, descriptorType, validate=validate)
routers = list(document)
return routers
}}}
This whole function is just 25 lines for doing...
{{{
routers = list(parse_file(filename, 'server-descriptor 1.0', validate =
validate))
}}}
Really doesn't seem like it's providing any value.
----
{{{
lib/bridgedb/configure.py:
def loadConfig(configFile=None, configCls=None):
"""Load configuration settings on top of the current settings.
...
}}}
You might want to take a look at...
https://stem.torproject.org/api/util/conf.html
This is a module used by stem and arm so they can have nice config
files...
https://gitweb.torproject.org/stem.git/tree/test/settings.cfg
https://gitweb.torproject.org/arm.git/tree/armrc.sample
It provides type inferencing, defaults, etc without doing exec() as you
presently are (... which is almost never a great idea). You can then use
config_dict() to specify the config values each file cares about and their
default values. For example...
https://gitweb.torproject.org/arm.git/tree/arm/header_panel.py#n27
Then when the config for that singleton is read all the configurations get
the updated values.
----
{{{
document = parse_file(filename, descriptorType, validate=validate)
try:
for router in document:
descriptors.append(router)
except (ValueError, ProtocolError) as error:
logging.error(
("Stem exception while parsing extrainfo descriptor from "
"file '%s':\n%s") % (filename, str(error)))
_copyUnparseableDescriptorFile(filename)
}}}
No reason for the for loop, same as...
{{{
descriptors = list(parse_file(filename, descriptorType,
validate=validate))
}}}
----
{{{
def parseNetworkStatusFile(filename, validate=True, skipAnnotations=True,
descriptorClass=RouterStatusEntryV3):
"""Parse a file which contains an ``@type bridge-networkstatus``
document.
See `ticket #12254 <https://bugs.torproject.org/12254>`__ for why
networkstatus-bridges documents don't look anything like the
networkstatus
v2 documents that they are purported to look like. They are missing
all
headers, and the entire footer including authority signatures.
}}}
Actually, you could probably simply parse it like a normal network status
document but without validation.
----
{{{
def deduplicate(descriptors):
"""Deduplicate some descriptors, returning only the newest for each
router.
.. note:: If two descriptors for the same router are discovered, AND
both
descriptors have the **same** published timestamp, then the
router's
fingerprint WILL BE LOGGED ON PURPOSE, because we assume that
router
to be malicious (deliberately, or unintentionally).
:param list descriptors: A list of
:api:`stem.descriptor.server_descriptor.RelayDescriptor`s,
:api:`stem.descriptor.extrainfo_descriptor.BridgeExtraInfoDescriptor`s,
:api:`stem.descriptor.router_status_entry.RouterStatusEntryV2`s.
"""
duplicates = {}
nonDuplicates = {}
for descriptor in descriptors:
fingerprint = descriptor.fingerprint
logging.debug("Deduplicating %s descriptor for router %s"
% (str(descriptor.__class__).rsplit('.', 1)[1],
safelog.logSafely(fingerprint)))
if fingerprint in nonDuplicates.keys():
# We already found a descriptor for this fingerprint:
conflict = nonDuplicates[fingerprint]
# If the descriptor we are currently parsing is newer than the
# last one we found:
if descriptor.published > conflict.published:
# And if this is the first duplicate we've found for this
# router, then create a list in the ``duplicates``
dictionary
# for the router:
if not fingerprint in duplicates.keys():
duplicates[fingerprint] = list()
# Add this to the list of duplicates for this router:
duplicates[fingerprint].append(conflict)
# Finally, put the newest descriptor in the
``nonDuplicates``
# dictionary:
nonDuplicates[fingerprint] = descriptor
# Same thing, but this time the one we're parsing is older:
elif descriptor.published < conflict.published:
if not fingerprint in duplicates.keys():
duplicates[fingerprint] = list()
duplicates[fingerprint].append(descriptor)
# This *shouldn't* happen. It would mean that two descriptors
for
# the same router had the same timestamps, probably meaning
there
# is a severely-messed up OR implementation out there. Let's
log
# its fingerprint (no matter what!) so that we can look up its
# ``platform`` line in its server-descriptor and tell whoever
# wrote that code that they're probably (D)DOSing the Tor
network.
else:
raise DescriptorWarning(
("Duplicate descriptor with identical timestamp (%s)
for "
"router with fingerprint '%s'!")
% (descriptor.published, fingerprint))
# Hoorah! No duplicates! (yet...)
else:
nonDuplicates[fingerprint] = descriptor
logging.info("Descriptor deduplication finished.")
logging.info("Number of duplicates: %d" % len(duplicates))
for (fingerprint, dittos) in duplicates.items():
logging.info(" For %s: %d duplicates"
% (safelog.logSafely(fingerprint), len(dittos)))
logging.info("Number of non-duplicates: %d" % len(nonDuplicates))
return nonDuplicates
}}}
Off the top of my head think the following does the same as all the
above...
{{{
# makes a map of {fingerprint => [all descriptors with that fingerprint]}
fingerprint_to_descriptors = {}
for desc in descriptors:
fingerprint_to_descriptors.setdefault(desc.fingerprint,
[]).append(descriptor)
# provide back the latest
return [sorted(descriptors, key = lambda d: d.published, reversed =
True)[0] for descriptors in fingerprint_to_descriptors.values()]
}}}
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/9380#comment:40>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs