[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

[tor-dev] An ANTLR 4 grammar for Tor bridge network statuses

To: "tor-dev@xxxxxxxxxxxxxxxxxxxx" <tor-dev@xxxxxxxxxxxxxxxxxxxx>
Subject: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
From: Karsten Loesing <karsten@xxxxxxxxxxxxxx>
Date: Wed, 4 Nov 2015 10:06:06 +0100
Delivered-to: archiver@xxxxxxxx
Delivery-date: Wed, 04 Nov 2015 04:06:27 -0500
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:to:from:subject:message-id:date:user-agent:mime-version :content-type:content-transfer-encoding; bh=3uiR3OXDCiklOu6vWelX7C3z/miGPLXF5z9HhPMfzb4=; b=s4VeVoIYCt8aVHJ5IgiU3N+PvLLyMp1Eq4YEw/OFuEUAgTE1ZQQIarn94mdM6+ZMG4 dlsDNBsI89MswWkNxqPJ+FfChhp4LI7wSSqQRYJqNbCz5sQt6oYxuZ+IcCI3x0O6bvSf GVdu2TTPXBpHTxOKW8CnGh+wb5qwJeDc0vBZbVXeqznhscDJbOCifOw68V9i8cypj6MS pny9uANWSSEPNZ8MdKLjViB8z0HhkPkK8VxCWVakQg5PtLdMj5UIDmrp90BdwWjiEm1U iAZ8fiQihBdIy/SZYZ7h42kfh9kFpDfQ32+2ka5ciliWet/nbj8+DNM6r2qBJYuPw7yM nmzQ==
List-archive: <http://lists.torproject.org/pipermail/tor-dev/>
List-help: <mailto:tor-dev-request@lists.torproject.org?subject=help>
List-id: discussion regarding Tor development <tor-dev.lists.torproject.org>
List-post: <mailto:tor-dev@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=unsubscribe>
Reply-to: tor-dev@xxxxxxxxxxxxxxxxxxxx
Sender: "tor-dev" <tor-dev-bounces@xxxxxxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello developers,

in the past few days I have been working on a grammar to parse Tor
bridge network statuses and hopefully other Tor descriptors in the
future.  It's working, for some definition of working, but some issues
remain and I need some help.

I just uploaded my sources, consisting only of the grammar with a fair
amount of documentation:

https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4

Quoting from that file to facilitate discussion here:


There are multiple goals of having a grammar for Tor descriptors
available on CollecTor:

1. Translate descriptors to JSON for statistical analysis: Some tools
and databases require Tor descriptors in a standard format like JSON.
 This grammar and a parser generated from it can help making that
translation as easy as possible, also to keep future maintenance as
low as possible.

2. Provide a basis for descriptor-parsing libraries: As of late 2015,
there are three libraries for parsing Tor descriptors: metrics-lib for
Java, Stem for Python, and Zoossh for Go.  It would be beneficial to
place as much knowledge about the descriptor format into a grammar
shared by all those libraries and then generate parsers for different
languages from that grammar.

3. Serve as documentation for the Tor directory protocol
specification: Tor descriptors are already documented using a
hand-written grammar, but that may contain slight inaccuracies because
it's not verified.  This grammar could fix that by either detecting
inaccuracies while trying to rewrite it to an executable grammar form
or by replacing the grammar in the specification documentation with
this executable grammar.

Open issues and questions:

 - Was it smart to explicitly include all those SP tokens in the
rules, or should those be discarded right away by the lexer?  The main
reason for keeping them was to stay as close to the specification as
possible, but maybe that has downsides on the other goals.

 - If a bridge uses a nickname (or other token that's supposed to be a
STRING) that is also a keyword like "r" or "published", things get
confusing.  Try editing the input bridge network status and observe
the result.  But those are perfectly valid nicknames, so what can we do?

 - It would be really nice to use regular expressions in the grammar
to match input more thoroughly than just ~[ \n]+, if only we can fix
the lexer troubles.  It's a pity that all that verification work would
need to be duplicated in each of the language-dependent parsers.  That
kinda defeats the purpose.

 - Is it easy to walk the parse tree and output a JSON format
*without* having to write code for each of the rules?  Ideally, the
translator would be 20 lines of code and not grow at all if we add 10
more descriptor types.  Do we need to change the grammar for that?

 - The following may turn out to be a non-issue, but some descriptors
require lines to be ordered, e.g., "accept" and "reject" lines in
server descriptors, and we'll have to retain that order in the parse
tree.  This should be similar to how we parse entries, starting with
"r" lines, but who knows.


Feedback much appreciated!

All the best,
Karsten
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJWOcp+AAoJEJD5dJfVqbCrnQUH/2dp8ER6ZcEGBtHP+dPb/4p0
0tKb4eZobQhZNx3oQOc08nCJl3AEa+Vedep5Caa9MSNycopf7mBFEGtw2V5J3mKN
w6D6cvSbSBoFhuh/+Q8oVj+6h0KkUaCVVMaTHefb63usM0EmjsEXvDjBXr+g5nhn
q0RqM1Id3V38rs3pKi1JDwGU4w5X45gzUPOXbiNGig6wJuLN1e2cxfF4RdDmGzST
JvjlH/KRV59NjMRvUAeTxZIXlz6fKwjTWWQ2PXUnuXAXNPVxYakzHNhiT7qXGro0
7ZFfIr7gwk9kZlF0oy6ltFC1mGgL4xk0vqlrOjvwrh+oAzIciurMcOddXEHwr3E=
=DZ0Z
-----END PGP SIGNATURE-----
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Follow-Ups:
- Re: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
  - From: Nick Mathewson

Prev by Author: [tor-dev] help
Next by Author: [tor-dev] 1-1-1 task exchange meeting on Thursday, Nov 5, 15:00 UTC in #tor-dev
Previous by thread: Re: [tor-dev] [Fwd: UX Principles]
Next by thread: Re: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
Index(es):
- Author
- Thread