[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses

To: tor-dev@xxxxxxxxxxxxxxxxxxxx
Subject: Re: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
From: Nick Mathewson <nickm@xxxxxxxxxxxxxx>
Date: Wed, 4 Nov 2015 11:43:28 -0500
Delivered-to: archiver@xxxxxxxx
Delivery-date: Wed, 04 Nov 2015 11:43:42 -0500
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=cv+Bm0ly+oPmqrRXDVlCyCIzVzAWzDuFnKcPBVQMvOM=; b=0bpmwePKXkJ1cqwhSJrE2ay9WHFayCD0EYSsFVsnehaNebbF6aiweVgIafp0Qs53nN /XkzhRSzVQM3ODR3TdlUKM2tybSQUkmPZeg9AafmmSfMuE2tKTY1iVvBpu7N5uvnTZZZ A64xLyZjpgWFx5n3OqcA9JuDT/S1pgf5up24q+KS10E02ivYbmGyuhlUktzaN43geiL1 FzVew6AFE6ixzIvFB6c0NDNOxnxqQA6OF2mnWpaBeUbWVApoPZFangdp1nMByAIL87w6 MVfVaPUL/r4mGXFd4pOS6OqgfFqsth8+x9TBprR5s9Nzjc+SLVN16sX/Z7MBT3ws47PG +lkw==
In-reply-to: <5639CA7E.5070103@xxxxxxxxxxxxxx>
List-archive: <http://lists.torproject.org/pipermail/tor-dev/>
List-help: <mailto:tor-dev-request@lists.torproject.org?subject=help>
List-id: discussion regarding Tor development <tor-dev.lists.torproject.org>
List-post: <mailto:tor-dev@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=unsubscribe>
References: <5639CA7E.5070103@xxxxxxxxxxxxxx>
Reply-to: tor-dev@xxxxxxxxxxxxxxxxxxxx
Sender: "tor-dev" <tor-dev-bounces@xxxxxxxxxxxxxxxxxxxx>

On Wed, Nov 4, 2015 at 4:06 AM, Karsten Loesing <karsten@xxxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello developers,
>
> in the past few days I have been working on a grammar to parse Tor
> bridge network statuses and hopefully other Tor descriptors in the
> future.  It's working, for some definition of working, but some issues
> remain and I need some help.
>
> I just uploaded my sources, consisting only of the grammar with a fair
> amount of documentation:
>
> https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4


Nice work, Karsten!  I'm hoping we move towards some kind of
machine-readable grammar/schema for all our data formats, and that we
have our actual parsing/encoding code generated from it.

(When I did a survey of where all our crash/assertion bugs for the
last few years were, they seemed to have a higher-than-usual
concentration in our parsing code.)

One thing about this grammar in particular, though: It is over-strict.
It matches only the formats we use today, and not the formats we are
allowed to use in the future.  For one example, a flag on an 's' line
can be any non-space string - but this grammar will fail to parse
unrecognized flags.

On the other hand, while we specify the order of r, s, w, p, a, lines
in a generated consensus, clients are required to parse the s, w, p,
and a lines in any order, but not to allow two s lines in a single 'r'
entry.

I think that because of the free-ordering and multiplicity-restriction
rules for our data formats, a context-free grammar simply isn't going
to match our spec very well.

> Quoting from that file to facilitate discussion here:
>
> There are multiple goals of having a grammar for Tor descriptors
> available on CollecTor:
>
> 1. Translate descriptors to JSON for statistical analysis: Some tools
> and databases require Tor descriptors in a standard format like JSON.
>  This grammar and a parser generated from it can help making that
> translation as easy as possible, also to keep future maintenance as
> low as possible.
>
> 2. Provide a basis for descriptor-parsing libraries: As of late 2015,
> there are three libraries for parsing Tor descriptors: metrics-lib for
> Java, Stem for Python, and Zoossh for Go.  It would be beneficial to
> place as much knowledge about the descriptor format into a grammar
> shared by all those libraries and then generate parsers for different
> languages from that grammar.
>
> 3. Serve as documentation for the Tor directory protocol
> specification: Tor descriptors are already documented using a
> hand-written grammar, but that may contain slight inaccuracies because
> it's not verified.  This grammar could fix that by either detecting
> inaccuracies while trying to rewrite it to an executable grammar form
> or by replacing the grammar in the specification documentation with
> this executable grammar.
>
> Open issues and questions:
>
>  - Was it smart to explicitly include all those SP tokens in the
> rules, or should those be discarded right away by the lexer?  The main
> reason for keeping them was to stay as close to the specification as
> possible, but maybe that has downsides on the other goals.

IMO, once we have a grammar that is truly correct, that grammar should
_be_ the spec, and we should revise the main spec to reference the
grammar.

>  - If a bridge uses a nickname (or other token that's supposed to be a
> STRING) that is also a keyword like "r" or "published", things get
> confusing.  Try editing the input bridge network status and observe
> the result.  But those are perfectly valid nicknames, so what can we do?

Change the lexing rules so that keywords are only recognized as such
at position 0 on the line, outside of a base64 block?

best wishes,
-- 
Nick
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Follow-Ups:
- Re: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
  - From: Karsten Loesing

References:
- [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
  - From: Karsten Loesing

Prev by Author: [tor-dev] Tor dev meeting times this week and beyond!
Next by Author: Re: [tor-dev] Tor dev meeting times this week and beyond!
Previous by thread: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
Next by thread: Re: [tor-dev] An ANTLR 4 grammar for Tor bridge network statuses
Index(es):
- Author
- Thread