[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"




On 13 Feb 2018, at 21:55, Iain Learmonth <irl@xxxxxxxxxxxxxx> wrote:

Hi,

On 12/02/18 23:55, isis agora lovecruft wrote:
1. What passes for "canonicalised" "utf-8" in C will be different to
   what passes for "canonicalised" "utf-8" in Rust.  In C, the
   following will not be allowed (whereas they are allowed in Rust):
       - NUL (0x00)
       - Byte Order Mark (0xFEFF)

Much of the metrics software is written in Java. Java strings allow for
NUL to appear, but assume that there is no BOM. If a BOM appears, then
this would be interpreted as data and, I assume, parsing would probably
fail. Should the whole document be rejected if it contains a NUL or BOM,
or should these values be stripped and then carry on parsing as if it
never happened?

Directory authorities and bridge clients already reject descriptors that
contain NUL. (This is an artefact of the C implementation: the descriptor
is seen as truncated, so it won't parse.)

We should specify rejection for BOM as well.

2. Directory document keywords MUST be printable ASCII.

This can be validated. Should a single document keyword containing
printable non-ASCII be enough to reject the document, or should a parser
try to recover?

If parsers want to be consistent with the Tor implementation, they should
reject.

I'd really like to see a section in the proposal about how parsers
should react when they find something unexpected, otherwise all the
parsers may end up doing different things.

+1

3. This change may break some descriptor/consensus/document parsers.
   If you are the maintainer of a parser, you may want to start
   thinking about this now.

For the metrics tools there are some guidelines on this we can follow:
https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
language would be Python (for stem), but Python developers have probably
got a good understanding of unicode/str/bytes by now. (In Python 3: when
using UTF-8, BOM will not be stripped and will be interpreted as data,
and you can have a NUL in a str).

Python for txtorcon
Rust for Tor's experimental protover implementation

And perhaps others:
https://stem.torproject.org/faq.html#are-there-any-other-controller-libraries
https://trac.torproject.org/projects/tor/wiki/doc/ListOfTorImplementations

T
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev