Hi, On 12/02/18 23:55, isis agora lovecruft wrote: > 1. What passes for "canonicalised" "utf-8" in C will be different to > what passes for "canonicalised" "utf-8" in Rust. In C, the > following will not be allowed (whereas they are allowed in Rust): > - NUL (0x00) > - Byte Order Mark (0xFEFF) Much of the metrics software is written in Java. Java strings allow for NUL to appear, but assume that there is no BOM. If a BOM appears, then this would be interpreted as data and, I assume, parsing would probably fail. Should the whole document be rejected if it contains a NUL or BOM, or should these values be stripped and then carry on parsing as if it never happened? > 2. Directory document keywords MUST be printable ASCII. This can be validated. Should a single document keyword containing printable non-ASCII be enough to reject the document, or should a parser try to recover? I'd really like to see a section in the proposal about how parsers should react when they find something unexpected, otherwise all the parsers may end up doing different things. > 3. This change may break some descriptor/consensus/document parsers. > If you are the maintainer of a parser, you may want to start > thinking about this now. For the metrics tools there are some guidelines on this we can follow: https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other language would be Python (for stem), but Python developers have probably got a good understanding of unicode/str/bytes by now. (In Python 3: when using UTF-8, BOM will not be stripped and will be interpreted as data, and you can have a NUL in a str). Thanks, Iain.
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ tor-dev mailing list tor-dev@xxxxxxxxxxxxxxxxxxxx https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev