[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

To: tor-dev@xxxxxxxxxxxxxxxxxxxx
Subject: Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
From: Iain Learmonth <irl@xxxxxxxxxxxxxx>
Date: Tue, 13 Feb 2018 10:55:30 +0000
Delivered-to: archiver@xxxxxxxx
Delivery-date: Tue, 13 Feb 2018 05:56:02 -0500
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; bh=bfQgAgzfei66LDxQAybiyhoAPFESQ +dH5/SFi5TbWII=; b=UtatlLmUrdaa0jIo3w3P7pP509M8ZAKH+tmABYgj6JNgI qhOkyOjALItOefpXIyW1GFoZ73OPYy6vsCxUqTDQVb2IobhZRXykqwhZgUEBVuvq OoHIzc4ilvAA/fGXFX6Z+p+tiJgOR0y38aAHR5i3rKpFoeIrezIf59v2BA1N7T+9 B9bVubOzlqZvk1BZw08SYjBOxyg9asGVEYc8fNivRoX9zXhKd35KidSolgGCFFCe nISptP0agFV4ty6fcwsdB9Bz0ai/GIVzB30fD5+BENvXv33UvenBd7RLxIR1J5Fa hpLoayGngid1kT26xIVaUsvoZ/x/BWgYkf81QuQnA==
In-reply-to: <20180212235522.GA28876@patternsinthevoid.net>
List-archive: <http://lists.torproject.org/pipermail/tor-dev/>
List-help: <mailto:tor-dev-request@lists.torproject.org?subject=help>
List-id: discussion regarding Tor development <tor-dev.lists.torproject.org>
List-post: <mailto:tor-dev@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=unsubscribe>
Organization: Tor Project
References: <20171209011708.GG1550@patternsinthevoid.net> <20180129200717.GC1368@patternsinthevoid.net> <20180129203631.GE1368@patternsinthevoid.net> <20180205174300.GK28008@patternsinthevoid.net> <20180205201643.GM28008@patternsinthevoid.net> <20180212235522.GA28876@patternsinthevoid.net>
Reply-to: tor-dev@xxxxxxxxxxxxxxxxxxxx
Sender: "tor-dev" <tor-dev-bounces@xxxxxxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

Hi,

On 12/02/18 23:55, isis agora lovecruft wrote:
>  1. What passes for "canonicalised" "utf-8" in C will be different to
>     what passes for "canonicalised" "utf-8" in Rust.  In C, the
>     following will not be allowed (whereas they are allowed in Rust):
>         - NUL (0x00)
>         - Byte Order Mark (0xFEFF)

Much of the metrics software is written in Java. Java strings allow for
NUL to appear, but assume that there is no BOM. If a BOM appears, then
this would be interpreted as data and, I assume, parsing would probably
fail. Should the whole document be rejected if it contains a NUL or BOM,
or should these values be stripped and then carry on parsing as if it
never happened?

>  2. Directory document keywords MUST be printable ASCII.

This can be validated. Should a single document keyword containing
printable non-ASCII be enough to reject the document, or should a parser
try to recover?

I'd really like to see a section in the proposal about how parsers
should react when they find something unexpected, otherwise all the
parsers may end up doing different things.

>  3. This change may break some descriptor/consensus/document parsers.
>     If you are the maintainer of a parser, you may want to start
>     thinking about this now.

For the metrics tools there are some guidelines on this we can follow:
https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
language would be Python (for stem), but Python developers have probably
got a good understanding of unicode/str/bytes by now. (In Python 3: when
using UTF-8, BOM will not be stripped and will be interpreted as data,
and you can have a NUL in a str).

Thanks,
Iain.

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Follow-Ups:
- Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
  - From: Damian Johnson
- Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
  - From: teor

References:
- Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
  - From: isis agora lovecruft
- Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
  - From: isis agora lovecruft
- Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
  - From: isis agora lovecruft

Prev by Author: Re: [tor-dev] Starting with contributing to Anonymous Local Count Statistics.
Next by Author: Re: [tor-dev] Atlas is not that friendly to Web Archive
Previous by thread: Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
Next by thread: Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
Index(es):
- Author
- Thread