How many current descriptors will be rejected as non-UTF-8?
Do we exclude all invalid byte sequences? Do we exclude all invalid code points (some libraries don't)? Do we reject unassigned or reserved code points? Do we reject private use code points? How do we avoid tying ourselves to a particular version of Unicode? (By accepting reserved code points? Some libraries don't do this.) Will we allow a byte order mark? (We can't during the transition, it doesn't parse as ASCII. And we probably shouldn't for any verbatim lines, because they are copied into the middle of the descriptor.) We will need to update the directory spec to acknowledge that contact and platform lines may be parsed as UTF-8 or ASCII-including-arbitrary-bytes-except-NUL, and that they are terminated by single-byte newlines regardless. How do we deal with format confusion attacks? UTF-8 has a few alternative whitespace characters. These could be used in an attack that confuses either humans viewing the file, or automated software: If a human uses a UTF-8 compatible viewer or editor, it likely shows Unicode newlines and ASCII newlines in an identical way. Similarly, it may show Unicode spaces and ASCII spaces in the same way. This may confuse the human reader. Similarly, if automated software parses using a Unicode whitespace or newline character class, it will mis-parse directory documents. (Our Rust protover code looks for ASCII spaces, so it appears to be fine.) Note that we already have this issue with line feeds and carriage returns, which I thought we had solved by banning carriage returns in directory documents. But it appears we allow "any printing ASCII character". (We will have to edit this to include Unicode.)
If we apply the existing restrictions in dir-spec, which require non-directory-descriptor directory documents to be ASCII, they will also be UTF-8. Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"? Do we expect to migrate these to non-ASCII UTF-8 at some point? Also, does "non-directory-descriptor directory documents" mean we can reject non-UTF-8 microdescriptors? I think we should. Does the NS consensus contain any lines that are copied verbatim from descriptors?
typo: plaintexts
We also can't reject bridge descriptors at the authority level. (Bridge clients download bridge descriptors directly from bridges.) Do we need bridge clients to also use this consensus parameter? T |
_______________________________________________ tor-dev mailing list tor-dev@xxxxxxxxxxxxxxxxxxxx https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev