[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
protocol spec updates
Just wanted to summarize some of the design changes we talked about in a
late-night session at PET the other week. Warning: this is pretty
disorganized.
Each email (or ack or error) that arrives at the nymserver is handled
separately. Although it would be nice (from a compression point of view) to
aggregate multiple pieces of email together, doing that makes the synposis
and message identification more difficult, and increases the window during
which the nymserver is holding on to plaintext. The goal is for the
nymserver to irrevocably encrypt user data as quickly as possible.
Each user's "bucket set" (i.e. the data you get when you concatenate all the
user's buckets for a given cycle together) is a series of concatenated
"message packets". Each packet is encrypted, so that without the right keys,
the bucket set is indistinguishable from random data.
The plaintext form of the message packets are serialized s-expressions,
using the "SEXP" encoding (as used in SPKI?), defined as follows:
the s-expressions are list in which the elements are either strings or lists
to serialize a list, emit a "(", then serialize each element, then emit a ")"
to serialize a string, emit "%d:%s" % (len(str), str). Note that this is just
like DJB netstrings but without the trailing comma
All message packets are defined to be a list that starts with a
one-character packet type, such as "I" for the INDEX packet, or "M" for a
MAIL packet. The interpretation of the rest of the list depends upon the
packet type. The client is obligated to ignore extra list elements beyond
those it expects, to provide a measure of extendability.
As email arrives, it is split into three processing paths:
the email is summarized by selecting a subset of the headers. This summary
is compressed and encrypted using SynopKey(j,i), then placed in a holding
directory
the full email is compressed and encrypted using MsgKey(j,i), then placed
in the same holding directory.
the user's current message-priority criteria is applied to the plaintext to
determine a ranking (messages matching particular regexps may be favored
over others, messages with a high spamassassin score may be lower-priority,
etc). This priority is also put into the holding directory.
The holding directory is renamed using MsgID(j,i) and moved into a
per-cycle + per-user directory.
At the end of the cycle, this per-cycle+per-user directory is scanned for
messages, and the highest-priority messages that will fit are selected.
Their MsgKey-encrypted contents are set aside for inclusion in this
bucketset. All remaining messages are deferred for a later cycle, and their
SynopKey-encrypted summaries are set aside for inclusion in this bucketset.
MsgId0/MsgKey0 is reserved for the INDEX packet, therefore MsgId0 never
appears anywhere in the bucketset. MsgId1/MsgKey1 is reserved for the
SUMMARY packet, so MsgId1 may appear in the INDEX, and MsgKey1 is used to
encrypt the SUMMARY packet. The first MAIL packet in the bucketset will be
encrypted with MsgKey2 even if there is no SUMMARY packet.
To encrypt any packet, first SEXP-serialize the s-expressions, then append
0x00-byte padding until the data is a multiple of 4 bytes (maybe 8 bytes?
This is intended to make decryption more efficient on typical 32-bit
processors). Then take the length of the padded data (plus the anticipated
SHA-256 hash) as a 4-byte big-endian value and prepend it to the data. Then
take the SHA-256 hash of everything from the LENGTH prefix to the padding
and append the hash. Then encrypt the whole thing.
s = sexp_serialize(s_expressions)
while len(s) % 8:
s = s + "\x00"
length = struct.pack(">L", len(s) + 32) # 32 bytes for the sha256
s = length + s
hash = sha256.new(s).digest()
return encrypt(key, s + hash)
When decrypting and unserializing, the SEXP parser is required to extract
exactly one list from the serialized data, and stop processing as soon as it
sees the top-most closing ")" character (this prevents it from trying to
interpret the 0x00 padding bytes). The decryption process involves first
decrypting four bytes, from which it can determine the length of the rest of
the packet (up to and including the hash), then decrypt that many bytes,
then check the hash, then unserialize the data into the original
s-expressions.
The INDEX packet is constructed by taking the list of all MsgIDs that appear
in this bucketset and putting them in-order as the second element of the
INDEX packet:
INDEX = ["I", [MsgId1, MsgId2, ...]]
If there were messages that didn't make it into this cycle, their encrypted
synopses are put together into a list which makes up the body of the SUMMARY
packet. Each encrypted synopsis is attached to its corresponding MsgId. If a
MsgId appears in the INDEX, it should not appear in the SUMMARY, and vice
versa, but the protocol is defined to say that should a MsgId appear in
both, the SUMMARY entry will be ignored.
SUMMARY = ["S", [ [MsgId1, EncSynopsis1], [MsgId2, EncSynopsis2] ] ]
After the INDEX and SUMMARY packets, the rest of the bucketset contains
MAIL, ACK, and ERROR packets. ACK and ERROR packets are encrypted and
indexed the same way as MAIL packets. All such MAIL/ACK/ERROR packets MUST
appear in the bucketset in exactly the same order as their corresponding
MsgID values in the INDEX packet (otherwise the client will be unable to
decrypt them properly).
The MAIL packet is simply an s-expression with the type character "M" and the
encrypted+compressed email message:
MAIL = ["M", EncEmail]
The remainder of the last bucket is filled with random data.
Check Buckets:
To implement byzantine server detection, we define the first UserID to be a
special one (with a UserID of all zeros), and this user's buckets are all
called "Check Buckets". Each client will retrieve an equal number of real
buckets and check buckets, using queries that cannot be distinguished from
each other.
TODO: we could either define the check buckets to be at a fixed, well-known
location (like the first three buckets) and have well-known contents (like
all zeros), or we could put them at a fixed location and put the hashes of
their (random) contents in the metadata, or we could use this UserID=0
trick. The first results in some buckets looking very much different than
the others (although I don't think this loses anything, since everybody
knows those are check buckets anyways). The second results in
variable-length data in the metadata, which makes me nervous. The third
requires an extra index-bucket PIR operation to find out where the check
buckets are located.
For each cycle, all clients start by retrieving the metadata and full
metaindex. They then retrieve two index buckets: the first one (which
contains the UserID=0), and whichever index bucket contains their real
UserID. Then they make up a list of buckets to fetch: assuming their
maxbuckets value is 3, their list will contain all three check buckets plus
the three sequential buckets for their UserID. Then the client chooses a
number of distributors (perhaps 5), builds the bit vectors for each bucket
query, shuffles the vectors, then sends a message to each distributor asking
it to return the XOR for all the vectors.
We define the PIR request to contain multiple vectors because it is far
more efficient to do all the vectors in parallel on the same pass over the
dataset. We shuffle the vectors to avoid exposing which vectors are for the
real buckets and which are for the check buckets.
We're going to keep the SHORT_PIR request in place (which passes a PRNG seed
instead of a full-length bit vector) to support clients who wish to value
bandwidth over information-theoretic security.
We didn't spend a lot of time on the distributor API, but the plan is to use
SEXP encoding for its messages too. The PIR_REQUEST operation will take as
arguments a serialized list of:
["PIR_REQUEST", nymserverid, cyclenum,
[ ["LONG", bitvector0],
["LONG", bitvector1],
["SHORT", prng_seed2],
["LONG", bitvector3],
...
] ]
and will eventually provoke a response of:
["PIR_RESPONSE", [xor0, xor1, xor2, xor3, ...] ]
One other note, the client should avoid doing CPU-intense operations on the
buckets after download, in case an attacker is pinging the client and looking
for evidence of slowdown to determine which code path the client has taken.
those are the notes that I wrote down.. please feel free to amend/correct
them.
cheers,
-Brian