[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

protocol spec updates



Just wanted to summarize some of the design changes we talked about in a
late-night session at PET the other week. Warning: this is pretty
disorganized.

 Each email (or ack or error) that arrives at the nymserver is handled
 separately. Although it would be nice (from a compression point of view) to
 aggregate multiple pieces of email together, doing that makes the synposis
 and message identification more difficult, and increases the window during
 which the nymserver is holding on to plaintext. The goal is for the
 nymserver to irrevocably encrypt user data as quickly as possible.

 Each user's "bucket set" (i.e. the data you get when you concatenate all the
 user's buckets for a given cycle together) is a series of concatenated
 "message packets". Each packet is encrypted, so that without the right keys,
 the bucket set is indistinguishable from random data.

 The plaintext form of the message packets are serialized s-expressions,
 using the "SEXP" encoding (as used in SPKI?), defined as follows:

  the s-expressions are list in which the elements are either strings or lists

  to serialize a list, emit a "(", then serialize each element, then emit a ")"

  to serialize a string, emit "%d:%s" % (len(str), str). Note that this is just
  like DJB netstrings but without the trailing comma

 All message packets are defined to be a list that starts with a
 one-character packet type, such as "I" for the INDEX packet, or "M" for a
 MAIL packet. The interpretation of the rest of the list depends upon the
 packet type. The client is obligated to ignore extra list elements beyond
 those it expects, to provide a measure of extendability.

 As email arrives, it is split into three processing paths:

  the email is summarized by selecting a subset of the headers. This summary
  is compressed and encrypted using SynopKey(j,i), then placed in a holding
  directory

  the full email is compressed and encrypted using MsgKey(j,i), then placed
  in the same holding directory.

  the user's current message-priority criteria is applied to the plaintext to
  determine a ranking (messages matching particular regexps may be favored
  over others, messages with a high spamassassin score may be lower-priority,
  etc). This priority is also put into the holding directory.

  The holding directory is renamed using MsgID(j,i) and moved into a
  per-cycle + per-user directory.

 At the end of the cycle, this per-cycle+per-user directory is scanned for
 messages, and the highest-priority messages that will fit are selected.
 Their MsgKey-encrypted contents are set aside for inclusion in this
 bucketset. All remaining messages are deferred for a later cycle, and their
 SynopKey-encrypted summaries are set aside for inclusion in this bucketset.

 MsgId0/MsgKey0 is reserved for the INDEX packet, therefore MsgId0 never
 appears anywhere in the bucketset. MsgId1/MsgKey1 is reserved for the
 SUMMARY packet, so MsgId1 may appear in the INDEX, and MsgKey1 is used to
 encrypt the SUMMARY packet. The first MAIL packet in the bucketset will be
 encrypted with MsgKey2 even if there is no SUMMARY packet.

 To encrypt any packet, first SEXP-serialize the s-expressions, then append
 0x00-byte padding until the data is a multiple of 4 bytes (maybe 8 bytes?
 This is intended to make decryption more efficient on typical 32-bit
 processors). Then take the length of the padded data (plus the anticipated
 SHA-256 hash) as a 4-byte big-endian value and prepend it to the data. Then
 take the SHA-256 hash of everything from the LENGTH prefix to the padding
 and append the hash. Then encrypt the whole thing.

  s = sexp_serialize(s_expressions)
  while len(s) % 8:
    s = s + "\x00"
  length = struct.pack(">L", len(s) + 32) # 32 bytes for the sha256
  s = length + s
  hash = sha256.new(s).digest()
  return encrypt(key, s + hash)

 When decrypting and unserializing, the SEXP parser is required to extract
 exactly one list from the serialized data, and stop processing as soon as it
 sees the top-most closing ")" character (this prevents it from trying to
 interpret the 0x00 padding bytes). The decryption process involves first
 decrypting four bytes, from which it can determine the length of the rest of
 the packet (up to and including the hash), then decrypt that many bytes,
 then check the hash, then unserialize the data into the original
 s-expressions.

 The INDEX packet is constructed by taking the list of all MsgIDs that appear
 in this bucketset and putting them in-order as the second element of the
 INDEX packet:

   INDEX = ["I", [MsgId1, MsgId2, ...]]

 If there were messages that didn't make it into this cycle, their encrypted
 synopses are put together into a list which makes up the body of the SUMMARY
 packet. Each encrypted synopsis is attached to its corresponding MsgId. If a
 MsgId appears in the INDEX, it should not appear in the SUMMARY, and vice
 versa, but the protocol is defined to say that should a MsgId appear in
 both, the SUMMARY entry will be ignored.

   SUMMARY = ["S", [ [MsgId1, EncSynopsis1], [MsgId2, EncSynopsis2] ] ]

 After the INDEX and SUMMARY packets, the rest of the bucketset contains
 MAIL, ACK, and ERROR packets. ACK and ERROR packets are encrypted and
 indexed the same way as MAIL packets. All such MAIL/ACK/ERROR packets MUST
 appear in the bucketset in exactly the same order as their corresponding
 MsgID values in the INDEX packet (otherwise the client will be unable to
 decrypt them properly).

 The MAIL packet is simply an s-expression with the type character "M" and the
 encrypted+compressed email message:

   MAIL = ["M", EncEmail]

 The remainder of the last bucket is filled with random data.



Check Buckets:

 To implement byzantine server detection, we define the first UserID to be a
 special one (with a UserID of all zeros), and this user's buckets are all
 called "Check Buckets". Each client will retrieve an equal number of real
 buckets and check buckets, using queries that cannot be distinguished from
 each other.

  TODO: we could either define the check buckets to be at a fixed, well-known
  location (like the first three buckets) and have well-known contents (like
  all zeros), or we could put them at a fixed location and put the hashes of
  their (random) contents in the metadata, or we could use this UserID=0
  trick. The first results in some buckets looking very much different than
  the others (although I don't think this loses anything, since everybody
  knows those are check buckets anyways). The second results in
  variable-length data in the metadata, which makes me nervous. The third
  requires an extra index-bucket PIR operation to find out where the check
  buckets are located.

 For each cycle, all clients start by retrieving the metadata and full
 metaindex. They then retrieve two index buckets: the first one (which
 contains the UserID=0), and whichever index bucket contains their real
 UserID. Then they make up a list of buckets to fetch: assuming their
 maxbuckets value is 3, their list will contain all three check buckets plus
 the three sequential buckets for their UserID. Then the client chooses a
 number of distributors (perhaps 5), builds the bit vectors for each bucket
 query, shuffles the vectors, then sends a message to each distributor asking
 it to return the XOR for all the vectors.

  We define the PIR request to contain multiple vectors because it is far
  more efficient to do all the vectors in parallel on the same pass over the
  dataset. We shuffle the vectors to avoid exposing which vectors are for the
  real buckets and which are for the check buckets.


 We're going to keep the SHORT_PIR request in place (which passes a PRNG seed
 instead of a full-length bit vector) to support clients who wish to value
 bandwidth over information-theoretic security.

 We didn't spend a lot of time on the distributor API, but the plan is to use
 SEXP encoding for its messages too. The PIR_REQUEST operation will take as
 arguments a serialized list of:

  ["PIR_REQUEST", nymserverid, cyclenum,
    [ ["LONG", bitvector0],
      ["LONG", bitvector1],
      ["SHORT", prng_seed2],
      ["LONG", bitvector3],
      ...
    ] ]

 and will eventually provoke a response of:

  ["PIR_RESPONSE", [xor0, xor1, xor2, xor3, ...] ]




One other note, the client should avoid doing CPU-intense operations on the
buckets after download, in case an attacker is pinging the client and looking
for evidence of slowdown to determine which code path the client has taken.


those are the notes that I wrote down.. please feel free to amend/correct
them.

cheers,
 -Brian