[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[freehaven-dev] Defining Anonymity, next draft

I've added subsections on document-anonymity, query-anonymity,
publisher-anonymity, and I fleshed out the rest of the subsections.
It's worth reading through in its entirety again, I think (if you
read the previous versions). Comments appreciated, especially on:

a) "no, roger, you're wrong. that's not what that should mean."
b) "this guy over there wrote a paper that talks about that idea."
c) "...and he describes it more thoroughly/better than you do."


\section{Defining Anonymity}

So far, we've been throwing around the term `anonymous' without ever
actually specifying exactly what we mean by it.  Indeed, many of the
projects in our related works section claim an anonymous network or
some other catchphrase involving anonymity, but they generally fail
to actually specify what protections users and operators receive from
their system, as well as what protections users and operators do not
receive from their system.

In general, there are four agents in an anonymous publication or
storage system: the author, the reader, the server, and the document itself.
The following sections address anonymity for each
of these agents separately. We then present some intuitive notions of
types or degrees of anonymity, as well as some models of adversaries that
help in thinking about how anonymous a system as whole might be. Finally,
we attempt to assess real-world projects in the face of these new
definitions and models.


Author-anonymity means that the original author of a given document
should not be known. This characteristic of anonymity is one of the
integral parts of pretty much any anonymous network or service.
Even so-called `anonymous remailers', which are simply anonymous
forwarders and don't support persistence or storage of the data, provide
author-anonymity. Indeed, anonymous remailers can be combined with public
storage and distribution systems such as usenet to offer a rudimentary
but very easy to construct and deploy service which allows persistently
available data and provides author-anonymity.


While author-anonymity addresses the original author of the document
itself, publisher-anonymity addresses the agent that originally introduces
the document into the system. In some cases the author and the publisher
may be the same entity, but in the general case the author may make use of
a separate individual, either a third party or a server in the system,
to introduce the document. Separating these two notions of `origin'
makes it clearer what protections the system provides.


Reader-anonymity means that readers requesting a document should not
have to identify themselves to anyone. In particular, this means that
when a reader performs a document request at a given server, this server
is unable to determine the identity or location of the reader.

This class of anonymity is crucial for protecting people from disclosing
that they are interested in or accessing certain types of material.
For instance, a user of the system might not want it known whether she is
downloading material from the Democrats web page, or from the Republicans
web page. Reader-anonymity ensures the privacy of the vast majority of
the system's {\em users}, a concern which is often ignored.


Server-anonymity means that the location of the document should not
be known or knowable. Specifically, given a document's name or other
identifier, an adversary is no closer to knowing which server or servers
on the network currently possess this document. This implies that the
retrieved documents do not provably pass through any given server that
receives a request. This protection is crucial for materials where mere
possession presents a danger to the possessor, such as documents speaking
out against Scientology practices.

Many services rely on sheer volume of servers, each containing the data,
to dissuade organizations from attacking any given server for possessing
the data. However, this approach does dissuade large corporations
from participating in these networks, due to liability and reputation
concerns. Also, there may be some circumstances, such as the opendvd
suits, where adversaries are willing to expend the energy to trace down
all servers which offer a given document. Indeed, making an example out
of even a few high profile server operators can go a long way towards
reducing the availability of a document.


Document-anonymity means that the server does not know the contents
of the document that it is storing or helping to store. This is broken
down into two scenarios. Isolated-server document-anonymity means that
if the server is allowed to look only at the data that it is storing,
it is unable to figure out the contents of the document. Generally this
is achieved via some sort of secret sharing mechanism, either sharing
the document or sharing the key for recreating the document (or both)
across servers.  On the other hand, connected-server document-anonymity
means that the server is allowed to communicate and compare data with all
other servers in the system, but is still unable to determine the contents
of the document. Since a connected server may well be able to act as a
reader and do a document request itself, it seems that connected-server
document-anonymity is difficult to achieve without some trusted party
acting as intermediary and authenticating and authorizing readers.

Notice that merely encrypting the document before publishing it into the
system is not sufficient to achieve document-anonymity: we are concerned
here not with confidentiality of the published document, but instead
with whether the given server can recreate the bits that were inserted
into the system.


Query-anonymity refers to the notion that over the course of a given
document query or request, the `identity' of the document itself is not
revealed to the server. In short, this means that although a server may
have many different documents stored, which document was served for a
given request is not knowable by the server. This process of private
information retrieval (PIR) is typically done by a computationally
intensive (or bandwidth-intensive) process of downloading lots of other
documents at the same time to mask which document is being retrieved,
or perhaps by using lots of servers in the transaction.

%Query-anonymity is sort of a special case, in that it addresses anonymity
%for only one of many operations that are generally expected to be supported
%by an anonymous publishing or storage service. 

\subsection{Anonymity vs. Pseudonymity}

The above classes of anonymity describe the issues regarding each
of agents or operations in the system. However, there are some other broader
characteristics of anonymity to consider.

Anonymity, when compared to pseudonymity, means that the agent performing
the operation effectively has no observable persistent `location'.
% Often this
%is characterized by a `pull' type of operation.
For instance, turning on a radio is an excellent way of receiving
information anonymously; reading the advertisements in the New York Times
is another good example, though it's slightly less secure because it seems
slightly more conceivable that somebody could track who purchases a given
newspaper. In general, broadcast media like this are good ways to achieve
anonymity. A Usenet feed is a bad example here, because if the user is
not the systems administrator of that node, he can be monitored by the
systems administrator; if he is the systems administrator of that node,
then in some sense his upstream provider can `prove' that he received
the information that he read (even though the upstream provider has no
idea which articles he actually physically read).

Pseudonymity, on the other hand, means there is some location associated
with the agent for that transaction, whether that location was a
`one-use' type of thing (so-called one-time pseudonymity), or that same
location is used across multiple transactions (so-called persistent
pseudonymity). Alternatively, the concept of location does not need to
imply a physical location -- it could as easily be some identifier that
provides linkability between transactions.

Both anonymity and pseudonymity in this context retain the notion of
privacy of location. Anonymity is in some sense `more private' than
pseudonymity, because there's less to trace, but having a pseudonym
does not necessarily imply that your location is public -- the pseudonym
could well be a reply block on a mixnet, or even simply a keypair which
an author uses to sign all of his documents.

%free haven uses pseudonymity for readers and for servers. we could
%theoretically support anonymity for authors. that really isn't speced
%out yet. (we should fix that.)

\subsection{Partial Anonymity vs. Full Anonymity}

The notion of full anonymity is really only defined over an ideal world,
where the adversary has an infinite set of indistinguishable suspects.
In reality, every set of candidates is limited in size, and indeed
the adversary often has partial information about the suspect -- for
instance, `he or she has a high-bandwidth Internet connection', or
`he or she probably lives in California based on activity patterns and
routing analysis'.

So in this clearer context, the question shifts from `is it anonymous?' to
`is it anonymous enough?' If the original set of suspects has $n$ members,
then for sufficiently large $n$ a system which leaks no information that
might reduce the set of suspects seems to be `anonymous enough'. However,
we have to bear in mind that we may well be trying to protect the
identities of all the users of a given service -- that is, even evidence
implying that a given user is one of $n$ users of the service may be
sufficient to make him suspicious, thus compromising his anonymity. This
may discourage corporations and persons who are particularly concerned
about their reputation from participating in a given anonymous service,
and indeed it may put ordinary users at risk as well.

\subsection{Computational vs. Information-Theoretic Anonymity}

Another issue to consider is the notion of how protected a given address
is: does it rely on computational complexity to protect its anonymity
(e.g., a reply block address on a conventional mixnet), or does it use
some other technique to make the address unknowable even in the face of
a computationally powerful adversary?

Actually, we've glossed over a very important technical point here:
there are really two alternatives to computational anonymity. The
first alternative is that the adversary has the transcript of the
communication, but is still unable to break its anonymity -- this is
what we call information-theoretic anonymity. This might be modeled by
a trusted third party (with trusted channels) acting as moderator, or it
might perhaps be implemented in some mechanism similar to a one-time pad.

The second option is that the adversary is simply unable to obtain
a transcript of the communication, or perhaps the usefulness of the
transcript `expires' quickly enough to be equivalent to no transcript
at all. This might happen in the case of some physical medium which is
untappable (perhaps quantum channels give us some starting point for
this), or in the case of an improved mixnet where a given address no
longer maps to or is traceable to its destination after the transaction.
This also implies that there is no mechanism for the adversary to take
a snapshot of the entire network at that point -- if he could, then he
might be able to go offline with that snapshot and break the anonymity
of the channels.

\subsection{Perfect Forward Anonymity}

Perfect forward secrecy means that after a given transaction is done,
there is nothing new that the adversary can `get' to help him decrypt
the transcript. Similarly, perfect forward anonymity is the notion
that after a given transaction is done, there is nothing new that the
adversary can get that can help him identify the location or identity
of either of the communicating parties.

In effect, this might be achieved by negotiating a `session location'
(the anonymity analog of a session key) between the parties for the
purposes of that transaction. For instance, this session location might be
a double-blinded virtual location in a high-entropy onion-routed network,
where the transaction takes place effectively instantaneously, and then
all records of paths to and from the virtual location are destroyed.

In this case, a snapshot could in fact be taken of the system at that
point, but this falls under the realm of computation anonymity as
described above.

\subsection{What does it mean to `break' an anonymous system?}

This question comes up in the context of encryption: what does it mean
to break an encryption system? Goldreich argues that since we don't
know which parts of the plaintext might be useful to a given adversary
or in a given situation, we must protect every aspect of the plaintext.
Indeed, he further argues that we must preserve the secrecy of all
partial information that the adversary may have a priori about the
plaintext (due to a non-uniform distribution, for example); that is, we
must prevent the adversary from determining any new information beyond
what he starts with.

It is tempting to claim that even this characterization is insufficient
for anonymous systems, as we describe above regarding Partial Anonymity:
even knowing that an individual is among the group of users may be
sufficient to make him suspicious, and thus sufficient to break the
anonymity of the system. On the other hand, even an ideal channel could
not prevent this disclosure: if there is a method of enumerating the
users of a service, then it seems that there is no means of preventing
some amount of information from leaking. Similarly, if there is no
method of enumerating any users of the service, then it seems that we
have achieved ideal anonymity.

There are issues to bear in mind from the social engineering perspective
as well. That no matter how powerful or effective your anonymity
protection is, if a user signs his name to his document, his anonymity
is broken. Also, more subtle attacks such as word choice correlation or
writing analysis might yield clues that allow better than even chances
of guessing. All of the above models could be based on the assumption
that a given document is a random distribution of bits. However, we
might instead assume that there is some amount of a priori information
that the adversary can guess or assume about the document, and as long
as our communications don't leak {\em more than} this information,
our system is anonymous.

We can picture the ideal anonymous system as a trusted third party
(TTP) with secure channels to each party in the system. This TTP would
receive confidential messages, strip off the source information, and
confidentially forward the messages to their destination in an unlinkable
fashion. Therefore, our goal is to come up with a decentralized system
that is able to simulate this TTP for each operation. Equivalently,
if Alice is communicating through Ted to Bob, a set of protocols which
allows Alice to communicate directly to Bob without Ted is said to be
anonymous if the transcripts of communication are indistinguishable
from those which include Ted. If they are distinguishable, then that
difference is exactly the `break' that causes the system to leak privacy.

\subsection{Modeling the Adversary}

[this section is only barely sketched. needs a lot more thought.]

This can be phrased as a game: how?

from here, we can define classes of adversaries such that some classes
can beat us in some circumstances, and some can't.

for instance, some adversaries might be able to watch the link between
them, some might be able to frob with the link as well. some might be
able to observe the private state of A and/or B (that is, data which is
internal to the nodes but doesn't get sent over the network). maybe some
could even modify private state.

subtle attack to watch out for: since we're talking about anonymity,
there are issues we have to address that we don't consider when we're
dealing with encryption. for instance, if B is running a server at his
company, what about the cached data in the NIDS down the hall? is that
part of B? because it can certainly help to distinguish...  what if one
of the packets in the transmission got sent off course, and 120 seconds
later an ICMP destination unreachable packet returns to B, with its
data segment an exact copy of one of the packets that was sent to A?
Is 120 seconds negligible? what isn't negligible?

\subsection{Application to real-world projects}

[again, this subsection is only barely sketched out]