[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [freehaven-dev] retrieval issues



(Sorry for the delay in reply; we've been working hard on the mix-acc
paper over the past few days.)

On Wed, Mar 21, 2001 at 12:26:45PM +0100, Thomas Strauss wrote:
> I'm currently doing research in existing P2P System's methods of information 
> retrieval for my diploma. While it seems that most of the systems take the 
> most simple approach of using a string or substring-match - No system I've 
> seen so far supports Meta-Data - it seems that freehaven has a potential 

I've generally ignored the meta-data issue lately, because I find the
infrastructure problems much more interesting. But I could have sworn
I found a number of p2p systems (perhaps only fledgling systems, it's
true) at the SF conference that handle meta-data. There's a chapter
about it in the p2p book, but I haven't read it in detail. Perhaps try
the bluesky or p2p-hackers list for more info about metadata systems?

http://www.transarc.ibm.com/~ota/bluesky/
http://zgp.org/mailman/listinfo/p2p-hackers

http://www.infoanarchy.org/ is a good place to hunt, too.

> design flaw that is related to the retrieval technique.
> 
> Freehaven does not forward search requests to avoid the 
> gnutella-broadcast-flooding of a network. So it depends on a fully connected 
> network to be sure, that your shares will be traded everywhere and more 
> important, can be retrieved immediatly after injection.

Or a mostly connected network -- you only need a fraction of the shares to
reconstruct the file. But yes.
 
> The protocol has no clear way how to make sure full connectivity is enforced. 
> It suggests servers to become "Introducers", but these are free to trade 
> their knowledge of network topology - if they dont, you are left alone in the 
> dark until someone decides to trade with you. A very unlikely issue, since 
> you are unknown. The automatic traders will almost never choose you as a 
> primary trade-partner, so your career as a well behaved freehaven stalls from 
> the beginning.

I disagree. The high-reputation nodes may not be very likely to choose
you, but you can choose them. If you want to join the network and gain a
reputation, then you need to initiate trades on your own. Note that you'll
be getting the bad end of the deal on most of these trades, since you are
a new (unreliable) entity.
 
> While this might be ok - it will take much more time to attack the network 
> through fake-servers - it starts concentrating the network around the 
> introducers. 
> This concentration raises the chance for an introducer to do successful 
> buddy-attacks and it gives a potential attacker the chance that shutting down 
> only a small number of introducers will hurt content, since many shares are 
> traded between the introducers.
 
More precisely, shutting down the high-reputation nodes will damage the
network, since they're the ones who are most useful to it. But given
our notion of server-anonymity (they are merely mix-net addresses),
it is hard to locate these servers.

One assumption you're making is that all reputation is the
same. Specifically, that a good reputation for storing files is identical
to a good reputation for introducing well (or assessing reliability).

A node can be very good at introducing people accurately even if it stores
no shares of its own. So the assumption that the introducers will be
the only real nodes on the network is flawed.

(Oops. Skimming over Chapter 12, it looks like we made this point in Chapter
16 but not in Chapter 12?)

A further point: none of the other systems out there solve the introducer
problem either. One of the toughest parts of using gnutella these days
is getting yourself inserted in a group of live nodes. From a practical
standpoint, it's tough to find an active list of currently-connected
machines. And those lists are generally good targets for adversaries.

Freenet solves it by having a few IP addresses listed on its webpage
(they are the freenet nodes which joined most recently). What would
happen if I started to flood that list with bogus IPs?

I haven't found anybody actively working on this problem. (It's a
hard problem, both in terms of actually solving it and in terms of
characterizing the problem.)

> Another point that made me think is the potential resource issue: If the 
> network needs to be fully connected, every machine needs to know every other. 
> Since the network needs a high number of servers in the servnet to fulfill 
> the anonymity claim, this leads to a high storage capacity on the server to 
> keep the servnet in memory.
 
The flaw with the current Free Haven design is not that it requires
a lot of memory on each node (memory is cheap these days, and keeping
trust data on a given node is quite cheap). The flaw is that we need
to broadcast to every participating node. From a practical standpoint,
this is suicide -- node bandwidth simply cannot handle this. We *must*
use a search system which queries at most log n machines (where n is
the number of participating servers). Otherwise we're sunk.

Possible systems to use for this include the Oceanstore lookup algorithm
(it seems difficult to disentangle from the rest of Oceanstore,
unfortunately) and the Chord system from MIT (not yet published,
unfortunately). But it can be done. The next problem is that doing
this requires another piece of infrastructure: the "where does this
share live?"  system. It needs to handle updates, deletes, etc. One
of the beautiful things about Chord is that it really is a separate
system. Thus when I say log n above, I mean where n is the number of
servers participating in Chord -- it can be entirely independent of the
number of servers participating in Free Haven.

The hard part about making Free Haven work currently is that we assume a
secure anonymous reasonably efficient MIX-net to protect the anonymity of
each participant. No such MIX-net exists right now. This is the current
problem that we're tackling; stay tuned for a clearer Tarzan design as
soon as I get around to publishing it.

Speaking of which: "Since the network needs a high number of servers in
the servnet to fulfill the anonymity claim" is not at all clear to me.
The mixnet needs to have a lot of participants, yes, but if Free Haven
had a "good enough" mixnet, then even if it had only one server, that server
would be anonymous. (Free Haven would not be very robust in this case; but
that's a different matter.)

> The routing issue might be solved, if the server himself decides whom to 
> broadcast next. Only change in the protocol would be to return a message: 
> "Sorry dude, what you need is not here, but you may ask these servers." The 
> data part of the reply contains a number of servers, the queried server 
> knows. The reader can then decide which servers on the list have already been 
> queried and which are missing.
 
Yes, that could work. We tried to separate the trust broadcasts (which
are a superset of introducing messages) from the document queries.

I think that by tweaking the frequency of the trust announcements, you can
maintain a connected network. Another thing to consider is that Free Haven
nodes, because they must build a reputation before they are trusted, are
inherently more stable or reliable than, say, Freenet or Gnutella nodes.

Thus there will be far fewer "real" Free Haven nodes, and the network will
be far less dynamic in terms of actually new nodes joining.

> Regards,
>     Thomas
> 
> 
> PS: If you know any papers about p2p information retrieval - please let me 
> know, thanks....
 
There are many. Take a look at our bibliographies and related works
sections, as well as the 'related' page on the website.

Let me know if you have any questions or are confused about anything,
--Roger