[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Anonymity-preserving collection of usage data of a hidden service authoritative directory



Hi,

They also republish whenever they consider their descriptor to be "dirty", which happens when they establish a new introduction point (rend_service_intro_established()) or give up on and drop an
introduction point (rend_services_introduce()). This 'dirty' part is
what I meant when I was pondering if a few hidden services have
unstable connections, and thus change their intro points a lot.

Maybe it's just a personal feeling (because I did not measure that yet), but don't you think that introduction points change quite often? I always thought that RSDs are republished so often, because it's anyway unlikely that the set of IPos stays the same for more than one hour. Thus, an RSD being 23 hours old simply cannot have any working IPos any more.


So it's very unlikely that there will be
many novel publications after the shown intervals.

Yep. They will be people creating a new hidden service, or people turning on their Tor after it's been off for a while. As we see
above, there are at most a handful in each 15 minute period.

However, novel publications decreased from 3.72 to 0.81 in the mean when comparing the two statistics. Maybe this comes from hidden services that were offline for some time less than 3 days and "republish" their descriptor. Then it would be an artifact coming from the 3-days-rule, because it's rather a novel publication with novel IPos than a republication.


But hey, at least we remove old ones sometime, rather than just
collecting them forever. :)

In German we call people who keep everything because they don't dare to throw anything away "Messies"... But maybe we can "heal" that in Tor. ;)


(Remember that this same logic is used by *clients* to discard old
service descriptors, and we have many fewer guarantees that their
clocks are at all correct. That's what the MAX_SKEW business is
about.)

Why would a client
expect that a hidden service with a 23-hour old descriptor is
online if it knows that it should have republished every hour?

Well, if the client's clock is wrong by 23 hours, ...

But you're right, the servers storing the descriptors should be
assumed to have better clocks, and they could just dump old ones to
save clients the trouble.

I am not sure if I get your arguments about clock skew right. Doesn't clock skew only address two *different* clocks, e.g. a client's and a directory node's clock? Then I agree that there should be some tolerance.


But when a directory node receives an RSD, it can note when that was and discard it after 1.5 hours using its own clock. Regardless of a client's clock, the descriptor is 1.5 hours old when discarding it and -- possibly -- useless. The latter depends on how often IPos change. (I think this would be the next thing to measure...)

Of course, the real reason hidden services republish every hour is because the directory authorities don't store anything to disk and
don't share service descriptors among each other -- so every time we
restart a directory authority it forgets about all hidden services.
This means they need to republish frequently just in case an
authority restarts. If we made some way for service descriptors to
survive a restart (e.g. by storing them to disk, replicating them, or
both), then it seems to me we would reduce the need to republish
dramatically.

The question is whether it is more likely that a directory node restarts or that an introduction point changes.


In a decentralized design I suggest to cut down the lease time to
one hour (or maybe 1.5 hours). This saves resources for replicating
descriptors in case of leaving/joining routers.

This is an interesting tradeoff. I'm not sure if it's better to demand frequent "I'm still here" messages from the hidden services, so you can quickly drop the ones that don't send one, or to be more flexible and let them go long periods with the same intro points and never need to send an update.

Maybe 1 hour is too short. 4 hours? 12 hours? We can negotiate that. ;) No, to be serious: What do you think how long a set of introduction points stays the same -- after a stabilization phase of say 15 minutes after starting the service?


I guess if we want to get extra complex then somebody could try
connecting to the hidden service and only dump the descriptor if it's
unreachable -- but that probably doesn't play well with our
authentication or authorization tricks, nor with the valet node and
related designs.

Maybe we can postpone this extension? My first thought would be to register 1000 fake hidden services at one directory node and wait for it to establish 1000 connections to them. :(


Actually, three. Only "v1" directory authorities handle hidden
service stuff, and that's just moria1, moria2, and tor26 right now.

Whoops. Yes, you wrote that in an earlier mail that I did not read in whole before writing my last mail...


Yep. This number seems to represent the total count of people
interacting with a given hidden service, but remember that it doesn't
represent the total number of rendezvous attempts -- since clients
cache the descriptors.

Sure.

Though note in connection_ap_handshake_rewrite_and_attach() that
clients try to refetch a newer descriptor if the one they have cached
is more than 15 minutes old. Are you following all the details so
far? :)

Now that you ask... :) Why 15 minutes? So, clients consider RSDs to be old after 15 minutes, servers after 60 minutes, but directories keep them for 3 days?...


There is something that is making the
rendezvous itself be very slow. I'm not sure what it is. There's no
need for it to be as slow as it is. And I think it really reduces the
set of people who think hidden services are neat.

Then this might be one of the next things to investigate. I think it's some timeout being too long or some operations that could/should be performed twice/three times in parallel.


Also, scaling questions aside, there are other reasons to distribute hidden service descriptors and improve their availability.

Right.

So what more data might we want to collect about current usage
patterns? Or is this enough to move on to the next steps which are to
think about an ascii format for descriptors (rather than the awful
binary format I was dumb enough to use back when we started), think
about the implications of letting strangers see and serve all the
descriptors, and think about a protocol for receiving, serving, and
replicating descriptors?

These are the possible next tasks (arbitrary order):
- Find out why connection establishment is that damn slow.
- Measure how often IPos change.
- Think about the format of RSDs (ASCII vs. binary), encryption of contents and the related security implications.
- Describe the protocol to receive/serve/replicate RSDs.


But enough measurements for the moment. I think I should think about some concepts now and hence will start with the RSD format.

--Karsten