[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] Load Balancing in 2.7 series - incompatible with OnionBalance ?




info@xxxxxxx wrote:

Hi Alec,

Hi Tom! I love your proposal, BTW. :-)

Most of what you said sounds right, and I agree that caching needs TTLs (not just here, all caches need to have them, always).

Thank you!

However, you mention that one DC going down could cause a bad experience for users. In most HA/DR setups I've seen there should be enough capacity if something fails, is that not the case for you? Can a single data center not serve all Tor traffic?

It's not the datacentre which worries me - we already know how to deal with those - it's the failure-based resource contention for the limited introduction-point space that is afforded by a maximum (?) of six descriptors each of which cites 10 introduction points. 

A cap of 60 IPs is a clear protocol bottleneck which - even with your excellent idea - could break a service deployment.

Yes, in the meantime the proper solution is to split the service three ways, or even four, but that's administrative burden which less well-resourced organisations might struggle with. 

Many (most?) will have a primary site and a single failover site, and it seems perverse that they could bounce just ONE of those sites and automatically lose 50% of their Onion capacity for up to 24 hours UNLESS they also take down the OTHER site for long enough to invalidate the OnionBalance descriptors. 

Such is not the description of a high-availability (HA) service, and it might put people off.

If that is a problem, I would suggest adding more data centers to the pool. That way if one fails, you don't lose half of the capacity, but a third (if N=3) or even a tenth (if N=10).

...but you lose it for 1..24 hours, even if you simply reboot the Tor daemon.

Anyway, such a thing is probably off-topic. To get back to the point about TTLs, I just want to note that retrying failed nodes until all fail is scary: 

I find that worrying, also. I'm not sure what I think about it yet, though.

what will happen if all ten nodes get a 'rolling restart' throughout the day? Wouldn't you eventually end up with all the traffic on a single node, as it's the only one that hadn't been restarted yet?

Precisely.

As far as I can see the only thing that can avoid holes like that is a TTL, either hard coded to something like an hour, or just specified in the descriptor. Then, if you do a rolling restart, make sure you don't do it all within one TTL length, but at least two or three depending on capacity.

Concur.


desnacked@xxxxxxxxxx wrote:

Please see rend_client_get_random_intro_impl(). Clients will pick a random intro point from the descriptor which seems to be the proper behavior here.

That looks great!

I can see how a TTL might be useful in high availability scenarios like the one you described. However, it does seem like something with potential security implications (like, set TTL to 1 second for all your descriptors, and now you have your clients keep on making directory circuits to fetch your descs).

Okay, so, how about:

IDEA: if ANY descriptor introduction point connection fails AND the descriptor's ttl has been exceeded THEN refetch the descriptor before trying again?

It strikes me (though I may be wrong?) that the degenerate case for this would be someone with an onion killing their IP in order to force the user to refetch a descriptor - which is what I think would happen anyway? 

At very least this proposal would add a work factor. 

For this reason I'd be interested to see this specified in a formal Tor proposal (or even as a patch to prop224). It shouldn't be too big! :)

I would hesitate to add it to Prop 224 which strikes me as rather large and distant.  I'd love to see this by Christmas :-P


teor2345@xxxxxxxxx wrote:

Do we connect to introduction points in the order they are listed in the descriptor? If so, that's not ideal, there are surely benefits to a random choice (such as load balancing).

Apparently not (re: George) :-)

That said, we believe that rendezvous points are the bottleneck in the rendezvous protocol, not introduction points.

Currently, and in most current deployments, yes.

However, if you were to use proposal #255 to split the introduction and rendezvous to separate tor instances, you would then be limited to:
- 6*10*N tor introduction points, where there are 6 HSDirs, each receiving 10 different introduction points from different tor instances, and N failover instances of this infrastructure competing to post descriptors. (Where N = 1, 2, 3.)
- a virtually unlimited number of tor servers doing the rendezvous and exchanging data (say 1 server per M clients, where M is perhaps 100 or so, but ideally dynamically determined based on load/response time).
In this scenario, you could potentially overload the introduction points.

Exactly my concern, especially when combined with overlong lifetimes of mostly-zombie descriptors.

- alec

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev