[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Parallelizing (some of) Tor's AES Operations

To: or-dev@xxxxxxxxxxxxx
Subject: Re: Parallelizing (some of) Tor's AES Operations
From: Matt Edman <edmanm@xxxxxxxxxxxxxxxxxxx>
Date: Fri, 2 May 2008 00:05:11 -0400
Delivered-to: archiver@xxxxxxxx
Delivered-to: or-dev-outgoing@xxxxxxxx
Delivered-to: or-dev@xxxxxxxx
Delivery-date: Fri, 02 May 2008 00:05:20 -0400
In-reply-to: <20080421163404.GE4147@xxxxxxxxxxxxxx>
References: <20080421062905.GA26123@xxxxxxxxxxxxxxxxxxx> <20080421163404.GE4147@xxxxxxxxxxxxxx>
Reply-to: or-dev@xxxxxxxxxxxxx
Sender: owner-or-dev@xxxxxxxxxxxxx
User-agent: Mutt/1.5.13 (2006-08-11)

On Mon, Apr 21, 2008 at 12:34:04PM -0400, Nick Mathewson wrote:
> On Mon, Apr 21, 2008 at 02:29:05AM -0400, Matt Edman wrote:
> >  1) The most straightforward approach I can think of is to launch
> > $NumCpus threads when Tor starts up. When aes_crypt() or
> > aes_crypt_inplace() get called from the main thread, the input gets
> > split into $length/$NumCpus chunks (or maybe fewer depending on
> > input length), which get farmed out to the worker threads.  The main
> > thread then collects the results and returns.

[...]

> Right.  Personally, I don't expect that this will wind up working very
> well in practice, what with the locking but if you want to benchmark
> it, it might work out well after all.  Having the main thread wait
> while all the subthreads are doing AES seems like a losing proposition.

Yep, this approach does indeed seem to perform quite poorly and scales even
worse when increasing the number of processors/chunks. Granted my little
benchmarking application probably left plenty of room for optimizing, but I
don't suspect much would change when you're getting near-linear _slowdown_ for
Tor's small cell sizes. On the other hand, this looks like it could be a win
if Tor ever moves to, say, 64KB cells. ;)

> >  4) A fourth approach would be to farm entire cells out to
> > threads. 

[...]

> Note that there's exactly one place in the entire code where
> relay_crypt() is called, marked [**] above.  To get things to
> parallelize effectively, I think you'd want to replace the call to
> relay_crypt, and have it instead place the cell in a work queue for
> the worker threads to process.  When the worker threads were done with
> one or more cell, you'd need a way for them to tell the main thread
> about them, and which circuit's cell queue to put them in.

Looking through the code, it seems this is approximately what cpuworkers.c
does now for onion skins. Roughly, the work queue is simply a socket. The
workers receive tasks from the main thread over the socket, do some
processing, and then write the result back to the main thread.

A simple and straightforward implementation would be to extend the existing
CPU worker code with an additional task type for processing relay cells.  This
seems to have the same drawback as the first approach above, though, in that
you spend too much time bookkeeping and not enough doing actual work.  Not
reusing the existing CPU workers means Tor will actually launch 2*NumCpus
workers total, but that's probably not so bad.

> Alternatively, you could give each circuit _two_ cell queues for each
> direction: one for cells that need to be crypted, and one for cells
> that are already crypted and are ready to transmit.  All you'd need to
> tell the workers about is which circuits have to be processed; all
> you'd need to tell the main thread about is which circuits now have
> more data.  I like this approach a little better because it doesn't
> require quite so much bookkeeping to tell the workers about what keys
> to use to crypt what, or to tell the main thread where to put cells.

I like this approach better too. Separate cell queues definitely seems like
the way to go.

> It will definitely take some thinking to figure out exactly what kind
> of locking and notification mechanisms you'd want to use here.  If you
> have any questions about waking the main thread, or what kind of
> integrity Tor's data structures need, just ask.

In terms of locking and notification mechanisms, is it reasonable to assume a
threaded environment? I notice CPU workers are typically threads, but may be
fork()ed processes on some platforms. If TOR_IS_MULTITHREADED is not defined,
though, we don't get to use the handy locking constructs that already exist in
compat.h. Are there really many platforms that don't handle threads very well?

It also looks like compat.h already has wrappers for thread signaling, but are
just '#if 0'ed out right now. I assume that's simply because they're not
currently used anywhere in Tor though (rather than because it's broken).

> This sounds like a really fun project in any case, and one I'd very
> much like to include in the next Tor release.  What timeframe will you
> be doing this on?

No particular rush, but I'm still shooting for a fairly short timeframe on
this.

--Matt

Follow-Ups:
- Re: Parallelizing (some of) Tor's AES Operations
  - From: Nick Mathewson

Prev by Author: Auto-Update -Feature - Was: Draft schedule for Tor 0.2.1.x.
Next by Author: Re: Parallelizing (some of) Tor's AES Operations
Previous by thread: Re: Proposal: More robust consensus voting with diverse authority sets
Next by thread: Re: Parallelizing (some of) Tor's AES Operations
Index(es):
- Author
- Thread