[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-dev] Parallel relaycrypt data structures for review
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Hi everyone,
This is quite interesting. I wrote back an initial proposal for this
kind of work in Tor and I'm happy to see some similarities :).
I've attached to the email the ideas I had based on some discussions
with Nick and reading the ticket #1749. (Sorry for any obvious huge
English mistakes, it's not my primary language :)
I remember a discussion about parallel relay cell crypto with Nick and
there was an important thing which is to prioritize quiet circuit(s).
This implies that once a cell is scheduled for crypto. processing, the
dispatcher should somehow know the "noise" of the circuit and dispatch
accordingly. It brings the concept of priority for cells/circuit in a
parallel environment which makes dispatching a bit more tricky.
(More inline).
Nick Mathewson:
> On Nov 29, 2012 9:30 PM, "Andrea Shepard" <andrea@xxxxxxxxxxxxxx>
> wrote:
>>
>> Please review first draft proposed parallel relaycrypt
>> structures in my parallel_relay_crypt branch.
>>
>
>
> Hi! Here are some initial thoughts:
>
> * If we're going to do it like this, maybe we need to make cell_t
> packed or something eventually. It's got a fair amount of padding
> overhead right now.
>
> * Maybe we'll need a next pointer in cells if we're queueing them?
>
> * Why is there only an rc_job for outgoing cells on a circuit? It
> seems for symmetry we'd need to have one for inbound cells and one
> for outbound cells. It looks like that code isn't there right
> now?
>
> * Maybe I'm confused by these queues. The system of cell queues
> is going to get a little confusing, maybe. Putting cells on the
> outgoing queue isn't always right, since some cells (e.g.,
> relay_data cells at an exit node) need to be handled locally rather
> than relaying them. So we need more new queues?
What about if the main thread, when dequeuing encrypted cells, could
check that and decide if it goes outbound or stays locally for more
processing ?
>
> * Should the jobs be in some data structure other than an
> smartlist_t? A queue would seem to make more sense, since jobs are
> getting added and pulled off. (Yes, protecting the data structure
> there with a lock makes sense.)
I would also propose the use of a priority queue because (or at least)
of the aforementioned use case.
>
> * If you're going to have separate locks, it's important to
> document how they nest, to prevent deadlock conditions.
>
> * Presumably relaycrypt_job_t would need to have a pointer to the
> actual circuit that needs work, and a note about whether it's a
> job for outbound or inbound cells.
Looks like there is ? ..
>
> * In the non-threaded-relaycrypt case, presumably the intention is
> that there's a function that would otherwise queue a cell for
> crypto but instead just put it directly on the appropriate circuit
> queue?
>
> Thanks again! I'll let you know if I think of anything else.
I have question/remark on the circuit_t pointer in relaycrypt_job_s
structure. I wonder if a lock might be needed since if the circuit is
closed but some cells are still being processed by a relaycrypt worker
thread, simply setting the circ to NULL will not indicate the thread
to stop using CPU for this invalid circuit.
I don't really know if Tor as some kind of refcount mechanism for
circuits but in my experience this is a "time trap" when setting a
pointer to NULL but still using some objects associated to that
pointer. The cleanup becomes a nightmare if a thread still has ref. to
some circuit's object which is invalidated when holding them.
Would it be safer to simply have a STOP/CANCEL mechanism for worker
thread and avoid nullifying references that might be in used?
That's it for now. I'm quite happy to see this work going forward! :)
Cheers!
David
>
> yrs,
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAEBCgAGBQJQuNqNAAoJEELoaioR9I025KMIALXE81lm0zpTRUWlM52HawyD
lGVBqMRf/V+F/g8EFqGpyZ3drs9o3+uNhY7/+qdSY6IhWJGUfyL7vrwYSk1/yuBf
RWBFE050zLhd9QwCARCoxb0E+oGsh/Z6hq8NgcDxba28Z2II+azE3KrNkwAuUnob
N/6gUd8ceWpyXmFgCpf3MJLNqZKOcYXax2LFA+ocQdQ368VD+Lq8jqPXfR62Npr4
Ujhp9i2REALbbV6dOmxBVdytK9v2NfFGZIXp9n5onU8/Ym9a8fg0ZvUk+i5bwb+6
eOCYeLeVe7yNK8yzxnk6J8TbwuCRMLTSIhEfov6f+WgQYR9hfmBcY7shqFvpef4=
=OQrF
-----END PGP SIGNATURE-----
Title: Multithreaded cryptographic scheme for Tor relay cells
Author: David Goulet <dgoulet@xxxxxxxxx>
Created: 18 September 2012
Status: Open
The master bug for this proposal is #1749.
Overview:
This document proposes a multithreaded design for cells cryptographic jobs.
The goal here is to take advantage of SMP machines especially on powerful
servers running Tor exit nodes nowadays. Moreover, nowadays, almost all OSes
running Tor nodes support multithreading which should be use to increase Tor
performance.
TODO: OUTLINE
1. Background
1.1 Overview of current cell's crypto
As stated in [WikiThreadCrypto], three main kinds of crypto occurs:
1) Onion skin processing (already parallelized on the server side).
2) SSL handshakes and crypto
3) AES-crypto on cells
This proposal focuses on the third kind which is direct AES cell cryptography
jobs (encryption and decryption).
The relay cell crypto is done in relay_crypt_one_payload(), which gets called
at a few places, all of which are fortunately in relay.c making things easier
to change.
* circuit_package_relay_cell(), it gets called 1..N times for a cell that
we're packaging from an edge connection and about to stick on a cell queue.
* relay_crypt(...), we just got a relay cell on a circuit, and we are about
to decrypt it, see whether it should get handled by us, and maybe pass it
on or maybe process it. If we handle it, we are going to pass it to
connection_edge_process_relay_cell(...). If we pass it on, we are going to
append it to a cell_queue_t.
This function is called by circuit_package_relay_cell(...) which is used when
we just received a relay cell on a circuit.
For pure relays, the relay_crypt() case dominates. For exit nodes, the
circuit_package_relay_cell() case matters too.
The whole concept of using a new threading model is to offload the crypto
part from the main loop and, has stated, take advantage of SMP systems.
2. Design
Since cells order matters on a circuit, it makes sense to create two new
priority queues in the circuit.
* Crypto cell queue
The cell has NOT been processed for encryption/decryption task.
* Transport cell queue
The cell is ready to be sent and has passed through the crypto cell queue.
The following schema shows the cell movement once it arrives on circuit A
where Cq is the Crypto queue and Tq is the Transport queue. This is as simple
as it gets. The tricky and fun part of this scheme is describe in details in
section 3, threading and synchronization.
circuit A
+----------------------------------+
+--------+ enqueue | +------+------+ +-------+ |
| cell c |---------->| Cq | | | ... | c | |
+--------+ | +------+------+ +-------+ |
+----------------------------------+
| dispatch(Cq, Tq)
|
Thread pool v
+-------+-------+ +--------+--------+
| | | ... | | circ A |
+-------+-------+ +--------+--------+
|
exec crypto job() <--- |
v
+----------------------------------+
+--------+ dequeue | +-------+ +------+------+ |
| cell c |<----------| | c | ... | | | Tq |
+--------+ | +-------+ +------+------+ |
+----------------------------------+
circuit A
It goes like this. Enqueue the cell c of circuit A to its crypto queue (Cq).
Each circuits are dispatched to a thread pool where they are assigned to an
available thread that process N (value is TBD) cells from the Cq and enqueue
them in Tq once done.
The main loop of Tor only needs to empty out the transport queue of each
circuit at each round it does. The rest is handled by the crypto thread pool.
This thread pool model is describe in detail in the next section.
3. Threading and Synchronization
The basic idea is to dispatch relay cell's cryptographic jobs to a thread
pool upon reception to offload the Tor main loop. The thread pool should
handle circuits rather than cells. Reordering cells after encryption is an
unnecessary steps that we absolutely want to avoid for complexity and latency
reasons.
The next section describes the threading model and dynamics between the
dispatcher and the pool (worker threads).
3.1 Threading model
This section first explains how the dispatcher mechanism works and describes
the worker thread of the pool in the following section.
3.1.1 Dispatcher
The dispatch mechanism is part of the main loop and has two roles. First, it
assigns the received cells to the corresponding crypto queue in its circuit.
We have to make sure that the dispatcher is the one and only one writer of
that queue so we can avoid complex synchronization between threads and keep
the design simple for this proposal.
New circuit are also managed by the dispatcher to assign it to a worker in
the thread pool.
The second role is to keep a priority queue of all worker threads ordered by
the number of cells they are currently handling in all of their crypto
queues. After adding a cell or a new circuit to a thread, this queue is
rebalanced. This rebalancing process is tricky since we might need to steal
circuit from other threads and push them back to new ones. This process might
be hard on locks contention but for this first proposal it's should be good
enough.
Section 3.3 describes the priority queue data structure proposed for the
dispatcher.
3.1.2 Worker thread
This thread is quite simpler. Each of them contains a priority queue of
circuits ordered by the EWMA value so the quietest circuit gets handled first
which is an important design requirements for the current state of Tor.
As proposed by Nick Mathewson, it might also be okay to do the EWMA
implementation strictly when adding stuff to the transport queue (outbound).
Tests should be done to see if it could make things faster.
Once the thread has CPU time, it chooses N cells from the crypto priority
queue depending on each circuit load, process them one at a time and move
them to the transport queue (Tq) of the corresponding circuit once done. This
N value has to be determined with tests. Another possibility is to make this
value change depending on the rate of the relay or the CPU power available.
Since the cells insertion is done by the dispatcher and we want to put as
less work as we can to it, the priority queue balancing operation should be
done in the worker thread. So, after N cells round of work, the queue should
be rebalanced since the dispatcher could have added more. This can be
verified by setting a simple flag in the priority queue.
3.2 Locking scheme
For this first draft I will not propose some lockless mechanism since this
kind of changes is non trivial and maybe unnecessary for a first
implementation. So the locking scheme is a simple lock for each proposed
queue. We can improve over time and also during the stabilization process.
3.3 Data structures
For the priority queue, the minimum value of the queue is the most important
node since we have to process the quietest circuit ordered by the number of
the EWMA value.
A binary heap makes sense since the minimum value can be looked up in O(1)
(called min-heap). This structure is simple, known and can be improved
afterwards by all sort of funky structures such as the Fibonacci heap or the
Brodal queue [BrodalQueue].
4. Conclusion
In a nutshell, this proposal is probably not the most efficient way to do
multithreading crypto jobs in Tor but it's goal is to add a simple and
scalable implementation that can be improved over time with testing,
measurements and better algorithms such as more efficient data structures or
lockless mechanism.
5. Acknowledgments
TODO
References:
----------------
[WikiThreadCrypto] Multithreaded crypto on Tor
https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/MultithreadedCrypto
[BrodalQueue] Gerth Stølting Brodal (1996) - Worst-Case Efficient Priority Queues
https://users.info.unicaen.fr/~karczma/TEACH/Doc/brodal_okasaki.pdf
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev