[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] Parallel relaycrypt data structures for review



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hi everyone,

This is quite interesting. I wrote back an initial proposal for this
kind of work in Tor and I'm happy to see some similarities :).

I've attached to the email the ideas I had based on some discussions
with Nick and reading the ticket #1749. (Sorry for any obvious huge
English mistakes, it's not my primary language :)

I remember a discussion about parallel relay cell crypto with Nick and
there was an important thing which is to prioritize quiet circuit(s).
This implies that once a cell is scheduled for crypto. processing, the
dispatcher should somehow know the "noise" of the circuit and dispatch
accordingly. It brings the concept of priority for cells/circuit in a
parallel environment which makes dispatching a bit more tricky.

(More inline).

Nick Mathewson:
> On Nov 29, 2012 9:30 PM, "Andrea Shepard" <andrea@xxxxxxxxxxxxxx>
> wrote:
>> 
>> Please review first draft proposed parallel relaycrypt
>> structures in my parallel_relay_crypt branch.
>> 
> 
> 
> Hi!  Here are some initial thoughts:
> 
> * If we're going to do it like this, maybe we need to make cell_t 
> packed or something eventually.  It's got a fair amount of padding 
> overhead right now.
> 
> * Maybe we'll need a next pointer in cells if we're queueing them?
> 
> * Why is there  only an rc_job for outgoing cells on a circuit? It 
> seems for symmetry we'd need to have one for inbound cells and one
> for outbound cells.  It looks like that code isn't there right
> now?
> 
> * Maybe I'm confused by these queues.  The system of cell queues
> is going to get a little confusing, maybe.  Putting cells on the
> outgoing queue isn't always right, since some cells (e.g.,
> relay_data cells at an exit node) need to be handled locally rather
> than relaying them. So we need more new queues?

What about if the main thread, when dequeuing encrypted cells, could
check that and decide if it goes outbound or stays locally for more
processing ?

> 
> * Should the jobs be in some data structure other than an
> smartlist_t? A queue would seem to make more sense, since jobs are
> getting added and pulled off.  (Yes, protecting the data structure
> there with a lock makes sense.)

I would also propose the use of a priority queue because (or at least)
of the aforementioned use case.

> 
> * If you're going to have separate locks, it's important to
> document how they nest, to prevent deadlock conditions.
> 
> * Presumably relaycrypt_job_t would need to have a pointer to the 
> actual circuit that needs work, and a note about whether it's a
> job for outbound or inbound cells.

Looks like there is ? ..

> 
> * In the non-threaded-relaycrypt case, presumably the intention is 
> that there's a function that would otherwise queue a cell for
> crypto but instead just put it directly on the appropriate circuit
> queue?
> 
> Thanks again! I'll let you know if I think of anything else.

I have question/remark on the circuit_t pointer in relaycrypt_job_s
structure. I wonder if a lock might be needed since if the circuit is
closed but some cells are still being processed by a relaycrypt worker
thread, simply setting the circ to NULL will not indicate the thread
to stop using CPU for this invalid circuit.

I don't really know if Tor as some kind of refcount mechanism for
circuits but in my experience this is a "time trap" when setting a
pointer to NULL but still using some objects associated to that
pointer. The cleanup becomes a nightmare if a thread still has ref. to
some circuit's object which is invalidated when holding them.

Would it be safer to simply have a STOP/CANCEL mechanism for worker
thread and avoid nullifying references that might be in used?

That's it for now. I'm quite happy to see this work going forward! :)

Cheers!
David

> 
> yrs,
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQEcBAEBCgAGBQJQuNqNAAoJEELoaioR9I025KMIALXE81lm0zpTRUWlM52HawyD
lGVBqMRf/V+F/g8EFqGpyZ3drs9o3+uNhY7/+qdSY6IhWJGUfyL7vrwYSk1/yuBf
RWBFE050zLhd9QwCARCoxb0E+oGsh/Z6hq8NgcDxba28Z2II+azE3KrNkwAuUnob
N/6gUd8ceWpyXmFgCpf3MJLNqZKOcYXax2LFA+ocQdQ368VD+Lq8jqPXfR62Npr4
Ujhp9i2REALbbV6dOmxBVdytK9v2NfFGZIXp9n5onU8/Ym9a8fg0ZvUk+i5bwb+6
eOCYeLeVe7yNK8yzxnk6J8TbwuCRMLTSIhEfov6f+WgQYR9hfmBcY7shqFvpef4=
=OQrF
-----END PGP SIGNATURE-----
Title: Multithreaded cryptographic scheme for Tor relay cells
Author: David Goulet <dgoulet@xxxxxxxxx>
Created: 18 September 2012
Status: Open

The master bug for this proposal is #1749.

Overview:

  This document proposes a multithreaded design for cells cryptographic jobs.
  The goal here is to take advantage of SMP machines especially on powerful
  servers running Tor exit nodes nowadays. Moreover, nowadays, almost all OSes
  running Tor nodes support multithreading which should be use to increase Tor
  performance.

  TODO: OUTLINE

1. Background

1.1 Overview of current cell's crypto

  As stated in [WikiThreadCrypto], three main kinds of crypto occurs:

  1) Onion skin processing (already parallelized on the server side).
  2) SSL handshakes and crypto
  3) AES-crypto on cells

  This proposal focuses on the third kind which is direct AES cell cryptography
  jobs (encryption and decryption).

  The relay cell crypto is done in relay_crypt_one_payload(), which gets called
  at a few places, all of which are fortunately in relay.c making things easier
  to change.

  * circuit_package_relay_cell(), it gets called 1..N times for a cell that
    we're packaging from an edge connection and about to stick on a cell queue.

  * relay_crypt(...), we just got a relay cell on a circuit, and we are about
    to decrypt it, see whether it should get handled by us, and maybe pass it
    on or maybe process it. If we handle it, we are going to pass it to
    connection_edge_process_relay_cell(...). If we pass it on, we are going to
    append it to a cell_queue_t.

  This function is called by circuit_package_relay_cell(...) which is used when
  we just received a relay cell on a circuit.

  For pure relays, the relay_crypt() case dominates. For exit nodes, the
  circuit_package_relay_cell() case matters too.

  The whole concept of using a new threading model is to offload the crypto
  part from the main loop and, has stated, take advantage of SMP systems.

2. Design

  Since cells order matters on a circuit, it makes sense to create two new
  priority queues in the circuit.

  * Crypto cell queue
    The cell has NOT been processed for encryption/decryption task.

  * Transport cell queue
    The cell is ready to be sent and has passed through the crypto cell queue.

  The following schema shows the cell movement once it arrives on circuit A
  where Cq is the Crypto queue and Tq is the Transport queue. This is as simple
  as it gets. The tricky and fun part of this scheme is describe in details in
  section 3, threading and synchronization.

                        circuit A
                       +----------------------------------+
  +--------+  enqueue  |    +------+------+     +-------+ |
  | cell c |---------->| Cq |      |      | ... |   c   | |
  +--------+           |    +------+------+     +-------+ |
                       +----------------------------------+
                                                 | dispatch(Cq, Tq)
                                                 |
               Thread pool                       v
               +-------+-------+     +--------+--------+
               |       |       | ... |        | circ A |
               +-------+-------+     +--------+--------+
                                                 |
                          exec crypto job() <--- |
                                                 v
                       +----------------------------------+
  +--------+   dequeue | +-------+     +------+------+    |
  | cell c |<----------| |   c   | ... |      |      | Tq |
  +--------+           | +-------+     +------+------+    |
                       +----------------------------------+
                        circuit A

  It goes like this. Enqueue the cell c of circuit A to its crypto queue (Cq).
  Each circuits are dispatched to a thread pool where they are assigned to an
  available thread that process N (value is TBD) cells from the Cq and enqueue
  them in Tq once done.

  The main loop of Tor only needs to empty out the transport queue of each
  circuit at each round it does. The rest is handled by the crypto thread pool.

  This thread pool model is describe in detail in the next section.

3. Threading and Synchronization

  The basic idea is to dispatch relay cell's cryptographic jobs to a thread
  pool upon reception to offload the Tor main loop. The thread pool should
  handle circuits rather than cells. Reordering cells after encryption is an
  unnecessary steps that we absolutely want to avoid for complexity and latency
  reasons.

  The next section describes the threading model and dynamics between the
  dispatcher and the pool (worker threads).

3.1 Threading model

  This section first explains how the dispatcher mechanism works and describes
  the worker thread of the pool in the following section.

3.1.1 Dispatcher

  The dispatch mechanism is part of the main loop and has two roles. First, it
  assigns the received cells to the corresponding crypto queue in its circuit.
  We have to make sure that the dispatcher is the one and only one writer of
  that queue so we can avoid complex synchronization between threads and keep
  the design simple for this proposal.

  New circuit are also managed by the dispatcher to assign it to a worker in
  the thread pool.

  The second role is to keep a priority queue of all worker threads ordered by
  the number of cells they are currently handling in all of their crypto
  queues. After adding a cell or a new circuit to a thread, this queue is
  rebalanced. This rebalancing process is tricky since we might need to steal
  circuit from other threads and push them back to new ones. This process might
  be hard on locks contention but for this first proposal it's should be good
  enough.

  Section 3.3 describes the priority queue data structure proposed for the
  dispatcher.

3.1.2 Worker thread

  This thread is quite simpler. Each of them contains a priority queue of
  circuits ordered by the EWMA value so the quietest circuit gets handled first
  which is an important design requirements for the current state of Tor.

  As proposed by Nick Mathewson, it might also be okay to do the EWMA
  implementation strictly when adding stuff to the transport queue (outbound).
  Tests should be done to see if it could make things faster.

  Once the thread has CPU time, it chooses N cells from the crypto priority
  queue depending on each circuit load, process them one at a time and move
  them to the transport queue (Tq) of the corresponding circuit once done. This
  N value has to be determined with tests. Another possibility is to make this
  value change depending on the rate of the relay or the CPU power available.

  Since the cells insertion is done by the dispatcher and we want to put as
  less work as we can to it, the priority queue balancing operation should be
  done in the worker thread. So, after N cells round of work, the queue should
  be rebalanced since the dispatcher could have added more. This can be
  verified by setting a simple flag in the priority queue.

3.2 Locking scheme

  For this first draft I will not propose some lockless mechanism since this
  kind of changes is non trivial and maybe unnecessary for a first
  implementation. So the locking scheme is a simple lock for each proposed
  queue. We can improve over time and also during the stabilization process.

3.3 Data structures

  For the priority queue, the minimum value of the queue is the most important
  node since we have to process the quietest circuit ordered by the number of
  the EWMA value.

  A binary heap makes sense since the minimum value can be looked up in O(1)
  (called min-heap). This structure is simple, known and can be improved
  afterwards by all sort of funky structures such as the Fibonacci heap or the
  Brodal queue [BrodalQueue].

4. Conclusion

  In a nutshell, this proposal is probably not the most efficient way to do
  multithreading crypto jobs in Tor but it's goal is to add a simple and
  scalable implementation that can be improved over time with testing,
  measurements and better algorithms such as more efficient data structures or
  lockless mechanism.

5. Acknowledgments

TODO

References:
----------------
[WikiThreadCrypto] Multithreaded crypto on Tor
	https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/MultithreadedCrypto
[BrodalQueue] Gerth Stølting Brodal (1996) - Worst-Case Efficient Priority Queues
    https://users.info.unicaen.fr/~karczma/TEACH/Doc/brodal_okasaki.pdf
_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev