[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: tor callgrinds

To: or-dev@xxxxxxxxxxxxx
Subject: Re: tor callgrinds
From: Nick Mathewson <nickm@xxxxxxxxxxxxx>
Date: Sat, 17 Feb 2007 16:01:32 -0500
Delivered-to: archiver@seul.org
Delivered-to: or-dev-outgoing@seul.org
Delivered-to: or-dev@seul.org
Delivery-date: Sat, 17 Feb 2007 16:01:58 -0500
In-reply-to: <20070217013550.GL12798@ns1.anodized.com>
Mail-followup-to: Nick Mathewson <nickm@xxxxxxxxxxxxx>, or-dev@xxxxxxxxxxxxx
References: <20070216220000.GK12798@ns1.anodized.com> <20070217013550.GL12798@ns1.anodized.com>
Reply-to: or-dev@xxxxxxxxxxxxx
Sender: owner-or-dev@xxxxxxxxxxxxx
User-agent: Mutt/1.4.2.1i

On Fri, Feb 16, 2007 at 05:35:50PM -0800, Christopher Layne wrote:
> On Fri, Feb 16, 2007 at 02:00:00PM -0800, Christopher Layne wrote:
> > Thought you guys might find this interesting. I did a couple of callgrind
> > runs on 2 different tor builds, 1 using -Os and the other using -O3. The
> 
> So did a bit more research on spec'ing which cost models are default in
> callgrind and now have it logging jumps, asm instructions, and l1/l2/dram
> performance counters in the simulator.  If anyone is interested on the
> machine specifically it's a 2.1 ghz Celeron-D (Prescott) running under
> Linux 2.6.20. I've rebuilt openssl, libz, and libevent with cranked up
> optimization/debug on, so more interesting things to look at.

Hi, Chris!  This is pretty neat stuff!  If you can do more of this, it
could help the development team know how to improve speed.

(Sorry about the delay in answering; compiling kcachegrind took me way
longer than it should have.)

A few questions.

    1. What version of Tor is this?  Performance data on 0.1.2.7-alpha
       or on svn trunk would help a lot more than data for 0.1.1.x,
       which I think this is. (I think this is the 0.1.1.x series
       because all the compression seems to be happening in
       tor_gzip_compress, whereas 0.1.2.x does compression
       incrementally in tor_zlib_process.)  There's already a lot of
       performance improvements (I think) in 0.1.2.7-alpha, but there
       might be possible regressions too, and I'd like to catch them
       before we release... whereas it is not likely that we'll do
       anything besides security and stability to 0.1.1.x, since it's
       supposed to be a stable series.

    2. How is this server configured?  A complete torrc would help.

    3. To what extent does -O3 help over -O2?  Most users seem to
       compile with -O2, so we should probably change our flags if the
       difference is nontrivial.

    4. Supposedly, KCachegrind can also visualize oprofile output.  If
       this is true, and you could get it working, it might give more
       accurate information as to actual timing patterns, with fewer
       Heisenberg effects.  (Even raw oprofile output
       would help, actually.)

Now, some notes on the actual data.  Again, I'm guessing this is for
Tor 0.1.1.x, so some of the results could be quite different for the
development series, especially if we fixed some stuff (which I think
we did) and especially if we introduced some stupid stuff (which
happens more than I'd like).

    * It looks like most of our time is being spent, as an OR and
      directory server, in compression, AES, and RSA.  To improve
      speed, our options are basically "make it faster" or "do it
      less" for each of these.

    * AES isn't going to get used much less: A relay server still
      needs to AES-ctr-crypt each cell it gets three times: once for
      TLS for link secrecy on the inbound link, once with a circuit
      key for long-range secrecy, and once for TLS for link security
      on the outbound link.  This explains the pretty even breakdown
      between rijndaelEncrypt, _X86_AES_decrypt, and _X86_AES_encrypt
      in the results.  (If you're not following me, read the design
      paper, or just trust me. ;) )

      [We could _maybe_ save the middle
      encryption in some cases by a trick similar to what we use for
      CREATE_FAST cells, but it would only get rid of 1/8 of the AES
      done by servers in toto, thus reducing the average server's A]

    * Making AES faster would be pretty neat; the right way to go
      about it is probably to look hard at how OpenSSL is doing it,
      and see whether it can't be improved.  Then again, the OpenSSL
      team is pretty clever, and it's not likely that there is a lot
      of low-hanging fruit to exploit here.

    * So here's how RSA is getting used on my server right now:

          0 directory objects signed,
       1643 directory objects verified,
          8 routerdescs signed,
      20554 routerdescs verified,
         38 onionskins encrypted,
      37631 onionskins decrypted,
      35148 client-side TLS handshakes,
      29866 server-side TLS handshakes,
          0 rendezvous client operations,
         70 rendezvous middle operations,
          0 rendezvous server operations.

      So it looks like verifying routers, decrypting onionskins, and
      doing TLS handshakes are the big offenders for RSA.  We've
      already cut down onionskin decryption as much as we can except
      by having clients build circuits less often.  To cut down on
      routerdesc verification, we need to have routers upload their
      descriptors and have authorities replace descriptors less often,
      and there's already a lot of work in that direction, but I don't
      know if I've seen any numbers recently.  We could cut down on
      TLS handshakes by using sessions, but that could hurt forward
      secrecy badly if we did it in a naive way.  (We could be smarter
      and use sessions with a very short expiration window, but it's
      not clear whether that would actually help: somebody would need
      to find out how frequent TLS disconnect/reconnects are in
      comparison to  ).

    * Making RSA faster could also be fun for somebody.  The core
      multiplication functions in openssl (bn_mul_add_words and
      bn_sq_comba8) are already in assembly, but it's conceivable that
      somebody could squeeze a little more out of them, especially on
      newer platforms.  (Again, though, this is an area that smart
      people have already spent a lot of time in.)

    * Finally, compression.  Zlib is pretty tunable in how it makes
      the CPU/compression tradeoff, so it wouldn't be so hard to
      fine-tune the compression algorithm more thoroughly.  Every
      admin I've asked, though, has said that they'd rather spend CPU
      to save bandwidth than vice versa.  Another way to do less
      compression would be to make directory objects smaller and have
      them get fetched less often: there are some design proposals to
      do that in the next series, and I hope that people help beat
      them into some semblance of workability.

Again, many thanks for this information; I hope we'll see more like it
in the future!

peace,
-- 
Nick Mathewson

Attachment: pgpPN5Rex2aDd.pgp
Description: PGP signature

Follow-Ups:
- Re: tor callgrinds
  - From: Christopher Layne
- Re: tor callgrinds
  - From: Watson Ladd

References:
- tor callgrinds
  - From: Christopher Layne
- Re: tor callgrinds
  - From: Christopher Layne

Prev by Author: Re: stream_bw event
Next by Author: More than you necessarily wanted to know about tor_strndup performance improvements from late-2004 [was Re: tor callgrinds]
Previous by thread: Re: More than you necessarily wanted to know about tor_strndup performance improvements from late-2004 [was Re: tor callgrinds]
Next by thread: Re: tor callgrinds
Index(es):
- Author
- Thread