[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

AES performance results

To: or-dev@xxxxxxxxxxxxx
Subject: AES performance results
From: Nick Mathewson <nickm@xxxxxxxxxxxxx>
Date: Mon, 26 Feb 2007 19:06:57 -0500
Delivered-to: archiver@seul.org
Delivered-to: or-dev-outgoing@seul.org
Delivered-to: or-dev@seul.org
Delivery-date: Mon, 26 Feb 2007 19:07:43 -0500
Mail-followup-to: Nick Mathewson <nickm@xxxxxxxxxxxxx>, or-dev@xxxxxxxxxxxxx
Reply-to: or-dev@xxxxxxxxxxxxx
Sender: owner-or-dev@xxxxxxxxxxxxx
User-agent: Mutt/1.4.2.1i

Because a well-behaved Tor spends most of its time in AES, and because
our last AES benchmarks gave surprising results (basically, "OpenSSL
0.9.7 AES isn't very fast"), I thought it would be a good idea to
benchmark again before the next 0.1.2.x stable release.

SUMMARY: I found that OpenSSL 0.9.8[be] AES is uniformly faster than
OpenSSL 0.9.7[bf] AES.  I found also that on the x86 hardware I have,
0.9.8e's AES implementation is significantly faster than our current
implementation, whereas our current implementation seems to be
slightly faster on PPC.  I also found that -O3 helps a little
everywhere, and a lot on in some places.  That's the next thing I'll
look into.

METHODOLOGY: I wrote a stupid benchmark function in aes.c to encrypt a
million cell-sized chunks using our aes_crypt function, and timed it
with the unix "time" command.  I did this twice for each
(computer,code) pair, I took the median of three runs.

Hardware, openssl version, and gcc versions are as noted.  Everything
was build with -O2 except as noted.

The optimizations considered were as follows:

  {Not using OpenSSL}

     builtin:
       Use the reference "fast" copy of rijndaelEncrypt from 
       rijndael-alg-fst.c version 3.

     use_rijndael_counter_optimization:
       As "builtin", but skip an encode/decode step when filling the
       AES buffer.  (AES considers a 128-bit block as 4 32-bit
       integers; counter mode begins by encoding a 128-bit integer
       into a 128-bit block.)

       [This is what Tor does now.]

     <full unroll>
       As use_rijndael_counter_optimization, but also define the
       FULL_UNROLL macro in order to enable some loop unrolling.

  {Using OpenSSL}

     use_openssl_evp:
       Define the USE_OPENSSL_EVP macro in Tor's aes.c so that all
       crypto is handled by OpenSSL's EVP_EncryptUpdate() function.

     use_openssl_aes:
       Define the USE_OPENSSL_AES macro in Tor's aes.c so that all
       crypto is handled by OpenSSL's AES_encrypt() function.

Results:

On Catbus,  an Intel Core 2 Duo E6700, openssl 0.9.8b, gcc 4.1
  builtin: 7.4s
    use_rijndael_counter_optimization: 7.3s
       + <full unroll>: 6.8
       + <full unroll, -O3>: 6.2
  use_openssl_evp: 5.3s
  use_openssl_aes: 4.6s
     + <-O3>: 4.4s

On Totoro, an Athlon XP 1700+ with openssl 0.9.7f, gcc 4.0

  builtin: 17.5
    use_rijndael_counter_optimization: 17.3s
       + <-O3>: 17.3
       + <full unroll>: 18.6
       + <full unroll, -O3>: 18.2
  use_openssl_evp: 23.0
  use_openssl_aes: 21.2
     + <-O3>: 20.2
  use_openssl_aes, with 0.9.8e: 10.9
     + <-O3>: 10.0

On Kushana, 1.33 GHz G4 with openssl 0.9.7b, gcc 4.0

  builtin: 11.9
    use_rijndael_counter_optimization: 11.1
       + <full unroll>: 10.7
       + <full unroll, -O3>: 10.7
  use_openssl_evp: 17.2
  use_openssl_aes: 13.3
     + <-O3>: 12.9

  use_openssl_aes, with 0.9.8e: 12.0
     + <-O3>: 11.6

CONCLUSIONS:

  In the face of OpenSSL 0.9.7f or earlier, it is a good idea to
  continue with our current approach.  FULL_UNROLL help some places,
  but not others.  -O3 helps a little.  Our current approach does
  around 15% better than the fastest OpenSSL-0.9.7f-based approach.

  With OpenSSL 0.9.8b or later, on x86 platforms, it is a big win to
  use OpenSSL's AES_encrypt; it is about 37% faster than what we're
  doing now.  Using -O3 helps a little.

  On PPC G4, our current approach is still faster than OpenSSL, but
  only by about 8% as opposed to 16% with OpenSSL 0.9.7.  FULL_UNROLL
  is a good idea here.

  So, the code should basically do

     #if (recent openssl && (x86 || x86_64))
     #  define USE_OPENSSL_AES
     #else if (PPC)
     #  define USE_RIJNDAEL_COUNTER_IMPLEMENTATION
     #  define FULL_UNROLL     <--maybe
     #else
     #  define USE_RIJNDAEL_COUNTER_IMPLEMENTATION
     #endif

  Depending on what profiling method and what workload you use, we
  spend between 8% and 20% of our time in aes_crypt; if these results
  hold in the field, taking this approach will save us between 5 and
  12% of our CPU time.  Not bad.

THANKS:

  To Ben Laurie for confirming that I'm not nuts here.

  To Andy Polyakov, whom Ben tells me is to thank for OpenSSL's asm
  AES implementations.

  And To the people who've been writing profiling-related mail to the
  list.

peace,
-- 
Nick Mathewson

Attachment: pgpZxhblrW17T.pgp
Description: PGP signature

Follow-Ups:
- Re: AES performance results
  - From: Adam Langley

Prev by Author: Re: signal handling and posix threads
Next by Author: Re: AES performance results
Previous by thread: Re: Building Tor with libevent 1.2 (or higher) on Windows fails
Next by thread: Re: AES performance results
Index(es):
- Author
- Thread