Because a well-behaved Tor spends most of its time in AES, and because our last AES benchmarks gave surprising results (basically, "OpenSSL 0.9.7 AES isn't very fast"), I thought it would be a good idea to benchmark again before the next 0.1.2.x stable release. SUMMARY: I found that OpenSSL 0.9.8[be] AES is uniformly faster than OpenSSL 0.9.7[bf] AES. I found also that on the x86 hardware I have, 0.9.8e's AES implementation is significantly faster than our current implementation, whereas our current implementation seems to be slightly faster on PPC. I also found that -O3 helps a little everywhere, and a lot on in some places. That's the next thing I'll look into. METHODOLOGY: I wrote a stupid benchmark function in aes.c to encrypt a million cell-sized chunks using our aes_crypt function, and timed it with the unix "time" command. I did this twice for each (computer,code) pair, I took the median of three runs. Hardware, openssl version, and gcc versions are as noted. Everything was build with -O2 except as noted. The optimizations considered were as follows: {Not using OpenSSL} builtin: Use the reference "fast" copy of rijndaelEncrypt from rijndael-alg-fst.c version 3. use_rijndael_counter_optimization: As "builtin", but skip an encode/decode step when filling the AES buffer. (AES considers a 128-bit block as 4 32-bit integers; counter mode begins by encoding a 128-bit integer into a 128-bit block.) [This is what Tor does now.] <full unroll> As use_rijndael_counter_optimization, but also define the FULL_UNROLL macro in order to enable some loop unrolling. {Using OpenSSL} use_openssl_evp: Define the USE_OPENSSL_EVP macro in Tor's aes.c so that all crypto is handled by OpenSSL's EVP_EncryptUpdate() function. use_openssl_aes: Define the USE_OPENSSL_AES macro in Tor's aes.c so that all crypto is handled by OpenSSL's AES_encrypt() function. Results: On Catbus, an Intel Core 2 Duo E6700, openssl 0.9.8b, gcc 4.1 builtin: 7.4s use_rijndael_counter_optimization: 7.3s + <full unroll>: 6.8 + <full unroll, -O3>: 6.2 use_openssl_evp: 5.3s use_openssl_aes: 4.6s + <-O3>: 4.4s On Totoro, an Athlon XP 1700+ with openssl 0.9.7f, gcc 4.0 builtin: 17.5 use_rijndael_counter_optimization: 17.3s + <-O3>: 17.3 + <full unroll>: 18.6 + <full unroll, -O3>: 18.2 use_openssl_evp: 23.0 use_openssl_aes: 21.2 + <-O3>: 20.2 use_openssl_aes, with 0.9.8e: 10.9 + <-O3>: 10.0 On Kushana, 1.33 GHz G4 with openssl 0.9.7b, gcc 4.0 builtin: 11.9 use_rijndael_counter_optimization: 11.1 + <full unroll>: 10.7 + <full unroll, -O3>: 10.7 use_openssl_evp: 17.2 use_openssl_aes: 13.3 + <-O3>: 12.9 use_openssl_aes, with 0.9.8e: 12.0 + <-O3>: 11.6 CONCLUSIONS: In the face of OpenSSL 0.9.7f or earlier, it is a good idea to continue with our current approach. FULL_UNROLL help some places, but not others. -O3 helps a little. Our current approach does around 15% better than the fastest OpenSSL-0.9.7f-based approach. With OpenSSL 0.9.8b or later, on x86 platforms, it is a big win to use OpenSSL's AES_encrypt; it is about 37% faster than what we're doing now. Using -O3 helps a little. On PPC G4, our current approach is still faster than OpenSSL, but only by about 8% as opposed to 16% with OpenSSL 0.9.7. FULL_UNROLL is a good idea here. So, the code should basically do #if (recent openssl && (x86 || x86_64)) # define USE_OPENSSL_AES #else if (PPC) # define USE_RIJNDAEL_COUNTER_IMPLEMENTATION # define FULL_UNROLL <--maybe #else # define USE_RIJNDAEL_COUNTER_IMPLEMENTATION #endif Depending on what profiling method and what workload you use, we spend between 8% and 20% of our time in aes_crypt; if these results hold in the field, taking this approach will save us between 5 and 12% of our CPU time. Not bad. THANKS: To Ben Laurie for confirming that I'm not nuts here. To Andy Polyakov, whom Ben tells me is to thank for OpenSSL's asm AES implementations. And To the people who've been writing profiling-related mail to the list. peace, -- Nick Mathewson
Attachment:
pgpZxhblrW17T.pgp
Description: PGP signature