On Fri, Feb 16, 2007 at 05:35:50PM -0800, Christopher Layne wrote: > On Fri, Feb 16, 2007 at 02:00:00PM -0800, Christopher Layne wrote: > > Thought you guys might find this interesting. I did a couple of callgrind > > runs on 2 different tor builds, 1 using -Os and the other using -O3. The > > So did a bit more research on spec'ing which cost models are default in > callgrind and now have it logging jumps, asm instructions, and l1/l2/dram > performance counters in the simulator. If anyone is interested on the > machine specifically it's a 2.1 ghz Celeron-D (Prescott) running under > Linux 2.6.20. I've rebuilt openssl, libz, and libevent with cranked up > optimization/debug on, so more interesting things to look at. Hi, Chris! This is pretty neat stuff! If you can do more of this, it could help the development team know how to improve speed. (Sorry about the delay in answering; compiling kcachegrind took me way longer than it should have.) A few questions. 1. What version of Tor is this? Performance data on 0.1.2.7-alpha or on svn trunk would help a lot more than data for 0.1.1.x, which I think this is. (I think this is the 0.1.1.x series because all the compression seems to be happening in tor_gzip_compress, whereas 0.1.2.x does compression incrementally in tor_zlib_process.) There's already a lot of performance improvements (I think) in 0.1.2.7-alpha, but there might be possible regressions too, and I'd like to catch them before we release... whereas it is not likely that we'll do anything besides security and stability to 0.1.1.x, since it's supposed to be a stable series. 2. How is this server configured? A complete torrc would help. 3. To what extent does -O3 help over -O2? Most users seem to compile with -O2, so we should probably change our flags if the difference is nontrivial. 4. Supposedly, KCachegrind can also visualize oprofile output. If this is true, and you could get it working, it might give more accurate information as to actual timing patterns, with fewer Heisenberg effects. (Even raw oprofile output would help, actually.) Now, some notes on the actual data. Again, I'm guessing this is for Tor 0.1.1.x, so some of the results could be quite different for the development series, especially if we fixed some stuff (which I think we did) and especially if we introduced some stupid stuff (which happens more than I'd like). * It looks like most of our time is being spent, as an OR and directory server, in compression, AES, and RSA. To improve speed, our options are basically "make it faster" or "do it less" for each of these. * AES isn't going to get used much less: A relay server still needs to AES-ctr-crypt each cell it gets three times: once for TLS for link secrecy on the inbound link, once with a circuit key for long-range secrecy, and once for TLS for link security on the outbound link. This explains the pretty even breakdown between rijndaelEncrypt, _X86_AES_decrypt, and _X86_AES_encrypt in the results. (If you're not following me, read the design paper, or just trust me. ;) ) [We could _maybe_ save the middle encryption in some cases by a trick similar to what we use for CREATE_FAST cells, but it would only get rid of 1/8 of the AES done by servers in toto, thus reducing the average server's A] * Making AES faster would be pretty neat; the right way to go about it is probably to look hard at how OpenSSL is doing it, and see whether it can't be improved. Then again, the OpenSSL team is pretty clever, and it's not likely that there is a lot of low-hanging fruit to exploit here. * So here's how RSA is getting used on my server right now: 0 directory objects signed, 1643 directory objects verified, 8 routerdescs signed, 20554 routerdescs verified, 38 onionskins encrypted, 37631 onionskins decrypted, 35148 client-side TLS handshakes, 29866 server-side TLS handshakes, 0 rendezvous client operations, 70 rendezvous middle operations, 0 rendezvous server operations. So it looks like verifying routers, decrypting onionskins, and doing TLS handshakes are the big offenders for RSA. We've already cut down onionskin decryption as much as we can except by having clients build circuits less often. To cut down on routerdesc verification, we need to have routers upload their descriptors and have authorities replace descriptors less often, and there's already a lot of work in that direction, but I don't know if I've seen any numbers recently. We could cut down on TLS handshakes by using sessions, but that could hurt forward secrecy badly if we did it in a naive way. (We could be smarter and use sessions with a very short expiration window, but it's not clear whether that would actually help: somebody would need to find out how frequent TLS disconnect/reconnects are in comparison to ). * Making RSA faster could also be fun for somebody. The core multiplication functions in openssl (bn_mul_add_words and bn_sq_comba8) are already in assembly, but it's conceivable that somebody could squeeze a little more out of them, especially on newer platforms. (Again, though, this is an area that smart people have already spent a lot of time in.) * Finally, compression. Zlib is pretty tunable in how it makes the CPU/compression tradeoff, so it wouldn't be so hard to fine-tune the compression algorithm more thoroughly. Every admin I've asked, though, has said that they'd rather spend CPU to save bandwidth than vice versa. Another way to do less compression would be to make directory objects smaller and have them get fetched less often: there are some design proposals to do that in the next series, and I hope that people help beat them into some semblance of workability. Again, many thanks for this information; I hope we'll see more like it in the future! peace, -- Nick Mathewson
Attachment:
pgpPN5Rex2aDd.pgp
Description: PGP signature