[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] Instruction census



Hi F-gang,

I compiled a set of C files (among them libelf and fctools) with fcpu-gcc
and started to count instructions.  The most often used single instruction
was an ordinary unconditional `move':

  98104 instructions total
  13079 move            13.3% ( 13.3% total)
  10846 load            11.0% ( 24.3% total)
   9741 loadcons         9.9% ( 34.3% total)
   9188 loadaddri        9.3% ( 43.6% total)
   8574 jmp              8.7% ( 52.4% total)
   8188 addi             8.3% ( 60.7% total)
   7386 jmp{cc}          7.5% ( 68.2% total)
   6169 loadconsx        6.2% ( 74.5% total)
   3545 storei           3.6% ( 78.1% total)
   3533 loadi            3.6% ( 81.7% total)
   3459 store            3.5% ( 85.3% total)
   2076 add              2.1% ( 87.4% total)
   1668 shiftli          1.7% ( 89.1% total)
   1523 xori             1.5% ( 90.6% total)
[...]

The vast majority of instructions are load/store, add/sub/loadaddr[i],
loadcons[x], move and jmp - together, they contribute 89% of the code.
Add ROP2, SHL, CMP and INC functions and you'll reach 99.7% - that's
almost everything.  Here's a detailed list of EU usages:

  21383 lsu             21.7% ( 21.7% total)
  21031 asu             21.4% ( 43.2% total)
  15910 loadcons        16.2% ( 59.4% total)
  13079 move            13.3% ( 72.7% total)
   8574 jmp              8.7% ( 81.5% total)
   7386 jmp{cc}          7.5% ( 89.0% total)
   4095 rop2             4.1% ( 93.2% total)
   3496 shl              3.5% ( 96.7% total)
   1781 cmp              1.8% ( 98.6% total)
   1111 inc              1.1% ( 99.7% total)
    112 move{cc}         0.1% ( 99.8% total)
     86 idu              0.0% ( 99.9% total)
     60 imu              0.0% (100.0% total)

This is mostly the profile I expected from standard software.  Note that
I compiled with `-O -fomit-frame-pointer'; otherwise, the result would
have been even worse (without optimization, the code is a hell of a mess).
With `-O3', the number of instructions increases slightly, `-Os' reduces
it by approximately 5%; but the profile is rather similar in either case.

It's interesting to see that integer division (div/rem) is more often
used than multiplication (I didn't expect that).  It's probably due to
the heavy use of shift-and-add sequences that gcc substitutes for simple
multiplications with constants.  That's going to change, because a mul
(or mac) instruction is often faster.

Another interesting fact is that 1/4 of the multiplications are actually
`mac' operations (most of them of the kind where all operands have the
same size).  One can also observe that add, sub, xor and shift[lr] are
most often used in their immediate form, while or isn't.  Signed compares
are less often used than unsigned ones, and left shifts occur twice as
often as right shifts (again due to shift-and-add, I guess).

Goals for optimization (IMHO):

	- reduce number of load/store instructions
	- increase number of conditional moves (in favor of jmp{cc})
	- avoid shift-and-add where mul/mac is faster
	- make use of divrem[s] instruction
	- make use of SIMD instructions

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/