[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] reg. rotation [Was: New suggestion about call convention]

I read a discussion sometimes and just read about call

Michael's way was silently accepted - transposing calee saved
registers at link time and generating prolog/epilogs for external

While it is sexy idea, when you look at apache or linux kernel
source code for example you will see that in many functions
there are indirect calls to other functions. For these you
can't use the optimalisation.
For GTK, GLIBC and others you have many functions in library,
often with wery short code (where prolog/epilog really counts)
which you can't optimize in this way too.

Also it don't address parameter registers - assuming that most
of functions uses 3 arguments then the first three argument registers
will be very exposed - you will need often to copy them to another
registers to free space for nested subroutine call.

Add register rotation. I'll try to explain my idea even if I know
that rotation/renaming will probably eat next pipeline stage before
decode. Maybe performance saving of it will be greater than looses
from additional 1 cycle latency of jumps.
And maybe someone clever will find way how to implement the idea
without next stage.

Suppose that r32...r63 (for now) can be rotated with granularity
2 regs (because of register pairing) by adding 5 bit constant ROFF
somewhere in fetch, decode or new stage.
We could manipulate the constant ROFF by instruction

circ n;  n is even int between 0...30 and instruction performs
         ROFF=ROFF+n in unsigned unsatureated (wrapped) arithmetic

Note that 0th bit of ROFF is always 0 so that adder is 4bit in
reality. I think (however I'm not HW expert) that HW needed is trivial
and without impact on other parts of f-cpu. It is like instruction
stream register renaming before it hits f-cpu as we know is currently.

r0..31 are system and scratch (caller saved) registers. I don't want
to consider fine grained use just now.

Rotated r32..r63 are parameter and calee-saved (local) registers.
We have several calling conventions attached to function by keyword
__poolsize(N) in GCC (for example) where N is even number from 0 to
32. Default would be 8 for example.

A function with __poolsize(N) is guaranted that is will be called with
r{64-N}..r{63} saved (not used).
If such function needs L registers as locals is calls "circ L" and
if L>N then it needs to save these L-N unsaved registers by itself.
function __poolsize(4) B(x,y,z);
function __poolsize(8) A(a,b,c) {
leads to:
A:       ; call with r32=a,r33=b,r34=c
add r33,r34,r62
sub r33,r32,r63
circ 2   ; now r32=b+c,r33=b-a, r34=a, r35=b, r36=c
call to B  (not sure with syntax)
circ -2
ret       (dtto)

If A would be __poolsize(4) too then we would have to save some other
registers before:
A:       ; call with r32=a,r33=b,r34=c
add r33,r34,r62
sub r33,r32,r63
store [sp],r58
store [sp],r59    ; to assure that B will get poolsize 4 too
circ 2   ; now r32=b+c,r33=b-a, r34=a, r35=b, r36=c
call to B  (not sure with syntax)
circ -2
load [sp],r59     ; restore
load [sp],r58

I make some assumptions about LSU (because I found no description
of it unfortunately). Example 1 shows the best case (probably
return address could be saved in rotated local reg. too) and you
can see that is can be really fast.

Secons example (more common) is still good IMHO because stores
and loads (could be done by SRB better) operates on registers
which are not in use just now. So that if LSU can post read
operation (assuming TLB is ok) and block only if registed is
mentioned in other instruction then we could have a lot of time
for (uncached) load.

Additionaly compiler knows poolsize of all subroutines it will
call so that it can iteratively start to prepare that size
during its own progress at places where its own code has low

It is simpler than IA64 like auto-spilled set but should
still allow CPU to adapt to current code and use almost
all registers.

One could do sw pipelining with that. Like:

store [r21++],r33
add r32,r22,r33
load [r20++],r34
circ 2
loop here

the snippet is wrong probably I only wanted you catch
the idea - it adds r22 to all nums at r21=r20. Additinaly
is uses registers r32..r63 as cache to hide potentional
latency of load (up to 15 cycles).
I'm not sure what would say current LSU however...

best regards,

To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/