[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] reg. rotation [Was: New suggestion about call convention]

> Add register rotation. I'll try to explain my idea even if I know
> that rotation/renaming will probably eat next pipeline stage before
> decode. Maybe performance saving of it will be greater than looses
> from additional 1 cycle latency of jumps.
> And maybe someone clever will find way how to implement the idea
> without next stage.

> Suppose that r32...r63 (for now) can be rotated with granularity
> 2 regs (because of register pairing) by adding 5 bit constant ROFF
> somewhere in fetch, decode or new stage.
> We could manipulate the constant ROFF by instruction

> circ n;  n is even int between 0...30 and instruction performs
>          ROFF=ROFF+n in unsigned unsatureated (wrapped) arithmetic

> Note that 0th bit of ROFF is always 0 so that adder is 4bit in
> reality. I think (however I'm not HW expert) that HW needed is trivial
> and without impact on other parts of f-cpu. It is like instruction
> stream register renaming before it hits f-cpu as we know is currently.

It look like a complex instruction that look having a lot of effect on the CPU 
I think. But I currently don't see where you can expect better performance.

> I make some assumptions about LSU (because I found no description
> of it unfortunately). Example 1 shows the best case (probably
> return address could be saved in rotated local reg. too) and you
> can see that is can be really fast.

If I understand you example correctly, you want to protect parameter to not 
being transfered to a saved register, right ? But you still do memory 
transfert and an one cycle in the decoder to do this, plus a cycle for circ 
instruction. I don't see where you expect to have better performance.

> Secons example (more common) is still good IMHO because stores
> and loads (could be done by SRB better) operates on registers
> which are not in use just now. So that if LSU can post read
> operation (assuming TLB is ok) and block only if registed is
> mentioned in other instruction then we could have a lot of time
> for (uncached) load.

Same remark for this example.

> Additionaly compiler knows poolsize of all subroutines it will
> call so that it can iteratively start to prepare that size
> during its own progress at places where its own code has low
> ILP.

If I understand correctly, you are thinking that parameter will be saved a lot 
of time and restored after a call, right ? I am not sure that's the most 
common case. As you explain previously, the only place where it can be 
interresting is in library. But they have short code, and most of them only 
will only use temporary. (I think we need to have some stat about software on 
RISC CPU. Did somebody know a tool or something for that purpose ?).

> It is simpler than IA64 like auto-spilled set but should
> still allow CPU to adapt to current code and use almost
> all registers.

From my point of view, the idea isn't to use all of them, but  to use them to 
reduce memory operation and to reduce pipeline bubble. And I think, but 
perhaps I am wrong, that store didn't cost anything (only load introduce 
bubble, from what I remember about whygee LSU).

I think that I don't understand your idea very well, can you please give us a 
timing sample with multiple call and what result you expect and where you 
expect them.


PS: A macro that do a call will certainly look like this.
	.macro call name
		loadaddri name, t0
		store [sp], ra
		jmp [t0], ra

To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/