[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Register set revised



> >I modified GCC to handle split set of two 1r1w
> >register sets.
> >
> why not two 2r1w ?

Humm .. it's much more complex to simulate write port
contention on gcc level. Restricting ops to 2 separated
bank was simple.
And my reason for whole experiment was effeciency. I have
several uses for fcpu (I'll have to downgrade it to
32 bits probably) which I'd like to instantiate in lower
end FPGA (Xilinx Spartan).
When I have budget of say 5,000 LCs and each 16bits
of 1r1w will use one LC I have 512 LC for 2r1w regset
(even partitioned) with 64x64 bits. But only half
(256) LCs for two 1r1w (32x64). 256LCs can be used
to implement pipelined adder then.

But I agree that full 2r will be always at least a few
percent faster (no moves in function prologs). But this
is the only place (prologs) where compiler is not able
to satisfy constraints exactly (!). This and another
one (I forgot): when value is user more times and is needed
in both halves of split.

> By restricting pointers to certain locations,
> coding SW is more complex and less flexible
> for example.

are you sure ? In some weird way by restricting registers
you add orthogonality ;) because there is less ways to do
a single thing.
And for compiler is it simpler to cope with a few "pointer"
registers at fixed place than to check for usage of up
to 8 pointers within 64 registers (as needed by regno-cacheline
mapping if the proposal is still alive) ! We will still
have to say GCC which registers are usable as pointers
(say 28..35) because there is no other way to assure that
compiler will not alocate more than 8 pointers.

If one would at least mandate that pointers must be allocated
from end of register set backward for example it could simplify
that mapping, am I right ?

> >I've done it in testing mode for binary ops and
> >stores and it seems that 70% of ops are ok.
> >
> the remaining 30% are what hurts most when you need them.
> If this increases code size by 10% it's already a hit.
> A more complex register set could give better results
> and still avoid the problems of its size.

you are right. well is seems that for smaller regsets (2r2w)
one can use partitioned 2r1w or use 2x1r1w with penalty
for two single-bank read to be used in embeded designs.
We can at least hint gcc to try to allocate registers
from oposite banks whenever it is possible without any
other insn generation (yet gcc can do it). It will not
hurt full 2r design and will help 1r implementations.

> >When we will spend more time on it I believe that
> >we can reach about 90%.
> >
> i hope that a 4-bank 2r1w register set can give better results.
> 90% efficiency is not enough. Don't forget that FC0
> is a scalar CPU and increase of code size has a direct impact
> on performance. This is why the instruction set is so rich now.

By the way are you still convinced of the 3rd readport usefullness ?
2w are definitely ok for oooc but even from many compiled code
which I've seen I saw 3r insn exacly one time in loop like:
while(p->valid) { p+=off; ...use p here... }
which did lead into postincrement by register with "off".

I also saw mac a few times. MMX like pmadd might be better
because it is 2r1w and imposes no RAW interdependency between
subsequent ones. But yes it might be seen as "different"
because changes chunk size on the fly. On other side it
supports widening multiply.

> but it breaks a lot of expected behaviours.
> For example, the register allocator has much more pressure.
> The unified register set is an important feature from the SW point of
> view, even though it can be implemented in more or less smart ways ...

I agree that unified regset is simpler. On other side in light
of fact that 90% of insn results are consumer only once then
you can utilize data flow information. In:
add r1,r2,r3
xor r3,r4,r5

you know DF tied on r3. Thus with split set with 4x 1r1w regs
with ROP2 available only from two banks compiler can always
rename r3 to whichever is available as source for xor.
Trick is that reads could be limited to register bank while
writes needs to reach whole set. The only real problem is
with function call - you can't accurately place results
because DF is broken at its boundary.
Thus here one needs unified set :-\

For superscalar design more splitted regset seems more attractive
to me because you can to 4r4w with the same complexity as 2r2w...

But well - there is still missing stall detection in latest
MR emulator. I've first do it and then back my claims by
some specINT numbers ..

nice day,
devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/