[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Register set revised



hi !

devik wrote:

I modified GCC to handle split set of two 1r1w
register sets.

why not two 2r1w ?

Humm .. it's much more complex to simulate write port
contention on gcc level. Restricting ops to 2 separated
bank was simple.
And my reason for whole experiment was effeciency.

"efficiency" is not a clear explanation, because one might want efficiency
in consumed power, power/price, power/performance, coding, ease of
implementation, etc....

I have several uses for fcpu (I'll have to downgrade it to
32 bits probably) which I'd like to instantiate in lower
end FPGA (Xilinx Spartan).

if you want to make small CPUs, F-CPU is not a good target.
Furthermore there exists already a looot of 32-bit CPUs,
which are better suited for low power and small footprints.
The fpgacpu.org site is a good place to look at, for example,
because it gives several good tricks.
In this particular case, "efficiency" is about reduced FPGA size
(hence price) and consumed power, as a hit in code footprint
AND in execution time is tolerated (memory is cheap compared to FPGA).
F-CPU is not meant to compete in this branch.

When I have budget of say 5,000 LCs and each 16bits
of 1r1w will use one LC I have 512 LC for 2r1w regset
(even partitioned) with 64x64 bits. But only half
(256) LCs for two 1r1w (32x64). 256LCs can be used
to implement pipelined adder then.

i fear that you're going to bump into haunted dark corners,
because the idea of downsizing of F-CPU has been abandonned
long ago. What would be the point when there exists better suited
and functionning CPUs ? On top of that, your software will
not be compatible with "regular" F-CPU cores.

But I agree that full 2r will be always at least a few
percent faster (no moves in function prologs). But this
is the only place (prologs) where compiler is not able
to satisfy constraints exactly (!).

everything counts !

This and another
one (I forgot): when value is user more times and is needed
in both halves of split.

in this case, it is like having a 2-address processor....

By restricting pointers to certain locations,
coding SW is more complex and less flexible
for example.

are you sure ? In some weird way by restricting registers
you add orthogonality ;) because there is less ways to do
a single thing.

i think that it is a wrong conclusion.
adding ways to do the same thing differently has an impact on large
instruction blocks : if one way is not possible, the other is still
practicable. Unfortunately, gcc might not be able to grasp all
these nuances.
Orthogonality goes along with reducing the number of coding rules.

And for compiler is it simpler to cope with a few "pointer"
registers at fixed place than to check for usage of up
to 8 pointers within 64 registers (as needed by regno-cacheline
mapping if the proposal is still alive) ! We will still
have to say GCC which registers are usable as pointers
(say 28..35) because there is no other way to assure that
compiler will not alocate more than 8 pointers.

note that the number of pointing register can evolve in the future...
that is why restricting the number of point register is not a good idea.
you better try to rely on pointer reuse and the LRU mechanism.

If one would at least mandate that pointers must be allocated
from end of register set backward for example it could simplify
that mapping, am I right ?

it would simplify mapping in some cases but not all the time.

I've done it in testing mode for binary ops and
stores and it seems that 70% of ops are ok.

the remaining 30% are what hurts most when you need them.
If this increases code size by 10% it's already a hit.
A more complex register set could give better results
and still avoid the problems of its size.

you are right. well is seems that for smaller regsets (2r2w)
one can use partitioned 2r1w or use 2x1r1w with penalty
for two single-bank read to be used in embeded designs.

F-CPU has not been created for very small footprints,
but more for getting every small performance increase
by any reasonable mean.

We can at least hint gcc to try to allocate registers
from oposite banks whenever it is possible without any
other insn generation (yet gcc can do it). It will not
hurt full 2r design and will help 1r implementations.

i would rather split the register set in more sub-banks,
in order to increase the associativity and reduce the hit
to a smaller fraction. Maybe you should try with 16 or 8 banks,
and give us comparative results. In this case, my gut feeling
is that the hit is only marginal.

When we will spend more time on it I believe that
we can reach about 90%.

i hope that a 4-bank 2r1w register set can give better results.
90% efficiency is not enough. Don't forget that FC0
is a scalar CPU and increase of code size has a direct impact
on performance. This is why the instruction set is so rich now.

By the way are you still convinced of the 3rd readport usefullness ?

yup, and particularly when SRB is implemented.
but if you don't need it, you can leave it alone in the beginning.

2w are definitely ok for oooc but even from many compiled code
which I've seen I saw 3r insn exacly one time in loop like:
while(p->valid) { p+=off; ...use p here... }
which did lead into postincrement by register with "off".

I also saw mac a few times. MMX like pmadd might be better
because it is 2r1w and imposes no RAW interdependency between
subsequent ones. But yes it might be seen as "different"
because changes chunk size on the fly. On other side it
supports widening multiply.

widening multiply means that there is no need for scheduling a couple of MAC
instructions AND later combining the result back into a single register,
hence at least 2 clock cycles that are saved, it's particularly important
in these small computational-intensive loops for 3D, sound and video...

but it breaks a lot of expected behaviours.
For example, the register allocator has much more pressure.
The unified register set is an important feature from the SW point of
view, even though it can be implemented in more or less smart ways ...

I agree that unified regset is simpler. On other side in light
of fact that 90% of insn results are consumer only once then
you can utilize data flow information. In:
add r1,r2,r3
xor r3,r4,r5

you know DF tied on r3. Thus with split set with 4x 1r1w regs
with ROP2 available only from two banks compiler can always
rename r3 to whichever is available as source for xor.

ouch, that's ugly ....
it's too specific to a particular implementation : if it is ever implemented,
porting to other architecture will become a nightmare.

my opinion is to perform this "transparently" with the core,
using more banks to reduce contentions and inserting "penalty cycles"
automatically to ensure that other cores can implement the register
set in a way that is more suitable to their particular case.

Trick is that reads could be limited to register bank while
writes needs to reach whole set. The only real problem is
with function call - you can't accurately place results
because DF is broken at its boundary.
Thus here one needs unified set :-\

as long as you're doing a forked subset of F-CPU, that's ok for you
because you can't expect to get screaming performance out of 5K LCs.
the SW performance goes along....

For superscalar design more splitted regset seems more attractive
to me because you can to 4r4w with the same complexity as 2r2w...

But well - there is still missing stall detection in latest
MR emulator. I've first do it and then back my claims by
some specINT numbers ..

are you really going to run SPEC2K ? .... :-)

nice day,
devik

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/