[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Register set revised



hi !

devik wrote:

if you want to make small CPUs, F-CPU is not a good target.
Furthermore there exists already a looot of 32-bit CPUs,
which are better suited for low power and small footprints.
The fpgacpu.org site is a good place to look at, for example,
because it gives several good tricks.

oh well. As strongly sw oriented guy I started to realize
such consequences just now. The architecture seemed rather
simple to me but recently I started to see how complex it
can be at logic level.

heh :-)

we did some serious work to make people believe that "F-CPU is simple",
just to get them interested and induce them to do some work ;-)

i would rather split the register set in more sub-banks,
in order to increase the associativity and reduce the hit
to a smaller fraction. Maybe you should try with 16 or 8 banks,
and give us comparative results. In this case, my gut feeling
is that the hit is only marginal.

it would be interesting. unfortunately it is far from being
simple to test more than 2 split sources ...

really ?
i think that for our case, that is : decoding whether 2 writes occur to the same
bank, it is a simple as comparing the 2 write addresses and see if they match.
if there are 8 banks, then it is a 3-bit comparator (3 ands and a 3-input AND).
This way it would be possible to have 8 banks of 8 registers with 2/3 read
outputs and 1 write input. welcome to the HW world ;-)

I also saw mac a few times. MMX like pmadd might be better
because it is 2r1w and imposes no RAW interdependency between
subsequent ones. But yes it might be seen as "different"
because changes chunk size on the fly. On other side it
supports widening multiply.

widening multiply means that there is no need for scheduling a couple of MAC
instructions AND later combining the result back into a single register,
hence at least 2 clock cycles that are saved, it's particularly important
in these small computational-intensive loops for 3D, sound and video...

yes I agree I overlooked that MAC is widening. I came to
conclusion before that it is not.

i did quite a bit of Pentium MMX coding and even discussed with
one of the architects. They had to make some early design decisions
and ISA definitions based on the then available resources and limitations.
This is why the speed increase of MMX (only 2x in average compared
to normal "scalar" code) is so marginal. The widening MAC
is only one example : it takes 3 cycles to compute and it can be pipelined,
but the "butterfly" eats up more cycles. that's dumb !
This is one of the reasons why 3r2w is desirable.

with ROP2 available only from two banks compiler can always
rename r3 to whichever is available as source for xor.

ouch, that's ugly ....

yes, isn't it ? ;-)

it makes me think too much of "barebone VLIW" architectures ....
without even the advantages.

Ok back to be serious. I sometimes try
to think in non-conventional ways and often reinvent wheels ...

don't worry about that.

my opinion is to perform this "transparently" with the core,
using more banks to reduce contentions and inserting "penalty cycles"
automatically to ensure that other cores can implement the register
set in a way that is more suitable to their particular case.

you are right. My latest screams into dark was caused
by my interest having ultra-simple design with reasonable
performance for SoC design. OpenRisc 1200 is still too
complex for me :-\ I even didn't find suitable one on
fpgacpu.org.

i have seen an increadible 32 bit core on fpgacpu.org,
i think it's the XSOC. It is very small and the instruction
set is really limited but it can compile some things.
it has no protection at all, but it fits into a few hundreds
of cells....

I'd like to have small linux servers in fpga to act as boundary
routers, net-connected video grabbers etc.

i have found some chip Europa-format i486-DX66 boards in Berlin
and that's probably what you need (like a PC104 board or something
like that). And because it's almost a PC, development is much faster.

MR emulator. I've first do it and then back my claims by
some specINT numbers ..

are you really going to run SPEC2K ? .... :-)

if someone will be willin to privately donate copy ;-)
But probably I'd measure mix of some utils like gcc, grep
and compare with Intel system with known specint.

good luck ...

devik

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/