[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tr:[f-cpu] usage of 64 registers & ILP



Hello Yann,

first I'd like to say thanks for your verbose and
helpful replies.

> > terms (I have to learn Tomasu... - can't remember - and other
> > similar algorithms to keep track).
> 
> FC0 doesn't use Tomasulo (?) but a scoreboard system :
> a ressource is free or not, and the instruction is "issued"
> (enters the "execution pipeline" and executes in straight line)
> if all the necessary ressources are available.

ohh here was one of my misunderstanding :o) So that RAW/WAW
dependency causes stall just before issue .. so that all my
fears was void ;)

> The problem comes from making a simple, extensible, stable
> and context-switch safe programming model. Adding another "status"
> bit can create problems further than you'd think (browse the
> mailing list archive from 3 years ago ;-D)

I'd like to do it but the archive is not available ;( It
said: Oops... There is no group called f-cpu. 
 
> After a bit of head scratching, i just understood that your code
> is in fact a vector add loop, which is in fact simple to do.
> I am however puzzled by your register allocation.

;) I'm not experienced in asm coding conventions .. I asm I coded
only IA32, IA64, PIC and AVR cpus and it is a year ago ..
 
> // copy/paste/interleave :

heh ? maybe I missed just next cpu-techie term :)

>   loadi 32, [r1], r4;
>   loadi 32, [r17], r20;
>   loadi 32, [r33], r36;
>   loadif 32, [r49], r52;
>   add r4, r2, r5;

would the cpu benefit from prefetch instruction
here:
    load [r1],r0 
so loadi in the next round would not stall ?
BTW when the quad of loadi is encountered and data
are not in the L1, will the first loadi stall or will
the load continue in "background" until the r4 is
really needed (accessed) ?

> It should do the job nicely for FC0, but might cause problems in a superscalar
> CPU. In that case, the interleaving must be modified :

It leads me to another question, is not FC0 in fact superscalar too ?
It has several execution pipelines .. well these are not complete ..
final xbar+register write is sungular only.
Do you think that it will be possible to change scheduler of FCPU to
decode/issue two ins per cycle ? Because when fcpu already has multiple
EUs and probably it might be simple to change ins positions to exploit
it. So that when two (or more) consecutive ins are independent then
issue both (assuming that there is big enough xbar)...
 
> > but I vas interested in information on circuit complexity
> > like depth of ten transistors in fcpu ...
> 
> a very important issue today. But as rule-of-thumb,
> the 6 gates limit of the critical datapath remains as a central
> guideline. Of course if there is a specific implementation issue,
> this can be locally modified (it also depends on the technology
> and the synthesiser and all the other options) but the general design
> is still simple with this rule (despite all the bypass nightmares).

seems reasonable. And because of it I asked the original question.
I was interested how complex could be one pipeline stage in Pentium4
for example and compare it with fcpu's 6 gates to get rough approximation
of what frequency could it do given we have Intel's technology ;)

> > uhh again it seems like if I have had some Vodka or so ;)
> don't drink and code ;-)
> 
> i don't have this problem, because i'm only chocoholic ;-)

me too ;-) well it was overkill (that with vodka) - I can drink
only tee/milk/beer/blue-bols/tonic ;)
devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/