[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: FC0 XBAR



hi ! (again... i gotta sleep soon, so don't worry !)

Michael Riepe wrote:
> On Thu, Aug 02, 2001 at 08:31:06PM +0200, Yann Guidon wrote:
> [...]
> > > IMHO, reading and writing "in the same cycle" means that *by the end
> > > of the cycle* (when the clock signal rises) the new data is stored in
> > > the register, while the reader gets a copy of its *previous* contents
> > > at the same time.
> >
> > that's what i feared :-(
> 
> I thought that's what the Xbar bypass is good for -- avoid the delay?

"avoid the delay" from one unit to another.
however, it seems that i had overlooked the
register set's inherent latency.

> [...]
> > Then there is another problem ! since the result is really written
> > during the next cycle but available only the cycle aftern, people who
> > want to optimise the instruction flow but who don't do it "perfetcly"
> > are hit by a 1-cycle delay. Obviously, delaying is not a good thing :
> > people will try to schedule the instructions close to the pipeline
> > features, but there is a large chance that, under the pression of
> > the program, available ressources and other stuff, the delay is reached.
> 
> Well, let's try an example to make this clearer.  Let's say we want to
> add 3 numbers:
> 
>         add r5, r1, r2  ; temporary result in r5
>         add r4, r3, r5  ; final result in r4

this is written in 'classical' risc / x86 fashion, it seems :-)

> Now, the time-table (without bypassing) is:
> 
>         cycle 0: read r1 and r2 (pass values through Xbar)
>         cycle 1: stage 1 of ASU is working
>         cycle 2: stage 2 of ASU is working
>         cycle 3: pass result through Xbar (write r5 at end of cycle)
>         cycle 4: read r3 and r5 (pass values through Xbar)
>         cycle 5: stage 1 of ASU is working
>         cycle 6: stage 2 of ASU is working
>         cycle 7: pass result through Xbar (write r4 at end of cycle)
> 
> Note that r5 is written at the end of cycle 3, but read in cycle 4; that
> is, the new value is read (and passes the Xbar again).  With bypassing,
> cycle 3 and 4 will overlap, resulting in a 1-cycle speed-up.
> 
> Or did I miss something?

i think that we agree about what means bypassing the RegSet in the 'Xbar'
(a set of mux).

however, there is another case that worries me :
suppose that you want to add more than 5 registers, for example 20.
This could work for any combination of other operations, of course
(this is not specific to additions, i care about the latency).

So we have a burst of register values all over the place. The scheduler
will take care to organise that cleanly. In order to have the fastest
execution possible, one will "organise" the instruction ordering
so independent operations are interleaved. That's the "usual job"
when one optimises for RISC.

Now imagine that the register number is exhausted, or some pressure
like that. imagine that the instruction is issued one cycle after
the necessary source data is present on the Xbar for bypass. The instruction
will have to wait yet another cycle, until the register set memorises and
gives the new value.
For me, this situation (wait) is not tolerable because i guess that most
of the "desirable" time (when the code will be optimised at a decent level)
the 1-cycle penalty might occur often enough that optimisation might not
be worth. If your optimisation yeilds poor speedup, you'll drop it and
i don't want to encourage that...

yes, i know, it's a bit complex. i hope i'll find a decent solution.

>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/