[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: FC0 XBAR



On Fri, Aug 03, 2001 at 03:55:11AM +0200, Yann Guidon wrote:
[...]
> > Well, let's try an example to make this clearer.  Let's say we want to
> > add 3 numbers:
> > 
> >         add r5, r1, r2  ; temporary result in r5
> >         add r4, r3, r5  ; final result in r4
> 
> this is written in 'classical' risc / x86 fashion, it seems :-)

If you find a way to encode instructions like

	r5 = r1 * r3 - r2 * r4
	r6 = r1 * r4 + r2 * r3

directly, without using temporary registers, I will use it! :)

> > Now, the time-table (without bypassing) is:
> > 
> >         cycle 0: read r1 and r2 (pass values through Xbar)
> >         cycle 1: stage 1 of ASU is working
> >         cycle 2: stage 2 of ASU is working
> >         cycle 3: pass result through Xbar (write r5 at end of cycle)
> >         cycle 4: read r3 and r5 (pass values through Xbar)
> >         cycle 5: stage 1 of ASU is working
> >         cycle 6: stage 2 of ASU is working
> >         cycle 7: pass result through Xbar (write r4 at end of cycle)
> > 
> > Note that r5 is written at the end of cycle 3, but read in cycle 4; that
> > is, the new value is read (and passes the Xbar again).  With bypassing,
> > cycle 3 and 4 will overlap, resulting in a 1-cycle speed-up.
> > 
> > Or did I miss something?
> 
> i think that we agree about what means bypassing the RegSet in the 'Xbar'
> (a set of mux).

I think so...

> however, there is another case that worries me :
> suppose that you want to add more than 5 registers, for example 20.
> This could work for any combination of other operations, of course
> (this is not specific to additions, i care about the latency).

Such operations can probably be parallelized (if there are no
read-after-write dependencies).  E.g. for the complex multiplication
above, we'll calculate the four products first, one after another,
and then add/subtract them.

> So we have a burst of register values all over the place. The scheduler
> will take care to organise that cleanly. In order to have the fastest
> execution possible, one will "organise" the instruction ordering
> so independent operations are interleaved. That's the "usual job"
> when one optimises for RISC.

Yep.

> Now imagine that the register number is exhausted, or some pressure
> like that. imagine that the instruction is issued one cycle after
> the necessary source data is present on the Xbar for bypass. The instruction
> will have to wait yet another cycle, until the register set memorises and
> gives the new value.

If the instruction is issued at the beginning of the `result write'
cycle of the previous instruction, the value can be bypassed.  If it is
issued one or more cycles later, it can read the result from the register.
(note that I use the term "instruction is issued" for the point in time
when the scheduler starts reading the operands -- the EU starts working
1 cycle later).

> For me, this situation (wait) is not tolerable because i guess that most
> of the "desirable" time (when the code will be optimised at a decent level)
> the 1-cycle penalty might occur often enough that optimisation might not
> be worth. If your optimisation yeilds poor speedup, you'll drop it and
> i don't want to encourage that...

Maybe I'm too "vernagelt" to see your point.

An instruction has 3 phases, right?  Phase 1 is operand fetch (1 cycle),
phase2 is calculation (n cycles) and phase 3 is result write (1 cycle
again).  When a second instruction depends on the result of the first
one, it can start its phase 1 immediately after phase 3 of the previous
instruction is finished, that is, in the next cycle, no matter what
happens (assuming the EU is not busy, of course).

Let me draw you a picture... first, the worst case:

   reg                 reg                 reg
    +    +----+----+    +    +----+----+    +
    |----| instr 1 |----|----| instr 2 |----|
    +    +----+----+    +    +----+----+    +
     Xbar           Xbar Xbar           Xbar

The Xbar behaves just like another pipeline stage, and the register
bank like a pipeline register.  Now, let's add the bypass:

   reg                 reg
    +    +----+----+    +
    |----| instr 1 |----|
    +    +----+----+\   +
     Xbar            \
                  reg \               reg
                   +   \+----+----+    +
                   |----| instr 2 |----|
                   +    +----+----+    +
                    Xbar           Xbar

The result register is moved out of the data path, we save 1 cycle.
We cannot save more because there's a 1-cycle delay each time data
passes the Xbar.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/