[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: FC0 XBAR



On Fri, 3 Aug 2001, Michael Riepe wrote:

> On Fri, Aug 03, 2001 at 03:55:11AM +0200, Yann Guidon wrote:
> [...]
> > > Well, let's try an example to make this clearer.  Let's say we want to
> > > add 3 numbers:
> > > 
> > >         add r5, r1, r2  ; temporary result in r5
> > >         add r4, r3, r5  ; final result in r4
> > 
> > this is written in 'classical' risc / x86 fashion, it seems :-)
> 
> If you find a way to encode instructions like
> 
> 	r5 = r1 * r3 - r2 * r4
> 	r6 = r1 * r4 + r2 * r3
> 
> directly, without using temporary registers, I will use it! :)

Hi Yann, that reminds me of something complex...

> > > Now, the time-table (without bypassing) is:
> > > 
> > >         cycle 0: read r1 and r2 (pass values through Xbar)
> > >         cycle 1: stage 1 of ASU is working
> > >         cycle 2: stage 2 of ASU is working
> > >         cycle 3: pass result through Xbar (write r5 at end of cycle)
> > >         cycle 4: read r3 and r5 (pass values through Xbar)
> > >         cycle 5: stage 1 of ASU is working
> > >         cycle 6: stage 2 of ASU is working
> > >         cycle 7: pass result through Xbar (write r4 at end of cycle)
> > > 
> > > Note that r5 is written at the end of cycle 3, but read in cycle 4; that
> > > is, the new value is read (and passes the Xbar again).  With bypassing,
> > > cycle 3 and 4 will overlap, resulting in a 1-cycle speed-up.
> > > 
> > > Or did I miss something?
> > 
> > i think that we agree about what means bypassing the RegSet in the 'Xbar'
> > (a set of mux).
> 
> I think so...
> 
> > however, there is another case that worries me :
> > suppose that you want to add more than 5 registers, for example 20.
> > This could work for any combination of other operations, of course
> > (this is not specific to additions, i care about the latency).
> 
> Such operations can probably be parallelized (if there are no
> read-after-write dependencies).  E.g. for the complex multiplication
> above, we'll calculate the four products first, one after another,
> and then add/subtract them.
> 
> > So we have a burst of register values all over the place. The scheduler
> > will take care to organise that cleanly. In order to have the fastest
> > execution possible, one will "organise" the instruction ordering
> > so independent operations are interleaved. That's the "usual job"
> > when one optimises for RISC.
> 
> Yep.
> 
> > Now imagine that the register number is exhausted, or some pressure
> > like that. imagine that the instruction is issued one cycle after
> > the necessary source data is present on the Xbar for bypass. The instruction
> > will have to wait yet another cycle, until the register set memorises and
> > gives the new value.
> 
> If the instruction is issued at the beginning of the `result write'
> cycle of the previous instruction, the value can be bypassed.  If it is
> issued one or more cycles later, it can read the result from the register.
> (note that I use the term "instruction is issued" for the point in time
> when the scheduler starts reading the operands -- the EU starts working
> 1 cycle later).
> 
> > For me, this situation (wait) is not tolerable because i guess that most
> > of the "desirable" time (when the code will be optimised at a decent level)
> > the 1-cycle penalty might occur often enough that optimisation might not
> > be worth. If your optimisation yeilds poor speedup, you'll drop it and
> > i don't want to encourage that...
> 
> Maybe I'm too "vernagelt" to see your point.
> 
> An instruction has 3 phases, right?  Phase 1 is operand fetch (1 cycle),
> phase2 is calculation (n cycles) and phase 3 is result write (1 cycle
> again).  When a second instruction depends on the result of the first
> one, it can start its phase 1 immediately after phase 3 of the previous
> instruction is finished, that is, in the next cycle, no matter what
> happens (assuming the EU is not busy, of course).
> 
> Let me draw you a picture... first, the worst case:
> 
>    reg                 reg                 reg
>     +    +----+----+    +    +----+----+    +
>     |----| instr 1 |----|----| instr 2 |----|
>     +    +----+----+    +    +----+----+    +
>      Xbar           Xbar Xbar           Xbar
> 
> The Xbar behaves just like another pipeline stage, and the register
> bank like a pipeline register.  Now, let's add the bypass:
> 
>    reg                 reg
>     +    +----+----+    +
>     |----| instr 1 |----|
>     +    +----+----+\   +
>      Xbar            \
>                   reg \               reg
>                    +   \+----+----+    +
>                    |----| instr 2 |----|
>                    +    +----+----+    +
>                     Xbar           Xbar
> 
> The result register is moved out of the data path, we save 1 cycle.
> We cannot save more because there's a 1-cycle delay each time data
> passes the Xbar.

Can you guarantee any timing here? Since you have to
add the time for execution and bypassing it may get
very tight. I think this kind of a hack should NEVER
be taken for an bit width upward extensible core!!!
Just my opinion though ;)

JG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/