[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: FC0 XBAR



Yann Guidon a écrit :
> 
> hi !
> 
> now that you have seen the picture, we'll be able to
> speak about the same thing :-) [at last, but i'm such dog slow now...]
> 
> Michael Riepe wrote:
> > > > Or did I miss something?
> > > i think that we agree about what means bypassing the RegSet in the 'Xbar'
> > > (a set of mux).
> > I think so...
> 
> however you probably did not understand that the Xbar took its own cycle.
> This "pseudo-unit", which is in fact a set of MUX, consumes wires and
> loads the transistors so it is a good thing that it takes his cycle.
> in the small example that i have posted (ASU+SHL+ROP2), we have
> to multiplex 7*64 bits !
> 
> hence : an instruction such as ADD is :
> 
>  fetch - decode - Xbar/issue - ASU1 - ASU2 - Xbar - Reg.
> 
> The role of the Xbar is obvious when bypass occurs in the usual
> way (bad scheduling) :
> 
> (1) ADD R1, R2, R3  ; r3=r1+r2
> (2) ADD R3, R4, R5  ; r5=r3+r4 <- direct dependency -> stall
> 
> cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
> 1      (1)
> 2      (2)   (1)
> 3            (2)    (1)
> 4                   (2)    (1)
> 5                   (2)           (1)
> 6                   (2)-------bypass-----(1)
> 7                          (2)                  (1)
> 8                                 (2)
> 9                                        (2)
> 10                                              (2)
> 
> At cycle 6, the values are now available either on Xbar or
> from the register set.
> 
> > > however, there is another case that worries me :
> > > suppose that you want to add more than 5 registers, for example 20.
> > > This could work for any combination of other operations, of course
> > > (this is not specific to additions, i care about the latency).
> >
> > Such operations can probably be parallelized (if there are no
> > read-after-write dependencies).  E.g. for the complex multiplication
> > above, we'll calculate the four products first, one after another,
> > and then add/subtract them.
> 
> now imagine even more operations going at the same time...
> and you try to schedule the operations correctly.
> 
> > > So we have a burst of register values all over the place. The scheduler
> > > will take care to organise that cleanly. In order to have the fastest
> > > execution possible, one will "organise" the instruction ordering
> > > so independent operations are interleaved. That's the "usual job"
> > > when one optimises for RISC.
> > Yep.
> glad you agree, because today people are so lazy ;-)
> 
> > > Now imagine that the register number is exhausted, or some pressure
> > > like that. imagine that the instruction is issued one cycle after
> > > the necessary source data is present on the Xbar for bypass. The instruction
> > > will have to wait yet another cycle, until the register set memorises and
> > > gives the new value.
> >
> > If the instruction is issued at the beginning of the `result write'
> > cycle of the previous instruction, the value can be bypassed.  If it is
> > issued one or more cycles later, it can read the result from the register.
> > (note that I use the term "instruction is issued" for the point in time
> > when the scheduler starts reading the operands -- the EU starts working
> > 1 cycle later).
> >
> > > For me, this situation (wait) is not tolerable because i guess that most
> > > of the "desirable" time (when the code will be optimised at a decent level)
> > > the 1-cycle penalty might occur often enough that optimisation might not
> > > be worth. If your optimisation yeilds poor speedup, you'll drop it and
> > > i don't want to encourage that...
> >
> > Maybe I'm too "vernagelt" to see your point.
> >
> > An instruction has 3 phases, right?  Phase 1 is operand fetch (1 cycle),
> > phase2 is calculation (n cycles) and phase 3 is result write (1 cycle
> > again).  When a second instruction depends on the result of the first
> > one, it can start its phase 1 immediately after phase 3 of the previous
> > instruction is finished, that is, in the next cycle, no matter what
> > happens (assuming the EU is not busy, of course).
> >
> > Let me draw you a picture... first, the worst case:
> >
> >    reg                 reg                 reg
> >     +    +----+----+    +    +----+----+    +
> >     |----| instr 1 |----|----| instr 2 |----|
> >     +    +----+----+    +    +----+----+    +
> >      Xbar           Xbar Xbar           Xbar
> 
> i think that you overlooked the latency of the register set :
> 
>    reg                regW  RegR               reg
>     +    +----+----+    +    +    +----+----+    +
>     |----| instr 1 |----|----|----| instr 2 |----|
>     +    +----+----+    +    +    +----+----+    +
>      Xbar           Xbar      Xbar           Xbar
> 
> So :
> Not only we have to care about the "direct" bypass that you
> have described below, but also the "indirect" bypass when the
> instruction arrives one cycle too late to benefit from the
> direct bypass. One cycle more and the register set is properly
> updated so there is no danger.
> 
> > The Xbar behaves just like another pipeline stage, and the register
> > bank like a pipeline register.  Now, let's add the bypass:
> >
> >    reg                 reg
> >     +    +----+----+    +
> >     |----| instr 1 |----|
> >     +    +----+----+\   +
> >      Xbar            \
> >                   reg \               reg
> >                    +   \+----+----+    +
> >                    |----| instr 2 |----|
> >                    +    +----+----+    +
> >                     Xbar           Xbar
> >
> > The result register is moved out of the data path, we save 1 cycle.
> i think that it saves 2 cycles in fact.
> 
> > We cannot save more because there's a 1-cycle delay each time data
> > passes the Xbar.
> 
> as you know it's more complex than that in FC0.
> it inherits from strategies that are more than 10 years old (ie ALPHA)
> but are still difficult/uneasy to use in the real world (of lazy coders).
> 
> what bothers me is illustrated below :
> imagine that you take the latency into account.
> The ideal case is when you schedule the instructions so that
> results are available on the Xbar when you send the instruction.
> This perfect case is illustrated below :
> 
> (1) ADD R1, R2, R3  ; r3=r1+r2
> (nop)
> (nop)
> (2) ADD R3, R4, R5  ; r5=r3+r4 <- data available on the Xbar
> 
> cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
> 1      (1)
> 2     (nop)   (1)
> 3     (nop)  (nop)  (1)
> 4      (2)   (nop) (nop)   (1)
> 5             (2)  (nop)          (1)
> 6                   (2)-------bypass-----(1)
> 7                          (2)                  (1)
> 8                                 (2)
> 9                                        (2)
> 10                                              (2)
> 
> Now, imagine that instead of the nop, we have interleaved instructions
> AND there are too much instructions, exactly ONE MORE than in the ideal case :
> 
> (1) ADD R1, R2, R3  ; r3=r1+r2
> (nop)
> (nop)
> (nop) ; -> the "killing" instruction
> (2) ADD R3, R4, R5  ; r5=r3+r4 <- data available on the Xbar
> 
> cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
> 1      (1)
> 2     (nop)   (1)
> 3     (nop)  (nop)  (1)
> 4     (nop)  (nop)  (nop)  (1)
> 5      (2)   (nop)  (nop) (nop)   (1)
> 6             (2)--R3 is the old value!--(1)          <--- this is where the clash is !
> 7                    (2)                        (1)   <--- the situation here is not better.
> 8                          (2)
> 9                                 (2)
> 10                                       (2)
> 11                                              (2)
> 
> When the instruction arrives 1 cycle too late for the Xbar bypass,
> it is not possible to issue it, because the Register cycle
> requires 1 cycle to update. Emitting it without proper measures
> would be a catastrophe.
> 
> There are 2 solutions : either delay the instruction or add another
> small bypass. I think that i found the solution for the bypass :
> it's just another register, and we have to insert a few MUX entries
> in the Xbar-read part.
> 

Hum, strange, it became to look like my own bypass/variable latency
system... ;p Be carefull whygee you will seem to be okay with me !

nicO
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/