[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [f-cpu] Re: FC0 XBAR
Yann Guidon a écrit :
>
> hi !
>
> now that you have seen the picture, we'll be able to
> speak about the same thing :-) [at last, but i'm such dog slow now...]
>
> Michael Riepe wrote:
> > > > Or did I miss something?
> > > i think that we agree about what means bypassing the RegSet in the 'Xbar'
> > > (a set of mux).
> > I think so...
>
> however you probably did not understand that the Xbar took its own cycle.
> This "pseudo-unit", which is in fact a set of MUX, consumes wires and
> loads the transistors so it is a good thing that it takes his cycle.
> in the small example that i have posted (ASU+SHL+ROP2), we have
> to multiplex 7*64 bits !
>
> hence : an instruction such as ADD is :
>
> fetch - decode - Xbar/issue - ASU1 - ASU2 - Xbar - Reg.
>
> The role of the Xbar is obvious when bypass occurs in the usual
> way (bad scheduling) :
>
> (1) ADD R1, R2, R3 ; r3=r1+r2
> (2) ADD R3, R4, R5 ; r5=r3+r4 <- direct dependency -> stall
>
> cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
> 1 (1)
> 2 (2) (1)
> 3 (2) (1)
> 4 (2) (1)
> 5 (2) (1)
> 6 (2)-------bypass-----(1)
> 7 (2) (1)
> 8 (2)
> 9 (2)
> 10 (2)
>
> At cycle 6, the values are now available either on Xbar or
> from the register set.
>
> > > however, there is another case that worries me :
> > > suppose that you want to add more than 5 registers, for example 20.
> > > This could work for any combination of other operations, of course
> > > (this is not specific to additions, i care about the latency).
> >
> > Such operations can probably be parallelized (if there are no
> > read-after-write dependencies). E.g. for the complex multiplication
> > above, we'll calculate the four products first, one after another,
> > and then add/subtract them.
>
> now imagine even more operations going at the same time...
> and you try to schedule the operations correctly.
>
> > > So we have a burst of register values all over the place. The scheduler
> > > will take care to organise that cleanly. In order to have the fastest
> > > execution possible, one will "organise" the instruction ordering
> > > so independent operations are interleaved. That's the "usual job"
> > > when one optimises for RISC.
> > Yep.
> glad you agree, because today people are so lazy ;-)
>
> > > Now imagine that the register number is exhausted, or some pressure
> > > like that. imagine that the instruction is issued one cycle after
> > > the necessary source data is present on the Xbar for bypass. The instruction
> > > will have to wait yet another cycle, until the register set memorises and
> > > gives the new value.
> >
> > If the instruction is issued at the beginning of the `result write'
> > cycle of the previous instruction, the value can be bypassed. If it is
> > issued one or more cycles later, it can read the result from the register.
> > (note that I use the term "instruction is issued" for the point in time
> > when the scheduler starts reading the operands -- the EU starts working
> > 1 cycle later).
> >
> > > For me, this situation (wait) is not tolerable because i guess that most
> > > of the "desirable" time (when the code will be optimised at a decent level)
> > > the 1-cycle penalty might occur often enough that optimisation might not
> > > be worth. If your optimisation yeilds poor speedup, you'll drop it and
> > > i don't want to encourage that...
> >
> > Maybe I'm too "vernagelt" to see your point.
> >
> > An instruction has 3 phases, right? Phase 1 is operand fetch (1 cycle),
> > phase2 is calculation (n cycles) and phase 3 is result write (1 cycle
> > again). When a second instruction depends on the result of the first
> > one, it can start its phase 1 immediately after phase 3 of the previous
> > instruction is finished, that is, in the next cycle, no matter what
> > happens (assuming the EU is not busy, of course).
> >
> > Let me draw you a picture... first, the worst case:
> >
> > reg reg reg
> > + +----+----+ + +----+----+ +
> > |----| instr 1 |----|----| instr 2 |----|
> > + +----+----+ + +----+----+ +
> > Xbar Xbar Xbar Xbar
>
> i think that you overlooked the latency of the register set :
>
> reg regW RegR reg
> + +----+----+ + + +----+----+ +
> |----| instr 1 |----|----|----| instr 2 |----|
> + +----+----+ + + +----+----+ +
> Xbar Xbar Xbar Xbar
>
> So :
> Not only we have to care about the "direct" bypass that you
> have described below, but also the "indirect" bypass when the
> instruction arrives one cycle too late to benefit from the
> direct bypass. One cycle more and the register set is properly
> updated so there is no danger.
>
> > The Xbar behaves just like another pipeline stage, and the register
> > bank like a pipeline register. Now, let's add the bypass:
> >
> > reg reg
> > + +----+----+ +
> > |----| instr 1 |----|
> > + +----+----+\ +
> > Xbar \
> > reg \ reg
> > + \+----+----+ +
> > |----| instr 2 |----|
> > + +----+----+ +
> > Xbar Xbar
> >
> > The result register is moved out of the data path, we save 1 cycle.
> i think that it saves 2 cycles in fact.
>
> > We cannot save more because there's a 1-cycle delay each time data
> > passes the Xbar.
>
> as you know it's more complex than that in FC0.
> it inherits from strategies that are more than 10 years old (ie ALPHA)
> but are still difficult/uneasy to use in the real world (of lazy coders).
>
> what bothers me is illustrated below :
> imagine that you take the latency into account.
> The ideal case is when you schedule the instructions so that
> results are available on the Xbar when you send the instruction.
> This perfect case is illustrated below :
>
> (1) ADD R1, R2, R3 ; r3=r1+r2
> (nop)
> (nop)
> (2) ADD R3, R4, R5 ; r5=r3+r4 <- data available on the Xbar
>
> cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
> 1 (1)
> 2 (nop) (1)
> 3 (nop) (nop) (1)
> 4 (2) (nop) (nop) (1)
> 5 (2) (nop) (1)
> 6 (2)-------bypass-----(1)
> 7 (2) (1)
> 8 (2)
> 9 (2)
> 10 (2)
>
> Now, imagine that instead of the nop, we have interleaved instructions
> AND there are too much instructions, exactly ONE MORE than in the ideal case :
>
> (1) ADD R1, R2, R3 ; r3=r1+r2
> (nop)
> (nop)
> (nop) ; -> the "killing" instruction
> (2) ADD R3, R4, R5 ; r5=r3+r4 <- data available on the Xbar
>
> cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
> 1 (1)
> 2 (nop) (1)
> 3 (nop) (nop) (1)
> 4 (nop) (nop) (nop) (1)
> 5 (2) (nop) (nop) (nop) (1)
> 6 (2)--R3 is the old value!--(1) <--- this is where the clash is !
> 7 (2) (1) <--- the situation here is not better.
> 8 (2)
> 9 (2)
> 10 (2)
> 11 (2)
>
> When the instruction arrives 1 cycle too late for the Xbar bypass,
> it is not possible to issue it, because the Register cycle
> requires 1 cycle to update. Emitting it without proper measures
> would be a catastrophe.
>
> There are 2 solutions : either delay the instruction or add another
> small bypass. I think that i found the solution for the bypass :
> it's just another register, and we have to insert a few MUX entries
> in the Xbar-read part.
>
Hum, strange, it became to look like my own bypass/variable latency
system... ;p Be carefull whygee you will seem to be okay with me !
nicO
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/