[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: FC0 XBAR



On Wed, Aug 08, 2001 at 10:12:58PM -0400, nicO wrote:
> Yann Guidon a écrit :
> > 
> > hi !
> > 
> > now that you have seen the picture, we'll be able to
> > speak about the same thing :-) [at last, but i'm such dog slow now...]
> > 
> > Michael Riepe wrote:
> > > > > Or did I miss something?
> > > > i think that we agree about what means bypassing the RegSet in the 'Xbar'
> > > > (a set of mux).
> > > I think so...
> > 
> > however you probably did not understand that the Xbar took its own cycle.
> > This "pseudo-unit", which is in fact a set of MUX, consumes wires and
> > loads the transistors so it is a good thing that it takes his cycle.
> > in the small example that i have posted (ASU+SHL+ROP2), we have
> > to multiplex 7*64 bits !

Huh?  I never got this reply to my mail.

Anyway - Yes, I'm aware of the nature of the Xbar.

> > hence : an instruction such as ADD is :
> > 
> >  fetch - decode - Xbar/issue - ASU1 - ASU2 - Xbar - Reg.

A more detailled version would be:

	fetch
	-- register --
	decode
	-- register --
	Xbar/issue
	-- register --
	ASU1
	-- register --
	ASU2
	-- register --
	Xbar
	-- register --

and the final register shown should be the one in the register bank.

> > The role of the Xbar is obvious when bypass occurs in the usual
> > way (bad scheduling) :
> > 
> > (1) ADD R1, R2, R3  ; r3=r1+r2
> > (2) ADD R3, R4, R5  ; r5=r3+r4 <- direct dependency -> stall
> > 
> > cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
> > 1      (1)
> > 2      (2)   (1)
> > 3            (2)    (1)
> > 4                   (2)    (1)
> > 5                   (2)           (1)
> > 6                   (2)-------bypass-----(1)
> > 7                          (2)                  (1)
> > 8                                 (2)
> > 9                                        (2)
> > 10                                              (2)

In fact, the scheduler should detect the dependency and delay (2),
like this:

	cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
	1      (1)
	2            (1)
	3                   (1)
	4      (2)                 (1)
	5            (2)                  (1)
	6                   (2)-------bypass-----(1)
	7                          (2)                  (1)
	8                                 (2)
	9                                        (2)
	10                                              (2)

(and of course the compiler should avoid generating code like this ;)

[...]
> > > Let me draw you a picture... first, the worst case:
> > >
> > >    reg                 reg                 reg
> > >     +    +----+----+    +    +----+----+    +
> > >     |----| instr 1 |----|----| instr 2 |----|
> > >     +    +----+----+    +    +----+----+    +
> > >      Xbar           Xbar Xbar           Xbar
> > 
> > i think that you overlooked the latency of the register set :
> > 
> >    reg                regW  RegR               reg
> >     +    +----+----+    +    +    +----+----+    +
> >     |----| instr 1 |----|----|----| instr 2 |----|
> >     +    +----+----+    +    +    +----+----+    +
> >      Xbar           Xbar      Xbar           Xbar

What's the big difference between a pipeline register and a `regular'
CPU register?  Both can be read and written "at the same time".  If the
read happens 1 or more cycles after the write, the new value is returned,
otherwise the old value.

If the register bank's read mux is so "heavy" that it needs another cycle
(in addition to the Xbar cycle), the picture is wrong; in that case,
it looks like this:

        RegR               regW  RegR               regW
    +    +    +----+----+    +    +    +----+----+    +
    |----|----| instr 1 |----|----|----| instr 2 |----|
    +    +    +----+----+    +    +    +----+----+    +
     RMux Xbar           Xbar RMux Xbar           Xbar

But we don't want that, do we?

> > what bothers me is illustrated below :
> > imagine that you take the latency into account.
> > The ideal case is when you schedule the instructions so that
> > results are available on the Xbar when you send the instruction.

Yep.

[...]
> > Now, imagine that instead of the nop, we have interleaved instructions
> > AND there are too much instructions, exactly ONE MORE than in the ideal case :
> > 
> > (1) ADD R1, R2, R3  ; r3=r1+r2
> > (nop)
> > (nop)
> > (nop) ; -> the "killing" instruction
> > (2) ADD R3, R4, R5  ; r5=r3+r4 <- data available on the Xbar
> > 
> > cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
> > 1      (1)
> > 2     (nop)   (1)
> > 3     (nop)  (nop)  (1)
> > 4     (nop)  (nop)  (nop)  (1)
> > 5      (2)   (nop)  (nop) (nop)   (1)
> > 6             (2)--R3 is the old value!--(1)          <--- this is where the clash is !
> > 7                    (2)                        (1)   <--- the situation here is not better.
> > 8                          (2)
> > 9                                 (2)
> > 10                                       (2)
> > 11                                              (2)
> > 
> > When the instruction arrives 1 cycle too late for the Xbar bypass,
> > it is not possible to issue it, because the Register cycle
> > requires 1 cycle to update. Emitting it without proper measures
> > would be a catastrophe.

Assuming that the register write requires one additional cycle, you're
perfectly right.  But I think it shouldn't.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/