[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: FC0 XBAR



hi !

now that you have seen the picture, we'll be able to
speak about the same thing :-) [at last, but i'm such dog slow now...]

Michael Riepe wrote:
> > > Or did I miss something?
> > i think that we agree about what means bypassing the RegSet in the 'Xbar'
> > (a set of mux).
> I think so...

however you probably did not understand that the Xbar took its own cycle.
This "pseudo-unit", which is in fact a set of MUX, consumes wires and
loads the transistors so it is a good thing that it takes his cycle.
in the small example that i have posted (ASU+SHL+ROP2), we have
to multiplex 7*64 bits !

hence : an instruction such as ADD is :

 fetch - decode - Xbar/issue - ASU1 - ASU2 - Xbar - Reg.

The role of the Xbar is obvious when bypass occurs in the usual
way (bad scheduling) :

(1) ADD R1, R2, R3  ; r3=r1+r2
(2) ADD R3, R4, R5  ; r5=r3+r4 <- direct dependency -> stall

cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
1      (1)
2      (2)   (1)
3            (2)    (1)
4                   (2)    (1)
5                   (2)           (1)
6                   (2)-------bypass-----(1)
7                          (2)                  (1)
8                                 (2)
9                                        (2)
10                                              (2)

At cycle 6, the values are now available either on Xbar or
from the register set.

> > however, there is another case that worries me :
> > suppose that you want to add more than 5 registers, for example 20.
> > This could work for any combination of other operations, of course
> > (this is not specific to additions, i care about the latency).
> 
> Such operations can probably be parallelized (if there are no
> read-after-write dependencies).  E.g. for the complex multiplication
> above, we'll calculate the four products first, one after another,
> and then add/subtract them.

now imagine even more operations going at the same time...
and you try to schedule the operations correctly.

> > So we have a burst of register values all over the place. The scheduler
> > will take care to organise that cleanly. In order to have the fastest
> > execution possible, one will "organise" the instruction ordering
> > so independent operations are interleaved. That's the "usual job"
> > when one optimises for RISC.
> Yep.
glad you agree, because today people are so lazy ;-)

> > Now imagine that the register number is exhausted, or some pressure
> > like that. imagine that the instruction is issued one cycle after
> > the necessary source data is present on the Xbar for bypass. The instruction
> > will have to wait yet another cycle, until the register set memorises and
> > gives the new value.
> 
> If the instruction is issued at the beginning of the `result write'
> cycle of the previous instruction, the value can be bypassed.  If it is
> issued one or more cycles later, it can read the result from the register.
> (note that I use the term "instruction is issued" for the point in time
> when the scheduler starts reading the operands -- the EU starts working
> 1 cycle later).
> 
> > For me, this situation (wait) is not tolerable because i guess that most
> > of the "desirable" time (when the code will be optimised at a decent level)
> > the 1-cycle penalty might occur often enough that optimisation might not
> > be worth. If your optimisation yeilds poor speedup, you'll drop it and
> > i don't want to encourage that...
> 
> Maybe I'm too "vernagelt" to see your point.
> 
> An instruction has 3 phases, right?  Phase 1 is operand fetch (1 cycle),
> phase2 is calculation (n cycles) and phase 3 is result write (1 cycle
> again).  When a second instruction depends on the result of the first
> one, it can start its phase 1 immediately after phase 3 of the previous
> instruction is finished, that is, in the next cycle, no matter what
> happens (assuming the EU is not busy, of course).
> 
> Let me draw you a picture... first, the worst case:
> 
>    reg                 reg                 reg
>     +    +----+----+    +    +----+----+    +
>     |----| instr 1 |----|----| instr 2 |----|
>     +    +----+----+    +    +----+----+    +
>      Xbar           Xbar Xbar           Xbar

i think that you overlooked the latency of the register set :

   reg                regW  RegR               reg
    +    +----+----+    +    +    +----+----+    +
    |----| instr 1 |----|----|----| instr 2 |----|
    +    +----+----+    +    +    +----+----+    +
     Xbar           Xbar      Xbar           Xbar

So :
Not only we have to care about the "direct" bypass that you
have described below, but also the "indirect" bypass when the
instruction arrives one cycle too late to benefit from the
direct bypass. One cycle more and the register set is properly
updated so there is no danger.

> The Xbar behaves just like another pipeline stage, and the register
> bank like a pipeline register.  Now, let's add the bypass:
> 
>    reg                 reg
>     +    +----+----+    +
>     |----| instr 1 |----|
>     +    +----+----+\   +
>      Xbar            \
>                   reg \               reg
>                    +   \+----+----+    +
>                    |----| instr 2 |----|
>                    +    +----+----+    +
>                     Xbar           Xbar
> 
> The result register is moved out of the data path, we save 1 cycle.
i think that it saves 2 cycles in fact.

> We cannot save more because there's a 1-cycle delay each time data
> passes the Xbar.

as you know it's more complex than that in FC0.
it inherits from strategies that are more than 10 years old (ie ALPHA)
but are still difficult/uneasy to use in the real world (of lazy coders).

what bothers me is illustrated below :
imagine that you take the latency into account.
The ideal case is when you schedule the instructions so that
results are available on the Xbar when you send the instruction.
This perfect case is illustrated below :

(1) ADD R1, R2, R3  ; r3=r1+r2
(nop)
(nop)
(2) ADD R3, R4, R5  ; r5=r3+r4 <- data available on the Xbar

cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
1      (1)
2     (nop)   (1)
3     (nop)  (nop)  (1)
4      (2)   (nop) (nop)   (1)
5             (2)  (nop)          (1)
6                   (2)-------bypass-----(1)
7                          (2)                  (1)
8                                 (2)
9                                        (2)
10                                              (2)


Now, imagine that instead of the nop, we have interleaved instructions
AND there are too much instructions, exactly ONE MORE than in the ideal case :

(1) ADD R1, R2, R3  ; r3=r1+r2
(nop)
(nop)
(nop) ; -> the "killing" instruction
(2) ADD R3, R4, R5  ; r5=r3+r4 <- data available on the Xbar

cycle FTCH - DEC - XBAR - ASU1 - ASU2 - Xbar - Reg.
1      (1)
2     (nop)   (1)
3     (nop)  (nop)  (1)
4     (nop)  (nop)  (nop)  (1)
5      (2)   (nop)  (nop) (nop)   (1)
6             (2)--R3 is the old value!--(1)          <--- this is where the clash is !
7                    (2)                        (1)   <--- the situation here is not better.
8                          (2)
9                                 (2)
10                                       (2)
11                                              (2)

When the instruction arrives 1 cycle too late for the Xbar bypass,
it is not possible to issue it, because the Register cycle
requires 1 cycle to update. Emitting it without proper measures
would be a catastrophe.

There are 2 solutions : either delay the instruction or add another
small bypass. I think that i found the solution for the bypass :
it's just another register, and we have to insert a few MUX entries
in the Xbar-read part.


FYI (and for those who don't follow the thread easily) :
* the Decode stage is when the opcode is decoded and the registers
are being read.
* The Xbar is when the data are selected and when the decoding data
are combined, so we know if we "issue" the instruction.

In the last stage, where the register is being written back,
the data takes a full cycle before it is available for decoding.



When writing this post, i have discovered several flaws
in my previous strategies. The descriptions are informational only,
but i presume that they are flawed in a way or another.
For example, in my 2nd example, i am not sure that the "ideal"
latency is 2 cycles. same for the others.

it will take some time to repair the code and model,
but i think it is only a matter of a few hours or days.
The fixings will require a few more registers and MUX but i'll
keep them away from the critical datapath.


>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/