[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: FC0 XBAR



hi !

Juergen Goeritz wrote:
> On Fri, 3 Aug 2001, Michael Riepe wrote:
> > On Fri, Aug 03, 2001 at 03:55:11AM +0200, Yann Guidon wrote:
> > > > Well, let's try an example to make this clearer.  Let's say we want to
> > > > add 3 numbers:
> > > >         add r5, r1, r2  ; temporary result in r5
> > > >         add r4, r3, r5  ; final result in r4
> > > this is written in 'classical' risc / x86 fashion, it seems :-)
> > If you find a way to encode instructions like
> >
> >       r5 = r1 * r3 - r2 * r4
> >       r6 = r1 * r4 + r2 * r3
> >
> > directly, without using temporary registers, I will use it! :)
> Hi Yann, that reminds me of something complex...

how many microprocessors do you know that have 2 parallel multipliers ? :-)


> > Maybe I'm too "vernagelt" to see your point.
> >
> > An instruction has 3 phases, right?  Phase 1 is operand fetch (1 cycle),
> > phase2 is calculation (n cycles) and phase 3 is result write (1 cycle
> > again).  When a second instruction depends on the result of the first
> > one, it can start its phase 1 immediately after phase 3 of the previous
> > instruction is finished, that is, in the next cycle, no matter what
> > happens (assuming the EU is not busy, of course).
> >
> > Let me draw you a picture... first, the worst case:
> >
> >    reg                 reg                 reg
> >     +    +----+----+    +    +----+----+    +
> >     |----| instr 1 |----|----| instr 2 |----|
> >     +    +----+----+    +    +----+----+    +
> >      Xbar           Xbar Xbar           Xbar
> >
> > The Xbar behaves just like another pipeline stage, and the register
> > bank like a pipeline register.  Now, let's add the bypass:
> >
> >    reg                 reg
> >     +    +----+----+    +
> >     |----| instr 1 |----|
> >     +    +----+----+\   +
> >      Xbar            \
> >                   reg \               reg
> >                    +   \+----+----+    +
> >                    |----| instr 2 |----|
> >                    +    +----+----+    +
> >                     Xbar           Xbar
> >
> > The result register is moved out of the data path, we save 1 cycle.
> > We cannot save more because there's a 1-cycle delay each time data
> > passes the Xbar.
> 
> Can you guarantee any timing here? Since you have to
> add the time for execution and bypassing it may get
> very tight.

in the FC0 pipeline, everything is very tight.
the solution is to do one single and simple thing per stage.
This is why the usual bypass will take one full cycle.

> I think this kind of a hack should NEVER
> be taken for an bit width upward extensible core!!!
> Just my opinion though ;)

don't worry : as soon as FC0 will be running, FC1 will be created.
then FC2, FC3... these new cores (yet to be done) will hopefully
take our errors into account :-)

> JG

Juergen Goeritz also wrote:
> On Fri, 3 Aug 2001, Yann Guidon wrote:
> > hi,
> >
> > Juergen Goeritz wrote:
<>
> > i'll have to make a "little drawing", now ...
> > <2 hours later ...>
> > ok i've done a little sketch.
> It's good that you made the drawing! ;)
it's so rare, because it's usually so long to do !
doing a hand drawing takes less than one minute,
a computer drawing is 100x slower.

> > you will understand why it is called "superpipeline" because execution occurs
> > 2 cycles after the register read has started. As i told you one day, F-CPU
> > is not your "average CPU core" :-)
> 
> Yes, I remember :-) But then I didn't raise the question
> of branching, did I?
I did :-)

> How much delay will the branch issue
> bring in when execution is 2 cycles delayed?
here are some particular points :
 - conditional jump is done when a condition is met. it is
    based on the current value of a specified register, either
    if the register is =0, if LSB is set or MSB is set.
 - The value of these "flags" is updated at register writeback time.
    it uses a special mechanism so you don't have to explicitely
    read the register in order to know if the flag is true.
    (it "caches" the useful information).
 - at decode stage, the condition is examined when the jump (or move)
    instruction is found in the instruction stream. The load and jump
    instructions work with one register that points to the data to
    load/Store or the location to jump to. These data are cached too
    (at least : as much as possible). So within the decode cycle,
    we can know if we can jump or not.
 - The decision is taken at issue cycle (now it overlaps the Xbar read
    cycle because the decode stage would be too long  otherwise).
    This means that in any way, the jump latency is 1 cycle when taken.
 - whether the condition is "ready" or not (that is : how many cycles
    between the time you issued the instruction that writes the register,
    and the time you issue the jump instruction) is hòiòjòkòlòmònòoòÚÛÜÝÞßàáâãäåæçèéêíhe registers are read. Following 2 branches would
double the complexity of the chip and an average of 1/2 would be used in the
end.

For example, the fetcher could provide 2 instructions corresponding to
the current stream, and the one where we can jump to. This would mean that
we require 6 read ports for the register set, as much Xbar inputs, and in the
end only one value would be used.

Maybe FC1 will "solve" this issue but this trick keeps the pipeline short,
which balances the loss. nicolas proposed to use delayed branches but this is
not a recommended practice for something that could go superscalar !

the solution that i propose is focused on the "fetcher".
In the drawing that you have seen, only the "execution pipeline" is represented.
The memory part will be added later, as well as the instruction fetch mechanics.
If we design an "intelligent" fetcher (or, should i say, "smarter"), we can probably
reduce the jump latency (which is only one cycle, so why bother).
We can resort to "trace caching" and on-the-fly instruction swapping
to perform the delayed branch in hardware for example.


Proposition for a HW-based instruction swapper to perform "delayed branch"
without software help :
  if there is no dependency between the jump instruction and the instruction
  that precedes it (if it not a jump instruction too, and if it doesn't use
  the same registers), swap the two instructions in the instruction cache.

I hope that it makes nicolas happy :*)

> > Due to the already complex drawing, i have not included the scheduler
> > and other control signals. Even the wire names are not acurate.
> > I have included 3 execution units only. However it gives a rough
> > idea about how it is designed. You can now read the QDCPOC source code
> > with the (partial) map under your eyes.
> Would be interesting to see the complete picture...
it WILL :-) but i am not sure that my screen can display everything ...

there's still a lot of work to do.

WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/