[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Tr:[f-cpu] usage of 64 registers & ILP

To: f-cpu@seul.org
Subject: Re: Tr:[f-cpu] usage of 64 registers & ILP
From: Yann Guidon <whygee@f-cpu.org>
Date: Thu, 04 Apr 2002 02:46:44 +0200
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Wed, 03 Apr 2002 19:40:37 -0500
Organization: http://www.f-cpu.org
References: <Pine.LNX.4.10.10204032240450.15162-100000@luxik.cdi.cz>
Reply-To: f-cpu@seul.org
Sender: owner-f-cpu@seul.org
hello,

Martin Devera wrote:
> > We are currently working on a single-issue superpipeline core
> > where each operation (except a few exceptions) can be pipelined.
> > If most units have 2 cycles of latency (for example now), it's a
> > bit like working with a 3-issue superscalar CPU.
> >
> > In FC0, the ILP depends on the kind of operations to perform.
> > Fortunately, most code is a mix of different operation types.
> >
> > Currently, there is only integer arithmetic operations,
> > so an addition requires 2 cycles and a multiply up to 8 cycles.
> > An average necessary ILP is around 3 or 4 for safety.
> 
> ohh yes I read all docs and all old mailing list issues about
> f-cpu ;-) Sometimes I have had a hard time to orientate in some
> terms (I have to learn Tomasu... - can't remember - and other
> similar algorithms to keep track).

FC0 doesn't use Tomasulo (?) but a scoreboard system :
a ressource is free or not, and the instruction is "issued"
(enters the "execution pipeline" and executes in straight line)
if all the necessary ressources are available.

it's like you want to send a letter by mail and you can't post
until you have the stamp, so you have to prepare the stamp etc...
it's static scheduling, which doesn't pull the most out of the
machine but is simple, RISC and eases compiler design.
And the latency is mostly predictible (as long as memory
doesn't interfere too much, but there are techniques...)

> Maybe I musunderstand FC0 scheduler - I've thought that decode part
> of pipeline can stall simply when there is RAW/WAW - scoreboard bit
> of source register is set. So that when you do
> i1:add r1,r2,r3
> i2:add r4,r1,r3
> 
> it would produce:
> 
> cycle:  1       2       3       4       5       6       7       8
> i1:     fetch   decd    xbar    asu1    asu2    xbar    rwrt
> i2:             fetch   ------- stall---------------------      decd

i presume you want to write :
 i1:add r3,r2,r1
 i2:add r3,r1,r4
so there is a dependence on R1
(this is because the destination register is at the last position)

if so, the actual stall is somewhat reduced for several reasons :

- fetch is done automatically by the fetcher. it's like a collection
of instruction FIFOs. Compared to a Pentium with a 16-byte FIFO
(and a FIFO between decode and fetch), there are 8 independent 32-byte lines
which automatically fetch the next instruction and also form a user-manageable
return stack. So you can mostly forget the "fetch" stage unless you jump
(there is a 1-cycle stall on a taken branch when all data is ready).

- decode : all fields are examined in parallel, and all the ressources
are deduced. The register set is read and all the other flags are
fetched. Only this stage and the fetch stage are allowed to stall
(otherwise the amount of logic would bloat the whole CPU).
Though the register's value or the opcode doesn't change during a stall,
different flags can be updated : pointer, zero flags... so it just
sits here until the necessary flags are ready.
This means that when there is a stall, decoding has already started (fortunately,
otherwise we wouldn't know whether to stall or not ;-D).
This removes yet another cycle from your stall scenario.

- xbar (read) is also a "issue" stage : all the flags are gathered
and the CPU decides whether to fetch the next opcode, stall or trap.
The data will be put "on hold" on the Xbar until the instruction can
start.

- the last Xbar stage provides the result to the other units, it's a kind
of bypass net. So the next instruction doesn't have to wait too much before being
able to start : the Xbar and R7 write stages can be bypassed.

So from the code point of view, this is equivalent to :

 i1:add r3,r2,r1
// asu1 stall
// asu2 stall
 i2:add r3,r1,r4

i made a PS about that, i should integrate it in the next manual
revision, but i'm so tired ... and i prefer to code...


> so that there would be latency 6 and you will have to find
> appropriate ILP in code. Or am I totally wrong ?
> I'd expect the stall to occur later say just after xbar at
> cycle 3 ..

This is not documented much (except during some threads on this list)
but you now get the picture.
It would have been suicidal to do what you first describe.
However, we can't avoid the memory latency.

> > pipeline units, if the latency of a single operation does not fit inside
> > a simple loop, you can "software pipeline" the loop :
> >  duplicate each instruction and rename each register of the copy
> > (something like adding 32 to each register number).
> > The loop size increases but the stalls are filled with useful
> > operations. This is one very good reason for having a large register set.
> 
> like: (assume that r1 == 8)
> loadi 8,r2,r10
> add r9,r5,r20
> storei 8,r3,r19
> loadi 8,r2,r11
> add r10,r5,r21
> storei 8,r3,r20
> ..... ?

is it a copy loop ?
i don't understand it well.

> Would it be very complex to add special 5bit register and add
> it's value to register number >32 in decode stage ? Like:

it would add a "status bit" and make the programming model more complex...
it's better done by external software.

> -- initialize prolog manualy --
> l1:     loadi 8,r2,r32
>         add r33,r5,r34
>         storei 8,r3,r35
>         loop.c r3,r4 ; r3==l1 and r4 is loop kernel counter
> -- unrolled loop epilog here ---
> 
> where loop.c would simply increment the register add number
> with overflow (no saturation) ?
> With simple circuit is could be also used to create function
> call prolog/epilog by testing the add number for overflow and
> calling spilling handler ...
> The added register number coudd be computed in paralel during
> decode stage and would affect registers > 32 only.
> The result is support for sw pipelining without need to
> unroll it - thus less pressure at instruction fetch.
> Does make it sense ?

usually, instruction fetch should not be a problem,
except for the first execution (because the loop body
must be fetched) but it's not relevant after you loop several
times.

The problem comes from making a simple, extensible, stable
and context-switch safe programming model. Adding another "status"
bit can create problems further than you'd think (browse the
mailing list archive from 3 years ago ;-D)

If you look at today's coding rules and compiler outputs, you'll
see that in fact loop unrolling is not such a problem for small
loop kernels. The loop speed might be limited by the load and store
instructions' latency but in fact, if we unroll more than once,
there is almost no return because the memory system can keep up
anyway.

After a bit of head scratching, i just understood that your code
is in fact a vector add loop, which is in fact simple to do.
I am however puzzled by your register allocation.

Here is a code that performs [r1] + r2 -> [r3] :

  load [r1], r4;
  add r4, r2, r5;
  store [r3], r5;

However :
 - load has a 1-cycle latency (due to data alignment).
 - ASU (>8 bits) has 2 cycles of latency
 - no pointer update is done yet
So the ASU forces us to unroll the loop at least 3x, but it's done
simply with copy/paste/update.

1) Transformation into loop :

  loadimm loopcount,r7;
loopentry r6; // loop starts in the next instruction
  loadi 8, [r1], r4;
  add r4, r2, r5;
  storei 8, [r3], r5;
  loop r6,r7;    // when r7 is already zero, exit the loop.

2) loop unrolling :
for convenience, i'll unroll 4x, so we deal with (hopefully)
aligned lines of 256 bits.

  loadimm loopcount>>2,r7;
loopentry r6; // loop starts in the next instruction

// copy/paste/interleave :
  loadi [r1], r4;
  loadi [r1+16], r4+16;
  loadi [r1+32], r4+32;
  loadi [r1+48], r4+48;
  add r4, r2, r5;
  add r4+16, r2, r5+16;
  add r4+32, r2, r5+32;
  add r4+48, r2, r5+48;
  storei [r3], r5;
  storei [r3+16], r5+16;
  storei [r3+32], r5+32;
  storei [r3+48], r5+48;

  loop r6,r7;    // when r7 is already zero, exit the loop.

In fact, i just used 16, 32 and 48 for convenience, the register
numbers should be consecutive as much as possible. The loop body uses
4 register per thread so it's possible to do better but i don't have
enough neurons tonight.

I also use distinct source and result registers. It's not mandatory
but can be critical in some situations, for example if FP was used
(eases the FP trap handling for example).

3) version with renamed registers :
  loadimm loopcount>>2,r7;
loopentry r6; // loop starts in the next instruction

// copy/paste/interleave :
  loadi 32, [r1], r4;
  loadi 32, [r17], r20;
  loadi 32, [r33], r36;
  loadif 32, [r49], r52;
  add r4, r2, r5;
  add r20, r2, r21;
  add r36, r2, r37;
  add r52, r2, r53;
  storei 32, [r3], r5;
  storei 32, [r19], r21;
  storei 32, [r35], r37;
  storeif 32, [r51], r53;
  loop r6,r7;    // when r7 is already zero, exit the loop.

I have set a 'f' after the last loads and store to indicate that
the value of the corresponding line is not needed anymore, the line
can be flushed from the LSU and the L1 cache.

It should do the job nicely for FC0, but might cause problems in a superscalar
CPU. In that case, the interleaving must be modified :

A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4

should become something like

A1
B2
C3
A4
B1
C2
A3
B4
C1
A2
B3
C4

because the execution units might not be all available at the same time...
This means that a prolog and epilog must be added.

And don't forget that it's just a short loop.

> > > If so, does it mean that binary tree or linked list handling
> > > will cause about 4 cycles big bubbles in the pipeline ? :-0
> > not exactly.
> > In reality, it will take even more : today's memory latencies
> > are huge because the core speed increases much faster than the
> > memory speed.
> 
> yes it is true
> 
> > I hope that you understand that it is unavoidable : if you think that
> > the number of bubbles is critical, then you force the core
> > to decrease its working speed and it become as slow as the memory
> 
> no I only wanted to kill avoidable bubbles - these which results
> from register interdependency without much ILP in the algorithm.
> If it is possible of course :) Parking lots for instruction seems
> to limit these latencies to shortest posible time.
> But maybe I don't understand the problem correctly ;)

i think we're on the same wavelength now.

> > but the latency does not increase as fast. pipelined memories is
> > a means to compensate, but you have to adapt your algos.
> 
> the distributed tree is nice idea ;) By the way for splay tree
> you will often have what you want in some cache (but as you said
> even L1 is slow)...

at least in FC0.

When the data is properly prefetched, there is 1 alignment cycle
of latency from the L0 (Load/Story Unit's buffer of 8*32 bytes).

it requires a few more cycles for fetching from L1 (256 bits
per cycle) and even more from L2 (might be limited to 64 or 128
bits per cycle).

> > > By the way anybody knows granularity of IA32,IA64 amd 21256
> > > pipeline ?
> > bad question :
> >  IA32 and IA64 are programming models, not "architectures".
> yes, sorry you are right.
argh, again :-/

> > each implementation has radically differing strategies :
> >  Merced and Itanium have different issue widths (6 vs 9, IIRC)
> [snip]
> 
> but I vas interested in information on circuit complexity
> like depth of ten transistors in fcpu ...

this number is not valid these days, because the gates tend to
increase the transistor count. I based my early approximations
on very old technologies where pass transistors were usual, and
now they are forbidden... ultra-low voltage and noise are becoming
a very important issue today. But as rule-of-thumb,
the 6 gates limit of the critical datapath remains as a central
guideline. Of course if there is a specific implementation issue,
this can be locally modified (it also depends on the technology
and the synthesiser and all the other options) but the general design
is still simple with this rule (despite all the bypass nightmares).
It ensures that we don't get caught in endless "what if we add this and
that and other stuffs", where the CDP increases and the overall
performance decreases...

> > Concerning 21256... are you referring to Dec/Compaq/HP(?) Alpha 21264 ?
> uhh again it seems like if I have had some Vodka or so ;)
don't drink and code ;-)

i don't have this problem, because i'm only chocoholic ;-)

> devik
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
Follow-Ups:
- Re: Tr:[f-cpu] usage of 64 registers & ILP
  - From: Martin Devera <devik@cdi.cz>
References:
- Re: Tr:[f-cpu] usage of 64 registers & ILP
  - From: Martin Devera <devik@cdi.cz>
Prev by Date: Re: Tr:[f-cpu] usage of 64 registers & ILP
Next by Date: Re: Tr:[f-cpu] usage of 64 registers & ILP
Prev by thread: Re: Tr:[f-cpu] usage of 64 registers & ILP
Next by thread: Re: Tr:[f-cpu] usage of 64 registers & ILP
Index(es):
- Date
- Thread