[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Tr:[f-cpu] usage of 64 registers & ILP
> We are currently working on a single-issue superpipeline core
> where each operation (except a few exceptions) can be pipelined.
> If most units have 2 cycles of latency (for example now), it's a
> bit like working with a 3-issue superscalar CPU.
>
> In FC0, the ILP depends on the kind of operations to perform.
> Fortunately, most code is a mix of different operation types.
>
> Currently, there is only integer arithmetic operations,
> so an addition requires 2 cycles and a multiply up to 8 cycles.
> An average necessary ILP is around 3 or 4 for safety.
ohh yes I read all docs and all old mailing list issues about
f-cpu ;-) Sometimes I have had a hard time to orientate in some
terms (I have to learn Tomasu... - can't remember - and other
similar algorithms to keep track).
Maybe I musunderstand FC0 scheduler - I've thought that decode part
of pipeline can stall simply when there is RAW/WAW - scoreboard bit
of source register is set. So that when you do
i1:add r1,r2,r3
i2:add r4,r1,r3
it would produce:
cycle: 1 2 3 4 5 6 7 8
i1: fetch decd xbar asu1 asu2 xbar rwrt
i2: fetch ------- stall--------------------- decd
so that there would be latency 6 and you will have to find
appropriate ILP in code. Or am I totally wrong ?
I'd expect the stall to occur later say just after xbar at
cycle 3 ..
> pipeline units, if the latency of a single operation does not fit inside
> a simple loop, you can "software pipeline" the loop :
> duplicate each instruction and rename each register of the copy
> (something like adding 32 to each register number).
> The loop size increases but the stalls are filled with useful
> operations. This is one very good reason for having a large register set.
like: (assume that r1 == 8)
loadi 8,r2,r10
add r9,r5,r20
storei 8,r3,r19
loadi 8,r2,r11
add r10,r5,r21
storei 8,r3,r20
..... ?
Would it be very complex to add special 5bit register and add
it's value to register number >32 in decode stage ? Like:
-- initialize prolog manualy --
l1: loadi 8,r2,r32
add r33,r5,r34
storei 8,r3,r35
loop.c r3,r4 ; r3==l1 and r4 is loop kernel counter
-- unrolled loop epilog here ---
where loop.c would simply increment the register add number
with overflow (no saturation) ?
With simple circuit is could be also used to create function
call prolog/epilog by testing the add number for overflow and
calling spilling handler ...
The added register number coudd be computed in paralel during
decode stage and would affect registers > 32 only.
The result is support for sw pipelining without need to
unroll it - thus less pressure at instruction fetch.
Does make it sense ?
> > If so, does it mean that binary tree or linked list handling
> > will cause about 4 cycles big bubbles in the pipeline ? :-0
> not exactly.
> In reality, it will take even more : today's memory latencies
> are huge because the core speed increases much faster than the
> memory speed.
yes it is true
> I hope that you understand that it is unavoidable : if you think that
> the number of bubbles is critical, then you force the core
> to decrease its working speed and it become as slow as the memory
no I only wanted to kill avoidable bubbles - these which results
from register interdependency without much ILP in the algorithm.
If it is possible of course :) Parking lots for instruction seems
to limit these latencies to shortest posible time.
But maybe I don't understand the problem correctly ;)
> but the latency does not increase as fast. pipelined memories is
> a means to compensate, but you have to adapt your algos.
the distributed tree is nice idea ;) By the way for splay tree
you will often have what you want in some cache (but as you said
even L1 is slow)...
> > By the way anybody knows granularity of IA32,IA64 amd 21256
> > pipeline ?
> bad question :
> IA32 and IA64 are programming models, not "architectures".
yes, sorry you are right.
> each implementation has radically differing strategies :
> Merced and Itanium have different issue widths (6 vs 9, IIRC)
[snip]
but I vas interested in information on circuit complexity
like depth of ten transistors in fcpu ...
> Concerning 21256... are you referring to Dec/Compaq/HP(?) Alpha 21264 ?
uhh again it seems like if I have had some Vodka or so ;)
devik
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/