[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tr:[f-cpu] usage of 64 registers & ILP

To: f-cpu@seul.org
Subject: Re: Tr:[f-cpu] usage of 64 registers & ILP
From: Martin Devera <devik@cdi.cz>
Date: Wed, 3 Apr 2002 23:40:04 +0200 (CEST)
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Wed, 03 Apr 2002 16:40:12 -0500
In-Reply-To: <3CAB5D38.8784B2B4@f-cpu.org>
Reply-To: f-cpu@seul.org
Sender: owner-f-cpu@seul.org

> We are currently working on a single-issue superpipeline core
> where each operation (except a few exceptions) can be pipelined.
> If most units have 2 cycles of latency (for example now), it's a
> bit like working with a 3-issue superscalar CPU.
> 
> In FC0, the ILP depends on the kind of operations to perform.
> Fortunately, most code is a mix of different operation types.
> 
> Currently, there is only integer arithmetic operations,
> so an addition requires 2 cycles and a multiply up to 8 cycles.
> An average necessary ILP is around 3 or 4 for safety.

ohh yes I read all docs and all old mailing list issues about
f-cpu ;-) Sometimes I have had a hard time to orientate in some
terms (I have to learn Tomasu... - can't remember - and other
similar algorithms to keep track).

Maybe I musunderstand FC0 scheduler - I've thought that decode part
of pipeline can stall simply when there is RAW/WAW - scoreboard bit
of source register is set. So that when you do
i1:add r1,r2,r3
i2:add r4,r1,r3

it would produce:

cycle:  1	2	3	4	5	6	7	8
i1:     fetch	decd	xbar	asu1	asu2	xbar	rwrt 
i2: 		fetch   ------- stall---------------------      decd

so that there would be latency 6 and you will have to find 
appropriate ILP in code. Or am I totally wrong ?
I'd expect the stall to occur later say just after xbar at
cycle 3 ..

> pipeline units, if the latency of a single operation does not fit inside
> a simple loop, you can "software pipeline" the loop :
>  duplicate each instruction and rename each register of the copy
> (something like adding 32 to each register number).
> The loop size increases but the stalls are filled with useful
> operations. This is one very good reason for having a large register set.

like: (assume that r1 == 8)
loadi 8,r2,r10
add r9,r5,r20
storei 8,r3,r19
loadi 8,r2,r11
add r10,r5,r21
storei 8,r3,r20
..... ?

Would it be very complex to add special 5bit register and add
it's value to register number >32 in decode stage ? Like:

-- initialize prolog manualy --
l1:	loadi 8,r2,r32
	add r33,r5,r34
	storei 8,r3,r35
	loop.c r3,r4 ; r3==l1 and r4 is loop kernel counter
-- unrolled loop epilog here ---

where loop.c would simply increment the register add number 
with overflow (no saturation) ?
With simple circuit is could be also used to create function
call prolog/epilog by testing the add number for overflow and
calling spilling handler ...
The added register number coudd be computed in paralel during
decode stage and would affect registers > 32 only.
The result is support for sw pipelining without need to 
unroll it - thus less pressure at instruction fetch.
Does make it sense ?
 
> > If so, does it mean that binary tree or linked list handling
> > will cause about 4 cycles big bubbles in the pipeline ? :-0
> not exactly.
> In reality, it will take even more : today's memory latencies
> are huge because the core speed increases much faster than the
> memory speed.

yes it is true

> I hope that you understand that it is unavoidable : if you think that
> the number of bubbles is critical, then you force the core
> to decrease its working speed and it become as slow as the memory

no I only wanted to kill avoidable bubbles - these which results
from register interdependency without much ILP in the algorithm.
If it is possible of course :) Parking lots for instruction seems
to limit these latencies to shortest posible time.
But maybe I don't understand the problem correctly ;)

> but the latency does not increase as fast. pipelined memories is
> a means to compensate, but you have to adapt your algos.

the distributed tree is nice idea ;) By the way for splay tree
you will often have what you want in some cache (but as you said
even L1 is slow)...
 
> > By the way anybody knows granularity of IA32,IA64 amd 21256
> > pipeline ?
> bad question :
>  IA32 and IA64 are programming models, not "architectures".

yes, sorry you are right.

> each implementation has radically differing strategies :
>  Merced and Itanium have different issue widths (6 vs 9, IIRC)
[snip]

but I vas interested in information on circuit complexity
like depth of ten transistors in fcpu ...

> Concerning 21256... are you referring to Dec/Compaq/HP(?) Alpha 21264 ?

uhh again it seems like if I have had some Vodka or so ;)

devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: Tr:[f-cpu] usage of 64 registers & ILP
  - From: Yann Guidon <whygee@f-cpu.org>

References:
- Re: Tr:[f-cpu] usage of 64 registers & ILP
  - From: Yann Guidon <whygee@f-cpu.org>

Prev by Date: Re: Tr:[f-cpu] usage of 64 registers & ILP
Next by Date: Re: Tr:[f-cpu] usage of 64 registers & ILP
Prev by thread: Re: Tr:[f-cpu] usage of 64 registers & ILP
Next by thread: Re: Tr:[f-cpu] usage of 64 registers & ILP
Index(es):
- Date
- Thread