[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tr:[f-cpu] usage of 64 registers & ILP

To: f-cpu@seul.org
Subject: Re: Tr:[f-cpu] usage of 64 registers & ILP
From: Yann Guidon <whygee@f-cpu.org>
Date: Wed, 03 Apr 2002 21:51:20 +0200
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Wed, 03 Apr 2002 14:45:22 -0500
Organization: http://www.f-cpu.org
References: <Pine.LNX.4.10.10204031930010.15162-100000@luxik.cdi.cz>
Reply-To: f-cpu@seul.org
Sender: owner-f-cpu@seul.org

hello,

Martin Devera wrote:
> > First during my developement days I never seen algorithm
> > (except unrolled loops) which can use 64 regs in one stack
> > frame range.
> >
> > >>> Most of the time, if you want to put n (pipeline depth) cycles
> > between the write of a register and it's read and if you use cmove trick
> > (to avoid jump), you will have great pressure on the register set, so
> > you will need so much register (there is no OOO in fcpu).
> 
> Ok so that is assumption below true ?
> - To keep pipeline full the program must exhibit ILP at
>   least 5 at each place.

not exactly.

We are currently working on a single-issue superpipeline core
where each operation (except a few exceptions) can be pipelined.
If most units have 2 cycles of latency (for example now), it's a
bit like working with a 3-issue superscalar CPU.

In FC0, the ILP depends on the kind of operations to perform.
Fortunately, most code is a mix of different operation types.

Currently, there is only integer arithmetic operations,
so an addition requires 2 cycles and a multiply up to 8 cycles.
An average necessary ILP is around 3 or 4 for safety.

For FP, this is going to increase (2 or 3x ?) but let's concentrate on
what works. And thanks to the large number of registers and the
pipeline units, if the latency of a single operation does not fit inside
a simple loop, you can "software pipeline" the loop :
 duplicate each instruction and rename each register of the copy
(something like adding 32 to each register number).
The loop size increases but the stalls are filled with useful
operations. This is one very good reason for having a large register set.

> If so, does it mean that binary tree or linked list handling
> will cause about 4 cycles big bubbles in the pipeline ? :-0
not exactly.
In reality, it will take even more : today's memory latencies
are huge because the core speed increases much faster than the
memory speed.
However, you can handle up to 8 data pointers and 8 jump locations
at a time (leaving us with 48 "data" registers, but there is no
restriction on their allocation) and you can pipeline data accesses !
so instead of having a single linked list (where you have a miss on
most accesses), you can maybe manage severa interleaved lists and
trees at a time, a bit like in the previous example of
loop unrolling/interleaving.
I hope that you understand that it is unavoidable : if you think that
the number of bubbles is critical, then you force the core
to decrease its working speed and it become as slow as the memory
(even the L1 cache must be considered as "slow").
However, todays memories can sustain pipelined (or "transactional")
requests, so you can issue several requests and get the result later.
The peak bandwidth remains proportional to the core's processor,
but the latency does not increase as fast. pipelined memories is
a means to compensate, but you have to adapt your algos.

> I have had hard time optimizing QoS queue in linux kernel
> for gigabit flow eth the code is full of list and tree searches ...

argh... i don't know much about this subject (i guess Jürgen can
give his salt grain ;-D) but interleaved trees are probably
possible...

> By the way anybody knows granularity of IA32,IA64 amd 21256
> pipeline ?
bad question :
 IA32 and IA64 are programming models, not "architectures".
each implementation has radically differing strategies :
 Merced and Itanium have different issue widths (6 vs 9, IIRC)
and the Pentium 3 can issue up to 3 instructions per cycle
with very specific coding rules, while Pentium 4's rules are
totally different and it decodes 1 instruction per cycle only
(but can execute up to 4 after that). You see, it's not that
simple (and even worse when we deal with Intel's idiosynchrasies).

Concerning 21256... are you referring to Dec/Compaq/HP(?) Alpha 21264 ?
if so, it can decode up to 4 instructions (128 bits) at a time (with specific
rules) and execute up to 6 operations (2 FP, 2 int and 2 memory, with
one branch or something like that). Check each CPU's doc.

> have a nice day,
you too, and good luck for your code !

> devik
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: Tr:[f-cpu] usage of 64 registers & ILP
  - From: Martin Devera <devik@cdi.cz>

References:
- Re: Tr:[f-cpu] usage of 64 registers & ILP
  - From: Martin Devera <devik@cdi.cz>

Prev by Date: Re: Tr:[f-cpu] usage of 64 registers & ILP
Next by Date: Re: Tr:[f-cpu] usage of 64 registers & ILP
Prev by thread: Re: Tr:[f-cpu] usage of 64 registers & ILP
Next by thread: Re: Tr:[f-cpu] usage of 64 registers & ILP
Index(es):
- Date
- Thread