[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Taking decision on the project

To: f-cpu@seul.org
Subject: Re: [f-cpu] Taking decision on the project
From: nico <nicolas.boulay@ifrance.com>
Date: Thu, 12 Dec 2002 21:24:30 +0100
Delivered-to: archiver@seul.org
Delivered-to: f-cpu-outgoing@seul.org
Delivered-to: f-cpu@seul.org
Delivery-date: Wed, 11 Dec 2002 15:46:37 -0500
In-reply-to: <Pine.LNX.4.33.0212111013480.518-100000@devix>
References: <20021211202711.066d1b50.nicolas.boulay@ifrance.com><Pine.LNX.4.33.0212111013480.518-100000@devix>
Reply-to: f-cpu@seul.org
Sender: owner-f-cpu@seul.org

On Wed, 11 Dec 2002 11:23:21 +0100 (CET)
devik <devik@cdi.cz> wrote:

> > It lack mutli-cpu support that's a big lack for a 2004 cpu ! (ll/sc
> > are for mono cpu system !)
> 
> I just read IA64 system manual, there are many MP issues discussed
> along with examples ... Maybe it could help ..
> 
> > Whygee propose to detect 2 alias maybe 3. Beside the fact that it
> > will be a hudge piece of silicon, i beleive that's not acceptable
> > for compiler. Finding all memory alias in C code will be such a
> > mess ! Our recent Gcc expert could better said about that.
> 
> hmm if it is about me, I'm not gcc expert ;-) But I have to say that
> gcc marks some register as they contain non-aliased data (if it knows)
> but for majority pointers gcc is "not sure".
> There are aliasing data attached to each memory reference which
> contains type and parent (structure) of datum. Unfortunately
> when I tested with some code snippets, many pointers have simply
> [0 S8 A64] alias set so that they can't be distinguished by gcc.
> 
> > My proposal was to use the stream hint. That's what they are
> > supposed to do : split memory stream, to avoid checking
> > read-after-write memory hasard. If compiler didn't use them cleanly
> > : shame on it :) That introduice 7 cache lines instead of 64, that's
> > fewer but much easier to handle.
> 
> hummm, what problem addresses these things ? Memory RAW ? In
> UP environment these should not occur AFAIK because all caches
> are controled by single CPU and cache lines should not be aliased.
> For MP why don't use acquire/release semantic or at least memory
> fence ?
> 
> > But it didn't work any more for floating point unit. NAN, infinite
> > must be implemented. Those execption came at the end of the pipeline
> > of the unit. We could not hide latency any more. Each instruction
> > must wait the complete end of the previous one. It will be so slow !
> 
> could not FPU store these flags into output register (or some attached
> flag of it) and trap when the value is attempted to be used ?
> 
> > Reordering buffer of today cpu are not a stupid bad dream of some
> > engineer that smoke to much!  It permit to do register bypass,
> 
> is there ROB in IA64 !? From slides I remember there was not one.
> They claimed that stop flags delimit insn group which can be
> executed and make visible immediately ...
> 

There is no ROB in IA64 but a system to refind where the trap come from
and use specific instruction to refind where bad things happen.

> >  [2r1w->3r2w]
> > Beside that there is "some choice" that i dislike in FC0. For
> > example, the trick to be 3r2w by using 2r1w instructions. We use
> > 3r1w register port but 90% of our instruction set are 2r1w (register
> > access time between 2r1w and 3r2w is at least 30 % slower, at least
> > ! that the difference between 2 Ghz and 1.4 Ghz cpu). And compiler
> > risk to schedule only 32 registers.
> 
> The only "problem" seems to be that "3r". I played with my little
> emulator and most of the time (>50%) both 2w was used. So that one

I beleive it was more :)

> read port is unused most of the time. There are some insn which need
> 3r and are useful like MUX or postincremented store (however I can
> imagine it in imm form only - then it is 2r).
> If MUX = (A andn M) or (B and M) it can be done with 3 insn with one
> cycle stall (but compiler can use the cycle). If whole program
> is made of 10% muxes if will slow it down by 20%. But if the CPU
> will be 20% faster with 2r2w it is the same timing.
> Next candidate is cstore - it reads condition, address and data. Well
> someone wanted it to be dropped (MR?). On other hand (I know it is
> crazy) if only z/nz would be implemented than there is no need for
> the third read port.
> Any other 3r insn waiting for comments ;-)) ?
> 

There are must more problem with 2r instead of 1r, than between 2r or
3r. Imagine how such memory array look like !

> > (That's why i propose my "little vliw", 4r2w but with 2 instructions
> > issue by cycle (or we could also used 2 separates register bank (and
> > double the register number :) to have the 2r1w speed by instruction
> > slot, so at least 1.3*2=2.6 the orginal speed). I don't think that
> > introduice so much change to the manual.)
> 
> it will introduce many stalls for common sequences like:
> add
> xor
> ; here both add & xor writes the result

??? write in the same register ?

> and what about 2w insns ? Some are very useful, add with carry,
> multiply with hi part. Well both problems above could be solved

Stop ! :)

64 decoder decode 64 bits insructions or 2*32 bits instructions. So
2 typical 2r1w instructions could be handle in the same time (in
different register set ?). All actual 3r2w, 3r1w, 2r2w are mapped inside
4r2w 64 bits instructions world that access both register set. This 64
bits instruction could handle complexe instruction that could replace 3
or 4 32 bits instructions are at least reduice the RAW pipeline
dependancies. That's how work oo engine :
a <- b + a
c <- d + a

became
a <- b + a 
c <- d + b + a

For exemple.

> by delay FFs and more clever scheduler - it could postpone
> the second write of insn to next cycle. If compiler is aware of it,
> it will schedule like:
> mulh ; 6c latency
> insn2
> insn3
> insn4
> mul
> add1  ; mul wrt1
> add2  ; mul wrt2
> add3  ; add1 wrt
> But for generic code there will be still many stalls probably.
> 

You try to use 3r2w instructions on a 2r1w instruction set ?

> >  [.call mess]
> > The last point i dislike is the hundred clock cycle needed to make a
> > "typical" function call. The example given by Cedric was scaring ! I
> 
> eh !? Speaking about register saves ? Hmm. I was hoping that
> SRB will be really used to implement mstore/mload. It would
> be nice to have prolog with one mstore and storing done in
> background. mstore could be limited by OS to store up to
> N registers at once to not kill interupt latency. N can be
> different for desktop, server, RTOS ...

I speak about lost clock cycle not number of instructions. Even with
SRB, you loose clock cycles. With register windows you loose on trap
only !

> 
> > I beleive more and more that using window register should
> > be consider. That the compiler work to do the right job ! (maybe 24
> > rotating register + 40 fixe one, otherwise the window will be to
> > small?). Sparc always reserve one windows for interrupt handling, so
> > always 8 refresh registers are ready for it (no stack manipulation
> > for simple handler, no complexe srb, no use of SR).
> 
> hmm YG convinced me that spilling of rotating registers is hard
> task and rotating itsels adds at least one pipeline stage.

That is true. You add a 3 bits adder on the pipe.

> Link time register allocation would do better in any case and
> with mstore ... it could be even faster than rotating sets.

mstore... Wich version ? Never forget that mstore use memory : >150
clock cycle in the worst case !

mstore for 4 register could be great. There is an easy way to do it but
there is lot of constraint of alignement. We use a 4*sized register
bank. 256->1024 bits so 4 registers are acceded in the same time, and
you can fill a cache line in 1 clock cycle ! But it must be aligned to
the cache line and register are packed by 4 !! There is some issue with
SIMD/Scalar stuff resolved with a muxes.

nicO
> But ! SRM will probably need to be revided - for example why
> to save ALL registers on interrupt ? I'd imagine to save
> only small ammount of regs (4 f.ex.) by interrupt and if more
> meeded then first insn in trap handler would be mstore to make
> more room (IF needed).
> 
> devik
> 
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
> _____________________________________________________________________
> Envie de discuter en "live" avec vos amis ? T_l_charger MSN Messenger
> http://www.ifrance.com/_reloc/m la 1_re messagerie instantan_e de
> France
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: [f-cpu] Taking decision on the project
  - From: nico <nicolas.boulay@ifrance.com>

References:
- Re: [f-cpu] Taking decision on the project
  - From: nico <nicolas.boulay@ifrance.com>
- Re: [f-cpu] Taking decision on the project
  - From: devik <devik@cdi.cz>

Prev by Date: Re: [f-cpu] Relative branch
Next by Date: Re: [f-cpu] Taking decision on the project
Previous by thread: Re: [f-cpu] Taking decision on the project
Next by thread: Re: [f-cpu] Taking decision on the project
Index(es):
- Date
- Thread