[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Taking decision on the project



> It lack mutli-cpu support that's a big lack for a 2004 cpu ! (ll/sc are
> for mono cpu system !)

I just read IA64 system manual, there are many MP issues discussed
along with examples ... Maybe it could help ..

> Whygee propose to detect 2 alias maybe 3. Beside the fact that it will
> be a hudge piece of silicon, i beleive that's not acceptable for
> compiler. Finding all memory alias in C code will be such a
> mess ! Our recent Gcc expert could better said about that.

hmm if it is about me, I'm not gcc expert ;-) But I have to say that
gcc marks some register as they contain non-aliased data (if it knows)
but for majority pointers gcc is "not sure".
There are aliasing data attached to each memory reference which
contains type and parent (structure) of datum. Unfortunately
when I tested with some code snippets, many pointers have simply
[0 S8 A64] alias set so that they can't be distinguished by gcc.

> My proposal was to use the stream hint. That's what they are supposed to
> do : split memory stream, to avoid checking read-after-write memory
> hasard. If compiler didn't use them cleanly : shame on it :) That
> introduice 7 cache lines instead of 64, that's fewer but much easier to
> handle.

hummm, what problem addresses these things ? Memory RAW ? In
UP environment these should not occur AFAIK because all caches
are controled by single CPU and cache lines should not be aliased.
For MP why don't use acquire/release semantic or at least memory
fence ?

> But it didn't work any more for floating point unit. NAN, infinite must
> be implemented. Those execption came at the end of the pipeline of the
> unit. We could not hide latency any more. Each instruction must wait the
> complete end of the previous one. It will be so slow !

could not FPU store these flags into output register (or some attached
flag of it) and trap when the value is attempted to be used ?

> Reordering buffer of today cpu are not a stupid bad dream of some
> engineer that smoke to much!  It permit to do register bypass,

is there ROB in IA64 !? From slides I remember there was not one.
They claimed that stop flags delimit insn group which can be
executed and make visible immediately ...

>  [2r1w->3r2w]
> Beside that there is "some choice" that i dislike in FC0. For example,
> the trick to be 3r2w by using 2r1w instructions. We use 3r1w register
> port but 90% of our instruction set are 2r1w (register access time
> between 2r1w and 3r2w is at least 30 % slower, at least ! that the
> difference between 2 Ghz and 1.4 Ghz cpu). And compiler risk to schedule
> only 32 registers.

The only "problem" seems to be that "3r". I played with my little emulator
and most of the time (>50%) both 2w was used. So that one read port is
unused most of the time. There are some insn which need 3r and are useful
like MUX or postincremented store (however I can imagine it in imm
form only - then it is 2r).
If MUX = (A andn M) or (B and M) it can be done with 3 insn with one
cycle stall (but compiler can use the cycle). If whole program
is made of 10% muxes if will slow it down by 20%. But if the CPU
will be 20% faster with 2r2w it is the same timing.
Next candidate is cstore - it reads condition, address and data. Well
someone wanted it to be dropped (MR?). On other hand (I know it is
crazy) if only z/nz would be implemented than there is no need for
the third read port.
Any other 3r insn waiting for comments ;-)) ?

> (That's why i propose my "little vliw", 4r2w but with 2 instructions
> issue by cycle (or we could also used 2 separates register bank (and
> double the register number :) to have the 2r1w speed by instruction
> slot, so at least 1.3*2=2.6 the orginal speed). I don't think that
> introduice so much change to the manual.)

it will introduce many stalls for common sequences like:
add
xor
; here both add & xor writes the result
and what about 2w insns ? Some are very useful, add with carry,
multiply with hi part. Well both problems above could be solved
by delay FFs and more clever scheduler - it could postpone
the second write of insn to next cycle. If compiler is aware of it,
it will schedule like:
mulh ; 6c latency
insn2
insn3
insn4
mul
add1  ; mul wrt1
add2  ; mul wrt2
add3  ; add1 wrt
But for generic code there will be still many stalls probably.

>  [.call mess]
> The last point i dislike is the hundred clock cycle needed to make a
> "typical" function call. The example given by Cedric was scaring ! I

eh !? Speaking about register saves ? Hmm. I was hoping that
SRB will be really used to implement mstore/mload. It would
be nice to have prolog with one mstore and storing done in
background. mstore could be limited by OS to store up to
N registers at once to not kill interupt latency. N can be
different for desktop, server, RTOS ...

> I beleive more and more that using window register should
> be consider. That the compiler work to do the right job ! (maybe 24
> rotating register + 40 fixe one, otherwise the window will be to small
> ?). Sparc always reserve one windows for interrupt handling, so always 8
> refresh registers are ready for it (no stack manipulation for simple
> handler, no complexe srb, no use of SR).

hmm YG convinced me that spilling of rotating registers is hard
task and rotating itsels adds at least one pipeline stage.
Link time register allocation would do better in any case and
with mstore ... it could be even faster than rotating sets.
But ! SRM will probably need to be revided - for example why
to save ALL registers on interrupt ? I'd imagine to save
only small ammount of regs (4 f.ex.) by interrupt and if more
meeded then first insn in trap handler would be mstore to make
more room (IF needed).

devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/