[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] F-CPU vs. Itanium



Hello Yann,

> sorry for the delay... i have glibc problems ;-)

;-] recently I started to use uClibc (for our embeeded apps)
it is simple, complete and only 250kb SO (with 30k dynamic
loader)

> > in docs there is stated that f-cpu was at first meant to
> > be itanium killer.
> hell, that was loooong loooooong ago ;-)
> before we can whip Intel's ass, we have to make a first proof-of
> concept for a lot of things, build a name, design a complete
> working workflow and user base... FC0 is not yet the ia64 basher

Yes it is the best way - I know something about hw but not so
much as you and other guys here. But from my software enginnering
experience the biggest problem of many projects is that they are
too large from beginning and these are never finished ..
Hovewer at the other side there are issues which whould be addressed
in early developement ...

> people expected, but it's a nice core anyway :-)

Yes I like it much ;) KISS approach.

> However, as you can read in the document i just released, 64 registers is
> not too much in some critical circumstances...

are you speaking about DCT ? Yes from coder's point of view it is
good to have a lot of registers.

> I have not looked deeply into IA64 and though there are 128 physical int and
> fp regs, i am still unsure whether the opcode is limited to 32 registers
> (then the "window" moves through user-controlled register renaming).
> F-CPU can't afford all the latency and the huge hardware it requires.

The opcode uses full 7 bit to address registers. Opcode is 41 bit long
minus 7 predication bits. So that there is 32 bits for instruction with
at most 3*7 bits to specify register.
3*41=123 and another 5 bits for group code forms instr. group. These five
bits selects instruction decoder for different instruction formats so that
you have almost the same expressiveness like in f-cpu - but at cost of
1/4 longer opcode.
So that there is no real renaming AFAIK - first 32 regs are globa (can't
be rotated) and regs 32...128 can be ROTATED - probably there is 7 bit
adder in path. You can change the constant.
You can rotate all 96 regs (circularly) or only 8,16,32 or 64 of them
(simple logic). I've looked at string ops written for IA64 and the sw
pipelining can do for eample strlen in cca 10 instructions and hidding
32 cycle memory latency here - with no unrolling (!)

IMHO the adder for register rotation could be added in f-cpu without
adding next stage - if I understand f-cpu pipeline then at start of
decode stage we have prepared all register addresses - these are
directly propagated to register file address ports where we need
them al the next stage (xbar). I'm not sure but there could be
transistor space in decode stage for say 4 bit adder ?
Then you could control rotating of blocks of 16 regs to do sw
pipelining (and as you say in prev post thare would be problem
with state ruring trap/context switch).
But some symetric algorithms could benefit from it - bus take it
just as my braindump ..

> OTOH 64 registers is equivalent to the agregated number of int and fp registers
> in most RISC architectures, so it's realistic. But the number of ports
> is already a problem for us.

hmm I've heard that Itanium (which has 14 ported reg.file) had to add
next pipeline stage only due to slowness of these ports - it seems that
they have private ports per EU: Itanium has 2 MU, 2 ALU and 3 BR
(I ignore FPU - it has its own bank) so you have 6 ports for ALUs,
4 for memunit, 3 for branching and 1 for ... god knows.

> > = register renaming
> * register renaming adds at least a pipeline stage so the jumps
>   are slower. We can't afford that now.

as above .. is the stage neccesarry ?

>   allowing the core to reach a higher clock frequency than a traditional design.
>   It also prepares the rest of the project to very-high performance design habits,
>   for example it creates a pressure on the compiler from the start.

Do you think that given 0.18u process f-cpu could go 2GHz ? I have
unforunately no idea of speed of inverter loop at 0.18u.

> * function call/returns are often a big deal, but the large number of registers
>   and modern compilers should help avoid unnecessary work. One ongoing discussion
>   deals with global-wise optimisations that analyse the call tree and keep only
>   the most important things in the registers, avoiding the silly spills on the stack.
>   The object code is probably larger but it should execute pretty fast.

You can do it on code which is one module with sources. The big advantage
of compile/link is compile speed. I can't imagine developers to wait
several hours when recompiling moderate project linked to glibc - you
would have to compile glibc along with it to do global optimization..
But if it is only way to go for speed .. well.

> > = multi issue & groups
>
> yep but FC0 is single-issue now, we will examine the issue logic problems
> for a later core (FC1 ? FC2 ?). There were several threads in the past about this...

maybe unfortunately old archives at yahoo are gone. I was only
thinking - if we would need later one bit to demark groups it will
be hard to add it without breaking current opcode format.

> that's pretty far fetched... and i doubt that there will be such a large
> interconnect between the execution units. The plans i have for designing
> FC0 don't use such a method because the units have a very specific form factor.

I'm just curious - what is on-chip area difference between f-cpu register
file:xbar interconnects:adder ?
Is is possible to guess something like 1:1:1 ? ;-)

> > = address disambiguation
>
> The memory system that i have designed (i know that nicO is not
> very hot about it and if he feels angry enough, he'll design his own ;-D)
> is very unusual. it keeps everything coherent at the cost of a bit more
> latency, which can be hidden by the usual methods (such as software pipelining).

Is there more completion description ? I've read docs 0.2 IIRC and there
was not much about mem subsys. In this example:

1: *p = 1;
2: a = *q;
3: p += a;

you could do a = *q first to hide [q] fetch latency but if p == q
it will result in wrong a. You can prefetch it but there is still
L0->register latency (2cycle). You can use condidional move to repair
if a if p==q but it is too expensive.
So that do you have other method to do it (which Alpha & IA64 does
thru associative mem) ?

> because the way loads and stores are performed is very different.

I'm eager to read more about it ;)

BTW: are there some proofs of neccessarity of orthogonal instrucion
     set ? It seems that by implementing something like tree between
     registers would make interconnects much cheaper - something
     like Alpha's split 16/16 with 1 cycle penalty.

best regards,
devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/