[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] F-CPU vs. Itanium



Martin Devera wrote:
> Hello Yann,
Hi Martin !

> > sorry for the delay... i have glibc problems ;-)
> 
> ;-] recently I started to use uClibc (for our embeeded apps)
> it is simple, complete and only 250kb SO (with 30k dynamic
> loader)

arghflk ! <drooling>
and how fast does this compile (compared to Glibc) ? :-)

> > > in docs there is stated that f-cpu was at first meant to
> > > be itanium killer.
> > hell, that was loooong loooooong ago ;-)
> > before we can whip Intel's ass, we have to make a first proof-of
> > concept for a lot of things, build a name, design a complete
> > working workflow and user base... FC0 is not yet the ia64 basher
> 
> Yes it is the best way - I know something about hw but not so
> much as you and other guys here.
some of your later comments later show that you're not clueless either.

> But from my software enginnering
> experience the biggest problem of many projects is that they are
> too large from beginning and these are never finished ..
F-CPU might well be among them. But OTOH when a project is "finished"
it means that it is dead...

> Hovewer at the other side there are issues which whould be addressed
> in early developement ...
and F-CPU spends most of its time trying to find compromises...

> > people expected, but it's a nice core anyway :-)
> Yes I like it much ;) KISS approach.
i wouldn't say it's KISS otherwise it would already be finished ;-P

However, it would be cool if it became the equivalent of "MIPS R2000"
of the new century. This project has been contributed by a lot of people
who shared their experience. In some places it looks like a worthless
compromise and in other places, where no existing solution exists,
new things are imagined.

> > However, as you can read in the document i just released, 64 registers is
> > not too much in some critical circumstances...
> are you speaking about DCT ? Yes from coder's point of view it is
> good to have a lot of registers.
DCT is only a simple example, of course, a proof of concept.
There was a list of wished code examples, and only the complete implementation
of these examples can give a good overview of WHAT F-CPU is and should
be properly programmed. If you learnt MIPS and x86, you have to forget
some things and learn new things, and think more globally. it's not easy.

> > I have not looked deeply into IA64 and though there are 128 physical int and
> > fp regs, i am still unsure whether the opcode is limited to 32 registers
> > (then the "window" moves through user-controlled register renaming).
> > F-CPU can't afford all the latency and the huge hardware it requires.
> 
> The opcode uses full 7 bit to address registers.
at a time, i wondered whether they implemented only 5 bits for the register
addresses.

> Opcode is 41 bit long minus 7 predication bits. So that there is 32 bits
> for instruction with at most 3*7 bits to specify register.
> 3*41=123 and another 5 bits for group code forms instr. group. These five
> bits selects instruction decoder for different instruction formats so that
> you have almost the same expressiveness like in f-cpu - but at cost of
> 1/4 longer opcode.
the code size bloat is not that important, depending on your goal
and application.

> So that there is no real renaming AFAIK - first 32 regs are globa (can't
> be rotated) and regs 32...128 can be ROTATED - probably there is 7 bit
> adder in path. You can change the constant.
IIRC there IS an adder and renaming takes one cycle (in the old implementations).

> You can rotate all 96 regs (circularly) or only 8,16,32 or 64 of them
> (simple logic).
forget the concept of "simple logic" :-) at that scale, anything takes
time. Of course now, transmission takes more time than computation, but
even that has a "cost".

> I've looked at string ops written for IA64 and the sw
> pipelining can do for eample strlen in cca 10 instructions and hidding
> 32 cycle memory latency here - with no unrolling (!)
nice.

> IMHO the adder for register rotation could be added in f-cpu without
> adding next stage
it can't : it's already too tight and it would break FC0's architecture.

> - if I understand f-cpu pipeline then at start of
> decode stage we have prepared all register addresses - these are
> directly propagated to register file address ports where we need
> them al the next stage (xbar).
that's acurate, even though now i'm less optimistic.
if i participate to FC1, i'll take more precautions
because it seems that FC0's register set will be the slowest
part of the whole core.

> I'm not sure but there could be
> transistor space in decode stage for say 4 bit adder ?
not even, because as noted before, it's probably the critical
datapath.

> Then you could control rotating of blocks of 16 regs to do sw
> pipelining (and as you say in prev post thare would be problem
> with state ruring trap/context switch).
> But some symetric algorithms could benefit from it - bus take it
> just as my braindump ..
F-CPU was not designed with these techniques in mind, so it looks
a bit handicaped in this respect. Maybe this must be considered
for a "F-CPU 2" project because the programmining model is already
advanced. Furthermore, optimising compilers for this kind of computer
are not easy to do (look at Intel's pain :-P) and this requires
predicates (which were dropped because it wouldn't fit in 32-bit instructions).

As a conclusion, the state of the art in computer science and technology
is not yet advanced for doing an ia64-like in F-CPU fashion.
we already have to deal with different kinds of problems, such as the
VHDL toolchain and the portability...

> > OTOH 64 registers is equivalent to the agregated number of int and fp registers
> > in most RISC architectures, so it's realistic. But the number of ports
> > is already a problem for us.
> hmm I've heard that Itanium (which has 14 ported reg.file) had to add
> next pipeline stage only due to slowness of these ports - it seems that
> they have private ports per EU: Itanium has 2 MU, 2 ALU and 3 BR
> (I ignore FPU - it has its own bank) so you have 6 ports for ALUs,
> 4 for memunit, 3 for branching and 1 for ... god knows.
maybe only He knows, indeed ;-P

The size of the register set is a big issue and it is certainly one
factor of Itanium's "slowness". already with 5 ports, FC0 will be
relatively "slow". Adding ports reduces the control logic's complexity
(because temporary buffers must be managed otherwise) and allow more
access, but completely bloats the silicon surface. it's a hard job.

For example, nicO wants to reduce the number of ports and i thought
about a way to achieve that. However, making a 3r2w register set
with 1r1w blocks does not reduce the surface and the complexity because
a lot of things must be duplicated :
 - you can do a 2r1w bank by using 2*1r1w blocks, so having 3 read ports
   requires 3 identical blocks.
 - you can't do a Xr2w bank as simply as a 2r1w bank because the data
   must be written to all the sub-blocks. One solution is to use
   some kind of FIFO that serializes the write but it's not a suitable
   solution in this case. One solution i proposed was to ensure that
   two simultaneous writes would not write the same sub-block, but this
   can reduce the overall CPU efficiency.

(i'll describe the trick later)


> > > = register renaming
> > * register renaming adds at least a pipeline stage so the jumps
> >   are slower. We can't afford that now.
> as above .. is the stage neccesarry ?
absolutely.

> >   allowing the core to reach a higher clock frequency than a traditional design.
> >   It also prepares the rest of the project to very-high performance design habits,
> >   for example it creates a pressure on the compiler from the start.
> Do you think that given 0.18u process f-cpu could go 2GHz ? I have
> unforunately no idea of speed of inverter loop at 0.18u.
the speed of the inverter is not directly linked to the frequency of the whole chip
because a lot of parameters have recently become prominent :
 - the wires are relatively "slow", and the propagation time increases
     as the square of the distance
 - ===> a 64-bit computer then requires not 2x but 4x more time to propagate a data
     from bit 0 to bit 63
 - logic gates have increased in complexity because the old transistor constructs
     (back in 1985 where pass transistors were so handy) do not hold anymore,
     because of the much lower core voltage and the reduced noise immunity.
 - ===> as a consequence, the actual surface for a given function doesn't decrease
     as quickly as the transistors... if you need more transistors to perform the
     same thing...
 -   memory cells become relatively slower and larger

This explains why a rough estimate of the complexity and speed of FC0
gives that one half of the surface and time is spent in the pipeline flip-flops.
The core will spend one half of its time memorizing things between two stages...
that's the point of diminishing return and all the FC0 is calibrated around that.

> > * function call/returns are often a big deal, but the large number of registers
> >   and modern compilers should help avoid unnecessary work. One ongoing discussion
> >   deals with global-wise optimisations that analyse the call tree and keep only
> >   the most important things in the registers, avoiding the silly spills on the stack.
> >   The object code is probably larger but it should execute pretty fast.
> 
> You can do it on code which is one module with sources. The big advantage
> of compile/link is compile speed. I can't imagine developers to wait
> several hours when recompiling moderate project linked to glibc - you
> would have to compile glibc along with it to do global optimization..
> But if it is only way to go for speed .. well.

This depends on the local definition of speed.
When you develop an algorithm, compile time is moderately crucial.
Often, only small parts of the code base are modified, so incremental
compilation is possible (unless there's a avalanche effect). Then when
you are finished, you can let your computer run when you sleep for some
deep optimisations, if you want.

> > > = multi issue & groups
> > yep but FC0 is single-issue now, we will examine the issue logic problems
> > for a later core (FC1 ? FC2 ?). There were several threads in the past about this...
> maybe unfortunately old archives at yahoo are gone.
There's an archive in Mexico (i forgot the URL).
it's a pretty large archive (around 20MB of files and attachments)

> I was only
> thinking - if we would need later one bit to demark groups it will
> be hard to add it without breaking current opcode format.

one of the ideas i proposed was to use a specific opcode for this.
For example, a cache line holds 8 instructions of 32 bits.
if the opcode itself is 8 bits, there remains 24 bits to describe
what the remaining 7 instructions of the line do.
With an average of 3 bits per instruction, it could "hint" the
instruction decoder and allow up to 7 parallel pipelines to be fed
in one cycle.
what do you think about this ? there's only a 1/8 overhead/bloat
and it's pretty portable accross different implementations
(recognize this opcode as NOP if not useful, extract only the needed
information otherwise...)

> > that's pretty far fetched... and i doubt that there will be such a large
> > interconnect between the execution units. The plans i have for designing
> > FC0 don't use such a method because the units have a very specific form factor.
> I'm just curious - what is on-chip area difference between f-cpu register
> file:xbar interconnects:adder ?
> Is is possible to guess something like 1:1:1 ? ;-)

in today's technology, the Xbar won't have a specific "area" because it will
be routed "over" the other layers. With a 5-metal technology, the "Xbar"
will use the topmost 1 or 2 layers, for example.

> > > = address disambiguation
> > The memory system that i have designed (i know that nicO is not
> > very hot about it and if he feels angry enough, he'll design his own ;-D)
> > is very unusual. it keeps everything coherent at the cost of a bit more
> > latency, which can be hidden by the usual methods (such as software pipelining).
> Is there more completion description ? I've read docs 0.2 IIRC and there
> was not much about mem subsys. In this example:
> 
> 1: *p = 1;
> 2: a = *q;
> 3: p += a;
>
> you could do a = *q first to hide [q] fetch latency but if p == q
> it will result in wrong a. You can prefetch it but there is still
> L0->register latency (2cycle). You can use condidional move to repair
> if a if p==q but it is too expensive.
> So that do you have other method to do it (which Alpha & IA64 does
> thru associative mem) ?

My (personal) point of view about this kind of problems is :
if you mess with the usual way things are done, it will be done
anyway (if it's legal enough) but certainly slower. If you code
correctly (99% of the code out there) there wouldn't be such a problem.

now for the remaining 1%, let's see what happens with some translation to asm
and explanation of what happens behind the curtains ...

1:
 loadimm  1, r1 ; // or something like that.
 store   r2, r1 ; // here, r2 already points to the location p, the address is
                  // "desambiguified" at execution (otherwise it would trap
                  // if a TLB miss occured).
2:
 load    r3, r4 ; // like above, r3 already points to *q. If q==p, the value
                  // of r4 becomes this of r1. there might be a small delay
                  // (1 or 2 cycles at most) if no bypass is designed in the LSU.
3:
  add r2, r4, r2; // there's a bypass on the Xbar. The only bad surprise comes from
                  // the future reuse of r2 as a pointer without a previous explicit
                  // prefetch, a few cycles of penalty. my first estimations are
                  // less than i expected (1 or 2 cycles) but this will not be true
                  // in the future or in far-from-ideal cases.

> > because the way loads and stores are performed is very different.
> I'm eager to read more about it ;)

there will be an outline in the next paper (2D DCT).

> BTW: are there some proofs of neccessarity of orthogonal instrucion
>      set ? It seems that by implementing something like tree between
>      registers would make interconnects much cheaper - something
>      like Alpha's split 16/16 with 1 cycle penalty.

split register sets will probably become more and more necessary...
However, i am scratching some ideas for in hypothetical FC1, which
would be even more strange but very interesting anyway
(on-the-fly RISC->TTA translator with large instruction buffer,
TTA core, potentially superscalar execution with TTA's inherent
OOO capabilities... yeah...) but that's only for the time
when FC0 will tick in more than one prototype.

> best regards,
> devik
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/