[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] F-CPU vs. Itanium


sorry for the delay... i have glibc problems ;-)

Martin Devera wrote:
> Hello,
> in docs there is stated that f-cpu was at first meant to
> be itanium killer.
hell, that was loooong loooooong ago ;-)
before we can whip Intel's ass, we have to make a first proof-of
concept for a lot of things, build a name, design a complete
working workflow and user base... FC0 is not yet the ia64 basher
people expected, but it's a nice core anyway :-)

> I was thinking what makes it better
> than Itanium in terms of performance:
> - = better for Itanium
> + = better for f-cpu
> = register count (let's ignore FPU splitted bank)
> - IA64 defines twice more registers which are used
>   as cache during subroutine calls -> less push/pop
>   to memory
> + f-cpu saves 3bits here ->shorter opcode
> + maybe less registers -> less expensive to add next
>   ports to reg-file (due to fanout) ?

the register set is still a big issue. And it is already huge.
I had plans for a 128-register core with some particular "tricks"
so 64 registers would be implemented as a normal set.
Other people wanted the usual 32 int regs. So we found a compromise
with a unified 64-register bank.

A 64-bit 64-register bank is 4x larger than what is found in a MIPS R3000
and this can cause electronic problems if the technology is not suited.
However, as you can read in the document i just released, 64 registers is
not too much in some critical circumstances...

I have not looked deeply into IA64 and though there are 128 physical int and
fp regs, i am still unsure whether the opcode is limited to 32 registers
(then the "window" moves through user-controlled register renaming).
F-CPU can't afford all the latency and the huge hardware it requires.

OTOH 64 registers is equivalent to the agregated number of int and fp registers
in most RISC architectures, so it's realistic. But the number of ports
is already a problem for us.

> = register renaming
> - f-cpu has to do more register saves during calls
>   because of fixed register allocation (call-presistent
>   vs. call-clobbered)
> - itanium can do sw pipelining with less code size
> + f-cpu is simpler -> higher clock ?

* register renaming adds at least a pipeline stage so the jumps
  are slower. We can't afford that now.

* FC0 has been designed in the beginning to have "very short pipeline stages",
  allowing the core to reach a higher clock frequency than a traditional design.
  It also prepares the rest of the project to very-high performance design habits,
  for example it creates a pressure on the compiler from the start.
  We also try to not "bloat" it and avoid hidden critical datapaths, at the cost
  of more software latency, but this keeps the overall frequency and performance high.
* function call/returns are often a big deal, but the large number of registers
  and modern compilers should help avoid unnecessary work. One ongoing discussion
  deals with global-wise optimisations that analyse the call tree and keep only
  the most important things in the registers, avoiding the silly spills on the stack.
  The object code is probably larger but it should execute pretty fast.

> = multi issue & groups
> - Itanium uses stop-marks to denote parts of machine code
>   where is no RAW->WAW between regs -> simpler multiissue logic
> - 6/9 issue at this time for Itanium/Merced
> + f-cpu saves 1bit per op here

yep but FC0 is single-issue now, we will examine the issue logic problems
for a later core (FC1 ? FC2 ?). There were several threads in the past about this...

> = simd
> + f-cpu can do simd on every op while Itanium has dedicated ops

experience with MMX coding has helped here ;-)

> = pipeline depth
> + shorter pipeline is always better
> - if f-cpu will want to be multiissue maybe xbar stage will have
>   to be larger ? like using 5x 4-stage omega network to do full
>   mesh between each register and each of 63 EU's ;-]

that's pretty far fetched... and i doubt that there will be such a large
interconnect between the execution units. The plans i have for designing
FC0 don't use such a method because the units have a very specific form factor.

> = address disambiguation
> - Itanium can do it (explicitly) so that you can exploit more
>   ILP - move loads before potential overwrites of the same address
> + Itanium spends instruction slot for it (while Alpha does it
>   automaticaly AFAIK)

The memory system that i have designed (i know that nicO is not
very hot about it and if he feels angry enough, he'll design his own ;-D)
is very unusual. it keeps everything coherent at the cost of a bit more
latency, which can be hidden by the usual methods (such as software pipelining).

the F-CPU programming model is simpler and maybe more naive than IA64's
but it makes its implementation mush easier, as well as code generation.
OTOH the coding guidelines are pretty strange, too, for the unexperienced,
because the way loads and stores are performed is very different.

> So where is the key for f-cpu to be faster ? Will it run at higher
> clocks ? Or did I miss something ?

F-CPU is designed to be and remain fast, rather simple (or at least
manageable by a small team), predictible (that is : allow static scheduling
during compilation and warrant that software can run in a give time) and long-lasting.
It is much simpler than IA64 (just look at the Instruction Set and the registers) 
so F-CPU will suffer less from enhancements in the coming decades.

did i forget something ? :-)

> devik

GIF image