alpha_vs_f-cpu.txt
created Tue Aug 14 07:20:01 2001 by Yann Guidon <whygee@f-cpu.org>
Original file : courtesy of Digital/Compaq (no Copyright notice was found)
current version : aug. 15th, 2001


       ALPHA vs. F-CPU : Architectural comparison.


-----------------------------------------------------------
Disclaimer : This document expresses personal opinions,
may contain judgement errors and/or mistypings.
Some assertions come from things i've heard or read but
i can't always give references (hence the word "opinion").
I may sometimes speak in the name of the F-CPU project
("we") but i am so schizophrenic that i never know when.
I tried to give credit wherever i can, please forgive
any misleading or wrong information (report them to me
anyway). All brands are registered and vice versa.
-----------------------------------------------------------

Introduction : Some "facts" about the Alpha
(and Whygee's expression of his sadness)

Alpha is dead. It's really too dumb and unfair. Apart from the F-CPU,
it is the only architecture that i felt like wanting to program
for. I have enough programmed on stinking platform and Alpha
was for me not only a "high end" platform, it was also the
most clean and "sexy" one. Imagine : no marketoid compromise,
high bandwidth, longevity, (some) third parties, and at one
time it was even able to take a share of the PC market through
low-end versions. Read the Alpha-HOWTO for more informations on
Digital's clone motherboards.

Alpha was started in 1988 by Digital, that became a part of
CPQ 10 years later, and now CPQ fails in favour of IA64. What
about the Alpha engineers ? about the third parties ? about
the products and the support ? Alpha seems to be a failure
despite the highest qualities, which were proved by its integration in
Cray Research T3D and T3E. Curiously enough, the same
sad fate happened to Cray : no T3F ever
came because it would have competed with MPP computers
built by SGI (which was more or less forced to buy Cray,
by fear that it would be bought by foreign companies...).

And don't say that ALPHA was too high-endy : Digital has
designed and built low-end, cheaper versions of the DecChips
with integrated PCI bridge and memory controller (such as
the 21068 which was designed for X terminals).

A good architecture is obviously not enough for long-lasting
success. 10 years of thrilling SPEC ratings is already
a cool thing but we see now that when the company fails,
the architectures goes away/along. Third parties leave, too,
because the moral caution of the leader has led the market to
suddenly disappear. Independence from
the manufacturer/economic/management/market variations becomes
more than necessary. I don't want to invest any effort in
something that will last less than 30 years. Anything less
is not worth the countless sleepless nights and inexistant
social life. I think that a lot of people understand what
"personal investment" means.

I recently found this document while browsing the web in
search for ALPHA architectural stuffs. Curiously, it seems
that no up-to-date or meaningful information can be found
about the latest chips. However I thought that it would
be interesting to compare the father and the child.
Yes, F-CPU is more or less an Alpha successor : Alpha
and some Cray designs such as the CDC6600, have deeply
influenced the early stages of the F-CPU definition
(through my pressure on the mailing list, as some can
remember). So you will find in this document some
apple-to-apple comparisons ! The original document from
Digital contains a lot of synthetic informations that
can be easily discussed.

By the way, why compare the F-CPU to the Alpha ?
Besides the fact that Alpha (which was a real industrial
project) died before the F-CPU (which is only a "utopia"
and hence has no "physical" existence), the F-CPU contributors
had examined the known architectures of that time (1Q99).
 * PPC (IBM/Motorola) is far too complex. We wanted
   something simple and OOO was being criticized at that time
   for consuming too much power and silicon with the control logic.
   On top of that, we would not be able to understand and/or debug.
 * MMIX (Knuth) : it is well known to not be a "real world"
   architecture. Despite the fact that Dick Sites himself
   came to help, it was not designed to be a commecial computer.
 * EPIC/IA64 : we could not afford the large dies and
   it is still currently being matured (VLIW is not yet
   really accepted as a ready and proved concept).
 * I certainly forget a lot of other things.
The fact is that we started a new architecture "from scratch"
around march/may 1999, with a classical MIPS R2000 pipeline
in mind.

Predicates were also examined for the F-CPU but soon removed
(guess by who ;-D) because :
    - a 32-bit instruction word with 6-bit fields for each register
      does not leave enough space for the predicate bits
    - when not enough predicates are available, they act just like
      "condition registers" : a bottleneck when going superscalar
      (and it's "anti-RISC") [Mathias had included "only" 3 predicates...]
    - predicates encourage code bloat and poor relative performance :
      a lot of operations are issued but most are useless, so we have
      a very high "peak" MFLOPS rating with poor sustained performance.
      Software pipelining (advanced compiler techniques favoring smart
      instruction scheduling) avoid this problem in the absence of predicates.

Alpha represented the State of the Art in RISC architecture,
it was the most evolved (yet rather simple and clear) generation of RISC
designs after the first-generation milestones like ARM, MIPS, SPARC...
A few cores come rather close to F-CPU, such as the SH-5, but
F-CPU still contains a lot of unique combination of features.
I believe that SH-5 will never be able to follow the F-CPU when
it will scale up :-P

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

note : the original text lines start with C++ style comments
(that's all i was able to do within EMACS, shame on me, to
distinguish the original text from the comments).

//  ALPHA ARCHITECTURE TECHNICAL SUMMARY 
//  Dick Sites, Rich Witek
(posted on comp.arch by jg@crl.dec.com (Jim Gettys), Tue, 25 Feb 1992 17:17:38 GMT)

ALPHA vs F-CPU comparison
Yann Guidon (august 2001)

//  [NOTE: "Alpha" is an internal code name. An official name will be announced
//   soon.]

In the end, the "ALPHA" name has remained.
I wonder what the legal team thought about this,
they certainly had to register more brand names than expected :-)
The "AXP" stuff never caught the attention, the meaning was never
clear and the similarity with "VAX" was ... not clear either.
Maybe it was one victory of the engineers over the marketoids :-) 

"F-CPU" is now a registered brand that is co-owned by several contributors.
The team owns the associated domain names and the minimum stuff
to make some serious work.

//  WHAT IS ALPHA?

If you can answer to that, then you might still have troubles defining what
really F-CPU is. Of course, it is a computer architecture, it's also a project,
a team, a certain way of life too. There is a strong independent culture and
mentality associated to this "concept". But it remains what people do with it.
ALPHA was an industrial project, F-CPU is more an engineering research
project. Finally I remember you that the goal of F-CPU is not to "build and sell",
but to "design" a "standard" for a new family of microprocessors. It is more in the
"fabless" industry.

One note about "SoC" : This (m*rk*t*ng) term appeared later in the F-CPU history.
F-CPU was never meant to become a "SoC core". LEON was. Hence LEON (a subset
thereof) is distributed under LGPL, while F-CPU is completely under GPL.
One one side, it doesn't allow F-CPU to be integrated inside "proprietary"
circuits. On the other side, the GPL'd IP code base is growing so it should
not be a limiting factor. But the real goal is still not to desing an "IP core"
but rather a completely unobfuscated alternative for the x86 processors.

//  Alpha is a 64-bit RISC architecture, designed with particular emphasis on 
//  speed, multiple instruction issue, multiple processors, software migration 
//  from VAX VMS and MIPS ULTRIX, and long lifetime. The architects rejected 
//  any feature that did not appear to be usable for at least 25 years.

F-CPU has much more ambitious goals than that (read in the introduction again).
Alpha is well designed for
today but it is more than 10 years old now (it was introduced in early 1992
but the early works started in 1988). Today, Alpha has lived one half of its
expected longevity and nothing has catched after the 21164 (as far as i can see).
Starting the F-CPU project with the same goals as the ALPHA in mind would not
be enough, so let's double the claims ! A 50 years lifelength would be fine.
Of course it is not realistic but you won't go far if you always gaze at
your shoes. Now that the project is 3 years old, i think that this strategy
of "utopy" has started to work.

F-CPU is a superpipelined, SIMD architecture that is somewhat influenced
by the early ALPHA and CDC architectures. Because we arrive 10 years later,
it is easy to see what features and choices were done correctly (or not).
There's no magic : all we have
to do is to look at the design choices and see if they still hold.
The reading of a french book ("Les microprocesseurs ALPHA" by Bernard
Ourghanlian, 1995, Intereditions, ISBN 2 7296 0565 7) has helped in parts.
It is a good "key" to "decypher" the microarchitectural features of the
ALPHA. Of course the newbies are encouraged to read the Patterson & Hennessy
books, the "Alpha Architecture Reference Manual" by Dick Sites etc.

Rejecting the delayed branches or the hardware stack were good ideas
because the constraints of superscalar designs were too
strong. Windowed registers included too many problems and OOO cores
came with the 21264 to remove the question definitely. From that point
of view, ALPHA has appeared as a "clean" architecture that learnt
the lessons taught by other's experiences.

However, the early choice of a split register set was not that bright
because later implementations had to include specific communication
instructions (otherwise, you had to go through temporary memory locations).
On top of that, the 264 has OOO and it contains much more physical registers
than it is possible to access, so a split set was not interesting for F-CPU,
if you remember (on top of that) that FC0 is single issue.

F-CPU comes with the tide of the Linux world. The GNU project is
gaining so much momentum that we are now free from the wintel legacy
dilemma (performance or compatibility ?). No need to emulate or simulate,
no Transmeta bag of tricks, the performance is not a fake measurement :-)
No "PALcode" was necessary, only a kernel/user mode bit (and a lot of other
stuffs behind but that's not the point yet).

So F-CPU can be considered as one of ALPHA's sons, but with broader
goals and lifetime. It reuses a lot of software and microarchitectural
expertise and adds its own. In fact, when i joined the F-CPU project
in 1999, i wanted to see if F-CPU could compete with ALPHA but even
though it has helped, the architecture of the FC0 has nothing in
common with any CPU implementation that i know of.


//  The first chip implementation runs at up to 200 MHz.  The speed of Alpha 
//  implementations is expected to scale up from this by at least a factor of 
//  1000 over the next 25 years. 

The "speed" (clock frequency) of the F-CPU is not the only important stuff.
Notice the misleading use of the word "speed", in fact it was meant to be
"performance" (it is explained later in this text).

Of course, F-CPU MUST be fast, very fast, otherwise why use it ? But the
architecture must also be able to scale up and down at will. It will certainly
be implemented by several independent funders/makers (a bit like MIPS
or SPARC cores for examples) so it must be flexible (on top of everything)
while keeping the performance and compatibility for granted.
All the recent computer architectures have to deal with that.
I foresee that in the not-so-distant-future, most architectures that
are not flexible enough will not survive outside their niche.
The computer industry is so unpredictable, today...

One experience learnt from the ALPHA is that even though they had
taken into account the necessity to provide as many address bits
as possible ("A lot of architectures died from running out of address bits"),
they ran out of address bits anyway when used in the CRAY T3.
They had to expand the address bus externally, ain't it a shame ?
it remembers me of my Sinclair PC-200 with 8086 inside ;-P
And this happened even though Cray was working with the DEC team. wow.
So today the other lesson can be : don't take the addressing
capability of the CPU for the system's memory capacity.
MPP is slowly burrying parallel vector systems...

However, a lot of people still make the same mistake : "a N-bit CPU
is useful only if you want to access 2^N bytes". It is false :
- the width of the registers are principally used for integer or
  bitfield computations, not so much for pointer. With the rise
  of SIMD instruction set extensions, i thought that people would
  understand that the wider, the more operations/cycle it can yield
  (with a MOPS/decode ratio that is increasing). I've heard that one
  CPU (PPC ?) has 256-bit "media" registers so i don't think
  that it is really silly.
- the need for address bits increases by 1 or 2 bits per year.
  do you think that PPC wants to access 2^256 bytes ?
  no : in fact, it is because SIMD is slowly taking over vector computing.

In the ALPHA architectural goals, one tenfold performance increase factor
was meant to come from CPU parallelism, ten time from clock speed and the
last ten times from instruction level parallelism. (hence the 1000 fold increase in 25 year).
 * clock : it's almost reached at the half of the life (150MHz vs 1001 MHz)
 * parallelism : one CRAY T3 system can top at 2K CPU (10x factor blown up by 20480% !)
   and the CEA (french Atomic Energy Center) has a 3K Alpha system. That is several
   orders of magnitude much more than projected.
 * ILP : this seems to suck, even though there are some interesting perspectives
   with Simultaneous MultiThreading (364 ? 464 ?). 4 instructions/cycle (sustained)
   seems to be what compilers can do today (given the crappy programs they are given).
One conclusion is that they have reached some goals before the time's up,
despite all their precautions. OK, ALPHA is still much better than the cranky
Intel craps. So the strategy is to define REALLY UNREACHABLE GOALS
so we're sure we won't reach the ceiling before a long, long time.

But in the end, the "trick" is to not define specific or quantitative goals at
all. Qualitative goals are more at stake in the F-CPU project. No compromise,
no schedule, extreme scalability, complete freedom within the F-CPU compatibility frame,
that is something that i expect from any decent CPU. Alpha showed the way,
so let's go :-)


//  FORMATS

//  Data Formats

//  Alpha is a load/store RISC architecture with all operations done between
//  registers. Alpha has 32 integer registers and 32 floating registers, each
//  64 bits. Integer register R31 and floating register F31 are always zero.
//  Longword (32-bit) and quadword (64-bit) integers are supported. Four
//  floating datatypes are supported: VAX F-float, VAX G-float, IEEE single
//  (32-bit), and IEEE double (64-bit). Memory is accessed via 64-bit virtual
//  little-endian byte addresses. 

It misses one point : ALPHA is a "pure clean RISC architecture", it does not
support anything else than 64-bit data. Well, there are some reserved extensions
for 128-bit stuff, but not much more. This has plagued the architecture
for a while, mainly when porting SW from other platforms.

F-CPU is meant to support any data size that is a power of two, at least
a byte. The instructions operate on byte, word, dword and qword as a minimal
requirement. But it is also a SIMD architecture : in order to increase
the MOPS/MHz ratio, we use data parallelism and F-CPU must be able to process
almost all operations with packets of data, with a single instruction
(whenever it makes sense).

As a consequence, the F-CPU register size is _not_defined_ (it is
"implementation-dependent", chosen case-by-case by the vendor). It is 64-bit
by default (to provide a minimal compatibility failsafe code), but the physical 
implementation could be as well 128-bits, or 256, or 512 .... There is no
upper bound. All you (would) have to do is change a simple line in
the F-CPU configuration file, recompile the design and make the chips.
By the way, a 32-bit F-CPU core could be designed but i don't see the point
because the LEON core (SPARC clone under LGPL) is already working well.
So all the RTL design files are focused on a 64-bit core with replication
of the units when wider data are defined.

Data "elements" (arithmetic integer quantities) are 8, 16, 32 or 64 bit wide
and fit by default in a register. Then, wider register "duplicate" the
pattern and we can process 16 "chunks" of 16 bits in a 256-bit register for example.
It functions almost like a "scalable" vector processor and unlike all the
existing SIMD instructions found in the current CPUs, the SIMD "feature"
is not an addition here : it is taken into account from the start and it won't
affect the balance of the architecture in the future. It will however
provide a very valuable way to increase the computing performance without
increasing the clock speed. The F-CPU is ready to take over the world when
Moore's law will be invalid :*)

The register set, as noted before, is not split. The 64 physical registers
contain pointers and data, in integer form or eventually in floating point
or logarithmic numbering system ("LNS") depending on the implementation goals
and means. All the data can be packed in SIMD fashion, except pointers (all the bits
must be valid, even if the implementation is a 256-bit version). There is no
reason to consider that 64-bit pointers must be the norm.

Oh, register #0 is hardwired to zero, by the way.
I'm not totally convinced it is really necessary (it makes the decoding a bit more complex),
but it adds a bit of "RISC taste". It is used 95% as "condition true" in jumps
and cmov so why bother, but other contributors insisted to have this feature.


//  Instruction Formats

//  Alpha instructions are all 32 bits, in four different instruction formats
//  specifying 0, 1, 2, or 3 register fields. All formats have a 6-bit opcode. 

//  	+-----+-------------------------+
//  	| OP  |         number          | PALcall
//  	+-----+----+--------------------+
//  	| OP  | RA |        disp        | Branch
//  	+-----+----+----+---------------+
//  	| OP  | RA | RB |    disp       | Memory
//  	+-----+----+----+----------+----+
//  	| OP  | RA | RB |  func.   | RC | Operate
//  	+-----+----+----+----------+----+

The F-CPU instructions are both similar and different from the ALPHA
(MIPSy) standard.
They are similar because they come from the same MIPS/RISC ground : 32-bit
perfectly aligned instructions with 3-operand capability and reduced decoding
constraints. This is an old and proven way to decode instructions at full CPU speed.

first change : the OP field is 8 bit, not 6.
Because we drop the function fileds, the OP field must encode more information.

Second : the other fields are inverted. it looks like :

                src1    src2   dest
+------+-------+------+------+------+
|  OP  | flags |  RA  |  RB  |  RC  | Operate
+------+------++------+------+------+
|  OP  |flags | imm8  |  RB  |  RC  | Operate + 8-bit immediate
+------+-+----+-------+------+------+
|  OP  |f|  16-b immediate   |  RC  | load constant + SR get/put
+------+-+------------+------+------+


Mathias Brossard had this idea
which it simplifies the design of the instruction set, we don't have this
"function field" in the middle of nowhere. It lets us place the optional
bit fields easily and choose between immediate data bits or flag bits.
This way, we were able to merge forms and simplify the instruction layout
(ie : dropping the 6-bit immediate form for the more flexible 8-bit form
reduces the decoder complexity by another small amount).
However, if the F-CPU instruction layout is simple, the
opcodes can have a lot of options sometimes. Most of these option bits
can be sent directly to the execution units but not always.


Third : because there are 64 registers we need 3*6=18 bits in the instructions.
14 bits remain for the 3-operand operations, that is often enough. The rest
is often "padded" with optional flags.

Fourth : concerning "complex" operations (i'm thinking about some
really elaborated instructions in the PPC and
the HP-PA RISC architecture),since we have less room for the opcode,
we have to reuse the registers when doing complex stuffs. The "usual" operations
read one or two registers and write to a third one, these are called
"2R1W" and "1R1W". Some operations like load and store with pointer update
require 3R1W and 2R2W instructions. The fourth register number comes
from the third register field, with the LSB negated.
[warning : this part is not definitive, there are some exceptions
 that are not yet "standard"]

Fifth : 3 bits are often used to specify the operand sizes :
one bit selects the SIMD mode and two bits specify the "chunk" size
(the size of the independent data inside the register) : 8, 16, 32 or 64 bits.
A mechanism is reserved to extend this range through a user-configurable
lookup table so we can handle 128-bit (for example) data without modifying the
instruction set or existing binaries.


//  PALcalls specify one of a few dozen complex operations to be performed.

The F-CPU uses another mechanism, closer to the Intel world, when accessing
hardware features : Special Registers (a bit like MSRs introduced with the Pentium chips).
These become critical because when the pipeline can't allow us to design
or schedule an operation safely (such as semaphores, virtual memory management
or dangerous configuration things), it is left to the Special Registers ("SR")
to do it. First advantage : it isolates the SW from the HW, and 2nd : it isolates
the pipeline from any hazard or complex scheduling (ie : semaphores)
because the get and put instructions (which access the SR) block the pipeline.
The rest is managed the old good way, through protection bits.

//  Conditional branches test register RA and specify a signed 21-bit
//  PC-relative longword target displacement. Subroutine calls put the return
//  address in RA. 

well this is not exactly the way it is done IIRC and the F-CPU treats memory
access (instructions + data) in an even more cautious way ! a strange mix
of laziness and fear of dangerous/complex scheduling. And the best way to
optimize something is to not do it ;-)

The branches in the F-CPU seem restrictive : check against a condition,
save the CIP (Current Instruction Pointer, "PC" for some people), jump
to the target. However the fear is that during a loop, for example,
the branch overhead becomes ridiculously useless ! In an usual computer,
one computes the jump target with a small immediate field in the instruction,
thus adding to the latency of the jump, and i don't speak about the virtual
memory issues. The Alpha has proposed some nice solutions but i was still
unsatisfied. The cherry on the pie is that the addition (immediate+CIP)
is performed for every loop iteration even though this value does not change !

F-CPU only branches to registers. Of course the real game is to prefetch
the register as soon as possible, which adds one spice to the art of
programming for Y2K CPUs. But at least, now, the address is not computed
everytime uselessly in the loops.

For the record, the F-CPU team had explored some features found in DSPs
and particularly loop stacks (see the Address Generators of the Analog Devices
SHARC DSP). Unfortunately these features increased the complexity of the core
and new cores would have a hard time trying to emulate these features.
Finally, the simplest way is the best, at the cost of some tiny latency
for each loop in the FC0. Future complex cores will (certainly) remove
this one-cycle overhead.

//  Loads and stores move longwords or quadwords between RA and memory, using 
//  RB plus a signed 16-bit displacement as the memory address.

nice, but they had to implement word and byte load and stores later.

F-CPU does load and store any data size in big or little endian
(there's a flag in the instruction). These instructions also add
"stream hint flags" so the dataflows can be separated in SW
and remove some overhead in the HW. Checking the pointers can be
awful when a very wide superscalar core must be designed and these
hints will help the memory system.

And as again, the only supported mode is "direct post incremented".
There is one pointer register which points to the data, with
a source/destination register and an increment register.

//  Operates use source registers RA and RB, writing result register RC. There 
//  is an extended opcode in the 11-bit function field. Integer operates can use 
//  the RB field and part of the function field to specify an 8-bit 
//  zero-extended literal.

in F-CPU :

RA is a source data, it can also be used as a condition.
RB is a source data, it can also be used as a pointer.
RC is the destination register, it can also be used as a third implicit source.

8-bit immediates overlap RA (and 2 other bits) (for load and store, for example),
16-bit immediates overlap RA and RB, RC is used as source or destination
(for storing or retrieving Special Registers, for example).

//  INSTRUCTIONS

//  PALcall Instructions

//  The Privileged Architecture Library call instructions specify one of a few
//  dozen complex functions to be performed. These functions deal with
//  interrupts and exceptions, task switching, virtual memory, and other
//  complex operations that must be done atomically. PALcall instructions
//  vector to a privileged library of software subroutines (using the same Alpha 
//  instruction set) that implement an operating-system-specific set of these 
//  complex operations. 

All this is done through SRs in the F-CPU. Special Registers are defined
within a private, unlimited address space within the CPU. The SR map
will evolve as the F-CPU does. They control the protections, the virtual
memory, the task contexts, the semaphores, they contain hardwired
informations about the capabilities, they ensure the compatibility and the
scalability with a reduced SW overhead.

All the instructions in the F-CPU are atomic and execute in a predictable
number of clock cycles : there is no complex scheduling,
otherwise the function is mapped to the Special Registers.
There, complex operations are performed directly in HW with
implementation-specific and optimized circuits (particularly
for semaphores and integrated HW) just like a coprocessor.

//  Branch Instructions

//  Conditional branch instructions can test a register for positive/negative
//  or for zero/nonzero. They can also test integer registers for even/odd. 
//  Unconditional branch instructions can write a return address into a 
//  register. There is also a calculated jump instruction the branches to an 
//  arbitrary 64-bit address in a register.

Branches instructions of the F-CPU test (with reduced overhead, that is :
without explicitely reading the specified registers) the MSB, the LSB,
and whether the register is zero or not (this is almost exactly like in
the ALPHA). A fourth condition is still reserved,
potentially testing NaN (error) when FP will be implemented.
This kind of instruction looks very similar to other RISC CPUs,
but the way it is done is very different in F-CPU.


//  Load/Store Instructions

//  Load and store instructions can move either 32- or 64-bit aligned
//  quantities. The VAX floating-point load/store instructions swap words to
//  give a consistent register format for floats. Memory addresses are flat
//  64-bit virtual addresses, with no segmentation. A 32-bit integer datum is
//  placed in a register in a canonical form that makes 33 copies of the high
//  bit of the datum. A 32-bit floating datum is placed in a register in a
//  canonical form that extends the exponent by 3 bits and extends the fraction
//  with 29 low-order zeros. 32-bit operates preserve these canonical forms. 

These canonical stuffs have advantages and drawbacks. Finally, they are not
used in the F-CPU. We are too much used to represent bytes as 8-bit values,
after all, so we don't perform sign extension. And a pointer uses a whole
register (whatever the size).

//  There are no 8- or 16-bit load/store instructions, but there are facilities 
//  for doing byte manipulation in registers.

The second is often really nice, but not the first. They finally added the feature
later.

//  Alpha has no 32/64 mode bit or other such device. Compilers, as directed by 
//  user declarations, can generate any mixture of 32- and 64-bit operations.

but they had performance hits with Windows NT :-P
F-CPU handles ALL data sizes equal or above one byte.

//  Integer Operate Instructions

//  The integer operate instructions manipulate full 64-bit values, and include
//  the usual assortment of arithmetic, compare, logical, and shift
//  instructions. There are just three 32-bit integer operates: add, subtract,
//  and multiply. These differ from their 64-bit counterparts ONLY in overflow
//  detection and in producing 32-bit canonical results. 

geez. F-CPU operates on 8, 16, 32 and 64 bit data "chunks". These "chunks" can
be "packed" in SIMD words of any width (a power of two, >=64 bits).
This applies to all significant instructions, except bit-to-bit operations
(ROP2 and MUX) and move, of course.

//  There is no integer divide instruction.

Michael is (still) trying to make a divide unit, but the particular pipeline
of the current implementation (FC0) makes his work difficult.
The instruction is defined in the opcode map anyway, even if it must
be emulated.

//  In addition to the operations found in conventional RISC architectures,
//  there are scaled add/subtract for quick subscript calculation, 128-bit
//  multiply for division by a constant and multiprecision arithmetic,
//  conditional moves for avoiding branches, and an extensive set of
//  in-register byte manipulation instructions for avoiding single-byte writes.

One nice feature that is present in ALPHA is the "scaled addition" :
addition with shift. However it is not much scalable ...
F-CPU has a small hit on constant multiplies, but the revenge
comes from the integer multiply unit which is optimized as much as possible
(and might eat up some silicon).

//  Rather then keeping a global state bit for integer overflow trap enable,
//  the enable is encoded in the function field of each instruction. Thus, both
//  ADDQ/V and ADDQ opcodes exist for specifying 64-bit add with and without
//  overflow checking. This makes pipelined implementations easier.

F-CPU has no flag at all. Nada. Niet.
If you want to check for overflow on add/sub, use the 2R2W form
which writes the "high" part of the result (carry/borrow bit) to a fourth register
and test it. The ALPHA designers themselves would agree :-P
(if they had had the possibility to add enough register ports)

//  Floating-point Operate Instructions

Not deeply examined yet. We have other problems to solve first.

//  The floating operate instructions include four complete sets of VAX and
//  IEEE arithmetic, plus conversions between float and integer. 

IEEE only is going to be supported, in 32 and 64 bit modes, SIMD if possible.

//  There is no floating square root instruction.

We intend to provide "seed" generation for accelerating Newton-Raphson
computations of FDIV and FSQRT.

//  In addition to the operations found in conventional RISC architectures, 
//  there are conditional moves for avoiding branches, and merge sign/exponent 
//  instructions for simple field manipulation.

merge sign is done with the (merged) integer core,
certainly with the bitwise MUX instruction.

//  Rather then keeping global state bits for arithmetic trap enables and
//  rounding mode, these enable and mode bits are encoded in the function field
//  of each instruction. 

The F-CPU FP instructions contain an IEEE compliance flag,
as a compromise between compliance and speed. This flag is
set to "compliant" by default and this impact on the pipeline
which can't emit another operation before it is sure that no
trap has occured. When the flag is not set, IEEE compliance
is not garanteed, traps are ignored and numerical stability
may suffer, but the pipeline remains "clean".
errors can be tested with a conditional jump on NaN.

//  SIGNIFICANT DIFFERENCES BETWEEN ALPHA AND CONVENTIONAL RISC PROCESSORS

//  First, Alpha is a true 64-bit architecture, with a minimal number of 32-bit 
//  instructions. It is not a 32-bit architecture that was later expanded to 64
//  bits. 

We can almost say that about the F-CPU except ... that it was not
designed to remain a 64-bit CPU. 64 bits is the base, standard width
and the extensions will be transparent (if the coding rules are respected).

//  Second, Alpha was designed to allow very high-speed implementations.

F-CPU even more :-)
But we have a handicap : because we don't have our own fundries and
a large development team, we can't taylor "full custom" chips. We have
a clock "hit" of at least 30% so we have to be REALLY fast if we want to
compensate the handicap.

//  The instructions are very simple (no load-four-registers-unaligned-and-check-
//  for-bytes-of-zero). There are no special registers that would prevent
//  pipelining multiple instances of the same operations (no MQ register and no
//  condition codes). The instructions interact with each other ONLY by one
//  instruction writing a register or memory, and another one reading from the
//  same place. This makes it particularly easy to build implementations that
//  issue multiple instructions every CPU cycle. (The first implementation
//  in fact issues two instructions every cycle.)
Notice : Like every superscalar CPU, ALPHA must respect "pairing rules"
in order to "emit" 2 instructions per cycle.
We start with 1 instruction/cycle in FC0 but F-CPU can scale as much, and probably
more, than ALPHA.

//  There are no
//  implementation-specific pipeline timing hazards, no load-delay slots, and
//  no branch-delay slots. These features would make it difficult to maintain
//  binary compatibility across multiple implementations and difficult to
//  maintain full speed on multiple-issue implementations. 

there is no secret to compatibility ;-)
 
//  Alpha is unconventional in the approach to byte manipulation. Single-byte
//  stores found in conventional RISC architectures force cache and memory
//  implementations to include byte shift-and-mask logic, and sequencer logic
//  to perform read-modify-write on memory words.

not in the F-CPU.
The FC0 takes the cache OUT of the "execution pipeline",
so we can add shifts and more to the instruction, without compromising
either bandwidth or data integrity.

//  This approach is awkward to
//  implement quickly, and tends to slow down cache access to normal 32- or
//  64-bit aligned quantities. It also makes it awkward to build a high-speed
//  error-correcting write-back cache, which is often needed to keep a very
//  fast RISC implementation busy. It also can make it difficult to pipeline
//  multiple byte operations. 

not in FC0.
Using the register addressing mode, it becomes straight-forward to take
the cache out of the critical datapath and preserve the pipeline
from page fault traps.

//  Instead, the byte shifting and masking is done in Alpha with normal 64-bit
//  register-to-register instructions, crafted to keep the sequences short.

hmmm, this family of instructions is often very useful anyway !

//  Alpha is also unconventional in the approach to arithmetic traps. In
//  contrast to conventional RISC architectures, Alpha arithmetic traps
//  (overflow, underflow, etc.) are imprecise -- they can be delivered an
//  arbitrary number of instructions after the instruction that triggered the
//  trap, and traps from many different instructions can be reported at once.
//  This makes implementations that use pipelining and multiple issue
//  substantially easier to build. 

FC0 issues in-order all the instructions that are considered as valid.
if a trap occurs, it must NEVER enter the pipeline (because of the OOO
completion, it would become a nightmare to rewind the pipeline !!!).

All exceptions are precise in the FC0 (and hopefully in the F-CPU).
All the instructions are designed to never trigger a trap after issue.

//  If precise arithmetic exceptions are desired, trap barrier instructions can
//  be explicitly inserted in the program to force traps to be delivered at
//  specific points. 

no need for them. But there is a "barrier" instruction that can do that anyway
(flushes the pipeline, "serialise" the issue, wait until all operations are completed...)

//  Alpha is also unconventional in the approach to multiprocessor shared
//  memory. As viewed from a second processor (including an I/O device), a 
//  sequence of reads and writes issued by one processor may be arbitrarily 
//  reordered by an implementation. This allows implementations to use 
//  multi-bank caches, bypassed write buffers, write merging, pipelined writes 
//  with retry on error, etc. If strict ordering between two accesses must be
//  maintained, memory barrier instructions can be explicitly inserted in the
//  program. 

Currently, FC0 uses two addressing spaces : "private" space where the CPU
is the only master (cache coherency is straight-forward) and "outside world"
where all the accesses are in-order and uncacheable. The mapping of these 
addressing spaces are configurable and remapped by the OS through virtual
memory pages. This is the easiest way to design a fast but scalable circuit
that can be assembled like LEGO blocks in cheap parallel systems (as long
as you care about the interconnexion bandwidth and latency :-/).
Other implementations might use a different approach
but this one is preferred if one wants to build scalable multi-CPU systems.

//  The basic multiprocessor interlocking primitive is a RISC-style
//  load_locked, modify, store_conditional sequence.

don't do that in the F-CPU ! the load-stores instructions are designed
to handle data with packets. Individual word access is allowed but
semaphores can have a bad impact on the memory system. Use semaphores
that are mapped in the SRs ! This way, any kind of specialized semaphore
device can be optimized for the system and the application, without
triggering spurious and costly page flushes etc...

//  If the sequence runs
//  without interrupt, exception, or an interfering write from another
//  processor, then the conditional store succeeds. Otherwise, the store fails
//  and the program eventually must branch back and retry the sequence. This
//  style of interlocking scales well with very fast caches, and makes Alpha an
//  especially attractive architecture for building multiple-processor systems.

However this involves one "hidden" flag and only one semaphore can be tested at
a time. How short-sigthed ! :-/

//  Alpha includes a number of HINTS for implementations, all aimed at allowing 
//  higher speed. Calculated jumps have a target hint that can allow much 
//  faster subroutine calls and returns.

F-CPU does that in an even simpler way. I'm proud of this part too ;-)

//  There are prefetching hints for the 
//  memory system that can allow much higher cache hit rates.
F-CPU only works with prefetching and "associating" a cache line to a register.
If your compiler "understands" jump target and data prefetching, you've already
accessed one half of the performance.

//  There are also
//  granularity hints for the virtual-address mapping that can allow much more 
//  effective use of translation lookaside buffers for big contiguous 
//  structures.

This issue is also addressed in the F-CPU. Though it is not yet completely
defined, page table entries can gather statistics that allow the OS to choose
the best page size for each location.

//  Alpha includes a very flexible privileged library of software for operating-
//  system-specific operations, invoked with PALcalls. This library allows Alpha
//  to run full VMS using one version of this software library that mirrors many
//  of the VAX operating-system features, and to run OSF/1 using a different
//  version that mirrors many of the MIPS operating-system features, and
//  similarly for NT. Other versions could be tailored for real-time, teaching,
//  etc. The PALcalls allow Alpha to run VMS with hardly more hardware than
//  a conventional RISC machine has (the PAL mode bit itself, plus 4 extra
//  protection bits in each TB entry). This library makes Alpha an especially
//  attractive architecture for multiple operating systems. 

I am still wondering if PALcode is such a good idea.
We're used for a long time to rewrite the trap handlers for each new computer.
Maybe this idea came from the VMS transition constraint, but there is
no need of this stuff in the F-CPU.

(to this, Michael Riepe answered :)
> We can implement a `PALcall' SR if we have to.

and what would we do with it ? :-)

the PALcode is probably a good idea for the Alpha but from a F-CPU
point of view, it goes against the logic.

In the DEC world, you buy one expensive ALPHA computer and every 6 month or so,
the maintainance service sends you a pile of CDs with updates, including PALcode
changes/patches. Nice, because there are several ALPHA families and even more
members (ie : 21064, 21064A, 21066, 21068, ...) so Digital can manage to make
one PALcode revision for every member. It controls everything (R&D, market) so it can.

In the F-CPU world, the CPU itself is a "commodity" : you can buy a bunch of
(cheap or not, according to what you can access, want, have, and can pay)
chips, which can have different versions and come from a lot of different
vendors/funders. In the current PC industry, a new CPU version comes every 6 months
in average, maybe the F-CPU will shorten this even more (thanks to increased
coopetition and open sourcing). In this condition, it is not realistic
to have PALcode : every chip would require an "individual" library.
There would be too many of them and nobody could manage them.

Furthermore, if one vendor goes out of business and takes the PALcode
away (even though he should release the source code under GPL in the "ideal
case"), you won't be able to use the chip. The PALcode becomes like a key
and if you don't have it, you won't be able to make your computer work.
On top of that, i imagine that it's the place where companies that don't
want to play the "open" game will put "proprietary feature" in order
to make others captive.

Maybe some points are wrong but i believe that PALcode is not a good
idea here. Maybe i don't understand completely what the PALcode philosophy
is, but IMHO it does what the OS should do. PALcode "hides" stuffs from the
OS and enables the vendor to include "non documented" or "chip-specific"
features which break the architecture standards where it hurts.
What appeared like a nice feature could become a weakness
in a world of great misunderstanding of the GPL.

and again, i still don't see the point of PALcode.
The Operating System (GNU/Linux/Hurd/whatever)'s kernel is meant to
do what PALcode does. Maybe it was an idiom that was inherited
from the old DEC age (i think about the microcode update diskette) ?

//  Finally, Alpha is not strongly
(notice the "strongly" :-D)
//  biased toward only one or two programming 
//  languages. It is an attractive architecture for compiling at least a dozen 
//  different languages.

F-CPU is not "biased" toward anything else that the algorithm and performance.
It puts an even heavier burden on the compiler, but crappy SW can still work.

The question of the langage is still burning : it is so difficult to create an
"optimising compiler" that either the CPU are biased to the langage (like
the early RISC computers) or the langage must be biased to the architecture
(we think about adding one or two keywords in the C syntax). However
this discussion makes us forget that a CPU does not execute a langage
but an algorithm.

//  SUMMARY

//  Alpha is designed to be a leadership 64-bit architecture.

but it is dying :-(

//  --------------------
//      Specifications (150MHz version).
//      Process Technology          .75 micron CMOS 
//      Cycle Time                   150 MHz (6.6 ns)
//      Die Size                     13.9mm x 16.8mm
//      Transistor Count             1.68 million
//      Package                      431 pin PGA
//      Number of Signal Pins        291
//      Power Dissipation            23 W at 6.6 ns cycle
//      Power Supply                 3.3 volts
//      Clocking Input               300 MHz differential 
//      On-chip D-cache              8 Kbyte, physical, direct-mapped,
//                                   write-through, 32-byte line, 32-byte fill
//      On-chip I-cache              8 Kbyte, physical, direct-mapped,
//                                   32-byte line, 32-byte fill, 64 ASNs
//      On-chip DTB                  32-entry; fully-associative; 8-Kbyte,
//                                   64-Kbyte, 256-Kbyte, 4-Mbyte page sizes
//      On-chip ITB                  8-entry, fully associative, 8-Kbyte page
//                                   plus 4-entry, fully-associative, 4-Mbyte page
//      Floating Point Unit          On-chip FPU supports both IEEE and VAX
//                                   floating point
//      Bus                          Separate data and address bus.
//                                   128-bit/64-bit data bus
//      Serial ROM Interface         Allows the chip to directly
//                                   access serial ROM
//      Virtual Address Size         64 bits checked; 43 bits
//                                   implemented
//      Physical Address Size        34 bits implemented
//      Page Size                    8 Kbytes
//      Issue Rate                   2 instructions per cycle to A-box,
//                                   E-box, or F-box
//      Integer Pipeline             7-stage pipeline
//      Floating Pipeline            10-stage pipeline

It is not realistic to compare the 21064 with an architecture that has not yet
been implemented.


Conclusion :

If the Alpha had not existed, the F-CPU would be a bit different.
Its experience helped the F-CPU become an even better RISC architecture.
Each new architecture brings its load of enhancements, features and concepts.
However since ALPHA and F-CPU both aim the same kind of goals (performance and longevity)
one must admit that there are not many solutions ("All the paths go to Rome").
A lot of experience about performance comes from other specific branches such as
DSP, ASIC and supercomputing. Longevity however is not an exact science.
Let's hope that the experience of the Alpha will help the F-CPU become
a dign successor.


PS :
If you think that i have only scratched the surface, it's true.
The F-CPU manual is being constantly written and can be downloaded from
http://f-cpu.gaos.org, it contains more than 170 pages and it is only the
beginning !

PS2 :
i have presented here general F-CPU architectural features, the
FC0 ("F-CPU core #0") introduces a lot of techniques that are too
complex for a superficial text like this. Read the introduction in the
F-CPU manual for more informations, or have a look at a wannabe-paper
at http://www.f-cpu.de/epf2001/

PS3 :
for more informations about the ALPHA, read the Alpha-HOWTO
or the references cited in the text, or search through google.com.
