[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] some technical stuff (yeah :-D)



Hi !

i got 4 technical mails from Andreas. You know,
as a technician, i'm getting bored of doing almost
everything (fortunately, recently, it's getting better).
Here are some answer, from the top of my (tired) head.

Note that after some editing, these answers can be
put into the manual.

> Subject:  [f-cpu] Why we use a scoreboard?
>
> Hello,
> 
> While my lecture about f-cpu there were some questions (see other mails)...
> 
> In preparation I readed also the "Taschenbuch
> Mikroprozessortechnik" (Beierlein & Hagenbruch, Fachbuchverlag Leipzig,
> ISBN: 3-446-21049-0) and at page 100 I found some explanation about
> Scoreboards.
> 
> There is the meaning, that scoreboards are an outdated technology and an
> extension or evolution of scoreboard is the "Tomasulo"-algorithm, which
> combines a reorder-buffer with elements of scoreboard to
> "Reservation"-stations, that means in listelements of a ringbuffer are
> saved necessary adressses and data, commit-attributes:
> 
> operation || source1 | data1 | valid1 || source2| data2 | valid2 || destination
> 
> Instructions in Reservationstations (listelements) with all validated data
> could be executed by scheduler/dispatcher
> 
> Could you explain if there are significant differences and explain the
> Tomasulo-algorithm a little bit clearer?
> 
> Bye Andreas

i don't know if i can "explain" Tomasulo. Maybe other books (ie P&H)
could answer better than me.

The differences between Tomasulo and scoreboard are these :
 - Tomasulo is not "centralized" as a scoreboard is
    -> Tomasulo is the name of the engineer (@IBM) that came up with
       this scheduling technique.
  you know that IBM's computers are not masterpieces of simplicity.
  however a scoreboard is well suited to a simple RISC machine, such
  as MIPS R2000. Scoreboards were used in the 60's in Seymour Cray's
  designs such as the CDC6600, which is a really interesting machine.
  Since FC0 is a "simple RISC machine", Tomasulo is not needed.
  Tomasulo is interesting when there are several execution units and
  several instructions with unknown latencies. FC0 has predictible
  latencies (which make F-CPU easier to design)
 - AFAIK Tomasulo is centered around the execution units, while the
  scoreboard is register-centric. FC0 is in the second case, even though
  it has the capability to finish instructions out of order. Because
  there is only one instruction flow, a scoreboard is easy to understand,
  create and update, and it simplifies the scheduling in a compiler.
 - the exception handling technique developped for FC0 requires
  a very predictible, simple and acurate method to treat the registers.
  The scoreboard appeared as the most suitable technique in this case,
  because the informations are treated in only one place.

These points are personal. It is not a bible and it will probably
change when other F-CPU cores will appear.

> Hello,
> 
> IMHO it is possible to simple include powersave functions into F-CPU.
> We need only a special instruction and a programmable register/flag which
> switches units on or off (bring it in tristate or decouple it from bus and
> vcc)...
> 
> All units should be easily extended with this feature, is not it?
> 
> What is your meaning about this?
> 
> Bye Andreas

Power saving is not #1 priority. #1 priority is to make it work ;-)
the next priority is to make it work fast and not too expensive, then
we make it less power hungry.
Concerning the hardware means, i am not a specialist. I count on a
programmable clock divider to reduce the speed, hence the power
consumption (in a CMOS process), this divider is controlled through
a Special Register that is accessible only in kernel mode.

However i can make the affirmation that "relative power consumption"
is reduced when the CPU is correctly programmed. it is a fact that
when the binary is well scheduled and efficient, the CPU spends less
clock cycles performing the task, so the Watts/hour/MIPS is reduced.


> Hello,
> 
> is it right that we targeted to use only 6 logic layers (stages) of gates
> and that a gate must not have more then 4 Inputs (fan in)?
This is an approximation. This number is even reduced when we expect
more fanout and/or long wires.

> What's about the fan-out? 
a rough estimation in ASIC/full custom technologies is rather low : 2 to 4
but when required, the output transistor can be resized to match the current
and slew rate.
FPGA or similar can have a f-o of 8 but the more fanout, the less speed.
So the upper limit is 8 gates. When more FO is needed, a balanced tree
of buffers is used. Each buffer counting for 1 gate.

> What other limitations we have?
common sense :-)
it applies to gate count, documentation/readability etc.

> Background: I try it again to create the shifting unit.
good luck :-) i hope that a somultion will be found.

> Bye Andreas

> Hello,
> 
> first, can anybody explain me how in an SIMD-division the
> division-by-zero-exception works? Will be all data in pE. 8bit-Chunks
> thrashed? Or is this a problem of the exceptionhandlerroutine?

when the routine is run, it's already too late : other instructions
in the pipeline are probably sent and it could double-fault or something
like that.

Even tough i missed the problem of SIMD idiv on the first times,
the principle is still the same : we have to prepare the appropriate
informations in advance.
In the scalar division, the register set holds a local copy of the
OR of all the bits, so that a condition code can know if the register
is zero or not. Depending on the opcode size (8,16,32,64)
we select one result. Because we can have partial results,
we have to store partial OR results of 8-bit chunks.
What SIDIV changes is that we must add some other gates to select
and OR the result of the partial selections.

> second, I readed the F-CPU can be scaled up to 128 Bits and so on, and the
> units are doubled... What is with the xbar? Will it handle 128 bit, or is
> it also doubled? If is it 128 Bit, we need 4 cycles to handle? Means a
> scaled CPU also two or more pipelines? Will the registersize scale up or 
> the count of register doubled? Or have I misreaded about?

If the data are 128-bit, it is natural that the Xbar bus is also 128-bit wide.
same for 32-bit, or 256 bits... Xbar_width = Register_width = EU_width .

However, if you want to make a 4096 bit CPU, there are some chances that
the chip will not allow such wide registers. you'll be forced to use a "vector"
model where the register is sent to the narrower Xbar and EU in a pipelined
fashion, but the scheduler must be completely redesigned.

> Hello,
> 
> I changed some links in manual at CVS, please check the log/history.
> I added also the french brochure into CVS (only the html-part) at
> fcpu/doc/brochure/french.

Since "nobody reads the manual", we need translations of this brochure
and update it. If we can do that before HAL in august, it will be easier
and faster to present the F-CPU to newcomers.

> I think we should not split project-goals in too many
> documentation-materials, we should concentrate us to the manual.
> 
> How do you think about?

the F-CPU manual is the cornerstone of the project.
We have the chance that the project started with the documentation,
before any code was written. This is a good sign.

However the brochure is necessary because newcomers are often afraid
of the complexity of our project, it is difficult to understand how
it is organised and i fail miserably to update the websites on time.


> Bye Andreas

nota for later : we have to add an instruction for the ROP2 unit :
SELECT which is a 2-input multiplexer (3r1w instruction). This is
optional but can help reduce the pressure on 'predicate-hungry' codes.

WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/