[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] How to increase the mip/mhz ratio.



hi nicO !

welcome in the english list :-)

nicO wrote:
> After the IEEE conference on today processor, i have understood the udge
> total amount of control needed to make superscalar processor. You need
> to verify coherency with a scoreboard and many comparator, you need a
> retirement unit to make the flot in the program sequence (you could even
> add register renaming if you want).

nothing forces you to do _all_ that.
btw, the _best_ MOPS/MHz computers are vector computers :
1 issue, many identical operations per instructions,
low architectural complexity. still unbeaten for the "heavy" stuff.
i don't say that we must go vector, but it's just a "cultural"
reference and counter-example : a simple architecture can
do great stuffs. Yet, if you run LISP or JAVA stuff, or something
silly like that, vectors won't work and someone will want to
make a stack machine... it's endless.

btw, let's make a good 1-issue CPU first.
we'll care about multiple issue "later", at least :
if we ever get one FC0 working.

Do you remember what you said and agreed ?
"put SH-5 out of the market" :-)

> But, how could it be possible to add more mips by clock cycle wihout all
> that waisted ressources. Intel, Ti choose VLIW, AMD simd (as us). VLIW
> add sometimes "false" parrallelism (duplicate code -> more code
> space...) and make exeption very difficult to handle (wich slot have
> failed ?). And SIMD aren't always possible.
> 
> Then i have the idea to increase the number of acceded regiters. Break
> the habit of 2 registers read and 1 write. For exemple, 4 read and one
> write could be possible.
physically, it is possible. However it is NEVER used.

3-input additions can probably be used in indexing arrays,
but a good recoding (pointer linearisation) can remove that
and we always can end up with 1 index. it's an old bag of tricks.

now, 4 input-op is never used. for addition and substraction : maybe.
multiply ? division ? square root ? do you get it ?
On top of that, the number of combinations/permutations for all the 
possibilities is too high, it's better to code that with "atomic"
instructions.

> It's much costly for the access to the register
> bank and the instruction work should be wider but with the pipelined
> software used to avoid bubble in the pipeline, a lot of registers are
> used.
what you propose is wider bank and fewer register, when it
does the same thing as more register and narrower bank.
you get nowhere further, and there is nothing thrilling here ;-)

> So we need more space in the instruction word, more access port in the
> register and redesign all execution unit (but not really increasing
> there latency).

if at least you had good examples to backup your assertion :-P

now what if your brilliant idea goes superscalar ? (i know you want
to avoid that, but let's imagine ...)

> 2IW (not VLIW) could be another good idea. We just manage the memory
> beside usual instruction. We need a read register port, and 2 more bit
> so a total of 8 bit in the instruction word.

you are speaking about adding 8 bits to the existing instruction ?
40-bit instructions are very difficult to handle. ie the ADi DSP
have 24 and 48-b wide instructions and it is a REAL PAIN (TM).
don't do that at home, kids, you could get burnt.

> There is 4 instructions : load/store/no operation/Sync. Memory load
> could have a very large latency, so often it's better to do preload
> (load data bevore you need them), so this part of the instruction are
> decorelated from the other part(asynchrone system). When you absolutely
> need a data, you just put a sync memory instruction, to finish all
> current memory operation. There's nothing to change to unit because load
> unit are present. Prefetch could be done with nop and a regiter != R0.

you miss something : load and store instructions require a full instruction,
not only 8 bits. we need a pointer, a destination, and a pointer increment.
it's already done well at this time.

> So we could break the 1.1 Vax Mips/Mhz of the Leon without having the
> number of gate per Mips, too high.

not at "any price".
the existing instruction encoding is already flexible enough, i think.
there's still a lot to do at that point.

> But we should add a lock instruction to perform atomic operation to
> manage a semaphore.

remember : semaphores are managed at the SR level.
Load and Store are for data only, otherwise you mess up with
whole cache lines :-/


however, it's good to be back in the architectural discussion mode :-)
Thank you nicO for coming back :-)

you also wrote :
> It will be great to have a VHDL code parser that could generate in latex
> (?) graphic with input and output for the documentation (entity and
> FSM).
LATEX can not describe graphics. it simply "contains" (E)PS data
in this case.

> It will create a document with each entity and all the input and output.
> It will extracte the FSM to produice a graphics (it's much easy to
> debug).

you know the slogan ? "just do it" :-) i am busy with FFTs and
arithmetic compressors, currently (as well as other stuffs).

however i notice that you do almost the contrary as Renoir,
which starts from a graphical representation to make VHDL/Verilog.
Maybe you can talk with the guys who write Matisse (a GPL variant).

Anyway : check the OpenCollector.org database and see if it is not
already or almost done. this little browsing will spare you useless
efforts :-)

(troll alert : beware of the old schematic vs RTL design war)

> nicO
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/