[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 15 MIPS FC0 emulator



hi,

> just one question : what about making the cycle counter ad the hazard
> detection "optional" ?
> this would ease the work for other architectures. As you know, FC0 is
> only the first implementation of F-CPU and other cores will probably
> have other execution and decoding rules.

Yes I plan it (hazard detection is still a bit slow so you
can want to turn it off). I plan to make these detections
in one inlined fn which can be changed for other one.

> >I tried to do it. Unfortunately f-cpu_opcodes.m4 isn't exactly
> >what I was expecting. I'd expect something like:
> >OP_DEFINE(`MUX',`m4_eval(....)') ...
> >
> well, i don't understand very well what you mean here ...

it means that when you create new opcode you need to add
it manualy to three places (currently) - .m4, .h.in and
.vhdl.in. But it is possible to have one .h4 and generate
.vhdl and .h from scratch ..

> did you run the "configuration" process ? It should generate the .h
> and you simply have to include it ...

didn't run it (shame). I have had only 2 hours to write
the emulator so I decided to start from scratch and planned
to use .m4 if emulator will prove itself useful.

> but think about it : i may have temporary access to an Alpha server (DS20)
> and there is no point of being fast if it can't run at all :-)

sounds attractive :) Yes I agree with code - I'll use C only
with optional MMX/SSE handcoded parts in #ifdef

> AFAIK, GMP is not well suited.
> what would be adapted is macros for doing the C operations in "optimised
> ways" (here you could even do MMX macros, that would be used when
> corectly #define'd) on the F-CPU data types.
> So then you can recompile the source and specify a register width,
> and the right code would be selected (there is no point in looping
> if only 64-bit data are used, but it becomes necessary for 256-bit and +,
> and you can even inline and unroll some versions.
> MAcros i think are necessary : copy, clear, assign constant,
> and +, -, and, or, not, xor, shift, all in scalar.
> SIMD operations are localised to only a part of the code,
> so they are locally optimised for dealing with "chunks".

Let me to not agree with you here. For example "scalar add" macro
will be probably used only in scalar add - vector add has
too many problems with carry and from 16bit chunks it is faster
to do loop (or use native SIMD like MMX/SSE - btw SSE is 128bit).
I'd suggest to define SIMD ADD with carry for all sizes, SIMD shift,
and or xor not. These can be then used by "generic" code to define
saturated add and sub (if not defined explicitly), cand/cor, compare,
neg, inc, dec, lshift, abs, max, min, scan, popc, mix, expand and
mux (I learned some tricks from GMP sources).
Then as platform optimization you could redefine any macro to give
better code for that op.
The most challenging is fast SIMD add with carry indication ..

Regarding GMP - you really want to code 64x64->128 bit multiply
by hand on 32bit CPUs !? And divides ? Me not .. GMP handles
them pretty fast (karatsuba multiplication) and has optimized
code for wast majority of CPUs. I planned to use it exclusively
for MUL, DIV and REM or large chunks.

> register size, of course !
> Up to now, and AFAIK, the chunk size is still limited to 64 bits.
> Some implementations could play with it (like, for a cheap 32-bit version)
> but 64-bit is already a very wide stuff.

I agree.

> >As cedric said it probably don;t make much sense to have cpu with
> >256bit adder, multiplier, shifter ... It will slow it down (either
> >in clock or latency sense).
> >
> yup. But there is still the misunderstanding that register size would
> correspond to
> integer size. Look at SSE2, then.

It is clear enough that register != integer ... it is essence
of SIMD. I like to use registers like collection of bits.

> >Also where do you expect "zero attribute" to be computed for
> >256 or 1024 bit reg ?
> >
> huh i guess it'sstill at the register write ports.
> But usually, tests for 0 are critical in loops
> or conditional code that deals with small scalar numbers.
> in this last situation, we know that only a small fraction
> of the 1024 bits have to be tested (the others being 0)

well but what about bypasses ? Will the zeroattr propagate
in paralel with data going thru bypass & xbar toward EU which
needs it ?

> > It will take one whole cycle ...
> >
> well, it may happen that a full test for zero on 1024 bits
> would take a few cycles. But this is not completely illogical
> so i bear with it ...

so it will stall pipeline ?

> >Maybe there is really need for some inter-chunk ops - like to
> >compute zero flag for only LSB MAX_CHUNK_BITS and have op
> >which can for example do (MAX_CHUNK_BITS=64):
> >bit0  -> bit0
> >bit64 -> bit1
> >bit128-> bit2
> >bit192-> bit3
> >bit1  -> bit4
> >etc.
> >
> maybe... but moving bits is the job of the SHL unit
> which is not fully and definitively finished (though
> MR's code already works)

Hmm yes - but is not SHL unit expected to work only on
chunks (so it doesn't need to cope with more than 64bits) ?
And there is no operation to "escape" from chunk except cshift.

> As MR explained, it's really a matter of implementing it
> with realistic HW constraints.

I completely agree.

> >regading combine, manual states that is will be problem to design
> >combine for chunks larger than 8 bits. Does it still hold ?
> >
> yup. Look at the drawing of the ROP2 byte.
> However, other implementations can probably implement larger ORs and ANDs,

Probably pipelined (stage after ROP2) ? Which will cost
another xbar EU port - BTW, are these ports to xbar
expensive in (current) HW ?
and should FC0 emulate them ?

> But the function of the combine operations is also performed
> (in a slower way) by the ASU with saturation mode : it requires
> a bit more time and some more instruction but it does it.

like add/subing 0xFE in saturated byte mode ? The bit 0 is
the result .. Hmm ! It is wonderfull !!! So cor.nxor is:
xor.8 ,,r1
addsi.8 0xfe,r1,r1
inc.8 r1,r1
Because add.8 is 1 cycle op it is then very similar to
8bit limited cand/corr (well cand/cor combines it with
ROP2). But ... what is saying your sense for othogonality ? :)

BTW: are you know other excelent tricks like one above ?

devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/