[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 15 MIPS FC0 emulator



hi !

devik wrote:

today I've written part of my own emulator. Why ? Well simulator
is better but not complete yet and probably slow.


well, "probably" must be measured before undertaking this ...
My goal and plan is to make it scalable and work well, before making it
fast.
otherwise, it ends up with kludges and workarounds when it could
be simply solved by simply changing the original design.

yes. As somebody already said there is need for VHDL, simulator
and emulator. Probably VHDL code can be also used as/for simulator.
But when testing GCC with real code I can't spend whole day to
wait for result ...
My rationale is different - not make it as clean as possible but
make it as simple as possible - thus making it less pain when
rewriting.

just one question : what about making the cycle counter ad the hazard detection "optional" ?
this would ease the work for other architectures. As you know, FC0 is only the first
implementation of F-CPU and other cores will probably have other execution and decoding rules.

did you do a copy and paste from the snapshot's source code ?
it contains a lot of "standard definitions" ....

I tried to do it. Unfortunately f-cpu_opcodes.m4 is exactly
what I was expecting. I'd expect something like:
OP_DEFINE(`MUX',`m4_eval(....)') ...

well, i don't understand very well what you mean here ...

so that I'd define OP_DEFINE and pass f-cpu_opcodes.m4 to m4.
It could directly generate #define OPCODE_MUX ... for me. Now
it seems I'd have to create (duplicate) all #define first
WITHOUT real values and m4 will only replace them by actual vals.
So that just now I created my own .h with opcodes but I expect
to integrate it more later.

did you run the "configuration" process ? It should generate the .h
and you simply have to include it ...


SIMD ADD.W gives 12MIPS and ADD.W with saturation and carry
store 7MIPS. It can be still optimized a bit.


before making it fast, what about making it acurate ?
what about the problem with signed saturations ?

I did unsigned one and yes it is acurate. I tested with variety
of possible inputs and all flag combinations of "add" (it revealed
a few early bugs).

:-)

It uses MMX where possible (it helps especially with .B and .W
SIMD ops). SSE would help but there is much less machines
with SSE than with MMX (mine for example).


I hereby grant you with the prize of "Least Portables Code" of the F-CPU
Project, congratulations ;-)

Thanks :) I wanted it everytime when I had to write portable
code in linux kernel :)

heh heh heh ;-)

but think about it : i may have temporary access to an Alpha server (DS20)
and there is no point of being fast if it can't run at all :-)

by "portable", it means that people with a Mac, SPARC or other machines
can still compile and execute it. I sometimes use Pentium computers
(of the first generation) and i know that x86 is not the only architecture
on Earth.

Yup :) Probably I did it because I never coded for MMX and wanted to
try it :-)

don't worry : coding asm for F-CPU is much more interesting :-)


hmm when I think about it then add.b of 32bit register could be done as
add = A+B; oldovr = 0; ovr = 1;
while(oldovr != ovr) {
oldovr = ovr;
ovr = ((A & B | (A^B) & ~add) >> 7) & 0x01010101; /* partial carries */
add -= ovr << 8; /* fix carries */
} /* loop (unlikely) if carry fix propagated to upper byte */
if (saturate) add |= ((ovr^0x7f7f7f7f)+1)^0x80808080;

it is typicaly 5 ops per byte (when perfectly optimized) which is
similar to do per byte unrolled loop (but about would win for
64bit cpu). For add.w it'd be better to stick with per word loop.
All these ways are unfortunately at least 5 times (sometimes
40 times) slower than MMX or similar SIMD code.
But well ... I'll have probably use loops & GMP library.

AFAIK, GMP is not well suited.
what would be adapted is macros for doing the C operations in "optimised ways"
(here you could even do MMX macros, that would be used when corectly #define'd)
on the F-CPU data types.
So then you can recompile the source and specify a register width,
and the right code would be selected (there is no point in looping
if only 64-bit data are used, but it becomes necessary for 256-bit and +,
and you can even inline and unroll some versions.

MAcros i think are necessary : copy, clear, assign constant,
and +, -, and, or, not, xor, shift, all in scalar.
SIMD operations are localised to only a part of the code,
so they are locally optimised for dealing with "chunks".

Well, that's my usual rant. But you said you contribute to the Linux kernel
so such a hack does not surprise me :-) there might even be a few things

hey :) Look at my HTB code in 2.4.20 - it is pretty portable. In fact
majority of Linux kernel is portable (except "arch/cpu" subdirs).

huh i'm still stuck with 19-pre7 because pre8 has screwed the Wacom table support ....


to learn. And if you don't hit the walls a few times, you won't understand
why i rant ;-) For example, now, it would be funny to have a 256 bit
version,
or even, an emulator where you can indicate the register size as a command
line parameter .......

nice ..

that's the goal ;-)

but .. 256bit is MAX_REGISTER_BITS or MAX_CHUNK_BITS ?

register size, of course !
Up to now, and AFAIK, the chunk size is still limited to 64 bits.
Some implementations could play with it (like, for a cheap 32-bit version)
but 64-bit is already a very wide stuff.

As cedric said it probably don;t make much sense to have cpu with
256bit adder, multiplier, shifter ... It will slow it down (either
in clock or latency sense).

yup. But there is still the misunderstanding that register size would correspond to
integer size. Look at SSE2, then.

Also where do you expect "zero attribute" to be computed for
256 or 1024 bit reg ?

huh i guess it'sstill at the register write ports.
But usually, tests for 0 are critical in loops
or conditional code that deals with small scalar numbers.
in this last situation, we know that only a small fraction
of the 1024 bits have to be tested (the others being 0)

It will take one whole cycle ...

well, it may happen that a full test for zero on 1024 bits
would take a few cycles. But this is not completely illogical
so i bear with it ...

Maybe there is really need for some inter-chunk ops - like to
compute zero flag for only LSB MAX_CHUNK_BITS and have op
which can for example do (MAX_CHUNK_BITS=64):
bit0 -> bit0
bit64 -> bit1
bit128-> bit2
bit192-> bit3
bit1 -> bit4
etc.

maybe... but moving bits is the job of the SHL unit
which is not fully and definitively finished (though
MR's code already works)

As far as i remember, it is unsigned.
a signed version is however interesting and useful,
but designing it might be a problem.

ahh ok - it is not stated in the manual but it is clear
from example there

As MR explained, it's really a matter of implementing it
with realistic HW constraints.

version of the ROP2 code, with MUX and combine
included. Yes i know it's not as fast (and it can be

regading combine, manual states that is will be problem to design
combine for chunks larger than 8 bits. Does it still hold ?

yup. Look at the drawing of the ROP2 byte.
However, other implementations can probably implement larger ORs and ANDs,
with a different timing. The current byte limit is a HW limitation
with a specific HW design rule, but you can play with larger combines :
that would certainly help certain codes. For example, i think that the
RC5 codes deal with 32-bit data, and ROP2 can help making 32-bit masks.

But the function of the combine operations is also performed
(in a slower way) by the ASU with saturation mode : it requires
a bit more time and some more instruction but it does it.

devik

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/