[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] 15 MIPS FC0 emulator



> >today I've written part of my own emulator. Why ? Well simulator
> >is better but not complete yet and probably slow.
> >
> well, "probably" must be measured before undertaking this ...
> My goal and plan is to make it scalable and work well, before making it
> fast.
> otherwise, it ends up with kludges and workarounds when it could
> be simply solved by simply changing the original design.

yes. As somebody already said there is need for VHDL, simulator
and emulator. Probably VHDL code can be also used as/for simulator.
But when testing GCC with real code I can't spend whole day to
wait for result ...
My rationale is different - not make it as clean as possible but
make it as simple as possible - thus making it less pain when
rewriting.

> did you do a copy and paste from the snapshot's source code ?
> it contains a lot of "standard definitions" ....

I tried to do it. Unfortunately f-cpu_opcodes.m4 is exactly
what I was expecting. I'd expect something like:
OP_DEFINE(`MUX',`m4_eval(....)') ...

so that I'd define OP_DEFINE and pass f-cpu_opcodes.m4 to m4.
It could directly generate #define OPCODE_MUX ... for me. Now
it seems I'd have to create (duplicate) all #define first
WITHOUT real values and m4 will only replace them by actual vals.
So that just now I created my own .h with opcodes but I expect
to integrate it more later.

> >SIMD ADD.W gives 12MIPS and ADD.W with saturation and carry
> >store 7MIPS. It can be still optimized a bit.
> >
> before making it fast, what about making it acurate ?
> what about the problem with signed saturations ?

I did unsigned one and yes it is acurate. I tested with variety
of possible inputs and all flag combinations of "add" (it revealed
a few early bugs).

> >It uses MMX where possible (it helps especially with .B and .W
> >SIMD ops). SSE would help but there is much less machines
> >with SSE than with MMX (mine for example).
> >
> I hereby grant you with the prize of "Least Portables Code" of the F-CPU
> Project, congratulations ;-)

Thanks :) I wanted it everytime when I had to write portable
code in linux kernel :)

> by "portable", it means that people with a Mac, SPARC or other machines
> can still compile and execute it. I sometimes use Pentium computers
> (of the first generation) and i know that x86 is not the only architecture
> on Earth.

Yup :) Probably I did it because I never coded for MMX and wanted to
try it :-)
hmm when I think about it then add.b of 32bit register could be done as
add = A+B; oldovr = 0; ovr = 1;
while(oldovr != ovr) {
 oldovr = ovr;
 ovr = ((A & B | (A^B) & ~add) >> 7) & 0x01010101; /* partial carries */
 add -= ovr << 8; /* fix carries */
} /* loop (unlikely) if carry fix propagated to upper byte */
if (saturate) add |= ((ovr^0x7f7f7f7f)+1)^0x80808080;

it is typicaly 5 ops per byte (when perfectly optimized) which is
similar to do per byte unrolled loop (but about would win for
64bit cpu). For add.w it'd be better to stick with per word loop.
All these ways are unfortunately at least 5 times (sometimes
40 times) slower than MMX or similar SIMD code.
But well ... I'll have probably use loops & GMP library.

> Well, that's my usual rant. But you said you contribute to the Linux kernel
> so such a hack does not surprise me :-) there might even be a few things

hey :) Look at my HTB code in 2.4.20 - it is pretty portable. In fact
majority of Linux kernel is portable (except "arch/cpu" subdirs).

> to learn. And if you don't hit the walls a few times, you won't understand
> why i rant ;-) For example, now, it would be funny to have a 256 bit
> version,
> or even, an emulator where you can indicate the register size as a command
> line parameter .......

nice .. but .. 256bit is MAX_REGISTER_BITS or MAX_CHUNK_BITS ? As
cedric said it probably don;t make much sense to have cpu with
256bit adder, multiplier, shifter ... It will slow it down (either
in clock or latency sense).
Also where do you expect "zero attribute" to be computed for
256 or 1024 bit reg ? It will take one whole cycle ...
Maybe there is really need for some inter-chunk ops - like to
compute zero flag for only LSB MAX_CHUNK_BITS and have op
which can for example do (MAX_CHUNK_BITS=64):
bit0  -> bit0
bit64 -> bit1
bit128-> bit2
bit192-> bit3
bit1  -> bit4
etc.

> As far as i remember, it is unsigned.
> a signed version is however interesting and useful,
> but designing it might be a problem.

ahh ok - it is not stated in the manual but it is clear
from example there

> version of the ROP2 code, with MUX and combine
> included. Yes i know it's not as fast (and it can be

regading combine, manual states that is will be problem to design
combine for chunks larger than 8 bits. Does it still hold ?

devik

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/