[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: (m) Re: [f-cpu] Re: Floating-Point?



On Thu, Aug 16, 2001 at 06:24:09PM +0200, Yann Guidon wrote:
[...]
> > I mean: I think it's unreasonable to build *variable-size* FP units.
> > There are too many special cases to consider -- rounding, exceptions,
> > infinities and NANs, ... (ok, go blame IEEE for it ;)
> 
> come on, pipelined/vector FP and SIMD FP are not new.
> On top of that i have added a new condition for jumps : NaN.

Didn't we already decide that months ago?

[...]
> >         - The manual doesn't state whether `modi' is a signed operation
> >           I suggest it should be signed (like `divi')
> i think that it goes along with the divide unit that you are doing.

The IDU will be able to perform both signed and unsigned
division/remainder.  The signed division will be symmetric (FORTH people
probably know this as SM/REM).  I still have to elaborate whether
asymmetric `floored' division and modulus (FM/MOD) is possible.

> >         - We need a level-1 floating-point compare instruction;
> >           `cmpl'/`cmple' may work with LNS (if there are no NANs),
> >           but not with FP.
> 
> IEEE FP defines FP comparison with Integer operations.

Yes, but unfortunately in sign-magnitude (not 2's complement) form.

> the format has be designed specifically for this purpose
> (however i don't remember what happens with NaNs etc)

IIRC, if at least one of the operands is a NAN, the result of all
comparisions (except `!=') is always false.

Mapped to (2's complement signed) integer, the order is as follows
(assuming IEEE `single' format):

	s  eeeee  fff  meaning
	=====================
	0     ff   >0  NAN
	0     ff   =0  +INF
	0  01-fe  any  positive normal
	0     00   >0  positive subnormal
	0     00   =0  +0
	1     ff   >0  NAN
	1     ff   =0  -INF
	1  01-fe  any  negative normal
	1     00   >0  negative subnormal
	1     00   =0  -0

That's obviously not correct.

> >         - The arguments of `store[f]' are reversed (dest, src).  It's
> >           ok that way (because it mirrors the `load' instruction) but
> >           there should be a BIG FAT WARNING in the manual.
> 
> there will certainly be a change in the L/S instruction format !
> the pointer that gets updated must be written somewhere and the
> current fields don't match the expected behaviour. ie :
> load r1, r2, r3 does : load [r2] into r3, and add r1 to r2
> this means that the r2 field must be written to. IT IS NOT POSSIBLE yet.
> so it will become : "load [r2] into r3, and r1 + r2 => r3^1"

I like the current form better.  This one is too crazy.

[...]
> >         - We need to clarify the syntax of the `condition' suffixes for
> >           `move' and `jmpa'.  I suggest
> > 
> >                 000  -z   (zero)
> >                 001       (unassigned)
> >                 010  -m   (msb == 1)
> >                 011  -l   (lsb == 0)
> >                 100  -nz  (not zero)
> >                 101       (unassigned)
> >                 110  -nm  (msb == 0)
> >                 111  -nl  (lsb == 0)
> 
> in the assembler that i have written, it's written differently
> and more verbosely (less confusing when you don't know the meanings).

Yep, I saw the .lsb/.msb/.and/.or suffixes.  I didn't find .zero
or a negation suffix, though.  I suggest we allow both forms -- the
long one for humans, the short one for compilers (or people like me,
who don't like writing a novel to flip a single bit ;).

Summarized:

	000  -z   .zero, .z, .null, ...
	001  -x   .nan, .not.number, ...
	010  -m   .msb, .msb1, ...
	011  -l   .lsb, .lsb1, ...
	100  -nz  .notzero, .not.zero, .nz, ...
	101  -nx  .notnan, .not.nan, .number, ...
	110  -nm  .notmsb, .not.msb, .msb0, ...
	111  -nl  .notlsb, .not.lsb, .lsb0, ...

I choose the short names, you the long names ;)

In case you wonder, the `x' stands for `eXtra' or `eXception'.  I could
also use `n' for `NAN', but that makes the instruction encoder a little
more complex and is probably misleading anyway.  But I'll think it over.

[...]
> >         - Can we please drop the `a' from `jmpa'?
> probably. i don't remember where it comes from, probably from Mathias.

It was meant to indicate an (A)bsolute jump.  Since that's the only
one the F-CPU knows, the suffix is redundant (and it looks too much
like `jump always').

> >         - When calling functions through pointers, it would be nice to
> >           be able to tell the F-CPU *a priori* that a register contains a
> >           code address.  While this can be done with an explicit prefetch
> >           (load to r0) for data pointers, there is no way to specify that
> >           a register contains a code address that the CPU will have to
> >           visit soon.
> what about loadaddr(i) ?

Not useful.  Imagine a C++ `member' function -- the first (hidden)
argument is a pointer to the class, the class contains a pointer to
the virtual method table, and the VMT contains pointers to all the
members.  To call another member, you have to

	// let r1 point to the current instance
	load r1, r2				// get pointer to VMT (usually stored at offset 0)
	add $offset, r2, r3		// VMT slot address
	load r3, r4				// get member's address
	// argument passing omitted
	jmp r4, r5				// call member

Both r2 (data pointer) and r4 (code pointer) are loaded from memory,
and r3 (also a data pointer) is calculated from r2 and a constant
(which probably has to be loadcons'd if it is too large).  But there
is no loadaddr[i] in that sequence, and the CPU has no way to tell
that r4 points to a function that is going to be called real soon
(that is, its code should be prefetched to avoid a stall).

> >           The same is true when an absolute code address is
> >           obtained via loadcons (which will probably be the common idiom
> >           when a function in another object file is called, unless jump
> >           tables are used -- which points us back to the `code pointer
> >           in register' problem, again).
> if the data/code is not explicitely prefetched, the code will still work,
> but with the "late fetch" penalty : the CPU will perform the "fetch"
> operation automatically while stalling the decode stage.

The point is that one cannot prefetch code.  `load r4, r0' will prefetch
the code into the D-cache, not the I-cache.

> >           To cut a long story short: I'd like to have an instruction
> >           that explicitly `tags' a register as a pointer, and probably
> >           initiates a prefetch cycle (for code or data, depending on
> >           the instruction's flags).  It may or may not move data from
> >           one register to another (one idea I had was a `pointer move'
> >           instruction); if it does, it might be a good idea to let it
> >           participate in address calculation (i.e. let it be able to
> >           add two operands, like the `lea' instruction on Intel CPUs).
> 
> this is what loadaddr is meant to do.

But it only works with PC-relative addressing.  While that's fine for
conditional branches, loops and local function calls, inter-module calls
cannot use it because the target address is resolved at link time (and
what's worse: it may be too far away for a 16-bit displacement unless
you limit the text segment size to 64 KB -- which is not a realistical
value at all).

> >         - Let's clarify the suffix order, e.g. like this (? means the
> >           suffix is currently unused, and its name is unassigned):
[...]
> wow, what a work :-)

You should see the complete flex source for the encoder; this is only
a small snippet ;)

[...]
> >         - The loadm/storem has a surprising operand order
> >           (start,src/dest,count), and it's not clear whether the
> >           register *numbers* or the register *contents* serve as the
> >           start/count values.  I suggest the former, and I would also
> >           change the operands to (firstreg, lastreg, memaddr) which is
> >           much easier to grok for humans.
> 
> some remarks :
>  - it is optional and conditioned by the presence of a SRB mechanism
>  - the 2nd register field is always the address. It must be pre-validated if possible.
>  - whether it is the contents or the value of the address does not change much
>    except that the value is know 2 cycles before or after. i'd prefer to use
>    the register number than its value, though, if possible.
>    though using the register contents might also help.

No, the register number is perfect.  After all, a programmer should know
which registers he's going to use -- at least most of the time ;)

> >           Since there are some unused flags, another variant might be
> >           interesting: `storem r2, r1', where r2 is used as a mask
> >           (bit <n> == 1 means "load/store register <n>"), and r1 is the
> >           address of the source/destination memory area (which must be
> >           big enough to hold all registers, just like the CMB).
> 
> this mask idea is interesting. It remembers me of the 6809 by the way :-)
> however it means that 4x loadcons might be necessary (in arbitrary cases)
> to backup the whole (non-contiguous) register set.

You can still use loadm/storem if you have only two or three contiguous
register blocks to save/restore.  The mask is useful when a) the registers
to save are too scattered or b) not known at compile time (emulators,
debuggers, ...), and you don't want to loop over the whole register bank
(that is, 63 times) and loadm/storem a single register each time.

> >           Maybe it would be wiser to put the memory address into the
> >           rightmost operand in *all* memory operations (load, store,
> >           cachemm, loadm and storem).  Some instructions will always
> >           have the wrong operand order, though.
> right. but i still prefer to leave the "pointer" field in the middle,
> because it is the most usual case where it makes sense (at least for myself).

Ok, then let it stay that way.  After all, it's matter of taste, and
it's MUCH easier to create machine code if the order of arguments
in an assembler instruction corresponds to the order of slots in the
instruction word.

[...]
> >  The result is a flex-based instruction encoder
> > that recognizes almost any instruction the F-CPU will have (with the
> > exceptions mentioned above).  I'll probably also build an assembler
> > around it. (I finally found a real use for my libelf library! Yeah! ;)
> 
> where's the source code ? :-)

Locked in my safe?

Just kidding ;)  But I wanted to eliminate the ambiguities first
before I release something that might be a reference implementation
for others.

> btw, please provide a "raw mode" so emulators don't need clomplex load functions...

You mean, relocated and linked ready-to-execute output?  I'd rather not
bloat the assembler with it but create a separate tool (do you remember
the good(?) old EXE2BIN? ;)

A simulator/debugger can really benefit from an elaborated binary format
like ELF.  It will have access to symbol names, line numbers and symbolic
debugging information (if the compiler/assembler supports it)...

Hexdumps? Nein danke :)

[...]
> expansion/reduction is another problem but i think that the SHL unit can do this,
> too.

FP expansion is trivial, but FP reduction may trigger exceptions (or at
least need rounding), and therefore has to be handled separately.

> Another proposition : make a signed and unsigned version of the integer expansion
> so we can extend the sign of the datum. This removes the "sign extension" flag
> from the move instruction and it removes one funky operation from the Xbar.

Ok, go ahead with that.  It also has the nice side effect that I can
now reclaim the -s suffix as a synonym for -m ;)

Most `unsigned' widening operations (with zero extension) can be done
with a single and[i] and/or loadcons[x] instruction (loadcons[x] needs
an additional `move' if you want to keep the old value as well):

    andi.w      $0xff, reg, reg     //  8->16
    sandi.w     $0xff, reg, reg     //  8->16 SIMD
    andi.q      $0xff, reg, reg     //  8->32
    sandi.q     $0xff, reg, reg     //  8->32 SIMD
    andi        $0xff, reg, reg     //  8->64
    loadcons.1          $0, reg     // 16->32
    loadconsx.1         $0, reg     // 16->64
    loadconsx.2         $0, reg     // 32->64

There is only one operation that always needs 2 instructions:

    sshiftli.q $16, r1, r2 ; sshiftri.q  $16, r2, r3    // 16->32 SIMD

Note that SIMD widening doesn't work with `move' at all (move doesn't
take a SIMD flag -- another reason to handle that operation in the SHL
unit, or explicitly).

`signed' widening can also be done with two shifts:

    shiftli.w   $8, r1, r2 ; shiftrai.w   $8, r2, r3    //  8->16
    sshiftli.w  $8, r1, r2 ; sshiftrai.w  $8, r2, r3    //  8->16 SIMD
    shiftli.q  $24, r1, r2 ; shiftrai.q  $24, r2, r3    //  8->32
    sshiftli.q $24, r1, r2 ; sshiftrai.q $24, r2, r3    //  8->32 SIMD
    shiftli    $56, r1, r2 ; shiftrai    $56, r2, r3    //  8->64
    shiftli.q  $16, r1, r2 ; shiftrai.q  $16, r2, r3    // 16->32
    sshiftli.q $16, r1, r2 ; sshiftrai.q $16, r2, r3    // 16->32 SIMD
    shiftli    $48, r1, r2 ; shiftrai    $48, r2, r3    // 16->64
    shiftli    $32, r1, r2 ; shiftrai    $32, r2, r3    // 32->64

If there is enough room in the SHL unit, we can add a little logic that
does it in one operation.  I suggest we define the `widen' instruction
as follows:

	[s]widenb[s][.b|.d|.q] r2, r1	//  8->xx
	[s]widenw[s][.b|.d|.q] r2, r1	// 16->xx
	[s]widenq[s][.b|.d|.q] r2, r1	// 32->xx
	[s]widen[s][.b|.d|.q]  r2, r1	// 64->xx

that is, [.b|.d|.q] refers to the new size, `s-' means SIMD (as usual),
and `-s' activates sign extension.  We need only a single opcode (the
source size can be encoded in the flag bits -- since the instruction
uses only two registers and no immediate operand, we have plenty of them).

Whether e.g. `widenq.b' actually truncates 32-bit values to 8-bit, and
how the result looks like when the value is not representable with the
destination size, needs to be defined.  The default (and probably the
only option for FC0) should be `chop' -- discard the upper bits --,
but signed/unsigned saturation (depending on the -s suffix) would be
nice, too.

FP conversions are a different beast.  We should have at least these:

	// FP -> FP
	32-bit FP  -> 64-bit FP		// trivial
	64-bit FP  -> 32-bit FP		// non-trivial (exceptions & rounding)

	// mandatory FP -> INT
	32-bit FP  -> 32-bit INT	// non-trivial (exceptions & rounding)
	32-bit FP  -> 64-bit INT	// non-trivial (exceptions & rounding)
	64-bit FP  -> 32-bit INT	// non-trivial (exceptions & rounding)
	64-bit FP  -> 64-bit INT	// non-trivial (exceptions & rounding)

	// mandatory INT -> FP
	64-bit INT -> 32-bit FP		// non-trivial (rounding)
	64-bit INT -> 64-bit FP		// non-trivial (rounding)

	// optional INT -> FP
	32-bit INT -> 32-bit FP		// non-trivial (rounding)
	32-bit INT -> 64-bit FP		// trivial

Note: the optional conversions can be replaced with an integer conversion
to 64-bit and one of the mandatory INT -> FP conversions.  Smaller integers
must be converted to 32-bit or larger before FP'izing them; this decision
was of course influenced by C's default integer promotion rules.

The INT -> FP and FP -> INT conversions should come in to flavors: one
for signed and one for unsigned integers.  That results in 8 variants of
`f2int', 4...8 variants of `int2f', plus `f2d' and `d2f' or whatever
they're going to be called.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/