[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: (m) Re: [f-cpu] Re: Floating-Point?

To: f-cpu@seul.org
Subject: Re: (m) Re: [f-cpu] Re: Floating-Point?
From: Yann Guidon <whygee@f-cpu.org>
Date: Fri, 17 Aug 2001 04:24:25 +0200
Delivery-Date: Thu, 16 Aug 2001 22:26:19 -0400
Organization: http://www.f-cpu.org
References: <Pine.LNX.3.96.1010814132809.3201B-100000@redwood.oekomm.de> <3B7926B0.C1FC207D@f-cpu.org> <20010814210034.54796@thrai.stud.uni-hannover.de> <3B7A3133.744063F0@f-cpu.org> <20010815231227.63958@thrai.stud.uni-hannover.de> <3B7BF3A9.5B0E4DD4@f-cpu.org> <20010817003553.36482@thrai.stud.uni-hannover.de>
Reply-To: f-cpu@seul.org
Sender: owner-f-cpu@seul.org
what a pity !
there are so many interesting and useful things here,
with a lot of insight, and the manual is still a beast to
update. maybe we should use the system of RFCs, so we don't
loose the idea if we can't update the man. This mail would
contain several RFCs at least.

on top of that, i didn't do anything at all today,
except answering mail. that's impossible...

Michael Riepe wrote:
> On Thu, Aug 16, 2001 at 06:24:09PM +0200, Yann Guidon wrote:
> [...]
> > > I mean: I think it's unreasonable to build *variable-size* FP units.
> > > There are too many special cases to consider -- rounding, exceptions,
> > > infinities and NANs, ... (ok, go blame IEEE for it ;)
> > come on, pipelined/vector FP and SIMD FP are not new.
> > On top of that i have added a new condition for jumps : NaN.
> Didn't we already decide that months ago?
i rediscovered it recently :-)

> [...]
> > >         - The manual doesn't state whether `modi' is a signed operation
> > >           I suggest it should be signed (like `divi')
> > i think that it goes along with the divide unit that you are doing.
> The IDU will be able to perform both signed and unsigned
> division/remainder.  The signed division will be symmetric (FORTH people
> probably know this as SM/REM).  I still have to elaborate whether
> asymmetric `floored' division and modulus (FM/MOD) is possible.
good luck !

> > >         - We need a level-1 floating-point compare instruction;
> > >           `cmpl'/`cmple' may work with LNS (if there are no NANs),
> > >           but not with FP.
> > IEEE FP defines FP comparison with Integer operations.
> Yes, but unfortunately in sign-magnitude (not 2's complement) form.
??? i'm surprised.
but i'm not a FP expert neither.

> > the format has be designed specifically for this purpose
> > (however i don't remember what happens with NaNs etc)
> IIRC, if at least one of the operands is a NAN, the result of all
> comparisions (except `!=') is always false.
it sounds logical.

> Mapped to (2's complement signed) integer, the order is as follows
> (assuming IEEE `single' format):
> 
>         s  eeeee  fff  meaning
>         =====================
>         0     ff   >0  NAN
>         0     ff   =0  +INF
>         0  01-fe  any  positive normal
>         0     00   >0  positive subnormal
>         0     00   =0  +0
>         1     ff   >0  NAN
>         1     ff   =0  -INF
>         1  01-fe  any  negative normal
>         1     00   >0  negative subnormal
>         1     00   =0  -0
> 
> That's obviously not correct.
where ? negative normal/subnormal ? +0/-0 ?
we'll have to ask some experts...

> > >         - The arguments of `store[f]' are reversed (dest, src).  It's
> > >           ok that way (because it mirrors the `load' instruction) but
> > >           there should be a BIG FAT WARNING in the manual.
> >
> > there will certainly be a change in the L/S instruction format !
> > the pointer that gets updated must be written somewhere and the
> > current fields don't match the expected behaviour. ie :
> > load r1, r2, r3 does : load [r2] into r3, and add r1 to r2
> > this means that the r2 field must be written to. IT IS NOT POSSIBLE yet.
> > so it will become : "load [r2] into r3, and r1 + r2 => r3^1"
> I like the current form better.  This one is too crazy.
probably but r2 can't be written to !
i think...

i'll have to re-check the instruction decoder/issue. there may be something that
i missed along the way. i was worried about the comparators that detect the
unavailable registers and i feared that a 5th would be necessary. of course
it is not the case.

> [...]
> > >         - We need to clarify the syntax of the `condition' suffixes for
> > >           `move' and `jmpa'.  I suggest
> > >
> > >                 000  -z   (zero)
> > >                 001       (unassigned)
> > >                 010  -m   (msb == 1)
> > >                 011  -l   (lsb == 0)
> > >                 100  -nz  (not zero)
> > >                 101       (unassigned)
> > >                 110  -nm  (msb == 0)
> > >                 111  -nl  (lsb == 0)
> >
> > in the assembler that i have written, it's written differently
> > and more verbosely (less confusing when you don't know the meanings).
> Yep, I saw the .lsb/.msb/.and/.or suffixes.
the .and and .or suffixes are for ROP2.

>  I didn't find .zero or a negation suffix, though.
it's a higher-level trick :-)
the syntax defines the usual émove src, dest"
and "cmove cond.y == x, src, dest" for conditional moves.
i have chosen to explicitely use "cmove" because i feared
a shift-reduce problem. "y" can be nothing, ".lsb" and ".msb"
and "x" can be 0 or 1. There is an inconsistency for
"cmove r==1" (i'll fix that with inverting everything :*D)
but otherwise it is written "r.lsb==0", "r==0", "r.msb==0"
and idem for 1. the x value gives the negation directly ;-)
and the absence of specifier means the whole register, hence
zero or not.
i know, it's lame, but it works :-) and it's less confusing than
-nm for example.

>  I suggest we allow both forms -- the
> long one for humans, the short one for compilers (or people like me,
> who don't like writing a novel to flip a single bit ;).
why not.

> Summarized:

>         000  -z   .zero, .z, .null, ...
>         001  -x   .nan, .not.number, ...
>         010  -m   .msb, .msb1, ...
>         011  -l   .lsb, .lsb1, ...
>         100  -nz  .notzero, .not.zero, .nz, ...
>         101  -nx  .notnan, .not.nan, .number, ...
>         110  -nm  .notmsb, .not.msb, .msb0, ...
>         111  -nl  .notlsb, .not.lsb, .lsb0, ...
> I choose the short names, you the long names ;)
allright.
gimme some time however :-)

> In case you wonder, the `x' stands for `eXtra' or `eXception'.
thanks for this kind explanation :*)

>  I could
> also use `n' for `NAN', but that makes the instruction encoder a little
> more complex and is probably misleading anyway.  But I'll think it over.
> [...]

'i' for 'i'nvalid ?

> > >         - Can we please drop the `a' from `jmpa'?
> > probably. i don't remember where it comes from, probably from Mathias.
> It was meant to indicate an (A)bsolute jump.  Since that's the only
> one the F-CPU knows, the suffix is redundant (and it looks too much
> like `jump always').
ok let's go.

> > >         - When calling functions through pointers, it would be nice to
> > >           be able to tell the F-CPU *a priori* that a register contains a
> > >           code address.  While this can be done with an explicit prefetch
> > >           (load to r0) for data pointers, there is no way to specify that
> > >           a register contains a code address that the CPU will have to
> > >           visit soon.
> > what about loadaddr(i) ?
> Not useful.
that's sad.

Maybe... we can use one or two bits from the add/sub instructions
so they validate and prefetch the resulting pointer ?

>  Imagine a C++ `member' function -- the first (hidden)
> argument is a pointer to the class, the class contains a pointer to
> the virtual method table, and the VMT contains pointers to all the
> members.
i can understand english, french and german,
i know a few words in several european langages such as
italian, spanish, portugese, tchech, russian, polish, latvian...
but i still believe that C++ comes from Mars.

>  To call another member, you have to
> 
>         // let r1 point to the current instance
>         load r1, r2                             // get pointer to VMT (usually stored at offset 0)
>         add $offset, r2, r3             // VMT slot address
>         load r3, r4                             // get member's address
>         // argument passing omitted
>         jmp r4, r5                              // call member

what an ugly code :-(
is all this necessary ???

There is one "programming trick" in FC0 :
you use a "software barrel of 8 registers" for data and another
"barrel" for instructions. when you want to access data,
use post-incremented form as possible, with an increment
such as it points to the data that you will access in 8 L/S
instructions. the code above is extremely short-sighted and
utterly underefficient.

> Both r2 (data pointer) and r4 (code pointer) are loaded from memory,
> and r3 (also a data pointer) is calculated from r2 and a constant
> (which probably has to be loadcons'd if it is too large).  But there
> is no loadaddr[i] in that sequence, and the CPU has no way to tell
> that r4 points to a function that is going to be called real soon
> (that is, its code should be prefetched to avoid a stall).
maybe we need a sort of loadaddr(i) without PC ?
i have the feeling that i miss something here...

> > >           The same is true when an absolute code address is
> > >           obtained via loadcons (which will probably be the common idiom
> > >           when a function in another object file is called, unless jump
> > >           tables are used -- which points us back to the `code pointer
> > >           in register' problem, again).
> > if the data/code is not explicitely prefetched, the code will still work,
> > but with the "late fetch" penalty : the CPU will perform the "fetch"
> > operation automatically while stalling the decode stage.
> The point is that one cannot prefetch code.  `load r4, r0' will prefetch
> the code into the D-cache, not the I-cache.
something seems broken here.

> > >           To cut a long story short: I'd like to have an instruction
> > >           that explicitly `tags' a register as a pointer, and probably
> > >           initiates a prefetch cycle (for code or data, depending on
> > >           the instruction's flags).  It may or may not move data from
> > >           one register to another (one idea I had was a `pointer move'
> > >           instruction); if it does, it might be a good idea to let it
> > >           participate in address calculation (i.e. let it be able to
> > >           add two operands, like the `lea' instruction on Intel CPUs).
> > this is what loadaddr is meant to do.
> 
> But it only works with PC-relative addressing.  While that's fine for
> conditional branches, loops and local function calls, inter-module calls
> cannot use it because the target address is resolved at link time (and
> what's worse: it may be too far away for a 16-bit displacement unless
> you limit the text segment size to 64 KB -- which is not a realistical
> value at all).

what's your conclusion, doctor ?

> > >         - Let's clarify the suffix order, e.g. like this (? means the
> > >           suffix is currently unused, and its name is unassigned):
> [...]
> > wow, what a work :-)
> 
> You should see the complete flex source for the encoder; this is only
> a small snippet ;)

i WANT to see it ;-)

> [...]
> > >         - The loadm/storem has a surprising operand order
> > >           (start,src/dest,count), and it's not clear whether the
> > >           register *numbers* or the register *contents* serve as the
> > >           start/count values.  I suggest the former, and I would also
> > >           change the operands to (firstreg, lastreg, memaddr) which is
> > >           much easier to grok for humans.
> >
> > some remarks :
> >  - it is optional and conditioned by the presence of a SRB mechanism
> >  - the 2nd register field is always the address. It must be pre-validated if possible.
> >  - whether it is the contents or the value of the address does not change much
> >    except that the value is know 2 cycles before or after. i'd prefer to use
> >    the register number than its value, though, if possible.
> >    though using the register contents might also help.
> 
> No, the register number is perfect.  After all, a programmer should know
> which registers he's going to use -- at least most of the time ;)

sure, that's why i did not care before.

> > >           Since there are some unused flags, another variant might be
> > >           interesting: `storem r2, r1', where r2 is used as a mask
> > >           (bit <n> == 1 means "load/store register <n>"), and r1 is the
> > >           address of the source/destination memory area (which must be
> > >           big enough to hold all registers, just like the CMB).
> >
> > this mask idea is interesting. It remembers me of the 6809 by the way :-)
> > however it means that 4x loadcons might be necessary (in arbitrary cases)
> > to backup the whole (non-contiguous) register set.
> 
> You can still use loadm/storem if you have only two or three contiguous
> register blocks to save/restore.  The mask is useful when a) the registers
> to save are too scattered or b) not known at compile time (emulators,
> debuggers, ...), and you don't want to loop over the whole register bank
> (that is, 63 times) and loadm/storem a single register each time.

in such a 'scattered' case, why not use a loop with a get/put
to the register set ?

> > >           Maybe it would be wiser to put the memory address into the
> > >           rightmost operand in *all* memory operations (load, store,
> > >           cachemm, loadm and storem).  Some instructions will always
> > >           have the wrong operand order, though.
> > right. but i still prefer to leave the "pointer" field in the middle,
> > because it is the most usual case where it makes sense (at least for myself).
> Ok, then let it stay that way.  After all, it's matter of taste, and
> it's MUCH easier to create machine code if the order of arguments
> in an assembler instruction corresponds to the order of slots in the
> instruction word.
right. But using BISON, one can easily swap operand orders.
replace "$2" with "$4" and vice versa.

> [...]
> > >  The result is a flex-based instruction encoder
> > > that recognizes almost any instruction the F-CPU will have (with the
> > > exceptions mentioned above).  I'll probably also build an assembler
> > > around it. (I finally found a real use for my libelf library! Yeah! ;)
> >
> > where's the source code ? :-)
> 
> Locked in my safe?
> 
> Just kidding ;)
Pfiew ! i prefer that ;-)

>  But I wanted to eliminate the ambiguities first
> before I release something that might be a reference implementation
> for others.

yup.

> > btw, please provide a "raw mode" so emulators don't need clomplex load functions...
> 
> You mean, relocated and linked ready-to-execute output?  I'd rather not
> bloat the assembler with it but create a separate tool (do you remember
> the good(?) old EXE2BIN? ;)

i think i did not use it much.
i preferred to write my own EXE from scratch.
i have contributed NASM with an EXE header generator :-)
this way, no need of any external tool.

> A simulator/debugger can really benefit from an elaborated binary format
> like ELF.  It will have access to symbol names, line numbers and symbolic
> debugging information (if the compiler/assembler supports it)...
> 
> Hexdumps? Nein danke :)

"elaborated" sometimes becomes "utterly complex"...

> [...]
> > expansion/reduction is another problem but i think that the SHL unit can do this,
> > too.
> 
> FP expansion is trivial, but FP reduction may trigger exceptions (or at
> least need rounding), and therefore has to be handled separately.
couldn't reduction be handled in a FP unit such as fadd ?

> > Another proposition : make a signed and unsigned version of the integer expansion
> > so we can extend the sign of the datum. This removes the "sign extension" flag
> > from the move instruction and it removes one funky operation from the Xbar.
> Ok, go ahead with that.
it's done in my simulator.

>  It also has the nice side effect that I can
> now reclaim the -s suffix as a synonym for -m ;)
what is "-m" ?

> Most `unsigned' widening operations (with zero extension) can be done
> with a single and[i] and/or loadcons[x] instruction (loadcons[x] needs
> an additional `move' if you want to keep the old value as well):
> 
>     andi.w      $0xff, reg, reg     //  8->16
>     sandi.w     $0xff, reg, reg     //  8->16 SIMD
>     andi.q      $0xff, reg, reg     //  8->32
>     sandi.q     $0xff, reg, reg     //  8->32 SIMD
>     andi        $0xff, reg, reg     //  8->64
>     loadcons.1          $0, reg     // 16->32
>     loadconsx.1         $0, reg     // 16->64
>     loadconsx.2         $0, reg     // 32->64
> 
> There is only one operation that always needs 2 instructions:
> 
>     sshiftli.q $16, r1, r2 ; sshiftri.q  $16, r2, r3    // 16->32 SIMD

nice snippet collection !
they have certainly a reserved place in the manual.


> Note that SIMD widening doesn't work with `move' at all (move doesn't
> take a SIMD flag -- another reason to handle that operation in the SHL
> unit, or explicitly).
right.
but the main reason is that i was worried about adding
"operation" functionalities in the Xbar.

> `signed' widening can also be done with two shifts:
> 
>     shiftli.w   $8, r1, r2 ; shiftrai.w   $8, r2, r3    //  8->16
>     sshiftli.w  $8, r1, r2 ; sshiftrai.w  $8, r2, r3    //  8->16 SIMD
>     shiftli.q  $24, r1, r2 ; shiftrai.q  $24, r2, r3    //  8->32
>     sshiftli.q $24, r1, r2 ; sshiftrai.q $24, r2, r3    //  8->32 SIMD
>     shiftli    $56, r1, r2 ; shiftrai    $56, r2, r3    //  8->64
>     shiftli.q  $16, r1, r2 ; shiftrai.q  $16, r2, r3    // 16->32
>     sshiftli.q $16, r1, r2 ; sshiftrai.q $16, r2, r3    // 16->32 SIMD
>     shiftli    $48, r1, r2 ; shiftrai    $48, r2, r3    // 16->64
>     shiftli    $32, r1, r2 ; shiftrai    $32, r2, r3    // 32->64

2 shifts ? i would prefer 1 simple instruction instead.

> If there is enough room in the SHL unit, we can add a little logic that
> does it in one operation.  I suggest we define the `widen' instruction
> as follows:

the 6809 defined a funny opcode : "sex" for "sign extension" :*)
now i understan much more about the meaning of Life :-D

>         [s]widenb[s][.b|.d|.q] r2, r1   //  8->xx
>         [s]widenw[s][.b|.d|.q] r2, r1   // 16->xx
>         [s]widenq[s][.b|.d|.q] r2, r1   // 32->xx
>         [s]widen[s][.b|.d|.q]  r2, r1   // 64->xx
> 
> that is, [.b|.d|.q] refers to the new size, `s-' means SIMD (as usual),
> and `-s' activates sign extension.  We need only a single opcode (the
> source size can be encoded in the flag bits -- since the instruction
> uses only two registers and no immediate operand, we have plenty of them).
exactly.

> Whether e.g. `widenq.b' actually truncates 32-bit values to 8-bit, and
> how the result looks like when the value is not representable with the
> destination size, needs to be defined.  The default (and probably the
> only option for FC0) should be `chop' -- discard the upper bits --,
> but signed/unsigned saturation (depending on the -s suffix) would be
> nice, too.

i'm too tired tonight ... i can't concentrate on this.

> FP conversions are a different beast.  We should have at least these:
> 
>         // FP -> FP
>         32-bit FP  -> 64-bit FP         // trivial
>         64-bit FP  -> 32-bit FP         // non-trivial (exceptions & rounding)
> 
>         // mandatory FP -> INT
>         32-bit FP  -> 32-bit INT        // non-trivial (exceptions & rounding)
>         32-bit FP  -> 64-bit INT        // non-trivial (exceptions & rounding)
>         64-bit FP  -> 32-bit INT        // non-trivial (exceptions & rounding)
>         64-bit FP  -> 64-bit INT        // non-trivial (exceptions & rounding)
> 
>         // mandatory INT -> FP
>         64-bit INT -> 32-bit FP         // non-trivial (rounding)
>         64-bit INT -> 64-bit FP         // non-trivial (rounding)
> 
>         // optional INT -> FP
>         32-bit INT -> 32-bit FP         // non-trivial (rounding)
>         32-bit INT -> 64-bit FP         // trivial
> 
> Note: the optional conversions can be replaced with an integer conversion
> to 64-bit and one of the mandatory INT -> FP conversions.  Smaller integers
> must be converted to 32-bit or larger before FP'izing them; this decision
> was of course influenced by C's default integer promotion rules.
i didn't know about this detail.

> The INT -> FP and FP -> INT conversions should come in to flavors: one
> for signed and one for unsigned integers.  That results in 8 variants of
> `f2int', 4...8 variants of `int2f', plus `f2d' and `d2f' or whatever
> they're going to be called.

this is certainly going to increase the size of the manual...

>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
Follow-Ups:
- Re: (m) Re: [f-cpu] Re: Floating-Point?
  - From: Michael Riepe <michael@stud.uni-hannover.de>
References:
- [f-cpu] Re: Floating-Point?
  - From: Yann Guidon <whygee@f-cpu.org>
- Re: [f-cpu] Re: Floating-Point?
  - From: Michael Riepe <michael@stud.uni-hannover.de>
- Re: [f-cpu] Re: Floating-Point?
  - From: Yann Guidon <whygee@f-cpu.org>
- Re: [f-cpu] Re: Floating-Point?
  - From: Michael Riepe <michael@stud.uni-hannover.de>
- (m) Re: [f-cpu] Re: Floating-Point?
  - From: Yann Guidon <whygee@f-cpu.org>
- Re: (m) Re: [f-cpu] Re: Floating-Point?
  - From: Michael Riepe <michael@stud.uni-hannover.de>
Prev by Date: Re: [f-cpu] F-CPU vs ALPHA
Next by Date: Re: [f-cpu] Re: Floating-Point?
Prev by thread: Re: (m) Re: [f-cpu] Re: Floating-Point?
Next by thread: Re: (m) Re: [f-cpu] Re: Floating-Point?
Index(es):
- Date
- Thread