[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

(m) Re: [f-cpu] Re: Floating-Point?



hi,

i finally take some time to answer this long and interesting mail.
blame it on the pillow.

Michael Riepe wrote:
> On Wed, Aug 15, 2001 at 10:22:11AM +0200, Yann Guidon wrote:
> [...]
> > > SIMD is IMHO not reasonable for the FP units.
> > in what context are you speaking ?
> I mean: I think it's unreasonable to build *variable-size* FP units.
> There are too many special cases to consider -- rounding, exceptions,
> infinities and NANs, ... (ok, go blame IEEE for it ;)

come on, pipelined/vector FP and SIMD FP are not new.
On top of that i have added a new condition for jumps : NaN.

> > > A reasonable approach is
> > > to build a set of pipelined 64-bit FP units, and then issue the 32-bit
> > > operations in two consecutive cycles.
> > that's vectoring, then. Scheduling might become more complex,
> > in situations such as chaining for example.
> Not if it's "hidden" inside the EU.
i fear the worse now ;-)

> > I have nothing to object to that, but
> >  - 1) currently we have no FP unit
> >  - 2) SIMD already works well (when it does)
> >  - 3) vectoring will be used in another core because FC0 would require too much changes
> >  - 4) if you have 1 FP unit, the hardest is done : you can duplicate it :-P
> If you have enough room.  Do you have an idea how big the FP unit will be?
i know that others have done that. Ok, it takes "some room" but you get what you pay for.

> > > BTW: I think we need another instruction that converts 32-bit FP to 64-bit
> > > and vice versa (and maybe also does the mix/expand/sdup thingy for FP).
> > geez, the instruction set in the current version of the manual needs a big rework...
> Yep.  There are a handful of inconsistencies, typing errors, missing
> parts etc. in it.
thank you for remembering me that :-P

>  Major things I've found so far:
> 
>         - The manual doesn't state whether `modi' is a signed operation
>           I suggest it should be signed (like `divi')
i think that it goes along with the divide unit that you are doing.

>         - Complement `abs' with `nabs' (negative absolute) for
>           symmetry, and to avoid the `sign surprise' when the argument
>           is -2**(chunksize-1)

negation should be no worry, but the INC unit is still oncomplete.
it is optional, though (currently).

>         - The syntax for the rounding mode (`l2int', `f2int') is not
>           specified. I suggest to use the following syntax:
> 
>                 l2int[r|t|f|c]
>                 f2int[r|t|f|c][x]
> 
>           with these meanings:
> 
>                 -r (round)      round to nearest (default)
>                 -t (trunc)      round towards zero
>                 -f (floor)      round towards -infinity
>                 -c (ceil)       round towards +infinity

it looks ok.

>         - `int2f' and `int2l' also need rounding modes because both
>           conversions may result in precision loss if the integer operand
>           has a large value.
> 
>         - `bitop[s|c|x|t]i' should be `bitopi[s|c|x|t]' (`i' is NOT a suffix!)

i agree.

>         - Assign four opcodes for bitop[i] and increase the imm6 operand
>           to imm8 (for consistency with the rop2, shift, rot, bitrevi and
>           loadcons[x] instructions).
right.

>           Since bitop[i] is a ROP2 instruction,
>           change the function encoding to match that of rop2, that is:
> 
>                 fun  rop2  bitop
>                 ================
>                 000  and   btst
>                 001  andn  bclr
>                 010  xor   bchg
>                 011  or    bset
>                 100  nor   --
>                 101  xnor  --
>                 110  orn   --
>                 111  nand  --
> 
>           I guess we can get the missing four instructions for free,
>           but they aren't really useful.

it seems interesting but i have done something else.
i've dropped the "mode bits" instead so we have all functions available.
check my latest snapshot : f-cpu/ contains some text files which describe the
changes.

>         - The description of the ROP2 is obsolete (and the syntax for
>           combine/mux is unspecified) I suggest -o and -a suffixes for
>           combine, and a new `mux' instruction.
> 
>         - For the `andn' and `orn' instructions, the manual must
>           clearly state which operand is inverted.  IMHO, `andni' and
>           `orni' will be almost useless if we invert the leftmost
>           (== immediate) operand (but not completely useless, because
>           the upper bits differ when the chunk size is 16 or more).
> 
>           On the other hand, we could add a flag for sign extension of the
>           immediate operand and invert the middle (== register) operand.
>           Since the function bits have moved to the opcode field, there
>           should be a free flag.
> 
>         - There is no explicit `not' instruction, but users can write
>           `nor r0, r2, r1', `xnor r0, r2, r1' or similar.  Since this
>           may not be obvious, F-CPU assemblers should recognize `not
>           r2, r1' and convert it to one of the other forms internally.
>           The `not' instruction should, however, be documented in the
>           Instruction Set Manual.
> 
>         - In `bitrev[i]', use the formula `r1 = bit_reverse(r2) >> (size-r3-1)'.
>           That will change the useful range for r3 to [size-1;0].  In the
>           current version, it's [size;1] which is pretty ugly.
> 
>           Another possible variant is `r1 = bit_reverse(r2) >> r3', with
>           the same useful range but a nicer default (r3 == r0) which
>           makes the 2-operand short form `bitrev r2, r1' meaningful,
>           but that may cause trouble when the register size is increased
>           beyond 64 bits :(
> 
>         - `flog' and `fexp' should both take only two operands.
>           Remember that (a**b)**c = a**(b*c) = a**(c*b) = (a**c)**b.
>           That is, with a simple multiplication (before fexp / after
>           flog) you get any base you want, and the FP unit probably
>           works better with a fixed base.
> 
>         - We need a level-1 floating-point compare instruction;
>           `cmpl'/`cmple' may work with LNS (if there are no NANs),
>           but not with FP.

IEEE FP defines FP comparison with Integer operations.
the format has be designed specifically for this purpose
(however i don't remember what happens with NaNs etc)

>         - The arguments of `store[f]' are reversed (dest, src).  It's
>           ok that way (because it mirrors the `load' instruction) but
>           there should be a BIG FAT WARNING in the manual.

there will certainly be a change in the L/S instruction format !
the pointer that gets updated must be written somewhere and the
current fields don't match the expected behaviour. ie :
load r1, r2, r3 does : load [r2] into r3, and add r1 to r2
this means that the r2 field must be written to. IT IS NOT POSSIBLE yet.
so it will become : "load [r2] into r3, and r1 + r2 => r3^1"

>         - Some immediate instructions may benefit from a non-linear
>           encoding of the immediate operand (for example, 6 bits value +
>           2 bits left-shift).  At least this is an option for `loadi'
>           and `storei'.
maybe but it makes the decoder more complex.

>         - The naming of the memory hierarchies in the `cachemm'
>           instruction is ambiguous (in particular, the -c and -l suffixes).
>           We can still use numeric suffixes [0-7], however.
> 
>           Again, the arguments are reversed (`cachemm addr,count').
> 
>         - In the description of `move', remove the reference to `nop'.

ASAP.

>           BTW: there is no need to give `cmove' a separate name and
>           opcode.
maybe, maybe not. you are certainly right but i'll verify with some real code.

>	  If there is a condition suffix, it's a conditional move
>           (3-operand form), otherwise it's unconditional (2 operands):
> 
>                 move[s]{cond} r3, r2, r1
>                 move[s]           r2, r1
yo.

>         - We need to clarify the syntax of the `condition' suffixes for
>           `move' and `jmpa'.  I suggest
> 
>                 000  -z   (zero)
>                 001       (unassigned)
>                 010  -m   (msb == 1)
>                 011  -l   (lsb == 0)
>                 100  -nz  (not zero)
>                 101       (unassigned)
>                 110  -nm  (msb == 0)
>                 111  -nl  (lsb == 0)

in the assembler that i have written, it's written differently
and more verbosely (less confusing when you don't know the meanings).

>         - Assemblers must accept `loadcons[x] large-number' and emit a
>           suitable series of loadcons.n (or loadconsx.n) instructions
>           instead.  This is necessary for external symbol references
>           (which are resolved at link time).  Assemble-time constants
>           may be shortened to less than 64 bits, however, and if the
>           user explicitly requests `loadcons.0' or `loadconsx.0', the
>           assembler should of course do what (s)he wants (and complain
>           if the value is too large).
> 
>         - Can we please drop the `a' from `jmpa'?
probably. i don't remember where it comes from, probably from Mathias.

>           As with `move', the presence of the condition suffix indicates
>           the form of the instruction:
> 
>                 jmp[a]{cond} r3, r2 [, r1]
>                 jmp[a]           r2 [, r1]
looks ok.

>         - When calling functions through pointers, it would be nice to
>           be able to tell the F-CPU *a priori* that a register contains a
>           code address.  While this can be done with an explicit prefetch
>           (load to r0) for data pointers, there is no way to specify that
>           a register contains a code address that the CPU will have to
>           visit soon.
what about loadaddr(i) ?

>           The same is true when an absolute code address is
>           obtained via loadcons (which will probably be the common idiom
>           when a function in another object file is called, unless jump
>           tables are used -- which points us back to the `code pointer
>           in register' problem, again).
if the data/code is not explicitely prefetched, the code will still work,
but with the "late fetch" penalty : the CPU will perform the "fetch"
operation automatically while stalling the decode stage.

>           To cut a long story short: I'd like to have an instruction
>           that explicitly `tags' a register as a pointer, and probably
>           initiates a prefetch cycle (for code or data, depending on
>           the instruction's flags).  It may or may not move data from
>           one register to another (one idea I had was a `pointer move'
>           instruction); if it does, it might be a good idea to let it
>           participate in address calculation (i.e. let it be able to
>           add two operands, like the `lea' instruction on Intel CPUs).

this is what loadaddr is meant to do.

>         - Let's clarify the suffix order, e.g. like this (? means the
>           suffix is currently unused, and its name is unassigned):
> 
>                 add[c|s|?]
>                 sub[b|f|?]
>                 mul[h][s]
>                 div[m][s]
>                 mac[l|h][s]             # I suggest to allow `macl' as an alias for `mac'.
>                 scan[n][r]
>                 bitop[s|c|x|t]
>                 bitopi[s|c|x|t]
>                 mix[l|h]
>                 expand[l|h]
>                 {rop2}[a|o]
>                 {rop2i}[a|o]
>                 load[f][e][0-7]
>                 loadi[f][e][0-7]
>                 store[f][e][0-7]
>                 storei[f][e][0-7]
>                 cachemm[f|p][l][c][0-7]
>                 move[s][n][z|?|m|l]
>                 jmpa[n][z|?|m|l]
>                 serialize[s][x][m]

wow, what a work :-)

>         - Some instructions (e.g. `mac' and `addsub') could have
>           variants with an immediate operand.
glups. maybe.

>         - The loadm/storem has a surprising operand order
>           (start,src/dest,count), and it's not clear whether the
>           register *numbers* or the register *contents* serve as the
>           start/count values.  I suggest the former, and I would also
>           change the operands to (firstreg, lastreg, memaddr) which is
>           much easier to grok for humans.

some remarks :
 - it is optional and conditioned by the presence of a SRB mechanism
 - the 2nd register field is always the address. It must be pre-validated if possible.
 - whether it is the contents or the value of the address does not change much
   except that the value is know 2 cycles before or after. i'd prefer to use
   the register number than its value, though, if possible.
   though using the register contents might also help.

>           Since there are some unused flags, another variant might be
>           interesting: `storem r2, r1', where r2 is used as a mask
>           (bit <n> == 1 means "load/store register <n>"), and r1 is the
>           address of the source/destination memory area (which must be
>           big enough to hold all registers, just like the CMB).

this mask idea is interesting. It remembers me of the 6809 by the way :-)
however it means that 4x loadcons might be necessary (in arbitrary cases)
to backup the whole (non-contiguous) register set.

>           Maybe it would be wiser to put the memory address into the
>           rightmost operand in *all* memory operations (load, store,
>           cachemm, loadm and storem).  Some instructions will always
>           have the wrong operand order, though.
right. but i still prefer to leave the "pointer" field in the middle,
because it is the most usual case where it makes sense (at least for myself).

>         - And finally, the most important point: the new `nop' instruction
>           is still undocumented ;)

yes, yes i know.

> In case you wonder: I needed a break from VHDL coding (I couldn't
> even write C any more!),
all my sympathy !

> so I decided to play with something totally
> different for a while.
:-)

>  The result is a flex-based instruction encoder
> that recognizes almost any instruction the F-CPU will have (with the
> exceptions mentioned above).  I'll probably also build an assembler
> around it. (I finally found a real use for my libelf library! Yeah! ;)

where's the source code ? :-)
btw, please provide a "raw mode" so emulators don't need clomplex load functions...

> > Sure, there needs to be an expansion/reduction code for FP
> > but SDUP works for SIMD FP if the packets have the same boundaries.
> That's a different kind of operation.

?

SDUP (IIRC) "duplicates" one item of a specified size into all the rest
of the destination register. whether the data is FP or int or whatever
doesn't change anything (as long as the data size is valid).

expansion/reduction is another problem but i think that the SHL unit can do this,
too.


Another proposition : make a signed and unsigned version of the integer expansion
so we can extend the sign of the datum. This removes the "sign extension" flag
from the move instruction and it removes one funky operation from the Xbar.


wow, all these things to do remember me that the summer is soon over...
i don't think i'll be able to do all this before october begins.

>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/