[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Re: Floating-Point?



On Wed, Aug 15, 2001 at 10:22:11AM +0200, Yann Guidon wrote:
[...]
> > SIMD is IMHO not reasonable for the FP units.
> in what context are you speaking ?

I mean: I think it's unreasonable to build *variable-size* FP units.
There are too many special cases to consider -- rounding, exceptions,
infinities and NANs, ... (ok, go blame IEEE for it ;)

> > A reasonable approach is
> > to build a set of pipelined 64-bit FP units, and then issue the 32-bit
> > operations in two consecutive cycles.
> that's vectoring, then. Scheduling might become more complex,
> in situations such as chaining for example.

Not if it's "hidden" inside the EU.

> I have nothing to object to that, but
>  - 1) currently we have no FP unit
>  - 2) SIMD already works well (when it does)
>  - 3) vectoring will be used in another core because FC0 would require too much changes
>  - 4) if you have 1 FP unit, the hardest is done : you can duplicate it :-P

If you have enough room.  Do you have an idea how big the FP unit will be?

> > BTW: I think we need another instruction that converts 32-bit FP to 64-bit
> > and vice versa (and maybe also does the mix/expand/sdup thingy for FP).
> 
> geez, the instruction set in the current version of the manual needs a big rework...

Yep.  There are a handful of inconsistencies, typing errors, missing
parts etc. in it.  Major things I've found so far:

	- The manual doesn't state whether `modi' is a signed operation
	  I suggest it should be signed (like `divi')

	- Complement `abs' with `nabs' (negative absolute) for
	  symmetry, and to avoid the `sign surprise' when the argument
	  is -2**(chunksize-1)

	- The syntax for the rounding mode (`l2int', `f2int') is not
	  specified. I suggest to use the following syntax:

		l2int[r|t|f|c]
		f2int[r|t|f|c][x]

	  with these meanings:

		-r (round)	round to nearest (default)
		-t (trunc)	round towards zero
		-f (floor)	round towards -infinity
		-c (ceil)	round towards +infinity

	- `int2f' and `int2l' also need rounding modes because both
	  conversions may result in precision loss if the integer operand
	  has a large value.

	- `bitop[s|c|x|t]i' should be `bitopi[s|c|x|t]' (`i' is NOT a suffix!)

	- Assign four opcodes for bitop[i] and increase the imm6 operand
	  to imm8 (for consistency with the rop2, shift, rot, bitrevi and
	  loadcons[x] instructions).  Since bitop[i] is a ROP2 instruction,
	  change the function encoding to match that of rop2, that is:

		fun  rop2  bitop
		================
		000  and   btst
		001  andn  bclr
		010  xor   bchg
		011  or    bset
		100  nor   --
		101  xnor  --
		110  orn   --
		111  nand  --

	  I guess we can get the missing four instructions for free,
	  but they aren't really useful.

	- The description of the ROP2 is obsolete (and the syntax for
	  combine/mux is unspecified) I suggest -o and -a suffixes for
	  combine, and a new `mux' instruction.

	- For the `andn' and `orn' instructions, the manual must
	  clearly state which operand is inverted.  IMHO, `andni' and
	  `orni' will be almost useless if we invert the leftmost
	  (== immediate) operand (but not completely useless, because
	  the upper bits differ when the chunk size is 16 or more).

	  On the other hand, we could add a flag for sign extension of the
	  immediate operand and invert the middle (== register) operand.
	  Since the function bits have moved to the opcode field, there
	  should be a free flag.

	- There is no explicit `not' instruction, but users can write
	  `nor r0, r2, r1', `xnor r0, r2, r1' or similar.  Since this
	  may not be obvious, F-CPU assemblers should recognize `not
	  r2, r1' and convert it to one of the other forms internally.
	  The `not' instruction should, however, be documented in the
	  Instruction Set Manual.

	- In `bitrev[i]', use the formula `r1 = bit_reverse(r2) >> (size-r3-1)'.
	  That will change the useful range for r3 to [size-1;0].  In the
	  current version, it's [size;1] which is pretty ugly.

	  Another possible variant is `r1 = bit_reverse(r2) >> r3', with
	  the same useful range but a nicer default (r3 == r0) which
	  makes the 2-operand short form `bitrev r2, r1' meaningful,
	  but that may cause trouble when the register size is increased
	  beyond 64 bits :(

	- `flog' and `fexp' should both take only two operands.
	  Remember that (a**b)**c = a**(b*c) = a**(c*b) = (a**c)**b.
	  That is, with a simple multiplication (before fexp / after
	  flog) you get any base you want, and the FP unit probably
	  works better with a fixed base.

	- We need a level-1 floating-point compare instruction;
	  `cmpl'/`cmple' may work with LNS (if there are no NANs),
	  but not with FP.

	- The arguments of `store[f]' are reversed (dest, src).  It's
	  ok that way (because it mirrors the `load' instruction) but
	  there should be a BIG FAT WARNING in the manual.

	- Some immediate instructions may benefit from a non-linear
	  encoding of the immediate operand (for example, 6 bits value +
	  2 bits left-shift).  At least this is an option for `loadi'
	  and `storei'.

	- The naming of the memory hierarchies in the `cachemm'
	  instruction is ambiguous (in particular, the -c and -l suffixes).
	  We can still use numeric suffixes [0-7], however.

	  Again, the arguments are reversed (`cachemm addr,count').

	- In the description of `move', remove the reference to `nop'.
	  BTW: there is no need to give `cmove' a separate name and
	  opcode.  If there is a condition suffix, it's a conditional move
	  (3-operand form), otherwise it's unconditional (2 operands):

		move[s]{cond} r3, r2, r1
		move[s]           r2, r1

	- We need to clarify the syntax of the `condition' suffixes for
	  `move' and `jmpa'.  I suggest

		000  -z   (zero)
		001       (unassigned)
		010  -m   (msb == 1)
		011  -l   (lsb == 0)
		100  -nz  (not zero)
		101       (unassigned)
		110  -nm  (msb == 0)
		111  -nl  (lsb == 0)

	- Assemblers must accept `loadcons[x] large-number' and emit a
	  suitable series of loadcons.n (or loadconsx.n) instructions
	  instead.  This is necessary for external symbol references
	  (which are resolved at link time).  Assemble-time constants
	  may be shortened to less than 64 bits, however, and if the
	  user explicitly requests `loadcons.0' or `loadconsx.0', the
	  assembler should of course do what (s)he wants (and complain
	  if the value is too large).

	- Can we please drop the `a' from `jmpa'?

	  As with `move', the presence of the condition suffix indicates
	  the form of the instruction:

		jmp[a]{cond} r3, r2 [, r1]
		jmp[a]           r2 [, r1]

	- When calling functions through pointers, it would be nice to
	  be able to tell the F-CPU *a priori* that a register contains a
	  code address.  While this can be done with an explicit prefetch
	  (load to r0) for data pointers, there is no way to specify that
	  a register contains a code address that the CPU will have to
	  visit soon.  The same is true when an absolute code address is
	  obtained via loadcons (which will probably be the common idiom
	  when a function in another object file is called, unless jump
	  tables are used -- which points us back to the `code pointer
	  in register' problem, again).

	  To cut a long story short: I'd like to have an instruction
	  that explicitly `tags' a register as a pointer, and probably
	  initiates a prefetch cycle (for code or data, depending on
	  the instruction's flags).  It may or may not move data from
	  one register to another (one idea I had was a `pointer move'
	  instruction); if it does, it might be a good idea to let it
	  participate in address calculation (i.e. let it be able to
	  add two operands, like the `lea' instruction on Intel CPUs).

	- Let's clarify the suffix order, e.g. like this (? means the
	  suffix is currently unused, and its name is unassigned):

		add[c|s|?]
		sub[b|f|?]
		mul[h][s]
		div[m][s]
		mac[l|h][s]		# I suggest to allow `macl' as an alias for `mac'.
		scan[n][r]
		bitop[s|c|x|t]
		bitopi[s|c|x|t]
		mix[l|h]
		expand[l|h]
		{rop2}[a|o]
		{rop2i}[a|o]
		load[f][e][0-7]
		loadi[f][e][0-7]
		store[f][e][0-7]
		storei[f][e][0-7]
		cachemm[f|p][l][c][0-7]
		move[s][n][z|?|m|l]
		jmpa[n][z|?|m|l]
		serialize[s][x][m]

	- Some instructions (e.g. `mac' and `addsub') could have
	  variants with an immediate operand.

	- The loadm/storem has a surprising operand order
	  (start,src/dest,count), and it's not clear whether the
	  register *numbers* or the register *contents* serve as the
	  start/count values.  I suggest the former, and I would also
	  change the operands to (firstreg, lastreg, memaddr) which is
	  much easier to grok for humans.

	  Since there are some unused flags, another variant might be
	  interesting: `storem r2, r1', where r2 is used as a mask
	  (bit <n> == 1 means "load/store register <n>"), and r1 is the
	  address of the source/destination memory area (which must be
	  big enough to hold all registers, just like the CMB).

	  Maybe it would be wiser to put the memory address into the
	  rightmost operand in *all* memory operations (load, store,
	  cachemm, loadm and storem).  Some instructions will always
	  have the wrong operand order, though.

	- And finally, the most important point: the new `nop' instruction
	  is still undocumented ;)

In case you wonder: I needed a break from VHDL coding (I couldn't
even write C any more!), so I decided to play with something totally
different for a while.  The result is a flex-based instruction encoder
that recognizes almost any instruction the F-CPU will have (with the
exceptions mentioned above).  I'll probably also build an assembler
around it. (I finally found a real use for my libelf library! Yeah! ;)

> Sure, there needs to be an expansion/reduction code for FP
> but SDUP works for SIMD FP if the packets have the same boundaries.

That's a different kind of operation.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/