[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [f-cpu] Re: Floating-Point?
On Wed, Aug 15, 2001 at 10:22:11AM +0200, Yann Guidon wrote:
[...]
> > SIMD is IMHO not reasonable for the FP units.
> in what context are you speaking ?
I mean: I think it's unreasonable to build *variable-size* FP units.
There are too many special cases to consider -- rounding, exceptions,
infinities and NANs, ... (ok, go blame IEEE for it ;)
> > A reasonable approach is
> > to build a set of pipelined 64-bit FP units, and then issue the 32-bit
> > operations in two consecutive cycles.
> that's vectoring, then. Scheduling might become more complex,
> in situations such as chaining for example.
Not if it's "hidden" inside the EU.
> I have nothing to object to that, but
> - 1) currently we have no FP unit
> - 2) SIMD already works well (when it does)
> - 3) vectoring will be used in another core because FC0 would require too much changes
> - 4) if you have 1 FP unit, the hardest is done : you can duplicate it :-P
If you have enough room. Do you have an idea how big the FP unit will be?
> > BTW: I think we need another instruction that converts 32-bit FP to 64-bit
> > and vice versa (and maybe also does the mix/expand/sdup thingy for FP).
>
> geez, the instruction set in the current version of the manual needs a big rework...
Yep. There are a handful of inconsistencies, typing errors, missing
parts etc. in it. Major things I've found so far:
- The manual doesn't state whether `modi' is a signed operation
I suggest it should be signed (like `divi')
- Complement `abs' with `nabs' (negative absolute) for
symmetry, and to avoid the `sign surprise' when the argument
is -2**(chunksize-1)
- The syntax for the rounding mode (`l2int', `f2int') is not
specified. I suggest to use the following syntax:
l2int[r|t|f|c]
f2int[r|t|f|c][x]
with these meanings:
-r (round) round to nearest (default)
-t (trunc) round towards zero
-f (floor) round towards -infinity
-c (ceil) round towards +infinity
- `int2f' and `int2l' also need rounding modes because both
conversions may result in precision loss if the integer operand
has a large value.
- `bitop[s|c|x|t]i' should be `bitopi[s|c|x|t]' (`i' is NOT a suffix!)
- Assign four opcodes for bitop[i] and increase the imm6 operand
to imm8 (for consistency with the rop2, shift, rot, bitrevi and
loadcons[x] instructions). Since bitop[i] is a ROP2 instruction,
change the function encoding to match that of rop2, that is:
fun rop2 bitop
================
000 and btst
001 andn bclr
010 xor bchg
011 or bset
100 nor --
101 xnor --
110 orn --
111 nand --
I guess we can get the missing four instructions for free,
but they aren't really useful.
- The description of the ROP2 is obsolete (and the syntax for
combine/mux is unspecified) I suggest -o and -a suffixes for
combine, and a new `mux' instruction.
- For the `andn' and `orn' instructions, the manual must
clearly state which operand is inverted. IMHO, `andni' and
`orni' will be almost useless if we invert the leftmost
(== immediate) operand (but not completely useless, because
the upper bits differ when the chunk size is 16 or more).
On the other hand, we could add a flag for sign extension of the
immediate operand and invert the middle (== register) operand.
Since the function bits have moved to the opcode field, there
should be a free flag.
- There is no explicit `not' instruction, but users can write
`nor r0, r2, r1', `xnor r0, r2, r1' or similar. Since this
may not be obvious, F-CPU assemblers should recognize `not
r2, r1' and convert it to one of the other forms internally.
The `not' instruction should, however, be documented in the
Instruction Set Manual.
- In `bitrev[i]', use the formula `r1 = bit_reverse(r2) >> (size-r3-1)'.
That will change the useful range for r3 to [size-1;0]. In the
current version, it's [size;1] which is pretty ugly.
Another possible variant is `r1 = bit_reverse(r2) >> r3', with
the same useful range but a nicer default (r3 == r0) which
makes the 2-operand short form `bitrev r2, r1' meaningful,
but that may cause trouble when the register size is increased
beyond 64 bits :(
- `flog' and `fexp' should both take only two operands.
Remember that (a**b)**c = a**(b*c) = a**(c*b) = (a**c)**b.
That is, with a simple multiplication (before fexp / after
flog) you get any base you want, and the FP unit probably
works better with a fixed base.
- We need a level-1 floating-point compare instruction;
`cmpl'/`cmple' may work with LNS (if there are no NANs),
but not with FP.
- The arguments of `store[f]' are reversed (dest, src). It's
ok that way (because it mirrors the `load' instruction) but
there should be a BIG FAT WARNING in the manual.
- Some immediate instructions may benefit from a non-linear
encoding of the immediate operand (for example, 6 bits value +
2 bits left-shift). At least this is an option for `loadi'
and `storei'.
- The naming of the memory hierarchies in the `cachemm'
instruction is ambiguous (in particular, the -c and -l suffixes).
We can still use numeric suffixes [0-7], however.
Again, the arguments are reversed (`cachemm addr,count').
- In the description of `move', remove the reference to `nop'.
BTW: there is no need to give `cmove' a separate name and
opcode. If there is a condition suffix, it's a conditional move
(3-operand form), otherwise it's unconditional (2 operands):
move[s]{cond} r3, r2, r1
move[s] r2, r1
- We need to clarify the syntax of the `condition' suffixes for
`move' and `jmpa'. I suggest
000 -z (zero)
001 (unassigned)
010 -m (msb == 1)
011 -l (lsb == 0)
100 -nz (not zero)
101 (unassigned)
110 -nm (msb == 0)
111 -nl (lsb == 0)
- Assemblers must accept `loadcons[x] large-number' and emit a
suitable series of loadcons.n (or loadconsx.n) instructions
instead. This is necessary for external symbol references
(which are resolved at link time). Assemble-time constants
may be shortened to less than 64 bits, however, and if the
user explicitly requests `loadcons.0' or `loadconsx.0', the
assembler should of course do what (s)he wants (and complain
if the value is too large).
- Can we please drop the `a' from `jmpa'?
As with `move', the presence of the condition suffix indicates
the form of the instruction:
jmp[a]{cond} r3, r2 [, r1]
jmp[a] r2 [, r1]
- When calling functions through pointers, it would be nice to
be able to tell the F-CPU *a priori* that a register contains a
code address. While this can be done with an explicit prefetch
(load to r0) for data pointers, there is no way to specify that
a register contains a code address that the CPU will have to
visit soon. The same is true when an absolute code address is
obtained via loadcons (which will probably be the common idiom
when a function in another object file is called, unless jump
tables are used -- which points us back to the `code pointer
in register' problem, again).
To cut a long story short: I'd like to have an instruction
that explicitly `tags' a register as a pointer, and probably
initiates a prefetch cycle (for code or data, depending on
the instruction's flags). It may or may not move data from
one register to another (one idea I had was a `pointer move'
instruction); if it does, it might be a good idea to let it
participate in address calculation (i.e. let it be able to
add two operands, like the `lea' instruction on Intel CPUs).
- Let's clarify the suffix order, e.g. like this (? means the
suffix is currently unused, and its name is unassigned):
add[c|s|?]
sub[b|f|?]
mul[h][s]
div[m][s]
mac[l|h][s] # I suggest to allow `macl' as an alias for `mac'.
scan[n][r]
bitop[s|c|x|t]
bitopi[s|c|x|t]
mix[l|h]
expand[l|h]
{rop2}[a|o]
{rop2i}[a|o]
load[f][e][0-7]
loadi[f][e][0-7]
store[f][e][0-7]
storei[f][e][0-7]
cachemm[f|p][l][c][0-7]
move[s][n][z|?|m|l]
jmpa[n][z|?|m|l]
serialize[s][x][m]
- Some instructions (e.g. `mac' and `addsub') could have
variants with an immediate operand.
- The loadm/storem has a surprising operand order
(start,src/dest,count), and it's not clear whether the
register *numbers* or the register *contents* serve as the
start/count values. I suggest the former, and I would also
change the operands to (firstreg, lastreg, memaddr) which is
much easier to grok for humans.
Since there are some unused flags, another variant might be
interesting: `storem r2, r1', where r2 is used as a mask
(bit <n> == 1 means "load/store register <n>"), and r1 is the
address of the source/destination memory area (which must be
big enough to hold all registers, just like the CMB).
Maybe it would be wiser to put the memory address into the
rightmost operand in *all* memory operations (load, store,
cachemm, loadm and storem). Some instructions will always
have the wrong operand order, though.
- And finally, the most important point: the new `nop' instruction
is still undocumented ;)
In case you wonder: I needed a break from VHDL coding (I couldn't
even write C any more!), so I decided to play with something totally
different for a while. The result is a flex-based instruction encoder
that recognizes almost any instruction the F-CPU will have (with the
exceptions mentioned above). I'll probably also build an assembler
around it. (I finally found a real use for my libelf library! Yeah! ;)
> Sure, there needs to be an expansion/reduction code for FP
> but SDUP works for SIMD FP if the packets have the same boundaries.
That's a different kind of operation.
--
Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
"All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/