[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] New EU_SHL Instruction

To: f-cpu@seul.org
Subject: Re: [f-cpu] New EU_SHL Instruction
From: Yann Guidon <whygee@f-cpu.org>
Date: Thu, 09 Jan 2003 01:59:11 +0100
Delivered-to: archiver@seul.org
Delivered-to: f-cpu-outgoing@seul.org
Delivered-to: f-cpu@seul.org
Delivery-date: Wed, 08 Jan 2003 19:44:19 -0500
Organization: Freedom CPU Project
References: <20030107064710.02532@thrai.stud.uni-hannover.de> <3E1B7B25.3000509@f-cpu.org> <20030108143629.49856@thrai.stud.uni-hannover.de>
Reply-to: f-cpu@seul.org
Sender: owner-f-cpu@seul.org
User-agent: Mozilla/5.0 (Windows; U; Win95; en-US; rv:1.0.0) Gecko/20020530

hi !

Michael Riepe wrote:

On Wed, Jan 08, 2003 at 02:13:09AM +0100, Yann Guidon wrote:

While skimming through some AltiVec documentation the other day I
noticed that they have nice `permute' and `select' instructions that
let you shuffle the chunks inside a vector any way you like. It's a
general instruction that can replace both `sdup' and `sbyterev' - and
can be easily implemented in the current SHL execution unit.

i don't like "replace", i'd rather say "complement".
and sdup is useful in other contexts.

It rather extends/generalizes them.

hmmmmgrmblgrml ....
even though "permute" generalises sdup,
it does not replace it.

IIRC, this instruction has already been discussed a long time ago.
And, AFAIK, there is an inherent limitation in the ALTIVEC version :
the word size is limited (256 bits).

It should be limited by the register size.

i don't remember the discussion about this, but there was some kind of inherent
limitation related to the index size ...
If we set that the size of the index equals the size of the chunk, that should be ok.

I therefore suggest the following ISA update:

  vperm.size r3, r2, r1       // new instruction: `vector permute'

      performs the `permute' function described above, with r3 being
      the selector (B), r2 being the source (A) and r1 the result
      (Y).  Only the required bits of the selector chunks are used
      (e.g. bits 2...0 if there are 8 chunks).
btw, is there a notion of SIMD here ?
that is : is the SIMD flag used ?
That' answered below. This instruction is the SIMD variant of `vsel'.
That is, the SIMD flag is always set.

then it is not really a different instruction ....

And is the index size as large as the chunk ?
so if there are 16-bit chunks, we could address 64K chunks
(so the register size would not be realistically limited)

For 8-bit chunks, there's a limit at a register size of 256*8 = 2048 bits.

huh, that should be enough... :-)

[...]
  vsel.size r3, r2, r1        // new instruction: `vector select'

      same as `vperm', but only the least significant chunk of
      the result is returned (with zero extension). Again, only the
      required bits of the selector are used. This instruction lets
      you read any chunk of a register with minimal effort.
well, it's the "reverse" of sdup, right ?
and vsel can be emulated with a shift.
Shift + mask, to be precise, and only if the register size is limited
to 64 bits. `vsel' is mainly directed towards wider registers.

well, it seems very close to sdup.

The only gain i see here is the mask.
      Note that `vperm' is the SIMD variant of `vsel', but I think
      that the name is more intuitive than `svsel'. We can keep the
      latter as an alias, however.
or simply "sel" :-)
I wanted to indicate that the instruction belongs to a `vector' family
of instructions.

well, here, the word SIMD is used in this meaning, so we can drop the "v" ...

{grrrrr now we have to find an instruction that can be named "poivre" ....}

Huh? I guess this is some french pun, right?

in french, "sel" = "salt", and "poivre" = you guess what :-)

vseli.size $imm8, r2, r1 // new instruction: `vector select immediate'

same as `vsel', but with an 8-bit unsigned immediate
selector. The SIMD variant `svseli' (or `vpermi') is
probably less useful, but one never knows...

sdup.size r2, r1 // changed instruction

will survive as an alias for `vperm.size r0, r2, r1' which has
exactly the same effect.

In the later versions (i guess it didn't make it into the manual),
the 3rd operand (that was left to zero in previous versions) indicated
the number of the chunk to duplicate. So alias you made will not be
enough : you would have to duplicate the first index first into a temporary
register. Funny that the instruction that does that is the instruction
you want to emulate :-)

I do not remember an `sdup' instruction that uses three operands, nor
did I implement one. That would have been another logical extension;
but I guess that vperm is more useful. BTW: if you're satisfied with
an immediate operand, `svseli $n, r2, r1' will do what you want
(duplicate chunk <n> of r2).

it's not doing sdup completely. The 3rd register operand is useful to avoid a shift
in front of a "bare" sdup :
sdup r1, r2, r3
replaces
shiftr r1*8, r2, r2
sdup r2, r3

this saves around 2 cycles, if i'm not mistaken and if there is no couple
of intructions to fill in the gap.

The typical example is when a bitmap is expanded to a framebuffer.
There is a sdup that duplicates a byte to a whole register, but when the sdup
is in a loop, the original bitmask has to be rotated before sdup.
Now, with the 3-operand sdup, there is no need to shift, the loop counter
addresses the byte in the register :-)

[s]byterev.size r2, r1 // unchanged

will stay the same. Note that `sbyterev' can be emulated
with `vperm', but the non-SIMD `byterev' can't without an
additional zero-extension instruction.

This change is so useful and so cheap to implement that I consider it
a must-have. Any objections?

i want to keep my "sdup(i)", it's very very useful in most code (SIMD or not).
It's still there:

   `sdupi $imm8, r2, r1' corresponds to `vpermi $imm8, r2, r1'.
   `sdup r2, r1' corresponds to `vperm r0, r2, r1' or `vpermi $0, r2, r1'.

If you insist, I will investigate whether `sdup r3, r2, r1' can be added.
I guess it's possible (there are some unused flags).

i'm sure it can :-)

Oh BTW, I forgot the encoding (for the manual):

vsel:
31-24 OP_VSEL (replaces OP_SDUP)
23-22 size
21 s- prefix
20-18 unused
17-12 r3
11- 6 r2
5- 0 r1

vseli:
31-24 OP_VSELI (new)
23-22 size
21 s- prefix
20 unused
19-12 unsigned 8-bit immediate
11- 6 r2
5- 0 r1

The "HW cost" of the permutation seems to be more than byterev, but not much more.

In the current implementation, the cost is almost 0. Byterev & friends use
a set of byte-wide muxes. Their select inputs are driven with constant
values, depending on the operating mode. `vsel' uses the same muxes,
but takes the select inputs from the second operand instead - that's all.
It doesn't even increase the latency.

it works more or less the same for sdup ...

Now the question is :
do we separate the SIMD instructions (permutation, selection, duplication... on chunks)
from the SHL unit (which only deals with bits) ?
The answer would be best answered by Michael but this is something that i would certainly do normally.
They are already separated (kind of).

then why is there still one unit ?

Bitwise ops use the omega
shifter, while bytewise ops use the mux-based `byte shuffler'.
They just share input and output ports.

if routing becomes too complex, then a split must be considered....

There is another instruction that would be cool but not easy to define or implement :
a shuffle instruction that puts the the Nth bit of the Mth chunk
into the Mth bit of the Nth chunk.

Usually, it is done with 64-it registers : bit 1 of byte 8 goes to bit 8 of byte 1.
This is useful for example for 'interpreting masks', when a character has been
detected and the resulting bytewide mask must be turned into a bitfield....

This is ok for 64-bit CPU but our case makes it difficult : how would this
be in a 128-bit or 256-bit CPU ? One solution would be to allow different
chunk sizes to accomodate the cases where the chunk width is not equal
to the number of chunks.

I guess that this instruction can be implemented with Michael's Omega shifter
but the control logic that deals with the chunk sizes etc is not the easiest part.
I'm not sure. An omega shifter is limited to a subset of all possible
shuffling operations, and this one looks pretty complicated. You'd
better not count on it.

i'm pretty sure that an omega network can do this, but it probably depends on the number
of stages. This operation is probably mentioned somewhere, and it remembers me of
something similar in Knuth's MMIX.

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: [f-cpu] New EU_SHL Instruction
  - From: Michael Riepe <michael@stud.uni-hannover.de>
- Re: [f-cpu] New EU_SHL Instruction
  - From: Thilo.Reichelt@t-online.de (Thilo Reichelt)

References:
- [f-cpu] New EU_SHL Instruction
  - From: Michael Riepe <michael@stud.uni-hannover.de>
- Re: [f-cpu] New EU_SHL Instruction
  - From: Yann Guidon <whygee@f-cpu.org>
- Re: [f-cpu] New EU_SHL Instruction
  - From: Michael Riepe <michael@stud.uni-hannover.de>

Prev by Date: [f-cpu] loadconsx and stream hints
Next by Date: Re: [f-cpu] (next) crazy idea about immediates
Previous by thread: Re: [f-cpu] New EU_SHL Instruction
Next by thread: Re: [f-cpu] New EU_SHL Instruction
Index(es):
- Date
- Thread