[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] New EU_SHL Instruction



hi !

Michael Riepe wrote:

Hi guys!

While skimming through some AltiVec documentation the other day I
noticed that they have nice `permute' and `select' instructions that
let you shuffle the chunks inside a vector any way you like. It's a
general instruction that can replace both `sdup' and `sbyterev' - and
can be easily implemented in the current SHL execution unit.

i don't like "replace", i'd rather say "complement".
and sdup is useful in other contexts.

The basic function works as follows (beware, pseudo-VHDL!):

function permute (A, B : in F_VECTOR) return F_VECTOR is
variable Y : F_VECTOR;
begin
for i in 0 to NUMBER_OF_CHUNKS - 1 loop
chunk(Y, i) := chunk(A, to_integer(chunk(B, i)));
end loop;
return Y;
end permute;

IIRC, this instruction has already been discussed a long time ago.
And, AFAIK, there is an inherent limitation in the ALTIVEC version :
the word size is limited (256 bits).

If there is a simple way to break this limit, and if implementing it
is not too heavy, that would be a good thing. But it was not successful in the past.

I therefore suggest the following ISA update:

vperm.size r3, r2, r1 // new instruction: `vector permute'

performs the `permute' function described above, with r3 being
the selector (B), r2 being the source (A) and r1 the result
(Y). Only the required bits of the selector chunks are used
(e.g. bits 2...0 if there are 8 chunks).

btw, is there a notion of SIMD here ?
that is : is the SIMD flag used ?
And is the index size as large as the chunk ?
so if there are 16-bit chunks, we could address 64K chunks
(so the register size would not be realistically limited)


`vperm' can perform chunk-wise shifts. It's not suitable
as a replacement for `cshiftl', however - you have to
set up the selector register somehow, and you'll need
cshiftl to do that. `cshiftr', on the other hand, may
be emulated by `vperm' (in a less efficient manner).

vsel.size r3, r2, r1 // new instruction: `vector select'

same as `vperm', but only the least significant chunk of
the result is returned (with zero extension). Again, only the
required bits of the selector are used. This instruction lets
you read any chunk of a register with minimal effort.

well, it's the "reverse" of sdup, right ?
and vsel can be emulated with a shift.
The only gain i see here is the mask.

Note that `vperm' is the SIMD variant of `vsel', but I think
that the name is more intuitive than `svsel'. We can keep the
latter as an alias, however.

or simply "sel" :-)

{grrrrr now we have to find an instruction that can be named "poivre" ....}

vseli.size $imm8, r2, r1 // new instruction: `vector select immediate'

same as `vsel', but with an 8-bit unsigned immediate
selector. The SIMD variant `svseli' (or `vpermi') is
probably less useful, but one never knows...

sdup.size r2, r1 // changed instruction

will survive as an alias for `vperm.size r0, r2, r1' which has
exactly the same effect.

In the later versions (i guess it didn't make it into the manual),
the 3rd operand (that was left to zero in previous versions) indicated
the number of the chunk to duplicate. So alias you made will not be
enough : you would have to duplicate the first index first into a temporary
register. Funny that the instruction that does that is the instruction
you want to emulate :-)


   [s]byterev.size r2, r1      // unchanged

       will stay the same. Note that `sbyterev' can be emulated
       with `vperm', but the non-SIMD `byterev' can't without an
       additional zero-extension instruction.

This change is so useful and so cheap to implement that I consider it
a must-have. Any objections?

i want to keep my "sdup(i)", it's very very useful in most code (SIMD or not).

The "HW cost" of the permutation seems to be more than byterev, but not much more.

Now the question is :
do we separate the SIMD instructions (permutation, selection, duplication... on chunks)
from the SHL unit (which only deals with bits) ?
The answer would be best answered by Michael but this is something that i would
certainly do normally.

There is another instruction that would be cool but not easy to define or implement :
a shuffle instruction that puts the the Nth bit of the Mth chunk
into the Mth bit of the Nth chunk.

Usually, it is done with 64-it registers : bit 1 of byte 8 goes to bit 8 of byte 1.
This is useful for example for 'interpreting masks', when a character has been
detected and the resulting bytewide mask must be turned into a bitfield....

This is ok for 64-bit CPU but our case makes it difficult : how would this
be in a 128-bit or 256-bit CPU ? One solution would be to allow different
chunk sizes to accomodate the cases where the chunk width is not equal
to the number of chunks.

I guess that this instruction can be implemented with Michael's Omega shifter
but the control logic that deals with the chunk sizes etc is not the easiest part.

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/