[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [f-cpu] New EU_SHL Instruction
On Wed, Jan 08, 2003 at 02:13:09AM +0100, Yann Guidon wrote:
> >While skimming through some AltiVec documentation the other day I
> >noticed that they have nice `permute' and `select' instructions that
> >let you shuffle the chunks inside a vector any way you like. It's a
> >general instruction that can replace both `sdup' and `sbyterev' - and
> >can be easily implemented in the current SHL execution unit.
> >
> i don't like "replace", i'd rather say "complement".
> and sdup is useful in other contexts.
It rather extends/generalizes them.
> >The basic function works as follows (beware, pseudo-VHDL!):
> >
> > function permute (A, B : in F_VECTOR) return F_VECTOR is
> > variable Y : F_VECTOR;
> > begin
> > for i in 0 to NUMBER_OF_CHUNKS - 1 loop
> > chunk(Y, i) := chunk(A, to_integer(chunk(B, i)));
> > end loop;
> > return Y;
> > end permute;
> >
> >
>
> IIRC, this instruction has already been discussed a long time ago.
> And, AFAIK, there is an inherent limitation in the ALTIVEC version :
> the word size is limited (256 bits).
It should be limited by the register size.
> If there is a simple way to break this limit, and if implementing it
> is not too heavy, that would be a good thing. But it was not successful
> in the past.
It uses the same mechanism that is used for the other bytewise
shuffling operations (byterev, sdup, mix, expand, cshift).
> >I therefore suggest the following ISA update:
> >
> > vperm.size r3, r2, r1 // new instruction: `vector permute'
> >
> > performs the `permute' function described above, with r3 being
> > the selector (B), r2 being the source (A) and r1 the result
> > (Y). Only the required bits of the selector chunks are used
> > (e.g. bits 2...0 if there are 8 chunks).
> >
> >
> btw, is there a notion of SIMD here ?
> that is : is the SIMD flag used ?
That' answered below. This instruction is the SIMD variant of `vsel'.
That is, the SIMD flag is always set.
> And is the index size as large as the chunk ?
> so if there are 16-bit chunks, we could address 64K chunks
> (so the register size would not be realistically limited)
For 8-bit chunks, there's a limit at a register size of 256*8 = 2048 bits.
[...]
> > vsel.size r3, r2, r1 // new instruction: `vector select'
> >
> > same as `vperm', but only the least significant chunk of
> > the result is returned (with zero extension). Again, only the
> > required bits of the selector are used. This instruction lets
> > you read any chunk of a register with minimal effort.
> >
> >
> well, it's the "reverse" of sdup, right ?
> and vsel can be emulated with a shift.
Shift + mask, to be precise, and only if the register size is limited
to 64 bits. `vsel' is mainly directed towards wider registers.
> The only gain i see here is the mask.
>
> > Note that `vperm' is the SIMD variant of `vsel', but I think
> > that the name is more intuitive than `svsel'. We can keep the
> > latter as an alias, however.
> >
> >
> or simply "sel" :-)
I wanted to indicate that the instruction belongs to a `vector' family
of instructions.
> {grrrrr now we have to find an instruction that can be named "poivre" ....}
Huh? I guess this is some french pun, right?
> > vseli.size $imm8, r2, r1 // new instruction: `vector select immediate'
> >
> > same as `vsel', but with an 8-bit unsigned immediate
> > selector. The SIMD variant `svseli' (or `vpermi') is
> > probably less useful, but one never knows...
> >
> > sdup.size r2, r1 // changed instruction
> >
> > will survive as an alias for `vperm.size r0, r2, r1' which has
> > exactly the same effect.
> >
> >
> In the later versions (i guess it didn't make it into the manual),
> the 3rd operand (that was left to zero in previous versions) indicated
> the number of the chunk to duplicate. So alias you made will not be
> enough : you would have to duplicate the first index first into a temporary
> register. Funny that the instruction that does that is the instruction
> you want to emulate :-)
I do not remember an `sdup' instruction that uses three operands, nor
did I implement one. That would have been another logical extension;
but I guess that vperm is more useful. BTW: if you're satisfied with
an immediate operand, `svseli $n, r2, r1' will do what you want
(duplicate chunk <n> of r2).
> > [s]byterev.size r2, r1 // unchanged
> >
> > will stay the same. Note that `sbyterev' can be emulated
> > with `vperm', but the non-SIMD `byterev' can't without an
> > additional zero-extension instruction.
> >
> >This change is so useful and so cheap to implement that I consider it
> >a must-have. Any objections?
> >
> i want to keep my "sdup(i)", it's very very useful in most code (SIMD or
> not).
It's still there:
`sdupi $imm8, r2, r1' corresponds to `vpermi $imm8, r2, r1'.
`sdup r2, r1' corresponds to `vperm r0, r2, r1' or `vpermi $0, r2, r1'.
If you insist, I will investigate whether `sdup r3, r2, r1' can be added.
I guess it's possible (there are some unused flags). Oh BTW, I forgot
the encoding (for the manual):
vsel:
31-24 OP_VSEL (replaces OP_SDUP)
23-22 size
21 s- prefix
20-18 unused
17-12 r3
11- 6 r2
5- 0 r1
vseli:
31-24 OP_VSELI (new)
23-22 size
21 s- prefix
20 unused
19-12 unsigned 8-bit immediate
11- 6 r2
5- 0 r1
> The "HW cost" of the permutation seems to be more than byterev, but not
> much more.
In the current implementation, the cost is almost 0. Byterev & friends use
a set of byte-wide muxes. Their select inputs are driven with constant
values, depending on the operating mode. `vsel' uses the same muxes,
but takes the select inputs from the second operand instead - that's all.
It doesn't even increase the latency.
> Now the question is :
> do we separate the SIMD instructions (permutation, selection,
> duplication... on chunks)
> from the SHL unit (which only deals with bits) ?
> The answer would be best answered by Michael but this is something that
> i would
> certainly do normally.
They are already separated (kind of). Bitwise ops use the omega
shifter, while bytewise ops use the mux-based `byte shuffler'.
They just share input and output ports.
> There is another instruction that would be cool but not easy to define
> or implement :
> a shuffle instruction that puts the the Nth bit of the Mth chunk
> into the Mth bit of the Nth chunk.
>
> Usually, it is done with 64-it registers : bit 1 of byte 8 goes to bit 8
> of byte 1.
> This is useful for example for 'interpreting masks', when a character
> has been
> detected and the resulting bytewide mask must be turned into a bitfield....
>
> This is ok for 64-bit CPU but our case makes it difficult : how would this
> be in a 128-bit or 256-bit CPU ? One solution would be to allow different
> chunk sizes to accomodate the cases where the chunk width is not equal
> to the number of chunks.
>
> I guess that this instruction can be implemented with Michael's Omega
> shifter
> but the control logic that deals with the chunk sizes etc is not the
> easiest part.
I'm not sure. An omega shifter is limited to a subset of all possible
shuffling operations, and this one looks pretty complicated. You'd
better not count on it.
--
Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
"All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/