[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] New EU_SHL Instruction



On Wed, Jan 08, 2003 at 02:13:09AM +0100, Yann Guidon wrote:

> >While skimming through some AltiVec documentation the other day I
> >noticed that they have nice `permute' and `select' instructions that
> >let you shuffle the chunks inside a vector any way you like. It's a
> >general instruction that can replace both `sdup' and `sbyterev' - and
> >can be easily implemented in the current SHL execution unit.
> >
> i don't like "replace", i'd rather say "complement".
> and sdup is useful in other contexts.

It rather extends/generalizes them.

> >The basic function works as follows (beware, pseudo-VHDL!):
> >
> >    function permute (A, B : in F_VECTOR) return F_VECTOR is
> >        variable Y : F_VECTOR;
> >    begin
> >        for i in 0 to NUMBER_OF_CHUNKS - 1 loop
> >            chunk(Y, i) := chunk(A, to_integer(chunk(B, i)));
> >        end loop;
> >        return Y;
> >    end permute;
> >  
> >
> 
> IIRC, this instruction has already been discussed a long time ago.
> And, AFAIK, there is an inherent limitation in the ALTIVEC version :
> the word size is limited (256 bits).

It should be limited by the register size.

> If there is a simple way to break this limit, and if implementing it
> is not too heavy, that would be a good thing. But it was not successful 
> in the past.

It uses the same mechanism that is used for the other bytewise
shuffling operations (byterev, sdup, mix, expand, cshift).

> >I therefore suggest the following ISA update:
> >
> >    vperm.size r3, r2, r1       // new instruction: `vector permute'
> >
> >        performs the `permute' function described above, with r3 being
> >        the selector (B), r2 being the source (A) and r1 the result
> >        (Y).  Only the required bits of the selector chunks are used
> >        (e.g. bits 2...0 if there are 8 chunks).
> >  
> >
> btw, is there a notion of SIMD here ?
> that is : is the SIMD flag used ?

That' answered below. This instruction is the SIMD variant of `vsel'.
That is, the SIMD flag is always set.

> And is the index size as large as the chunk ?
> so if there are 16-bit chunks, we could address 64K chunks
> (so the register size would not be realistically limited)

For 8-bit chunks, there's a limit at a register size of 256*8 = 2048 bits.

[...]
> >    vsel.size r3, r2, r1        // new instruction: `vector select'
> >
> >        same as `vperm', but only the least significant chunk of
> >        the result is returned (with zero extension). Again, only the
> >        required bits of the selector are used. This instruction lets
> >        you read any chunk of a register with minimal effort.
> >  
> >
> well, it's the "reverse" of sdup, right ?
> and vsel can be emulated with a shift.

Shift + mask, to be precise, and only if the register size is limited
to 64 bits. `vsel' is mainly directed towards wider registers.

> The only gain i see here is the mask.
> 
> >        Note that `vperm' is the SIMD variant of `vsel', but I think
> >        that the name is more intuitive than `svsel'. We can keep the
> >        latter as an alias, however.
> >  
> >
> or simply "sel" :-)

I wanted to indicate that the instruction belongs to a `vector' family
of instructions.

> {grrrrr now we have to find an instruction that can be named "poivre" ....}

Huh? I guess this is some french pun, right?

> >    vseli.size $imm8, r2, r1    // new instruction: `vector select immediate'
> >
> >        same as `vsel', but with an 8-bit unsigned immediate
> >        selector. The SIMD variant `svseli' (or `vpermi') is
> >        probably less useful, but one never knows...
> >
> >    sdup.size r2, r1            // changed instruction
> >
> >        will survive as an alias for `vperm.size r0, r2, r1' which has
> >        exactly the same effect.
> >  
> >
> In the later versions (i guess it didn't make it into the manual),
> the 3rd operand (that was left to zero in previous versions) indicated
> the number of the chunk to duplicate. So alias you made will not be
> enough : you would have to duplicate the first index first into a temporary
> register. Funny that the instruction that does that is the instruction
> you want to emulate :-)

I do not remember an `sdup' instruction that uses three operands, nor
did I implement one. That would have been another logical extension;
but I guess that vperm is more useful. BTW: if you're satisfied with
an immediate operand, `svseli $n, r2, r1' will do what you want
(duplicate chunk <n> of r2).

> >    [s]byterev.size r2, r1      // unchanged
> >
> >        will stay the same. Note that `sbyterev' can be emulated
> >        with `vperm', but the non-SIMD `byterev' can't without an
> >        additional zero-extension instruction.
> >
> >This change is so useful and so cheap to implement that I consider it
> >a must-have. Any objections?
> >
> i want to keep my "sdup(i)", it's very very useful in most code (SIMD or 
> not).

It's still there:

    `sdupi $imm8, r2, r1' corresponds to `vpermi $imm8, r2, r1'.
    `sdup r2, r1' corresponds to `vperm r0, r2, r1' or `vpermi $0, r2, r1'.

If you insist, I will investigate whether `sdup r3, r2, r1' can be added.
I guess it's possible (there are some unused flags).  Oh BTW, I forgot
the encoding (for the manual):

    vsel:
        31-24   OP_VSEL (replaces OP_SDUP)
        23-22   size
        21      s- prefix
        20-18   unused
        17-12   r3
        11- 6   r2
         5- 0   r1

    vseli:
        31-24   OP_VSELI (new)
        23-22   size
        21      s- prefix
        20      unused
        19-12   unsigned 8-bit immediate
        11- 6   r2
         5- 0   r1

> The "HW cost" of the permutation seems to be more than byterev, but not 
> much more.

In the current implementation, the cost is almost 0. Byterev & friends use
a set of byte-wide muxes. Their select inputs are driven with constant
values, depending on the operating mode. `vsel' uses the same muxes,
but takes the select inputs from the second operand instead - that's all.
It doesn't even increase the latency.

> Now the question is :
>  do we separate the SIMD instructions (permutation, selection, 
> duplication... on chunks)
> from the SHL unit (which only deals with bits) ?
> The answer would be best answered by Michael but this is something that 
> i would
> certainly do normally.

They are already separated (kind of). Bitwise ops use the omega
shifter, while bytewise ops use the mux-based `byte shuffler'.
They just share input and output ports.

> There is another instruction that would be cool but not easy to define 
> or implement :
>  a shuffle instruction that puts the the Nth bit of the Mth chunk
> into the Mth bit of the Nth chunk.
> 
> Usually, it is done with 64-it registers : bit 1 of byte 8 goes to bit 8 
> of byte 1.
> This is useful for example for 'interpreting masks', when a character 
> has been
> detected and the resulting bytewide mask must be turned into a bitfield....
> 
> This is ok for 64-bit CPU but our case makes it difficult : how would this
> be in a 128-bit or 256-bit CPU ? One solution would be to allow different
> chunk sizes to accomodate the cases where the chunk width is not equal
> to the number of chunks.
> 
> I guess that this instruction can be implemented with Michael's Omega 
> shifter
> but the control logic that deals with the chunk sizes etc is not the 
> easiest part.

I'm not sure. An omega shifter is limited to a subset of all possible
shuffling operations, and this one looks pretty complicated. You'd
better not count on it.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/