[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
scatter/gather op ( was:Re: [f-cpu] New EU_SHL Instruction)
On Thu, 9 Jan 2003 14:01:51 +0100
Michael Riepe <michael@stud.uni-hannover.de> wrote:
> On Thu, Jan 09, 2003 at 01:59:35AM +0100, Yann Guidon wrote:
> [...]
> > and_reduce (or "combine" as written in ROP2) is not possible
> > for very wide data.
> >
> > Furthermore, the xorn.and trick is useful for "detecting" that a
> > byte corresponds, but if you need to find the index of the
> > character, the "obvious" answer is to loop over the register.
> > if you have a result of 0x00FF000000000000, it's not a good
> > solution. So the idea is to "transpose" the bits in the word, that
> > would become 0x4040404040404040 and the last byte can then ben
> > binary encoded in INC (if it's implemented).
>
> Wouldn't it be sufficient to `collapse' each chunk into a single bit?
that's a gather intra-chunk operation. (Such gather op are a lack in
all the f-cpu ISA because inter-chunk operation are maid in 64 bits cpu
instead of thinking about a 256 bits version.)
A add gather could be usefull too !
gather.add.64 V1 V2 R3
R3 = V1[0]+V1[1]+V1[2]+V1[3]
+V2[0]+V2[1]+V2[2]+V2[3]
(big tree adder ?)
This avoid stupid end of loop in many mathematical operation (imagine
unroll MAC op for digital filter) :
int X[100], Coeff[100], out;
init(Coeff);
out=0;
for(int i ; i<100; i++)
{
out+=X[i]*Coeff[i];
}
Such loop are a dream for SIMD (8*32=256 bits register) :
V8i X[100/8], Coeff[100/8], Vout1,Vout2;
int out;
init(Coeff);
out=0;
for(int i ; i< (floor(100/8)=96); i+=2)
{
Vout1+=X[i]*Coeff[i];
Vout2+=X[i+1]*Coeff[i+1]; /*for masking the internal depencies of the
mac op !*/
}
for(int i; i < (rest(100,8)=4);i++)
{
out+=(int)X[i]*(int)Coeff[i]
}
out+=scatter_add(Vout1,Vout2);
return out;
This kind of scatter avoid you to do strange manipulations with the
vector in registers. This is Vector-Vector->Scalar or
Vector-Scalar->Scalar operations. The inverse could be usefull too
(scatter) : Scalar-Scalar-> Vector.
Add is the most evident op for such thing but maybe other op could be
usefull too ?
For bit-wise operation, like and/or_reduice, this is intra-chunk op.
Because bit-width op are only SIMD with 1 bit integer :)
nicO
> That is, if the chunk's value is not zero, the corresponding bit will
> be set, otherwise it will be zero:
>
> r2 = 0xab00cd00ef0000
> collapse.b r2, r1
> r1 <= 0x54
> collapse.d r2, r1
> r1 <= 0x0e
>
> and so on. A complementary `uncollapse' instruction would be nice,
> too (it would allow you to generate chunk masks more easily):
>
> r2 = 0x5a
> uncollapse.b r2, r1
> r1 <= 0x00ff00ffff00ff00
> uncollapse.d r2, r1
> r1 <= 0x0000ffff0000ffffffff0000ffff0000 // yes, that's 128 bits ;)
>
> --
> Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
> "All I wanna do is have a little fun before I die"
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu in the body. http://f-cpu.seul.org/
> _____________________________________________________________________
> GRAND JEU SMS : Pour gagner un NOKIA 7650, envoyez le mot IF au 61321
> (prix d'un SMS + 0.35 euro). Un SMS vous dira si vous avez gagn_.
> R_glement : http://www.ifrance.com/_reloc/sign.sms
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/