[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

scatter/gather op ( was:Re: [f-cpu] New EU_SHL Instruction)



On Thu, 9 Jan 2003 14:01:51 +0100
Michael Riepe <michael@stud.uni-hannover.de> wrote:

> On Thu, Jan 09, 2003 at 01:59:35AM +0100, Yann Guidon wrote:
> [...]
> > and_reduce (or "combine" as written in ROP2) is not possible
> > for very wide data.
> > 
> > Furthermore, the xorn.and trick is useful for "detecting" that a
> > byte corresponds, but if you need to find the index of the
> > character, the "obvious" answer is to loop over the register.
> > if you have a result of 0x00FF000000000000, it's not a good
> > solution. So the idea is to "transpose" the bits in the word, that
> > would become 0x4040404040404040 and the last byte can then ben
> > binary encoded in INC (if it's implemented).
> 
> Wouldn't it be sufficient to `collapse' each chunk into a single bit?

that's a gather intra-chunk operation. (Such gather op are a lack in
all the f-cpu ISA because inter-chunk operation are maid in 64 bits cpu
instead of thinking about a 256 bits version.)

A add gather could be usefull too !

gather.add.64 V1 V2 R3

R3 = V1[0]+V1[1]+V1[2]+V1[3]
    +V2[0]+V2[1]+V2[2]+V2[3]

(big tree adder ?)

This avoid stupid end of loop in many mathematical operation (imagine
unroll MAC op for digital filter) :

int X[100], Coeff[100], out;

init(Coeff);

out=0;
for(int i ; i<100; i++)
{
 out+=X[i]*Coeff[i];
}

Such loop are a dream for SIMD (8*32=256 bits register) :

V8i X[100/8], Coeff[100/8], Vout1,Vout2;
int out;
init(Coeff);

out=0;
for(int i ; i< (floor(100/8)=96); i+=2)
{
 Vout1+=X[i]*Coeff[i];
 Vout2+=X[i+1]*Coeff[i+1]; /*for masking the internal depencies of the
mac op !*/
}

for(int i; i < (rest(100,8)=4);i++)
{
	out+=(int)X[i]*(int)Coeff[i]
}

out+=scatter_add(Vout1,Vout2);

return out;

This kind of scatter avoid you to do strange manipulations with the
vector in registers. This is Vector-Vector->Scalar or
Vector-Scalar->Scalar operations. The inverse could be usefull too
(scatter) : Scalar-Scalar-> Vector.

Add is the most evident op for such thing but maybe other op could be
usefull too ?

For bit-wise operation, like and/or_reduice, this is intra-chunk op.
Because bit-width op are only SIMD with 1 bit integer :)

nicO


> That is, if the chunk's value is not zero, the corresponding bit will
> be set, otherwise it will be zero:
> 
> 	r2 = 0xab00cd00ef0000
> 	collapse.b r2, r1
> 	r1 <= 0x54
> 	collapse.d r2, r1
> 	r1 <= 0x0e
> 
> and so on. A complementary `uncollapse' instruction would be nice,
> too (it would allow you to generate chunk masks more easily):
> 
> 	r2 = 0x5a
> 	uncollapse.b r2, r1
> 	r1 <= 0x00ff00ffff00ff00
> 	uncollapse.d r2, r1
> 	r1 <= 0x0000ffff0000ffffffff0000ffff0000	// yes, that's 128 bits ;)
> 
> -- 
>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
>  "All I wanna do is have a little fun before I die"
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
> _____________________________________________________________________
> GRAND JEU SMS : Pour gagner un NOKIA 7650, envoyez le mot IF au 61321
> (prix d'un SMS + 0.35 euro). Un SMS vous dira si vous avez gagn_.
> R_glement : http://www.ifrance.com/_reloc/sign.sms
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/