[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] GCC and jmpz vs. jmpl



On Thu, Jan 09, 2003 at 11:52:17PM +0100, Yann Guidon wrote:

[2-stage INC/CMP]
> >Not my fault. Whoever designed the original INC unit underestimated the
> >cost of a SIMD-enabled `and' reduce tree.
> >
> is it caused by the SIMD ?
> without SIMD, i think it's ok.
> and i seemed to rememer that
> adding SIMD was not that expensive.

Let's count gates... for 64-bit increment, you need a 4:1 tree of depth 3,
plus a row of XORs -> that fits. For SIMD, you can use separate trees and
a big MUX behind them -> still fits. But now comes the input selector
for inc/dec/neg, and another MUX at the output that selects between
negated and original operand (for `abs'), and the stage is already too
full. Finally, there is the 64:8 binary encoder for `lsb' (or rather,
one for each chunk size), and the second stage is full, too. If we would
only perform inc/dec, a single stage would probably do the trick.

> But if INC is going to last 2 cycles,
> then it should be possible to shove a binary
> encoder at the end of the unit....

It's already there.

> that would be cool and this would help
> for finding the first char in a strcmp, for example :
>   ; we get here because the loop exited with a non-zero match
>   ; in r1 ( bytes either 00 or FF)
> LSB1 r1, r2 ; the mask is first priority encoded, to remove trailing bits,
>                     ; and the position of the LSB that is set is put into r2
> shri 3, r2, r2 ; shift 3 LSB out (because we deal with bytes)
> or r2, r3, r3 ; the pointer in r3 (where the match occured) is adjusted
>                   ; it is normally 64-bit (or more) aligned, and the OR 
> added the offset.

You haven't looked at my assembler version of strchr(), have you? It
does exactly that.

> i just realized that this nice code seems to be independent from the 
> register size :-)

Not if `lsb1' and `shiftri' are restricted to 64 bits (as they are now).

> BTW, the binary encoding is already in my mind because the
> LSB0/1 and MSB0/1 instructions were not always precise about this :
> is the output the rank of the bit, or simply a single bit ?

The output is the number (or index) of the bit, plus 1 (it will return
0 if it didn't find anything).

> maybe the binary encoder can be bypassed with a flag bit
> in the instruction ? So we can have the raw priority encoded
> result, or the binary encoding, on demand. Of course,
> because the binary encoding is just a bunch of ORs,
> the priority encoding is necessary to avoid that more than
> one bit is set at the input of the binary encoder.

The scheme I use is a little different.  The reduce operation will
give me a bit vector that contains 0's on one side and 1's on the
other:

	00000000001111111111111111111111

That simplifies the encoder pretty much; e.g. for a 32-bit operation,

	result(5) := vector(31);
	result(4) := vector(31) xor vector(15);
	result(3) := (vector(31) xor vector(23)) or (vector(15) xor vector(7));

and so on.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/