[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [f-cpu] GCC and jmpz vs. jmpl
On Thu, Jan 09, 2003 at 11:52:17PM +0100, Yann Guidon wrote:
[2-stage INC/CMP]
> >Not my fault. Whoever designed the original INC unit underestimated the
> >cost of a SIMD-enabled `and' reduce tree.
> >
> is it caused by the SIMD ?
> without SIMD, i think it's ok.
> and i seemed to rememer that
> adding SIMD was not that expensive.
Let's count gates... for 64-bit increment, you need a 4:1 tree of depth 3,
plus a row of XORs -> that fits. For SIMD, you can use separate trees and
a big MUX behind them -> still fits. But now comes the input selector
for inc/dec/neg, and another MUX at the output that selects between
negated and original operand (for `abs'), and the stage is already too
full. Finally, there is the 64:8 binary encoder for `lsb' (or rather,
one for each chunk size), and the second stage is full, too. If we would
only perform inc/dec, a single stage would probably do the trick.
> But if INC is going to last 2 cycles,
> then it should be possible to shove a binary
> encoder at the end of the unit....
It's already there.
> that would be cool and this would help
> for finding the first char in a strcmp, for example :
> ; we get here because the loop exited with a non-zero match
> ; in r1 ( bytes either 00 or FF)
> LSB1 r1, r2 ; the mask is first priority encoded, to remove trailing bits,
> ; and the position of the LSB that is set is put into r2
> shri 3, r2, r2 ; shift 3 LSB out (because we deal with bytes)
> or r2, r3, r3 ; the pointer in r3 (where the match occured) is adjusted
> ; it is normally 64-bit (or more) aligned, and the OR
> added the offset.
You haven't looked at my assembler version of strchr(), have you? It
does exactly that.
> i just realized that this nice code seems to be independent from the
> register size :-)
Not if `lsb1' and `shiftri' are restricted to 64 bits (as they are now).
> BTW, the binary encoding is already in my mind because the
> LSB0/1 and MSB0/1 instructions were not always precise about this :
> is the output the rank of the bit, or simply a single bit ?
The output is the number (or index) of the bit, plus 1 (it will return
0 if it didn't find anything).
> maybe the binary encoder can be bypassed with a flag bit
> in the instruction ? So we can have the raw priority encoded
> result, or the binary encoding, on demand. Of course,
> because the binary encoding is just a bunch of ORs,
> the priority encoding is necessary to avoid that more than
> one bit is set at the input of the binary encoder.
The scheme I use is a little different. The reduce operation will
give me a bit vector that contains 0's on one side and 1's on the
other:
00000000001111111111111111111111
That simplifies the encoder pretty much; e.g. for a 32-bit operation,
result(5) := vector(31);
result(4) := vector(31) xor vector(15);
result(3) := (vector(31) xor vector(23)) or (vector(15) xor vector(7));
and so on.
--
Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
"All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/