[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] Zen and the art of F-CPU assembler coding



On Mon, Dec 09, 2002 at 12:13:13PM +0100, devik wrote:
[...]
> > But the function of the combine operations is also performed
> > (in a slower way) by the ASU with saturation mode : it requires
> > a bit more time and some more instruction but it does it.
> 
> like add/subing 0xFE in saturated byte mode ? The bit 0 is
> the result .. Hmm ! It is wonderfull !!! So cor.nxor is:
> xor.8 ,,r1
> addsi.8 0xfe,r1,r1
> inc.8 r1,r1
[...]

.o( What an ugly mess! )

It's more wonderful than you think: Both `cor' and `cand' can be
substituted with only two other F-CPU instructions. No additional
registers are required. And in contrast to your version, it's a general
solution that will also work with larger chunks. Just add an unsigned
compare-with-zero to a normal logic operation, and you've got it.

cor.<op>.<sz> is pretty obvious:

    <op> b,a,r1
    scmpl.<sz> r1,r0,r1     // in C: `-((a OP b) > 0u)'

cand.<op>.<sz> is a little more complex: it requires that the result of
the logic operation <op> is inverted: (*)

    n<op> b,a,r1
    scmple.<sz> r0,r1,r1    // in C: `-(~(a OP b) <= 0u)'

(*) All F-CPU logic operations are invertible:

     <op>    <->   n<op>
    ======================
    a and b  <->  a nand b
    a  or b  <->  a  nor b
    a xor b  <->  a xnor b
    a orn b  <->  b andn a  // DeMorgans law

Q.E.D. :)

With this substitution, the main loop of `strchr' may look like this:

    setup:
        loadi $8, r1, r3        // load first word early
        sdup.b r2, r2
        loadaddri $end_of_loop-.-4, r7
        loopentry r8
    start_of_loop:
        xor r2, r3, r4          // t=0 if data in LSU
        scmple.b r0, r3, r5     // t=1
        scmple.b r0, r4, r6     // t=2
        loadi $8, r1, r3        // t=3
        jmpnz r5, r7            // t=4
        jmpz r6, r8             // t=5
    end_of_loop:

It takes 6 cycles for an 8-byte loop (additional cycles for the backward
jump not counted), which is as fast as your version. This is the perfect
code for testing the scheduler and the bypassing mechanism: there are RAW
dependencies for r4, r5 and r6, and the timing is as tight as possible :)

Note that we could also use combine:

    setup:
        loadi $8, r1, r3        // load first word early
        sdup.b r2, r2
        loadaddri $end_of_loop-.-4, r7
        loopentry r8
    start_of_loop:
        xnor.and.b r0, r3, r4   // t=0 if data in LSU
        xnor.and.b r2, r3, r5   // t=1
        loadi $8, r1, r3        // t=2
        jmpnz r4, r7            // t=3
        jmpz r5, r8             // t=4
    end_of_loop:

This is the fastest version I can imagine: 5 cycles (1.6 bytes/cycle with
64-bit registers). Instruction scheduling is optimal unless `xnor.and.b'
takes more than two cycles (which is very unlikely).

On the other hand, the examples show that we can live without combine.
If we drop it (or maybe move it to a second pipeline stage and make it
support *all* chunk sizes), there will be enough room in the ROP2 unit to
add `n -> 2**n' decoders in front of it for 1-cycle btst/bchg/bset/bclr
instructions that are more useful than combine, IMHO.

.o( Yann probably wants to kill me now )

Greetings from your friendly bit mangler,
-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
 ... now this *was* fun, wasn't it? :)
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/