[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] strchr() again



Hi F-gang,

here is my version of strchr(), in `mr-as' syntax as usual. It's similar
to deviks version, but I use my optimized 5-cycle main loop, and I handle
some things differently. As a result, the prologue is slightly shorter.

I hope that I got the matching conditions right - there may be several
matches with both the target character and 0, and only the very first
one is valid. If there only is a match with zero, the function will take
the emergency exit (note that the sequence of the jump instructions is
important). The epilogue contains a little speculative execution, in
order to avoid stalls (if you can still find some, or you find a bug,
please tell me).

If there is no match, the routine takes approximately 13 + 5/8 * n
instructions (n = string length). An empty string is scanned in 16
cycles. If a match is found, there are approximately 20 + 5/8 * n
instructions to execute (n = relative position of the match). Compared
to a bytewise search, which will take at least 3 cycles per byte (load,
compare and jump), the overhead amortizes for n >= 6...9 bytes, depending
on whether there's a match and where it is located. At n >= 35...54 bytes,
speed increases above 1.0 bytes/cycle, asymptotically approaching 1.6
bytes/cycle (for a 64-bit CPU - double width, double throughput!).

    .align 32               // align to cache line boundary
.global strchr
strchr:
    // prologue
    andni $7, r1, r3        // aligned `long*' pointer
    andi $7, r1, r4         // byte offset
    loadi $8, r3, r5        // load first data word
    shiftli $3, r4, r7      // 8 * byte offset
    orn r0, r0, r8          // -1L
    sdup.b r2, r6           // duplicate character value
    shiftl r7, r8, r9       // mask
    loadaddri $found-.-4, r31
    move r0, r1             // preload NULL result

    // first compare
    xnor.and.b r6, r5, r10  // compare with character
    xnor.and.b r0, r5, r11  // compare with 0
    loadi $8, r3, r5        // load second data word
    and r9, r10, r12        // mask off unused bytes
    and r9, r11, r13        // mask off unused bytes
    jmpnz r12, r31          // found character
    jmpnz r13, r63          // found 0, terminate

    // main loop
    loopentry r30
    xnor.and.b r6, r5, r12  // compare with character
    xnor.and.b r0, r5, r13  // compare with 0
    loadi $8, r3, r5        // load next data word
    jmpnz r12, r31          // found character
    jmpz r13, r30           // loop unless a 0 occurred
    jmp r63                 // terminate

    // XXX: align the epilogue? A single NOP would do it.

    // epilogue
found:
    // Found a matching character, but there may also be a matching
    // zero. Check which one comes first (if both are in the same
    // place, we must have been looking for the 0).
    lsb1 r12, r14           // position of first matching byte
    lsb1 r13, r15           // position of first zero byte (if any)
    subi $16, r3, r18       // roll back pointer (it's 2 words ahead)
    shiftri $3, r14, r17    // offset of matching byte
    cmple r15, r14, r16     // match is valid if r14 <= r15 (or r15 == 0)
    add r17, r18, r1        // pointer to matching byte
    jmpz r15, r63           // return pointer if there was no matching 0
    movez r16, r0, r1       // if terminating 0 came before the match
    jmp r63

For ultra-fast execution, we can define a variant that requires strict
8-byte alignment. We may also combine both versions (check the alignment
at the start of the general routine, and branch to the slow version if
it's not ok). The epilogue is the same, therefore I omit it here (the
epilogue code may even be shared by both versions). The main loop also
is the same, but I wanted to avoid the unnecessary loadaddr and jmp
instructions required for jumping to the existing loop :)

aligned_strchr:
    // prologue
    loadi $8, r1, r5        // load first data word
    sdup.b r2, r6           // duplicate character value
    loadaddri $found-.-4, r31
    move r1, r3             // move pointer away
    move r0, r1             // preload NULL result

    // XXX: better align the main loop
    // (can be done by moving the entry point above)

    // main loop
    loopentry r30
    xnor.and.b r6, r5, r12  // compare with character
    xnor.and.b r0, r5, r13  // compare with 0
    loadi $8, r3, r5        // load next data word
    jmpnz r12, r31          // found character
    jmpz r13, r30           // loop unless a 0 occurred
    jmp r63                 // terminate

Guess what? This was fun! :)

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/