[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: EU Report (was: Re: [f-cpu] Register set revised)



On Fri, Mar 21, 2003 at 08:40:07AM +0100, Yann Guidon wrote:
[...]
> currently, for FC0, the computation time is directly proportional to the 
> instruction count
> (if there is no stall). in this case, "abs(a-b)" takes 2 instructions 
> and "max(a,b)-min(a,b)"
> takes 3 instructions (and cycles) even if the second case has a parallel 
> execution for the min/max.

Why that?  Because we have only a single bypass/feedback path?
BTW: parallel execution of min/max is already implemented.

Note that there is an important difference between abs(a-b) and
max(a,b)-min(a,b): Due to restricted operand ranges and intermediate
overflows, the former may deliver the wrong result.  With min/max,
the result will always be correct (provided you choose the appropriate
signed/unsigned mode):

    // signed operation:
    // r1 := +127
    // r2 := -128
    // expected result is 255

    sub.8 r2, r1, r3        ; r3 = +255 (aka -1)
    abs.8 r3, r4            ; r4 = +1 (WRONG)

    minmaxs.8 r2, r1, r4    ; r4 = -128, r5 = +127
    sub.8 r4, r5, r6        ; r6 = 255 (CORRECT)

    // unsigned operation:
    // r1 := 255
    // r2 := 0
    // expected result is 255

    sub.8 r2, r1, r3        ; r3 = 255 (aka -1)
    abs.8 r3, r4            ; r4 = 1 (WRONG)

    minmax.8 r2, r1, r4     ; r4 = 0, r5 = 255
    sub.8 r4, r5, r6        ; r6 = 255 (CORRECT)

Surprised?

> The SAD is in fact doing 3 things : minus, absolute and finally the 
> parallel addition.
> BTW, i have seen instruction sets with the 2 first operations but NEVER 
> with the parallel
> addition.

AltiVec has `vector sum' instructions (although they're limited to
saturated adds, AFAIK).

[...]
> >The byte (or chunk) adder will also be useful in vector computations.
> >
> for what kinds of algorithms ???

All of them.  The four most common vector operations are:

    - Add/Sub: Y(i) = A(i) ± B(i)           ; sadd / ssub
    - Scale: Y(i) = r * A(i)                ; sdup + smul
    - Scalar Product: y = sum(A(i) * B(i))  ; smul + chunk-add

Additionally, the chunk-add instruction will be used several times
in every matrix and matrix/vector multiplication:  Remember that the
product of two n-order square matrices is just a collection of n*n
scalar products.

[...]
> >In any case, the chunks of a word can be combined by using `mix':
> >
> > mix.8 r0, r1, r2    // distributes the chunks across r2 and r3
> > add.16 r2, r3, r1
> > mix.16 r0, r1, r2
> > add.32 r2, r3, r1
> > mix.32 r0, r1, r2
> > add.64 r2, r3, r1   // gotcha!
> >  
> >
> :-)

Unfortunately, there are a lot of stalls. With a double feedback path,
this computation takes 15 cycles, or 18 cycles without it.  If there is
nothing you can interleave the code with, you'll waste a lot of time.
On the other hand, a chunk add (maybe we'd better call it `sum') can be
done in approximately 2-3 cycles.

> >This will also work with other commutative operations, e.g. mul.  A `chunk
> >add' insn may be more convenient, however (and will also be much faster).
> >  
> >
> probably but not used often enough.

As I tried to point out above, it's much more common than a specialized
SAD instruction, and also much more general.  IMHO it's much better to
provide a handful of powerful primitives that can be used to implement
a larger set of functions, rather than implementing the functions
themselves.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/