# Re: EU Report (was: Re: [f-cpu] Register set revised)

```On Fri, Mar 21, 2003 at 08:40:07AM +0100, Yann Guidon wrote:
[...]
> currently, for FC0, the computation time is directly proportional to the
> instruction count
> (if there is no stall). in this case, "abs(a-b)" takes 2 instructions
> and "max(a,b)-min(a,b)"
> takes 3 instructions (and cycles) even if the second case has a parallel
> execution for the min/max.

Why that?  Because we have only a single bypass/feedback path?
BTW: parallel execution of min/max is already implemented.

Note that there is an important difference between abs(a-b) and
max(a,b)-min(a,b): Due to restricted operand ranges and intermediate
overflows, the former may deliver the wrong result.  With min/max,
the result will always be correct (provided you choose the appropriate
signed/unsigned mode):

// signed operation:
// r1 := +127
// r2 := -128
// expected result is 255

sub.8 r2, r1, r3        ; r3 = +255 (aka -1)
abs.8 r3, r4            ; r4 = +1 (WRONG)

minmaxs.8 r2, r1, r4    ; r4 = -128, r5 = +127
sub.8 r4, r5, r6        ; r6 = 255 (CORRECT)

// unsigned operation:
// r1 := 255
// r2 := 0
// expected result is 255

sub.8 r2, r1, r3        ; r3 = 255 (aka -1)
abs.8 r3, r4            ; r4 = 1 (WRONG)

minmax.8 r2, r1, r4     ; r4 = 0, r5 = 255
sub.8 r4, r5, r6        ; r6 = 255 (CORRECT)

Surprised?

> The SAD is in fact doing 3 things : minus, absolute and finally the
> BTW, i have seen instruction sets with the 2 first operations but NEVER
> with the parallel

AltiVec has `vector sum' instructions (although they're limited to

[...]
> >The byte (or chunk) adder will also be useful in vector computations.
> >
> for what kinds of algorithms ???

All of them.  The four most common vector operations are:

- Scale: Y(i) = r * A(i)                ; sdup + smul
- Scalar Product: y = sum(A(i) * B(i))  ; smul + chunk-add

in every matrix and matrix/vector multiplication:  Remember that the
product of two n-order square matrices is just a collection of n*n
scalar products.

[...]
> >In any case, the chunks of a word can be combined by using `mix':
> >
> > mix.8 r0, r1, r2    // distributes the chunks across r2 and r3
> > add.16 r2, r3, r1
> > mix.16 r0, r1, r2
> > add.32 r2, r3, r1
> > mix.32 r0, r1, r2
> > add.64 r2, r3, r1   // gotcha!
> >
> >
> :-)

Unfortunately, there are a lot of stalls. With a double feedback path,
this computation takes 15 cycles, or 18 cycles without it.  If there is
nothing you can interleave the code with, you'll waste a lot of time.
On the other hand, a chunk add (maybe we'd better call it `sum') can be
done in approximately 2-3 cycles.

> >This will also work with other commutative operations, e.g. mul.  A `chunk
> >add' insn may be more convenient, however (and will also be much faster).
> >
> >
> probably but not used often enough.

As I tried to point out above, it's much more common than a specialized
SAD instruction, and also much more general.  IMHO it's much better to
provide a handful of powerful primitives that can be used to implement
a larger set of functions, rather than implementing the functions
themselves.

--
Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
"All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
```