[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: EU Report (was: Re: [f-cpu] Register set revised)
On Fri, Mar 21, 2003 at 08:40:07AM +0100, Yann Guidon wrote:
[...]
> currently, for FC0, the computation time is directly proportional to the
> instruction count
> (if there is no stall). in this case, "abs(a-b)" takes 2 instructions
> and "max(a,b)-min(a,b)"
> takes 3 instructions (and cycles) even if the second case has a parallel
> execution for the min/max.
Why that? Because we have only a single bypass/feedback path?
BTW: parallel execution of min/max is already implemented.
Note that there is an important difference between abs(a-b) and
max(a,b)-min(a,b): Due to restricted operand ranges and intermediate
overflows, the former may deliver the wrong result. With min/max,
the result will always be correct (provided you choose the appropriate
signed/unsigned mode):
// signed operation:
// r1 := +127
// r2 := -128
// expected result is 255
sub.8 r2, r1, r3 ; r3 = +255 (aka -1)
abs.8 r3, r4 ; r4 = +1 (WRONG)
minmaxs.8 r2, r1, r4 ; r4 = -128, r5 = +127
sub.8 r4, r5, r6 ; r6 = 255 (CORRECT)
// unsigned operation:
// r1 := 255
// r2 := 0
// expected result is 255
sub.8 r2, r1, r3 ; r3 = 255 (aka -1)
abs.8 r3, r4 ; r4 = 1 (WRONG)
minmax.8 r2, r1, r4 ; r4 = 0, r5 = 255
sub.8 r4, r5, r6 ; r6 = 255 (CORRECT)
Surprised?
> The SAD is in fact doing 3 things : minus, absolute and finally the
> parallel addition.
> BTW, i have seen instruction sets with the 2 first operations but NEVER
> with the parallel
> addition.
AltiVec has `vector sum' instructions (although they're limited to
saturated adds, AFAIK).
[...]
> >The byte (or chunk) adder will also be useful in vector computations.
> >
> for what kinds of algorithms ???
All of them. The four most common vector operations are:
- Add/Sub: Y(i) = A(i) ± B(i) ; sadd / ssub
- Scale: Y(i) = r * A(i) ; sdup + smul
- Scalar Product: y = sum(A(i) * B(i)) ; smul + chunk-add
Additionally, the chunk-add instruction will be used several times
in every matrix and matrix/vector multiplication: Remember that the
product of two n-order square matrices is just a collection of n*n
scalar products.
[...]
> >In any case, the chunks of a word can be combined by using `mix':
> >
> > mix.8 r0, r1, r2 // distributes the chunks across r2 and r3
> > add.16 r2, r3, r1
> > mix.16 r0, r1, r2
> > add.32 r2, r3, r1
> > mix.32 r0, r1, r2
> > add.64 r2, r3, r1 // gotcha!
> >
> >
> :-)
Unfortunately, there are a lot of stalls. With a double feedback path,
this computation takes 15 cycles, or 18 cycles without it. If there is
nothing you can interleave the code with, you'll waste a lot of time.
On the other hand, a chunk add (maybe we'd better call it `sum') can be
done in approximately 2-3 cycles.
> >This will also work with other commutative operations, e.g. mul. A `chunk
> >add' insn may be more convenient, however (and will also be much faster).
> >
> >
> probably but not used often enough.
As I tried to point out above, it's much more common than a specialized
SAD instruction, and also much more general. IMHO it's much better to
provide a handful of powerful primitives that can be used to implement
a larger set of functions, rather than implementing the functions
themselves.
--
Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
"All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/