[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [f-cpu] F-CPU vs ALPHA
On Wed, Aug 15, 2001 at 05:44:34AM +0200, Yann Guidon wrote:
[...]
> 3R4W ? can you explain, or write a draft, about this unit ?
> i'm surprised.
That's only the SRT divider core. It takes a 128-bit numerator and a
64-bit denominator as inputs (3R) and produces a 64-bit quotient and
a 64-bit remainder, using a redundant encoding (that is, *two* 64-bit
vectors per number).
You already have an early draft of the IDU core -- search your mailing list
archive for the message id <20010302162502.24372@thrai.stud.uni-hannover.de>
or the subject "Divider News" (or look for the "Blah2" entity, probably
in a file named blah2.vhdl).
> if there is a painful problem, maybe we can go the route of half-emulation,
> like with the emulation of multiplies with a sequence of partial
> instructions ?
Dividers don't scale up, like multipliers and adders do.
I have to think that over; maybe there's a better (and smaller) solution;
I really don't want to waste so many transistors for an operation that
is rarely used.
[...]
> > and I'd rather add a `tiny float' data type (1+4+11 bits)
> ?
>
> > for low-precision operations than implement 40-bit FP (which is IMHO
> > rather useless, and a waste of space).
> are tiny floats really useful ?
In some cases, yes (especially if you can process 4 of them in a row ;).
Like audio/video processing where integers introduce too much rounding
errors but 32-bit floats would be overkill. Whenever you need an FP
type with 3-digit precision and values close to zero (-100...100, or
-10000...10000 if we use 1+5+10 form).
> > > // There is no floating square root instruction.
> > > We intend to provide "seed" generation for accelerating Newton-Raphson
> > > computations of divide and SQRT.
> > In the FC0, we can probably get away with some bit-shuffling (hardwired).
> > For sqrt(x), divide the exponent by 2 (right shift); for 1/x, invert it.
>
> believe me, if this was so simples, others would do it already.
> The problems is that this trick "only" does not provide enough precision.
>
> The strategy in most computers is not to run the NR iterations inside
> a loop, but to emit a fixed number of unrolled instructions.
> The "seed" must be constructed in such a way that when the specified number
> of iteration is executed, we get at least a certain precision/acuracy.
>
> For example, imagine we have a table that yields a "seed" for 8 acurate
> bits (in addition to a correct exponenent). A single run will give 16 acurate
> bits then another will give 32 bits of precision. if you work with IEEE floats,
> the compiler will generate 2 optimized unrolled iteration bodies.
> If you work with IEEE long reals, it will generate 3 iterations.
>
> If you "only" do an exponent adjustment,
> 1) you won't get a precise garanteed minimum acuracy of the approximation
Of course I will.
Let x be 1.m * 2**e (with 1.0 <= 1.m < 2.0)
Then y = 1/x is 1/1.m * 2**(-e), with 0.5 < 1/1.m <= 1.0
Let y0 = 0.75 * 2**(-e) be the first approximation
Then y - y0 becomes (1/1.m - 0.75) * 2**(-e),
and the maximum error is |y - y0| <= 0.25 * 2**(-e)
Q.E.D.
(I agree that it's not a *good* approximation ;)
> 2) you will thus have to put one NR iteration in a while(){} loop
> and if you work with SIMD float, the slow convergence of one number
> will keep the whole SIMD packet from being used even though other
> numbers are "ready" (converged).
That problem won't go away anyway.
> of course, if you don't mind, you can "forget" or "simplify" the seed LUT.
> but the program/compiler must be adapted in consequence.
With simple approximation (y0 = 0.75 * 2**(-exp(b))), it takes ~6
iterations to produce a `double' quotient. In order to reduce that
to ~3 iterations, we need to consider ~6 bits of `b' (that means we
need a lookup-table with at least 2**6 = 64 entries). 4 iterations
cost ~2**3 entries, 2 iterations ~2**13 entries (which is too much).
See the output of the attached C program (I don't have numbers for sqrt
approximation, sorry).
I think 4 iterations is good enough, 3 is already too expensive.
[...]
> > > I am still wondering if PALcode is such a good idea.
> > > We're used for a long time to rewrite the trap handlers for each new computer.
> > > Maybe this idea came from the VMS transition constraint, but there is
> > > no need of this stuff in the F-CPU.
> > We can implement a `PALcall' SR if we have to.
> and what would we do with it ? :-)
Play soccer? ;)
--
Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
"All I wanna do is have a little fun before I die"
/*
* Copyright (C) 2001 Michael Riepe <michael@stud.uni-hannover.de>
* All Rights Reserved.
*/
#include <stdio.h>
#define MAX_ITERATIONS 8
typedef unsigned long long ullong;
/* type conversion unit */
union blah {
double d;
ullong i;
};
/* `double' characteristics */
#define S_SHIFT 63
#define S_MASK (1ull << S_SHIFT)
#define E_SHIFT 52
#define E_MASK ((1ull << S_SHIFT) - (1ull << E_SHIFT))
#define M_SHIFT 0
#define M_MASK ((1ull << E_SHIFT) - (1ull << M_SHIFT))
#define BIAS 0x3ff
double epsilon = 0.0; /* accuracy */
int approx_bits = 0; /* approximation quality */
double
divide(double a, double b, unsigned *iter) {
union blah blah;
double x, y;
int asign, bsign;
short aexp, bexp;
ullong am, bm;
unsigned n;
/* decompose a */
blah = a;
asign = blah.i >> S_SHIFT;
aexp = (blah.i >> E_SHIFT) & 0x7ff;
am = blah.i & M_MASK;
aexp -= BIAS;
/* decompose b */
blah = b;
bsign = blah.i >> S_SHIFT;
bexp = (blah.i >> E_SHIFT) & 0x7ff;
bm = blah.i & M_MASK;
bexp -= BIAS;
/* normalize */
asign ^= bsign;
aexp -= bexp; /* over/underflow checking omitted */
bsign = 0;
bexp = 0; /* that is, 1.0 <= b < 2.0 */
/* restore a and b */
blah.i = am | ((ullong)(aexp + BIAS) << E_SHIFT) | ((ullong)asign << S_SHIFT);
a = blah.d;
blah.i = bm | ((ullong)(bexp + BIAS) << E_SHIFT) | ((ullong)bsign << S_SHIFT);
b = blah.d;
/* first approximation (i.e. the F-CPU `fiaprx' instruction) */
blah.i = bm & ((1ull << E_SHIFT) - (1ull << (E_SHIFT-approx_bits)));
blah.i |= 1ull << (E_SHIFT-approx_bits-1);
blah.i |= (ullong)BIAS << E_SHIFT;
x = 1.0 / blah.d;
/* iteration */
for (n = 0; n < MAX_ITERATIONS; n++) {
y = b * x;
if (1.0 - epsilon <= y && y <= 1.0 + epsilon) {
/* finished! */
break;
}
x *= (2.0 - y);
}
iter && (*iter = n);
return a * x;
}
int
main(int argc, char **argv) {
union blah blah;
unsigned n, maxn;
double q;
int a, b;
/* epsilon = 2**(-52) */
blah.i = (ullong)(BIAS - 52) << E_SHIFT;
epsilon = blah.d;
for (approx_bits = 0; approx_bits <= 16; approx_bits++) {
maxn = 0;
a = 1 << 20; /* some reasonable number range */
for (b = a; b < (a << 1); b++) {
q = divide((double)a, (double)b, &n); if (maxn < n) maxn = n;
/*
printf("%d / %d = %f after %u iterations\n", a, b, q, n);
*/
}
/* result */
printf("approx_bits = %2d -> %u iterations max\n", approx_bits, maxn);
}
exit(0);
}