[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] F-CPU vs ALPHA



On Wed, Aug 15, 2001 at 05:44:34AM +0200, Yann Guidon wrote:
[...]
> 3R4W ? can you explain, or write a draft, about this unit ?
> i'm surprised.

That's only the SRT divider core.  It takes a 128-bit numerator and a
64-bit denominator as inputs (3R) and produces a 64-bit quotient and
a 64-bit remainder, using a redundant encoding (that is, *two* 64-bit
vectors per number).

You already have an early draft of the IDU core -- search your mailing list
archive for the message id <20010302162502.24372@thrai.stud.uni-hannover.de>
or the subject "Divider News" (or look for the "Blah2" entity, probably
in a file named blah2.vhdl).

> if there is a painful problem, maybe we can go the route of half-emulation,
> like with the emulation of multiplies with a sequence of partial
> instructions ?

Dividers don't scale up, like multipliers and adders do.

I have to think that over; maybe there's a better (and smaller) solution;
I really don't want to waste so many transistors for an operation that
is rarely used.

[...]
> > and I'd rather add a `tiny float' data type (1+4+11 bits)
> ?
> 
> > for low-precision operations than implement 40-bit FP (which is IMHO
> > rather useless, and a waste of space).
> are tiny floats really useful ?

In some cases, yes (especially if you can process 4 of them in a row ;).
Like audio/video processing where integers introduce too much rounding
errors but 32-bit floats would be overkill.  Whenever you need an FP
type with 3-digit precision and values close to zero (-100...100, or
-10000...10000 if we use 1+5+10 form).

> > > //  There is no floating square root instruction.
> > > We intend to provide "seed" generation for accelerating Newton-Raphson
> > > computations of divide and SQRT.
> > In the FC0, we can probably get away with some bit-shuffling (hardwired).
> > For sqrt(x), divide the exponent by 2 (right shift); for 1/x, invert it.
> 
> believe me, if this was so simples, others would do it already.
> The problems is that this trick "only" does not provide enough precision.
> 
> The strategy in most computers is not to run the NR iterations inside
> a loop, but to emit a fixed number of unrolled instructions.
> The "seed" must be constructed in such a way that when the specified number
> of iteration is executed, we get at least a certain precision/acuracy.
> 
> For example, imagine we have a table that yields a "seed" for 8 acurate
> bits (in addition to a correct exponenent). A single run will give 16 acurate
> bits then another will give 32 bits of precision. if you work with IEEE floats,
> the compiler will generate 2 optimized unrolled iteration bodies.
> If you work with IEEE long reals, it will generate 3 iterations.
> 
> If you "only" do an exponent adjustment,
>  1) you won't get a precise garanteed minimum acuracy of the approximation

Of course I will.

Let x be 1.m * 2**e (with 1.0 <= 1.m < 2.0)
Then y = 1/x is 1/1.m * 2**(-e), with 0.5 < 1/1.m <= 1.0
Let y0 = 0.75 * 2**(-e) be the first approximation
Then y - y0 becomes (1/1.m - 0.75) * 2**(-e),
and the maximum error is |y - y0| <= 0.25 * 2**(-e)

Q.E.D.

(I agree that it's not a *good* approximation ;)

>  2) you will thus have to put one NR iteration in a while(){} loop
>     and if you work with SIMD float, the slow convergence of one number
>     will keep  the whole SIMD packet from being used even though other
>     numbers are "ready" (converged).

That problem won't go away anyway.

> of course, if you don't mind, you can "forget" or "simplify" the seed LUT.
> but the program/compiler must be adapted in consequence.

With simple approximation (y0 = 0.75 * 2**(-exp(b))), it takes ~6
iterations to produce a `double' quotient.  In order to reduce that
to ~3 iterations, we need to consider ~6 bits of `b' (that means we
need a lookup-table with at least 2**6 = 64 entries).  4 iterations
cost ~2**3 entries, 2 iterations ~2**13 entries (which is too much).
See the output of the attached C program (I don't have numbers for sqrt
approximation, sorry).

I think 4 iterations is good enough, 3 is already too expensive.

[...]
> > > I am still wondering if PALcode is such a good idea.
> > > We're used for a long time to rewrite the trap handlers for each new computer.
> > > Maybe this idea came from the VMS transition constraint, but there is
> > > no need of this stuff in the F-CPU.
> > We can implement a `PALcall' SR if we have to.
> and what would we do with it ? :-)

Play soccer? ;)

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
/*
 * Copyright (C) 2001 Michael Riepe <michael@stud.uni-hannover.de>
 * All Rights Reserved.
 */

#include <stdio.h>

#define MAX_ITERATIONS	8

typedef unsigned long long ullong;

/* type conversion unit */
union blah {
	double d;
	ullong i;
};

/* `double' characteristics */
#define S_SHIFT		63
#define S_MASK		(1ull << S_SHIFT)
#define E_SHIFT		52
#define E_MASK		((1ull << S_SHIFT) - (1ull << E_SHIFT))
#define M_SHIFT		0
#define M_MASK		((1ull << E_SHIFT) - (1ull << M_SHIFT))
#define BIAS		0x3ff

double epsilon = 0.0;		/* accuracy */
int approx_bits = 0;	/* approximation quality */

double
divide(double a, double b, unsigned *iter) {
	union blah blah;
	double x, y;
	int asign, bsign;
	short aexp, bexp;
	ullong am, bm;
	unsigned n;

	/* decompose a */
	blah = a;
	asign = blah.i >> S_SHIFT;
	aexp = (blah.i >> E_SHIFT) & 0x7ff;
	am = blah.i & M_MASK;
	aexp -= BIAS;

	/* decompose b */
	blah = b;
	bsign = blah.i >> S_SHIFT;
	bexp = (blah.i >> E_SHIFT) & 0x7ff;
	bm = blah.i & M_MASK;
	bexp -= BIAS;

	/* normalize */
	asign ^= bsign;
	aexp -= bexp;	/* over/underflow checking omitted */
	bsign = 0;
	bexp = 0;		/* that is, 1.0 <= b < 2.0 */

	/* restore a and b */
	blah.i = am | ((ullong)(aexp + BIAS) << E_SHIFT) | ((ullong)asign << S_SHIFT);
	a = blah.d;
	blah.i = bm | ((ullong)(bexp + BIAS) << E_SHIFT) | ((ullong)bsign << S_SHIFT);
	b = blah.d;

	/* first approximation (i.e. the F-CPU `fiaprx' instruction) */
	blah.i = bm & ((1ull << E_SHIFT) - (1ull << (E_SHIFT-approx_bits)));
	blah.i |= 1ull << (E_SHIFT-approx_bits-1);
	blah.i |= (ullong)BIAS << E_SHIFT;
	x = 1.0 / blah.d;

	/* iteration */
	for (n = 0; n < MAX_ITERATIONS; n++) {
		y = b * x;
		if (1.0 - epsilon <= y && y <= 1.0 + epsilon) {
			/* finished! */
			break;
		}
		x *= (2.0 - y);
	}
	iter && (*iter = n);

	return a * x;
}

int
main(int argc, char **argv) {
	union blah blah;
	unsigned n, maxn;
	double q;
	int a, b;

	/* epsilon = 2**(-52) */
	blah.i = (ullong)(BIAS - 52) << E_SHIFT;
	epsilon = blah.d;

	for (approx_bits = 0; approx_bits <= 16; approx_bits++) {
		maxn = 0;
		a = 1 << 20;	/* some reasonable number range */
		for (b = a; b < (a << 1); b++) {
			q = divide((double)a, (double)b, &n); if (maxn < n) maxn = n;
			/*
			printf("%d / %d = %f after %u iterations\n", a, b, q, n);
			*/
		}
		/* result */
		printf("approx_bits = %2d -> %u iterations max\n", approx_bits, maxn);
	}
	exit(0);
}