[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] DivMod unit



On Sat, Dec 01, 2001 at 04:22:16PM +0100, Cedric BAIL wrote:
> So as Yann say last week, i was writing a divide/modulo unit for the F-CPU.
> I concentrate me on the creation of a component that can do all size of 
> division. The principe of it is easy :
>  For i form WIDTH-1 to 0
>    TmpB := B << i
>    TmpResult := substraction(A, TmpB);
>    if carry(TmpResult) = true then  (test negativ result)
>       Yl(i) := 0;
>    else
>       Yl(i) := 1;
>       A := TmpResult;
>    endif
>  next i;
>  Yh := A;
> 
> So you have, in the same pass :
> 	Yl := A / B;
> 	Yh := A % B;

Yep... standard "restoring" division.
Minor improvement:

	tmpB := B shl (WIDTH - 1);
	for i in WIDTH-1 downto 0 loop
		-- calculate result bit here
		tmpB := tmpB shr 1;
	end loop;

This avoids the variable-size shift inside the loop (which would take
at least 1 extra clock cycle).

> As you can see all this unit need is a really good substraction function.
> The one that I want to write will normally work for all size of data (if
> it's a power of 4). The problem is that it doesn't work actually for a
> other size than 8 (I have some porblem with index, it will be solved next
> week).

There's a general problem here: when the word size grows, the delay will
increase as well (O(log(n))).  A 64-bit divider will need 2 cycles for
each result bit, larger units will be even slower.  Another drawback is
that this kind of algorithm doesn't work with signed operands.

> I think, that i will probably rewrite my substraction function. Because 
> all the perfomance of this unit depend on it. I read a lot about the way
> to jump over 0, do SRT divide, and I think that for F-CPU this type of
> algorithm are perhaps not a good idea. Because if some case are accelerated
> a lot of others are really slow down. And this type of algorithm are not
> previsible, and the average speed of this unit is slower than this unit.
> It's why i think it's not a good solution to use them.

If operating on 8-bit bytes, the simple approach is faster -- it can
perform an 8-bit division in 8 cycles --, but for larger word sizes,
you need additional cycles for the subtract operation, which makes the
SRT algorithm faster (1 cycle per bit, plus a few cycles for input
preparation -- operand normalization -- and output postprocessing).

> Of course this component is not pipelined. And i am sure that we divide
> sometimes by something that take 64/32/16 bits, so only in really few case
> we need to have a pipelined unit (And in this case they will perhaps write
> a new div unit). But for 8bits it's not really the same problem, we use
> them in all conversion from changing base, in crypto and in some 2d graphical 
> filter. In this case, that only represent a really few part of program, we will
> have better result, if we have a pipelined 8bits unit i think. So the question
> is : "Did we want to accelerate this application ?"

Since we can calculate eight 8-bit quotients at a time (SIMD),
pipelining doesn't seem really useful to me.

> Finally were division are needed, it's in game. But actually all this game
> (OpenGl and Direct3D games) use the FPU for all their number. So if we wan't
> a good software implementation of OpenGl, we will need a good FPU (i mean
> pipelined). And that's perhaps an other discussion ;-)

Right, that's an entirely different playground.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/