[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [f-cpu] DivMod unit
On Sat, Dec 01, 2001 at 04:22:16PM +0100, Cedric BAIL wrote:
> So as Yann say last week, i was writing a divide/modulo unit for the F-CPU.
> I concentrate me on the creation of a component that can do all size of
> division. The principe of it is easy :
> For i form WIDTH-1 to 0
> TmpB := B << i
> TmpResult := substraction(A, TmpB);
> if carry(TmpResult) = true then (test negativ result)
> Yl(i) := 0;
> else
> Yl(i) := 1;
> A := TmpResult;
> endif
> next i;
> Yh := A;
>
> So you have, in the same pass :
> Yl := A / B;
> Yh := A % B;
Yep... standard "restoring" division.
Minor improvement:
tmpB := B shl (WIDTH - 1);
for i in WIDTH-1 downto 0 loop
-- calculate result bit here
tmpB := tmpB shr 1;
end loop;
This avoids the variable-size shift inside the loop (which would take
at least 1 extra clock cycle).
> As you can see all this unit need is a really good substraction function.
> The one that I want to write will normally work for all size of data (if
> it's a power of 4). The problem is that it doesn't work actually for a
> other size than 8 (I have some porblem with index, it will be solved next
> week).
There's a general problem here: when the word size grows, the delay will
increase as well (O(log(n))). A 64-bit divider will need 2 cycles for
each result bit, larger units will be even slower. Another drawback is
that this kind of algorithm doesn't work with signed operands.
> I think, that i will probably rewrite my substraction function. Because
> all the perfomance of this unit depend on it. I read a lot about the way
> to jump over 0, do SRT divide, and I think that for F-CPU this type of
> algorithm are perhaps not a good idea. Because if some case are accelerated
> a lot of others are really slow down. And this type of algorithm are not
> previsible, and the average speed of this unit is slower than this unit.
> It's why i think it's not a good solution to use them.
If operating on 8-bit bytes, the simple approach is faster -- it can
perform an 8-bit division in 8 cycles --, but for larger word sizes,
you need additional cycles for the subtract operation, which makes the
SRT algorithm faster (1 cycle per bit, plus a few cycles for input
preparation -- operand normalization -- and output postprocessing).
> Of course this component is not pipelined. And i am sure that we divide
> sometimes by something that take 64/32/16 bits, so only in really few case
> we need to have a pipelined unit (And in this case they will perhaps write
> a new div unit). But for 8bits it's not really the same problem, we use
> them in all conversion from changing base, in crypto and in some 2d graphical
> filter. In this case, that only represent a really few part of program, we will
> have better result, if we have a pipelined 8bits unit i think. So the question
> is : "Did we want to accelerate this application ?"
Since we can calculate eight 8-bit quotients at a time (SIMD),
pipelining doesn't seem really useful to me.
> Finally were division are needed, it's in game. But actually all this game
> (OpenGl and Direct3D games) use the FPU for all their number. So if we wan't
> a good software implementation of OpenGl, we will need a good FPU (i mean
> pipelined). And that's perhaps an other discussion ;-)
Right, that's an entirely different playground.
--
Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
"All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/