[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] magnetude comparison



Hi F-gang,

On Mon, Mar 01, 2004 at 01:49:26AM +0100, gaetan@xeberon.net wrote:
[...]
> what do you think about this new version:
> 
>     -- estimated delay
>     --   2 < L <=  4  : d=4/t=5
>     --   5 < L <=  8  : d=5/t=6
>     --   9 < L <= 16  : d=6/t=7
>     --  17 < L <= 32  : d=7/t=8
>     --  33 < L <= 64  : d=8/t=9
>     function compare_vector(a, b : std_ulogic_vector) return std_ulogic is
>       constant L : natural := a'length;
>       constant aa : std_ulogic_vector(L-1 downto 0) := a;
>       constant bb : std_ulogic_vector(L-1 downto 0) := b;
>       variable pp, vv: std_ulogic_vector(L-1 downto 0);
>       variable p, v, swap : std_ulogic;
>       variable step, level, left : natural;
>     begin
> 
>         -- (d=0/t=0)
>         for i in  L-1 downto 0 loop
>             pp(i) := b(i);
>             vv(i) := a(i) xor b(i);
>         end loop;
>        
>         -- (d=1/t=2)
>        
>         for level in 1 to 15 loop
>             step := 2**level;
>             left := L/step;
>             for i in 0 to left-1 loop
>                 if (vv(2*i)='0') then
>                     pp(i) := pp(2*i+1);
>                 else
>                   pp(i) := pp(2*i);
>                 end if;              -- d=1/t=1

Shouldn't this be the other way round?

	if vv(2*i+1) = '1' then
		pp(i) := pp(2*i+1);
	else
		pp(i) := pp(2*i);
	end if;

The most significant bit is on the left.

>                 vv(i) := vv(2*i+1) or vv(2*i); -- d=1/t=1
>             end loop;
>             exit when step >= L;
>             -- cost for each loop : d=1/t=1
>         end loop;
>                
>         swap := pp(0) and vv(0);     -- d=1/t=1
> 
> --        print_vector("a", a);   
> --        print_vector("b", b);   
> --        print_stdval("swap?", swap);   
>         return swap;   
>     end;
> 
> i estimate the delay around 7 or 8 bits for 32 bit or 64 bit. It's easy 
> to cut between the first and second stage...

Yep.  I once considered a similar circuit for the EU_CMP unit, too.
I finally dropped it, but I don't remember why :(  But I still have
the source file :)

I still see a minor delay problem.  It's true that I count a MUX
as d=1/t=1 but only for the datapath -- that is, the control signal
should arrive early.  In this circuit, it's always late.  You may get
better latency using 4:1 MUXes (at least in the first two stages):

	for level in 1 to 15 loop
		step := 4**level;
		left := L / step;
		for i in 0 to left-1 loop
			if vv(4*i+3) = '1' then
				pp(i) := pp(4*i+3);
			elsif vv(4*i+2) = '1' then
				pp(i) := pp(4*i+2);
			elsif vv(4*i+1) = '1' then
				pp(i) := pp(4*i+1);
			else
				pp(i) := pp(4*i+0);
			end if;
			vv(i) := vv(4*i+3) or vv(4*i+2) or vv(4*i+1) or vv(4*i+0);
		end loop;
		exit when step >= L;
	end loop;

Each loop counts as d=2 t=2 now (which is realistic), but you'll
need only half the number of loops.  I used this kind of stage in
my alternate EU_CMP version.  In fact, my stage was a little more
complex because it did not only extract the "leading" bit but also
calculated its index and bit mask (for the msb instruction).

The drawback is that a 4:1 stage can't be realized in most FPGAs'
cells because the core element has too many inputs.  On the other
hand, one may extract the 4:1 core and put it in a function:

	function compare4 (pp, vv : in std_ulogic_vector)
		return std_ulogic_vector;

	for level in 1 to 15 loop
		step := 4**level;
		left := L / step;
		for i in 0 to left-1 loop
			pp(i) := compare4(pp(4*i+3 downto 4*i), vv(4*i+3 downto 4*i));
			vv(i) := vv(4*i+3) or vv(4*i+2) or vv(4*i+1) or vv(4*i+0);
		end loop;
		exit when step >= L;
	end loop;

and the function could use 2:1 stages internally if necessary.

> it's not I don't want to use your way with the compound adder, but i 
> have a method using Leading One Predictor
> (which assumed than mantissa A should be greater than mantissa B, and 
> so, the problem is when exponents
> are equals, the document i have use a comparator. I propose to put it in 
> the first stage).

The LOP needs its operands in a particular order?
Maybe that can be changed.

> I use a compound adder
> anyway (for rounding, not for normalization).
> With leading one predictor, it's really fast to get the number of shift 
> to apply to the mantissa in the normalisation
> process.

Yep.  Without prediction of the shift count, you would have to
calculate it from the result, which takes at least half a stage.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/