```Here we go!

First, let me make a general remark.  I found it much easier to build a
working circuit without pipelining first, and add the registers later.
In particular, it's much easier to move things around if you don't
have to deal with registers, and the resulting entity will be easier
to test.

It's also easier to work with larger building blocks -- like a complete
adder or shifter -- during the design phase, instead of fiddling with
hundreds or thousands of signals, gates and flip-flops.  And it makes
the code more readable if you extract complex or often used parts
and put them into a procedure or function.  See below for an example.

Ok, let's look at the code...

On Thu, Feb 19, 2004 at 11:30:36PM +0100, gaetan@xeberon.net wrote:
[...]
> so here is the first stage:
[...]
>   -- stage 1 : start DE calculation and // trivial case checks
>   stage_1 : process (Clk, Rst)
>
>         -- first part of a integer adder
>         -- deliberately inspirated by M. Riepe's CSAdd function
>         -- estimated delay : d=3, t=4
>       procedure fasu_ExpAdder_PartOne(a, b : in std_ulogic_vector;
>                                     g, p, s, t : out std_ulogic_vector) is
>       constant L : natural := a'length;
>       variable aa, bb : std_ulogic_vector(L-1 downto 0);
>       variable sv, tv : std_ulogic_vector(L-1 downto 0);
>       variable gm, pm : std_ulogic_vector(L-1 downto 0);
>     begin
>       assert a'length = L;
>       assert b'length = L;
>       assert g'length = L;
>       assert p'length = L;
>       assert s'length = L;
>       assert t'length = L;

--pragma synthesis_off
assert blah;
--pragma synthesis_on

Otherwise the synthesizer will barf.

>       -- normalize inputs
>       aa := A;
>       bb := B;
>
>       -- a row of 4-bit adders
>       -- (d=0)
>       gm := aa and bb; -- d=1, t=1
>       pm := aa xor bb; -- d=1, t=2
>       -- ( d=1 t=2 )
>
>       CSV(gm, pm, sv, tv); -- d=2, t=2
>       -- (d=3, t=4)

I'm missing a call to CLA here.  Don't you need the 4-bit
carry/propagate outputs?  They can be calculated in parallel with
sv/tv, and in the same time.

>       g := gm;
>       p := pm;
>       s := sv;
>       t := tv;
>
>     end;
>
>
>
>
>
>     -- input signals as single floats
>     variable Fa      : std_ulogic_vector(CHUNK_SIZE-1 downto 0);
>     variable Fb      : std_ulogic_vector(CHUNK_SIZE-1 downto 0);
>
>      variable gm00, gm01, pm00, pm01 : std_ulogic_vector(SGL_E_SIZE-1
> downto 0);
>     variable sv00, sv01, tv00, tv01 : std_ulogic_vector(SGL_E_SIZE-1
> downto 0);
>
>      variable gm10, gm11, pm10, pm11 : std_ulogic_vector(DBL_E_SIZE-1
> downto 0);
>     variable sv10, sv11, tv10, tv11 : std_ulogic_vector(DBL_E_SIZE-1
> downto 0);
>
>     variable Ea0, Eb0 : std_ulogic_vector(SGL_E_SIZE-1 downto 0);
>     variable Ea1, Eb1 : std_ulogic_vector(DBL_E_SIZE-1 downto 0);
>
>     variable EffSubDBL, EffSubSGL : std_ulogic;
>
>     variable Substract : std_ulogic;
>
>     -- only for representation detection, there is 3 datapaths:
>     -- see operands as DOUBLE
>     --              as 2 SINGLES
>     variable Fa_raE_DBL, Fa_roM_DBL, Fa_roE_DBL,
>              Fb_raE_DBL, Fb_roM_DBL, Fb_roE_DBL  : std_ulogic;
>     variable Fa_raE_hSGL, Fa_roM_hSGL, Fa_roE_hSGL,
>              Fb_raE_hSGL, Fb_roM_hSGL, Fb_roE_hSGL  : std_ulogic;
>     variable Fa_raE_lSGL, Fa_roM_lSGL, Fa_roE_lSGL,
>              Fb_raE_lSGL, Fb_roM_lSGL, Fb_roE_lSGL  : std_ulogic;
>
>     variable OpSize : std_ulogic;
>
>   begin
>     if (Rst = '1') then
>       r2_Enable <= '0';
>     else
>        if (rising_edge(Clk)) then
>         if (r1_Enable = '1') then
>
>
>           for i in 0 to BLOCK64_NBR-1 loop -- for each 64bit chunk
>
>
>             Fa(CHUNK_SIZE-1 downto 0)    := r1_Fa((i+1)*CHUNK_SIZE-1
> downto i*CHUNK_SIZE);
>             Fb(CHUNK_SIZE-1 downto 0)    := r1_Fb((i+1)*CHUNK_SIZE-1
> downto i*CHUNK_SIZE);
>
>             OpSize := r1_SuperMode(Super_Mode_OpSize0);
>             -- note : r1_SuperMode(Super_Mode_OpSize1) = 0 assumed
>
>             Substract := r1_SuperMode(Super_Mode_Sub);
>
>             -- (d=0)
>             -- TODO : enable/disable chunk following SIMD_flag

Don't bother.  Just operate on all chunks in parallel.

>             -- (d=1)
>
>             if (OpSize = '1') then -- SIMD mode = SINGLE

Actually, "00" is single and "01" is double.  Therefore, OpSize = '1'
means "double".

>               -- First part Exponent Substraction
>               -- (d=2)
>               Ea0(SGL_E_SIZE-1 downto 0) := Fa(SGL_E_END downto
> SGL_E_START); -- d=1
>               Eb0(SGL_E_SIZE-1 downto 0) := not(Fb(SGL_E_END downto
> SGL_E_START)); -- d=2
>
>               -- (d=2)
>               Ea1(SGL_E_SIZE-1 downto 0) := Fa(SGL_E_END + SGL_SIZE
> downto SGL_E_START + SGL_SIZE); -- d=1
>               Ea1(SGL_E_SIZE-1 downto 0) := not (Fb(SGL_E_END + SGL_SIZE
> downto SGL_E_START + SGL_SIZE)); -- d=2

>               Ea1(DBL_SIZE-1 downto SGL_SIZE) := (others => '0'); -- d=0
>               Eb1(DBL_SIZE-1 downto SGL_SIZE) := (others => '0'); -- d=0

That will probably not work (not all tools grok it).  Better use
something like

Ea1(DBL_SIZE-1 downto SGL_SIZE) := (DBL_SIZE-1 downto SGL_SIZE => '0');

instead.  Alternatively, preset the variables and then replace the
appropriate slice:

Ea1 := (others => '0');
Ea1(SGL_E_SIZE-1 downto 0) := Fa(SGL_E_END + SGL_SIZE
downto SGL_E_START + SGL_SIZE);

>             else -- SIMD mode = double
>               -- (d=1)
>               Ea0(SGL_E_SIZE-1 downto 0) := (others => '0'); -- d=0
>               Eb0(SGL_E_SIZE-1 downto 0) := (others => '0'); -- d=0

Could be set to (others => 'X') since we don't care about it.

>               -- (d=2)
>               Ea1(DBL_E_SIZE-1 downto 0) := Fa(DBL_E_END downto
> DBL_E_START); -- d=1
>               Ea1(DBL_E_SIZE-1 downto 0) := not (Fb(DBL_E_END downto
> DBL_E_START)); -- d=2

Eb1, again.

>             end if;
>             -- (d=3)

It may be cheaper to left-align the "single" values in Ea1/Eb1.
In your version, the wires to the MUXes cross:

Ea1 := (others => '0');
Eb1 := (others => '0');
if mode = double then
Ea1(10 downto 0) := Fa(62 downto 52);
Eb1(10 downto 0) := not Fb(62 downto 52);
else
Ea1(7 downto 0) := Fa(62 downto 55);
Eb1(7 downto 0) := not Fb(62 downto 55);
end if;

With left alignment, you get:

Ea1 := (others => '0');
Eb1 := (others => '1');	-- beware!
if mode = double then
Ea1(10 downto 0) := Fa(62 downto 52);
Eb1(10 downto 0) := not Fb(62 downto 52);
else
Ea1(10 downto 3) := Fa(62 downto 55);
Eb1(10 downto 3) := not Fb(62 downto 55);
end if;

or even cheaper:

Ea1 := (others => '0');
Eb1 := (others => '1');	-- beware!
Ea1(10 downto 3) := Fa(62 downto 55);
Eb1(10 downto 3) := not Fb(62 downto 55);
if mode = double then
Ea1(2 downto 0) := Fa(54 downto 52);
Eb1(2 downto 0) := not Fb(54 downto 52);
end if;

That is, the MUX can be made smaller.

>               -- First part 11-bits adders ('DOUBLE or highest single'
> datapath)
>               -- (d=3)
>               -- computing Ea-Eb
>               fasu_ExpAdder_PartOne(Ea1, Eb1, gm10, pm10, sv10, tv10);
> -- d=3
>               -- computing Eb-Ea
>               fasu_ExpAdder_PartOne(Eb1, Ea1, gm11, pm11, sv11, tv11);
> -- d=3
>               -- (d=6)

Not necessary to calculate the difference both ways.  Remember that
CSAdd is a compound adder.  You get `Ea - Eb' at the incremented output
and `not (Eb - Ea)' at the normal output.  Thus, calculating `Eb - Ea'
requires only a single row of inverters, not a complete second adder.

>
>               -- First part 8-bits adders ('lowest single' datapath)
>               -- (d=3)
>               -- computing Ea-Eb
>               fasu_ExpAdder_PartOne(Ea0, Eb0, gm00, pm00, sv00, tv00);
> -- d=3
>               -- computing Eb-Ea
>               fasu_ExpAdder_PartOne(Eb0, Ea0, gm01, pm01, sv01, tv01);
> -- d=3
>               -- (d=6)

Same here.

>               -- First part 8-bits adders ('lowest single' datapath)
>               -- (d=3)
>               -- computing Ea-Eb
>               fasu_ExpAdder_PartOne(Ea1, Eb1, gm10, pm10, sv10, tv10);
> -- d=3
>               -- computing Eb-Ea
>               fasu_ExpAdder_PartOne(Eb1, Ea1, gm11, pm11, sv11, tv11);
> -- d=3
>               -- (d=6)

These two are duplicates (already calculated above).

>               -- Special value detection, part one: DOUBLE datapath
>               -- (!!ONLY DOUBLE, NOT 'DOUBLE or highest SINGLE datapath !!)
>               --(d=1)

Fa/Fb should be at d=0, shouldn't they?

>               Fa_roE_DBL := reduce_or (Fa(DBL_E_End downto
> DBL_E_Start)); -- d=3
>               Fa_roM_DBL := reduce_or (Fa(DBL_M_End downto
> DBL_M_Start)); -- d=3
>               Fa_raE_DBL := reduce_and(Fa(DBL_E_End downto
> DBL_E_Start)); -- d=3
>               Fb_roE_DBL := reduce_or (Fb(DBL_E_End downto
> DBL_E_Start)); -- d=3
>               Fb_roM_DBL := reduce_or (Fb(DBL_M_End downto
> DBL_M_Start)); -- d=3
>               Fb_raE_DBL := reduce_and(Fb(DBL_E_End downto
> DBL_E_Start)); -- d=3
>               -- (d=4)

And d=3 here.

>               -- Special value detection, part one: highest SINGLE datapath
>               --(d=1)
>               Fa_roE_hSGL := reduce_or (Fa(32+SGL_E_End downto
> 32+SGL_E_Start)); -- d=3
>               Fa_roM_hSGL := reduce_or (Fa(32+SGL_M_End downto
> 32+SGL_M_Start)); -- d=3
>               Fa_raE_hSGL := reduce_and(Fa(32+SGL_E_End downto
> 32+SGL_E_Start)); -- d=3
>               Fb_roE_hSGL := reduce_or (Fb(32+SGL_E_End downto
> 32+SGL_E_Start)); -- d=3
>               Fb_roM_hSGL := reduce_or (Fb(32+SGL_M_End downto
> 32+SGL_M_Start)); -- d=3
>               Fb_raE_hSGL := reduce_and(Fb(32+SGL_E_End downto
> 32+SGL_E_Start)); -- d=3
>               -- (d=4)

Likewise.

>               -- Special value detection, part one: lowest SINGLE datapath
>               --(d=1)
>               Fa_roE_lSGL := reduce_or (Fa(SGL_E_End downto
> SGL_E_Start)); -- d=3
>               Fa_roM_lSGL := reduce_or (Fa(SGL_M_End downto
> SGL_M_Start)); -- d=3
>               Fa_raE_lSGL := reduce_and(Fa(SGL_E_End downto
> SGL_E_Start)); -- d=3
>               Fb_roE_lSGL := reduce_or (Fb(SGL_E_End downto
> SGL_E_Start)); -- d=3
>               Fb_roM_lSGL := reduce_or (Fb(SGL_M_End downto
> SGL_M_Start)); -- d=3
>               Fb_raE_lSGL := reduce_and(Fb(SGL_E_End downto
> SGL_E_Start)); -- d=3
>               -- (d=4)

Likewise.

By the way: With the help of some generic functions, this code could

-- value classes
subtype FP_CLASS is std_ulogic_vector(2 downto 0);

-- possible values
-- Note: "10x" is impossible
constant CLASS_ZERO     : FP_CLASS := "000";
constant CLASS_DENORMAL : FP_CLASS := "001";
constant CLASS_NORMAL_1 : FP_CLASS := "010";
constant CLASS_NORMAL_2 : FP_CLASS := "011";
constant CLASS_INF      : FP_CLASS := "110";
constant CLASS_NAN      : FP_CLASS := "111";

-- number of exponent bits in <format>
function expt_bits (format : natural) return natural is
begin
case format is
when 32 => return  8;	-- IEEE single
when 64 => return 11;	-- IEEE double
when others => null;
end case;
--pragma synthesis_off
assert false report "unsupported FP format" severity failure;
--pragma synthesis_on
end expt_bits;

-- return a value's class
function classify (X : std_ulogic_vector) return FP_CLASS is
constant L : natural := X'length;
constant E : natural := expt_bits(L);
constant E_high : natural := L-2;
constant E_low  : natural := L-E-1;
constant M_high : natural := L-E-2;
constant M_low  : natural := 0;
constant xx : std_ulogic_vector(L-1 downto 0) := to_X01(X);
variable yy : FP_CLASS;
begin
yy(0) := reduce_or(xx(M_high downto M_low));
yy(1) := reduce_or(xx(E_high downto E_low));
yy(2) := reduce_and(xx(E_high downto E_low));
return yy;
end classify;

With that, you can do something like

-- declarations:
-- variable Class_a_DBL, Class_a_hSGL, Class_a_lSGL : FP_CLASS;
-- variable Class_b_DBL, Class_b_hSGL, Class_b_lSGL : FP_CLASS;

Class_a_DBL  := classify(Fa(63 downto  0));
Class_a_hSGL := classify(Fa(63 downto 32));
Class_a_lSGL := classify(Fa(31 downto  0));
Class_b_DBL  := classify(Fb(63 downto  0));
Class_b_hSGL := classify(Fb(63 downto 32));
Class_b_lSGL := classify(Fb(31 downto  0));

and you're done.

>               -- Effective Subtraction detection for 'DOUBLE and highest
> SINGLE' datapath
>               -- (d=1)
>               EffSubDBL := Fa(DBL_S_POS) xor Fb(DBL_S_POS) xor
> Substract; -- d=2
>               -- rem: DBL_S_POS = SGL_S_POS + SGL_SIZE = 63
>               -- (d=3)
>
>               -- Effective Subtraction detection for 'lowest SINGLE'
> datapath
>               -- (d=1)
>               EffSubSGL := Fa(SGL_S_POS) xor Fb(SGL_S_POS) xor
> Substract; -- d=2
>               -- (d=3)

Again, d=2

>               -- storing results to next stage registers
>               r2_gm00((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= gm00;
>               r2_gm01((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= gm01;
>               r2_pm00((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= pm00;
>               r2_pm01((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= pm01;
>               r2_sv00((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= sv00;
>               r2_sv01((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= sv01;
>               r2_tv00((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= tv00;
>               r2_tv01((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= tv01;
>
>               r2_gm10((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= gm10;
>               r2_gm11((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= gm11;
>               r2_pm10((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= pm10;
>               r2_pm11((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= pm11;
>               r2_sv10((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= sv10;
>               r2_sv11((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= sv11;
>               r2_tv10((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= tv10;
>               r2_tv11((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= tv11;

If you have the next level's gm/pm, only pm00/10, sv00/10 and tv00/10
are needed in stage 2.  If you manage to squeeze another row of XORs
into stage 1 (which is likely, considering the delay calculation),
you can also calculate

ym00 := pm00 xor sv00;
zm00 := pm00 xor tv00;
ym10 := pm10 xor sv10;
zm10 := pm10 xor tv10;

in stage 1 and pass those instead of pm/sv/tv (saves some more
registers).  On the other hand, some required signals seem to be
missing (e.g. Fa_raE_DBL & friends, and also OpSize).

>               r2_Fa((i+1)*CHUNK_SIZE-1 downto i*CHUNK_SIZE) <= Fa;
>               r2_Fb((i+1)*CHUNK_SIZE-1 downto i*CHUNK_SIZE) <= Fb;
>
>               r2_EffSubDBL(i) <= EffSubDBL;
>               r2_EffSubSGL(i) <= EffSubSGL;
>
>           end loop;
>           -- enable stage 2
>           r2_Enable <= '1';
>         else
>           r2_Enable <= '0';
>         end if;
>       end if;
>     end if;
>   end process;

--
Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
"All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

```