[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] fadder forever



Hello Michael!
sorry for having you working at 4am!

Michael Riepe a écrit :

Here we go!

First, let me make a general remark. I found it much easier to build a
working circuit without pipelining first, and add the registers later.
In particular, it's much easier to move things around if you don't
have to deal with registers, and the resulting entity will be easier
to test.

It's also easier to work with larger building blocks -- like a complete
adder or shifter -- during the design phase, instead of fiddling with
hundreds or thousands of signals, gates and flip-flops. And it makes
the code more readable if you extract complex or often used parts
and put them into a procedure or function. See below for an example.

Ok, let's look at the code...

On Thu, Feb 19, 2004 at 11:30:36PM +0100, gaetan@xeberon.net wrote:
[...]

so here is the first stage:

[...]




-- input signals as single floats
variable Fa : std_ulogic_vector(CHUNK_SIZE-1 downto 0);
variable Fb : std_ulogic_vector(CHUNK_SIZE-1 downto 0);

variable gm00, gm01, pm00, pm01 : std_ulogic_vector(SGL_E_SIZE-1 downto 0);
variable sv00, sv01, tv00, tv01 : std_ulogic_vector(SGL_E_SIZE-1 downto 0);
variable gm10, gm11, pm10, pm11 : std_ulogic_vector(DBL_E_SIZE-1 downto 0);
variable sv10, sv11, tv10, tv11 : std_ulogic_vector(DBL_E_SIZE-1 downto 0);

variable Ea0, Eb0 : std_ulogic_vector(SGL_E_SIZE-1 downto 0);
variable Ea1, Eb1 : std_ulogic_vector(DBL_E_SIZE-1 downto 0);
variable EffSubDBL, EffSubSGL : std_ulogic;
variable Substract : std_ulogic;

-- only for representation detection, there is 3 datapaths:
-- see operands as DOUBLE
-- as 2 SINGLES
variable Fa_raE_DBL, Fa_roM_DBL, Fa_roE_DBL,
Fb_raE_DBL, Fb_roM_DBL, Fb_roE_DBL : std_ulogic;
variable Fa_raE_hSGL, Fa_roM_hSGL, Fa_roE_hSGL,
Fb_raE_hSGL, Fb_roM_hSGL, Fb_roE_hSGL : std_ulogic;
variable Fa_raE_lSGL, Fa_roM_lSGL, Fa_roE_lSGL,
Fb_raE_lSGL, Fb_roM_lSGL, Fb_roE_lSGL : std_ulogic;
variable OpSize : std_ulogic;

begin
if (Rst = '1') then
r2_Enable <= '0';
else
if (rising_edge(Clk)) then
if (r1_Enable = '1') then
for i in 0 to BLOCK64_NBR-1 loop -- for each 64bit chunk

Fa(CHUNK_SIZE-1 downto 0) := r1_Fa((i+1)*CHUNK_SIZE-1 downto i*CHUNK_SIZE);
Fb(CHUNK_SIZE-1 downto 0) := r1_Fb((i+1)*CHUNK_SIZE-1 downto i*CHUNK_SIZE);
OpSize := r1_SuperMode(Super_Mode_OpSize0);
-- note : r1_SuperMode(Super_Mode_OpSize1) = 0 assumed
Substract := r1_SuperMode(Super_Mode_Sub);
-- (d=0)
-- TODO : enable/disable chunk following SIMD_flag


Don't bother. Just operate on all chunks in parallel.


ok
i though it will make uneeded transistors to work so i just would like to disable unused part...

-- (d=1)
if (OpSize = '1') then -- SIMD mode = SINGLE

Actually, "00" is single and "01" is double. Therefore, OpSize = '1'
means "double".


indeed, that's true..

Ea1(DBL_SIZE-1 downto SGL_SIZE) := (others => '0'); -- d=0
Eb1(DBL_SIZE-1 downto SGL_SIZE) := (others => '0'); -- d=0

That will probably not work (not all tools grok it). Better use
something like

Ea1(DBL_SIZE-1 downto SGL_SIZE) := (DBL_SIZE-1 downto SGL_SIZE => '0');

instead. Alternatively, preset the variables and then replace the
appropriate slice:

Ea1 := (others => '0');
Ea1(SGL_E_SIZE-1 downto 0) := Fa(SGL_E_END + SGL_SIZE
downto SGL_E_START + SGL_SIZE);

i don't understand why

Ea1 := (others => '0');
should work and not Ea1(DBL_SIZE-1 downto SGL_SIZE) := (others => '0')
???
isn't it in the vhdl standard?


else -- SIMD mode = double
-- (d=1)
Ea0(SGL_E_SIZE-1 downto 0) := (others => '0'); -- d=0
Eb0(SGL_E_SIZE-1 downto 0) := (others => '0'); -- d=0

Could be set to (others => 'X') since we don't care about it.


'X' state will not be accepted by the synthetiser?

end if;
-- (d=3)

It may be cheaper to left-align the "single" values in Ea1/Eb1.
In your version, the wires to the MUXes cross:

Ea1 := (others => '0');
Eb1 := (others => '0');
if mode = double then
Ea1(10 downto 0) := Fa(62 downto 52);
Eb1(10 downto 0) := not Fb(62 downto 52);
else
Ea1(7 downto 0) := Fa(62 downto 55);
Eb1(7 downto 0) := not Fb(62 downto 55);
end if;

With left alignment, you get:

Ea1 := (others => '0');
Eb1 := (others => '1'); -- beware!
if mode = double then
Ea1(10 downto 0) := Fa(62 downto 52);
Eb1(10 downto 0) := not Fb(62 downto 52);
else
Ea1(10 downto 3) := Fa(62 downto 55);
Eb1(10 downto 3) := not Fb(62 downto 55);
end if;

or even cheaper:

Ea1 := (others => '0');
Eb1 := (others => '1'); -- beware!
Ea1(10 downto 3) := Fa(62 downto 55);
Eb1(10 downto 3) := not Fb(62 downto 55);
if mode = double then
Ea1(2 downto 0) := Fa(54 downto 52);
Eb1(2 downto 0) := not Fb(54 downto 52);
end if;

That is, the MUX can be made smaller.


i thought the synthetiser would be smart and it itseft.

-- First part 11-bits adders ('DOUBLE or highest single' datapath)
-- (d=3)
-- computing Ea-Eb
fasu_ExpAdder_PartOne(Ea1, Eb1, gm10, pm10, sv10, tv10); -- d=3
-- computing Eb-Ea
fasu_ExpAdder_PartOne(Eb1, Ea1, gm11, pm11, sv11, tv11); -- d=3
-- (d=6)

Not necessary to calculate the difference both ways. Remember that
CSAdd is a compound adder. You get `Ea - Eb' at the incremented output
and `not (Eb - Ea)' at the normal output. Thus, calculating `Eb - Ea'
requires only a single row of inverters, not a complete second adder.


yes, that's true... it add a 1-gate delay
it causes a little problem in stage 3:
i had a 4-bit shifter (a shifter that can shift a vector following an other 'driver' 4bit-vector, so
which can shift by 0 to 2^4 positions). So d=4, t=4
So i had a conditional bit inverter (d=2) and it fitted into the stage.
But now i have a 5 bit adder, so there is not enough space to put the inverter...
do i have the right to violate the 6-gate delay?
i have 2 solutions i don't know how to balance:
- I can put the whole conditionnal inverter in the 4th stage, - or i can precompute the inverted vector in the 3th stage (so i need an additional register vector).




Fa/Fb should be at d=0, shouldn't they?


if we don't care about chunk selection following simd_flag, yes..


-- Special value detection, part one: lowest SINGLE datapath
--(d=1)
Fa_roE_lSGL := reduce_or (Fa(SGL_E_End downto SGL_E_Start)); -- d=3 Fa_roM_lSGL := reduce_or (Fa(SGL_M_End downto SGL_M_Start)); -- d=3 Fa_raE_lSGL := reduce_and(Fa(SGL_E_End downto SGL_E_Start)); -- d=3
Fb_roE_lSGL := reduce_or (Fb(SGL_E_End downto SGL_E_Start)); -- d=3 Fb_roM_lSGL := reduce_or (Fb(SGL_M_End downto SGL_M_Start)); -- d=3 Fb_raE_lSGL := reduce_and(Fb(SGL_E_End downto SGL_E_Start)); -- d=3
-- (d=4)

Likewise.

By the way: With the help of some generic functions, this code could
be more readable. Consider

	-- value classes
	subtype FP_CLASS is std_ulogic_vector(2 downto 0);

	-- possible values
	-- Note: "10x" is impossible

constant CLASS_ZERO : FP_CLASS := "000";
constant CLASS_DENORMAL : FP_CLASS := "001";
constant CLASS_NORMAL_1 : FP_CLASS := "010";
constant CLASS_NORMAL_2 : FP_CLASS := "011";
constant CLASS_INF : FP_CLASS := "110";
constant CLASS_NAN : FP_CLASS := "111";

-- number of exponent bits in <format>
function expt_bits (format : natural) return natural is
begin
case format is
when 32 => return 8; -- IEEE single
when 64 => return 11; -- IEEE double
when others => null;
end case;
--pragma synthesis_off
assert false report "unsupported FP format" severity failure;
--pragma synthesis_on
end expt_bits;

-- return a value's class
function classify (X : std_ulogic_vector) return FP_CLASS is
constant L : natural := X'length;
constant E : natural := expt_bits(L);
constant E_high : natural := L-2;
constant E_low : natural := L-E-1;
constant M_high : natural := L-E-2;
constant M_low : natural := 0;
constant xx : std_ulogic_vector(L-1 downto 0) := to_X01(X);
variable yy : FP_CLASS;
begin
yy(0) := reduce_or(xx(M_high downto M_low));
yy(1) := reduce_or(xx(E_high downto E_low));
yy(2) := reduce_and(xx(E_high downto E_low));
return yy;
end classify;

With that, you can do something like

-- declarations:
-- variable Class_a_DBL, Class_a_hSGL, Class_a_lSGL : FP_CLASS;
-- variable Class_b_DBL, Class_b_hSGL, Class_b_lSGL : FP_CLASS;

Class_a_DBL := classify(Fa(63 downto 0));
Class_a_hSGL := classify(Fa(63 downto 32));
Class_a_lSGL := classify(Fa(31 downto 0));
Class_b_DBL := classify(Fb(63 downto 0));
Class_b_hSGL := classify(Fb(63 downto 32));
Class_b_lSGL := classify(Fb(31 downto 0));

and you're done.


-- Effective Subtraction detection for 'DOUBLE and highest SINGLE' datapath
-- (d=1)
EffSubDBL := Fa(DBL_S_POS) xor Fb(DBL_S_POS) xor Substract; -- d=2
-- rem: DBL_S_POS = SGL_S_POS + SGL_SIZE = 63
-- (d=3)
-- Effective Subtraction detection for 'lowest SINGLE' datapath
-- (d=1)
EffSubSGL := Fa(SGL_S_POS) xor Fb(SGL_S_POS) xor Substract; -- d=2
-- (d=3)

Again, d=2


-- storing results to next stage registers
r2_gm00((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= gm00;
r2_gm01((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= gm01;
r2_pm00((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= pm00;
r2_pm01((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= pm01;
r2_sv00((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= sv00;
r2_sv01((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= sv01;
r2_tv00((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= tv00;
r2_tv01((i+1)*SGL_E_SIZE-1 downto i*SGL_E_SIZE) <= tv01;
r2_gm10((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= gm10;
r2_gm11((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= gm11;
r2_pm10((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= pm10;
r2_pm11((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= pm11;
r2_sv10((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= sv10;
r2_sv11((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= sv11;
r2_tv10((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= tv10;
r2_tv11((i+1)*DBL_E_SIZE-1 downto i*DBL_E_SIZE) <= tv11;

If you have the next level's gm/pm, only pm, sv00/10 and tv00/10
are needed in stage 2. If you manage to squeeze another row of XORs
into stage 1 (which is likely, considering the delay calculation),
you can also calculate

ym00 := pm00 xor sv00;
zm00 := pm00 xor tv00;
ym10 := pm10 xor sv10;
zm10 := pm10 xor tv10;

in stage 1 and pass those instead of pm/sv/tv (saves some more
registers).
ok here i do not understand anymore...
how can i build the results without sv and tv ?
isn't it like if i only have 4-bits adders?


On the other hand, some required signals seem to be
missing (e.g. Fa_raE_DBL & friends, and also OpSize).


yes i added them after sending the mail :)

r2_Fa((i+1)*CHUNK_SIZE-1 downto i*CHUNK_SIZE) <= Fa;
r2_Fb((i+1)*CHUNK_SIZE-1 downto i*CHUNK_SIZE) <= Fb;

r2_EffSubDBL(i) <= EffSubDBL;
r2_EffSubSGL(i) <= EffSubSGL;

end loop;
-- enable stage 2
r2_Enable <= '1';
else
r2_Enable <= '0';
end if;
end if;
end if; end process;


! thx a lot !

--

~~ Gaetan ~~
http://www.xeberon.net


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/