[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] SIMD and exception



Hi #3,

gaetan@xeberon.net wrote:
hello
I got some interesting answers about exception, and what about SIMD (this is blocking me for the moment)
Well, in general, you'll have to do what you'll have to do...

gaetan@xeberon.net a écrit :

Hello F-World

Two other questions today :)
1) i'm thinking my current way to handle SIMD will not be correct. For the moment, in each stage, i check if i do operations in 32 bit or in 64bit. The problem is it's add a MUXto each stage... and it's at lease 2 gate deep.
I count MUXes as d=1/t=1 (provided that the select line has less delay than the datapath).

In fact i do in each stage:
if (OpSize = single) then
make operation op1 with operands as single float
else -- opsize = double
make operation op1' with operands as double
end if;
and the comparison are with 2 bit vector (00=32bit, 01=double), so it's huge for each stage...
I guess you can ignore the other values (for now). The decoder should identify them as invalid.

So i think to other methods:
- select in the good way in the first stage (with a decoder, to prevent unused path to switch transistors), and select the good result in the last stage. The problem is it will use twice more component (for instance i use only one 64 bit generic_adder for both SIMD format, but now i will to do addition for single and an other addition for double float). It will also add twice more registers. The problem here is the area.
Well, some parts have to be duplicated - e.g. the exponent subtractor. You'll also need separate shifters and rounding logic for each chunk.
The generic adder, however, can be split in the middle without increasing its delay, by using custom versions of CLA and CSV (that's what I did in the IAdd unit).

- to convert everything in 64 bit and do all operation in 64 bit (i think this is how the pentium works), BUT the problem is in the last stage, i will have to cut the mantissa and exponent, and this will add some logic, so increase the number of stage by at lease one (especialy for the rounding step)
I guess that's not an option.

For the moment, i think everything should fit into 4 cycle, but some problem can occurs, so it may use 5 cycles at the end.
So i would like to know what do you think about that.


And an other thing i don't understand;
I was told that
if ( thing = '1') then
 A := not B;
end if;
has 2 gates of delay (d=2)
If you implement it as a row of inverters followed by a MUX, it's d=1/t=1 for the inverters and d=1/t=1 for the MUX, giving a total of d=2/t=2 (for the datapath).

And according to Yann a MUX2 is d=2

But after, Michael told me the shifter only is d=LN (for each bit of N vector (LN bit long) there is a MUX)... so it should normaly takes d=2.LN...
so here is my problem... :(
See above. You can assume that the target technology provides MUXes that are faster than the obvious implementation:

Y := (Sel and A) or (not Sel and B)

which would have d=2/t=2 for the datapath.

And is it the same way for logic equation:
i mean (A, B, C are std_ulogic, not vector):
(A and B) or C
d=2 ?
Yes, and t=2. It may be faster when AOI gates are used, but they're not generally available. In an FPGA, this will also be relatively fast (one cell delay).

(A xor B) xor C
d=2?
d=2, but t=4 (if the target technology really uses XORs).

Again, this is relatively fast in an FPGA where most functions with 3 or 4 operands can be implemented in a single cell (or row of cells).

On a full custom chip, a 3-input XOR may be implemented as

Y := (A and B and C)
or (A and not B and not C)
or (not A and B and not C)
or (not A and not B and C);

or even

Y := not (
(not A and not B and not C) or
(not A and B and C) or
(A and not B and C) or
(A and B and not C));

which takes 3 inverters and an and-or-invert (AOI) gate, resulting in d=2/t=2. But on the other hand, you wrote "xor", so the synthesizer may create just that.

I suggest you use work.misc.xor3 (available for both std_ulogic and std_ulogic_vector) if you need a 3-input XOR. It's easier to adapt a single function to the target than to scan the whole F-CPU for stray 3-input XORs.

On the other hand, if you *know* that one of the inputs arrives late, you can as well use

Y := (A xor B) xor C; -- when C arrives late

u = (Sa and not(s) ) or (not(de xor Sa) and s)
is d=3?
Approximately d=3/t=4, but:

	-- d=2/t=3
	if s = '1' then
		-- d=1/t=2
		u := de xnor Sa;
	else
		-- d=0/t=0
		u := Sa;
	end if;

Or, if Sa arrives substantially later than de and s:

	u := Sa xor (s and not de);

where the delay varies between d=1/t=2 (Sa -> u) to d=3/t=4 (de -> u).

Just try to keep the balance ;-)
Michael.

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/