[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] magnetude comparison

Hello F-World!:

Michael Riepe a écrit :

Shouldn't this be the other way round?

if vv(2*i+1) = '1' then
pp(i) := pp(2*i+1);
pp(i) := pp(2*i);
end if;

The most significant bit is on the left.


it doesn't seem to work with:
a := 1001
b := 1100
swap? := 0 -- should swap cause b>a

a := 0010
b := 1000
swap? := 0

a := 00100001
b := 10000001
swap? := 0


strangely, the vector vv and pp are reverted (so the most significant bit is on the right)
i tested my algorithm with :
Run <= compare_vector("1000","1001");
Run <= compare_vector("1001","0100");
Run <= compare_vector("1001","1100");
Run <= compare_vector("0011","0011");
Run <= compare_vector("1110","1111");
Run <= compare_vector("1000","1000");
Run <= compare_vector("1001","1000");
Run <= compare_vector("0010","1000");
Run <= compare_vector("10010000", "01000000");
Run <= compare_vector("00100000", "00110000");
Run <= compare_vector("11100001", "11100000");
Run <= compare_vector("11100000", "11100001");
Run <= compare_vector("00100001", "10000001");
Run <= compare_vector("0010","1000");
Run <= compare_vector("0000000000100001", "1000000000000001");
Run <= compare_vector("0000000000100001", "0000000000100010");
Run <= compare_vector("0000000000100001", "0000000000100000");
Run <= compare_vector("1000000000100001", "1000000000000000");

i don't think i cover all cases but it seems to work correctly ...

vv(i) := vv(2*i+1) or vv(2*i); -- d=1/t=1
end loop;
exit when step >= L;
-- cost for each loop : d=1/t=1
end loop;
swap := pp(0) and vv(0); -- d=1/t=1

-- print_vector("a", a); -- print_vector("b", b); -- print_stdval("swap?", swap); return swap; end;

i estimate the delay around 7 or 8 bits for 32 bit or 64 bit. It's easy to cut between the first and second stage...

Yep.  I once considered a similar circuit for the EU_CMP unit, too.
I finally dropped it, but I don't remember why :(  But I still have
the source file :)

I still see a minor delay problem.  It's true that I count a MUX
as d=1/t=1 but only for the datapath

what's is the difference between the "datapath" and within unit like this?
my estimation is not "realistic"?
how much do you estimate a Mux in my algorithm?

-- that is, the control signal
should arrive early. In this circuit, it's always late. You may get
better latency using 4:1 MUXes (at least in the first two stages):

for level in 1 to 15 loop
step := 4**level;
left := L / step;
for i in 0 to left-1 loop
if vv(4*i+3) = '1' then
pp(i) := pp(4*i+3);
elsif vv(4*i+2) = '1' then
pp(i) := pp(4*i+2);
elsif vv(4*i+1) = '1' then
pp(i) := pp(4*i+1);
pp(i) := pp(4*i+0);
end if;
vv(i) := vv(4*i+3) or vv(4*i+2) or vv(4*i+1) or vv(4*i+0);
end loop;
exit when step >= L;
end loop;

Each loop counts as d=2 t=2 now (which is realistic), but you'll
need only half the number of loops. I used this kind of stage in
my alternate EU_CMP version. In fact, my stage was a little more
complex because it did not only extract the "leading" bit but also
calculated its index and bit mask (for the msb instruction).

The drawback is that a 4:1 stage can't be realized in most FPGAs'
cells because the core element has too many inputs. On the other
hand, one may extract the 4:1 core and put it in a function:

function compare4 (pp, vv : in std_ulogic_vector)
return std_ulogic_vector;

for level in 1 to 15 loop
step := 4**level;
left := L / step;
for i in 0 to left-1 loop
pp(i) := compare4(pp(4*i+3 downto 4*i), vv(4*i+3 downto 4*i));
vv(i) := vv(4*i+3) or vv(4*i+2) or vv(4*i+1) or vv(4*i+0);
end loop;
exit when step >= L;
end loop;

and the function could use 2:1 stages internally if necessary.

yes that could be better...

it's not I don't want to use your way with the compound adder, but i have a method using Leading One Predictor
(which assumed than mantissa A should be greater than mantissa B, and so, the problem is when exponents
are equals, the document i have use a comparator. I propose to put it in the first stage).

The LOP needs its operands in a particular order?

Yes it assumes the difference between Mantissa A and Mantissa B will be positive.
In fact, the theory defines a signed-bit vector W=A-B
each signed bit of W will be +1, 0 or -1
If Ma > Mb, then W is composed by k number of "0" followed by a "1".
This form is called
0 1
(damned, why there isn't mathml supported by all email reader)
Then the theory continues on different possibility:
and extract logical equation
.Since i don't have enough time to check deeply the theory, i will use it directly :[
I'm currently writing my report, so vhdl hacking is slowing down. I will explain *everything*...

Maybe that can be changed.

i think but for the moment, i'm concentrating on a working unit, i will improve it later... or let
enough information to improve it easily in the future...

I use a compound adder
anyway (for rounding, not for normalization).
With leading one predictor, it's really fast to get the number of shift to apply to the mantissa in the normalisation

Yep. Without prediction of the shift count, you would have to
calculate it from the result, which takes at least half a stage.

only half a stage for 54 bit mantissa???
you need to detect the LSB then to code the position in binary to feed the shifter...

I have an other couple of questions:
1) why f cpu uses std_Ulogic? In January we've got a conference with a guy from STMicroelectronics, i asked
him to know what he though about std_ulogic: he said they would NEVER use it anywhere (they used it before,
but they switched to std_logic)...
So why is there so much ulogic in F-CPU? is the "solve" function so important? I think it can lead to some mistake
(we can assign a signal from multiple process)...

2) question about delay
let a, b, c, e, f be std_ulogic

f = a and b and b and e
it's a AND4 so d=1/t=1
but a AND4 i composed with 3 AND2
the obvious implementation is
a --|& |-- __
b --|__| \---|& | ---
__ ---- | __ |
c--|& |---/

but the synthetiser can make it faster (you told me about MUX), on CMOS technology (and not on FPGA)

but if i do
f = (a and b) or (c and e)
then i should estimate it d=2/t=2 (right?)

I understand that in CMOS technology we can make AND2 (or NAND2) easily, but i don't understand , if the synthetiser can make a AND4 d=1/t=1,
why it couldn't optimise my second function...

f = not(a)

For me it's strange to evaluate a MUX2:1 wich the same delay that a AND2, and a single AND2 with the same delay that AND4... and an inverter
with the same delay as a Mux...
Is there any rules for this estimation ? because for the moment, i estimate everythink nearly randomly...

I will have a oral presentation of the projet in 2 weeks, so i want to be sure about what i will say ;)

(yes yes there is a lot of question in my mail, sorry... :)


~~ Gaetan ~~

To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/