[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] magnetude comparison

To: f-cpu@seul.org
Subject: Re: [f-cpu] magnetude comparison
From: "gaetan@xeberon.net" <gaetan@xeberon.net>
Date: Tue, 02 Mar 2004 01:13:56 +0100
Delivered-to: archiver@seul.org
Delivered-to: f-cpu-outgoing@seul.org
Delivered-to: f-cpu@seul.org
Delivery-date: Mon, 01 Mar 2004 19:11:30 -0500
In-reply-to: <20040301234546.26608@thrai.stud.uni-hannover.de>
References: <40422843.2090605@xeberon.net> <40422BBC.8040607@f-cpu.org> <4042428F.50602@xeberon.net> <20040301002528.14378@thrai.stud.uni-hannover.de> <40428896.6070802@xeberon.net> <20040301234546.26608@thrai.stud.uni-hannover.de>
Reply-to: f-cpu@seul.org
Sender: owner-f-cpu@seul.org
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; fr-FR; rv:1.6) Gecko/20040113


Michael Riepe a écrit :

Ah... I see where the bug is. You wrote:

function compare_vector(a, b : std_ulogic_vector) return std_ulogic is
constant L : natural := a'length;
constant aa : std_ulogic_vector(L-1 downto 0) := a;
constant bb : std_ulogic_vector(L-1 downto 0) := b;
variable pp, vv: std_ulogic_vector(L-1 downto 0);
variable p, v, swap : std_ulogic;
variable step, level, left : natural;
begin

-- (d=0/t=0)
for i in L-1 downto 0 loop
pp(i) := b(i);
vv(i) := a(i) xor b(i);
end loop;

Note that you used a and b in the loop, not aa and bb as you should.

arg what a mistake...

Isn't life great? ;)

vhdl world is so wonderful...

[...]

I still see a minor delay problem. It's true that I count a MUX
as d=1/t=1 but only for the datapath

what's is the difference between the "datapath" and within unit like this?
my estimation is not "realistic"?
how much do you estimate a Mux in my algorithm?

In a MUX, there is the data path (selected input -> output) and the
control path (selector -> output). The control path usually has
higher latency because the selector has to be (de/en)coded first.
For 2:1 or 4:1 MUXes, the control -> output latency usually is d=2/t=2.
Since the source of the control lines is an XOR (d=1/t=2) in your
case, the first level will have a total latency of d=3/t=4 for pp.

The following levels have early-arriving control signals, so they
will add the usual d=1/t=1 latency to the datapath. But with five
more levels and the final AND, the total latency will be d=9/t=10.
The 4:1 variant with a total of 3 levels fits nicely into a single
stage (d=6/t=7), unless you use FPGAs.

why things are so complex...
but i start to understand.

[...]

The LOP needs its operands in a particular order?

Yes it assumes the difference between Mantissa A and Mantissa B will be positive.
In fact, the theory defines a signed-bit vector W=A-B
each signed bit of W will be +1, 0 or -1

Umh... redundant encoding. I used that in the SRT divider.

If Ma > Mb, then W is composed by k number of "0" followed by a "1".
This form is called
k
0 1
(0^k1)
(damned, why there isn't mathml supported by all email reader)

Because some of them still run in text mode?

arg!!! some people still use 'mail' command?

Then the theory continues on different possibility:
0^k.1.1...
0^k.1.0...
0^k.1.-1..

That looks familiar.

[...]

Yep. Without prediction of the shift count, you would have to
calculate it from the result, which takes at least half a stage.

only half a stage for 54 bit mantissa???

Yes.

you need to detect the LSB then to code the position in binary to feed the shifter...

If you use a shifter that requires a binary encoded shift count :)

My omega network shifter can also be controlled by bit vectors that
represent numbers like (2**shift_count)-1. They are much easier to
generate in this case: a single left-to-right cascade_or is sufficient.
That will take d=3/t=3 for a 64-bit operand -- half a stage.

i don't know what an omega network it, i though my LOP was a good thing...
however, i think i will continue with it since i began with, i'll study you omega network latter in the year
(during my placement in scotland, for example), i will finish my project and try to give a working unit...
no time left :(

2) question about delay
let a, b, c, e, f be std_ulogic

f = a and b and b and e
it's a AND4 so d=1/t=1
but a AND4 i composed with 3 AND2
the obvious implementation is
__
a --|& |-- __
b --|__| \---|& | ---
__ ---- | __ |
c--|& |---/
e--|__|

but the synthetiser can make it faster (you told me about MUX), on CMOS technology (and not on FPGA)

but if i do
f = (a and b) or (c and e)
then i should estimate it d=2/t=2 (right?)

That depends on the target, but d=2/t=2 -- meaning separate AND and
OR gates -- is the safest choice.

that's not very funny, i have some control logic (with some "complex" boolean equations)....

I understand that in CMOS technology we can make AND2 (or NAND2) easily, but i don't understand , if the synthetiser can make a AND4 d=1/t=1,
why it couldn't optimise my second function...

f = not(a)
d=1/t=1

For me it's strange to evaluate a MUX2:1 wich the same delay that a AND2, and a single AND2 with the same delay that AND4... and an inverter
with the same delay as a Mux...
Is there any rules for this estimation ? because for the moment, i estimate everythink nearly randomly...

Well, of course there are differences between an AND2 and an AND4
or between an inverter and a MUX. The latency also varies with the
size of the transistors (due to their different gate capacities),
length and width of the wires and so on. We simply can't take that
into account. We have to let the synthesizer do it.

The base for my estimations, in particular the t value, is more or
less the way these elements are realized in a CMOS process -- it's the
number of transistors (or transistor pairs) the signal has to pass,
whether from gate to drain (as in most gates) or from source to drain
(as in pass gates). Output inverters in AND/OR gates (as opposed to
NAND/NOR) are ignored because most expressions can be re-arranged so
that there is at most a single inverter at each input (or output) line.

The old d value is a little more inaccurate because a "gate" as such
doesn't exist -- an XOR is much slower than a NAND, for example.
But it was the basis of the initial "six gate" design rule, so I keep
it. And sometimes I violate it, if the t value indicates that I may.

I assume that gates with more than 4 inputs (or XOR gates with more
than 2 inputs) are not implemented directly but by combination of
two or more simpler gates. Latency is approximately d=t=log4(n) for
n-input AND/OR gates and d=t=log2(n) for n-input XOR gates, rounded
up to the next higher integer. Note that this is also consistent
with standard FPGAs that provide 4-input cells (they're even a little
faster when performing XORs or arbitrary functions).

I suppose that MUXes use a faster representation than explicit AND-OR.
This could be AOI gates, pass gates or similar, with d=t=1 for the
data path up to a number of 4 selectable inputs. I try to avoid bigger
MUXes if I can. For the select->output path, d=t=2 is a safer choice.
In FPGAs, 2:1 MUXes usually need a single cell, so d=t=1 is also
realistic in this case. 4:1 MUXes may be more expensive, however.

Everything not mentioned above is subject to additional thinking ;)

well well well... very good
i will take some nights to digest it i think... but it's very interesting...
but why not writing it down (for example in the wiki), then any new rooky that may come will not
ask you that again and again...

thanks very well for all, Michael!

++

--

~~ Gaetan ~~
http://www.xeberon.net

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: [f-cpu] magnetude comparison
  - From: Michael Riepe <michael+fcpu@stud.uni-hannover.de>

References:
- Re: [f-cpu] magnetude comparison
  - From: Michael Riepe <michael+fcpu@stud.uni-hannover.de>

Prev by Author: Re: [f-cpu] Synthesier hunting
Next by Author: Re: [f-cpu] magnetude comparison
Previous by thread: Re: [f-cpu] magnetude comparison
Next by thread: Re: [f-cpu] magnetude comparison
Index(es):
- Author
- Thread