[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] magnetude comparison

Hi F-gang,

On Mon, Mar 01, 2004 at 06:47:03PM +0100, gaetan@xeberon.net wrote:
> Hello F-World!:
> Michael Riepe a écrit :
> >
> >
> >Shouldn't this be the other way round?
> >
> >	if vv(2*i+1) = '1' then
> >		pp(i) := pp(2*i+1);
> >	else
> >		pp(i) := pp(2*i);
> >	end if;
> >
> >The most significant bit is on the left.
> >
> >  
> >
> nop
> it doesn't seem to work with:
> a := 1001
> b := 1100
> swap? := 0 -- should swap cause b>a
> strangely, the vector vv and pp are reverted (so the most significant 
> bit is on the right)
> i tested my algorithm with :
>         Run <= compare_vector("1000","1001");

Ah... I see where the bug is.  You wrote:

>     function compare_vector(a, b : std_ulogic_vector) return std_ulogic is
>       constant L : natural := a'length;
>       constant aa : std_ulogic_vector(L-1 downto 0) := a;
>       constant bb : std_ulogic_vector(L-1 downto 0) := b;
>       variable pp, vv: std_ulogic_vector(L-1 downto 0);
>       variable p, v, swap : std_ulogic;
>       variable step, level, left : natural;
>     begin
>         -- (d=0/t=0)
>         for i in  L-1 downto 0 loop
>             pp(i) := b(i);
>             vv(i) := a(i) xor b(i);
>         end loop;

Note that you used a and b in the loop, not aa and bb as you should.

The index ranges of the formal parameters a and b depend on the
actual parameters given in the function call -- and in your test
above, they're ascending (left to right) instead of descending (left
downto right).  pp and vv, on the other hand, always have descending
index ranges, so e.g.

	pp(i) := b(i);

reverses the vector when b has an ascending index range.  If you fix
that, then you'll have to change the if statement as outlined earlier.

Damn pitfall, eh? ;)

By the way: it's not necessary to copy bitwise.  A simple

	pp := bb;
	vv := aa xor bb;

should do it.  Note that the index range is converted automatically in
this case, so you could also write

	pp := b;
	vv := a xor b;

and get exactly the same behaviour: the leftmost bit of the source is
copied into the leftmost bit of the destination, and so on, up to the
rightmost bit:

	variable a : std_ulogic_vector(1 to 4);
	variable b : std_ulogic_vector(15 downto 12);

	a := b;
	-- means:
	a(1) := b(15);
	a(2) := b(14);
	a(3) := b(13);
	a(4) := b(12);

Bit string literals have an ascending range, at least in VHDL'93:
"the direction of the index subtype of the aggregate is that of the
index subtype of the base type of the aggregate".

Since the index subtype of "std_ulogic_vector" is "natural", which
has an ascending range (0 to integer'high), the index range of the
literal is also ascending.  The leftmost bound (available as the
'left attribute) is taken from the index subtype as well.  In case of
natural, it's 0, but for ordinary string literals it is 1 because type
"string" has "positive" as its index subtype...

That is, string'("1010") has indices 1234 (left to right) while
std_ulogic_vector'("1010") has indices 0123.  And "1010" alone can
be either one or something entirely different, depending on the

Isn't life great? ;)

> >I still see a minor delay problem.  It's true that I count a MUX
> >as d=1/t=1 but only for the datapath
> >
> what's is the difference between the "datapath" and within unit like this?
> my estimation is not "realistic"?
> how much do you estimate a Mux in my algorithm?

In a MUX, there is the data path (selected input -> output) and the
control path (selector -> output).  The control path usually has
higher latency because the selector has to be (de/en)coded first.
For 2:1 or 4:1 MUXes, the control -> output latency usually is d=2/t=2.
Since the source of the control lines is an XOR (d=1/t=2) in your
case, the first level will have a total latency of d=3/t=4 for pp.

The following levels have early-arriving control signals, so they
will add the usual d=1/t=1 latency to the datapath.  But with five
more levels and the final AND, the total latency will be d=9/t=10.
The 4:1 variant with a total of 3 levels fits nicely into a single
stage (d=6/t=7), unless you use FPGAs.

> >The LOP needs its operands in a particular order?
> >  
> >
> Yes it assumes the difference between Mantissa A and Mantissa B will be 
> positive.
> In fact, the theory defines a signed-bit vector W=A-B
> each signed bit of W will be +1, 0 or -1

Umh... redundant encoding.  I used that in the SRT divider.

> If Ma > Mb, then W is composed by k number of "0" followed by a "1".
> This form is called
>    k
> 0  1
> (0^k1)
> (damned, why there isn't mathml supported by all email reader)

Because some of them still run in text mode?

> Then the theory continues on different possibility:
> 0^k.1.1...
> 0^k.1.0...
> 0^k.1.-1..

That looks familiar.

> >Yep.  Without prediction of the shift count, you would have to
> >calculate it from the result, which takes at least half a stage.
> >
> >  
> >
> only half a stage for 54 bit mantissa???


> you need to detect the LSB then to code the position in binary to feed 
> the shifter...

If you use a shifter that requires a binary encoded shift count :)

My omega network shifter can also be controlled by bit vectors that
represent numbers like (2**shift_count)-1.  They are much easier to
generate in this case: a single left-to-right cascade_or is sufficient.
That will take d=3/t=3 for a 64-bit operand -- half a stage.

> I have an other couple of questions:
> 1) why f cpu uses std_Ulogic? In January we've got a conference with a 
> guy from STMicroelectronics, i asked
> him to know what he though about std_ulogic: he said they would NEVER 
> use it anywhere (they used it before,
> but they switched to std_logic)...
> So why is there so much ulogic in F-CPU? is the "solve" function so 
> important? I think it can lead to some mistake
> (we can assign a signal from multiple process)...

That's one reason.  With std_logic, a signal may have multiple drivers
that contribute to its value.  That can't happen with unresolved
logic because the tools would signal an error.  And since we don't use
tri-state or explicit "wired-and" or "wired-or" logic inside the core,
using std_ulogic is both safer and cleaner.

> 2) question about delay
> let a, b, c, e, f be std_ulogic
> f  = a and b and b and e
> it's a AND4 so d=1/t=1
> but a AND4 i composed with 3 AND2
> the obvious implementation is
>           __
> a --|& |--     __
> b --|__|  \---|& | ---
>         __         ---- | __ |
> c--|& |---/
> e--|__|
> but the synthetiser can make it faster (you told me about MUX), on CMOS 
> technology (and not on FPGA)
> but if i do
> f = (a and b) or (c and e)
> then i should estimate it d=2/t=2 (right?)

That depends on the target, but d=2/t=2 -- meaning separate AND and
OR gates -- is the safest choice.

> I understand that in CMOS technology we can make AND2 (or NAND2) easily, 
> but i don't understand , if the synthetiser can make a AND4 d=1/t=1,
> why it couldn't optimise my second function...
> f = not(a)
> d=1/t=1
> For me it's strange to evaluate a MUX2:1 wich the same delay that a 
> AND2, and a single AND2 with the same delay that AND4... and an inverter
> with the same delay as a Mux...
> Is there any rules for this estimation ? because for the moment, i 
> estimate everythink nearly randomly...

Well, of course there are differences between an AND2 and an AND4
or between an inverter and a MUX.  The latency also varies with the
size of the transistors (due to their different gate capacities),
length and width of the wires and so on.  We simply can't take that
into account.  We have to let the synthesizer do it.

The base for my estimations, in particular the t value, is more or
less the way these elements are realized in a CMOS process -- it's the
number of transistors (or transistor pairs) the signal has to pass,
whether from gate to drain (as in most gates) or from source to drain
(as in pass gates).  Output inverters in AND/OR gates (as opposed to
NAND/NOR) are ignored because most expressions can be re-arranged so
that there is at most a single inverter at each input (or output) line.

The old d value is a little more inaccurate because a "gate" as such
doesn't exist -- an XOR is much slower than a NAND, for example.
But it was the basis of the initial "six gate" design rule, so I keep
it.  And sometimes I violate it, if the t value indicates that I may.

I assume that gates with more than 4 inputs (or XOR gates with more
than 2 inputs) are not implemented directly but by combination of
two or more simpler gates.  Latency is approximately d=t=log4(n) for
n-input AND/OR gates and d=t=log2(n) for n-input XOR gates, rounded
up to the next higher integer.  Note that this is also consistent
with standard FPGAs that provide 4-input cells (they're even a little
faster when performing XORs or arbitrary functions).

I suppose that MUXes use a faster representation than explicit AND-OR.
This could be AOI gates, pass gates or similar, with d=t=1 for the
data path up to a number of 4 selectable inputs.  I try to avoid bigger
MUXes if I can.  For the select->output path, d=t=2 is a safer choice.
In FPGAs, 2:1 MUXes usually need a single cell, so d=t=1 is also
realistic in this case.  4:1 MUXes may be more expensive, however.

Everything not mentioned above is subject to additional thinking ;)

 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/