[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] SIMD and exception

To: f-cpu@seul.org
Subject: Re: [f-cpu] SIMD and exception
From: "gaetan@xeberon.net" <gaetan@xeberon.net>
Date: Tue, 10 Feb 2004 10:04:44 +0100
Delivered-to: archiver@seul.org
Delivered-to: f-cpu-outgoing@seul.org
Delivered-to: f-cpu@seul.org
Delivery-date: Tue, 10 Feb 2004 04:03:34 -0500
In-reply-to: <4027EF57.2080502@stud.uni-hannover.de>
References: <40278EF6.1010805@xeberon.net> <4027D85F.8030202@xeberon.net> <4027EF57.2080502@stud.uni-hannover.de>
Reply-to: f-cpu@seul.org
Sender: owner-f-cpu@seul.org
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; fr-FR; rv:1.6) Gecko/20040113

hi

Michael Riepe a écrit :

Two other questions today :)
1) i'm thinking my current way to handle SIMD will not be correct. For the moment, in each stage, i check if i do operations in 32 bit or in 64bit. The problem is it's add a MUXto each stage... and it's at lease 2 gate deep.
I count MUXes as d=1/t=1 (provided that the select line has less delay than the datapath).

ok thx...

So i think to other methods:
- select in the good way in the first stage (with a decoder, to prevent unused path to switch transistors), and select the good result in the last stage. The problem is it will use twice more component (for instance i use only one 64 bit generic_adder for both SIMD format, but now i will to do addition for single and an other addition for double float). It will also add twice more registers. The problem here is the area.
Well, some parts have to be duplicated - e.g. the exponent subtractor. You'll also need separate shifters and rounding logic for each chunk.
The generic adder, however, can be split in the middle without increasing its delay, by using custom versions of CLA and CSV (that's what I did in the IAdd unit).

So here is a problem:
There is in the first stage the exponent substraction. For single, exponent size is 8 bit, so the substractor delay is 6.So it just fit into the stage
But for double, it takes more, so i will have to slip it between stage 1 and stage 2 (putting csv and cla). So the two datapathes (single and double) will not arrive on the same time on the main adder (mantissa adder). So i cannot use the same adder for both operation (except if i bufferize single datapath and "wait" for double one).
Is there any restriction about register number because it will add lot of registers (for double and single datapath)...

- to convert everything in 64 bit and do all operation in 64 bit (i think this is how the pentium works), BUT the problem is in the last stage, i will have to cut the mantissa and exponent, and this will add some logic, so increase the number of stage by at lease one (especialy for the rounding step)
I guess that's not an option.

no prob

See above. You can assume that the target technology provides MUXes that are faster than the obvious implementation:

Y := (Sel and A) or (not Sel and B)

which would have d=2/t=2 for the datapath.
And is it the same way for logic equation:
i mean (A, B, C are std_ulogic, not vector):
(A and B) or C
d=2 ?
Yes, and t=2. It may be faster when AOI gates are used, but they're not generally available. In an FPGA, this will also be relatively fast (one cell delay).
(A xor B) xor C
d=2?
d=2, but t=4 (if the target technology really uses XORs).

Again, this is relatively fast in an FPGA where most functions with 3 or 4 operands can be implemented in a single cell (or row of cells).

On a full custom chip, a 3-input XOR may be implemented as

Y := (A and B and C)
or (A and not B and not C)
or (not A and B and not C)
or (not A and not B and C);

or even

Y := not (
(not A and not B and not C) or
(not A and B and C) or
(A and not B and C) or
(A and B and not C));

which takes 3 inverters and an and-or-invert (AOI) gate, resulting in d=2/t=2. But on the other hand, you wrote "xor", so the synthesizer may create just that.

I suggest you use work.misc.xor3 (available for both std_ulogic and std_ulogic_vector) if you need a 3-input XOR. It's easier to adapt a single function to the target than to scan the whole F-CPU for stray 3-input XORs.

On the other hand, if you *know* that one of the inputs arrives late, you can as well use

Y := (A xor B) xor C; -- when C arrives late
u = (Sa and not(s) ) or (not(de xor Sa) and s)
is d=3?


Approximately d=3/t=4, but:

    -- d=2/t=3
    if s = '1' then
        -- d=1/t=2
        u := de xnor Sa;
    else
        -- d=0/t=0
        u := Sa;
    end if;

Or, if Sa arrives substantially later than de and s:

    u := Sa xor (s and not de);

where the delay varies between d=1/t=2 (Sa -> u) to d=3/t=4 (de -> u).

Just try to keep the balance ;-)

ok thank you michael !

Michael.

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

--

~~ Gaetan ~~
http://www.xeberon.net


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: [f-cpu] SIMD and exception
  - From: Michael Riepe <michael+fcpu@stud.uni-hannover.de>

References:
- [f-cpu] SIMD and exception
  - From: "gaetan@xeberon.net" <gaetan@xeberon.net>
- Re: [f-cpu] SIMD and exception
  - From: "gaetan@xeberon.net" <gaetan@xeberon.net>
- Re: [f-cpu] SIMD and exception
  - From: Michael Riepe <michael+fcpu@stud.uni-hannover.de>

Prev by Author: Re: [f-cpu] SIMD and exception
Next by Author: Re: [f-cpu] SIMD and exception
Previous by thread: Re: [f-cpu] Re: Delay
Next by thread: Re: [f-cpu] SIMD and exception
Index(es):
- Author
- Thread