[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: Shifting theory (was Re: [f-cpu] VHDL and delay estimation)



Hi F-gang,

Yann Guidon wrote:
[...]
the wiring delays are an annoying issue.
particularly for shifters...
Yep.

Currently, i use a method to approximate the "cost" of
wiring but it's not an absolute measure :
the delay is roughly proportional to the number of wires that
must be crossed.
For example, if you have to perform a AND between
vector(x) and vector(y), the "cost" is |x-y|

In practice it should be |x-y|/2 :

          AND
          / \
         /   \
        /     \
       /       \
       |       |            ********¤******¤********* (vector)
       x       y

but that's the "big O" story.
Well, that depends on the layout. If the vectors x and y are interleaved, there are no crossings at all:

+---+
X(0) --| |
| & |-- Z(0)
Y(0) --| |
+---+

+---+
X(1) --| |
| & |-- Z(1)
Y(1) --| |
+---+

...

+---+
X(N-1) --| |
| & |-- Z(N-1)
Y(N-1) --| |
+---+

(I hope this is readable)

The goal is to reduce the relative distance.
For a shifting block it is not always possible to place the gate
at the middle point between the sources.

There are some topologies, such as the one Michael implemented,
where all the shifting stages have the same critical datapath
but all paths have different lengths, i consider this under-performant
even though it's easier to layout. Shifters that have constant wire lengths
in each stage have a balanced CDP so there is no case against
which to optimise the layout.
It's true that the individual paths have different lengths, but there is an upper bound (that is, a worst case delay) of N/2*log2(N), where N is the shifted operand's width. In a barrel shifter, the delay grows exponentially with every additional stage: stage i contains (2**i-1)*N crossings. The total delay will be no better than O(N*N). At N=64, this is already a big difference.

And don't forget SIMD, please. That will most likely require additional wires in a barrel shifter, while it comes for free with the omega network.

Another issue is loss in the wires :
when a wire crosses  more than 8 to 16 "bits",
a small buffer/inverter can boost the signal and reduce
the transmission time. Usually, this is automatically
performed by good synthesizers (and they will
even often optimise it out) and it will slow down RTL
simulation. However this issue is increasingly dominant in
modern silicon technologies and we must be careful
if we want to reach even higher clock speeds...

So, for mostly combinatorial blocks, such as addition,
there is not much to worry, but shift/rotation is another deal :
it seems to require only muxes but N wires of length L
require a surface of N x L. This comment applies to internal buses too.

For shifting, such as the normalisation, the "trick" is to create
a "generic" shifting stage. Pipelining can then be enabled
with a if ... generate that implements the pipeline latch.
That's exactly what the omega network provides: a uniform shifting stage that can be easily pipelined.

Michael.

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/