[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Something new to play with :)

On Sun, Sep 23, 2001 at 03:52:55AM +0200, Yann Guidon wrote:
> hi !
> Michael Riepe wrote:
> <snip>
> thanks. I did not look at it (yet) though.
> When i was at Graham's library, her, home,
> i read a few books about networks and interco.
> In the F-CPU we have a very tough environment.
> I think that an Omega network would be the
> minimum degree of 'smartnes'. However i would
> like to design a network with an increased
> number of stages but "thinner" stages than Omega.
> The problem is that when you have a 64-bit wide
> omega network, the "critical datapath" can be
> represented as the lenght of the longest wire
> in the slice, and Omega needs to cross at least
> 32 wires in the CDP (and that's a lot). If what
> books say is true, the delay through an IC wire
> is roughly proportional to the square of the
> length (which can be modeled as a string of RC cells).
> If we adopt a "cell" composed of 4* 4-input
> multiplexers, we will "need" 3 Omega slices but
> it is not dangerous to increase the depth (to 5 or 6)
> if we reduce the crossings and wire lengths.
> If we reduce the length of one wire by two, there's
> a 4x win.
> However Omega has a very very interesting feature :
> it belongs to a family where the "address" (binary number
> that says how much you shift) needs almost no decoding
> before feeding the mux inputs.

That's right -- but how many different operations can it perform?

> One hidden drawback that is inherent with the first
> naive approach of the transistor array is that the
> decoding of the individual 4096 transistor gates
> "takes" much time.

Not as much as it may seem.  Every output bit is a copy of a single input
bit, or is zero -- that is, you need one 65:1 mux (or 64:1 mux plus an
AND gate) per bit.  The control lines of the mux depend on the index of
the result bit (which is a constant), the shift count, the chunk size,
and the kind of operation to be performed.  In the most complex case
(64-bit left/right shift), you need a 6-bit unsigned add/subtract
operation with saturation (with one constant and one variable operand).

> I believe (warning : gut feeling ahead) that if we have more
> than 3 stages made of 16*4* 4-input muxes, the decoding
> will not be too hard. below that point (the "naive approach"
> being an example) we need a significant decoding logic
> (time and space).

I used two stages because the decoding is easier to get right.  Since SIMD
is byte-oriented, the choice of 8x8 cells for the first stage was obvious.
With 4x4 cells, both the second and third stage would need rather complex
decoding logic.

The approach I used is similar to the one I proposed half a year ago, but
slightly improved.  The first stage shifts individual bytes by [0;7] bits
(using 8:1 muxes), the second stage combines the results of the first.
Since at most two bytes of the intermediate result contribute to each
result byte, the second stage is a pair of 16:1 muxes whose outputs
are ORed.

The decoding logic for the first stage (single-bit shifts) is pretty
simple -- the least significant bits of the B operand drive the control
lines of the mux directly.  Stage 2 (byte shifts) needs more complex
decoding logic, but does not depend on the results of stage 1 (i.e. it
can be calculated while the bit-shifts are performed).  Currently, I use
muxes (case statement) to select one of 30 constant "byte selectors",
but a ROM-based approach or discrete logic is also possible.

In order to speed up the bitwise operations, left-shift and right-shift
operations are handled separately.  A third block (which is similar to
the second stage used by the bitwise operands) performs the bytewise
operations, and a final output mux selects the correct result.

 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/