[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] FC0 P&R : i finally got it !


while thinking about the f*ck*d up pseudo-MIPS
(microcoded !) project that my group is currently
doing this week (as part of the diplom), i remembered
of the "over-cell routing" stuffs and how the register
set was routed by Alliance.

How many metal layers do remain over a 3R2W register set ?
This might be the solution to the P&R problem that
i experienced while using Alliance for the ROP2 unit.

First we have to optimise the register set for room saving
and low latency. The current design of the register set
(split into 5 subparts of 8,8,16,16 and 16 bits) is rather
cool, we can optimize the subparts and copy/paste them.

So we have the following preliminary layout :

                r                       r
                e                       e
                g                       g
                1                       63
               |                           |
bits 48 to 63  |                           |
               |                           |
               |                           |
bits 32 to 47  |                           |
               |                           |
               |                           |
bits 16 to 31  |                           |
               |                           |
bits 8 to 15   |                           |
bits 0 to 7    |                           |
                 R/W addresses (5*63 wires)

Located at each separation, we can locate write and local
decoders, as well as buffers that will amplify the address
lines for the "jump" to the next cluster. The cool thing
is that the buffer can be disabled if the write does not
propagate to the last clusters (MSB). Nico will be happy
because at first glance, it consumes less power if the
writes occur on bytes or words.

Next step : split the register set into two halves in the
vertical axis so we can put the "bypass registers" of
the "crossbar" at equal distance from all registers and
all units :

                r            r       r           r
                e            e       e           e
                g            g       g           g
                1            3       3           6
                             1       2           3
               +--------------+     +-------------+
               |              |  B  |             |
               |              |  U  |             |
               |              |  F  |             |
               +--------------+  F  +-------------+
               |              |  E  |             |
               |              |  R  |             |
               |              |  S  |             |
               +--------------+     +-------------+
               |              |  A  |             |
               |              |  N  |             |
               |              |  D  |             |
               +--------------+     +-------------+
               |              |  F  |             |
               +--------------+  F  +-------------+
               |              |  s  |             |
               +--------------+     +-------------+
                ^^^^^^^^^^^^^^      ^^^^^^^^^^^^^
                ||||||||||||||      |||||||||||||
                 R/W addresses       (5*63 wires)

This way, we can put all the normal EUs on the left and the right
of the register set. the "Xbar" will be routed over the register set
and most units. From the sideview, it looks like this :

  |      |     |      |        |             |     |     |
  +-POPC-+-SHL-+-ROP2-+  R1:31-B-R32:63  ASU-+-IDU-+-INC-+

It has drawbacks and good sides : it removes the need
to design EUs which have the input and the output on the same side
(as in the "original Xbar" design) and we can "chain" operators
when a certain sequence is often used (such as shift then mask,
so SHL and ROP2 are next to each others).
The bad news is that if you want to make a shift then add,
in the shown case study, the wire will take too much time
to travel across the units and the R7's length. The bypass
buffers of the register set will have to play the role
of amplifiers and it may cost one cycle to travel through
the remaining distance.

So from one side we win one cycle, from the other it costs
one more cycle. The designer's art will be to identify
often-used operator sequences so the corresponding
units will be located in the same sequence. Otherwise,
instead of being "for free", it might cost two cycles...

The next step is cool : given enough metal layers are present,
we can draw vertical wires to feed the multiply unit, located
"above" (from the upper side point of view) the other units and
register set. Because this unit is long  and has several outputs,
vertical wires will be drawn wherever needed and meet the
horizontal main buses. i have no patience or skill to draw
that with ASCII stuffs but it seems to work on paper sketches.

all the above study is based on the fact that F-CPU uses
very wided data, so the aspect ratio of the units will look
like a very fine column. For example, the ROP2 unit can be
10 times higher than large and this will increase when 128-bit
(and more) registers are used. So (for manual P&R), the usual
column datapath aproach is necessary (modulo the unavoidable
clock and power problems). The behaviour of a software-P&R
will be completely different and i'm not speaking about FPGA.

The rest of the circuit layout doesn't change from the old
layout pictures. i simply modified the "central hub" idea.
I hope we'll get access to a decent synthesiser that
allows floorplanning etc...

To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/