[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] Alternative ROP2 Implementation



hi !

Michael Riepe wrote:

During those boring easter holidays ;) I have found another way to
implement the ROP2 unit. It's based on the formulas (in pseudo-C):

a & b == b ? a : 0 // and
a & ~b == b ? 0 : a // andn
a ^ b == b ? ~a : a // xor
a | b == b ? 1 : a // or
~(a | b) == b ? 0 : ~a // nor
~(a ^ b) == b ? a : ~a // xnor
a | ~b == b ? a : 1 // orn
~(a & b) == b ? ~a : 1 // nand
b ? a : c // mux
b ? c : a // muxr (new: "reversed" mux)

Note the similarity between and/andn and mux/muxr. The attached GIF
shows the actual implementation. The five signals: 0, 1, a, ~a and c,
are "precomputed" (only ~a and c actually need any gates) and passed to
two n-bit wide 8:1 muxes that are directly controlled by the opcode's
function bits. The individual bits of b then select from their outputs
(that's a row of <n> 1-bit 2:1 muxes).

i believe that this implementation is actually slower than the precedent version.
The reason is simple : you implement a 16-input MUX equivalent,
but the precedent version uses a 4-MUX as its core.

The main advantage of this kind of circuit is that the `b' operand signals
may come later than the rest. That allows to put a SIMD <n>:2**<n>
decoder in front of it which will help providing the full set of `bitop'
instructions:

band y = a & (1 << b) // also called btst
bandn y = a & ~(1 << b) // also called bclr
bxor y = a ^ (1 << b) // also called bchg
bor y = a | (1 << b) // also called bset
bnor y = ~(a | (1 << b)) // new
bxnor y = ~(a ^ (1 << b)) // new
born y = a | ~(1 << b) // new
bnand y = ~(a & (1 << b)) // new

with a latency of just 1 cycle (which won't work with the SHL unit).

is it the only reason why you re-built this unit ?

The fcpu-mr-rop2-20030421.tar.gz package (second attachment) contains a
rewrite of the ROP2 unit that supports all instructions mentioned above,
as well as combine mode up to a chunk size of 64 bits (but only for the
ordinary logical operators, not for bitop -- I doubt that it makes sense
for them).  Latency is critical in combine mode (I had to violate the
6G rule again, but I still obey the 10T rule), therefore I'd like to
receive synthesis and speed reports.

The unit has been tested with both Simili and Vanilla, the testbench
I used is included in the package.  You'll also need some stuff from
the `common' directory; see eu_rop2/Makefile for details.

i'd really be interested to see the speed differences ....
and we'll take the faster version, of course :-)
but i don't want to burry the old plain MUX-4 version
which can still be useful.

btw, does your bitop version support SIMD operations ?

YG



*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/