[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Late answer on "

On Wed, Jun 19, 2002 at 02:24:38AM +0200, Yann Guidon wrote:
> > On the other hand, two slightly
> > different instructions would be sufficient for *all* word sizes:
> > 
> >     loadcons $imm17, reg    // similar to the original `loadconsx'
> >     => reg := sign_extend(imm17)
> > 
> >     loadconsp $imm16, reg   // `p' means `partial'
> >     => reg := shift_left(reg, 16) | imm16
> > 
> > Values between -65536 and 65535, inclusively, can be loaded with a
> > single instruction, 32-bit values need two instructions, and so on.
> > This solution is more general than the original loadcons[x] instructions
> > and IMHO also much more elegant.
> do you meant that you include the SHL in the pipeline ?

Heck no! It's just a hardwired 16-bit left shift:

	if LOADCONSP = '1' then
		Register_In(63 downto 16) <= Register_Out(47 downto 0);
		Register_In(15 downto 0) <= Data_from_Xbar(15 downto 0);
		Register_In <= Data_from_Xbar;
	end if;

BTW: If you use a 3-way MUX, you can also do zero extension.  Add another
input and you can choose between old loadcons, new loadcons, zero
extension and `straight-through' mode. Fact is that you *will* need a
MUX for both variants if we abandon partial writes.

> in that case, "strings" of consecutive loadcons will have a terrific
> latency ! The purpose of the previous version was clearly to allow
> the programmer to issue 4 loadcons in 4 cycles, in a row.

That should be possible.

> > Since we need 8 bits for the opcode and 6 bits for the destination
> > register, we can encode all variants using only a single opcode (compared
> > to 8 opcodes for loadcons[x]):
> given the relative usefulness of loadcons, allocating 8 opcodes is not
> completely unjustified.

IMHO it is.

> >          8   + 1 + 1 +   16  +  6  = 32 bits
> >     +--------+---+---+-------+-----+
> >     | opcode | P | S | imm16 | reg |
> >     +--------+---+---+-------+-----+
> > 
> >         P=0 => load full register; S is the sign bit
> >         P=1 => load least significant 16 bits of the register; S is ignored
> > 
> > In case you didn't notice it: the same encoding is used by `loadaddri[d]'.
> thanks for the remark, but `loadaddri[d]' doesn't use SHL...

Neither does loadconsp :P

> > Implementing the new `loadcons' is simple: the decoder sign-extends the
> > immediate value and sends it along. `loadconsp' is a little more tricky
> > because it needs a `feedback loop' from one of the register set's read
> > ports to one of the write ports. Fortunately, the left shift and the
> > `or' operations take almost no time (we need an extra mux, the rest is
> > just a bunch of wires).
> I am more and more reluctant to perform shifts on the Xbar.
> I thought we could perform some bit-reversing there, for example,
> but in practice it's too difficult to manage. And how do you
> manage the bypasses ?... i don't want this to become yet
> another naughty hack.

Why bypasses? Constants are supposed to go into the register set
directly, aren't they?

If the loadcons[p] instructions are always issued in-order, there is
another way to implement them: add an `accumulator' register to the
instruction decoder. That is, a load will look like this:

    loadcons 0x7777, r0     // acc = 0x0000000000007777;
    // maybe do something else here
    loadconsp 0x33bb, r0    // acc = 0x00000000777733bb;
    // maybe do something else here
    loadconsp 0x1919, r42   // acc = 0x0000777733bb1919; r42 = acc;

Note that destination register `r0' is used as a synonym for `accumulate
but do not write'.

	+ less pressure on the register set (ports remain free)
	+ feedback loop is local, not wrapped around the register set
	+ less timing critical!
	+ can be interleaved with other instructions (except loadcons)
	- disables well-known loadcons tricks (but probably enables others)

If the value is used immediately after the load, it can also be sent to
the EU directly while the value is moved to the destination register. That
is, the sequence

	loadcons 0x12, r1
	mul r1, r2, r3

will take only one cycle longer than

	muli 0x12, r2, r3

while it allows bigger (17-bit) constants.

> i don't want to use the "shift" approach. I don't know for the ALPHA,
> but even MIPS uses a specific instruction to load the MSB with a constant.

SPARC as well (but they split the register after 22 bits).

> The "relative" approach increases the dependencies between the operations,
> while the "absolute" way does not require an order. I remember that Cedric
> used loadcons optimisations to create a specific constant in his RC5 code...
> the "old" loadcons can still be done without partial writes, like you
> said, with another MUX in the CDP. ok.
> But remember that a shift requires a certain amount of Silicon surface,
> much more than a simple mux, and it depends on the number of wires to cross.

Since it's not really a shift, it requires a 48-bit MUX, 48 wires and
1 control line.  A normal (unshifted) feedback, as needed for the old
loadcons, requires four 16-bit MUXes, 64 wires and 4 control lines. Not
a big difference (and my version is actually cheaper).

> My conclusion : partial writes are being abandonned but
> the "old" loadcons is still useful and easy to do.
> I don't even think that there will be a problem.
> It's just like a "move" instruction but with a modified
> datapath.

I guess I should be satisfied, but I'm not :)

 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/