[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Late answer

To: f-cpu@seul.org
Subject: Re: [f-cpu] Late answer
From: Yann Guidon <whygee@f-cpu.org>
Date: Thu, 20 Jun 2002 05:52:29 +0200
Delivered-To: archiver@seul.org
Delivered-To: f-cpu-outgoing@seul.org
Delivered-To: f-cpu@seul.org
Delivery-Date: Wed, 19 Jun 2002 23:44:44 -0400
Organization: http://www.f-cpu.org
References: <3D0FCF46.1584D826@f-cpu.org> <20020619154718.58467@thrai.stud.uni-hannover.de>
Reply-To: f-cpu@seul.org
Sender: owner-f-cpu@seul.org

hi !

Michael Riepe wrote:
> On Wed, Jun 19, 2002 at 02:24:38AM +0200, Yann Guidon wrote:
> [...]
> > > On the other hand, two slightly
> > > different instructions would be sufficient for *all* word sizes:
> > >
> > >     loadcons $imm17, reg    // similar to the original `loadconsx'
> > >     => reg := sign_extend(imm17)
> > >
> > >     loadconsp $imm16, reg   // `p' means `partial'
> > >     => reg := shift_left(reg, 16) | imm16
> > >
> > > Values between -65536 and 65535, inclusively, can be loaded with a
> > > single instruction, 32-bit values need two instructions, and so on.
> > > This solution is more general than the original loadcons[x] instructions
> > > and IMHO also much more elegant.
> > do you meant that you include the SHL in the pipeline ?
> Heck no! It's just a hardwired 16-bit left shift:
> 
>         if LOADCONSP = '1' then
>                 Register_In(63 downto 16) <= Register_Out(47 downto 0);
>                 Register_In(15 downto 0) <= Data_from_Xbar(15 downto 0);
>         else
>                 Register_In <= Data_from_Xbar;
>         end if;

i figured this later... and it is not a "clean" option, if we consider
future cores that don't work the same way as FC0. #If# the future cores
don't use an in-order pipeline or #if# some kind of "translation"
is performed on the instruction, then the operation will pass though
different stages.
Now imagine a crazy coder like us, who reads the description of this
instruction : he thinks "great ! a zero-latency shift instruction !"
and we'll soon see this instruction used for things completely unrelated
from constant loading... And the coders will be disapointed when another
core will perform the instruction differently.

The second objection deals with the surface of the shift on the Xbar.
see below.

> BTW: If you use a 3-way MUX, you can also do zero extension.  Add another
> input and you can choose between old loadcons, new loadcons, zero
> extension and `straight-through' mode. Fact is that you *will* need a
> MUX for both variants if we abandon partial writes.
sure, but i prefer a simple 2-input mux.
Though there are several kinds of immediates, which are treated/expanded
during the decode stage. Only the result of the decoded constant
will feed the next mux....

And i forgot to mention a "2nd order bypass", which remembers the last
word written to the R7, because R7's latency is so high that it needs
2 cycles for data to go in and out, between the time it is written to
when it is read again... So in fact, the above mux requires 4 ports :
1 for constants, 1 for register output, and 1 for each "old" write
port value... so it's full now.

> > in that case, "strings" of consecutive loadcons will have a terrific
> > latency ! The purpose of the previous version was clearly to allow
> > the programmer to issue 4 loadcons in 4 cycles, in a row.
> That should be possible.
at what price ?

> > > Since we need 8 bits for the opcode and 6 bits for the destination
> > > register, we can encode all variants using only a single opcode (compared
> > > to 8 opcodes for loadcons[x]):
> > given the relative usefulness of loadcons, allocating 8 opcodes is not
> > completely unjustified.
> IMHO it is.
mmmm we could limit the constants to 64 bits and free 2 bits / 4(8) opcodes ?

> > >          8   + 1 + 1 +   16  +  6  = 32 bits
> > >     +--------+---+---+-------+-----+
> > >     | opcode | P | S | imm16 | reg |
> > >     +--------+---+---+-------+-----+
> > >
> > >         P=0 => load full register; S is the sign bit
> > >         P=1 => load least significant 16 bits of the register; S is ignored
> > >
> > > In case you didn't notice it: the same encoding is used by `loadaddri[d]'.
> > thanks for the remark, but `loadaddri[d]' doesn't use SHL...
> Neither does loadconsp :P
but loadaddri uses the ASU to computer PC-relative pointers (i KNEW there
was a flaw in what you claimed ;-D)

> > > Implementing the new `loadcons' is simple: the decoder sign-extends the
> > > immediate value and sends it along. `loadconsp' is a little more tricky
> > > because it needs a `feedback loop' from one of the register set's read
> > > ports to one of the write ports. Fortunately, the left shift and the
> > > `or' operations take almost no time (we need an extra mux, the rest is
> > > just a bunch of wires).
> >
> > I am more and more reluctant to perform shifts on the Xbar.
> > I thought we could perform some bit-reversing there, for example,
> > but in practice it's too difficult to manage. And how do you
> > manage the bypasses ?... i don't want this to become yet
> > another naughty hack.
> 
> Why bypasses? Constants are supposed to go into the register set
> directly, aren't they?

not directly, otherwise we can't do
  loadconsx 0x1234, r1
  add r1, r2, r3
there would be some bypass troubles. To keep things simple, all
the write operations MUST share the same datapath, including
the R7 read, Xbar read cycle, Xbar write cycle and R7 writeback.
If we writeback after the read cycle, it creates new bypass conditions...

If loadcons and move use the 2 Xbar (read then write) cycles,
it's not slower from the user point of view and it doesn't create
special bypass conditions.

> If the loadcons[p] instructions are always issued in-order, there is
> another way to implement them: add an `accumulator' register to the
> instruction decoder. That is, a load will look like this:
> 
>     loadcons 0x7777, r0     // acc = 0x0000000000007777;
>     // maybe do something else here
>     loadconsp 0x33bb, r0    // acc = 0x00000000777733bb;
>     // maybe do something else here
>     loadconsp 0x1919, r42   // acc = 0x0000777733bb1919; r42 = acc;
> 
> Note that destination register `r0' is used as a synonym for `accumulate
> but do not write'.

hmmmm i thought about this for a while (so i won't blame on you :-P)
but quickly abandonned this idea. What would happen if an IRQ fired
in the middle of this sequence ? r0 would be lost and we wouldn't know
where to write its contents :-( So loadcons MUST always specify the
destination.

>         + less pressure on the register set (ports remain free)
>         + feedback loop is local, not wrapped around the register set
>         + less timing critical!
>         + can be interleaved with other instructions (except loadcons)
>         - disables well-known loadcons tricks (but probably enables others)
but major flaw with interrupts and exceptions/traps :-(
in other words : it's not "atomic" and it relies on the state of a single
register (so it might become a bottleneck later) ...
and if there is a state somewhere, it might confuse the compilers as well...

> [...]
> > i don't want to use the "shift" approach. I don't know for the ALPHA,
> > but even MIPS uses a specific instruction to load the MSB with a constant.
> SPARC as well (but they split the register after 22 bits).
> 
> > The "relative" approach increases the dependencies between the operations,
> > while the "absolute" way does not require an order. I remember that Cedric
> > used loadcons optimisations to create a specific constant in his RC5 code...
> >
> > the "old" loadcons can still be done without partial writes, like you
> > said, with another MUX in the CDP. ok.
> > But remember that a shift requires a certain amount of Silicon surface,
> > much more than a simple mux, and it depends on the number of wires to cross.
> Since it's not really a shift,
but IT IS A SHIFT ! it's not a barrel shifter, but you need at least another
metal layer to bring a bunch of wires up to 16 positions.
Given the mask rules of 1Lambda for a wire width and 2L for spacing
(as a rough estimate), then your shift will consume a surface of at least
(2+1*16) * 48 Lambdas. And since oblique routing (45° wires) is not usual,
it's going to take even more. OTOH, a straight line consumes far less wires
and surface.

> it requires a 48-bit MUX, 48 wires and 1 control line.
> A normal (unshifted) feedback, as needed for the old
> loadcons, requires four 16-bit MUXes, 64 wires and 4 control lines. Not
> a big difference (and my version is actually cheaper).
not cheaper in wire length because the number of wires you shift
(16 in this case) is also the minimal distance between 2 gates.
Unless you eat up all the routing/metal levels...

> > My conclusion : partial writes are being abandonned but
> > the "old" loadcons is still useful and easy to do.
> > I don't even think that there will be a problem.
> > It's just like a "move" instruction but with a modified
> > datapath.
> I guess I should be satisfied, but I'm not :)
Don't worry, you can confide yourself to Doktor Guidon ;-)

>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

Follow-Ups:
- Re: [f-cpu] Late answer
  - From: Michael Riepe <michael@stud.uni-hannover.de>

References:
- [f-cpu] Late answer on "
  - From: Yann Guidon <whygee@f-cpu.org>
- Re: [f-cpu] Late answer on "
  - From: Michael Riepe <michael@stud.uni-hannover.de>

Prev by Date: Re: More Alphabet Soup (was: [f-cpu] (!) a few noteworthy things)
Next by Date: RE: [f-cpu] Late answer
Prev by thread: Re: [f-cpu] Late answer on "
Next by thread: Re: [f-cpu] Late answer
Index(es):
- Date
- Thread