[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Late answer



On Thu, Jun 20, 2002 at 05:52:29AM +0200, Yann Guidon wrote:
[loadconsp]
> > Heck no! It's just a hardwired 16-bit left shift:
> > 
> >         if LOADCONSP = '1' then
> >                 Register_In(63 downto 16) <= Register_Out(47 downto 0);
> >                 Register_In(15 downto 0) <= Data_from_Xbar(15 downto 0);
> >         else
> >                 Register_In <= Data_from_Xbar;
> >         end if;
> 
> i figured this later... and it is not a "clean" option, if we consider
> future cores that don't work the same way as FC0. #If# the future cores
> don't use an in-order pipeline or #if# some kind of "translation"
> is performed on the instruction, then the operation will pass though
> different stages.

But it will still perform the same operation. It has to, for
compatibility.

> Now imagine a crazy coder like us, who reads the description of this
> instruction : he thinks "great ! a zero-latency shift instruction !"
> and we'll soon see this instruction used for things completely unrelated
> from constant loading... And the coders will be disapointed when another
> core will perform the instruction differently.

It can't. But it may be slower -> not our problem. If some weirdo uses
an instruction for purposes it wasn't meant for, ... ;)

On the other hand, I'm one of those weirdoes myself... ;)

> The second objection deals with the surface of the shift on the Xbar.
> see below.

Hmm...

> > BTW: If you use a 3-way MUX, you can also do zero extension.  Add another
> > input and you can choose between old loadcons, new loadcons, zero
> > extension and `straight-through' mode. Fact is that you *will* need a
> > MUX for both variants if we abandon partial writes.
> sure, but i prefer a simple 2-input mux.
> Though there are several kinds of immediates, which are treated/expanded
> during the decode stage. Only the result of the decoded constant
> will feed the next mux....
> 
> And i forgot to mention a "2nd order bypass", which remembers the last
> word written to the R7, because R7's latency is so high that it needs
> 2 cycles for data to go in and out, between the time it is written to
> when it is read again... So in fact, the above mux requires 4 ports :
> 1 for constants, 1 for register output, and 1 for each "old" write
> port value... so it's full now.

Oh boy... 2nd order bypass? *sigh* Sounds like `angina pectoris'.

[...]
> > > given the relative usefulness of loadcons, allocating 8 opcodes is not
> > > completely unjustified.
> > IMHO it is.
> mmmm we could limit the constants to 64 bits and free 2 bits / 4(8) opcodes ?

We can free 7 opcodes without limiting constant size.

> > > >          8   + 1 + 1 +   16  +  6  = 32 bits
> > > >     +--------+---+---+-------+-----+
> > > >     | opcode | P | S | imm16 | reg |
> > > >     +--------+---+---+-------+-----+
> > > >
> > > >         P=0 => load full register; S is the sign bit
> > > >         P=1 => load least significant 16 bits of the register; S is ignored
> > > >
> > > > In case you didn't notice it: the same encoding is used by `loadaddri[d]'.
> > > thanks for the remark, but `loadaddri[d]' doesn't use SHL...
> > Neither does loadconsp :P
> but loadaddri uses the ASU to computer PC-relative pointers (i KNEW there
> was a flaw in what you claimed ;-D)

The encoding is still the same, though. I didn't claim that it uses
the same EU, did I? ;)

[...]
> > Why bypasses? Constants are supposed to go into the register set
> > directly, aren't they?
> 
> not directly, otherwise we can't do
>   loadconsx 0x1234, r1
>   add r1, r2, r3
> there would be some bypass troubles. To keep things simple, all
> the write operations MUST share the same datapath, including
> the R7 read, Xbar read cycle, Xbar write cycle and R7 writeback.
> If we writeback after the read cycle, it creates new bypass conditions...

You can't bypass when you load a partial constant:

	loadconsx.1 0, r1
	// no bypass possible!
	add r1, r2, r3

(ironically, this is one way to zero-extend a 16- or 32-bit value).

When part of the value comes from the register set and the rest from
the decoder, you *have* to read the register - and you have to write
it *before*, or you'll need another MUX inside the datapath.

[...]
> > If the loadcons[p] instructions are always issued in-order, there is
> > another way to implement them: add an `accumulator' register to the
> > instruction decoder. That is, a load will look like this:
> > 
> >     loadcons 0x7777, r0     // acc = 0x0000000000007777;
> >     // maybe do something else here
> >     loadconsp 0x33bb, r0    // acc = 0x00000000777733bb;
> >     // maybe do something else here
> >     loadconsp 0x1919, r42   // acc = 0x0000777733bb1919; r42 = acc;
> > 
> > Note that destination register `r0' is used as a synonym for `accumulate
> > but do not write'.
> 
> hmmmm i thought about this for a while (so i won't blame on you :-P)
> but quickly abandonned this idea. What would happen if an IRQ fired
> in the middle of this sequence ? r0 would be lost and we wouldn't know
> where to write its contents :-( So loadcons MUST always specify the
> destination.

That's easy to solve, isn't it? The obvious solution is to include the
accumulator register in the SRB. When the IRQ arrives, the register is
automatically saved if it is going to be re-used inside the IRQ service
routine, and it is automatically restored when the IRQ handler returns.

> >         + less pressure on the register set (ports remain free)
> >         + feedback loop is local, not wrapped around the register set
> >         + less timing critical!
> >         + can be interleaved with other instructions (except loadcons)
> >         - disables well-known loadcons tricks (but probably enables others)
> but major flaw with interrupts and exceptions/traps :-(

Solved :)

> in other words : it's not "atomic" and it relies on the state of a single
> register (so it might become a bottleneck later) ...
> and if there is a state somewhere, it might confuse the compilers as well...

There's no need to become confused. Compilers can treat a loadcons
sequence as a single instruction, and assemblers can translate a single
`loadcons' into an appropriate sequence automatically. You just write

	loadcons some_large_constant, r23

and the assembler does the rest. A super-optimizing compiler will of
course have to handle loadcons himself (for instruction interleaving
etc.), but then it's also supposed to know what it does.

[...]
> Given the mask rules of 1Lambda for a wire width and 2L for spacing
> (as a rough estimate), then your shift will consume a surface of at least
> (2+1*16) * 48 Lambdas. And since oblique routing (45 wires) is not usual,
> it's going to take even more. OTOH, a straight line consumes far less wires
> and surface.

Ok, that's a point. Then let's use the accumulator solution (where the
shift is moved outside the Xbar/EU complex). It also has the advantage
that it can't be abused as a `fast left shift' ;)

[...]
> > I guess I should be satisfied, but I'm not :)
> Don't worry, you can confide yourself to Doktor Guidon ;-)

Don't tell me about doctors... *sigh*

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/