[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Late answer



hi !

Michael Riepe wrote:
> On Thu, Jun 20, 2002 at 05:52:29AM +0200, Yann Guidon wrote:
> [loadconsp]
> > > Heck no! It's just a hardwired 16-bit left shift:
> > >         if LOADCONSP = '1' then
> > >                 Register_In(63 downto 16) <= Register_Out(47 downto 0);
> > >                 Register_In(15 downto 0) <= Data_from_Xbar(15 downto 0);
> > >         else
> > >                 Register_In <= Data_from_Xbar;
> > >         end if;
> > i figured this later... and it is not a "clean" option, if we consider
> > future cores that don't work the same way as FC0. #If# the future cores
> > don't use an in-order pipeline or #if# some kind of "translation"
> > is performed on the instruction, then the operation will pass though
> > different stages.
> But it will still perform the same operation. It has to, for compatibility.
certainly, but that's the first half of the problem. Though if loadcons
relies on naughty hacks, it will be much more difficult to implement later...
{insert thought about your "accumulator" proposal here}

> > Now imagine a crazy coder like us, who reads the description of this
> > instruction : he thinks "great ! a zero-latency shift instruction !"
> > and we'll soon see this instruction used for things completely unrelated
> > from constant loading... And the coders will be disapointed when another
> > core will perform the instruction differently.
> It can't. But it may be slower -> not our problem. If some weirdo uses
> an instruction for purposes it wasn't meant for, ... ;)
> On the other hand, I'm one of those weirdoes myself... ;)
now, you see what i mean.
now, imagine the crowd of people who have read Michael Abrash's
"Zen of code optimisation" and who apply his saying "don't look at
an instuction for what it is meant to do, but for what it does"...

> > The second objection deals with the surface of the shift on the Xbar.
> > see below.
> Hmm...
sure...

> > > BTW: If you use a 3-way MUX, you can also do zero extension.  Add another
> > > input and you can choose between old loadcons, new loadcons, zero
> > > extension and `straight-through' mode. Fact is that you *will* need a
> > > MUX for both variants if we abandon partial writes.
> > sure, but i prefer a simple 2-input mux.
> > Though there are several kinds of immediates, which are treated/expanded
> > during the decode stage. Only the result of the decoded constant
> > will feed the next mux....
> >
> > And i forgot to mention a "2nd order bypass", which remembers the last
> > word written to the R7, because R7's latency is so high that it needs
> > 2 cycles for data to go in and out, between the time it is written to
> > when it is read again... So in fact, the above mux requires 4 ports :
> > 1 for constants, 1 for register output, and 1 for each "old" write
> > port value... so it's full now.
> 
> Oh boy... 2nd order bypass? *sigh* Sounds like `angina pectoris'.

Fortunately, Doktor Guidon found the cure ;-P

> [...]
> > > > given the relative usefulness of loadcons, allocating 8 opcodes is not
> > > > completely unjustified.
> > > IMHO it is.
> > mmmm we could limit the constants to 64 bits and free 2 bits / 4(8) opcodes ?
> We can free 7 opcodes without limiting constant size.
groumpfh.

> > > > >          8   + 1 + 1 +   16  +  6  = 32 bits
> > > > >     +--------+---+---+-------+-----+
> > > > >     | opcode | P | S | imm16 | reg |
> > > > >     +--------+---+---+-------+-----+
> > > > >
> > > > >         P=0 => load full register; S is the sign bit
> > > > >         P=1 => load least significant 16 bits of the register; S is ignored
> > > > >
> > > > > In case you didn't notice it: the same encoding is used by `loadaddri[d]'.
> > > > thanks for the remark, but `loadaddri[d]' doesn't use SHL...
> > > Neither does loadconsp :P
> > but loadaddri uses the ASU to computer PC-relative pointers (i KNEW there
> > was a flaw in what you claimed ;-D)
> The encoding is still the same, though. I didn't claim that it uses
> the same EU, did I? ;)
sorry, wrong question :-P

> [...]
> > > Why bypasses? Constants are supposed to go into the register set
> > > directly, aren't they?
> > not directly, otherwise we can't do
> >   loadconsx 0x1234, r1
> >   add r1, r2, r3
> > there would be some bypass troubles. To keep things simple, all
> > the write operations MUST share the same datapath, including
> > the R7 read, Xbar read cycle, Xbar write cycle and R7 writeback.
> > If we writeback after the read cycle, it creates new bypass conditions...
> You can't bypass when you load a partial constant:
>         loadconsx.1 0, r1
>         // no bypass possible!
>         add r1, r2, r3

partial constants certainly ARE bypassable, otherwise what would be the
point of the MUXes in the Xbar read cycle ?

> (ironically, this is one way to zero-extend a 16- or 32-bit value).
> 
> When part of the value comes from the register set and the rest from
> the decoder, you *have* to read the register - and you have to write
> it *before*, or you'll need another MUX inside the datapath.

There is already a MUX in the xbar, which chooses between the register
set's output, the constants that come from the decoder (for the usual
addi stuffs) and the other bypasses (both levels of both write ports).

> [...]
> > > If the loadcons[p] instructions are always issued in-order, there is
> > > another way to implement them: add an `accumulator' register to the
> > > instruction decoder. That is, a load will look like this:
> > >
> > >     loadcons 0x7777, r0     // acc = 0x0000000000007777;
> > >     // maybe do something else here
> > >     loadconsp 0x33bb, r0    // acc = 0x00000000777733bb;
> > >     // maybe do something else here
> > >     loadconsp 0x1919, r42   // acc = 0x0000777733bb1919; r42 = acc;
> > >
> > > Note that destination register `r0' is used as a synonym for `accumulate
> > > but do not write'.
> >
> > hmmmm i thought about this for a while (so i won't blame on you :-P)
> > but quickly abandonned this idea. What would happen if an IRQ fired
> > in the middle of this sequence ? r0 would be lost and we wouldn't know
> > where to write its contents :-( So loadcons MUST always specify the
> > destination.
> 
> That's easy to solve, isn't it?
so easy that you forget the catches... 'easy' is not always 'good'.

> The obvious solution is to include the
> accumulator register in the SRB. When the IRQ arrives, the register is
> automatically saved if it is going to be re-used inside the IRQ service
> routine, and it is automatically restored when the IRQ handler returns.
no, not THAT ....
you're just adding yet another non-portable feature. did you already forget
the binary compatibility ? the naughty hacker's codes that perform
platform-dependent optimisations ?...

> > >         + less pressure on the register set (ports remain free)
> > >         + feedback loop is local, not wrapped around the register set
> > >         + less timing critical!
> > >         + can be interleaved with other instructions (except loadcons)
> > >         - disables well-known loadcons tricks (but probably enables others)
> > but major flaw with interrupts and exceptions/traps :-(
> Solved :)
i don't consider this as a viable solution ...

> > in other words : it's not "atomic" and it relies on the state of a single
> > register (so it might become a bottleneck later) ...
> > and if there is a state somewhere, it might confuse the compilers as well...
> 
> There's no need to become confused. Compilers can treat a loadcons
> sequence as a single instruction, and assemblers can translate a single
> `loadcons' into an appropriate sequence automatically. You just write
> 
>         loadcons some_large_constant, r23
> 
> and the assembler does the rest. A super-optimizing compiler will of
> course have to handle loadcons himself (for instruction interleaving
> etc.), but then it's also supposed to know what it does.
"it's also supposed to know what it does." but does it want to do it ?

> [...]
> > Given the mask rules of 1Lambda for a wire width and 2L for spacing
> > (as a rough estimate), then your shift will consume a surface of at least
> > (2+1*16) * 48 Lambdas.
oups ! wrong formula ! it should be 16 * 48 * (1+2)^2 which is even 3x larger.

> > And since oblique routing (45° wires) is not usual,
> > it's going to take even more. OTOH, a straight line consumes far less wires
> > and surface.
> Ok, that's a point. Then let's use the accumulator solution (where the
> shift is moved outside the Xbar/EU complex). It also has the advantage
> that it can't be abused as a `fast left shift' ;)
no ! loadcons is loadcons and is not a unit in itself.
otherwise, one has to use SHL instead.

> [...]
> > > I guess I should be satisfied, but I'm not :)
> > Don't worry, you can confide yourself to Doktor Guidon ;-)
> Don't tell me about doctors... *sigh*
are you sick ?...
say "dreiunddreizig"... :-)

>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

PS: just in case you didn't know, Simili2 build 23 is out on
symphonyeda.com. Several bug fixes, but still no command-line only
package.
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/