[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] Late answer

Well, here we go again...

> > > i figured this later... and it is not a "clean" option, if we consider
> > > future cores that don't work the same way as FC0. #If# the future cores
> > > don't use an in-order pipeline or #if# some kind of "translation"
> > > is performed on the instruction, then the operation will pass though
> > > different stages.
> > But it will still perform the same operation. It has to, for compatibility.
> certainly, but that's the first half of the problem. Though if loadcons
> relies on naughty hacks, it will be much more difficult to implement later...
> {insert thought about your "accumulator" proposal here}

A future core will have to do only one thing: implement the specified
behaviour. If we specify the `shifted' version, it will have to work that
way, whether it uses a hardwired shift, the SHL or something else. If we
specify the original `partial write' version, it will have to work *that*
way, whether the register set actually supports partial writes or not.
And if we specify the `accumulator' version, it will of course have to
use an accumulator. We're not only discussing the implementation, but
also the specification (because the current specification is suboptimal
and hard to implement).

> > > Now imagine a crazy coder like us, who reads the description of this
> > > instruction : he thinks "great ! a zero-latency shift instruction !"
> > > and we'll soon see this instruction used for things completely unrelated
> > > from constant loading... And the coders will be disapointed when another
> > > core will perform the instruction differently.
> > It can't. But it may be slower -> not our problem. If some weirdo uses
> > an instruction for purposes it wasn't meant for, ... ;)
> > On the other hand, I'm one of those weirdoes myself... ;)
> now, you see what i mean.
> now, imagine the crowd of people who have read Michael Abrash's
> "Zen of code optimisation" and who apply his saying "don't look at
> an instuction for what it is meant to do, but for what it does"...

If you look at the specification (and not the implementation), everything
will be fine. If we *specify* that `loadconsp' does a left-shift and
replaces the least significant 16 bits, then why shouldn't a programmer
use it for that purpose? You can also use the current `loadconsx.1 0,
reg' and `loadconsx.2 0, reg' for zero-extending 16- and 32-bit values.
There's nothing wrong with it. For example, Thomas used mixl/mixh with r0
as the first operand in order to separate 8-bit RGB values and zero-extend
them to 16 bit, or sdup.q in order to duplicate a 32-bit constant.

> > You can't bypass when you load a partial constant:
> >         loadconsx.1 0, r1
> >         // no bypass possible!
> >         add r1, r2, r3
> partial constants certainly ARE bypassable, otherwise what would be the
> point of the MUXes in the Xbar read cycle ?

And if we can avoid those MUXes? Move the complexity out of the Xbar?
Make the datapath shorter? I thought that was what you wanted.

> > (ironically, this is one way to zero-extend a 16- or 32-bit value).
> > 
> > When part of the value comes from the register set and the rest from
> > the decoder, you *have* to read the register - and you have to write
> > it *before*, or you'll need another MUX inside the datapath.
> There is already a MUX in the xbar, which chooses between the register
> set's output, the constants that come from the decoder (for the usual
> addi stuffs) and the other bypasses (both levels of both write ports).

That's a lot of stuff. Four n-way 16-bit muxes with control logic for
each 64-bit datapath - and there are at least three of them, because EUs
can take three operands.

[...accumulating loadcons...]
> > > hmmmm i thought about this for a while (so i won't blame on you :-P)
> > > but quickly abandonned this idea. What would happen if an IRQ fired
> > > in the middle of this sequence ? r0 would be lost and we wouldn't know
> > > where to write its contents :-( So loadcons MUST always specify the
> > > destination.
> > 
> > That's easy to solve, isn't it?
> so easy that you forget the catches... 'easy' is not always 'good'.
> > The obvious solution is to include the
> > accumulator register in the SRB. When the IRQ arrives, the register is
> > automatically saved if it is going to be re-used inside the IRQ service
> > routine, and it is automatically restored when the IRQ handler returns.
> no, not THAT ....
> you're just adding yet another non-portable feature. did you already forget
> the binary compatibility ? the naughty hacker's codes that perform
> platform-dependent optimisations ?...

Since this variant works differently, we would of course have to specify
that it works this way (and future cores would have to work just the same
way). The point is that this variant avoids partial data transports in
both the register set and the Xbar. We will only have to move full-length
words around - and that makes a lot of things a lot easier, doesn't it?

> > > in other words : it's not "atomic" and it relies on the state of a single
> > > register (so it might become a bottleneck later) ...
> > > and if there is a state somewhere, it might confuse the compilers as well...
> > 
> > There's no need to become confused. Compilers can treat a loadcons
> > sequence as a single instruction, and assemblers can translate a single
> > `loadcons' into an appropriate sequence automatically. You just write
> > 
> >         loadcons some_large_constant, r23
> > 
> > and the assembler does the rest. A super-optimizing compiler will of
> > course have to handle loadcons himself (for instruction interleaving
> > etc.), but then it's also supposed to know what it does.
> "it's also supposed to know what it does." but does it want to do it ?

If it wants to squeeze out every unnecessary cycle, yes. Otherwise, no.

> > Ok, that's a point. Then let's use the accumulator solution (where the
> > shift is moved outside the Xbar/EU complex). It also has the advantage
> > that it can't be abused as a `fast left shift' ;)
> no ! loadcons is loadcons and is not a unit in itself.
> otherwise, one has to use SHL instead.

Who says it isn't? As far as I am concerned, the current F-CPU
specification is not cast in stone. If there are mistakes in it, we
have to correct them - and believe me, partial writes (and partial data
moves) *are* mistakes. There are others - like the `mac mistake': the
mac and mul instructions use different result formats (mac results are
widened "if the destination register is wide enough"). Since `mac.64'
will behave differently on a 64- and a 128-bit F-CPU, this clearly is
a Big Bad Bug in the specification. But there are always alternatives,
and we should at least consider them. AND we should take care that we
choose an alternative that is easy to implement. KISS principle.

 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/