[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

More Alphabet Soup (was: [f-cpu] (!) a few noteworthy things)


Unfortunately, on of my mails (concerning partial writes) seems to
have gone unnoticed...

On Tue, Jun 18, 2002 at 02:36:23AM +0200, Yann Guidon wrote:
> hi !
> Michael Riepe wrote:
> > On Mon, Jun 17, 2002 at 03:31:15AM +0200, Yann Guidon wrote:
> > [...]
> > > - the SIMD flag still creates problems.
> > > Partial writes to a register are handled but bypass conditions are
> > > a major headache, and this has a big impact on the "zero flags".
> > > We should not forget the potential troubles that this choice
> > > can make on future architectures. Here are the existing possibilities :
> > >  a) specify that the high part is unchanged
> > >    (only the low byte/word/dword/etc. is updated)
> > >   --> this is the current approach.
> >
> > - requires partial writes
> on the register set, this is not much a problem. However, it
> becomes a problem for 2 things :
>  - kepping the "zero" flags up to date (partial write of the flag + timing problems)
>  - bypass (one unit's output is connected to another unit's input, but one part of the
>    word must come from the register set... and this can become quite complex)
> > - requires additional instructions for zero/sign extension
> Isn't SHL meant to do that ?

Or any other unit. The omega shifter can't do it; if we want it, we need
extra code.

> btw, we don't have sign extension instructions because this SIMD/partial write
> stuff is still not solved. i hope it will be solved cleanly soon.
> > >  b) specify that the high part is cleared --> simpler solution
> > 
> > + requires no partial writes
> > + saves on instruction for zero extension
> particularly important for pointers !!!

Maybe not for pointers (since they're always full size), but for pointer
arithmetics (add/subtract a small offset to/from a pointer, for example).

> > + cheap to implement:
> > 
> >         signal X, Y, Mask : std_ulogic_vector(63 downto 0);
> >         ...
> >         Mask <= (
> >                 63 downto 32 => SIMD or U(2),
> >                 31 downto 16 => SIMD or U(1),
> >                 15 downto  8 => SIMD or U(0),
> >                 others => '1'
> >         );
> >         -- note that Mask is available from the decoder
> >         -- there's only an AND (or maybe MUX) inside the signal path
> >         Y <= X and Mask;
> that's where it hurts : it's in the critical datapath :-/

If bypassed values don't have to be zero-extended (as in your f) proposal),
we can include it into the write ports of the register set. And it's only
a single AND (or MUX).

> > >  c) specify that the high part is sign-extended
> > >     (sign extension might create troubles like those of the
> > >      current solution
> > + requires no partial writes
> > + saves one instruction for sign extension
> > - more complex than b) because there are multiple sign bits to
> >   consider
> that's what i noted : not a good solution.

No, not at all.

> > >  d) specify that the SIMD flag has no effect at all and the
> > >    high part is updated with the rest of the word (just like a
> > >    normal SIMD operation would do)
> > + all the world is SIMD :)
> ... and "God is real, unless you declare it as a char"
>  (ok, it's not my invention, but it goes well along your remark :-D)

I guess it was `... unless declared integer'. There is no `char' in
Fortran (only character - but God probably lacks that ;).

> > + requires no partial writes
> > + even cheaper to implement than b)
> sure...
> > - requires additional instruction for zero/sign extension
> it's needed anyway (at least compiler writers will want one,
> and people will be fed up to fiddle with arithmetic shifts etc...)
> don't you do one with your unit ?

Not yet. I've been looking for a better solution...

> > >  e) specify that the flag return an "undefined/reserved" behaviour
> > >    for the MSB (could be both dangerous and safe, it would force
> > >    compilers to generate valid pointers all the time)
> > + even cheaper to implement than b)
> > - worst solution ever
> sure.
> > [...]
> > e) will allow implementors to build F-CPUs that work like a), b), c), d),
> > or any other way.
> that's the obvious purpose (you seem to read in my mind ;-D)

.o( using Linux Telepathy Driver v0.0.1-alpha )

> > As soon as those versions exist, programmers will use
> > this particular `feature' (trust me - they *will*),
> i know this, too.
> > and the resulting code will no longer be compatible between F-CPU versions.
> > Therefore, we have to avoid e).
> This was meant as a "temporary" version before the "F-CPU rev. 1" milestone.
> Before that, compatibility is not ensured and binary compatibility will
> not be enforced (because the opcodes won't be defined).

If the non-SIMD opcodes trigger a trap, your e) is almost my f).

> > Since I don't like a),
> at least it won't disturb people who are used to non-RISC systems...
> that's the behaviour i have seen on most existing multi-size computers
> (68xx, 68K, inteloids etc...)

And it sucks.

> > and c) is more expensive than b),
> obviously.
> > and d) is what we have in SIMD mode, I prefer b).
> b) is what first came to my mind but i soon realised that there are
> other possibilities, so i wanted to explore them all.
> My sources and model implement a) but i did not choose between b) and f).
> f) is closer to a) and b) shifts the problem from the register set to
> the execution units. It's a tough decision and we have to consider a lot
> of things, including future implementations...
> > On the other hand, turning SIMD on unconditionally *is* tempting.
> :-)
> didn't you propose that ?... or i misunderstood your request ?

No, *you* did. I still want b), one way or another.

> > It would free one flag and streamline the instruction set (the s- prefix
> > will no longer be needed). That is, my second choice is d).
> that's a radical cleanup, right :-)

Tabula rasa (for those of you who don't understand latin: `clean table')
is better than creeping featurism (SW people, are you listening?),
especially if you have to deal with limited resources (like die space
or delay time). I don't want the F-CPU to suffer from `galloping
elephantiasis' (or was it called `chronic Intelism'? ;)

> > What about f): keep the SIMD bit but make d) the default and b) optional
> > behaviour.
> you want to make a special case of e) ?

Yes, but a *clean* one. There's a difference between saying: `the
behaviour is unspecified' and `these opcodes are reserved for a specific

> > That is, when the SIMD flag is cleared, a `conforming'
> > F-CPU must either mask the result or trigger an `invalid instruction'
> > trap (this can be handled inside the decoder).  From the design and
> > specification point of view, this solution is much cleaner than e).
> my intent for e) was to be completely ... open and as imprecise as possible
> so your proposition is certainly more "orthodox". but that doesn't
> solve the whole problem at all (the programs would spend their time
> in the exception routines...)

Programs are supposed to use SIMD operations whenever possible.  By making
other variants much more expensive, we kind of force d) through the back
door, at least on the specification level :)

I know, this sounds like RMS in a Linux vs. GNU/Linux debate ;)

> > I suggest we choose f) but make any reasonable effort to implement b).
> i prefer "my" f) but b) has some problems to be solved...

I guess we can combine your f) with a), b), c) or d) - or my f).

> > Did you think about the new loadcons[p] I suggested?
> ??? did i miss or forget something ?
> can you explain and detail what you mean ?

See my mail from Wed, 5 Jun 2002 22:47:29 +0200. The subject was
"Partial Writes Considered Harmful".

> > On Mon, Jun 17, 2002 at 05:10:14AM +0200, Yann Guidon wrote:
> > > Ben Franchuk wrote:
> > > > Yann Guidon wrote:
> > > <snip>
> > > this is unrelated but you just gave me an idea for a 6th solution ("f)") :-)
> > >
> > >  f) avoid bypass if the size of the written register is different from
> > >     the required operand :-)
> > 
> > Is that a variant of a), b), c) or d)?
> no.
> it's something completely different.

They're almost orthogonal.

> the idea behind this is that in a), the bypass network has to choose between
> more data and mix them into a single word (i gave examples).
> my idea was to not do the bypass (and thus avoid the complex mux management
> in the critical datapath) when a chunk size change is detected. the first
> instruction would continue to go through the write pipeline and the second
> (which would usually trigger the bypass) would wait until the register set
> performs the partial write itself. This keeps the complex muxes out of the
> critical datapath, it solves one problem at the cost of some more latencies
> from time to time in "micro$oft-like" codes.

That can be done with any variant. When the chunk sizes don't match,
let the second instruction wait until the register set has performed
the write. Whether it's a partial write (a), a masked write (b) or a
sign-extended write (c) doesn't really matter.

BTW: if masking is done when the register is written, bypassing *is*
possible unless the second instruction has wider operands.

> PS : for the good reason that i would like to rest a bit (and few
> people would be able to decypher my attemps at being clear...),
> i skipped the reasons why b) shifts the problem from the register set
> to the main datapath.

I'm aware of that anyway. But b) combined with your f) might be a good

> However, during a little pause, i got an interesting idea !
> Here is the problem : For reasons that i don't have time to detail,
> the "Xbar" is a bunch of wires that are routed over the execution
> units to keep distances small and avoid the big "stuff" that can be
> found on the usual FC0 diagrams. This uses one or two more metal layers
> but the surface and the distance are reduced.
> To optimise the design, the usual execution units are all in line, but
> one half of the units are flipped (in an odd/even fashion) so neighbouring
> units can "share" the latches of the input or the output of the Xbar.
> This organisation comes from some experiments i made on precaracterized
> cells, and i remarked that the FF take one half of the surface, while
> the execution unit is a very long strip. Sharing the FFs between neighbours
> can save 25% of the surface/speed/consumption/price...

At the input of the EUs, at the output, or both? Note that you may
introduce EU dependencies that way.

> One solution i propose for implementing b) is to "clear" the corresponding MSB
> when the data is handed to the Xbar by the register set. Most instructions
> (except ROP2 and INC with the "neg" and "not" operations) will return a 0
> result (0+0=0, etc.). So there is less need to modify the Xbar_output units
> and less wires to spread.

Another exception is the IDU (0/0 = ?).
For other drawbacks, see A) below.

> Do you understand the trick ? instead of clearing the MSB of the results
> at the output of the EUs (it's important to do this because the result is
> needed for bypass directly on the Xbar, and the partial write of the register
> set will not be enough), the MSB is cleared when the operands are read
> (less places where the MSB is cleared, so less control signals).

There are at least four approaches (capital letters this time):

A) `early masking': mask off the high bits when the register is read
   (before the instruction is issued).

	+ masking moved to register set
	- bypass impossible when first instruction has wider operands 1)
	- requires at least 3 masking units (one per register read port)
	- some instructions need special handling (complex!)

	1) Surprise! You need to mask the operands of the second instruction
	   but there is no masking unit inside the bypass.

B) `write masking': mask off the high bits when the result is
   written to the register.

	+ masking moved to register set
	+ requires only 2 masking units (one per register write port)
	- bypass impossible when second instruction has wider operands

C) `late masking': store the chunk size in the register set (or
   scoreboard) and mask off the high bits when the result is read from the
   register (that is, before the *next* instruction that uses the value).

	+ masking moved to register set
	- bypass impossible when second instruction has wider operands
	- requires at least 3 masking units (one per register read port)
	- requires extra `valid' bits that indicate the chunk size

D) move the problem to the EUs. This can be done easily in the IMU, but
   there's no room left inside the ASU, for example. SHL is pretty tight,
   too (I already violate the `6 gate rule' there).

	+ bypass always possible
	- heavy implementation
	- not all EUs support it

E/F/G anyone? ;)

If we can add a masking unit inside the bypass, ABC) will always be able
to bypass results. But even if we can't, B) looks like the best solution
so far.

 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/