[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: More Alphabet Soup (was: [f-cpu] (!) a few noteworthy things)

The Health Bureau says : "don't read this post unless you have
a full bottle of aspirin next to you ! Now you are warned !".
In case you don't understand this post, reread it several times.

- - - - - - - - - - - 8<- - - - - - - - - - - - - - - - - - - -

hi !

Michael Riepe wrote:
> Hi!
> Unfortunately, on of my mails (concerning partial writes) seems to
> have gone unnoticed...
certainly my fault. my emails and computers are breaking up...
so i better keep the old, working and rather stable NS4.51
that i use for years...

> On Tue, Jun 18, 2002 at 02:36:23AM +0200, Yann Guidon wrote:
> > Michael Riepe wrote:
> > > - requires additional instructions for zero/sign extension
> > Isn't SHL meant to do that ?
> Or any other unit. The omega shifter can't do it; if we want it, we need
> extra code.

I don't think Omega is the best choice... When i find time, i'll try
to program another strategy where the wire length is shorter in the
Critical DataPath.

Concerning the latency, it seems obvious that, past a certain point,
it should be pipelined. Though i'm not sure whether all the control logic
can keep up ...

> > > >  b) specify that the high part is cleared --> simpler solution
> > >
> > > + requires no partial writes
> > > + saves on instruction for zero extension
> > particularly important for pointers !!!
> Maybe not for pointers (since they're always full size), but for pointer
> arithmetics (add/subtract a small offset to/from a pointer, for example).
that's more or less what i meant :-)

> > > + cheap to implement:
> > >         signal X, Y, Mask : std_ulogic_vector(63 downto 0);
> > >         ...
> > >         Mask <= (
> > >                 63 downto 32 => SIMD or U(2),
> > >                 31 downto 16 => SIMD or U(1),
> > >                 15 downto  8 => SIMD or U(0),
> > >                 others => '1'
> > >         );
> > >         -- note that Mask is available from the decoder
> > >         -- there's only an AND (or maybe MUX) inside the signal path
> > >         Y <= X and Mask;
> >
> > that's where it hurts : it's in the critical datapath :-/
> If bypassed values don't have to be zero-extended (as in your f) proposal),
> we can include it into the write ports of the register set. And it's only
> a single AND (or MUX).
that's what i explain later : the zero-extension can not happen at that level,
but at (or before) the EU's outputs. We don't want to have to check whether the data
granularity changes between two depending operations... and the bypassed value
MUST be the same as what is written to the Register Set.

> > > >  c) specify that the high part is sign-extended
> > > >     (sign extension might create troubles like those of the
> > > >      current solution
> > > + requires no partial writes
> > > + saves one instruction for sign extension
> > > - more complex than b) because there are multiple sign bits to
> > >   consider
> > that's what i noted : not a good solution.
> No, not at all.
so let's just leave it alone...

> > > >  d) specify that the SIMD flag has no effect at all and the
> > > >    high part is updated with the rest of the word (just like a
> > > >    normal SIMD operation would do)
> > > + all the world is SIMD :)
> > ... and "God is real, unless you declare it as a char"
> >  (ok, it's not my invention, but it goes well along your remark :-D)
> I guess it was `... unless declared integer'. There is no `char' in
> Fortran (only character - but God probably lacks that ;).
sorry, i am equiped with defect core memory...

> > > - requires additional instruction for zero/sign extension
> > it's needed anyway (at least compiler writers will want one,
> > and people will be fed up to fiddle with arithmetic shifts etc...)
> > don't you do one with your unit ?
> Not yet. I've been looking for a better solution...
there is one. It's a bit heavy but should scale correctly.
i think i explained it on this list, several months ago...

> > > >  e) specify that the flag return an "undefined/reserved" behaviour
> > > >    for the MSB (could be both dangerous and safe, it would force
> > > >    compilers to generate valid pointers all the time)
> > > + even cheaper to implement than b)
> > > - worst solution ever
> > sure.
> > > [...]
> > > e) will allow implementors to build F-CPUs that work like a), b), c), d),
> > > or any other way.
> > that's the obvious purpose (you seem to read in my mind ;-D)
> .o( using Linux Telepathy Driver v0.0.1-alpha )
seems you can renumber it as v0.99-beta already.

> > > As soon as those versions exist, programmers will use
> > > this particular `feature' (trust me - they *will*),
> > i know this, too.
> >
> > > and the resulting code will no longer be compatible between F-CPU versions.
> > > Therefore, we have to avoid e).
> > This was meant as a "temporary" version before the "F-CPU rev. 1" milestone.
> > Before that, compatibility is not ensured and binary compatibility will
> > not be enforced (because the opcodes won't be defined).
> If the non-SIMD opcodes trigger a trap, your e) is almost my f).
almost, except that e) is more "generic" and attemps to "not" specify, indeed.

> > > Since I don't like a),
> > at least it won't disturb people who are used to non-RISC systems...
> > that's the behaviour i have seen on most existing multi-size computers
> > (68xx, 68K, inteloids etc...)
> And it sucks.
there are some ways to work around that. but it's not "plain orthodox RISC",
MIPS, ALPHA, SPARC and others seem to be more than happy with only
one data size and the SIMD flag would seem useless to them, since usually
they would use the 64-bit size by default (which is the same, SIMD flag or
not, on a 64-bit machine). However, F-CPU can have more than 64 bits per
register... thus, at least the size flags are absolutely necessary.

> > > On the other hand, turning SIMD on unconditionally *is* tempting.
> > :-)
> > didn't you propose that ?... or i misunderstood your request ?
> No, *you* did. I still want b), one way or another.
i'll try to do b) since is is consistent with pointer arithmetics.
I already have found a hack in yesterday's mail, so we'll probably
use that anyway.

> > > It would free one flag and streamline the instruction set (the s- prefix
> > > will no longer be needed). That is, my second choice is d).
> > that's a radical cleanup, right :-)
> Tabula rasa (for those of you who don't understand latin: `clean table')
it's very close to the french saying ;-)

> is better than creeping featurism (SW people, are you listening?),
At least a few are. Cedric warned me about d), which would be dangerous
for SW people. that would be too complex to handle "cleanly" with compilers...

but ...



The SIMD flag could be turned into a "pointer" flag ???????

waddayouthink ?

We would still use the b) approach with a bit of d) in the description,
but the current SIMD flag could have an inverted meaning, and trigger
TLB checks and the rest ?...

It would be :
 * all "normal" operations are SIMD (you said "all the world is SIMD :)")
   and operand size would be managed as in RISC world
 * all pointers operations must have the flag reset so the result is zero-extended
    and the TLB is correctly checked (but then we need bits to indicate whether
    it's I or D)

but i fear it's evenn more complex and confusing than before...

for example, the 3 bits would have different meanings than today, we would need
the following combinations :
 - instruction pointer
 - data pointer
 - SIMD 8/16/32/64/128/256 bits/chunk

So there are some instructions that become useless on the memory side...
argh. i wish i didn't have this idea.
forget it.

> especially if you have to deal with limited resources (like die space
if you listen to nicO, you know it's not the biggest problem ;-P

> or delay time).
that's the critical point, which is more or less proportional to the surface...

> I don't want the F-CPU to suffer from `galloping
> elephantiasis' (or was it called `chronic Intelism'? ;)
you named it.

> > > What about f): keep the SIMD bit but make d) the default and b) optional
> > > behaviour.
> > you want to make a special case of e) ?
> Yes, but a *clean* one. There's a difference between saying: `the
> behaviour is unspecified' and `these opcodes are reserved for a specific
> purpose'.
sure. one is a subset of the other ;-)

> > > That is, when the SIMD flag is cleared, a `conforming'
> > > F-CPU must either mask the result or trigger an `invalid instruction'
> > > trap (this can be handled inside the decoder).  From the design and
> > > specification point of view, this solution is much cleaner than e).
> > my intent for e) was to be completely ... open and as imprecise as possible
> > so your proposition is certainly more "orthodox". but that doesn't
> > solve the whole problem at all (the programs would spend their time
> > in the exception routines...)
> <fanatic>
> Programs are supposed to use SIMD operations whenever possible.  By making
> other variants much more expensive, we kind of force d) through the back
> door, at least on the specification level :)
> </fanatic>
> I know, this sounds like RMS in a Linux vs. GNU/Linux debate ;)
no. it sounds rather logical, but not fair.
Still, the problem of the pointer's definition is hurting us here.

in my vision, a pointer is held in a whole register, whatever the size
of both. a pointer has no defined size, just like the register.
by not binding the pointer format to the existing data formats
(char, int, long int...), it becomes difficult to do pointer arithmetics
with "common" arithmetic operations.

Just a note about multiple pointers : a register can contain
ONLY ONE pointer, otherwise how would we handle jumps and load/stores ?

another note :
a scatter/gather instruction would be ideally performed using a "base"
pointer (checked the usual way) and a SIMD "offset", so every SIMD offset
chunk is parallelly checked against the maximum allowed offset (size of page
in TLB ?) and the TLB doens't need as many ports as there are chunks...

> > > I suggest we choose f) but make any reasonable effort to implement b).
> > i prefer "my" f) but b) has some problems to be solved...
> I guess we can combine your f) with a), b), c) or d) - or my f).
now i understand why you renamed the subject to "Re: More Alphabet Soup " :-P

The obvious goal of this thread is to SPECIFY how SIMD and scalar operations
interact, so i would not be happy if at the end we say "ok, so let's leave it
unspecified". This thread would then prove useless...

b) is now my goal, for obvious architectural/design reasons, and i try to find
simple solutions as well. You said you desired b) too and it looks
like the most suitable solution (nobody yelled at b) yet).
a) is the current approach but let's see how we can do b) and allow
bypasses without exploding the control logic's surface...

> > > Did you think about the new loadcons[p] I suggested?
> > ??? did i miss or forget something ?
> > can you explain and detail what you mean ?
> See my mail from Wed, 5 Jun 2002 22:47:29 +0200. The subject was
> "Partial Writes Considered Harmful".
thanks for the references.
Unfortunately, by the date, i see that it was "swallowed" during a
Netscape 6 crash... scandisc did the rest. do you feel how desperate i am ???
i'll try to find it on the seul.org archives.

ok, i found it on my HDD anyway, it was fortunately not wiped away definitely.
it took minutes to find/scan, however. It was also burried in the "calling
conventions" thread...

I reply to this on another mail, for sake of readability.

> > > On Mon, Jun 17, 2002 at 05:10:14AM +0200, Yann Guidon wrote:
> > > > Ben Franchuk wrote:
> > > > > Yann Guidon wrote:
> > > > <snip>
> > > > this is unrelated but you just gave me an idea for a 6th solution ("f)") :-)
> > > >  f) avoid bypass if the size of the written register is different from
> > > >     the required operand :-)
> > > Is that a variant of a), b), c) or d)?
> > no.
> > it's something completely different.
> They're almost orthogonal.
it's an "enhancement" on a) but it's useless if b) works as expected.

> > the idea behind this is that in a), the bypass network has to choose between
> > more data and mix them into a single word (i gave examples).
> >
> > my idea was to not do the bypass (and thus avoid the complex mux management
> > in the critical datapath) when a chunk size change is detected. the first
> > instruction would continue to go through the write pipeline and the second
> > (which would usually trigger the bypass) would wait until the register set
> > performs the partial write itself. This keeps the complex muxes out of the
> > critical datapath, it solves one problem at the cost of some more latencies
> > from time to time in "micro$oft-like" codes.
> That can be done with any variant.
probably but the goal is to make it useless.

> When the chunk sizes don't match,
> let the second instruction wait until the register set has performed
> the write. Whether it's a partial write (a), a masked write (b) or a
> sign-extended write (c) doesn't really matter.

a "clean" bypass rule is necessary, however. if there are special rules,
coding might get uselessly complex, too intel-like, like in P6 cores....

> BTW: if masking is done when the register is written, bypassing *is*
> possible unless the second instruction has wider operands.

my latest idea with b) is that "if" the masking is done at the EU output
(and at the register set output as well, because it works for most operations
:-D) then there is no need to check anything. the programming model is
simplified and the behaviour is clearly determined.

> > PS : for the good reason that i would like to rest a bit (and few
> > people would be able to decypher my attemps at being clear...),
> > i skipped the reasons why b) shifts the problem from the register set
> > to the main datapath.
> I'm aware of that anyway. But b) combined with your f) might be a good
> solution.
the point with the latest b) implementation details is that f) is useless.
the scheduling rules are not changed because the masking is done (mostly)
before the operation starts.

> > However, during a little pause, i got an interesting idea !
> >
> > Here is the problem : For reasons that i don't have time to detail,
> > the "Xbar" is a bunch of wires that are routed over the execution
> > units to keep distances small and avoid the big "stuff" that can be
> > found on the usual FC0 diagrams. This uses one or two more metal layers
> > but the surface and the distance are reduced.
> >
> > To optimise the design, the usual execution units are all in line, but
> > one half of the units are flipped (in an odd/even fashion) so neighbouring
> > units can "share" the latches of the input or the output of the Xbar.
> > This organisation comes from some experiments i made on precaracterized
> > cells, and i remarked that the FF take one half of the surface, while
> > the execution unit is a very long strip. Sharing the FFs between neighbours
> > can save 25% of the surface/speed/consumption/price...
> At the input of the EUs, at the output, or both?
both. a FF gets or sends values to/from the Xbar on the higher metal levels,
and collects or sends data in two opposite directions. In fact, this factors
the HW, no need for individual FF for input and output of each EU.

Here is a proposal/first picture :
    _______________________________________________________________________    read buses
   | _____|______________________|______________________|____________      |   write buses
   ||     |          |           |           |          |            |     |
   R7    in-> INC ->out<- ROP2 <-in-> ASU ->out<- SHL <-in-> POPC ->out<- IDIV

it's very incomplete, bypass and signal repeaters are not represented,
but making graphic illustrations is almost impossible for me (unless
you prefer i sent scanned pictures...)
Note that due to its size, the IMUL is not inside the central datapath,
and the IDIV can be put on one extremity (it is likely to be smaller than
the IMUL unit).

> Note that you may introduce EU dependencies that way.
I don't see what you mean by "EU dependencies".

The only scheduling problem i have seen so far is that it becomes difficult to "chain"
EUs, for example chain the output of SHL to the input of ROP2. OTOH the gain
is meaningful, both in die space and speed. -25% for the main pipeline (ROP2,
INC, ASU, SHL, POPC) is something good. Removing one half of the FF from the
Xbar's CDP is really worth it.

> [...]
> > One solution i propose for implementing b) is to "clear" the corresponding MSB
> > when the data is handed to the Xbar by the register set. Most instructions
> > (except ROP2 and INC with the "neg" and "not" operations) will return a 0
> > result (0+0=0, etc.). So there is less need to modify the Xbar_output units
> > and less wires to spread.
> Another exception is the IDU (0/0 = ?).
ooops. well, huh... then the EU output is explicitely cleared in that case,
and this special case must be handled by the unit...

> For other drawbacks, see A) below.
> > Do you understand the trick ? instead of clearing the MSB of the results
> > at the output of the EUs (it's important to do this because the result is
> > needed for bypass directly on the Xbar, and the partial write of the register
> > set will not be enough), the MSB is cleared when the operands are read
> > (less places where the MSB is cleared, so less control signals).
> There are at least four approaches (capital letters this time):
you really like alphabet soup :-)

However, note that all approaches can be combined in a way or another.

> A) `early masking': mask off the high bits when the register is read
>    (before the instruction is issued).
>         + masking moved to register set
and less places where this must be done ---> less control logic and less long wires

>         - bypass impossible when first instruction has wider operands 1)
hhhmmm ???... i'll have to check that.

>         - requires at least 3 masking units (one per register read port)
i guess it's not the critical parameter and compared to others, it's even the
simplest one.

>         - some instructions need special handling (complex!)
which ?

>         1) Surprise! You need to mask the operands of the second instruction
>            but there is no masking unit inside the bypass.
if it's needed, we'll make it.

> B) `write masking': mask off the high bits when the result is
>    written to the register.
i think it was your first idea (or at least i understood that,
and i later developped).

>         + masking moved to register set
>         + requires only 2 masking units (one per register write port)
there is no big difference in practice, i guess.

But the real difference is that 2 instructions can use the 2 write ports
and they can use 2 different write sizes --> in practice, it's more
complex than A) because A) needs 1 mask control logic, while B) needs 2.

>         - bypass impossible when second instruction has wider operands
that's why i thought about moving it further down the pipeline...

B+f is possible but not satisfying. there is certainly something better.

> C) `late masking': store the chunk size in the register set (or
>    scoreboard) and mask off the high bits when the result is read from the
>    register (that is, before the *next* instruction that uses the value).
>         + masking moved to register set
>         - bypass impossible when second instruction has wider operands
>         - requires at least 3 masking units (one per register read port)
>         - requires extra `valid' bits that indicate the chunk size

where did that crazy idea come from ? ...
can't we just trry to make something a kid can understand ?...

> D) move the problem to the EUs. This can be done easily in the IMU, but
>    there's no room left inside the ASU, for example. SHL is pretty tight,
>    too (I already violate the `6 gate rule' there).
then split SHL into 2 stages... given the long wires of you Omega network,
it won't be useless...

>         + bypass always possible
>         - heavy implementation
>         - not all EUs support it

i propose to let the Xbar "unit" perform a part of it.
it can mask off the MSB when it reads the register operands.
If the EU can't do it, then Xbar masks the EU output locally.

> E/F/G anyone? ;)
> If we can add a masking unit inside the bypass, ABC) will always be able
> to bypass results. But even if we can't, B) looks like the best solution
> so far.

not really because it still has problems with bypass (detecting special conditions).
Don't forget that without bypass capability, FC0 will be ... unacceptably slow.

i think we can already start to modify the manual to reflect these new
specifications. the R7 VHDL source will become simpler, and i have a better
view of the Xbar and scheduler architecture. Ain't it great ?...

>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/