[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] (!) a few noteworthy things

hi !

Michael Riepe wrote:
> On Mon, Jun 17, 2002 at 03:31:15AM +0200, Yann Guidon wrote:
> [...]
> > - the SIMD flag still creates problems.
> > Partial writes to a register are handled but bypass conditions are
> > a major headache, and this has a big impact on the "zero flags".
> > We should not forget the potential troubles that this choice
> > can make on future architectures. Here are the existing possibilities :
> >  a) specify that the high part is unchanged
> >    (only the low byte/word/dword/etc. is updated)
> >   --> this is the current approach.
> - requires partial writes
on the register set, this is not much a problem. However, it
becomes a problem for 2 things :
 - kepping the "zero" flags up to date (partial write of the flag + timing problems)
 - bypass (one unit's output is connected to another unit's input, but one part of the
   word must come from the register set... and this can become quite complex)

> - requires additional instructions for zero/sign extension
Isn't SHL meant to do that ?

btw, we don't have sign extension instructions because this SIMD/partial write
stuff is still not solved. i hope it will be solved cleanly soon.

> >  b) specify that the high part is cleared --> simpler solution
> + requires no partial writes
> + saves on instruction for zero extension
particularly important for pointers !!!

> + cheap to implement:
>         signal X, Y, Mask : std_ulogic_vector(63 downto 0);
>         ...
>         Mask <= (
>                 63 downto 32 => SIMD or U(2),
>                 31 downto 16 => SIMD or U(1),
>                 15 downto  8 => SIMD or U(0),
>                 others => '1'
>         );
>         -- note that Mask is available from the decoder
>         -- there's only an AND (or maybe MUX) inside the signal path
>         Y <= X and Mask;

that's where it hurts : it's in the critical datapath :-/

> >  c) specify that the high part is sign-extended
> >     (sign extension might create troubles like those of the
> >      current solution
> + requires no partial writes
> + saves one instruction for sign extension
> - more complex than b) because there are multiple sign bits to
>   consider
that's what i noted : not a good solution.

> >  d) specify that the SIMD flag has no effect at all and the
> >    high part is updated with the rest of the word (just like a
> >    normal SIMD operation would do)
> + all the world is SIMD :)
... and "God is real, unless you declare it as a char"
 (ok, it's not my invention, but it goes well along your remark :-D)

> + requires no partial writes
> + even cheaper to implement than b)

> - requires additional instruction for zero/sign extension
it's needed anyway (at least compiler writers will want one,
and people will be fed up to fiddle with arithmetic shifts etc...)

don't you do one with your unit ?

> >  e) specify that the flag return an "undefined/reserved" behaviour
> >    for the MSB (could be both dangerous and safe, it would force
> >    compilers to generate valid pointers all the time)
> + even cheaper to implement than b)
> - worst solution ever

> [...]
> e) will allow implementors to build F-CPUs that work like a), b), c), d),
> or any other way.
that's the obvious purpose (you seem to read in my mind ;-D)

> As soon as those versions exist, programmers will use
> this particular `feature' (trust me - they *will*),
i know this, too.

> and the resulting code will no longer be compatible between F-CPU versions.
> Therefore, we have to avoid e).
This was meant as a "temporary" version before the "F-CPU rev. 1" milestone.
Before that, compatibility is not ensured and binary compatibility will
not be enforced (because the opcodes won't be defined).

> Since I don't like a),
at least it won't disturb people who are used to non-RISC systems...
that's the behaviour i have seen on most existing multi-size computers
(68xx, 68K, inteloids etc...)

> and c) is more expensive than b),

> and d) is what we have in SIMD mode, I prefer b).
b) is what first came to my mind but i soon realised that there are
other possibilities, so i wanted to explore them all.

My sources and model implement a) but i did not choose between b) and f).
f) is closer to a) and b) shifts the problem from the register set to
the execution units. It's a tough decision and we have to consider a lot
of things, including future implementations...

> On the other hand, turning SIMD on unconditionally *is* tempting.
didn't you propose that ?... or i misunderstood your request ?

> It would free one flag and streamline the instruction set (the s- prefix
> will no longer be needed). That is, my second choice is d).
that's a radical cleanup, right :-)

> What about f): keep the SIMD bit but make d) the default and b) optional
> behaviour.
you want to make a special case of e) ?

> That is, when the SIMD flag is cleared, a `conforming'
> F-CPU must either mask the result or trigger an `invalid instruction'
> trap (this can be handled inside the decoder).  From the design and
> specification point of view, this solution is much cleaner than e).
my intent for e) was to be completely ... open and as imprecise as possible
so your proposition is certainly more "orthodox". but that doesn't
solve the whole problem at all (the programs would spend their time
in the exception routines...)

> I suggest we choose f) but make any reasonable effort to implement b).
i prefer "my" f) but b) has some problems to be solved...

> Did you think about the new loadcons[p] I suggested?
??? did i miss or forget something ?
can you explain and detail what you mean ?

> On Mon, Jun 17, 2002 at 05:10:14AM +0200, Yann Guidon wrote:
> > Ben Franchuk wrote:
> > > Yann Guidon wrote:
> > <snip>
> > this is unrelated but you just gave me an idea for a 6th solution ("f)") :-)
> >
> >  f) avoid bypass if the size of the written register is different from
> >     the required operand :-)
> Is that a variant of a), b), c) or d)?
it's something completely different.

the idea behind this is that in a), the bypass network has to choose between
more data and mix them into a single word (i gave examples).

my idea was to not do the bypass (and thus avoid the complex mux management
in the critical datapath) when a chunk size change is detected. the first
instruction would continue to go through the write pipeline and the second
(which would usually trigger the bypass) would wait until the register set
performs the partial write itself. This keeps the complex muxes out of the
critical datapath, it solves one problem at the cost of some more latencies
from time to time in "micro$oft-like" codes.

>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>

PS : for the good reason that i would like to rest a bit (and few
people would be able to decypher my attemps at being clear...),
i skipped the reasons why b) shifts the problem from the register set
to the main datapath.

However, during a little pause, i got an interesting idea !

Here is the problem : For reasons that i don't have time to detail,
the "Xbar" is a bunch of wires that are routed over the execution
units to keep distances small and avoid the big "stuff" that can be
found on the usual FC0 diagrams. This uses one or two more metal layers
but the surface and the distance are reduced.

To optimise the design, the usual execution units are all in line, but
one half of the units are flipped (in an odd/even fashion) so neighbouring
units can "share" the latches of the input or the output of the Xbar.
This organisation comes from some experiments i made on precaracterized
cells, and i remarked that the FF take one half of the surface, while
the execution unit is a very long strip. Sharing the FFs between neighbours
can save 25% of the surface/speed/consumption/price...

The "Xbar" entity can be seen as divided into 3 components :
 - Xbar_input latches the Xbar read and write lines, thus performing
    the bypass, and gives the operand to the left and right-hand units.
 - Xbar_output chooses between the left and right-hand unit's results
    and "buffers" the result on the Xbar result autobahn. In order to
    reduce wire lengths and reduce the need for high drives, an additional
    MUX (at the output of the FF) will choose between the local FF
    or the data that comes from the next Xbar_output. I have used this
    technique once and it gives good results on hand-placed semicustom ASIC.
 - Xbar_network connects the Register Set, all the Xbar_input and all the
    Xbar_output's together. it can be considered as the part that is routed
    on the high metal layers.

This is pretty easy to design and control, because the result MUX is traightforward
and spread over the whole datapath. However, with a 64-bit design, these control
signal might be slow and heavy to operate. The "result MUX" requires long lines
and high drive. Adding another control signal that selects "0" at the MSB requires 2x
more drive and this is my problem :-/

One solution i propose for implementing b) is to "clear" the corresponding MSB
when the data is handed to the Xbar by the register set. Most instructions
(except ROP2 and INC with the "neg" and "not" operations) will return a 0
result (0+0=0, etc.). So there is less need to modify the Xbar_output units
and less wires to spread.

Do you understand the trick ? instead of clearing the MSB of the results
at the output of the EUs (it's important to do this because the result is
needed for bypass directly on the Xbar, and the partial write of the register
set will not be enough), the MSB is cleared when the operands are read
(less places where the MSB is cleared, so less control signals).

With this trick, the Register set is super-simplified, but some more care
must be taken inside the datapath.

I hope that this explanation is not too confusing... i gotta make some more
drawings but i can't do that now because my computers are broken... :-(
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/