[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] (!) a few noteworthy things



Yann Guidon a écrit :
> 
> hi !
> 
> Michael Riepe wrote:
> > On Mon, Jun 17, 2002 at 03:31:15AM +0200, Yann Guidon wrote:
> > [...]
> > > - the SIMD flag still creates problems.
> > > Partial writes to a register are handled but bypass conditions are
> > > a major headache, and this has a big impact on the "zero flags".
> > > We should not forget the potential troubles that this choice
> > > can make on future architectures. Here are the existing possibilities :
> > >  a) specify that the high part is unchanged
> > >    (only the low byte/word/dword/etc. is updated)
> > >   --> this is the current approach.
> >
> > - requires partial writes
> on the register set, this is not much a problem. However, it
> becomes a problem for 2 things :
>  - kepping the "zero" flags up to date (partial write of the flag + timing problems)

We should defined 8, 16, 32, 64 bits "zero" flag.

>  - bypass (one unit's output is connected to another unit's input, but one part of the
>    word must come from the register set... and this can become quite complex)
>

I don't think it's complexe : it's just on overkill ! We need to read
the register that will be write ! Otherwise we can't compute things
depending one all bits in the register.  In e), if we checks the kind of
SIMd to bypass, there is always the problem for the zero flag that
aren't correct any more. Or we need to treat special case with a longer
cdp. 
 
> > - requires additional instructions for zero/sign extension
> Isn't SHL meant to do that ?
> 
> btw, we don't have sign extension instructions because this SIMD/partial write
> stuff is still not solved. i hope it will be solved cleanly soon.
> 
> > >  b) specify that the high part is cleared --> simpler solution
> >
> > + requires no partial writes
> > + saves on instruction for zero extension
> particularly important for pointers !!!
>

???? Could you explain more ?
 
> > + cheap to implement:
> >
> >         signal X, Y, Mask : std_ulogic_vector(63 downto 0);
> >         ...
> >         Mask <= (
> >                 63 downto 32 => SIMD or U(2),
> >                 31 downto 16 => SIMD or U(1),
> >                 15 downto  8 => SIMD or U(0),
> >                 others => '1'
> >         );
> >         -- note that Mask is available from the decoder
> >         -- there's only an AND (or maybe MUX) inside the signal path
> >         Y <= X and Mask;
> 
> that's where it hurts : it's in the critical datapath :-/
> 
> > >  c) specify that the high part is sign-extended
> > >     (sign extension might create troubles like those of the
> > >      current solution
> > + requires no partial writes
> > + saves one instruction for sign extension
> > - more complex than b) because there are multiple sign bits to
> >   consider
> that's what i noted : not a good solution.
> 
> > >  d) specify that the SIMD flag has no effect at all and the
> > >    high part is updated with the rest of the word (just like a
> > >    normal SIMD operation would do)
> > + all the world is SIMD :)
> ... and "God is real, unless you declare it as a char"
>  (ok, it's not my invention, but it goes well along your remark :-D)
> 
> > + requires no partial writes
> > + even cheaper to implement than b)
> sure...
> 
> > - requires additional instruction for zero/sign extension
> it's needed anyway (at least compiler writers will want one,
> and people will be fed up to fiddle with arithmetic shifts etc...)
> 
> don't you do one with your unit ?
> 
> > >  e) specify that the flag return an "undefined/reserved" behaviour
> > >    for the MSB (could be both dangerous and safe, it would force
> > >    compilers to generate valid pointers all the time)
> > + even cheaper to implement than b)
> > - worst solution ever
> sure.
> 
> > [...]
> > e) will allow implementors to build F-CPUs that work like a), b), c), d),
> > or any other way.
> that's the obvious purpose (you seem to read in my mind ;-D)
> 
> > As soon as those versions exist, programmers will use
> > this particular `feature' (trust me - they *will*),
> i know this, too.
> 
> > and the resulting code will no longer be compatible between F-CPU versions.
> > Therefore, we have to avoid e).
> This was meant as a "temporary" version before the "F-CPU rev. 1" milestone.
> Before that, compatibility is not ensured and binary compatibility will
> not be enforced (because the opcodes won't be defined).
>

I think personnaly, that we shouldn't try to keep binary compatibility.
 
> > Since I don't like a),
> at least it won't disturb people who are used to non-RISC systems...
> that's the behaviour i have seen on most existing multi-size computers
> (68xx, 68K, inteloids etc...)
> 
> > and c) is more expensive than b),
> obviously.
> 
> > and d) is what we have in SIMD mode, I prefer b).
> b) is what first came to my mind but i soon realised that there are
> other possibilities, so i wanted to explore them all.
> 
> My sources and model implement a) but i did not choose between b) and f).
> f) is closer to a) and b) shifts the problem from the register set to
> the execution units. It's a tough decision and we have to consider a lot
> of things, including future implementations...
> 
> > On the other hand, turning SIMD on unconditionally *is* tempting.
> :-)
> didn't you propose that ?... or i misunderstood your request ?
> 
> > It would free one flag and streamline the instruction set (the s- prefix
> > will no longer be needed). That is, my second choice is d).
> that's a radical cleanup, right :-)
> 
> > What about f): keep the SIMD bit but make d) the default and b) optional
> > behaviour.
> you want to make a special case of e) ?
> 
> > That is, when the SIMD flag is cleared, a `conforming'
> > F-CPU must either mask the result or trigger an `invalid instruction'
> > trap (this can be handled inside the decoder).  From the design and
> > specification point of view, this solution is much cleaner than e).
> my intent for e) was to be completely ... open and as imprecise as possible
> so your proposition is certainly more "orthodox". but that doesn't
> solve the whole problem at all (the programs would spend their time
> in the exception routines...)
> 
> > I suggest we choose f) but make any reasonable effort to implement b).
> i prefer "my" f) but b) has some problems to be solved...
> 
> > Did you think about the new loadcons[p] I suggested?
> ??? did i miss or forget something ?
> can you explain and detail what you mean ?
> 
> > On Mon, Jun 17, 2002 at 05:10:14AM +0200, Yann Guidon wrote:
> > > Ben Franchuk wrote:
> > > > Yann Guidon wrote:
> > > <snip>
> > > this is unrelated but you just gave me an idea for a 6th solution ("f)") :-)
> > >
> > >  f) avoid bypass if the size of the written register is different from
> > >     the required operand :-)
> >
> > Is that a variant of a), b), c) or d)?
> no.
> it's something completely different.
> 
> the idea behind this is that in a), the bypass network has to choose between
> more data and mix them into a single word (i gave examples).
> 
> my idea was to not do the bypass (and thus avoid the complex mux management
> in the critical datapath) when a chunk size change is detected. the first
> instruction would continue to go through the write pipeline and the second
> (which would usually trigger the bypass) would wait until the register set
> performs the partial write itself. This keeps the complex muxes out of the
> critical datapath, it solves one problem at the cost of some more latencies
> from time to time in "micro$oft-like" codes.
> 

Nice but there is still a pb on the zero flag.

nicO

> >  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
> WHYGEE
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> PS : for the good reason that i would like to rest a bit (and few
> people would be able to decypher my attemps at being clear...),
> i skipped the reasons why b) shifts the problem from the register set
> to the main datapath.
> 
> However, during a little pause, i got an interesting idea !
> 
> Here is the problem : For reasons that i don't have time to detail,
> the "Xbar" is a bunch of wires that are routed over the execution
> units to keep distances small and avoid the big "stuff" that can be
> found on the usual FC0 diagrams. This uses one or two more metal layers
> but the surface and the distance are reduced.
> 
> To optimise the design, the usual execution units are all in line, but
> one half of the units are flipped (in an odd/even fashion) so neighbouring
> units can "share" the latches of the input or the output of the Xbar.
> This organisation comes from some experiments i made on precaracterized
> cells, and i remarked that the FF take one half of the surface, while
> the execution unit is a very long strip. Sharing the FFs between neighbours
> can save 25% of the surface/speed/consumption/price...
> 
> The "Xbar" entity can be seen as divided into 3 components :
>  - Xbar_input latches the Xbar read and write lines, thus performing
>     the bypass, and gives the operand to the left and right-hand units.

You really want to use "latches" ??? :(

>  - Xbar_output chooses between the left and right-hand unit's results
>     and "buffers" the result on the Xbar result autobahn. In order to
>     reduce wire lengths and reduce the need for high drives, an additional
>     MUX (at the output of the FF) will choose between the local FF
>     or the data that comes from the next Xbar_output. I have used this
>     technique once and it gives good results on hand-placed semicustom ASIC.
>  - Xbar_network connects the Register Set, all the Xbar_input and all the
>     Xbar_output's together. it can be considered as the part that is routed
>     on the high metal layers.
> 
> This is pretty easy to design and control, because the result MUX is traightforward
> and spread over the whole datapath. However, with a 64-bit design, these control
> signal might be slow and heavy to operate. The "result MUX" requires long lines
> and high drive. Adding another control signal that selects "0" at the MSB requires 2x
> more drive and this is my problem :-/
> 
> One solution i propose for implementing b) is to "clear" the corresponding MSB
> when the data is handed to the Xbar by the register set. Most instructions
> (except ROP2 and INC with the "neg" and "not" operations) will return a 0
> result (0+0=0, etc.). So there is less need to modify the Xbar_output units
> and less wires to spread.
> 
> Do you understand the trick ? instead of clearing the MSB of the results
> at the output of the EUs (it's important to do this because the result is
> needed for bypass directly on the Xbar, and the partial write of the register
> set will not be enough), the MSB is cleared when the operands are read
> (less places where the MSB is cleared, so less control signals).
> 
> With this trick, the Register set is super-simplified, but some more care
> must be taken inside the datapath.
> 
> I hope that this explanation is not too confusing... i gotta make some more
> drawings but i can't do that now because my computers are broken... :-(
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/