[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: More Alphabet Soup (was: [f-cpu] (!) a few noteworthy things)



hi,

Again, i don't know how long it will take to me to write this,
and to you to read this... Anyway, don't be afraid... or we'll never
do anything at all...

Michael Riepe wrote:
> On Wed, Jun 19, 2002 at 02:24:40AM +0200, Yann Guidon wrote:
> > The Health Bureau says : "don't read this post unless you have
> > a full bottle of aspirin next to you ! Now you are warned !".
> *broad evil grin* :)
i guess i scared a lot of people with that ...

> [...]
> > > > > - requires additional instructions for zero/sign extension
> > > > Isn't SHL meant to do that ?
> > > Or any other unit. The omega shifter can't do it; if we want it, we need
> > > extra code.
> > I don't think Omega is the best choice... When i find time, i'll try
> > to program another strategy where the wire length is shorter in the
> > Critical DataPath.
> If you can find one where wire length doesn't explode in the later stages...
if you include that in your construction rules, this will be ok.
My "design rule" says to not cross more than 16 wires per shift...

> The nicest property of the omega net ist that all stages are equal
> (except the control logic).
Omega can do cut & paste and thus, all paths are equal.
However it is suboptimal because the paths are 2x longer
than the optimal path.

In the construction i want to use, there is a first stage
made of 2 layers of 4-mux, followed by a 2nd stage made of
3 layers of 3-mux. The first stage does shifts up to 16 bits
in either directions, the second stage just handles the "long
wires" and performs shifts of +1, 0 and -1 block of 16 wires.


> > Concerning the latency, it seems obvious that, past a certain point,
> > it should be pipelined. Though i'm not sure whether all the control logic
> > can keep up ...
> A pipelined SHL would be more difficult to write but should be
> possible. But please let's keep the 1-stage version for now.
is there any reason not to ?

> [...]
> > > If bypassed values don't have to be zero-extended (as in your f) proposal),
> > > we can include it into the write ports of the register set. And it's only
> > > a single AND (or MUX).
> > that's what i explain later : the zero-extension can not happen at that level,
> > but at (or before) the EU's outputs. We don't want to have to check whether the data
> > granularity changes between two depending operations... and the bypassed value
> > MUST be the same as what is written to the Register Set.
> 
> The bypassed value need NOT be the same. Only the valid part must be
> identical, the rest is a `don't care'. If it's not masked in the bypass,
> it will be masked after the second instruction (given that the second
> instruction does not use bigger chunks).

ok but it's only one half of the problem. when the chuncks are bigger,
it's another story...

> The granularity check isn't hard to do. Let U1 and U2 be the decoded
> `size vectors' (that is, "000", "001", "011" or "111") and SIMD1 and
> SIMD2 be the SIMD flags of the first and second instruction, respectively,
> then bypassing without masking is permitted if
> 
>         not (U1 or (2 downto 0 => SIMD1))
>         and (U2 or (2 downto 0 => SIMD2)) = "000"
> 
> It's MUCH harder to check for the case whether a bypass is appropriate
> at all (compare register numbers and so on)!

ok, we can check whether there is a size change. and then ? 
the "simple solution is to "hold/stall" the decode pipeline,
but this thought is not funny...

> [...]
> > MAYBE.....
> >
> > The SIMD flag could be turned into a "pointer" flag ???????
<snip>
> > but i fear it's evenn more complex and confusing than before...
> Yep, and it has severe semantic problems.
> 
> > So there are some instructions that become useless on the memory side...
> > argh. i wish i didn't have this idea.
> > forget it.
> 
> Ok. It's better that way.

pffffiuuh.... ideas are difficult to control...

> [...]
> > in my vision, a pointer is held in a whole register, whatever the size
> > of both. a pointer has no defined size, just like the register.
> 
> Yep. In practice, the number of valid bits in a pointer will be
> determined by the hardware (number of address lines) and/or the
> operating system (TLB miss handler).

This should be written in the manual...

> > by not binding the pointer format to the existing data formats
> > (char, int, long int...), it becomes difficult to do pointer arithmetics
> > with "common" arithmetic operations.
> The answer is, of course: use SIMD mode with maximum chunk size. Since
> it is identical to non-SIMD mode with the same chunk size...
no, because all F-CPUs are not 64-bit wide...

> [...port sharing between EUs...]
> > > Note that you may introduce EU dependencies that way.
> > I don't see what you mean by "EU dependencies".
> If two EUs share a port, you can use only one of them at a time. This
> currently doesn't matter for input ports (because we build a 1-issue CPU)
> but is important for output ports - results MUST NOT arrive at the same
> time, and the scheduler will have to take care of that. Yet another
> special case to handle...

i went to a japanese restaurant today and made a few drawings on my papers...
===> it's not a problem.

One parameter is that we can group units that have the same latency :
the current ROP2 and INC units are rather similar and can share the same
"output" port, which can be further simplified. This one needs however
to support the write to either R7's write port (if a preceding ASU
operation was started, for example, ROP2/INC has to use the alternate
write port).

Another problem arises, however : i've been very laxist about the
"variable latency" of the units, such as additional/optional pipe stages
for some units. putting a non-shareable "output" in the middle of
some units might be difficult in practice. we'll probably have to abandon
the idea of min/max/sort/etc. in the INC unit, as well as 16-bit and 32-bit
combination in ROP2, and the 8-bit 1-cycle latency of the ASU.
The other good side is that the latency decoder is simplified...

> [...]
> > > Another exception is the IDU (0/0 = ?).
> > ooops. well, huh... then the EU output is explicitely cleared in that case,
> > and this special case must be handled by the unit...
> 
> Ok. I suppose the IDU won't crash and burn when the divisor is zero;
> it just will produce meaningless results (which will be masked).

let's hope so and specify it now ;-P

> > > For other drawbacks, see A) below.
> > >
> > > > Do you understand the trick ? instead of clearing the MSB of the results
> > > > at the output of the EUs (it's important to do this because the result is
> > > > needed for bypass directly on the Xbar, and the partial write of the register
> > > > set will not be enough), the MSB is cleared when the operands are read
> > > > (less places where the MSB is cleared, so less control signals).
> > >
> > > There are at least four approaches (capital letters this time):
> > you really like alphabet soup :-)
> >
> > However, note that all approaches can be combined in a way or another.
> >
> > > A) `early masking': mask off the high bits when the register is read
> > >    (before the instruction is issued).
> > >
> > >         + masking moved to register set
> > and less places where this must be done ---> less control logic and less long wires
> >
> > >         - bypass impossible when first instruction has wider operands 1)
> > hhhmmm ???... i'll have to check that.
> >
> > >         - requires at least 3 masking units (one per register read port)
> > i guess it's not the critical parameter and compared to others, it's even the
> > simplest one.
> >
> > >         - some instructions need special handling (complex!)
> > which ?
> 
> You already mentioned them - ROP2 (xnor, orn), INC - and IDU.

is SHL safe ?
then, MSB clearing is performed on the R7 read ports and on the "output" port
shared by INC and ROP2 (add to that that they have the same latency,
and you understand why they are grouped together ;-)

> > >         1) Surprise! You need to mask the operands of the second instruction
> > >            but there is no masking unit inside the bypass.
> > if it's needed, we'll make it.
> 
> If we put masks into the bypass and the register write ports, the
> whole discussion is closed. With that, we can always bypass, and we
> can always zero-extend.

According to my remark above, masking at 2 locations only
is possible. no scheduling trick, i guess it's the best i can do.

> > > B) `write masking': mask off the high bits when the result is
> > >    written to the register.
> > i think it was your first idea (or at least i understood that,
> > and i later developped).
> Right.
working with you is such a pleasure, dear colleague :-)

> > >         + masking moved to register set
> > >         + requires only 2 masking units (one per register write port)
> > there is no big difference in practice, i guess.
> >
> > But the real difference is that 2 instructions can use the 2 write ports
> > and they can use 2 different write sizes --> in practice, it's more
> > complex than A) because A) needs 1 mask control logic, while B) needs 2.
> 
> I suppose that every masking unit has its own decoder logic anyway,
> in order to reduce the number of wires. You only send the SIMD and U
> bits to it, and the rest is done in place.
if you can do it only once (as on the read port), it certainly is easier
to understand and implement :-)

> > >         - bypass impossible when second instruction has wider operands
> > that's why i thought about moving it further down the pipeline...
> > B+f is possible but not satisfying. there is certainly something better.
> B+b, with an optional masking unit inside the bypass :)
i choose A+optional masking at some EU output :-)

> > > C) `late masking': store the chunk size in the register set (or
> > >    scoreboard) and mask off the high bits when the result is read from the
> > >    register (that is, before the *next* instruction that uses the value).
> > >         + masking moved to register set
> > >         - bypass impossible when second instruction has wider operands
ouch.
> > >         - requires at least 3 masking units (one per register read port)
> > >         - requires extra `valid' bits that indicate the chunk size
ouch^3 !!!

> > where did that crazy idea come from ? ...
> Um. Hmm. Where did... outta my crazy head? ;)
ohgottohgottohgott !

> > can't we just trry to make something a kid can understand ?...
> Ok, let's build a Turing machine ;)
can this run a Linux kernel ? :-P

> > > D) move the problem to the EUs. This can be done easily in the IMU, but
> > >    there's no room left inside the ASU, for example. SHL is pretty tight,
> > >    too (I already violate the `6 gate rule' there).
> > then split SHL into 2 stages... given the long wires of you Omega network,
> > it won't be useless...
> >
> > >         + bypass always possible
> > >         - heavy implementation
> > >         - not all EUs support it
> > i propose to let the Xbar "unit" perform a part of it.
> > it can mask off the MSB when it reads the register operands.
> > If the EU can't do it, then Xbar masks the EU output locally.
> I thought you wanted to reduce the complexity of the Xbar?
i think i have come to a pretty simple implementation and masking
can be perfomed at 2 locations (R7 output and INC/ROP2 Xbar tap).

> > > E/F/G anyone? ;)
> > >
> > > If we can add a masking unit inside the bypass, ABC) will always be able
> > > to bypass results. But even if we can't, B) looks like the best solution
> > > so far.
> >
> > not really because it still has problems with bypass (detecting special conditions).
> > Don't forget that without bypass capability, FC0 will be ... unacceptably slow.
> 
> If you want maximum speed, always use full words (with or without SIMD).
sure. but it makes the coding rules more complex.

as of today, there is one main restriction :
     the stall/delay between two dependent instructions is
     only this of the used execution unit.
If we make it more complex, optimisation rules will become too complex.
FC0 can already be programmed/seen/optimised like a 2-way or 3-way
superscalar RISC machine, that's already enough.

ok, let's continue this thread on the other mail ;-D

>  Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/