[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: More Alphabet Soup (was: [f-cpu] (!) a few noteworthy things)

On Thu, Jun 20, 2002 at 04:32:21AM +0200, Yann Guidon wrote:
> > A pipelined SHL would be more difficult to write but should be
> > possible. But please let's keep the 1-stage version for now.
> is there any reason not to ?

Not at the moment.

> > The granularity check isn't hard to do. Let U1 and U2 be the decoded
> > `size vectors' (that is, "000", "001", "011" or "111") and SIMD1 and
> > SIMD2 be the SIMD flags of the first and second instruction, respectively,
> > then bypassing without masking is permitted if
> > 
> >         not (U1 or (2 downto 0 => SIMD1))
> >         and (U2 or (2 downto 0 => SIMD2)) = "000"
> > 
> > It's MUCH harder to check for the case whether a bypass is appropriate
> > at all (compare register numbers and so on)!
> ok, we can check whether there is a size change. and then ? 
> the "simple solution is to "hold/stall" the decode pipeline,
> but this thought is not funny...

Stalling for 1 cycle is better than scheduling an explicit "zero
extend" instructions that will take at least 2 cycles (one for
operating, and one for another result bypass).

> > > by not binding the pointer format to the existing data formats
> > > (char, int, long int...), it becomes difficult to do pointer arithmetics
> > > with "common" arithmetic operations.
> > The answer is, of course: use SIMD mode with maximum chunk size. Since
> > it is identical to non-SIMD mode with the same chunk size...
> no, because all F-CPUs are not 64-bit wide...

I said: MAXIMUM chunk size - that is, full-word. If that isn't possible,
use 64-bit mode instead (I doubt that we'll see F-CPU chips with more
than 64 address lines during the next 10...20 years).

> > [...port sharing between EUs...]
> > > > Note that you may introduce EU dependencies that way.
> > > I don't see what you mean by "EU dependencies".
> > If two EUs share a port, you can use only one of them at a time. This
> > currently doesn't matter for input ports (because we build a 1-issue CPU)
> > but is important for output ports - results MUST NOT arrive at the same
> > time, and the scheduler will have to take care of that. Yet another
> > special case to handle...
> i went to a japanese restaurant today and made a few drawings on my papers...

Japanese characters, again? ;) *kritzel*

> ===> it's not a problem.
> One parameter is that we can group units that have the same latency :
> the current ROP2 and INC units are rather similar and can share the same
> "output" port, which can be further simplified. This one needs however
> to support the write to either R7's write port (if a preceding ASU
> operation was started, for example, ROP2/INC has to use the alternate
> write port).

What a hack!

> Another problem arises, however : i've been very laxist about the
> "variable latency" of the units, such as additional/optional pipe stages
> for some units. putting a non-shareable "output" in the middle of
> some units might be difficult in practice. we'll probably have to abandon
> the idea of min/max/sort/etc. in the INC unit, as well as 16-bit and 32-bit
> combination in ROP2, and the 8-bit 1-cycle latency of the ASU.
> The other good side is that the latency decoder is simplified...

What about the IMU? It has ports at d=3, d=4, d=5 and d=6.

> > > >         - some instructions need special handling (complex!)
> > > which ?
> > 
> > You already mentioned them - ROP2 (xnor, orn), INC - and IDU.
> is SHL safe ?

Shifting/rotating 0 by 0 bits, in any direction, gives 0 ;)

> then, MSB clearing is performed on the R7 read ports and on the "output" port
> shared by INC and ROP2 (add to that that they have the same latency,
> and you understand why they are grouped together ;-)

I still like the B+b solution better. What about future EUs that we
didn't specify yet? What about the FP units? Not all FP operations have
the `f(0)=0' property! That adds a lot of potential special cases :(

> > > >         1) Surprise! You need to mask the operands of the second instruction
> > > >            but there is no masking unit inside the bypass.
> > > if it's needed, we'll make it.
> > 
> > If we put masks into the bypass and the register write ports, the
> > whole discussion is closed. With that, we can always bypass, and we
> > can always zero-extend.
> According to my remark above, masking at 2 locations only
> is possible. no scheduling trick, i guess it's the best i can do.

Sorry but it's an ugly hack. Boomerang-style - if you don't pay
attention, it will come back and hit you. I prefer clean solutions.

> > > >         + masking moved to register set
> > > >         + requires only 2 masking units (one per register write port)
> > > there is no big difference in practice, i guess.
> > >
> > > But the real difference is that 2 instructions can use the 2 write ports
> > > and they can use 2 different write sizes --> in practice, it's more
> > > complex than A) because A) needs 1 mask control logic, while B) needs 2.
> > 
> > I suppose that every masking unit has its own decoder logic anyway,
> > in order to reduce the number of wires. You only send the SIMD and U
> > bits to it, and the rest is done in place.
> if you can do it only once (as on the read port), it certainly is easier
> to understand and implement :-)

And takes much more (and longer) wires.

> > > can't we just trry to make something a kid can understand ?...
> > Ok, let's build a Turing machine ;)
> can this run a Linux kernel ? :-P

A Turing machine can run anything, so the question is: "Has Linux been
ported to it already?" ;)

 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/