[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: More Alphabet Soup (was: [f-cpu] (!) a few noteworthy things)

On Wed, Jun 19, 2002 at 02:24:40AM +0200, Yann Guidon wrote:
> The Health Bureau says : "don't read this post unless you have
> a full bottle of aspirin next to you ! Now you are warned !".

*broad evil grin* :)

> > > > - requires additional instructions for zero/sign extension
> > > Isn't SHL meant to do that ?
> > Or any other unit. The omega shifter can't do it; if we want it, we need
> > extra code.
> I don't think Omega is the best choice... When i find time, i'll try
> to program another strategy where the wire length is shorter in the
> Critical DataPath.

If you can find one where wire length doesn't explode in the later stages...
The nicest property of the omega net ist that all stages are equal
(except the control logic).

> Concerning the latency, it seems obvious that, past a certain point,
> it should be pipelined. Though i'm not sure whether all the control logic
> can keep up ...

A pipelined SHL would be more difficult to write but should be
possible. But please let's keep the 1-stage version for now.

> > If bypassed values don't have to be zero-extended (as in your f) proposal),
> > we can include it into the write ports of the register set. And it's only
> > a single AND (or MUX).
> that's what i explain later : the zero-extension can not happen at that level,
> but at (or before) the EU's outputs. We don't want to have to check whether the data
> granularity changes between two depending operations... and the bypassed value
> MUST be the same as what is written to the Register Set.

The bypassed value need NOT be the same. Only the valid part must be
identical, the rest is a `don't care'. If it's not masked in the bypass,
it will be masked after the second instruction (given that the second
instruction does not use bigger chunks).

The granularity check isn't hard to do. Let U1 and U2 be the decoded
`size vectors' (that is, "000", "001", "011" or "111") and SIMD1 and
SIMD2 be the SIMD flags of the first and second instruction, respectively,
then bypassing without masking is permitted if

	not (U1 or (2 downto 0 => SIMD1))
	and (U2 or (2 downto 0 => SIMD2)) = "000"

It's MUCH harder to check for the case whether a bypass is appropriate
at all (compare register numbers and so on)!

> MAYBE.....
> The SIMD flag could be turned into a "pointer" flag ???????
> waddayouthink ?
> We would still use the b) approach with a bit of d) in the description,
> but the current SIMD flag could have an inverted meaning, and trigger
> TLB checks and the rest ?...
> It would be :
>  * all "normal" operations are SIMD (you said "all the world is SIMD :)")
>    and operand size would be managed as in RISC world
>  * all pointers operations must have the flag reset so the result is zero-extended
>     and the TLB is correctly checked (but then we need bits to indicate whether
>     it's I or D)

I guess we can limit ourselves to data pointers. But there's another
problem: consider a two-dimensional array `int a[5][9]'. The address
of the element `a[y][x]' is calculated as `a + sizeof(int) * (9 * y + x)',
that is

	// let r1 = a, r2 = x, r3 = y
	muli 9, r3, r4
	add r2, r4, r5
	muli INTSIZE, r5, r6	// actually, this will be a shift
	add r1, r6, r1

Do you want ALL those results to be pointers that trigger a prefetch? What
about r4/r5/r6 which do not point to memory but are just integers?

> but i fear it's evenn more complex and confusing than before...

Yep, and it has severe semantic problems.

> for example, the 3 bits would have different meanings than today, we would need
> the following combinations :
>  - instruction pointer
>  - data pointer
>  - SIMD 8/16/32/64/128/256 bits/chunk
> So there are some instructions that become useless on the memory side...
> argh. i wish i didn't have this idea.
> forget it.

Ok. It's better that way.

> in my vision, a pointer is held in a whole register, whatever the size
> of both. a pointer has no defined size, just like the register.

Yep. In practice, the number of valid bits in a pointer will be
determined by the hardware (number of address lines) and/or the
operating system (TLB miss handler).

> by not binding the pointer format to the existing data formats
> (char, int, long int...), it becomes difficult to do pointer arithmetics
> with "common" arithmetic operations.

The answer is, of course: use SIMD mode with maximum chunk size. Since
it is identical to non-SIMD mode with the same chunk size...

> Just a note about multiple pointers : a register can contain
> ONLY ONE pointer, otherwise how would we handle jumps and load/stores ?


> another note :
> a scatter/gather instruction would be ideally performed using a "base"
> pointer (checked the usual way) and a SIMD "offset", so every SIMD offset
> chunk is parallelly checked against the maximum allowed offset (size of page
> in TLB ?) and the TLB doens't need as many ports as there are chunks...

Sounds good. Especially if we require the base pointer to be page
aligned :)

> > > > I suggest we choose f) but make any reasonable effort to implement b).
> > > i prefer "my" f) but b) has some problems to be solved...
> > I guess we can combine your f) with a), b), c) or d) - or my f).
> now i understand why you renamed the subject to "Re: More Alphabet Soup " :-P

*evil grin* again :)

[...port sharing between EUs...]
> > Note that you may introduce EU dependencies that way.
> I don't see what you mean by "EU dependencies".

If two EUs share a port, you can use only one of them at a time. This
currently doesn't matter for input ports (because we build a 1-issue CPU)
but is important for output ports - results MUST NOT arrive at the same
time, and the scheduler will have to take care of that. Yet another
special case to handle...

> > Another exception is the IDU (0/0 = ?).
> ooops. well, huh... then the EU output is explicitely cleared in that case,
> and this special case must be handled by the unit...

Ok. I suppose the IDU won't crash and burn when the divisor is zero;
it just will produce meaningless results (which will be masked).

> > For other drawbacks, see A) below.
> > 
> > > Do you understand the trick ? instead of clearing the MSB of the results
> > > at the output of the EUs (it's important to do this because the result is
> > > needed for bypass directly on the Xbar, and the partial write of the register
> > > set will not be enough), the MSB is cleared when the operands are read
> > > (less places where the MSB is cleared, so less control signals).
> > 
> > There are at least four approaches (capital letters this time):
> you really like alphabet soup :-)
> However, note that all approaches can be combined in a way or another.
> > A) `early masking': mask off the high bits when the register is read
> >    (before the instruction is issued).
> > 
> >         + masking moved to register set
> and less places where this must be done ---> less control logic and less long wires
> >         - bypass impossible when first instruction has wider operands 1)
> hhhmmm ???... i'll have to check that.
> >         - requires at least 3 masking units (one per register read port)
> i guess it's not the critical parameter and compared to others, it's even the
> simplest one.
> >         - some instructions need special handling (complex!)
> which ?

You already mentioned them - ROP2 (xnor, orn), INC - and IDU.

> >         1) Surprise! You need to mask the operands of the second instruction
> >            but there is no masking unit inside the bypass.
> if it's needed, we'll make it.

If we put masks into the bypass and the register write ports, the
whole discussion is closed. With that, we can always bypass, and we
can always zero-extend.

> > B) `write masking': mask off the high bits when the result is
> >    written to the register.
> i think it was your first idea (or at least i understood that,
> and i later developped).


> >         + masking moved to register set
> >         + requires only 2 masking units (one per register write port)
> there is no big difference in practice, i guess.
> But the real difference is that 2 instructions can use the 2 write ports
> and they can use 2 different write sizes --> in practice, it's more
> complex than A) because A) needs 1 mask control logic, while B) needs 2.

I suppose that every masking unit has its own decoder logic anyway,
in order to reduce the number of wires. You only send the SIMD and U
bits to it, and the rest is done in place.

> >         - bypass impossible when second instruction has wider operands
> that's why i thought about moving it further down the pipeline...
> B+f is possible but not satisfying. there is certainly something better.

B+b, with an optional masking unit inside the bypass :)

> > C) `late masking': store the chunk size in the register set (or
> >    scoreboard) and mask off the high bits when the result is read from the
> >    register (that is, before the *next* instruction that uses the value).
> > 
> >         + masking moved to register set
> >         - bypass impossible when second instruction has wider operands
> >         - requires at least 3 masking units (one per register read port)
> >         - requires extra `valid' bits that indicate the chunk size
> where did that crazy idea come from ? ...

Um. Hmm. Where did... outta my crazy head? ;)

> can't we just trry to make something a kid can understand ?...

Ok, let's build a Turing machine ;)

> > D) move the problem to the EUs. This can be done easily in the IMU, but
> >    there's no room left inside the ASU, for example. SHL is pretty tight,
> >    too (I already violate the `6 gate rule' there).
> then split SHL into 2 stages... given the long wires of you Omega network,
> it won't be useless...
> >         + bypass always possible
> >         - heavy implementation
> >         - not all EUs support it
> i propose to let the Xbar "unit" perform a part of it.
> it can mask off the MSB when it reads the register operands.
> If the EU can't do it, then Xbar masks the EU output locally.

I thought you wanted to reduce the complexity of the Xbar?

> > E/F/G anyone? ;)
> > 
> > If we can add a masking unit inside the bypass, ABC) will always be able
> > to bypass results. But even if we can't, B) looks like the best solution
> > so far.
> not really because it still has problems with bypass (detecting special conditions).
> Don't forget that without bypass capability, FC0 will be ... unacceptably slow.

If you want maximum speed, always use full words (with or without SIMD).

 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/