[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] correction : no delay required for the multiplier



On Sat, Dec 22, 2001 at 04:08:44AM +0100, Yann Guidon wrote:

> while laying out the scheduler, it appeared to me that
> there is no need to delay one of the multiplier's outputs.
> The complexity is manageable in the decoder, when it comes
> to predict the latency. A side of the problem that i did
> not see is that there is no need for an additional port !
> if we delay the multiplier, we'll require more HW for
> managing the additional instantaneous bandwidth, which
> is rather uncommon in real cases.

We need to delay the 8- and 16-bit low parts for macl/mach.

> I am still unsure about the difference between mac and mul
> (in your implementation) but i stick to the basics :
> we can insert up to two register writes per instructions
> in the scheduler queue (whatever the position).

`mul' is 2r1w -- no second write needed, use the `low part' output for
the corresponding chunk size.

`mulh' is 2r2w -- same as `mul', but also write the high part to the
second register. For 8- and 16-bit operations, the high part has an
additional delay (+ 1 cycle).

`mac' in its original (documented) form is 3r1w, but its results are
scattered across both high and low parts of the IMU output (unless we
route them to a third port).

The alternative mac (`amac') instruction I proposed earlier is also 3r1w
but otherwise equals the `mul' instruction.  Its result is available at
the `low part' outputs but is truncated to the chunk size. The current
IMU supports this instruction, too.

> Think about the Load operations :
> the pointer update (addition) takes more time than transfering
> the data from the LSU to the register. Delaying the data
> load would seriously reduce the processor's performance...

Yep.

> In the scheduler FIFO, i have two columns of N positions
> (N-deep FIFO) with each the following informations :
>   - valid           (1 bit)
>   - register number (6 bits)      -> used during the register write cycle
>   - write mask      (2 bits)      -> used during the register write cycle
>   - write port      (3 or 4 bits) -> used during the Xbar cycle to select the data
> Each FIFO stage contains (for most bits) a register (that is clocked
> by the main clock), a few comparators (for the register numbers)
> and a MUX that selects either the previous stage's data OR the data coming
> from the decoder's LUT.
> 
> This is where it is possible to make a few sophisticated things :
> currently, each column can select its own destination register,
> write mask and write port, in independent positions from each others
> (still with the limit of 2 registers per instruction because one
> column can only insert 1 register). It is therefore possible
> to write one register in column 1 at cycle i and another register
> in column 2 at cycle j, as long as the decoder's LUT can predict i and j.
> 
> Is it clear or you require a drawing ? :-)
> (i'm kidding, but if it's really unclear, i'm am currently
> drawing some stuffs that can help understand).

It's crystal clear :)  And it looks good to me.  I would probably
store a pre-decoded write mask (5 bits) instead, but that's a minor
issue.

> I see the multiplier's "issue" as an unexpected solution
> to the Xbar port number problem :-) we could then spare 1/2 of
> the MUXes required for the multiplier on the Xbar !
> 
> I don't know precisely, but it would be ok to have 4 output
> ports for the multiplier (each with 64 bits).
> when two results are available in a single cycle, then
> two contiguous ports are used simultaneously.

The current timing may allow us to add MUXes after some of the outputs
without increasing the number of stages.  In particular, we could use
a 4:1 mux for the low parts, and another one for the high parts, while
maintaining the timing for `mulh' (4/5/5/6 cycles, depending on the
chunk size).  We'll lose the ability to schedule two `mul' instructions
with different chunk sizes so that both results arrive at the same time,
but I guess that doesn't matter much. `mach'/`macl' will need their own
outputs (and 4:1 muxes). This will simplify the interface pretty much:

	- `mul' and `amac' use port #0
	- `mulh' uses port #0 and #1
	- `macl' uses port #2
	- `mach' uses port #3
	- all outputs have a delay of 4/5/5/6 cycles (for 8/16/32/64-bit chunks)

The output mux selectors can be driven internally (by the chunk size
control lines) or externally, whatever is more appropriate.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/