[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[f-cpu] correction : no delay required for the multiplier



hi !

while laying out the scheduler, it appeared to me that
there is no need to delay one of the multiplier's outputs.
The complexity is manageable in the decoder, when it comes
to predict the latency. A side of the problem that i did
not see is that there is no need for an additional port !
if we delay the multiplier, we'll require more HW for
managing the additional instantaneous bandwidth, which
is rather uncommon in real cases.

I am still unsure about the difference between mac and mul
(in your implementation) but i stick to the basics :
we can insert up to two register writes per instructions
in the scheduler queue (whatever the position).

Think about the Load operations :
the pointer update (addition) takes more time than transfering
the data from the LSU to the register. Delaying the data
load would seriously reduce the processor's performance...

In the scheduler FIFO, i have two columns of N positions
(N-deep FIFO) with each the following informations :
  - valid           (1 bit)
  - register number (6 bits)      -> used during the register write cycle
  - write mask      (2 bits)      -> used during the register write cycle
  - write port      (3 or 4 bits) -> used during the Xbar cycle to select the data
Each FIFO stage contains (for most bits) a register (that is clocked
by the main clock), a few comparators (for the register numbers)
and a MUX that selects either the previous stage's data OR the data coming
from the decoder's LUT.

This is where it is possible to make a few sophisticated things :
currently, each column can select its own destination register,
write mask and write port, in independent positions from each others
(still with the limit of 2 registers per instruction because one
column can only insert 1 register). It is therefore possible
to write one register in column 1 at cycle i and another register
in column 2 at cycle j, as long as the decoder's LUT can predict i and j.

Is it clear or you require a drawing ? :-)
(i'm kidding, but if it's really unclear, i'm am currently
drawing some stuffs that can help understand).

I see the multiplier's "issue" as an unexpected solution
to the Xbar port number problem :-) we could then spare 1/2 of
the MUXes required for the multiplier on the Xbar !

I don't know precisely, but it would be ok to have 4 output
ports for the multiplier (each with 64 bits).
when two results are available in a single cycle, then
two contiguous ports are used simultaneously.

I am now waiting for Michael's deep explanations and ideas.

WHYGEE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/