[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SR [was:Re: [f-cpu] delayed issue]



hi !

devik wrote:

Humm... i think about "true" 4r2w (on liw) but with 1r1w split register
bank. The first problem is the use of 3r2w reg bank but 90% of the
instruction are 2r1w.

but not all instructions have the same latency !
so there are many cases where a "long" instruction will prevent short
ones to complete.
The 2nd write port is here to remove 90% of the compiler's pressure to
detect when
this situation occurs.

Just one small note. When I played with my simulator I found
that adding 4 entry "delay" buffer at output of EUs it is
possible to about 4/5 of all write contention related stalls.

4 entries for each of the 6 integer units ? that makes 24 register numbers to compare
and more data wires to route !
This kind of buffers can increase frequency a bitby reducing the size of the
register set, but adds at least one pipeline stage and more management logic.

I onlu don't know whether is it possible to write scheduler
in way it could handle it.
But because you issue AT MOST one insn/cycle (assuming no stalls)
then for 1w ops you should be always able to find free write
cycle.

That's the theory.
In practice, F-CPU is meant to compensate the scalar design with instructions
that are richer than classical RISC processors. These instructions become more
used in critical loops for data-intensive programs (like video and sound), where
performance matters most.
The load with postincrement instruction requires 2 operands and gives 2 other data.
If you schedule them well, you can issue 1 instruction per cycle continuously
(provided there is no data hazard). If the register set has only 1 write port and
a 4-deep write queue then the queue will exhaust if more than 4 consecutive loads
are programmed. This happens a lot, for example when saving/restoring the
register set manually ...


When I think about it just now, one could simply write results
to delay buffers if there is no free write cycle and introduce
stall when one such buffer is almost full. But can't imagine
complexity just now.

the first major problem that arises is how to manage the consistency
of the pipeline, how much logic is needed for the additional buffers,
how many cycles it will add to the pipeline ....
In a sense, it's almost the same problem as simple OOOE.

In order to reduce the pipeline stages to the minimum that is necessary
to perform the function, the pipeline doesn't have to spend any time wondering
where data lies, instead of performing useful computations. It only has to
check for possible exceptions, bypasses and stalls, and it already takes
the first 2 cycles in the pipeline (maybe the most complex part of the design).
Adding something else would make development impossible for our small team.

In fact the register set problem is more related to technology than
architecture. The solution is something like splitting the set in
sub-banks or 'strides' as with other memory technologies (caches,
DRAM etc.). But it depends a lot on the available technology and
the fundry ... Of course several solutions could be tested
(like 8 banks of 8 registers, or 4 banks of 16 registers) but the
best solution will vary with the silicon properties, number
of metal layers etc .....


Now we seem to speak only about the 2 write ports.
But the 3 read ports also have a role, for example for the
store with post increment (we need the pointer, the modifier and
the data) and for SRB (the 3rd port is used to "spy" data
on the Xbar for the backup).
On top of that, several people seem "shocked" about the
unbalanced latencies between the "normal units" with 6 gates
of CDP and the Register set. But the Xbar (the central
bus that routes data to/from units and the register set) is
also going to be slow. The good side is that these slow parts
can be pipelined with little impact on the scheduling.
Adding buffers would be a cure that would be worse than
the problem.

Now, a lot of people are seeming to discover or rediscover
or rediscuss old stuff. What i would like is simply : source code ....

have fun,

good night,
devik

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/