[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] new cjump instruction



On Mon, Apr 14, 2003 at 10:57:15AM +0200, Yann Guidon wrote:
[...]
> In the instruction flow pipeline, the Fetcher is at the first place :
> it is where instructions are fetched and buffered. When a line
> is first accessed, the Fetcher takes the address of the current line,
> increments it (upon overflow of the 12 LSB, it checks the TLB)
> and performs a read request to L1. The goal is to maintain
> a constantly prefetched instruction stream, and compensate for
> average memory latencies : with 8 instructions per line (256 bits),
> it gives 8 cycles to compute the next address, check if it is
> present in LSU, Fetcher and I-Cache, and fetch it. If absent,
> there are still a few clocks lefts for a fetch from L2.

The fetcher could increment the address as soon as it requests a cache
line (in parallel, just as a postincremented load would do).  Maybe even
earlier if we use a `tandem register' (that is, double buffering).

> So the "Fetcher" fetches instructions in advance, sequentially.
> Linear code will use a "double buffer" with 2 lines.
> If a jump occurs, a new set of lines is found through LRU,
> so a jump back (return ?) to the calling code is possible.
> There are usually 8 lines.

Which means that we can handle at least four different locations,
right?

> The Fetcher presents one instruction at each clock cycle
> to the pipeline, as well as its address (which is generated
> from the line's virtual address and the index of the instruction in
> the line, obtained from the counter). The next pipeline stages
> (decode / R7 read and issue/Xbar) give a "next instruction"
> bit to the fetcher, in order to advance the instruction counter
> and get the next instruction. So the fetcher selects a new word
> either in the current line or another one, and this is completely
> transparent to the rest of the pipeline (which just begs for
> a new instruction). Naturally, if there is a stall in the memory,
> it propagates to the pipeline through the Fetcher, but we count
> on its buffers and "intelligence" to reduce the stall's costs.

Yep.  But I'd rather present the whole cache line to the decoder, and
let it send back a `next line' bit.  Alternatively, the decoder could
use an `instruction window' that contains 2...4 (or more) instructions.
In case we ever want to do a kind of peephole optimization here (or issue
several instructions per cycle), we wouldn't have to change the fetcher.

BTW: One possible optimization would be `loadcons clustering'.  If there
are several loadcons[x] instructions in a row (with the same destination
register), the decoder might put the constants together and process all
the loads in a single cycle.

> Next comes "Decode". it "explodes" the instruction word
> into all the subfields and checks for whatever conditions exists.
> This means : reading all registers in the 3 corresponding fields,
> checking whether their value is available in the next cycle,
> whether the pointer field corresponds to a valid pointer
> or an invalid pointer or nothing, check whether the condition
> is true, check whether the instruction is valid ....
> 
> All those fields and flags are condensed into a few bits
> in the next cycle : "Issue/Xbar", where it is computed
> whether to fetch the next instruction, jump to the target,
> wait (stall) or trap. At the end of this cycle, if the instruction
> is valid and accepted, it "officially" enters the pipeline
> and the corresponding flags are updated in the scheduler's queue
> (depending on the latency determined by a table lookup
> during the decode cycle).

Yep.

> Answers to MR :
> * currently, i don't think that predecoding
> can bring any advantage (yet). It will be useful in later architectures
> but i see no case where it can bring any benefit.

The main idea was to simplify the decoder (lower latency), and to
have some bits available in the fetcher that can be used for code flow
analysis.  E.g. if there is an unconditional jump, the fetcher may skip
the `linear' prefetch cycle to save memory bandwidth (and a buffer);
if there is a syscall instruction, it may prefetch code from the syscall
handler, and so on.

> * Loadaddri : the ASU can be used anyway, because the TLB is hooked
> to its output (on a "user-hidden port", kinda specific bypass).
> The TLB can get addresses from the Xbar ( a register or a unit bypass)
> or from the ASU (used by load/store with postincrement).

The TLB will also get addresses directly from the fetcher (for sequential
instruction prefetch), won't it?

> One "optimisation" is to bypass the TLB when no carry occurs
> on the 12th bit but lookup in the LSU and/or Fetcher's buffer is
> always performed after that.
> It is better to reuse the ASU when there is a register operand,
> because it will benefit from the bypass on Xbar.
> A specific adder could be added to perform "fast/early" compute
> but it can be spared at the cost of more latency. Both solutions
> should be left to the user, in case there is a possible tradeoff
> for power consumption, speed and room.
> However, the Fetcher contains a pair of 'incrementers',
> one to increment the virtual address "tag" and another
> is a simple counter (from 0 to 7). maybe the large incrementer
> can be made into a normal adder that performs "loadaddri".
> This won't work for register-relative "loadaddr".

An adder for loadaddri won't fit into a single pipeline stage.  And there
is another problem with loadaddri: both operands come from the IF/D unit.

We already have so many adders that one more or less doesn't really count.
EU_ASU has one 64-bit adder, EU_IMU has four 128-bit adders, EU_IDU has
one 64-bit adder (two if we're going to support `rem' and `divrem'),
and we'll need yet another adder to support `addsub'.  Plus any adders
that will be required to implement the FPU (at least two or three).
Small adders (< 64 bits) not listed.

We could make a version with shared adders (e.g. ASU/IMU/IDU might
share a single 128-bit adder), but its latency would probably be worse,
parallelism would be lost, and scheduling might become more difficult.
On the other hand, it would allow us to make a cheaper FC0 -- we
would save not only a significant portion of die area, but also ~10
Xbar ports.  I'll write such an `Integer Arithmetic Unit' (EU_IAU)
if you're interested.

-- 
 Michael "Tired" Riepe <Michael.Riepe@stud.uni-hannover.de>
 "All I wanna do is have a little fun before I die"
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/