[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] new cjump instruction



hi,

Michael Riepe wrote:

On Sun, Apr 13, 2003 at 01:21:05PM +0000, nico wrote:
[...]

All of this is part of the beginning of the pipeline (fetch stages).
There is nothing about it in the manual or elsewhere.
[...]

Well, then let's try to fill that gap.

<snip>

Did I miss anything?  Well, probably a lot.
Yann, please correct me if I'm wrong.

not much time to answer but :

The LSU (Load Store Unit) and the Fetcher have
a similar structure : prefetch logic (increment addresses),
cache logic (LRU, validity bits etc.) and multiported memory
(small, fine-grained cache).

The main difference is that the Fetcher's buffer contains 32-bit words
in read-only mode while the LSU manages read and write of bytes.
This increases the complexity a bit.
The Fetcher also has a sequencial behaviour (there is a flag that
says whether to get the next instruction) while the LSU is random-access.

In the instruction flow pipeline, the Fetcher is at the first place :
it is where instructions are fetched and buffered. When a line
is first accessed, the Fetcher takes the address of the current line,
increments it (upon overflow of the 12 LSB, it checks the TLB)
and performs a read request to L1. The goal is to maintain
a constantly prefetched instruction stream, and compensate for
average memory latencies : with 8 instructions per line (256 bits),
it gives 8 cycles to compute the next address, check if it is
present in LSU, Fetcher and I-Cache, and fetch it. If absent,
there are still a few clocks lefts for a fetch from L2.

So the "Fetcher" fetches instructions in advance, sequentially.
Linear code will use a "double buffer" with 2 lines.
If a jump occurs, a new set of lines is found through LRU,
so a jump back (return ?) to the calling code is possible.
There are usually 8 lines.

The Fetcher presents one instruction at each clock cycle
to the pipeline, as well as its address (which is generated
from the line's virtual address and the index of the instruction in
the line, obtained from the counter). The next pipeline stages
(decode / R7 read and issue/Xbar) give a "next instruction"
bit to the fetcher, in order to advance the instruction counter
and get the next instruction. So the fetcher selects a new word
either in the current line or another one, and this is completely
transparent to the rest of the pipeline (which just begs for
a new instruction). Naturally, if there is a stall in the memory,
it propagates to the pipeline through the Fetcher, but we count
on its buffers and "intelligence" to reduce the stall's costs.


Next comes "Decode". it "explodes" the instruction word
into all the subfields and checks for whatever conditions exists.
This means : reading all registers in the 3 corresponding fields,
checking whether their value is available in the next cycle,
whether the pointer field corresponds to a valid pointer
or an invalid pointer or nothing, check whether the condition
is true, check whether the instruction is valid ....

All those fields and flags are condensed into a few bits
in the next cycle : "Issue/Xbar", where it is computed
whether to fetch the next instruction, jump to the target,
wait (stall) or trap. At the end of this cycle, if the instruction
is valid and accepted, it "officially" enters the pipeline
and the corresponding flags are updated in the scheduler's queue
(depending on the latency determined by a table lookup
during the decode cycle).


Answers to MR :
* currently, i don't think that predecoding
can bring any advantage (yet). It will be useful in later architectures
but i see no case where it can bring any benefit.
* Loadaddri : the ASU can be used anyway, because the TLB is hooked
to its output (on a "user-hidden port", kinda specific bypass).
The TLB can get addresses from the Xbar ( a register or a unit bypass)
or from the ASU (used by load/store with postincrement).
One "optimisation" is to bypass the TLB when no carry occurs
on the 12th bit but lookup in the LSU and/or Fetcher's buffer is
always performed after that.
It is better to reuse the ASU when there is a register operand,
because it will benefit from the bypass on Xbar.
A specific adder could be added to perform "fast/early" compute
but it can be spared at the cost of more latency. Both solutions
should be left to the user, in case there is a possible tradeoff
for power consumption, speed and room.
However, the Fetcher contains a pair of 'incrementers',
one to increment the virtual address "tag" and another
is a simple counter (from 0 to 7). maybe the large incrementer
can be made into a normal adder that performs "loadaddri".
This won't work for register-relative "loadaddr".

i hope everyone understood,

YG


*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/