[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] new cjump instruction



hi,

Michael Riepe wrote:

On Mon, Apr 14, 2003 at 10:57:15AM +0200, Yann Guidon wrote:
[...]

In the instruction flow pipeline, the Fetcher is at the first place :
it is where instructions are fetched and buffered. When a line
is first accessed, the Fetcher takes the address of the current line,
increments it (upon overflow of the 12 LSB, it checks the TLB)
and performs a read request to L1. The goal is to maintain
a constantly prefetched instruction stream, and compensate for
average memory latencies : with 8 instructions per line (256 bits),
it gives 8 cycles to compute the next address, check if it is
present in LSU, Fetcher and I-Cache, and fetch it. If absent,
there are still a few clocks lefts for a fetch from L2.

The fetcher could increment the address as soon as it requests a cache
line (in parallel, just as a postincremented load would do). Maybe even
earlier if we use a `tandem register' (that is, double buffering).

well, too much prefetch can overload the memory system
and increase the real execution time.
But of course, since the sources are "free", anybody can "play"
with the Fetcher's strategy.

So the "Fetcher" fetches instructions in advance, sequentially.
Linear code will use a "double buffer" with 2 lines.
If a jump occurs, a new set of lines is found through LRU,
so a jump back (return ?) to the calling code is possible.
There are usually 8 lines.

Which means that we can handle at least four different locations,
right?

"at least", right :
first, usually, only one stream is using double-buffering.
The rest will be code that was the source of a jump (in which case,
the double buffering will "die" progressively through LRU) or of a call
(then, if too many calls are done, the Fetcher will "forget" the oldest
parts of the virtual "call stack").

The LRU algorithm and the line selection will not be trivial algos
but at least it implicitely implements a call stack.
Heuristics must be found in order to select whether a return address
must keep its "double", for example if the jump occurs after the 4th instruction,
the cost of fetching the next line from L1 increases dramatically,
so we can "forget" a prefetched line if the instruction pointer's bit #4 is cleared.

This kind of trick can spare some lines and increase the number of
real fetched locations to 6 instead of 8/2. At least : one pair is used for
the current instruction flow and another is possibly incompressible.

The Fetcher presents one instruction at each clock cycle
to the pipeline, as well as its address (which is generated
from the line's virtual address and the index of the instruction in
the line, obtained from the counter). The next pipeline stages
(decode / R7 read and issue/Xbar) give a "next instruction"
bit to the fetcher, in order to advance the instruction counter
and get the next instruction. So the fetcher selects a new word
either in the current line or another one, and this is completely
transparent to the rest of the pipeline (which just begs for
a new instruction). Naturally, if there is a stall in the memory,
it propagates to the pipeline through the Fetcher, but we count
on its buffers and "intelligence" to reduce the stall's costs.

Yep.  But I'd rather present the whole cache line to the decoder, and
let it send back a `next line' bit.

hmmm ... no.
even a cache line is a buffer that must be carefully handled.
and a decoder decodes, a fetcher fetches.
At one point, there must be the "current instruction" that is being decoded :
this is where there is the limit between "fetch" and "decode".

If you provide several instructions for parallel decoding,
it increases the decoding logic and the problems.
Furthermore, in the classical computing world, only one
instruction is "active" at a time. So i consider that it is present
in one "pipeline register", it is updated by the "fetcher" upon
request from the decode and issue stages.

Alternatively, the decoder could
use an `instruction window' that contains 2...4 (or more) instructions.
In case we ever want to do a kind of peephole optimization here (or issue
several instructions per cycle), we wouldn't have to change the fetcher.

KISS, you know ?

first, it's usually the compiler's job to optimise the code sequences, and
it has opportunities to perform them on a much larger window than 2 to 4 instructions.

Next, we have 64 registers and potentially 4 register addresses,
how are you going to compare together 8 to 16 6-bit words together ?
(theoretically, that's 8*7/2=38 to 16*15/2=120 comparators).
the clock speed and/or the pipeline depth are going to suffer.

BTW: One possible optimization would be `loadcons clustering'. If there
are several loadcons[x] instructions in a row (with the same destination
register), the decoder might put the constants together and process all
the loads in a single cycle.

naaah ....

how are going going to deal with "clusters" that spread across a cache line boundary
(or worse) through a page ?

Next, you'll have to put comparators, that determine that the loadcons is sent to the same
register. That's O(log2(8)), at least 3 logic depths for 8 instructions/line.

Then, you'll have to make sure that the constants are in the correct order :
no place for constant optimasations and "holes" in the constant, you have
to provide contiguous indices.

so that's reasons i'm able to find in my post-tiredness state,
wait till i get some sleep if you want more :-)

Next comes "Decode". it "explodes" the instruction word
into all the subfields and checks for whatever conditions exists.
This means : reading all registers in the 3 corresponding fields,
checking whether their value is available in the next cycle,
whether the pointer field corresponds to a valid pointer
or an invalid pointer or nothing, check whether the condition
is true, check whether the instruction is valid ....

All those fields and flags are condensed into a few bits
in the next cycle : "Issue/Xbar", where it is computed
whether to fetch the next instruction, jump to the target,
wait (stall) or trap. At the end of this cycle, if the instruction
is valid and accepted, it "officially" enters the pipeline
and the corresponding flags are updated in the scheduler's queue
(depending on the latency determined by a table lookup
during the decode cycle).

Yep.


Answers to MR :
* currently, i don't think that predecoding
can bring any advantage (yet). It will be useful in later architectures
but i see no case where it can bring any benefit.

The main idea was to simplify the decoder (lower latency),

waht complex instruction needs preprocessing ?

and to have some bits available in the fetcher that can be used for code flow analysis.

???

E.g. if there is an unconditional jump, the fetcher may skip
the `linear' prefetch cycle to save memory bandwidth (and a buffer);

hmmm ... there are many ways to code an unconditional jump,
like a call without ever using return for example.
Jump instructions are easy to find but they offer quite some freedom
and even return can be conditional (since it uses the main "jump" form).
You're going to put 8 complex comparators that will add 1 bit to the cache lines.
It's probably interesting but i'm curious to know how many tens of percents
of wallclock time it is going to save.

if there is a syscall instruction, it may prefetch code from the syscall
handler, and so on.

syscall can have either an immediate form or a register form :
prefetch will not be possible for the register form.

* Loadaddri : the ASU can be used anyway, because the TLB is hooked
to its output (on a "user-hidden port", kinda specific bypass).
The TLB can get addresses from the Xbar ( a register or a unit bypass)
or from the ASU (used by load/store with postincrement).

The TLB will also get addresses directly from the fetcher (for sequential
instruction prefetch), won't it?

i guess so ut i'm too tired to think about it...

now there is another question : split or unified TLB ?

One "optimisation" is to bypass the TLB when no carry occurs
on the 12th bit but lookup in the LSU and/or Fetcher's buffer is
always performed after that.
It is better to reuse the ASU when there is a register operand,
because it will benefit from the bypass on Xbar.
A specific adder could be added to perform "fast/early" compute
but it can be spared at the cost of more latency. Both solutions
should be left to the user, in case there is a possible tradeoff
for power consumption, speed and room.
However, the Fetcher contains a pair of 'incrementers',
one to increment the virtual address "tag" and another
is a simple counter (from 0 to 7). maybe the large incrementer
can be made into a normal adder that performs "loadaddri".
This won't work for register-relative "loadaddr".

An adder for loadaddri won't fit into a single pipeline stage.

if there is a carry at the Xth position, it can stall the process in order to compute the MSB ?
i think that it won't occur often (inversely proportionally to the memory size that can be accessed
by the LSB of the computed pointer).

And there
is another problem with loadaddri: both operands come from the IF/D unit.

With the immediate form, both operands (NIP [Next IP] and offset) come from the instruction
which is presented by the Fetcher to the decoder, where all fields can be examined in parallel.
These fields can be "looped back" to the Fetcher that can then compute the address
with its internal incrementer/avalanche-adder.

We already have so many adders that one more or less doesn't really count.
EU_ASU has one 64-bit adder, EU_IMU has four 128-bit adders, EU_IDU has
one 64-bit adder (two if we're going to support `rem' and `divrem'),
and we'll need yet another adder to support `addsub'.  Plus any adders
that will be required to implement the FPU (at least two or three).
Small adders (< 64 bits) not listed.

We could make a version with shared adders (e.g. ASU/IMU/IDU might
share a single 128-bit adder), but its latency would probably be worse,
parallelism would be lost, and scheduling might become more difficult.
On the other hand, it would allow us to make a cheaper FC0 -- we
would save not only a significant portion of die area, but also ~10
Xbar ports.  I'll write such an `Integer Arithmetic Unit' (EU_IAU)
if you're interested.

Now that most interesting units are "ready", there is quite a lot of work to do in
order to connect these units together and then schedule operations.
And i don't know what has happened to the Register Set's sources.

@+
YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/