[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] F-CPU fetch unit

To: f-cpu@xxxxxxxx
Subject: Re: [f-cpu] F-CPU fetch unit
From: Yann Guidon <whygee@xxxxxxxxx>
Date: Wed, 31 Aug 2005 08:24:01 +0200
Delivered-to: archiver@seul.org
Delivered-to: f-cpu-outgoing@seul.org
Delivered-to: f-cpu@seul.org
Delivery-date: Wed, 31 Aug 2005 02:09:24 -0400
In-reply-to: <20050831025348.11089.qmail@web54501.mail.yahoo.com>
Organization: Freedom CPU Project
References: <20050831025348.11089.qmail@web54501.mail.yahoo.com>
Reply-to: f-cpu@xxxxxxxx
Sender: owner-f-cpu@xxxxxxxx
User-agent: Mozilla/5.0 (Windows; U; Win95; en-US; rv:1.4.1) Gecko/20031008

hi,

Bogdan Petrisor wrote:

--- Yann Guidon <whygee@xxxxxxxxx> wrote:
stop !  the PC is not a real register.
in fact, there is no PC. It is just provided as an artifact
for old-school programming.
"PC" is just a read-only port as seen from the Xbar. it is "built" at every cycle by the Fetcher, which concatenates several internal fields and counters. that's all. The PC is "advanced" simply because the running counters are incremented to (speculatively) provide new instructions to the instruction decoder.
ok, so the PC (let's call it like this but we're reffering to the RO port of the Xbar)

yup, and more specifically : there are several versions of the PC, as the instructions flow in the pipeline. Currently, there is a 2-stage FIFO so the "current PC" as seen from Xbar correspond to the instruction that has been issued, so we can compute correct addresses.

must have the width of the other regiters that are connected to the Xbar.

right, and this is zero-extended whenever needed.

that would make for now 64 (a configurable constant from what I've seen in the source files). This PC is an output from the fetcher. But how does the jump works then? Is the instruction decoded here instead of the decoder?

Jumps(as well as call, return etc.) is a bit weird when compared to other architectures. The goal was 1) to escape current patents 2) to remove unnecessary operations 3) to exploit as much as possible a certain set of seemingly unrelated units to get the most out of them.

To start with the developper's eye, there is no "jump +100" or so (as seen in x86 and others). The reason : computing the jump address, checking the TLB, and so on, are VERY slow. So the slowness is spread among several operations : one "prepares" the jump, and the other actuall performs it (this also reduces the need for delay slots).

The best code examples are found in the call/return and loop situations,
where we have a perfect locality to exploit. The jump addresses are stored
in a normal register (that is the VIRTUAL one, when in user mode, so that
the REAL state of the computer can be saved transparently whenever
a task switch occurs) as well as in the Fetcher (the linear address).
Return and loop are specific kinds of jump
because there is almost nothing to do : the Fetcher already contains
the target instructions. We just have to instruct it to keep the data
for a while, so there is no penalty when reusing the instructions.
This is done with the "call" and "loopentry" instructions :
they mark the "current line" in the fetcher as "do not flush"
In case there is a flush or a task switch, the data are not lost,
the jump address is kept in the register set.

For a "normal jump", there is also a prefetch instruction,
that loads the current PC, adds a register and an immediate,
and stores the result to another register. A side effect is that
the result is checked in the TLB and the target register
is "associated" to a given line of the Fetcher, which starts
a memory fetch if it is absent.
This way, the jump or call instructions do have one cycle
of penalty in the best case (this is not compressible). In the
worst case (which i think will be quite common), this might
take maybe ten cycles. So scheduling is very important,
but not as difficult as on x86 because every architectural
detail is exposed.

Note that the LSU (that is : load and store)
works mostly the same, with the added complexity of
supporting reads and writes (so there are a bit more ports)
with any bit width. There, latency is hidden because
we can perform both read/write in parallel with pointer update.

- The Fetcher SPECULATIVELY fetches instructions from the L1 I-cache and from outside memory in case of a miss. - The Fetcher is composed of several "lines" (say, 8 for a start). These are a sort of "cache" for the L1 cache, but with many ports (read ports go to the decoder and I-L1, write ports are fed by the external memory and the L1). - Each line contains a field that says if the translation is OK, and an additional field contains the "virtual address"s MSB (so that the PC can be rebuilt) - there is also a flag that says the size of the page. It is used both for rebuilding the PC and to trigger a TLB lookup when the internal counter overflows (we have to know which of the counter'sbit will trigger the lookup).
so you mean like 8 little fethchers and while one of them puts the current instruction out the
others get more instructions from the cache?

there is 1 "Fetcher", containing 8 (or more or less) independent lines. Only one can read the L1 at a time, because there is usually only one read port (but it is very wide, 256 bits in the beginning, and the L1 is certainly pipelined itself). Same for L1 write. For memory read/write, there is only one memory interface (with a configurable width, like 64 or 256 bits) so it is slower (one Fetcher's line is immobilized during 2 or 4 cycles). And because the LSU works mostly the same, a priority must be computed to decide which of the LSU or the Fetcher can access the external memory at a given time.

So when a miss finnaly happens the fetcher gets ony the guilty instruction from the RAM or is it responsible for the replacement of the pages in cache?

The Fetcher does not replace entries in the TLB. It fills the cache and its lines with data that belong to the current process, as indicated in the TLB.

You see, it is quite complex, the function arises from the collaboration of several specialised units, rather than a single "black box" (that's why the term "MMU" is not suitable at all). And developping the Fetcher+LSU+TLB+Cache memory system is not as easy as creating an execution unit, because there are a LOT of side effects.

yes, I (usually) love problems :D:D. The reason I'm asking this is because I like to have a general ideea about the ports before staring to work on a block/entity/unit/whatever you like to call it.

The "execution units" (in the "execution pipeline") are quite straightforward to design, they have an obvious interface (data in and out, plus all the necessary flags) and they can stand alone. However, as you see, the rest is not easy at all, because of their interactions and collective work.

But the execution units are not all finished.

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/

References:
- [f-cpu] F-CPU fetch unit
  - From: Bogdan Petrisor

Prev by Author: Re: CRC (was Re: [f-cpu] F-CPU architecture...)
Previous by thread: [f-cpu] F-CPU fetch unit
Next by thread: Re: [f-cpu] F-CPU architecture...
Index(es):
- Author
- Thread