[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] new cjump instruction



hi !

nico wrote:

On Sat, 12 Apr 2003 21:16:51 +0200
Yann Guidon <whygee@f-cpu.org> wrote:

hi,

nico wrote:

On Sat, 12 Apr 2003 02:54:07 +0200
Yann Guidon <whygee@f-cpu.org> wrote:

huh, i don't think that it's a good answer ....

on top of that, this technique poses new problems in FC0's pipeline.
sure, addresses are computed fast, but what about their validation,
their fetch, their lookup in the buffers ......

Validation are usefull because you are inside a pages.

validation seems ok BUT how do you validate that the new address is
already present and available ? you were the first to mention that
associative logic is slow ....

it is. But where do you see any associative logic ?

it depends on the architecture of the cache memory that is used,
and this can change from implementation to implementation ....

fetch are ligthening fast.

i do not agree.

I would say : it will be the fastest in this case comparre to other.

each stage will be as "fast" as in other cases.
however, the sequence of events changes dramatically and the
register-based version is clearly faster for small loops.

You have an entire clock cycle to do it (no
adder or register read before accessing L1 caches).

worse :
you can check a 64-input associative logic (corresponding to the
registers) faster than the "hit" logic of the cache (shorter wires
etc....), just in case the other LSU logic says "no" (hey : one has to
multiplex the 12-bit LSB path from the instruction with the XBAR's
output with leads to the LSU address comparators) [a multiplexor is
not 'free gate' today]

hmm, i made a mistake here : not only we have to check the LSU,
but also the Fetcher's buffer, to look up for this address.
and i'm not sure that the address comparison will be only 12-bit wide.

So let's imagine there is a "dedicated fast path" for this 12-bit address to the L1 cache,
and it works in 1 or 2 cycles (well, this adds one other address comparator that
will be active EVERY DAMN CYCLE).
Then, data must be sent from the cache to the LSU (if absent from the
LSU), which takes easily one cycle. Then goes the usual pipeline thing
: decode, then issue. so in fact it is not faster.
Of course the classical jump is more complex, but it is more flexible.

All of this is part of the beginning of the pipeline (fetch stages).

yep but depending on where the target address is located,
the sequence of events change.
face it : just like any memory access (either data fetch or jump),
the target is either (or both) in LSU, Fetcher, L1, L2, main DRAM or in mass storage.

For short loops, the register-based jump wins : the target remains in "cache"
in the Fetcher's buffer, and the number of checks is minimal, and there
are only 6 bits to code this.
This comes at the price of prefetching the target, there might be a stall during
the first loop run, that's all.

your immediate cjump does not solve the pipeline bubble problem,
so it has no chance to run small loops faster.
it also requires address checks that are not minimal.
I concede that page boundaries are not to be checked,
but it adds a burden on the software and it does not
remove all the checks for the other things : where is
the target, in the memory hierarchy ?

There is nothing about it in the manual or elsewhere.
trust me if you want.

So i use the usual paradigm of a memory bus (adresse/data).

unless you want to design a small microcontroller, this is not useful.

So fetch
send an adress and receive a data from the L1 cache memory. The
adress must be available as soon as possible. So 12 lsb don't need any
adder, so it's the fastest. (beside that fast L1 have 2 or 3 cycles
latency but a thoughtput of 1)

in this situation, you don't take any instruction buffer into account.
So each time you jump, you have a 2 or 3-cycle penalty.

The purpose of the register-based jump is to benefit from the
instruction buffer (the "Fetcher"'s buffer). It doesn't even require an addition either,
and spares the L1 fetch in small loops (garanteed !)
Of course there is HW overhead to check the validity of the register,
and some SW overhead to prefetch the target BUT it doesn't impose tough
constraints on the linker or compiler (well, not of the kind of what you
re-proposed) and the prefetch can be simplified quite easily (when several
instructions lead to the same jump address).

If you want to add multiple buffer, and bunch of decoder, that's up to
you (this also add latency). But you will always need somewhere a memory
bus, and it will be the slowest part.
you forget about the instruction buffer.

But the real problem is for the compiler. What is the opinion of the
compiler writter ?

it's a useless constraint ....

no it's not useless. It permit 0 cycle jump (without jump prediction).

no.

So unroll loops will be far less interresseting, you will save L1 cache
space. In the worst case,(1 fonction by .c file) such file could avoid
using it. Tight loop could be much faster.

your "solution" does not remove pipeline bubbles !
it is NOT possible to jump in no cycle, because the pipeline
is so tight that there are at least 1 or 2 instruction being decoded at the same time.
i know it, because i have fallen in the same trap.

Even if you maintain 2 parallel execution streams in advance,
or store stalled pipeline stages, you CAN'T jump in zero time
simply because the register set would require 2x or 3x more read ports !
And even then, the time to propagate the order to switch from
one execution stream to another is too high, it costs at least
one cycle, so here we are back to the initial problem :
we can't compress time.

i hope that this explanation is not confuse.

So devik, what do you think of it ?

nicO

YG

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/