[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [f-cpu] new cjump instruction
On Sun, 13 Apr 2003 19:34:06 +0200
Yann Guidon <whygee@f-cpu.org> wrote:
> hi !
>
> nico wrote:
>
> >On Sat, 12 Apr 2003 21:16:51 +0200
> >Yann Guidon <whygee@f-cpu.org> wrote:
> >
> >>hi,
> >>
> >>nico wrote:
> >>
> >>>On Sat, 12 Apr 2003 02:54:07 +0200
> >>>Yann Guidon <whygee@f-cpu.org> wrote:
> >>>
> >>>>huh, i don't think that it's a good answer ....
> >>>>
> >>>>on top of that, this technique poses new problems in FC0's
> >pipeline.>>>sure, addresses are computed fast, but what about their
> >validation,>>>their fetch, their lookup in the buffers ......
> >>>>
> >>>Validation are usefull because you are inside a pages.
> >>>
> >>validation seems ok BUT how do you validate that the new address is
> >>already present and available ? you were the first to mention that
> >>associative logic is slow ....
> >>
> >it is. But where do you see any associative logic ?
> >
> >
> it depends on the architecture of the cache memory that is used,
> and this can change from implementation to implementation ....
>
Yep. But it's always a memory to access. Beside that "usual" memory IP
are one cycle access (async read, sync write).
> >>>fetch are ligthening fast.
> >>>
> >>i do not agree.
> >>
> >>
> >I would say : it will be the fastest in this case comparre to other.
> >
> >
> each stage will be as "fast" as in other cases.
> however, the sequence of events changes dramatically and the
> register-based version is clearly faster for small loops.
>
When you see gcc output code, we see a lot the use of immediat jump
which is translate in 5 consecutives instruction. Not very optimal.
> >>>You have an entire clock cycle to do it (no
> >>>adder or register read before accessing L1 caches).
> >>>
> >>worse :
> >>you can check a 64-input associative logic (corresponding to the
> >>registers) faster than the "hit" logic of the cache (shorter wires
> >>etc....), just in case the other LSU logic says "no" (hey : one has
> >to>multiplex the 12-bit LSB path from the instruction with the XBAR's
> >>output with leads to the LSU address comparators) [a multiplexor is
> >>not 'free gate' today]
> >>
> >>
> hmm, i made a mistake here : not only we have to check the LSU,
> but also the Fetcher's buffer, to look up for this address.
> and i'm not sure that the address comparison will be only 12-bit wide.
>
The most i can say is that is as usual : not cleat at all. :)
At a time or another, you will have to access a memory bus ! A pipeline
must be feed at each cycle. Adding buffer don't change that.
> >>So let's imagine there is a "dedicated fast path" for this 12-bit
> >>address to the L1 cache,
> >>and it works in 1 or 2 cycles (well, this adds one other address
> >>comparator that
> >>will be active EVERY DAMN CYCLE).
> >>Then, data must be sent from the cache to the LSU (if absent from
> >the>LSU), which takes easily one cycle. Then goes the usual pipeline
> >thing>: decode, then issue. so in fact it is not faster.
> >>Of course the classical jump is more complex, but it is more
> >flexible.>
> >
> >All of this is part of the beginning of the pipeline (fetch stages).
> >
> yep but depending on where the target address is located,
> the sequence of events change.
> face it : just like any memory access (either data fetch or jump),
> the target is either (or both) in LSU, Fetcher, L1, L2, main DRAM or
> in mass storage.
>
Yes, but LSU/Fetcher is inside the pipeline. The output is the L1 caches
wich represent 95% of the fetch (then L2, then DRAM, there is no
direct access to mass storage !).
> For short loops, the register-based jump wins : the target remains in
> "cache"
What is L1 if it isn't a cache ?
> in the Fetcher's buffer, and the number of checks is minimal, and
> there are only 6 bits to code this.
> This comes at the price of prefetching the target, there might be a
> stall during
> the first loop run, that's all.
>
That's means when there is no loop, you must add bubble beside each
instruction ?
> your immediate cjump does not solve the pipeline bubble problem,
> so it has no chance to run small loops faster.
:) Because you try to addapt this instruction to your complex stuff. "My
jump" didn't need all this complexity.
> it also requires address checks that are not minimal.
nop. No need.
> I concede that page boundaries are not to be checked,
> but it adds a burden on the software and it does not
> remove all the checks for the other things : where is
> the target, in the memory hierarchy ?
>
That's a false problem. This hirearachy is undefined nowadays.
The only true problem is a software one.
> >There is nothing about it in the manual or elsewhere.
> >
> >
> trust me if you want.
:) Good joke !
>
> >So i use the usual paradigm of a memory bus (adresse/data).
> >
> unless you want to design a small microcontroller, this is not useful.
>
That's only how all cpu work...
> >So fetch
> >send an adress and receive a data from the L1 cache memory. The
> >adress must be available as soon as possible. So 12 lsb don't need
> >any adder, so it's the fastest. (beside that fast L1 have 2 or 3
> >cycles latency but a thoughtput of 1)
> >
> >
> in this situation, you don't take any instruction buffer into account.
? The purpose of the fetch stage is to filed the instruction buffer. So
i don't understand your point.
> So each time you jump, you have a 2 or 3-cycle penalty.
It's depend on the L1 cache structure, only. You must feed the pipeline
every cycle what ever internal buffer you use, it always a small amount
compare to L1. If it's not, it's much easier to use a faster L1 cache,
like the choice of Pentium4 of using 8Ko L1 cache but with a latency of
2, compare to the latency of 3 of the 64Ko cache of the Athlon.
>
> The purpose of the register-based jump is to benefit from the
> instruction buffer (the "Fetcher"'s buffer). It doesn't even require
> an addition either,
> and spares the L1 fetch in small loops (garanteed !)
It's depend what you called "small", f-cpu need a lot of unrolling
because of RAW dependancies (don't forget the "internal" RAW of the
MAC instruction, and the 6 stages of the mul IU). This buffer are almost
sur to blown quickly.
> Of course there is HW overhead to check the validity of the register,
> and some SW overhead to prefetch the target BUT it doesn't impose
Overhead that you find because instead of having one instruction, most
of the time you have 5 of it.
> tough constraints on the linker or compiler (well, not of the kind of
> what you re-proposed) and the prefetch can be simplified quite easily
> (when several instructions lead to the same jump address).
>
Simplify prefetch ? What is most simple than a memory bus ?
> >If you want to add multiple buffer, and bunch of decoder, that's up
> >to you (this also add latency). But you will always need somewhere a
> >memory bus, and it will be the slowest part.
> >
> >
> you forget about the instruction buffer.
>
nop.
> >>>But the real problem is for the compiler. What is the opinion of
> >the>>compiler writter ?
> >>>
> >>it's a useless constraint ....
> >>
> >>
> >no it's not useless. It permit 0 cycle jump (without jump
> >prediction).
> >
> no.
>
Because, you think of all the complexity you will add.
> >So unroll loops will be far less interresseting, you will save L1
> >cache space. In the worst case,(1 fonction by .c file) such file
> >could avoid using it. Tight loop could be much faster.
> >
> >
> your "solution" does not remove pipeline bubbles !
> it is NOT possible to jump in no cycle, because the pipeline
> is so tight that there are at least 1 or 2 instruction being decoded
> at the same time.
??? We have a one issue cpu.
> i know it, because i have fallen in the same trap.
>
> Even if you maintain 2 parallel execution streams in advance,
> or store stalled pipeline stages, you CAN'T jump in zero time
> simply because the register set would require 2x or 3x more read ports
> ! And even then, the time to propagate the order to switch from
> one execution stream to another is too high, it costs at least
> one cycle, so here we are back to the initial problem :
> we can't compress time.
That's why you are stuck by the L1 cache speed.
>
> i hope that this explanation is not confuse.
It was like usual :)
nicO
>
> >So devik, what do you think of it ?
> >
> >>>nicO
> >>>
> >>YG
> >>
> >>
> YG
>
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/