[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] new cjump instruction



On Sun, 13 Apr 2003 19:34:06 +0200
Yann Guidon <whygee@f-cpu.org> wrote:

> hi !
> 
> nico wrote:
> 
> >On Sat, 12 Apr 2003 21:16:51 +0200
> >Yann Guidon <whygee@f-cpu.org> wrote:
> >
> >>hi,
> >>
> >>nico wrote:
> >>
> >>>On Sat, 12 Apr 2003 02:54:07 +0200
> >>>Yann Guidon <whygee@f-cpu.org> wrote:
> >>>
> >>>>huh, i don't think that it's a good answer ....
> >>>>
> >>>>on top of that, this technique poses new problems in FC0's
> >pipeline.>>>sure, addresses are computed fast, but what about their
> >validation,>>>their fetch, their lookup in the buffers ......
> >>>>
> >>>Validation are usefull because you are inside a pages.
> >>>
> >>validation seems ok BUT how do you validate that the new address is
> >>already present and available ? you were the first to mention that
> >>associative logic is slow ....
> >>
> >it is. But where do you see any associative logic ?
> >  
> >
> it depends on the architecture of the cache memory that is used,
> and this can change from implementation to implementation ....
> 

Yep. But it's always a memory to access. Beside that "usual" memory IP
are one cycle access (async read, sync write).

> >>>fetch are ligthening fast.
> >>>
> >>i do not agree.
> >>    
> >>
> >I would say : it will be the fastest in this case comparre to other.
> >  
> >
> each stage will be as "fast" as in other cases.
> however, the sequence of events changes dramatically and the
> register-based version is clearly faster for small loops.
> 

When you see gcc output code, we see a lot the use of immediat jump
which is translate in 5 consecutives instruction. Not very optimal.

> >>>You have an entire clock cycle to do it (no
> >>>adder or register read before accessing L1 caches).
> >>>
> >>worse :
> >>you can check a 64-input associative logic (corresponding to the
> >>registers) faster than the "hit" logic of the cache (shorter wires
> >>etc....), just in case the other LSU logic says "no" (hey : one has
> >to>multiplex the 12-bit LSB path from the instruction with the XBAR's
> >>output with leads to the LSU address comparators) [a multiplexor is
> >>not 'free gate' today]
> >>    
> >>
> hmm, i made a mistake here : not only we have to check the LSU,
> but also the Fetcher's buffer, to look up for this address.
> and i'm not sure that the address comparison will be only 12-bit wide.
> 

The most i can say is that is as usual : not cleat at all. :)

At a time or another, you will have to access a memory bus ! A pipeline
must be feed at each cycle. Adding buffer don't change that.

> >>So let's imagine there is a "dedicated fast path" for this 12-bit 
> >>address to the L1 cache,
> >>and it works in 1 or 2 cycles (well, this adds one other address 
> >>comparator that
> >>will be active EVERY DAMN CYCLE).
> >>Then, data must be sent from the cache to the LSU (if absent from
> >the>LSU), which takes easily one cycle. Then goes the usual pipeline
> >thing>: decode, then issue. so in fact it is not faster.
> >>Of course the classical jump is more complex, but it is more
> >flexible.>
> >
> >All of this is part of the beginning of the pipeline (fetch stages).
> >
> yep but depending on where the target address is located,
> the sequence of events change.
> face it : just like any memory access (either data fetch or jump),
> the target is either (or both) in LSU, Fetcher, L1, L2, main DRAM or
> in mass storage.
> 

Yes, but LSU/Fetcher is inside the pipeline. The output is the L1 caches
wich represent 95% of the fetch (then L2, then DRAM, there is no
direct access to mass storage !).

> For short loops, the register-based jump wins : the target remains in 
> "cache"

What is L1 if it isn't a cache ?

> in the Fetcher's buffer, and the number of checks is minimal, and
> there are only 6 bits to code this.
> This comes at the price of prefetching the target, there might be a 
> stall during
> the first loop run, that's all.
> 

That's means when there is no loop, you must add bubble beside each
instruction ?

> your immediate cjump does not solve the pipeline bubble problem,
> so it has no chance to run small loops faster.

:) Because you try to addapt this instruction to your complex stuff. "My
jump" didn't need all this complexity.

> it also requires address checks that are not minimal.

nop. No need.

> I concede that page boundaries are not to be checked,
> but it adds a burden on the software and it does not
> remove all the checks for the other things : where is
> the target, in the memory hierarchy ?
> 

That's a false problem. This hirearachy is undefined nowadays. 

The only true problem is a software one.


> >There is nothing about it in the manual or elsewhere. 
> >  
> >
> trust me if you want.

:) Good joke !

> 
> >So i use the usual paradigm of a memory bus (adresse/data).
> >
> unless you want to design a small microcontroller, this is not useful.
> 

That's only how all cpu work...

> >So fetch
> >send an adress and receive a data from the L1 cache memory. The
> >adress must be available as soon as possible. So 12 lsb don't need
> >any adder, so it's the fastest. (beside that fast L1 have 2 or 3
> >cycles latency but a thoughtput of 1)
> >  
> >
> in this situation, you don't take any instruction buffer into account.

? The purpose of the fetch stage is to filed the instruction buffer. So
i don't understand your point.

> So each time you jump, you have a 2 or 3-cycle penalty.

It's depend on the L1 cache structure, only. You must feed the pipeline
every cycle what ever internal buffer you use, it always a small amount
compare to L1. If it's not, it's much easier to use a faster L1 cache,
like the choice of Pentium4 of using 8Ko L1 cache but with a latency of
2, compare to the latency of 3 of the 64Ko cache of the Athlon.


> 
> The purpose of the register-based jump is to benefit from the
> instruction buffer (the "Fetcher"'s buffer). It doesn't even require
> an addition either,
> and spares the L1 fetch in small loops (garanteed !)

It's depend what you called "small", f-cpu need a lot of unrolling
because of RAW dependancies (don't forget the "internal" RAW of the
MAC instruction, and the 6 stages of the mul IU). This buffer are almost
sur to blown quickly.

> Of course there is HW overhead to check the validity of the register,
> and some SW overhead to prefetch the target BUT it doesn't impose

Overhead that you find because instead of having one instruction, most
of the time you have 5 of it.

> tough constraints on the linker or compiler (well, not of the kind of
> what you re-proposed) and the prefetch can be simplified quite easily
> (when several instructions lead to the same jump address).
> 

Simplify prefetch ? What is most simple than a memory bus ?

> >If you want to add multiple buffer, and bunch of decoder, that's up
> >to you (this also add latency). But you will always need somewhere a
> >memory bus, and it will be the slowest part. 
> >  
> >
> you forget about the instruction buffer.
> 

nop.

> >>>But the real problem is for the compiler. What is the opinion of
> >the>>compiler writter ?
> >>>
> >>it's a useless constraint ....
> >>    
> >>
> >no it's not useless. It permit 0 cycle jump (without jump
> >prediction).
> >
> no.
> 

Because, you think of all the complexity you will add.

> >So unroll loops will be far less interresseting, you will save L1
> >cache space. In the worst case,(1 fonction by .c file) such file
> >could avoid using it. Tight loop could be much faster.
> >  
> >
> your "solution" does not remove pipeline bubbles !
> it is NOT possible to jump in no cycle, because the pipeline
> is so tight that there are at least 1 or 2 instruction being decoded
> at the same time.

??? We have a one issue cpu.

> i know it, because i have fallen in the same trap.
> 
> Even if you maintain 2 parallel execution streams in advance,
> or store stalled pipeline stages, you CAN'T jump in zero time
> simply because the register set would require 2x or 3x more read ports
> ! And even then, the time to propagate the order to switch from
> one execution stream to another is too high, it costs at least
> one cycle, so here we are back to the initial problem :
> we can't compress time.

That's why you are stuck by the L1 cache speed.

> 
> i hope that this explanation is not confuse.

It was like usual :)

nicO

> 
> >So devik, what do you think of it ?
> >
> >>>nicO
> >>>
> >>YG
> >>    
> >>
> YG
> 
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/