[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] new cjump instruction



On Sat, 12 Apr 2003 21:16:51 +0200
Yann Guidon <whygee@f-cpu.org> wrote:

> hi,
> 
> nico wrote:
> 
> >On Sat, 12 Apr 2003 02:54:07 +0200
> >Yann Guidon <whygee@f-cpu.org> wrote:
> >
> >>huh, i don't think that it's a good answer ....
> >>
> >>on top of that, this technique poses new problems in FC0's pipeline.
> >>sure, addresses are computed fast, but what about their validation,
> >>their fetch, their lookup in the buffers ......
> >>    
> >>
> >
> >Validation are usefull because you are inside a pages.
> >  
> >
> validation seems ok BUT how do you validate that the new address is
> already present and available ? you were the first to mention that
> associative logic is slow ....
> 

it is. But where do you see any associative logic ?

> >fetch are ligthening fast.
> >
> i do not agree.

I would say : it will be the fastest in this case comparre to other.

> 
> >You have an entire clock cycle to do it (no
> >adder or register read before accessing L1 caches).
> >  
> >
> worse :
> you can check a 64-input associative logic (corresponding to the
> registers) faster than the "hit" logic of the cache (shorter wires
> etc....), just in case the other LSU logic says "no" (hey : one has to
> multiplex the 12-bit LSB path from the instruction with the XBAR's
> output with leads to the LSU address comparators) [a multiplexor is
> not 'free gate' today]
> 
> So let's imagine there is a "dedicated fast path" for this 12-bit 
> address to the L1 cache,
> and it works in 1 or 2 cycles (well, this adds one other address 
> comparator that
> will be active EVERY DAMN CYCLE).
> Then, data must be sent from the cache to the LSU (if absent from the
> LSU), which takes easily one cycle. Then goes the usual pipeline thing
> : decode, then issue.
> so in fact it is not faster.
> Of course the classical jump is more complex, but it is more flexible.
> 

All of this is part of the beginning of the pipeline (fetch stages).
There is nothing about it in the manual or elsewhere. 

So i use the usual paradigm of a memory bus (adresse/data). So fetch
send an adress and receive a data from the L1 cache memory. The
adress must be available as soon as possible. So 12 lsb don't need any
adder, so it's the fastest. (beside that fast L1 have 2 or 3 cycles
latency but a thoughtput of 1)

If you want to add multiple buffer, and bunch of decoder, that's up to
you (this also add latency). But you will always need somewhere a memory
bus, and it will be the slowest part. 

> >But the real problem is for the compiler. What is the opinion of the
> >compiler writter ?
> >
> it's a useless constraint ....

no it's not useless. It permit 0 cycle jump (without jump prediction).
So unroll loops will be far less interresseting, you will save L1 cache
space. In the worst case,(1 fonction by .c file) such file could avoid
using it. Tight loop could be much faster.

So devik, what do you think of it ?

> 
> >nicO
> >  
> >
> YG
> 
> *************************************************************
> To unsubscribe, send an e-mail to majordomo@seul.org with
> unsubscribe f-cpu       in the body. http://f-cpu.seul.org/
*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/