[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [f-cpu] second order prefetch in FC0



hi,

ben franchuk wrote:

cyrano@nerim.net wrote:

I don't like prefetch. Did gcc could really calculate the very narrow windows where the prefetch is usefull ? Prefetch are implementation dependant but also clock speed dependant !

I prefer multi-load/store much more (a complete cache line for example that fill 4 or 8 registers).

So you're proposal look like a kind of double load ( a = toto -> titi ) or load then store (toto->titi = a). This could be a feature of "internal cpu buses" and a new instruction. As we control L1/L2 access and we don't need to conform to the limited feature of SDRAM, this kind of bus cycle could be added and optimised closed to the cache controller.

The problem is the prefetch values and other timing information is not a
constant but a variable. You really want to bunch up all the pre-fetch
information and sort both on data flow and the timing variables. Pre-fetch also makes a mess with DMA and serialized instructions like
single stepping too.Ben.

i like the idea of "adaptative" computers that record resource utilisation patterns.
This ensures that the code is highly portable and some code
becomes efficient after a few loop runs. For example :
- Pentium (P53) records the boundaries of the variable-size instructions
with one bit per byte. These bits are invisible to the program and automatically
generated whenever the cacheline is first executed. There is no decoding/parsing
penalty after a second run.
- Alpha 21264 uses data and instruction cache chaining (i don't remember the right term).
Each cache line contains 2 physical addresses of the last two memory accesses
after the cache line was used. This speeds up linked lists because in the case
of the cache line being used, the cache mechanism will prefetch the 2 cache lines
referenced by the tag.
But, as i presume, these methods are certainly completely mined by patents.

Concerning Devik's proposition of "delayed execution", the big problems are
- to generate the delays by a compiler (and force recompile if another arch is used)
- there is no room left for this in the opcodes
so my idea was to check how these delays could be generated 'on the fly'
and invisibly, using a first "parsing pass" like for pentium's instruction
alignment method. But in the end, it consumes much more resources
and pipeline stages than FC0's OOOC so i don't see the point yet.

Concerning the 2nd order prefetch, i don't think we could afford
DEC/Compaq/HP's patents on this ...

YG

*************************************************************
To unsubscribe, send an e-mail to majordomo@seul.org with
unsubscribe f-cpu in the body. http://f-cpu.seul.org/