I don't like prefetch. Did gcc could really calculate the very narrow windows where the prefetch is usefull ? Prefetch are implementation dependant but also clock speed dependant !
I prefer multi-load/store much more (a complete cache line for example that fill 4 or 8 registers).
So you're proposal look like a kind of double load ( a = toto -> titi ) or load then store (toto->titi = a). This could be a feature of "internal cpu buses" and a new instruction. As we control L1/L2 access and we don't need to conform to the limited feature of SDRAM, this kind of bus cycle could be added and optimised closed to the cache controller.
The problem is the prefetch values and other timing information is not a