so didn't intended to put this LUT on the CPU data path at all.
The CPU datapath is where data are treated, modified, etc.
I think it can be implemented like MMU extension module, to handle
the address-conversion only for the tiled pages (for graphics
purposes, if/when needed).
There is no "magic box" called MMU, there's the LSU, the TLBs, some
buffers here and there, cache memory... Adding a LUT "when requested"
in the address path would not just increase the memory latency but
also make it more complex because "certain cycles" will have one
more cycle of latency. It adds a new "visible state" to the CPU
architecture : in a multithreaded, one thread would want one
conversion, another wants a different one, so the thread switch is
ridiculously slow. And you still haven't specified how you'll tell
the LUT to be active or not.
OTOH, a LUT instruction is feasible. some address bits may be used to
select one thread's contents and it would be useful for many many
applications (gamma conversions, crypto, hash...) unlike reordering
address bits which is useful only for one specific use.