[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: ZigZag (was Re: [f-cpu] Status quo)



Le 2015-04-01 22:37, Cedric BAIL a ÃcritÂ:
Hello,
hello !

I think I wasn't really clear in my previous description of this
topic. The reason why I did bring up this idea is because the same zig
zag concept can be used in two case. The first one being when you use
a GPU and you upload the texture. Here that is a pretty self contained
piece of code and will usually be manually optimized. So an
instruction is fine for it.

fair enough.

The second case is that actually having that kind of memory layout is
also going to help a software implementation that are typically used
to do 2D graphics. But in that case the burden on the toolkit doing
the 2D graphics is so high to use a special instruction and change
their own rendering pipeline that it wont be worth it at all for them
to do it. That's where actually making the operation transparent to
the software by flagging some entry in the TLB would pay much more.

I think I get the idea of flagging the TLB entry.
It solves one issue I raised before : how can the compiler tell
the CPU to shuffle the address bits on a certain pointer and not another ?

I notice that the point is to optimise cache line reuse.
If there are 4 bytes per pixel, there is only 10 bits to shuffle,
which is not too bad. Nicolas gave a simple bit shuffling example,
but the zigzag you showed probably uses a Gray code, which uses
one XOR per bit. That's a tiny bit more latency than a XOR but
today the bulk of the delays are in wiring, including fanouts...

However the question of which patter to use is not solved
and I see no consensus. If different patterns must be supported,
a hardware version is unlikely to be efficient and cost-effective.
Imagine that one pattern is implemented and people need another,
our pattern will be unused so it's not even worth it to implement it...

But the big problem I see is with timing and scheduling inside the CPU.

Let me remind you how memory access works :
the instruction sends an address to the Load/Store unit (LSU).
this address is sent to both the cache blocks and the TLB
so both look the adress up in parallel.

The TLB answers with the access rights, owner and ZZ bit.
At the same time, the data cache answers with the data.
It's too late, the wrong address is accessed,
another data cache read is necessary, meaning more
power draw and more wasted time. The efficiency
is not better than without ZZ.

Using TLB flags is not a bad idea in principle
but it moves the issues where it hurts.
If the double-read is to be avoided, then every
instruction must wait for the TLB's answer before
inquiring the data cache. The WHOLE CPU takes a
performance hit a 30% because reads are vital.

This is why I advocate solutions that move the address
shuffling before it reaches the LSU.

F-CPU is a general-purpose processor so it is important
that adding a feature that boosts one particular operation
does not impact all the rest.

and will require manual writing of the assembly code,
not necessarily, but if your compiler doesn't support the CPU's
features, why use it ?
There is no difference to me between writing assembly code and using a
C function stub that actually is just converted to an assembly
function. It is the same amount of work on the developers.
what about adding a keyword to the compiler ?
C has "static", "volatile", "inline" etc. and GCC as more modifiers/specifiers.

The inter-block pattern is managed by the allocator, ok.
Then how do you define that a pointer must have its LSB mangled ?

ioctl, mmap flags, ... whatever fit the needs of the kernel. The point
is that the memory allocator for image is already self contained and
require little software change.

so you push software issues to the hardware.
hardware is good at handling data but when you touch the address bus,
things can get ugly :-/

Given that there are several ways to mix the LSB, it has no place
inside the CPU or directly on its address bus. And since it's
a problem that is specific to GPU, why isn't it possible
to manage it on the GPU side's bus ?

That's when I read this I understood you didn't understood why I was
talking about this subject :-)
hey, _you_ are the graphics specialist ;-)

I've programmed graphics but not at your advanced level...

So now I hope it is clearer. Basically
my point here is that the more efficient we are on memory bandwidth
usage the better we can use the performance of the CPU for most
software. This is just one trick I know of, maybe other have other
idea on how to reduce the memory bandwidth we need to do specific
operation and we can from there infer what is the best solution, MMU
flags, ISA, block. That's also why I did talk about the RLE
compression of glyph. I can't really think of any other trick that
would "compress" the memory, but I am sure you got the idea.

I'm curious about your RLE compression but it's way too early
to think about an actual implementation, i'm currently trying to
get suitable FPGAs ;-) I can't wait to have a good platform !

Oh BTW, OpenGraphics is publishing informations about
https://github.com/jbush001/NyuziProcessor

yg

yg

*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/