[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: ZigZag (was Re: [f-cpu] Status quo)



On Thu, Apr 2, 2015 at 12:09 PM,  <whygee@xxxxxxxxx> wrote:
> Le 2015-04-01 22:37, Cedric BAIL a Ãcrit :
>> The second case is that actually having that kind of memory layout is
>> also going to help a software implementation that are typically used
>> to do 2D graphics. But in that case the burden on the toolkit doing
>> the 2D graphics is so high to use a special instruction and change
>> their own rendering pipeline that it wont be worth it at all for them
>> to do it. That's where actually making the operation transparent to
>> the software by flagging some entry in the TLB would pay much more.
>
> I think I get the idea of flagging the TLB entry.
> It solves one issue I raised before : how can the compiler tell
> the CPU to shuffle the address bits on a certain pointer and not another ?

That's the important bit. It doesn't need to. It is only necessary to
do a modification in the kernel and expose a way when you allocate
memory from userspace to request this optimisation. Then in toolkit
and application, they can just modify the memory allocator they use
for image.

> I notice that the point is to optimise cache line reuse.
> If there are 4 bytes per pixel, there is only 10 bits to shuffle,
> which is not too bad. Nicolas gave a simple bit shuffling example,
> but the zigzag you showed probably uses a Gray code, which uses
> one XOR per bit. That's a tiny bit more latency than a XOR but
> today the bulk of the delays are in wiring, including fanouts...
>
> However the question of which patter to use is not solved
> and I see no consensus. If different patterns must be supported,
> a hardware version is unlikely to be efficient and cost-effective.
> Imagine that one pattern is implemented and people need another,
> our pattern will be unused so it's not even worth it to implement it...

That's a good question which pattern is the best and are some counter
productive. I am not to sure that there is any need to define exactly
the pattern you want from userspace at all. As long as you have a
better 2D cache locality, that's all you care about. The kernel will
need to know the detail of how to address the TLB, but it can also not
care about the exact pattern. The only case where you need to care
about the exact pattern is when you do define the SoC and choose the
one that match a potential GPU or better fit the cache subsystem of
the CPU.

> But the big problem I see is with timing and scheduling inside the CPU.
>
> Let me remind you how memory access works :
> the instruction sends an address to the Load/Store unit (LSU).
> this address is sent to both the cache blocks and the TLB
> so both look the adress up in parallel.
>
> The TLB answers with the access rights, owner and ZZ bit.
> At the same time, the data cache answers with the data.
> It's too late, the wrong address is accessed,
> another data cache read is necessary, meaning more
> power draw and more wasted time. The efficiency
> is not better than without ZZ.

I see you need the ZZ bit to actually compute the real address you are
accessing in the L1. If you can't avoid that, I do understand your
issue.

> Using TLB flags is not a bad idea in principle
> but it moves the issues where it hurts.
> If the double-read is to be avoided, then every
> instruction must wait for the TLB's answer before
> inquiring the data cache. The WHOLE CPU takes a
> performance hit a 30% because reads are vital.

Completely defeating the purpose of having more memory bandwidth :-)

> This is why I advocate solutions that move the address
> shuffling before it reaches the LSU.
>
> F-CPU is a general-purpose processor so it is important
> that adding a feature that boosts one particular operation
> does not impact all the rest.

Sure. I am just here trying to expose mecanism that help reduce need
for memory bandwidth. I am wondering how GPU are doing it as they will
need to do those translation to and they seems to have found a way to
get better performance that way for sure !

>>>> and will require manual writing of the assembly code,
>>>
>>> not necessarily, but if your compiler doesn't support the CPU's
>>> features, why use it ?
>>
>> There is no difference to me between writing assembly code and using a
>> C function stub that actually is just converted to an assembly
>> function. It is the same amount of work on the developers.
>
> what about adding a keyword to the compiler ?
> C has "static", "volatile", "inline" etc. and GCC as more
> modifiers/specifiers.

Completely useless for two reason. First nobody is going to use an
architecture specific keyword in any generic source code. Second the
issue here is that you need to also change your pipeline logic
completely to work base on a ZZ tile and take benefit of this
instruction. It is a huge work that would not pay of on other target
as much (or much more unlikely). That's why I think the solution in
the ISA is not going to take of for this use case.

>>> The inter-block pattern is managed by the allocator, ok.
>>> Then how do you define that a pointer must have its LSB mangled ?
>>
>> ioctl, mmap flags, ... whatever fit the needs of the kernel. The point
>> is that the memory allocator for image is already self contained and
>> require little software change.
>
> so you push software issues to the hardware.
> hardware is good at handling data but when you touch the address bus,
> things can get ugly :-/

:-)

>> So now I hope it is clearer. Basically
>> my point here is that the more efficient we are on memory bandwidth
>> usage the better we can use the performance of the CPU for most
>> software. This is just one trick I know of, maybe other have other
>> idea on how to reduce the memory bandwidth we need to do specific
>> operation and we can from there infer what is the best solution, MMU
>> flags, ISA, block. That's also why I did talk about the RLE
>> compression of glyph. I can't really think of any other trick that
>> would "compress" the memory, but I am sure you got the idea.
>
>
> I'm curious about your RLE compression but it's way too early
> to think about an actual implementation, i'm currently trying to
> get suitable FPGAs ;-) I can't wait to have a good platform !

The RLE compression is pretty stupid, there is a first step to drop
precision on the glyph by switching to 4bits (that does improve speed
obviously from 20 to 30 or 40 pixels high, I forgot the exact number).
Then you start encoding the number of consecutive identical pixels in
the glyph on 4bits. Resulting in 1 byte for as much as 16 pixels.
There is also a jump table, but I forget how it work. Maybe just an
array of byte, where the index to the start of the next line encoded
using the first bit of the byte to indicate if the next byte is part
of the same jump index or not. No rocket science here, just optimized
technics for glyphs rendering specifically.

We have some project to start encoding image to a palettized mode
(starting with GRY8 for a test) and use that as a source on almost the
same principle. That is still a little bit far away for the moment,
but you get the idea. The think is that it doesn't pay that much as we
don't have yet enough information to build a pipeline where the output
buffer will automatically adapt its color space depending on a serie
of source. So we are limited to just optimizing the current source and
still read 4 bytes and write 4 bytes in all case. Finding a way to
actually compress the write buffer would pay much more, but is way
more difficult.

-- 
Cedric BAIL
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/