[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: ZigZag (was Re: [f-cpu] Status quo)



Hi CÃdric :-)

Le 2015-04-03 14:20, Cedric BAIL a ÃcritÂ:
On Thu, Apr 2, 2015 at 12:09 PM,  <whygee@xxxxxxxxx> wrote:
I think I get the idea of flagging the TLB entry.
It solves one issue I raised before : how can the compiler tell
the CPU to shuffle the address bits on a certain pointer and not another ?

That's the important bit. It doesn't need to. It is only necessary to
do a modification in the kernel and expose a way when you allocate
memory from userspace to request this optimisation. Then in toolkit
and application, they can just modify the memory allocator they use
for image.

it's fine at this point.

However it has pushed the problem where it becomes worse.
Writes could be delayed for a cycle or two before commit to the cache,
it would be a little inconvenient but not as troublesome as the reads.
I see no way to feedback the TLB's zz flag to the execution side of the CPU.

However the question of which patter to use is not solved
and I see no consensus. If different patterns must be supported,
a hardware version is unlikely to be efficient and cost-effective.
Imagine that one pattern is implemented and people need another,
our pattern will be unused so it's not even worth it to implement it...

That's a good question which pattern is the best and are some counter
productive. I am not to sure that there is any need to define exactly
the pattern you want from userspace at all. As long as you have a
better 2D cache locality, that's all you care about. The kernel will
need to know the detail of how to address the TLB, but it can also not
care about the exact pattern. The only case where you need to care
about the exact pattern is when you do define the SoC and choose the
one that match a potential GPU or better fit the cache subsystem of
the CPU.

ok so the pattern is "to be determined from actual tests".

As for the GPU, we can use tailored F-CPU cores with wide SIMD,
so a single development environment can target both uses,
and it keeps the zz pattern identical across a given system.

The TLB answers with the access rights, owner and ZZ bit.
At the same time, the data cache answers with the data.
It's too late, the wrong address is accessed,
another data cache read is necessary, meaning more
power draw and more wasted time. The efficiency
is not better than without ZZ.

I see you need the ZZ bit to actually compute the real address you are
accessing in the L1. If you can't avoid that, I do understand your
issue.

It's indeed a tricky problem.

An address is an address...
If you want X, you ask for it, and you get it.
If you want X or Y and it depends on X,
then the computation or lookup of X is the thing to eliminate.

TLB-based memory management works because the address
is split into different parallel sections and
the high bits are looked up while the lower bits
hit the cache array directly. Parallelism is the key.

The only way I see to remove the computation is
to put the zz flag in the address/pointer itself,
in the MSB for example. This way the property is
carried through all computations, it is set by the
memory allocators and there is no special instruction
to create.

This approach cuts the addressable memory size in half.
But it's ok for a 64-bits computer, I think...
The loss can be reduced with a AND gate, combining 2 or 3 LSB...

This is why I advocate solutions that move the address
shuffling before it reaches the LSU.

With the flag IN the address, the LSU can conditionally
shuffle the address bits before they reach the cache,
so the honor is safe :-P

F-CPU is a general-purpose processor so it is important
that adding a feature that boosts one particular operation
does not impact all the rest.

Sure. I am just here trying to expose mecanism that help reduce need
for memory bandwidth.

I know :-)

I am wondering how GPU are doing it as they will
need to do those translation to and they seems to have found a way to
get better performance that way for sure !

Dig through the patents ?

so you push software issues to the hardware.
hardware is good at handling data but when you touch the address bus,
things can get ugly :-/

:-)

Welcome to the F-CPU world !


The RLE compression is pretty stupid, there is a first step to drop
precision on the glyph by switching to 4bits (that does improve speed
obviously from 20 to 30 or 40 pixels high, I forgot the exact number).
Then you start encoding the number of consecutive identical pixels in
the glyph on 4bits. Resulting in 1 byte for as much as 16 pixels.

I see, so the drop to 4 bits keeps the data from expanding in the worst case.

There is also a jump table, but I forget how it work.

a jump table ? where ? how ? why ? what ?

Maybe just an
array of byte, where the index to the start of the next line encoded
using the first bit of the byte to indicate if the next byte is part
of the same jump index or not. No rocket science here, just optimized
technics for glyphs rendering specifically.

I'm confused here.

We have some project to start encoding image to a palettized mode
(starting with GRY8 for a test) and use that as a source on almost the
same principle. That is still a little bit far away for the moment,
but you get the idea.

not at all :-D I miss the context, I'm not a 2D pipeline specialist.

The think is that it doesn't pay that much as we
don't have yet enough information to build a pipeline where the output
buffer will automatically adapt its color space depending on a serie
of source. So we are limited to just optimizing the current source and
still read 4 bytes and write 4 bytes in all case. Finding a way to
actually compress the write buffer would pay much more, but is way
more difficult.

I should go back to my FPGA then :-D

yg
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/