[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] Status quo



Hi Cedric,

On 03/31/2015 05:13 PM, Cedric BAIL wrote:
On Tue, Mar 31, 2015 at 1:02 AM, Nikolay Dimitrov <picmaster@xxxxxxx>
wrote:
On 03/30/2015 11:30 PM, Cedric BAIL wrote:
On Mon, Mar 30, 2015 at 9:38 PM,  <whygee@xxxxxxxxx> wrote:
Le 2015-03-30 14:21, Nicolas Boulay a Ãcrit :
2015-03-30 13:42 GMT+02:00 <whygee@xxxxxxxxx>:
For the "crypto" stuff, I guess that AES and CRC are the basis
things you need. I would argue that helper for ECC is also very
useful this day ( https://tools.ietf.org/html/rfc5656 ). Them
being accessible from outside of the kernel space is an absolute
must (That's why instruction are usually better suited for that
task).

Linux CAAM and cryptodev provide such device abstraction.

Not really if I remember correctly as they require a systemcall to
actually access the device. This can have a huge impact on what and
when you can use them.

It's a virtual device, accessed by ioctls. You can have 0 or more
physical devices abstracted by the driver, the decision how to implement
the interface is yours. The only limitation which always applies is to
make sure the driver supports multiple contexts (users) at the same
time. But again I agree, the crypto stuff can be implemented anywhere,
including in userspace - custom library, openssl, UIO, wherever you want.

Actually one of the thing that I would have loved to see many
times is a way to tell the MMU to map a memory area as a tile.
Using it linearly, but the physical memory being spread by block
over a surface. I am pretty sure, I am not clear here, but the
main purpose is to be able to swizzle memory for no cost before
uploading it to the GPU. So you would just allocate the memory
using a specific syscall and that memory that appear linear to
the process, could directly be used by a GPU texture unit. You
would be able to directly decompress a JPEG for example in memory
and the GPU could use that memory directly with no cost and with
the best possible frame rate. Additional benefit you could use
that same texture with a software renderer and benefit from
better cache locality. Of course that require cooperation with
GPU block...

Isn't this what Linux DRI2 is already doing?

No, DRI is a direct path to send GPU command from a user space
application (It actually require still an interception and analysis
by the kernel before being send to the GPU). Here I am talking about
the texture upload operation, which usually require to convert the
memory layout before uploading/exposing it to the GPU.

DRI uses memory managed by DRM, which does exactly what you need.

Most 2D rendering operation are just limited by that, only up
and down smooth scaling are CPU intensive enough that you may be
able to use 2 cores before you saturate your memory.

Yes. And color space conversion is a nasty one too, especially when
you need to do it on all graphics surfaces arranged in several
semi-transparent layers - the count of color-converted pixels can
easily reach 2x-3x of the screen resolution itself. It's better to
have such modules as hardware - can convert multiple pixels per
clock, offloading the CPU and reducing the image latency.

Agreed, color conversion is also another area where you want a good
infrastructure to do it (be it a set of instruction or external
block).

As a reference, using a C implementation to draw RLE glyphs is
faster than any MMX/NEON code when the glyphs vertical size is
above 20 pixels. I don't know what kind of other lite compression
we could do for improving the available memory bandwidth.

That seems quite specific - you're probably blitting GUI widgets to
the screen? I would leave the widget rendering task to the CPU (via
the graphics toolkit library) and the would just share the frame
with the display server - if there are no changes in the widgets,
the frame doesn't change and display server just renders same frame
again on VSYNC. Spoiler: "wayland".

Yes, that's already the case in most toolkit. Here I am talking
about the source of the pixels used in the blit not the destination.
Any sane toolkit will try to reduce hard the area it has to draw,
but reducing also the amount of data in input does help to.
Especially on device where you have huge DPI. You actually want some
light compression technique while reading the data you are blitting
on the screen.

You usually do 32bits x 2 read and 32bits write. In the case of a
glyph which is only an alpha mask in nature, you start with a 8 bits
read, a 32 bits read and another 32bits write. The 32 bits read and
write from the texture, if done correctly, are likely going to be
cached, but the other one is quite unlikely. It is the actual RLE
encoding of those 8bits that pay enough while rendering text that
you can even use C instead of ASM to implement it. This RLE encoding
is paying of on glyph as they are composed of huge area with the
same color and only a small ramp up when transitioning from one to
the other. That's not something you can apply on every source of
pixels.

The only other graphical component that is likely to be compressed
is the background, which is likely to be a JPEG. Not sure how you
could just get the JPEG macro block in the cache and do inline
expansion sanely while reading it in the CPU. Anyway, this was just
to point out that light compression do help a lot, if we can either
have an instruction or some trick with the MMU, they will
definitively pay of.

That's definitely an interesting observation. I just shared my thoughts
that such RLE can be more efficient with somewhat longer sequences
(larger areas of solid colors) typical for GUI widgets, where it would
probably have biggest improvement.

Regards,
Nikolay
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/