[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] Status quo



Hi Cedric,

On 03/30/2015 11:30 PM, Cedric BAIL wrote:
Hello,

On Mon, Mar 30, 2015 at 9:38 PM,  <whygee@xxxxxxxxx> wrote:
Le 2015-03-30 14:21, Nicolas Boulay a Ãcrit :
2015-03-30 13:42 GMT+02:00 <whygee@xxxxxxxxx>:
Another kind of pure HW accelerator is JPEG/MPEG block DCT
"accelerator".

That's a subject I have been playing for now a decade with by writing
a 2D toolkit. I have yet to find a setup where using one of those
JPEG block was actually useful. In the best case scenario they save a
little bit of battery, but the pain of maintenance of software is way
bigger than the little win it provide you (Especially when you make a
huge effort to open a file once and share it across process). Even
Intel with their libva doesn't really help much. It's more something
that you use for the fun than for some real reason. Note that's only
true for still image, as soon as you start having an animation/movie,
a dedicated block make sense and is actually usefull...

Exactly. MJPEG streams can have better benefits from such hardware, and
also handling JPEG images in a browser for portable media device. I also
agree that we should always measure and verify whether the accelerator
block delivers the required performance, otherwise there's no use of it.

But you still do implement many codec in software as new codec do
appear almost every day and you need to decode them. That's where
Intel CPU have a serious lead compared to ARM one as you can find
optimized assembly code for almost any new codec. Providing block
that are easy to use from the CPU for any codec is useful. It being
an instruction or an "external" block is not really the issue here.
If you are able to easily detect it and use it, then that will be
fine.

Well, not exactly sure. I used to work with Renesas SH-4A and Freescale
IMX5/IMX6 on QNX & Linux, where 3rd-party suppliers delivered already
optimized libraries, and Intel was almost non-existent in this specific
market (automotive infotainment). But I agree on the 2nd part, driver
interfaces and library interfaces are the proper place to make such
abstractions. Linux DMA API is an excellent example of this - all
transfers are initiated via the same API calls, yet there can be a
hardware-specific code which controls a pool of DMA controllers, or pure
software fallback (memcpy implemented in ASM).

For the "crypto" stuff, I guess that AES and CRC are the basis
things you need. I would argue that helper for ECC is also very
useful this day ( https://tools.ietf.org/html/rfc5656 ). Them being
accessible from outside of the kernel space is an absolute must
(That's why instruction are usually better suited for that task).

Linux CAAM and cryptodev provide such device abstraction.

Actually one of the thing that I would have loved to see many times
is a way to tell the MMU to map a memory area as a tile. Using it
linearly, but the physical memory being spread by block over a
surface. I am pretty sure, I am not clear here, but the main purpose
 is to be able to swizzle memory for no cost before uploading it to
the GPU. So you would just allocate the memory using a specific
syscall and that memory that appear linear to the process, could
directly be used by a GPU texture unit. You would be able to
directly decompress a JPEG for example in memory and the GPU could
use that memory directly with no cost and with the best possible
frame rate. Additional benefit you could use that same texture with a
software renderer and benefit from better cache locality. Of course
that require cooperation with GPU block...

Isn't this what Linux DRI2 is already doing?

Well, in fact anything that make it possible to increase the
available memory bandwidth is going to help here.

This is absolutely correct - memory bandwidth rules, especially in area
where the memory is shared across multiple cores/controllers. In such
cases the system buses, bridges and arbiters are as important as CPU
core itself, so the system can deliver high performance as a whole.

Most 2D rendering operation are just limited by that, only up and
down smooth scaling are CPU intensive enough that you may be able to
use 2 cores before you saturate your memory.

Yes. And color space conversion is a nasty one too, especially when you
need to do it on all graphics surfaces arranged in several
semi-transparent layers - the count of color-converted pixels can easily
reach 2x-3x of the screen resolution itself. It's better to have such
modules as hardware - can convert multiple pixels per clock, offloading
the CPU and reducing the image latency.

As a reference, using a C implementation to draw RLE glyphs is
faster than any MMX/NEON code when the glyphs vertical size is above
20 pixels. I don't know what kind of other lite compression we could
do for improving the available memory bandwidth.

That seems quite specific - you're probably blitting GUI widgets to the
screen? I would leave the widget rendering task to the CPU (via the
graphics toolkit library) and the would just share the frame with the
display server - if there are no changes in the widgets, the frame
doesn't change and display server just renders same frame again on
VSYNC. Spoiler: "wayland".

Also I am not sure we want to tackle the subject of GPU here, but
still I think it is worth looking at Vulkan
(http://en.wikipedia.org/wiki/Vulkan_%28API%29) and SPIR-V in
particular
(http://en.wikipedia.org/wiki/Standard_Portable_Intermediate_Representation).


The later one is likely going to become a potential backend for OpenGL
and OpenCL, so maybe it makes sense to study it and take that into
account while designing any interaction with external block.

The GPU is a very interesting topic for me, I would be interested to see
any positive development in this area.

Kind regards,
Nikolay
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/