[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [f-cpu] Status quo



On Tue, Mar 31, 2015 at 1:02 AM, Nikolay Dimitrov <picmaster@xxxxxxx> wrote:
> On 03/30/2015 11:30 PM, Cedric BAIL wrote:
>> On Mon, Mar 30, 2015 at 9:38 PM,  <whygee@xxxxxxxxx> wrote:
>>> Le 2015-03-30 14:21, Nicolas Boulay a Ãcrit :
>>>> 2015-03-30 13:42 GMT+02:00 <whygee@xxxxxxxxx>:
>> For the "crypto" stuff, I guess that AES and CRC are the basis
>> things you need. I would argue that helper for ECC is also very
>> useful this day ( https://tools.ietf.org/html/rfc5656 ). Them being
>> accessible from outside of the kernel space is an absolute must
>> (That's why instruction are usually better suited for that task).
>
> Linux CAAM and cryptodev provide such device abstraction.

Not really if I remember correctly as they require a systemcall to
actually access the device. This can have a huge impact on what and
when you can use them.

>> Actually one of the thing that I would have loved to see many times
>> is a way to tell the MMU to map a memory area as a tile. Using it
>> linearly, but the physical memory being spread by block over a
>> surface. I am pretty sure, I am not clear here, but the main purpose
>>  is to be able to swizzle memory for no cost before uploading it to
>> the GPU. So you would just allocate the memory using a specific
>> syscall and that memory that appear linear to the process, could
>> directly be used by a GPU texture unit. You would be able to
>> directly decompress a JPEG for example in memory and the GPU could
>> use that memory directly with no cost and with the best possible
>> frame rate. Additional benefit you could use that same texture with a
>> software renderer and benefit from better cache locality. Of course
>> that require cooperation with GPU block...
>
> Isn't this what Linux DRI2 is already doing?

No, DRI is a direct path to send GPU command from a user space
application (It actually require still an interception and analysis by
the kernel before being send to the GPU). Here I am talking about the
texture upload operation, which usually require to convert the memory
layout before uploading/exposing it to the GPU.

>> Most 2D rendering operation are just limited by that, only up and
>> down smooth scaling are CPU intensive enough that you may be able to
>> use 2 cores before you saturate your memory.
>
> Yes. And color space conversion is a nasty one too, especially when you
> need to do it on all graphics surfaces arranged in several
> semi-transparent layers - the count of color-converted pixels can easily
> reach 2x-3x of the screen resolution itself. It's better to have such
> modules as hardware - can convert multiple pixels per clock, offloading
> the CPU and reducing the image latency.

Agreed, color conversion is also another area where you want a good
infrastructure to do it (be it a set of instruction or external
block).

>> As a reference, using a C implementation to draw RLE glyphs is
>> faster than any MMX/NEON code when the glyphs vertical size is above
>> 20 pixels. I don't know what kind of other lite compression we could
>> do for improving the available memory bandwidth.
>
> That seems quite specific - you're probably blitting GUI widgets to the
> screen? I would leave the widget rendering task to the CPU (via the
> graphics toolkit library) and the would just share the frame with the
> display server - if there are no changes in the widgets, the frame
> doesn't change and display server just renders same frame again on
> VSYNC. Spoiler: "wayland".

Yes, that's already the case in most toolkit. Here I am talking about
the source of the pixels used in the blit not the destination. Any
sane toolkit will try to reduce hard the area it has to draw, but
reducing also the amount of data in input does help to. Especially on
device where you have huge DPI. You actually want some light
compression technique while reading the data you are blitting on the
screen.

You usually do 32bits x 2 read and 32bits write. In the case of a
glyph which is only an alpha mask in nature, you start with a 8 bits
read, a 32 bits read and another 32bits write. The 32 bits read and
write from the texture, if done correctly, are likely going to be
cached, but the other one is quite unlikely. It is the actual RLE
encoding of those 8bits that pay enough while rendering text that you
can even use C instead of ASM to implement it. This RLE encoding is
paying of on glyph as they are composed of huge area with the same
color and only a small ramp up when transitioning from one to the
other. That's not something you can apply on every source of pixels.

The only other graphical component that is likely to be compressed is
the background, which is likely to be a JPEG. Not sure how you could
just get the JPEG macro block in the cache and do inline expansion
sanely while reading it in the CPU. Anyway, this was just to point out
that light compression do help a lot, if we can either have an
instruction or some trick with the MMU, they will definitively pay of.
-- 
Cedric BAIL
*************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxx with
unsubscribe f-cpu       in the body. http://f-cpu.seul.org/