Re: [pygame] Dirty rect overlapping optimizations

I think a function that accepts a sequence of tuples in the form of (dest_surface, dest_position, source_area, blend_mode) would be enough. It is only needed as a power-user optimization and I don't seem much value in watering it down or splitting it into multiple functions for beginners. No need to overthink and complicate it. Cython may help sprite rendering in Groups, but it would have to be implemented and tested.

Has anyone considered releasing the GIL during blit operations? Is it possible the release the GIL during a blit, then block when needed, in the cases of subsequent draw operations? The optimized case would be to allow math operations to execute while the display buffer is being modified, while you are iterating over sprites.

Here is my preliminary C code for a "blit_multi" function. It is called "blit_list". It only supports (dest_surface, dest_position) tuples, but it works as so far.

https://gist.github.com/bitcraft/1785face7c5684916cde

Use it like this:

sprite_image_list = [(s.image, s.rect) for s in self.sprites]

display.blit_list(sprite_image_list)

Jitblit sounds interesting, but how is that different from just using Surface.convert() with the correct flags? I may be mistaken here, but on all modern computers, using 32-bit surfaces can be optimized into a single memcpy call...bytealigned and whatnot. No need to iterate over pixel data (RLE). If that's the case, then it doesn't make sense to support legacy 8/16/24-bit image formats, and validates pygame_sdl2's position not to support them.

I don't see a way to combat overdraw, except by informing users that it is happening.

Anyway, all of this chatter is moot, if pygame moves to a hardware sprites soon.

On Mon, Mar 27, 2017 at 6:18 AM, René Dudfield <renesd@xxxxxxxxx> wrote:

The use case is to speed up naieve drawing. With things like pygame zero, where they update the whole screen no matter what you draw. Where if people just draw several images to the screen, we should be able to easily track their changes, and only update what has changed.

Yeah! Batching APIs would be a useful additions indeed. Ones which do _multi things at once. The tricky bit is to finding a set of those functions which are useful for many cases. Also extending the API without adding a million different functions here and there. fill_multi, blit_multi etc. However, what does blit_multi do? Blit one surface to multiple places? What about blit blend arguments? What about sub rects? Shouldn't there be a batch API for blitting a lot of different surfaces to different rects at once? After all this you can see how the APIs could get complicated and the number of them large. We have drawline, and drawlines, do we have blit and blits?

Relatedly, there is also a possibility of compiling the sprite classes in Cython. This may also give a decent speedup in some cases. As shown by things like psyco, and pypy speeding them up. They're loops in python after all. But I wonder if they can be compiled easily, or if they're too dynamic.

Apart from the jit work going on using libjit, it would be useful to do data aware blits. What I mean by that is by looking at an image, you can find better ways to draw it. Eg. does the image only contain 6bits worth of colour? Then an RLE blit is the fastest, and using 32bit alpha transparency is wrong. Are there large areas of the same colour (which can be more quickly drawn with fill())? Can we split it into 5 surfaces (four on the edges, one in the middle) where only the edge surfaces are drawn with the much slower alpha transparency blitters? For many games, the images don't change, so this works quite well. However this would perhaps require quite an API change (not entirely sure of that), or could likely be implemented inside the sprite classes with some trickery.

Another area is better memory alignment on a platform specific basis can give considerable gains. On the slower raspberry pi for example, 16 byte, and 32 byte aligned memory goes much faster. On many platforms this is 4 bytes, or 8 bytes. Gains of around 600MB/s ->1300MB/s have been seen. This is one area where a libjit based blitter (or one like in ANGLE) can make significant improvements. This type of change only needs to happen where we allocate memory for surfaces (or maybe people have done that already).

Finally, hardware support for things like jpeg drawing, and DMA are possibilities. For example, the raspberrypi has dedicated hardware for jpeg decoding (as does most hardware these days). This can be used to draw a jpeg from memory into a buffer, so if you are drawing photos onto the screen, then this is a very fast way to do it. However, this would require a more abstract Sprite class which loads files itself. Sprite('mybigphoto.jpg'). But interestingly (but not unexpectedly) the CPUs are outperforming the dedicated hardware, if you dedicate the CPUs to the task. Also for when the images are small. However, the older, and slower CPUs don't. https://info-beamer.com/blog/omx-jpeg-decoding-performance-vs-libjpeg-turbo But either way, keeping the jpg photos in memory allows you to draw a lot more of them, than if you decode them and use up all of the limited memory on small platforms like the Orange|Raspberry PI. So we need a Sprite('photo.jpg') style api for this reason.

I'm interested in if anyone has done any work on dirty rect optimizations for reducing over draw for now. Because that could be applied to something like pygame zero with almost no work on their part, and no work on the users part. For non-worst cases, like where there's < 50 rects, I guess simply compiling an algorithm like DR0ID has done in C/Cython will work nicely enough - if the python code isn't fast enough already. Would need some representative pygamezero examples to check this.

On Mon, Mar 27, 2017 at 3:55 AM, Leif Theden <leif.theden@xxxxxxxxx> wrote:
Just to clarify, this is an optimization to update the display surface? TBH, if DR0ID says that it isn't a good optimization, then I'll have to follow him. Thinking deeply about the issue, the would be better payoffs by finding way to not flip bits in the display surface in the first place. To say it another way, if we can inform developers that they are drawing to specific areas in the display surface several times, then it would help profile the game better and reduce extraneous draws. Android has options to detect extraneous overdraw, and I'm sure other platforms do as well.

Also, if we are going to promote pygame group rendering, a cheap 15-20% performance increase (based on my system) could be had by creating a C function that accepts a list of surfaces and their positions, then executing the blits in C. I've brought it up before, but nothing came if it. In fact, I was asked more than once if I had converted my surfaces, so maybe people just don't read threads? ¯\_(ツ)_/¯

On Sun, Mar 26, 2017 at 2:50 PM, René Dudfield <renesd@xxxxxxxxx> wrote:
Because rectangles can overlap, it's possible to reduce the amount drawing done when using them for dirty rect updating. If there's two rectangles overlapping, then we don't need to over draw the overlapping area twice. Normally the dirty rect update algorithm used just makes a bigger rectangle which contains two of the smaller rectangles. But that's not optimal.

But, as with all over draw algorithms, can it be done fast enough to be worth it?

Here's an article on the topic:
http://gandraxa.com/detect_overlapping_subrectangles.xml

jmm0 made some code here:
https://bitbucket.org/jmm0/optimize_dirty_rects/src/c2affd5450b57fc142c919de10e530e367306222/optimize_dirty_rects.py?at=default&fileviewer=file-view-default

DR0ID, also did some with tests and faster code...
https://bitbucket.org/dr0id/pyweek-games/src/1a9943ebadc6e2102db0457d17ca3e6025f6ca60/pyweek19-2014-10/practice/dirtyrects/?at=default

So far DR0ID says it's not really fast enough to be worth it. However, there are opportunities to improve it. Perhaps a more optimal algorithm, or one which uses C or Cython.

"worst case szenario with 2000 rects it takes ~0.31299996376 seconds"
If it's 20x faster in C, then that gets down to 0.016666. Still too slow perhaps.

It's not an embarrassingly parallel problem... I think. But I haven't thought on it much at all. Maybe there is a use for multi core here.

Anyone done any other work like this? Or know of some good algos for this?

cheers,