Re: [pygame] Dirty rect overlapping optimizations

Cool about the blit function. Since the rect and surfaces are by reference, then you only need to make sprite_image_list once (if you are blitting the same thing every time). Additionally, I think it would likely be fast enough for another commonly requested case (blit the same image to many different spots). It would also work nicely for sprite groups. I think it's worth some more API discussion, before merging it in. We'd also need some tests and docs.

A couple of quick points.

hardware is slower sometimes (see raspberrypi link I posted). Not surprising for platforms like this where video memory is the same speed for CPUs and GPUs, and where there are 4 CPU cores that have a modern architecture.
memory reads are reduced a lot when you change the bitdepth from 4 bytes to 1 or 2. On modern computers memory IO is always the bottleneck. Even with hardware acceleration, blitting less memory is faster. If you can use the same as the screen format (some bayer format) then it's often only 8-12 bits of data per LCD pixel. YUV style formats are very fast, and lower memory because they are what movies use.
Run length encoded, even if not hardware accelerated, can be very quick because you read less memory.
You can't use memcpy for blit blends, format conversions, scaling etc. Memory alignment can be easy by just allocating an extra 32 or 16 byte of data, and having the start position be there. If you do this for each scan line, even better.
The current blit functions to screen can be problematic on some platforms with threads, since some platforms don't have thread safe display functions. Calling them outside the main thread is not possible (on some platforms). There are a few other parts which aren't exactly thread safe on SDL. A work around is to use multiple processes (1 per core is optimal) and use shared memory. That way there are no threading issues, and this is the most optimal way to do it anyway. You can create mmap'd Surfaces with frombuffer.
There are some multi threaded blit functions however floating around that use openmp. pgreloaded branches.
The jit blit functions can mostly gain speed advantages by taking advantage of platform optimal prefetch instructions, SIMD, but also CPU optimal stride (how much data to process in what sized chunk so it fits in the cache best). The best example of this is ANGLE, which has a very advanced run time assembler. (it has to to emulate OpenGL and dx in software!) But also numexpr does some of these optimizations.

Another software improvement would be using the new METH_FASTCALL on python 3.6+ Which can see 15%-30% improvements for cases like calling lots of functions.

Anyway... I think there's lots of potential improvements for the raspberrypi. Some of them will benefit both software and hardware renderers.

But I'm still interested in the dirty rect handling, for the naieve case.

cheers,

On Mon, Mar 27, 2017 at 4:59 PM, Leif Theden <leif.theden@xxxxxxxxx> wrote:

I think a function that accepts a sequence of tuples in the form of (dest_surface, dest_position, source_area, blend_mode) would be enough. It is only needed as a power-user optimization and I don't seem much value in watering it down or splitting it into multiple functions for beginners. No need to overthink and complicate it. Cython may help sprite rendering in Groups, but it would have to be implemented and tested.

Has anyone considered releasing the GIL during blit operations? Is it possible the release the GIL during a blit, then block when needed, in the cases of subsequent draw operations? The optimized case would be to allow math operations to execute while the display buffer is being modified, while you are iterating over sprites.

Here is my preliminary C code for a "blit_multi" function. It is called "blit_list". It only supports (dest_surface, dest_position) tuples, but it works as so far.
https://gist.github.com/bitcraft/1785face7c5684916cde

Use it like this:

sprite_image_list = [(s.image, s.rect) for s in self.sprites]
display.blit_list(sprite_image_list)

Jitblit sounds interesting, but how is that different from just using Surface.convert() with the correct flags? I may be mistaken here, but on all modern computers, using 32-bit surfaces can be optimized into a single memcpy call...bytealigned and whatnot. No need to iterate over pixel data (RLE). If that's the case, then it doesn't make sense to support legacy 8/16/24-bit image formats, and validates pygame_sdl2's position not to support them.

I don't see a way to combat overdraw, except by informing users that it is happening.

Anyway, all of this chatter is moot, if pygame moves to a hardware sprites soon.

On Mon, Mar 27, 2017 at 6:18 AM, René Dudfield <renesd@xxxxxxxxx> wrote:
The use case is to speed up naieve drawing. With things like pygame zero, where they update the whole screen no matter what you draw. Where if people just draw several images to the screen, we should be able to easily track their changes, and only update what has changed.

Yeah! Batching APIs would be a useful additions indeed. Ones which do _multi things at once. The tricky bit is to finding a set of those functions which are useful for many cases. Also extending the API without adding a million different functions here and there. fill_multi, blit_multi etc. However, what does blit_multi do? Blit one surface to multiple places? What about blit blend arguments? What about sub rects? Shouldn't there be a batch API for blitting a lot of different surfaces to different rects at once? After all this you can see how the APIs could get complicated and the number of them large. We have drawline, and drawlines, do we have blit and blits?

Relatedly, there is also a possibility of compiling the sprite classes in Cython. This may also give a decent speedup in some cases. As shown by things like psyco, and pypy speeding them up. They're loops in python after all. But I wonder if they can be compiled easily, or if they're too dynamic.

Apart from the jit work going on using libjit, it would be useful to do data aware blits. What I mean by that is by looking at an image, you can find better ways to draw it. Eg. does the image only contain 6bits worth of colour? Then an RLE blit is the fastest, and using 32bit alpha transparency is wrong. Are there large areas of the same colour (which can be more quickly drawn with fill())? Can we split it into 5 surfaces (four on the edges, one in the middle) where only the edge surfaces are drawn with the much slower alpha transparency blitters? For many games, the images don't change, so this works quite well. However this would perhaps require quite an API change (not entirely sure of that), or could likely be implemented inside the sprite classes with some trickery.

Another area is better memory alignment on a platform specific basis can give considerable gains. On the slower raspberry pi for example, 16 byte, and 32 byte aligned memory goes much faster. On many platforms this is 4 bytes, or 8 bytes. Gains of around 600MB/s ->1300MB/s have been seen. This is one area where a libjit based blitter (or one like in ANGLE) can make significant improvements. This type of change only needs to happen where we allocate memory for surfaces (or maybe people have done that already).

Finally, hardware support for things like jpeg drawing, and DMA are possibilities. For example, the raspberrypi has dedicated hardware for jpeg decoding (as does most hardware these days). This can be used to draw a jpeg from memory into a buffer, so if you are drawing photos onto the screen, then this is a very fast way to do it. However, this would require a more abstract Sprite class which loads files itself. Sprite('mybigphoto.jpg'). But interestingly (but not unexpectedly) the CPUs are outperforming the dedicated hardware, if you dedicate the CPUs to the task. Also for when the images are small. However, the older, and slower CPUs don't. https://info-beamer.com/blog/omx-jpeg-decoding-performance-vs-libjpeg-turbo But either way, keeping the jpg photos in memory allows you to draw a lot more of them, than if you decode them and use up all of the limited memory on small platforms like the Orange|Raspberry PI. So we need a Sprite('photo.jpg') style api for this reason.

I'm interested in if anyone has done any work on dirty rect optimizations for reducing over draw for now. Because that could be applied to something like pygame zero with almost no work on their part, and no work on the users part. For non-worst cases, like where there's < 50 rects, I guess simply compiling an algorithm like DR0ID has done in C/Cython will work nicely enough - if the python code isn't fast enough already. Would need some representative pygamezero examples to check this.

On Mon, Mar 27, 2017 at 3:55 AM, Leif Theden <leif.theden@xxxxxxxxx> wrote:
Just to clarify, this is an optimization to update the display surface? TBH, if DR0ID says that it isn't a good optimization, then I'll have to follow him. Thinking deeply about the issue, the would be better payoffs by finding way to not flip bits in the display surface in the first place. To say it another way, if we can inform developers that they are drawing to specific areas in the display surface several times, then it would help profile the game better and reduce extraneous draws. Android has options to detect extraneous overdraw, and I'm sure other platforms do as well.

Also, if we are going to promote pygame group rendering, a cheap 15-20% performance increase (based on my system) could be had by creating a C function that accepts a list of surfaces and their positions, then executing the blits in C. I've brought it up before, but nothing came if it. In fact, I was asked more than once if I had converted my surfaces, so maybe people just don't read threads? ¯\_(ツ)_/¯

On Sun, Mar 26, 2017 at 2:50 PM, René Dudfield <renesd@xxxxxxxxx> wrote:
Because rectangles can overlap, it's possible to reduce the amount drawing done when using them for dirty rect updating. If there's two rectangles overlapping, then we don't need to over draw the overlapping area twice. Normally the dirty rect update algorithm used just makes a bigger rectangle which contains two of the smaller rectangles. But that's not optimal.

But, as with all over draw algorithms, can it be done fast enough to be worth it?

Here's an article on the topic:
http://gandraxa.com/detect_overlapping_subrectangles.xml

jmm0 made some code here:
https://bitbucket.org/jmm0/optimize_dirty_rects/src/c2affd5450b57fc142c919de10e530e367306222/optimize_dirty_rects.py?at=default&fileviewer=file-view-default

DR0ID, also did some with tests and faster code...
https://bitbucket.org/dr0id/pyweek-games/src/1a9943ebadc6e2102db0457d17ca3e6025f6ca60/pyweek19-2014-10/practice/dirtyrects/?at=default

So far DR0ID says it's not really fast enough to be worth it. However, there are opportunities to improve it. Perhaps a more optimal algorithm, or one which uses C or Cython.

"worst case szenario with 2000 rects it takes ~0.31299996376 seconds"
If it's 20x faster in C, then that gets down to 0.016666. Still too slow perhaps.

It's not an embarrassingly parallel problem... I think. But I haven't thought on it much at all. Maybe there is a use for multi core here.

Anyone done any other work like this? Or know of some good algos for this?

cheers,