The use case is to speed up naieve drawing. With things like pygame
zero, where they update the whole screen no matter what you draw. Where
if people just draw several images to the screen, we should be able to
easily track their changes, and only update what has changed.
Yeah! Batching APIs would be a useful additions indeed. Ones which do _multi things at once. The tricky bit is to finding a set of those functions which are useful for many cases. Also extending the API without adding a million different functions here and there. fill_multi, blit_multi etc. However, what does blit_multi do? Blit one surface to multiple places? What about blit blend arguments? What about sub rects? Shouldn't there be a batch API for blitting a lot of different surfaces to different rects at once? After all this you can see how the APIs could get complicated and the number of them large. We have drawline, and drawlines, do we have blit and blits?
Relatedly, there is also a possibility of compiling the sprite
classes in Cython. This may also give a decent speedup in some cases. As shown by things like psyco, and pypy speeding them up. They're loops in python after all. But I wonder if they can be compiled easily, or if they're too dynamic.
Apart from the jit work going on using libjit, it would be useful to do data aware blits. What I mean by that is by looking at an image, you can find better ways to draw it. Eg. does the image only contain 6bits worth of colour? Then an RLE blit is the fastest, and using 32bit alpha transparency is wrong. Are there large areas of the same colour (which can be more quickly drawn with fill())? Can we split it into 5 surfaces (four on the edges, one in the middle) where only the edge surfaces are drawn with the much slower alpha transparency blitters? For many games, the images don't change, so this works quite well. However this would perhaps require quite an API change (not entirely sure of that), or could likely be implemented inside the sprite classes with some trickery.
Another area is better memory alignment on a platform specific basis can give considerable gains. On the slower raspberry pi for example, 16 byte, and 32 byte aligned memory goes much faster. On many platforms this is 4 bytes, or 8 bytes. Gains of around 600MB/s ->1300MB/s have been seen. This is one area where a libjit based blitter (or one like in ANGLE) can make significant improvements. This type of change only needs to happen where we allocate memory for surfaces (or maybe people have done that already).
Finally, hardware support for things like jpeg drawing, and DMA are possibilities. For example, the raspberrypi has dedicated hardware for jpeg decoding (as does most hardware these days). This can be used to draw a jpeg from memory into a buffer, so if you are drawing photos onto the screen, then this is a very fast way to do it. However, this would require a more abstract Sprite class which loads files itself. Sprite('mybigphoto.jpg'). But interestingly (but not unexpectedly) the CPUs are outperforming the dedicated hardware, if you dedicate the CPUs to the task. Also for when the images are small. However, the older, and slower CPUs don't.
https://info-beamer.com/blog/omx-jpeg-decoding-performance-vs-libjpeg-turbo But either way, keeping the jpg photos in memory allows you to draw a lot more of them, than if you decode them and use up all of the limited memory on small platforms like the Orange|Raspberry PI. So we need a Sprite('photo.jpg') style api for this reason.
I'm interested in if anyone has done any work on dirty rect optimizations for reducing over draw for now. Because that could be applied to something like pygame zero with almost no work on their part, and no work on the users part. For non-worst cases, like where there's < 50 rects, I guess simply compiling an algorithm like DR0ID has done in C/Cython will work nicely enough - if the python code isn't fast enough already. Would need some representative pygamezero examples to check this.