Re: [pygame] pygame with SDL2 proposal

On Mon, Mar 20, 2017 at 5:07 PM, Ian Mallett <ian@xxxxxxxxxxxxxx> wrote:

On Mon, Mar 20, 2017 at 3:52 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Ian Mallett wrote:
Per-pixel drawing operations, if they must be individually managed by the CPU, will always be much faster to do on the CPU. This means things like Surface.set_at(...) and likely draw.circle(...) as well as potentially things like draw.rect(...) and draw.line(...).

This is undoubtedly true in the case of drawing one pixel at a time,
but are you sure it's true for lines and rectangles?

On Mon, Mar 20, 2017 at 12:25 PM, Leif Theden <leif.theden@xxxxxxxxx> wrote:
Good points Ian, but I don't see why we need to support software drawing when OpengGL supports drawing primitives? Is there a compelling reason that drawing lines with the CPU is better then doing it on the GPU?
Oh yes!

Basically, it's because a GPU runs well when it has a big, parallelizeable workload, and terribly when not. Flexible, small workloads, such as you see in a typical indie game or small project, are basically exactly this worst case. They are small (rendering dozens to hundreds of objects), and they are dynamic in that the objects change positions and shading according to CPU-hosted logic. Heuristic: if you've got a branch deciding where/whether to render your object or what color it should be, then the GPU hates it and you.*

If that made sense you to, you can skip this elaboration:
----

The GPU is basically a bunch of workers (thousands, nowadays) sitting in a room. When you tell the GPU to do something, you tell everyone in the room to do that same thing. Configuring the GPU to do something else (saliently: changing the shader) is slow (for technical reasons).

I have a GTX 980 sitting on my desk right now, and it has 2048 thread processors clocked at 1126 MHz. That's ****ing insane. I can throw millions and millions of triangles at it, and it laughs right back at me because it's rendering them (essentially) 2048 at a time. The fragments (≈ pixels) generated from those triangles are also rendered 2048 at a time. This is awesome, but only if you're drawing lots of triangles or shading lots of pixels in the same way (the same shader).

But I cannot change the way I'm drawing those triangles individually. Say I alternate between a red shader and a blue shader for each of a million triangles. NVIDIA guidelines tell me I'm at about 3 seconds per frame, not even counting the rendering. This is what I mean by overhead. (To work around this problem, you double the amount of work and send a color along with each vertex as data. That's just more data and the GPU can handle it easily. But reconfigure? No good.) And this is in C-like languages. In Python, you have a huge amount of software overhead for those state changes, even before you get to the PCIe bus.

And in a typical pygame project or indie game, this is basically exactly what we're trying to do. We've got sprites with individual location data and different ways of being rendered--different textures, different blend modes, etc. Only a few objects, but decent complexity in how to draw them. With a bunch of cleverness, you could conceivably write some complex code to work around this (generate work batches, abstract to an übershader, etc.), but I doubt you could (or would want to) fully abstract this away from the user--particularly in such a flexible API as pygame.

The second issue is that the PCIe bus, which is how the CPU talks to the GPU, is really slow compared to the CPU's memory subsystem--both in terms of latency and bandwidth. My lab computer has ~64 GB/s DDR4 bandwidth (my computer at home has quadruple that) at 50ns-500ns latency. By contrast, the PCIe bus tops out at 2 GB/s at ~20000ns latency. My CPU also has 15MB of L3 cache, while my 980 has no L3 cache and only 2MiB of L2 (because streaming workloads need less caching and caching is expensive).

So when you draw something on the CPU, you're drawing using a fast processor (my machine: 3.5 GHz, wide vectors, long pipe) into very close DRAM at a really low latency, but it's probably cached in L3 or lower anyway. When you draw something on the GPU, you're drawing (slowly (~1 GHz, narrow vectors, short pipe), but in-parallel) into different-DRAM-which-is-optimized-more-for-streaming and which may or may not be cached at all. Similar maybe, but you also have to wait for the command to go over the PCIe bus, take any driver sync hit, spool up the GPU pipeline in the right configuration, and so on. The overhead is worth it if you're drawing a million triangles, but not if you're calling Surface.set_at(...).

The point is, GPUs have great parallelism, but you pay for it in latency and usability. It's a tradeoff, and when you consider all the rest you need to do on the CPU, it's not always a clear one. But, as a heuristic, lots of geometry or fillrate or math-intensive shading calls for a GPU. Anything less calls for a CPU. My argument is that the typical use-case of pygame falls, easily, into the latter.

----

*(Of course, you can make this fast at a tremendous programmer cost by emulating all that logic on the GPU using e.g. compute shaders, which is what all the cool kids are doing, or amortizing state changes with e.g. Vulkan's new command lists. But it requires (1) being competent at GPU architecture and (2) being willing to invest the time. I still use pygame mainly because of 2.)

Also, I'm a bit tired of the "python is slow so you may as well make everything slow and not expect it to work quickly" attitude.
I was worried someone might take it that way; this isn't my point at all. What I want is for people to remember what's important.

Clearly, one should not aspire to make things slow. I'm just saying that if a game developer tries to use Python+pygame to write some crazy graphics-intensive mega-AAA game, when it fails it's really on them for picking the wrong tool. At least for now--this is what I mean when I say we need to figure out if we like our niche.

A pygame app burns through the CPU not because of the interpretor, but because it is flipping bits in ram when a GPU could do it.
It's both of these and more. SDL's core blitting routines are in C, occasionally vectorized, IIRC, whereas as I mentioned above you have to figure in the cost of command transfers and overhead when you do operations on the GPU.

Ian