Ian Mallett wrote:Per-pixel drawing operations, if they must be individually managed by the CPU, will always be much faster to do on the CPU. This means things like Surface.set_at(...) and likely draw.circle(...) as well as potentially things like draw.rect(...) and draw.line(...).
This is undoubtedly true in the case of drawing one pixel at a time,
but are you sure it's true for lines and rectangles?
Good points Ian, but I don't see why we need to support software drawing when OpengGL supports drawing primitives? Is there a compelling reason that drawing lines with the CPU is better then doing it on the GPU?
The GPU is basically a bunch of workers (thousands, nowadays) sitting in a room. When you tell the GPU to do something, you tell everyone in the room to do that same thing. Configuring the GPU to do something else (saliently: changing the shader) is slow (for technical reasons).I have a GTX 980 sitting on my desk right now, and it has 2048 thread processors clocked at 1126 MHz. That's ****ing insane. I can throw millions and millions of triangles at it, and it laughs right back at me because it's rendering them (essentially) 2048 at a time. The fragments (≈ pixels) generated from those triangles are also rendered 2048 at a time. This is awesome, but only if you're drawing lots of triangles or shading lots of pixels in the same way (the same shader).But I cannot change the way I'm drawing those triangles individually. Say I alternate between a red shader and a blue shader for each of a million triangles. NVIDIA guidelines tell me I'm at about 3 seconds per frame, not even counting the rendering. This is what I mean by overhead. (To work around this problem, you double the amount of work and send a color along with each vertex as data. That's just more data and the GPU can handle it easily. But reconfigure? No good.) And this is in C-like languages. In Python, you have a huge amount of software overhead for those state changes, even before you get to the PCIe bus.And in a typical pygame project or indie game, this is basically exactly what we're trying to do. We've got sprites with individual location data and different ways of being rendered--different textures, different blend modes, etc. Only a few objects, but decent complexity in how to draw them. With a bunch of cleverness, you could conceivably write some complex code to work around this (generate work batches, abstract to an übershader, etc.), but I doubt you could (or would want to) fully abstract this away from the user--particularly in such a flexible API as pygame.The second issue is that the PCIe bus, which is how the CPU talks to the GPU, is really slow compared to the CPU's memory subsystem--both in terms of latency and bandwidth. My lab computer has ~64 GB/s DDR4 bandwidth (my computer at home has quadruple that) at 50ns-500ns latency. By contrast, the PCIe bus tops out at 2 GB/s at ~20000ns latency. My CPU also has 15MB of L3 cache, while my 980 has no L3 cache and only 2MiB of L2 (because streaming workloads need less caching and caching is expensive).So when you draw something on the CPU, you're drawing using a fast processor (my machine: 3.5 GHz, wide vectors, long pipe) into very close DRAM at a really low latency, but it's probably cached in L3 or lower anyway. When you draw something on the GPU, you're drawing (slowly (~1 GHz, narrow vectors, short pipe), but in-parallel) into different-DRAM-which-is-optimized-more-for-streaming and which may or may not be cached at all. Similar maybe, but you also have to wait for the command to go over the PCIe bus, take any driver sync hit, spool up the GPU pipeline in the right configuration, and so on. The overhead is worth it if you're drawing a million triangles, but not if you're calling Surface.set_at(...).
The point is, GPUs have great parallelism, but you pay for it in latency and usability. It's a tradeoff, and when you consider all the rest you need to do on the CPU, it's not always a clear one. But, as a heuristic, lots of geometry or fillrate or math-intensive shading calls for a GPU. Anything less calls for a CPU. My argument is that the typical use-case of pygame falls, easily, into the latter.
Also, I'm a bit tired of the "python is slow so you may as well make everything slow and not expect it to work quickly" attitude.
A pygame app burns through the CPU not because of the interpretor, but because it is flipping bits in ram when a GPU could do it.