Just a couple quick, miscellaneous notes on this problem:
--IDK if pygame does this, but I expect blitting, with overdraw, to a fullscreen temporary surface, and then sending the whole surface every frame to the GPU would be faster than updating multiple smaller portions of the GPU memory with CPU data.
--Dirty rects were never really a performance issue for me. This is partly because I usually implement the scheme above, more-or-less, partly because I avoid it algorithmically, and partly because blitting is inexpensive and typical overdraw rates are low (as they would be in a good GPU renderer too).
--IIRC the fastest algorithm for rect-rect collision detection is the max-of-mins/min-of-maxes method, which is widely known. It maps nicely to SSE too.
--I don't doubt there's no clear performance gain for current implementations, especially if the algorithm is in Python.
--There are two key cases to think about: (1) You have 1x1 pixel rects for each pixel on the screen. No overdraw, so the algorithm is complete waste. (2) You have a bunch of fullscreen rects that of course all overlap 100%.