Okay, so I’ve probably exhausted the Mandelbrot set as a source of interest, at least for the moment. However, every so often someone mentions something or other which sounds like a fun exercise in parallelisation. The latest one is Conway’s Game of Life. Like the Mandelbrot set, this is something I used to play with when I was a teenager – and like the Mandelbrot set, it’s an embarrassingly parallel problem.
As before, I’ve written a few implementations but not worked excessively hard to optimise. All my tests were performed with a game board of 1000×500 cells, displaying at one pixel per cell (partly because that was the easiest way to draw it). I found that with the slower implementations the rendering side was irrelevant to the results, but the rendering became a bottleneck with the fast algorithms, so I’ve included results both with and without rendering.
The “benchmark” code (I use the word very lightly – normal caveats around methodology apply, this was only a weekend evenings project after all) records the “current” number of frames per second roughly once per second, and also an average over the whole run. (The fast algorithm jump around in fps pretty wildly depending on what’s going on – the more naïve ones don’t really change much between a busy board and a dead one.) The starting pattern in all cases was the R-pentomino (or F-pentomino, depending on whose terminology you use) in the middle of the board.
A base class (
BytePerPixelBoard) is used by all the implementations – this represents the board with a single byte per pixel, which is really easy to work with but obviously very wasteful in terms of memory (and therefore cache). I haven’t investigated any other representations, but I wanted to get this post out fairly quickly as I have a lot of other stuff to do at the moment. (I also have a post on Java closures which I should tidy up soon, but hey…)
All the source code is available for download, of course. There are no fancy options for verifying the results or even turning off rendering – just comment out the line in
Program.BackgroundLoop which calls
It would be fairly easy to make ParallelFastInteriorBoard derive from SimpleFastInteriorBoard, and ParallelActiveBlockBoard derive from ActiveBlockBoard, but I’ve kept the implementations completely separate for the moment. That means a huge amount of duplication, of course, including a nested class in each of the “active block” implementations – the nested classes are exactly the same. Oh, and they use internal fields instead of properties. I wanted the fields to be read-only, so I couldn’t use automatic properties – but I didn’t want to go to the hassle of having explicitly declared fields and separate properties. Trust me, I wouldn’t do this in production code. (There were some other internal fields, but they’ve mercifully been converted into properties now.)
This is about as simple as you can get. It fetches each cell’s current value “carefully” (i.e. coping with wrapping) regardless of where it is on the board. It works, but it’s almost painfully slow.
This is just an optimisation of SimpleBoard. We fetch the borders of the board “slowly” as per SimpleBoard, but once we’re on the interior of the board (any cell that’s not on an edge) we know we won’t need to worry about wrapping, so we can just use a fixed set of array offsets relative to the offset of the cell that’s being calculated.
This ends up being significantly faster, although it could still be optimised further. In the current code each cell is tested for whether it’s an interior cell or not. There’s a slight optimisation by remembering the results for “am I looking at the top or the bottom row” for a whole row, but by calculating just the edges and then the interior (or vice versa) a lot of tests could be avoided.
Behold the awesome power of
Parallel.For. Parallelising SimpleFastInteriorBoard literally took about a minute. I haven’t seen any toolkit other rival this in terms of ease of use. Admittedly I haven’t done much embarrassingly parallel work in many other toolkits, but even so it’s impressive. It’ll be interesting to see how easy it is to use for complex problems as well as simple ones.
If you look at the results, you’ll see this is the first point at which there’s a difference between rendering and not. As the speed of the calculation improves the constant time taken for rendering becomes more significant.
This is a departure in terms of implementation. We still use the same backing store (one big array of a byte per cell) but the board is logically broken up into blocks of 25×25 cells. (The size is hard-coded, but only as two constants – they can easily be changed. I haven’t investigated the effect of different block sizes thoroughly, but a few ad hoc tests suggests this is a reasonable sweet spot.) When considering a block in generation n, the changes (if any) in generation n-1 are considered. If the block itself changed, it may change again. If the relevant borders of any of its neighbours changed (including corners of diagonally neighbouring blocks), the cell may change. No other changes in the board can affect the block, so if none of these conditions are met the contents of generation n-1 can be copied to generation n for that block. This is obviously a massive saving when there are large areas of stable board, either dead or with stable patterns (such as 2×2 live blocks). It doesn’t help with blinkers (3×1 blocks which oscillate between vertical and horizontal alignments) or anything similar, but it’s still a big optimisation. It does mean a bit more house-keeping in terms of remembering what changed (anywhere in the cell, the four sides, and the four corners) each generation, but that pales into insignificance when you can remove the work from a lot of blocks.
Again, my implementation is relatively simple – but it still took a lot of work to get right! The code is much more complicated than SimpleFastInteriorBoard, and indeed pretty much all the code from SFIB is used in ActiveBlockBoard. Again, I’m sure there are optimisations that could be made to it, and alternative strategies for implementing it. For instance, I did toy with some code which didn’t keep track of what had changed in the block, but instead pushed changes to relevant neighbouring blocks, explicitly saying, “You need to recalculate next time!” There wasn’t much difference in speed, however, and it needed an extra setup stage in order to make the code understandable, so I’ve removed it.
The performance improvement of this is staggering – it makes it over 20 times faster than the single-threaded SimpleFastInteriorBoard. At first the improvement was much smaller – until I realised that actually I’d moved the bottleneck to the rendering, and re-benchmarked with rendering turned off.
Exactly the same implementation as ActiveBlockBoard, but using a
Parallel.ForEach loop. Again, blissfully easy to implement, and very effective. It didn’t have the same near-doubling effect of SimpleFastInteriorBoard to ParallelFastInteriorBoard (with rendering turned off) but I suspect this is because the amount of housekeeping work required to parallelise things starts becoming more important at the rates shown in the results. It’s still impressive stuff though.
|Implementation||FPS with rendering||FPS without rendering|
- The Parallel Extensions library rocks my world. Massive, massive kudos to Joe Duffy and the team.
- I have no idea why ParallelActiveBlockBoard is slower than ActiveBlockBoard when rendering is enabled. This was repeatable (with slightly different results each time, but the same trend).
- Always be aware that reducing one bottleneck may mean something else becomes your next bottleneck – at first I thought that ActiveBlockBoard was “only” 5 times faster than SimpleFastInteriorBoard; it was only when parallelisation showed no improvement (just the opposite) that I started turning rendering off.
- Changing the approach can have more pronounced effects than adding cores – moving to the “active block” model gave a massive improvement over brute force.
- Micro-optimisation has its place, when highly targeted and backed up with data – arguably the “fast interior” is a micro-optimisation, but again the improvement is quite dramatic.
- Benchmarks are fun, but shouldn’t be taken to be indicative of anything else.
- Did I mention that Parallel Extensions rocks?