Mandelbrot revisited – benchmark edition

I’ve had fun with the Mandelbrot set in this blog before, using it as an example of an embarrassingly parallelisable problem and demonstrating Parallel LINQ with it.

This morning, over breakfast, I described the problem to Christian Brunschen, a colleague of mine who has some parallelisation experience through implementing OpenMP in a portable manner. He immediately suggested a few possible changes to the way that I’ve approached things. Given the number of different attempts I’ve now had, I thought it would make sense to write a litte benchmarking app which could easily be expanded as further implementations were considered. The benchmark is available for download so you can play around with it yourself. (As is often the case with this kind of thing, it’s not the nicest code in the world for various reasons – the duplication of the sequence generation method, for example… Please don’t use it as a tutorial on actual coding and organisation.)

Benchmark implementation details

The benchmark allows you to vary the size of the generated image and the maximum number of iterations per point. The images can be displayed after the test run, but only the time taken to populate a byte array is recorded. The byte arrays are all allocated before any tests are run, and the garbage collector is invoked (as far as it can be) between tests. The images themselves are only generated after the tests have all completed. Each implementation is given a tiny “dummy run” (creating an image 10 pixels across, with a maximum of 2 iterations per point) first to hopefully remove JITting from the benchmarking times. I won’t pretend that this puts everything on a completely level playing field (benchmarking is hard) but hopefully it’s a good start. We check the similarity between the results of each test and the first one – in some cases they could be “nearly all the same” without there being a bug, due to the subtleties of floating point operations and inlining.

The base class that all the generators use takes care of working out the height from the width, remembering the various configuration options, allocating the array, and also providing a simple imperative method to obtain the value of a single point, without using anything fancy.

I originally wanted to put the whole thing in a nice UI, but after wasting nearly an hour trying to get WPF to behave, I decided to go for a simple console app. One day I really must learn how to write GUIs quickly…

Okay, enough introduction – let’s look at the implementations we’re testing, and then the results. The idea is to get a look at how a number of areas influence the results, as well as seeing some nifty ways of expressing things functionally.

SingleThreadImperativeSimple

This is the most trivial code you could write, basically. As the base class provides the calculation method, we just go through each row and column, work out the value, and store it in the array. The only attempt at optimisation is keeping the “current index” into the array rather than calculating it for every point.

 

public override void Generate()
{
    int index = 0;
    for (int row = 0; row < Height; row++)
    {
        for (int col = 0; col < Width; col++)
        {
            Data[index++] = ComputeMandelbrotIndex(row, col);
        }
    }
}

 

where ComplexMandelbrotIndex is in the base class:

 

protected byte ComputeMandelbrotIndex(int row, int col)
{
    double x = (col * SampleWidth) / Width + OffsetX;
    double y = (row * SampleHeight) / Height + OffsetY;

    double y0 = y;
    double x0 = x;

    for (int i = 0; i < MaxIterations; i++)
    {
        if (x * x + y * y >= 4)
        {
            return (byte)((i % 255) + 1);
        }
        double xtemp = x * x – y * y + x0;
        y = 2 * x * y + y0;
        x = xtemp;
    }
    return 0;
}

 

SingleThreadImperativeInline

This is largely the same code as SingleThreadImperativeSimple, but with everything inlined. Within the main body, there’s no access to anything other than local variables, and no method calls. One further optimisation is available: points which will have a value of 0 aren’t stored (we assume we’ll always start with a cleared array).

 

public override void Generate()
{
    int width = Width;
    int height = Height;
    int maxIterations = MaxIterations;
    int index = 0;
    byte[] data = Data;

    for (int row = 0; row < height; row++)
    {
        for (int col = 0; col < width; col++)
        {
            double x = (col * SampleWidth) / width + OffsetX;
            double y = (row * SampleHeight) / height + OffsetY;

            double y0 = y;
            double x0 = x;

            for (int i = 0; i < maxIterations; i++)
            {
                if (x * x + y * y >= 4)
                {
                    data[index] = (byte)((i % 255) + 1);
                    break;
                }
                double xtemp = x * x – y * y + x0;
                y = 2 * x * y + y0;
                x = xtemp;
            }
            // Leave data[index] = 0 by default
            index++;
        }
    }
}

 

MultiThreadUpFrontSplitImperative

This should be pretty efficient – work out how many cores we’ve got, split the work equally between them (as chunks of rows, which helps in terms of locality and just incrementing an index in the byte array each time), and then run. Wait until all the threads have finished, and we’re done. Shouldn’t be a lot of context switching required normally, and no synchronization is required. However, if some cores are busy with another process, we’ll end up context switching for no gain.

 

public override void Generate()
{
    int cores = Environment.ProcessorCount;

    int rowsPerCore = Height / cores;

    List<Thread> threads = new List<Thread>();
    for (int i = 0; i < cores; i++)
    {
        int firstRow = rowsPerCore * i;
        int rowsToCompute = rowsPerCore;
        if (i == cores – 1)
        {
            rowsToCompute = Height-(rowsPerCore*(cores-1));
        }
        Thread thread = new Thread(() =>
            {
                int index = firstRow * Width;
                for (int row = firstRow; row < firstRow + rowsToCompute; row++)
                {
                    for (int col = 0; col < Width; col++)
                    {
                        Data[index++] = ComputeMandelbrotIndex(row, col);
                    }
                }
            });
        thread.Start();
        threads.Add(thread);
    }

    threads.ForEach(thread => thread.Join());
}

 

MultiThreadRowFetching

Again, we use a fixed number of threads – but this time we start off with a queue of work – one task per row. The queue has to be synchronized every time we use it, and we also have to check whether or not it’s empty. Plenty of scope for lock contention here, particularly as the number of cores increases.

 

public override void Generate()
{
    Queue<int> rowsLeft = new Queue<int>(Enumerable.Range(0, Height));

    List<Thread> threads = new List<Thread>();
    for (int i = 0; i < Environment.ProcessorCount; i++)
    {
        Thread thread = new Thread(() =>
            {
                while (true)
                {
                    int row;
                    lock (rowsLeft)
                    {
                        if (rowsLeft.Count == 0)
                        {
                            break;
                        }
                        row = rowsLeft.Dequeue();
                    }
                    int index = row * Width;
                    for (int col = 0; col < Width; col++)
                    {
                        Data[index++] = ComputeMandelbrotIndex(row, col);
                    }
                }
            });
        thread.Start();
        threads.Add(thread);
    }

    threads.ForEach(thread => thread.Join());
}

SingleThreadLinqSimple

This is the first version of Mandelbrot generation I wrote since thinking of using LINQ, tweaked a litte to dump the output into an existing array instead of creating a new one. It’s essentially the same as SingleThreadImperativeSimple but with the for loops replaced with a query expression.

 

public override void Generate()
{
    var query = from row in Enumerable.Range(0, Height)
                from col in Enumerable.Range(0, Width)
                select ComputeMandelbrotIndex(row, col);

    int index=0;
    foreach (byte b in query)
    {
        Data[index++] = b;
    }
}

 

ParallelLinqRowByRowWithCopy

My first attempt using Parallel LINQ was somewhat disastrous due to the nature of PLINQ’s ordering. To combat this effect, I initially computed a row at a time, then copying each row into place afterwards:

 

private byte[] ComputeMandelbrotRow(int row)
{
    byte[] ret = new byte[Width];
    for (int col = 0; col < Width; col++)
    {
        ret[col] = ComputeMandelbrotIndex(row, col);
    }
    return ret;
}

public override void Generate()
{
    var query = from row in Enumerable.Range(0, Height).AsParallel(ParallelQueryOptions.PreserveOrdering)
                select ComputeMandelbrotRow(row);

    int rowStart = 0;
    foreach (byte[] row in query)
    {
        Array.Copy(row, 0, Data, rowStart, Width);
        rowStart += Width;
    }
}

 

ParallelLinqRowByRowInPlace

An obvious potential improvement is to write the data to the eventual target array as we go, to avoid creating the extra array creation and copying. At this point we have less of a purely functional solution, but it’s interesting anyway. Note that to use this idea in a query expression we have to return something even though it’s not useful. Likewise we have to make the query execute fully – so I use Max() to ensure that all the results will be computed. (I originally tried Count() but that didn’t work – presumably because the count of the results is known before the actual values are.) As we’re writing the data to the right place on each iteration, we no longer need to preserve ordering.

 

private int ComputeMandelbrotRow(int row)
{
    int index = row * Width;
    for (int col = 0; col < Width; col++)
    {
        Data[index++] = ComputeMandelbrotIndex(row, col);
    }
    return 0;
}

public override void Generate()
{
    var query = from row in Enumerable.Range(0, Height).AsParallel()
                select ComputeMandelbrotRow(row);

    query.Max();
}

 

ParallelLinqWithSequenceOfPoints

After this initial foray into Parallel LINQ, Nicholas Palladinos suggested that I could adapt the original idea in a more elegant manner by treating the sequence of points as a single input sequence, and asking Parallel LINQ to preserve the order of that whole sequence.

 

public override void Generate()
{
    var points = from row in Enumerable.Range(0, Height)
                 from col in Enumerable.Range(0, Width)
                 select new { row, col };

    var query = from point in points.AsParallel(ParallelQueryOptions.PreserveOrdering)
                select ComputeMandelbrotIndex(point.row, point.col);

    int index=0;
    foreach (byte b in query)
    {
        Data[index++] = b;
    }
}

 

SingleThreadLinqWithGenerator

My next step was to try to put the whole guts of the algorithm into the query expression, using a Complex value type and a generator to create a sequence of complex numbers generated from the starting point. This is quite a natural step, as the value is computed by just considering this infinite sequence and seeing how quickly it escapes from the circle of radius 2 on the complex plane. Aside from anything else, I think it’s a nice example of deferred execution and streaming – the sequence really would go on forever if we didn’t limit it with the Take and TakeWhile calls, but the generator itself doesn’t know it’s being limited. This version uses a sequence of points as per ParallelLinqWithSequenceOfPoints to make it easier to parallelise later.

It’s not exactly an efficient way of doing things though – there’s a huge amount of calling going on from one iterator to another. I’m actually quite surprised that the final results aren’t worse than they are.

 

public override void Generate()
{
    var points = from row in Enumerable.Range(0, Height)
                 from col in Enumerable.Range(0, Width)
                 select new { row, col };

    var query = from point in points
                // Work out the initial complex value from the row and column
                let c = new Complex((point.col * SampleWidth) / Width + OffsetX,
                                    (point.row * SampleHeight) / Height + OffsetY)
                // Work out the number of iterations
                select GenerateSequence(c, x => x * x + c).TakeWhile(x => x.SquareLength < 4)
                                                          .Take(MaxIterations)
                                                          .Count() into count
                // Map that to an appropriate byte value
                select (byte)(count == MaxIterations ? 0 : (count % 255) + 1);

    int index = 0;
    foreach (byte b in query)
    {
        Data[index++] = b;
    }
}

public static IEnumerable<T> GenerateSequence<T>(T start, Func<T, T> step)
{
    T value = start;
    while (true)
    {
        yield return value;
        value = step(value);
    }
}

 

ParallelLinqWithGenerator

From the previous implementation, parallelisation is simple.

 

public override void Generate()
{
    var points = from row in Enumerable.Range(0, Height)
                 from col in Enumerable.Range(0, Width)
                 select new { row, col };

    var query = from point in points.AsParallel(ParallelQueryOptions.PreserveOrdering)
                // Work out the initial complex value from the row and column
                let c = new Complex((point.col * SampleWidth) / Width + OffsetX,
                                    (point.row * SampleHeight) / Height + OffsetY)
                // Work out the number of iterations
                select GenerateSequence(c, x => x * x + c).TakeWhile(x => x.SquareLength < 4)
                                                          .Take(MaxIterations)
                                                          .Count() into count
                // Map that to an appropriate byte value
                select (byte)(count == MaxIterations ? 0 : (count % 255) + 1);

    int index = 0;
    foreach (byte b in query)
    {
        Data[index++] = b;
    }
}

 

SingleThreadImperativeWithComplex

Having already seen how slow the generator version was in the past, I thought it would be worth checking whether or not this was due to the use of the Complex struct – so this is a version which uses that, but is otherwise just like the very first implementation (SingleThreadImperativeSimple).

 

public override void Generate()
{
    int index = 0;
    for (int row = 0; row < Height; row++)
    {
        for (int col = 0; col < Width; col++)
        {
            Data[index++] = ComputeMandelbrotIndexWithComplex(row, col);
        }
    }
}

byte ComputeMandelbrotIndexWithComplex(int row, int col)
{
    double x = (col * SampleWidth) / Width + OffsetX;
    double y = (row * SampleHeight) / Height + OffsetY;

    Complex start = new Complex(x, y);
    Complex current = start;

    for (int i = 0; i < MaxIterations; i++)
    {
        if (current.SquareLength >= 4)
        {
            return (byte)((i % 255) + 1);
        }
        current = current * current + start;
    }
    return 0;
}

 

UnorderedParallelLinqSimple

This is where my colleague, Christian, came in. He suggested that instead of ordering the results, we just return the point as well as its value. We can then populate the array directly, regardless of the order of the results. The first version just uses an anonymous type to represent the combination:

 

public override void Generate()
{
    var query = from row in Enumerable.Range(0, Height).AsParallel()
                from col in Enumerable.Range(0, Width)
                select new { row, col, value = ComputeMandelbrotIndex(row, col) };

    foreach (var x in query)
    {
        Data[x.row * Width + x.col] = x.value;
    }
}

 

UnorderedParallelLinqWithStruct

Creating a new garbage-collected object on the heap for each point isn’t the world’s greatest idea. A very simple transformation is to use a struct instead. Without knowing the PLINQ implementation, it seems likely that the values will still end up on the heap somehow (how else could they be communicated between threads?) but I’d still expect a certain saving from this simple step.

 

public override void Generate()
{
    var query = from row in Enumerable.Range(0, Height).AsParallel()
                from col in Enumerable.Range(0, Width)
                select new Result (row, col, ComputeMandelbrotIndex(row, col));

    foreach (var x in query)
    {
        Data[x.Row * Width + x.Column] = x.Value;
    }
}

struct Result
{
    internal int row, col;
    internal byte value;

    internal int Row { get { return row; } }
    internal int Column { get { return col; } }
    internal byte Value { get { return value; } }

    internal Result(int row, int col, byte value)
    {
        this.row = row;
        this.col = col;
        this.value = value;
    }
}

 

UnorderedParallelLinqInPlace

Why bother returning the data at all? We can just write the data in place, like we did earlier with ParallelLinqRowByRowInPlace but in a more aggressive fashion (and the disadvantage of computing the index for each point). Again, we have to return a dummy value and iterate through those dummy results to force all the computations to go through.

 

public override void Generate()
{

    var query = from row in Enumerable.Range(0, Height).AsParallel()
                from col in Enumerable.Range(0, Width)
                // Note side-effect!
                select Data[row*Width + col] = ComputeMandelbrotIndex(row, col);

    // Force iteration through all results
    query.Max();
}

 

UnorderedParallelLinqInPlaceWithDelegate

UnorderedParallelLinqInPlace feels slightly nasty – we’ve clearly got a side-effect within the loop, it’s just that we know the side-effects are all orthogonal. We can make ourselves feel slightly cleaner by separating the side-effect from the main algorithm, by way of a delegate. The loop can hold its hands up and say, “I’m side-effect free if you are.” It’s not hugely satisfactory, but it’ll be interesting to see the penalty we pay for this attempt to be purer.

 

IEnumerable<T> Generate<T>(Func<int, int, T> transformation)
{
    var query = from row in Enumerable.Range(0, Height).AsParallel()
                from col in Enumerable.Range(0, Width)
                // Side-effect only if transformation contains one…
                select transformation(row, col);

    return query;
}

public override void Generate()
{
    // Transformation with side-effect
    Func<int,int,byte> transformation = (row, col) => Data[row * Width + col] = ComputeMandelbrotIndex(row, col);

    Generate(transformation).Max();
}

 

UnorderedParallelLinqInPlaceWithGenerator

We can easily combine the earlier generator code with any of these new ways of processing the results. Here we use our “algorithm in the query” approach but process the results as we go to avoid ordering issues.

 

public override void Generate()
{
    var query = from row in Enumerable.Range(0, Height).AsParallel()
                from col in Enumerable.Range(0, Width)
                // Work out the initial complex value from the row and column
                let c = new Complex((col * SampleWidth) / Width + OffsetX,
                                    (row * SampleHeight) / Height + OffsetY)
                // Work out the number of iterations
                let count = GenerateSequence(c, x => x * x + c).TakeWhile(x => x.SquareLength < 4)
                                                               .Take(MaxIterations)
                                                               .Count()
                // Map that to an appropriate byte value – and write it in place
                select Data[row * Width + col]  = (byte)(count == MaxIterations ? 0 : (count % 255) + 1);

    // Force execution
    query.Max();
}

 

ParallelForLoop

The Parallel Extensions library doesn’t just contain Parallel LINQ. It also has various other building blocks for parallelism, including a parallel for loop. This allows very simple parallelism of our very first generator – we just turn the outside loop into a parallel for loop, turning the previous inner loop into a delegate. We need to move the index variable into the outer loop so there’ll be one per row (otherwise they’d trample on each other) but that’s about it:

 

public override void Generate()
{
    Parallel.For(0, Height, row =>
    {
        int index = row * Width;
        for (int col = 0; col < Width; col++)
        {
            Data[index++] = ComputeMandelbrotIndex(row, col);
        }
    });
}

 

Results

A benchmark is nothing without results, so here they are on my dual core laptop, from three test runs. The first is the “default” settings I used to develop the benchmark – nothing hugely strenuous, but enough to see differences. I then tried a larger image with the same maximum number of iterations, then the original size with a larger number of iterations. The results are in alphabetical order because that’s how the test prints them :) Times are in milliseconds.

Implementation Width=1200; MaxIterations=200 Width=3000; MaxIterations=200 Width=1200; MaxIterations=800
MultiThreadRowFetching 380 2479 1311
MultiThreadUpFrontSplitImperative 384 2545 2088
ParallelForLoop 376 2361 1292
ParallelLinqRowByRowInPlace 378 2347 1295
ParallelLinqRowByRowWithCopy 382 2376 1288
ParallelLinqWithGenerator 4782 29752 16626
ParallelLinqWithSequenceOfPoints 549 3413 1462
SingleThreadImperativeInline 684 4352 2455
SingleThreadImperativeSimple 704 4353 2372
SingleThreadImperativeWithComplex 2795 16720 9800
SingleThreadLinqSimple 726 4522 2438
SingleThreadLinqWithGenerator 8902 52586 30075
UnorderedParalleLinqInPlace 422 2586 1317
UnorderedParallelLinqInPlaceWithDelegate 509 3093 1392
UnorderedParallelLinqInPlaceWithGenerator 5046 31657 17026
UnorderedParallelLinqSimple 556 3449 1448
UnorderedParalelLinqWithStruct 511 3227 1427

Conclusions

So, what have we learned? Well… bearing in mind that benchmarks like this are often misleading compared with real applications, etc it’s still interesting that:

  • Parallel Extensions rocks. If I were trying to include a Mandelbrot generation implementation in a production setting, I’d definitely go for the parallel for loop – it’s simple, but works just as well as anything else.
  • The micro-optimisation of SingleThreadImperativeInline really doesn’t help much, but makes the code harder to understand – just like so many micro-optimisations.
  • The “generator” form of LINQ really doesn’t perform well at all. It does parallelise pretty well, as you’d expect, but it’s just plain slow.
  • Part of the slowness is almost certainly due to the use of the Complex struct, given the results of SingleThreadImperativeWithComplex. Not sure what’s going on there.
  • The extra abstraction step from UnorderedParallelLinqInPlace to UnorderedParallelLinqInPlaceWithDelegate does have a significant impact, at least when the maximum number of iterations per point isn’t the dominant force. That doesn’t mean it would be a bad idea in production, of course – just somewhere to consider when running against performance issues.

I suspect I’ve missed some notable points though – comments?

The Beauty of Closures

Fairly soon I’m going to write a blog post comparing the different proposals under consideration for Java 7 when it comes to closures. I thought it would be worth writing some background material on it first though, so I’ve put an article on the C# in Depth site.

I’m not entirely comfortable that I’ve captured why they’re important, but the article can mature over time like a good wine. The idea is to counter the Blub Paradox (which I hadn’t come across before, but I agree with completely – it’s part of the effect I believe Steve McConnell was fighting against when talking about programming in a language).