Parallel Extensions June CTP

Either my timing is great or it’s lousy – you decide. Yesterday I posted about parallelising Conway’s Life – and today the new CTP for the Parallel Extensions library comes out! The bad news is that it meant I had to run all the tests again… the good news is that it means we can see whether or not the team’s work over the last 6 months has paid off.

Breaking change

The only breaking change I’ve seen is that AsParallel() no longer takes a ParallelQueryOptions parameter – instead, you call AsOrdered() on the value returned from AsParallel(). It was an easy change to make, and worked first time. There may well be plenty of other breaking API changes which are more significant, of course – I’m hardly using any of it.

Benchmark comparisons

One of the nice things about having the previous blog entries is that I can easily compare the results of how things were then with how they are now. Here are the test results for the areas of the previous blog posts which used Parallel Extensions. For the Game of Life, I haven’t included the results with the rendering for the fast version, as they’re still bound by rendering (unsurprisingly).

Mandelbrot set

(Original post)

Results are in milliseconds taken to plot the whole image, so less is better. (x;y) values mean “Width=x, MaxIterations=y.”

Description December CTP June CTP
ParallelForLoop (1200;200) 376 380
ParallelForLoop (3000;200) 2361 2394
ParallelForLoop (1200;800) 1292 1297
ParallelLinqRowByRowInPlace (1200;200) 378 393
ParallelLinqRowByRowInPlace (3000;200) 2347 2440
ParallelLinqRowByRowInPlace (1200;800) 1295 1939
ParallelLinqRowByRowWithCopy (1200;200) 382 411
ParallelLinqRowByRowWithCopy (3000;200) 2376 2484
ParallelLinqRowByRowWithCopy (1200;800) 1288 1401
ParallelLinqWithGenerator (1200;200) 4782 4868
ParallelLinqWithGenerator (3000;200) 29752 31366
ParallelLinqWithGenerator (1200;800) 16626 16855
ParallelLinqWithSequenceOfPoints (1200;200) 549 533
ParallelLinqWithSequenceOfPoints (3000;200) 3413 3290
ParallelLinqWithSequenceOfPoints (1200;800) 1462 1460
UnorderedParalleLinqInPlace (1200;200) 422 440
UnorderedParalleLinqInPlace (3000;200) 2586 2775
UnorderedParalleLinqInPlace (1200;800) 1317 1475
UnorderedParallelLinqInPlaceWithDelegate (1200;200) 509 514
UnorderedParallelLinqInPlaceWithDelegate (3000;200) 3093 3134
UnorderedParallelLinqInPlaceWithDelegate (1200;800) 1392 1571
UnorderedParallelLinqInPlaceWithGenerator (1200;200) 5046 5511
UnorderedParallelLinqInPlaceWithGenerator (3000;200) 31657 30258
UnorderedParallelLinqInPlaceWithGenerator (1200;800) 17026 19517
UnorderedParallelLinqSimple (1200;200) 556 595
UnorderedParallelLinqSimple (3000;200) 3449 3700
UnorderedParallelLinqSimple (1200;800) 1448 1506
UnorderedParalelLinqWithStruct (1200;200) 511 534
UnorderedParalelLinqWithStruct (3000;200) 3227 3154
UnorderedParalelLinqWithStruct (1200;800) 1427 1445

A mixed bag, but overall it looks to me like the June CTP was slightly worse than the older one. Of course, that’s assuming that everything else on my computer is the same as it was a couple of weeks ago, etc. I’m not going to claim it’s a perfect benchmark by any means. Anyway, can it do any better with the Game of Life?

Game of Life

(Original post)

Results are in frames per second, so more is better.

Description December CTP June CTP
ParallelFastInteriorBoard (rendered) 23 22
ParallelFastInteriorBoard (unrendered) 29 28
ParallelFastInteriorBoard 508 592

Yes, ParallelFastInteriorBoard really did get that much speed bump, apparently. I have no idea why… which leads me to the slightly disturbing conclusion of this post:

Conclusion

The numbers above don’t mean much. At least, they’re not particularly useful because I don’t understand them. Why would a few tests become “slightly significantly worse” and one particular test get markedly better? Do I have serious benchmarking issues in terms of my test rig? (I’m sure it could be a lot “cleaner” – I didn’t think it would make very much difference.)

I’ve always been aware that off-the-cuff benchmarking like this was of slightly dubious value, at least in terms of the details. The only facts which are really useful here are the big jumps due to algorithm etc. The rest is noise which may interest Joe and the team (and hey, if it proves to be useful, that’s great) but which is beyond my ability to reason about. Modern machines are complicated beasts, with complicated operating systems and complicated platforms above them.

So ignore the numbers. Maybe the new CTP is faster for your app, maybe it’s not. It is important to know it’s out there, and that if you’re interested in parallelisation but haven’t played with it yet, this is a good time to pick it up.

4 thoughts on “Parallel Extensions June CTP”

  1. The way that I think about it is “hey, they just made a bunch of architectural changes, presumably for the better, and none of it totally screwed everything up!” Performance-tuning—at the level of resolution that would show the most of the differences above—should be the one of the last steps, I would think.

    There’s also this bit:

    In this CTP, PLINQ is implemented on top of the Task Parallel Library, which does not yet have thread injection in this CTP (see the related above issue #1). Some PLINQ queries require more concurrently-running tasks than the number of threads that Task Parallel Library creates, so you may observe deadlocks. Specifically, binary operators like SelectMany and Join that use the output of another PLINQ query as the second data source are likely to hit this. We have provided a workaround for this CTP: the previous implementation, which runs on top of the .NET ThreadPool, is still available. Just set the PLINQ_USE_THREADPOOL environment variable to a non-empty value and PLINQ will revert back to the ThreadPool. This setting will go away in subsequent releases.

    Like

  2. Hey Jon, I’ve been enjoying your posts. Remember that, at this point, not a lot of performance optimizations have been done to the CTPs as the focus is really on the API and functionality. So, I’m not too surprised that there are some performance anomalies (increases) in there. Especially the fact that PLINQ switched its underlying implementation from ThreadPool-based to TPL-based.

    I’m sure we’ll see performance within Parallel Extensions really start to soar as we get closer to release :).

    Keep up the great work!

    Like

  3. My first thought with respect to the performance changes: in addition to your other observations, I’ll also add that getting good performance from incorrect code is often a lot easier than getting good performance from correct code. (This has been a classic issue with graphics card benchmarking…do you count better frame rates if they come at the expense of accurate rendering?)

    In other words, it’s entirely possible that in the course of fixing some bugs in the Parallel Extensions, certain things got slower. :)

    Like

  4. PLINQ is wonderful. I am working with Parallel Extensions. I want to exploit multicore now!
    Two books recommended for those interested in multicore development with C# 2008 and with future C# 2010:
    Concurrent Programming on Windows
    Author: Joe Duffy
    http://www.amazon.com/Concurrent-Programming-on-Windows/dp/B0015DYKI4/ref=ed_oe_k
    C# 2008 and 2005 Threaded Programming: Beginner’s Guide
    Author: Gastón C. Hillar
    http://www.amazon.com/2008-2005-Threaded-Programming-Beginners/dp/1847197108

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s