Category Archives: Benchmarking

Tracking down a performance hit

I’ve been following the progress of .NET Core with a lot of interest, and trying to make the Noda Time master branch keep up with it. The aim is that when Noda Time 2.0 eventually ships (apologies for the delays…) it will be compatible with .NET Core from the start. (I’d expected to be able to support netstandard1.0, but that appears to have too much missing from it. It looks like netstandard1.3 will be the actual target.)

I’ve been particularly looking forward to being able to run the Noda Time benchmarks (now using BenchmarkDotNet) to compare .NET Core on Linux with the same code on Windows. In order to make that a fair comparison, I now have two Intel NUCs, both sporting an i5-5250U and 8GB of memory.

As it happens, I haven’t got as far as running the benchmarks under .NET Core – but I am now able to run all the unit tests on both Linux and Windows, using both the net451 TFM and netcoreapp1.0.

When I did that recently, I was pretty shocked to see that (depending on which tests I ran) the tests were 6-10 times slower on Linux than on Windows, using netcoreapp1.0 in both cases. This post is a brief log of what I did to track down the problem.

Step 1: Check that there’s really a problem

Thought: Is this actually just a matter of not running the tests in a release configuration, or something similar?

Verification: I ran the tests several times, specifying -c Release on the command line to use the release build of both NodaTime.Test.dll and NodaTime.dll. Running under a debugger definitely wasn’t an issue, as this was all just done from the shell.

Additionally, I ran the tests in two ways – firstly, running the whole test suite, and secondly running with --where=cat!=Slow to avoid the few tests I’ve got which are known to be really pretty slow. They’re typically tests which compare the answers the BCL gives with the answers Noda Time gives, across the whole of history for a particular calendar system or time zone. I’m pleased to report that the bottleneck in these tests is almost always the BCL, but that doesn’t help to speed them up. If only the “slow” tests had been much slower on Linux, that might have pointed to the problems being in BCL calendar or time zone code.

The ratios vary, but there was enough of a problem under both circumstances for it to be worth looking further.

Step 2: Find a problematic test

I didn’t have very strong expectations one way or another about whether this would come down to some general problem in the JIT on Linux, or whether there might be one piece of code causing problems in some tests but not others. Knowing that there are significant differences in handling of some culture and time zone code between the Linux and Windows implementations, I wanted to find a test which used the BCL as little as possible – but which was also slow enough for the differences in timing to be pronounced and not easily explicable by the problems of measuring small amounts of time.

Fortunately, NUnit produces a TestResult.xml file which is easy to parse with LINQ to XML, so I could easily transform the results from Windows and Linux into a list of tests, ordered by duration (descending), and spot the right kind of test.

I found my answer in UmAlQuraYearMonthDayCalculatorTest.GetYearMonthDay_DaysSinceEpoch, which effectively tests the Um Al Qura calendar for self consistency, by iterating over every day in the supported time period and checking that we can convert from “days since Unix epoch” to an expected “year, month day”. In particular, this test doesn’t rely on the Windows implementation of the calendar, nor does it use any time zones, cultures or anything similar. It’s nicely self-contained.

This test took 2051ms on Linux and 295ms on Windows. It’s possible that those figures were from a debug build, but I repeated the tests using a release build and confirmed that the difference was still similar.

Step 3: Find the bottleneck

At this point, my aim was to try to remove bits of the test at a time, until the difference went away. I expected to find something quite obscure causing the difference – something like different CPU cache behaviour. I knew that the next step would be to isolate the problem to a small piece of code, but I expected that it would involve a reasonable chunk of Noda Time – at least a few types.

I was really lucky here – the first and most obvious call to remove made a big difference: the equality assertion. Assertions are usually the first thing to remove in tests, because everything else typically builds something that you use in the assertions… if you’re making a call without either using the result later or asserting something about the result, presumably you’re only interested in side effects.

As soon as I removed the call to Assert.AreEqual(expected, actual), the execution time dropped massively on Linux, but hardly moved on Windows: they were effectively on a par.

I wondered whether the problem was with the fact that I was asserting equality between custom structs, and so tried replacing the real assertions with assertions of equality of strings, then of integers. No significant difference – they all showed the same discrepancy between Windows and Linux.

Step 4: Remove Noda Time

Once I’d identified the assertions as the cause of the problem, it was trivial to start a new test project with no dependency on Noda Time, consisting of a test like this:

[Test]
public void Foo()
{
    for (int i = 0; i < 1000000; i++)
    {
        var x = 10;
        var y = 10;
        Assert.AreEqual(x, y);
    }
}

This still demonstrated the problem consistently, and allowed simpler experimentation with different assertions.

Step 5: Dig into NUnit

For once in my life, I was glad that a lot of implementation details of a framework were exposed publicly. I was able to try lots of different “bits” of asserting equality, in order to pin down the problem. Things I tried:

  • Assert.AreEqual(x, y): slow
  • Assert.That(x, Is.EqualTo(y)): slow
  • Constructing an NUnitEqualityComparer: fast
  • Calling NUnitEqualityComparer.AreEqual: fast. (Here the construction occurred before the loop, and the comparisons were in the loop.)
  • Calling Is.EqualTo(y): slow

The last bullets two bullets were surprising. I’d been tipped off that NUnitEqualityComparer uses reflection, which could easily differ in performance between Windows and Linux… but checking for equality seemed to be fast, and just constructing the constraint was slow. In poking around the NUnit source code (thank goodness for Open Source!) it’s obvious why Assert.AreEqual(x, y) and Assert.That(y, Is.EqualTo(x)) behave the same way – the former just calls the latter.

So, why is Is.EqualTo(y) slow (on Linux)? The method itself is simple – it just creates an instance of EqualConstraint. The EqualConstraint constructor body doesn’t do much… so I proved that it’s not EqualConstraint causing the problem by deriving my own constraint with a no-op implementation of ApplyTo… sure enough, just constructing that is slow.

That leaves the constructor of the Constraint abstract base class:

protected Constraint(params object[] args)
{
    Arguments = args;

    DisplayName = this.GetType().Name;
    if (DisplayName.EndsWith("`1") || DisplayName.EndsWith("`2"))
        DisplayName = DisplayName.Substring(0, DisplayName.Length - 2);
    if (DisplayName.EndsWith("Constraint"))
            DisplayName = DisplayName.Substring(0, DisplayName.Length - 10);
}

That looks innocuous enough… but maybe calling GetType().Name is expensive on Linux. So test that… nope, it’s fast.

At this point I’m beginning to wonder whether we’ll ever get to the bottom of it, but let’s just try…

[Test]
public void EndsWith()
{
    string text = "abcdefg";
    for (int i = 0; i < Iterations; i++)
    {
        text.EndsWith("123");
    }
}

… and sure enough, it’s fast on Windows and slow on Linux. Wow. Looks like we have a culprit.

Step 6: Remove NUnit

At this point, it’s relatively plain sailing. We can reproduce the issue in a simple console app. I won’t list the code here, but it’s in the GitHub issue. It just times calling EndsWith once (to get it JIT compiled) and then a million times. Is it the most rigorous benchmark in the world? Absolutely not… but when the difference is between 5.3s on Linux and 0.16s on Windows, on the same hardware, I’m not worried about inaccuracy of a few milliseconds here or there.

Step 7: File a CoreCLR issue

So, as I’ve shown, I filed a bug on GitHub. I’d like to think it was a pretty good bug report:

  • Details of the environment
  • Short but complete console app ready to copy/paste/compile/run
  • Results

Exactly the kind of thing I’d have put into a Stack Overflow question – when I ask for a minimal, complete example on Stack Overflow, this is what I mean.

Anyway, about 20 minutes later (!!!), Stephen Toub has basically worked out the nub of it: it’s a culture issue. Initially, he couldn’t reproduce it – he saw the same results on Windows and Linux. But changing his culture to en-GB, he saw what I was seeing. I then confirmed the opposite – when I ran the code having set LANG=en-US, the problem went away for me. Stephen pulled Matt Ellis in, who gave more details as to what was going wrong behind the scenes.

Step 8: File an NUnit issue

Matt Ellis suggested filing an issue against NUnit, as there’s no reason this code should be culture-sensitive. By specifying the string comparison as Ordinal, we can go through an even faster path than using the US culture. So

if (DisplayName.EndsWith("Constraint"))

becomes

if (DisplayName.EndsWith("Constraint", StringComparison.Ordinal))

… and the equivalent for the other two calls.

I pointed out in the issue that it was also a little bit odd that this was being worked out in every Constraint constructor call, when of course it’s going to give the same result for every instance of the same type. When “every Constraint constructor call” becomes “every assertion in an entire test run”, it’s a pretty performance-critical piece of code. While unit tests aren’t important in terms of performance in the same way that production code is, anything which adds friction is bad news.

Hopefully the NUnit team will apply the simple improvement for the next release, and then the CoreCLR team can attack the tougher underlying problem over time.

Step 9: Blog about it

Open up Stack Edit, start typing: “I’ve been following the progress”… :)

Conclusion

None of the steps I’ve listed here is particularly tricky. Diagnosing problems is often more a matter of determination and being unwilling to admit defeat than cleverness. (I’m not denying that there’s a certain art to being able to find the right seam to split the problem in two, admittedly.)

I hope this has been useful as a “start to finish” example of what a diagnostic session can look and feel like. It wasn’t one physical session, of course – I found bits of time to investigate it over the course of a day or so – but it would have been the same steps either way.

Smug, satisfied smile…

Array covariance: not just ugly, but slow too

It seems to be quite a long time since I’ve written a genuine "code" blog post. Time to fix that.

This material may well be covered elsewhere – it’s certainly not terrifically original, and I’ve been meaning to post about it for a long time. In particular, I remember mentioning it at CodeMash in 2012. Anyway, the time has now come.

Refresher on array covariance

Just as a bit of background before we delve into the performance aspect, let me remind you what array covariance is, and when it applies. The basic idea is that C# allows a reference conversion from type TDerived[] to type TBase[], so long as:

  • TDerived and TBase are both reference types (potentially interfaces)
  • There’s a reference conversion from TDerived to TBase (so either TDerived is the same as TBase, or a subclass, or an implementing class etc)

Just to remind you about reference conversions, those are conversions from one reference type to another, where the result (on success) is never a reference to a different object. To quote section 6.1.6 of the C# 5 spec:

Reference conversions, implicit or explicit, never change the referential identity of the object being converted. In other words, while a reference conversion may change the type of the reference, it never changes the type or value of the object being referred to.

So as a simple example, there’s a reference conversion from string to object, therefore there’s a reference conversion from string[] to object[]:

string[] strings = new string[10];
object[] objects = strings;

// strings and objects now refer to the same object

There is not a reference conversion between value type arrays, so you can’t use the same code to conver from int[] to object[].

The nasty part is that every store operation into a reference type array now has to be checked at execution time for type safety. So to extend our sample code very slightly:

string[] strings = new string[10]; 
object[] objects = strings;

objects[0] = "string"; // This is fine
objects[0] = new Button(); // This will fail

The last line here will fail with an ArrayTypeMismatchException, to avoid storing a Button reference in a String[] object. When I said that every store operation has to be checked, that’s a slight exaggeration: in theory, if the compile-time type is an array with an element type which is a sealed class, the check can be avoided as it can’t fail.

Avoiding array covariance

I would rather arrays weren’t covariant in the first place, but there’s not a lot that can be done about that now. However, we can work around this, if we really need to. We know that value type arrays are not covariant… so how about we use a value type array instead, even if we want to store reference types?

All we need is a value type which can store the reference type we need – which is dead easy with a wrapper type:

public struct Wrapper<T> where T : class
{
    private readonly T value;
    public T Value { get { return value; } }
    
    public Wrapper(T value)
    {
        this.value = value;
    }
    
    public static implicit operator Wrapper<T>(T value)
    {
        return new Wrapper<T>(value);
    }
}

Now if we have a Wrapper<string>[], we can’t assign that to a Wrapper<object>[] variable – the types are incompatible. If that feels a bit clunky, we can put the array into its own type:

public sealed class InvariantArray<T> where T : class
{
    private readonly Wrapper<T>[] array;
    
    public InvariantArray(int size)
    {
        array = new Wrapper<T>[size];
    }
    
    public T this[int index]
    {
        get { return array[index].Value; }
        set { array[index] = value; }
    }
}

Just to clarify, we now only have value type arrays, but ones where each value is a plain wrapper for a reference. We now can’t accidentally violate type-safety at compile-time, and the CLR doesn’t need to validate write operations.

There’s no memory overhead here – aside from the type information at the start, I’d actually expect the contents of a Wrapper<T>[] to be indistinguishable from a T[] in memory.

Benchmarking

So, how does it perform? I’ve written a small console app to test it. You can download the full code, but the gist of it is that we use a stopwatch to measure how long it takes to either repeatedly write to an array, or repeatedly read from an array (validating that the value read is non-null, just to prove that we’ve really read something). I’m hoping I haven’t fallen foul of any of the various mistakes in benchmarking which are so easy to make.

The test tries four scenarios:

  • object[] (but still storing strings)
  • string[]
  • Wrapper<string>[]
  • InvariantArray<string>

Running against an array size of 100, with 100 million iterations per test, I get the following results on my Thinkpad Twist :

Array type Read time (ms) Write time
object[] 11842 44827
string[] 12000 40865
Wrapper<string>[] 11843 29338
InvariantArray<string> 11825 32973

That’s just one run, but the results are fairly consistent across runs. The one interesting deviation is the write time for object[] – I’ve observed it sometimes being the same as for string[], but not consistently. I don’t understand this, but it does seem that the JIT isn’t performing the optimization for string[] that it could if it spotted that string is sealed.

Both of the workarounds to avoid array covariance make a noticeable difference to the performance of writing to the array, without affecting read performance. Hooray!

Conclusion

I think it would be a very rare application which noticed a significant performance boost here, but I do like the fact that this is one of those situations where a cleaner design also leads to better performance, without many obvious technical downsides.

That said, I doubt that I’ll actually be using this in real code any time soon – the fact that it’s just "different" to normal C# code is a big downside in itself. Hope you found it interesting though :)

Optimization and generics, part 2: lambda expressions and reference types

As with almost any performance work, your mileage may vary (in particular the 64-bit JIT may work differently) and you almost certainly shouldn’t care. Relatively few people write production code which is worth micro-optimizing. Please don’t take this post as an invitation to make code more complicated for the sake of irrelevant and possibly mythical performance changes.

It took me a surprisingly long time to find the problem described in the previous blog post, and almost no time at all to fix it. I understood why it was happening. This next problem took a while to identify at all, but even when I’d found a workaround I had no idea why it worked. Furthermore, I couldn’t reproduce it in a test case… because I was looking for the wrong set of triggers. I’ve now found at least some of the problem though.

This time the situation in Noda Time is harder to describe, although it concerns the same area of code. In various places I need to create new delegates containing parsing steps and add them to the list of steps required for a full parse. I can always use lambda expressions, but in many cases I’ve got the same logic repeatedly… so I decided to pull it out into a method. Bang – suddenly the code runs far slower. (In reality, I’d performed this refactoring first, and "unrefactored" it to speed things up.)

I think the problem comes down to method group conversions with generic methods and a type argument which is a reference type. The CLR isn’t very good at them, and the C# compiler uses them more than it needs to.

Show me the benchmark!

The complete benchmark code is available of course, but fundamentally I’m doing the same thing in each test case: creating a delegate of type Action which does nothing, and then checking that the delegate reference is non-null (just to avoid the JIT optimizing it away). In each case this is done in a generic method with a single type parameter. I call each method in two ways: once with int as the type argument, and once with string as the type argument. Here are the different cases involved:

  • Use a lambda expression: Action foo = () => {};
  • Fake what I expected the compiler to do: keep a separate generic cache class with a static variable for the delegate; populate the cache once if necessary, and thereafter use the cache field
  • Fake what the compiler is actually doing with the lambda expression: write a separate generic method and perform a method group conversion to it
  • Do what the compiler could do: write a separate non-generic method and perform a method group conversion to it
  • Use a method group conversion to a static (non-generic) method on a generic type
  • Use a method group conversion to an instance (non-generic) method on a generic type, via a generic cache class with a single field in referring to an instance of the generic class

(Yes, the last one is a bit convoluted – but the line in the method itself is simple: Action foo = ClassHolder<T>.SampleInstance.NoOpInstance;

Remember, we’re doing each of these in a generic method, and calling that generic method using a type argument of either int or string. (I’ve run a few tests, and the exact type isn’t important – all that matters is that int is a value type, and string is a reference type.)

Importantly, we’re not capturing any variables, and the type parameter is not involved in either the delegate type or any part of the implementation body.

Benchmark results

Again, times are in milliseconds – but this time I didn’t want to run it for 100 million iterations, as the "slow" versions would have taken far too long. I’ve run this on the x64 JIT as well and seen the same effect, but I haven’t included the figures here.

Times in milliseconds for 10 million iterations

Test TestCase<int> TestCase<string>
Lambda expression 180 29684
Generic cache class 90 288
Generic method group conversion 184 30017
Non-generic method group conversion 178 189
Static method on generic type 180 29276
Instance method on generic type 202 299

Yes, it’s about 150 times slower to create a delegate from a generic method with a reference type as the type argument than with a value type… and yet this is the first I’ve heard of this. (I wouldn’t be surprised if there were a post from the CLR team about it somewhere, but I don’t think it’s common knowledge by any means.)

Conclusion

One of the tricky things is that it’s hard to know exactly what the C# compiler is going to do with any given lambda expression. In fact, the method which was causing me grief earlier on isn’t generic, but it’s in a generic type and captures some variables which use the type parameters – so perhaps that’s causing a generic method group conversion somewhere along the way.

Noda Time is a relatively extreme case, but if you’re using delegates in any performance-critical spots, you should really be aware of this issue. I’m going to ping Microsoft (first informally, and then via a Connect report if that would be deemed useful) to see if there’s an awareness of this internally as potential "gotcha", and whether there’s anything that can be done. Normal trade-offs of work required vs benefit apply, of course. It’s possible that this really is an edge case… but with lambdas flying everywhere these days, I’m not sure that it is.

Maybe tomorrow I’ll actually be able to finish getting Noda Time moved onto the new system… all of this performance work has been a fun if surprising distraction from the main job of shipping working code…

Optimization and generics, part 1: the new() constraint (updated: now with CLR v2 results)

As with almost any performance work, your mileage may vary (in particular the 64-bit JIT may work differently) and you almost certainly shouldn’t care. Relatively few people write production code which is worth micro-optimizing. Please don’t take this post as an invitation to make code more complicated for the sake of irrelevant and possibly mythical performance changes.

I’ve been doing quite a bit of work on Noda Time recently – and have started getting my head round all the work that James Keesey has put into the parsing/formatting. I’ve been reworking it so that we can do everything without throwing any exceptions, and also to work on the idea of parsing a pattern once and building a sequence of actions for both formatting and parsing from the action. To format or parse a value, we then just need to apply the actions in turn. Simples.

Given that this is all in the name of performance (and I consider Noda Time to be worth optimizing pretty hard) I was pretty cross when I ran a complete revamp through the little benchmarking tool we use, and found that my rework had made everything much slower. Even parsing a value after parsing the pattern was slower than parsing both the value and the pattern together. Something was clearly very wrong.

In fact, it turns out that at least two things were very wrong. The first (the subject of this post) was easy to fix and actually made the code a little more flexible. The second (the subject of the next post, which may be tomorrow) is going to be harder to work around.

The new() constraint

In my SteppedPattern type, I have a generic type parameter – TBucket. It’s already constrained in terms of another type parameter, but that’s irrelevant as far as I’m aware. (After today though, I’m taking very little for granted…) The important thing is that before I try to parse a value, I want to create a new bucket. The idea is that bits of information end up in the bucket as they’re being parsed, and at the very end we put everything together. So each parse operation requires a new bucket. How can we create one in a nice generic way?

Well, we can just call its public parameterless constructor. I don’t mind the types involved having such a constructor, so all we need to do is add the new() constraint, and then we can call new TBucket():

// Somewhat simplified…
internal sealed class SteppedPattern<TBucket> : IParsePattern<TBucket>
    where TBucket : new()
{
    public ParseResult Parse(string value)
    {
        TBucket bucket = new TBucket();

        // Rest of parsing goes here
    }
}

Great! Nice and simple. Unfortunately, it turned out that that one line of code was taking 75% of the time to parse a value. Just creating an empty bucket – pretty much the simplest bit of parsing. I was amazed when I discovered that.

Fixing it with a provider

The fix is reasonably easy. We just need to tell the type how to create an instance, and we can do that with a delegate:

// Somewhat simplified…
internal sealed class SteppedPattern<TBucket> : IParsePattern<TBucket>
{
    private readonly Func bucketProvider;

    internal SteppedPattern(Func bucketProvider)
    {
        this.bucketProvider = bucketProvider;
    }

    public ParseResult Parse(string value)
    {
        TBucket bucket = bucketProvider();

        // Rest of parsing goes here
    }
}

Now I can just call new SteppedPattern(() => new OffsetBucket()) or whatever. This also means I can keep the constructor internal, not that I care much. I could even reuse old parse buckets if that wouldn’t be a semantic problem – in other cases it could be useful. Hooray for lambda expressions – until we get to the next post, anyway.

Show me the figures!

You don’t want to have to run Noda Time’s benchmarks to see the results for yourself, so I wrote a small benchmark to time just the construction of a generic type. As a measure of how insignificant this would be for most apps, these figures are in milliseconds, performing 100 million iterations of the action in question. Unless you’re going to do this in performance-critical code, you just shouldn’t care.

Anyway, the benchmark has four custom types: two classes, and two structs – a small and a large version of each. The small version has a single int field; the large version has eight long fields. For each type, I benchmarked both approaches to initialization.

The results on two machines (32-bit and 64-bit) are below, for both the v2 CLR and v4. The 64-bit machine is much faster in general – you should only compare results within one machine, as it were.)

CLR v4: 32-bit results (ms per 100 million iterations)

Test type new() constraint Provider delegate
Small struct 689 1225
Large struct 11188 7273
Small class 16307 1690
Large class 17471 3017

CLR v4: 64-bit results (ms per 100 million iterations)

Test type new() constraint Provider delegate
Small struct 473 868
Large struct 2670 2396
Small class 8366 1189
Large class 8805 1529

CLR v2: 32-bit results (ms per 100 million iterations)

Test type new() constraint Provider delegate
Small struct 703 1246
Large struct 11411 7392
Small class 143967 1791
Large class 143107 2581

CLR v2: 64-bit results (ms per 100 million iterations)

Test type new() constraint Provider delegate
Small struct 510 686
Large struct 2334 1731
Small class 81801 1539
Large class 83293 1896

(An earlier version of this post had a mistake – my original tests used structs for everything, despite the names.)

Others have reported slightly different results, including the new() constraint being better for both large and small structs.

Just in case you hadn’t spotted them, look at the results for classes. Those are the real results – it took over 2 minutes to run the test using the new() constraint on my 32-bit laptop, compared with under two seconds for the provider. Yikes. This was actually the situation I was in for Noda Time, which is built on .NET 2.0 – it’s not surprising that so much of my benchmark’s time was spent constructing classes, given results like this.

Of course you can download the benchmark program for yourself and see how it performs on your machine. It’s a pretty cheap-and-cheerful benchmark, but when the differences are this big, minor sources of inaccuracy don’t bother me too much. The simplest way to run under CLR v2 is to compile with the .NET 3.5 C# compiler to start with.

What’s going on under the hood?

As far as I’m aware, there’s no IL to give support for the new() constraint, in terms of using the parameterless constructor. (The constraint itself can be expressed in IL though.) Instead, when we write new T() in C#, the compiler emits a call to Activator.CreateInstance. Apparently, that’s slower than calling a delegate – presumably due to trying to find an accessible constructor with reflection, and invoking it. I suspect it could be optimized relatively easily – e.g. by caching the results per type it’s called with, in terms of delegates. I’m slightly surprised this hasn’t (apparently) been optimized, given how easy it is to cache values by generic type. No doubt there’s a good reason lurking there somewhere, even if it’s only the memory taken up by the cache.

Either way, it’s easy to work around in general.

Conclusion

I wouldn’t have found this gotcha if I didn’t have before and after tests (or in this case, side-by-side tests of the old way and the new way of parsing). The real lesson of this post shouldn’t be about the new() constraint – it should be how important it is to test performance (assuming you care), and how easy it is to assume certain operations are cheap.

Next post: something much weirder.

LINQ To Objects and the performance of nested “Where” calls

This post came out of this Stack Overflow question, which essentially boils down to which is better out of these two options:

var oneBigPredicate = collection.Where(x => Condition1(x)
                                         && Condition2(x)
                                         && Condition3(x));

var multiplePredicates = collection.Where(x => Condition1(x))
                                   .Where(x => Condition2(x))
                                   .Where(x => Condition3(x))

The first case is logically a single "wrapper" sequence around the original collection, with a filter which checks all three conditions (but applying short-circuiting logic, of course) before the wrapper will yield an item from the original sequence.

The second case is logically a set of concentric wrapper sequences, each applying a filter which checks a single condition. Asking the "outer" wrapper for the next item involves that one asking the "middle" wrapper for the next item, which asks the "inner" wrapper for the next item, which asks the original collection for the next item… the item is then passed "outwards" through the wrappers, with the filters being applied as we go, of course.

Now the two will achieve the same result, and I should say up-front that in most realistic cases, it won’t make a significant difference which you use. But realistic cases aren’t nearly as interesting to investigate as pathological cases, so I decided to benchmark a few different options for the second case. In particular, I wanted to find out how long it took to iterate over a query in the cases where the condition was either "always true" or "always false" – and vary the depth of nesting. Note that I’m not actually testing the first kind of query shown above… I suspect it wouldn’t be terribly interesting, at least compared with the results of the second query.

The simplest way of creating a "collection" of a large size is to use Enumerable.Repeat(), and the simplest way of iterating over the whole sequence is to just call Count() on it… so that’s what I do, timing how long the Count() call takes. (I don’t actually print out the results of Count(), but with an "always false" predicate I’ll get 0, and with an "always true" predicate I’ll get the size of the input collection.)

Here’s the sample code:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

class Test
{
    static void Main()
    {
        int size = 10000000;
        Console.WriteLine("Always false");
        RunTests(size, x => false);
        
        Console.WriteLine("Always true");
        RunTests(size, x => true);
        
    }
    
    static void RunTests(int size, Func<string, bool> predicate)
    {
        for (int i = 1; i <= 10; i++)
        {
            RunTest(i, size, predicate);
        }
    }
    
    static void RunTest(int depth, int size, Func<string, bool> predicate)
    {
        IEnumerable<string> input = Enumerable.Repeat("value", size);
        
        for (int i = 0; i < depth; i++)
        {
            input = input.Where(predicate);
        }
        
        Stopwatch sw = Stopwatch.StartNew();
        input.Count();
        sw.Stop();
        Console.WriteLine("Depth: {0} Size: {1} Time: {2}ms",
                          depth, size, sw.ElapsedMilliseconds);
    }
}

You may notice there’s no JIT warm-up here – but I’ve tried that, and it doesn’t alter the results much. Likewise I’ve also tried with a larger size of collection, and the trend is the same – and the trend is the interesting bit.

Time to guess the results…

Before I include the results, I’ll explain what I thought would happen. I had the mental model I described before, with multiple sequences feeding each other.

When the condition is always false, I’d expected the call to MoveNext() from Count() to get all the way to the innermost filtering sequence, which would then iterate over all of the input collection, applying the filter on each item and never yielding a result. That innermost MoveNext() call returns false, and that propagates right back to Count(). All the (significant) time is spent in the innermost loop, testing and rejecting items. That shouldn’t depend on the depth of the nesting we’ve got, right? I’d expect it to be linear in terms of the size of the collection, but constant in terms of nesting. We’ll see.

When the condition is always true, we’re in a much worse situation. We need to propagate every item from the collection through all the filters, out to Count(), checking the condition and accepting the item on each check. Each MoveNext() call has to go all the way to the innermost filter, then the result has to propagate out again. That sounds like it should be roughly linear in the depth of the nesting, as well as still being linear in the size of the collection – assuming a very simplistic execution model, of course.

Before you look at the results, check that you understand my logic, and see if you can think why it might not be the case.

The actual results

Here’s what I’ve actually observed, with all times in milliseconds. The size of the collection is constant – we’re only varying the depth

Depth Time for "always false" Time for "always true"
1 182 305
2 219 376
3 246 452
4 305 548
5 350 650
6 488 709
7 480 795
8 526 880
9 583 996
10 882 1849

There are a few things to note here:

  • The "always true" time is going up broadly linearly, as we’d expected
  • The "always false" time is also going going up broadly linearly, even though we’d expected it to be roughly constant
  • The "always false" time is still significantly better than the "always true" time, which is somewhat reassuring
  • The performance at a depth of 10 is significantly worse than at a depth of 9 in both cases

Okay, so that’s bizarre. What can possibly be making the time taken by the "inner" filtering sequence go up with the depth of the nesting? It shouldn’t know about the rest of the calls to Where – that’s not its job. Time to investigate…

Reimplementing Where

As you may have seen before, Where isn’t very hard to implement – at least it’s not hard to implement if you’re not trying to be clever. In order to mess around with things to check that my mental model was correct, I decided to run exactly the same tests again, but against the Edulinq implementation. This time, the results are very different:

Depth Time for "always false" Time for "always true"
1 232 502
2 235 950
3 256 1946
4 229 2571
5 227 3103
6 226 3535
7 226 3901
8 232 4365
9 229 4838
10 226 5219

Well look at that… suddenly our "always false" query has the expected characteristics. The "always true" query is basically linear except for the jump in time taken between a depth of 2 and a depth of 3. This may well be the same sort of discontinuity present in the earlier results, but at a different depth. (I’ll do more research on that another time, I think.)

So if the naive implementation actually works better in some cases, what’s going wrong in the "always false" case in the real LINQ to Objects? We can work this out by taking a stack trace from a predicate. Here’s a sample query which will throw an exception:

var query = Enumerable.Repeat("value", 1)
                      .Where(x => { throw new Exception("Bang!"); })
                      .Where(x => true)
                      .Where(x => true)
                      .Where(x => true)
                      .Where(x => true)
                      .Where(x => true)
                      .Where(x => true)
                      .Where(x => true);

When you try to count that query (or do anything else which will iterate over it) you get a stack trace like this:

Unhandled Exception: System.Exception: Bang!
   at Test.<Main>b__0(String x)
   at System.Linq.Enumerable.<>c__DisplayClassf`1.<CombinePredicates>b__e(TSource x)
   at System.Linq.Enumerable.<>c__DisplayClassf`1.<CombinePredicates>b__e(TSource x)
   at System.Linq.Enumerable.<>c__DisplayClassf`1.<CombinePredicates>b__e(TSource x)
   at System.Linq.Enumerable.<>c__DisplayClassf`1.<CombinePredicates>b__e(TSource x)
   at System.Linq.Enumerable.<>c__DisplayClassf`1.<CombinePredicates>b__e(TSource x)
   at System.Linq.Enumerable.<>c__DisplayClassf`1.<CombinePredicates>b__e(TSource x)
   at System.Linq.Enumerable.<>c__DisplayClassf`1.<CombinePredicates>b__e(TSource x)
   at System.Linq.Enumerable.WhereEnumerableIterator`1.MoveNext()
   at System.Linq.Enumerable.Count[TSource](IEnumerable`1 source)
   at Test.Main()

Ooh, look at that stack – we’ve got 8 Where clauses, and 7 levels of DisplayClassf`1 – which looks like a generated class, perhaps for a lambda expression.

Between this and a helpful email from Eric Lippert, we can basically work out what’s going on. LINQ to Objects knows that it can combine two Where clauses by just constructing a compound filter. Just for kicks, let’s do the same thing – almost certainly more simplistically than LINQ to Objects – but in a way that will give the right idea:

// We could have an interface for this, of course.
public class FilteredEnumerable<T> : IEnumerable<T>
{
    private readonly IEnumerable<T> source;
    private readonly Func<T, bool> predicate;
    
    internal FilteredEnumerable(IEnumerable<T> source,
                                Func<T, bool> predicate)
    {
        this.source = source;
        this.predicate = predicate;
    }
    
    public IEnumerator<T> GetEnumerator()
    {
        foreach (T item in source)
        {
            if (predicate(item))
            {
                yield return item;
            }
        }
    }
    
    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
    
    public FilteredEnumerable<T> Where(Func<T, bool> extraPredicate)
    {
        return new FilteredEnumerable<T>(source,
                                         x => this.predicate(x) && extraPredicate(x));
    }
}

public static class Extensions
{
    public static IEnumerable<T> Where<T>(this IEnumerable<T> source,
                                          Func<T, bool> predicate)
    {
        var filtered = source as FilteredEnumerable<T>;
        return filtered == null ? new FilteredEnumerable<T>(source, predicate)
                                : filtered.Where(predicate);
    }
}

Let’s run the same tests as before, and see how we do…

Depth Time for "always false" Time for "always true"
1 240 504
2 275 605
3 332 706
4 430 797
5 525 892
6 558 980
7 638 1085
8 715 1208
9 802 1343
10 1312 2768

Oh look – it feels like the real LINQ to Objects implementation. A bit slower, certainly, but that’s okay. The way it trends is the same. In particular, it’s slower than the naive implementation for the "always false" filter, but faster for the "always true" filter.

Could things be improved?

The problem here is the creation of nested delegates. We end up with a large stack (which appears to be causing a bigger problem when it reaches a depth of 10) when really we want to just build another delegate.

The thought occurs that we could potentially use expression trees to do this. Not in a signature-compatible way with LINQ to Objects, but we should be able to combine (a => a == 3) and (b => b != 10) into (x => x == 3 && x != 10) effectively. Then when we’re asked to iterate, we just need to compile the expression tree to a delegate, and filter using that single, efficient delegate.

There are three problems with this:

  • It goes against the normal approach of LINQ to Objects, using delegates instead of expression trees. Heck, with expression trees throughout we could do all kinds of interesting optimizations.
  • Creating and compiling the expression tree may be more expensive than the iteration – we don’t really know. It depends on how large the collection is, etc.
  • It’s somewhat complicated to implement, because we need to rewrite the constituent expressions; note how "a" and "b" both become "x" in the example above. This is the main reason I haven’t actually bothered trying this in order to benchmark it :)

There are various other options to do with building a more efficient way of evaluating the predicates. One pretty simple (but repetitive) solution is to have one class per number of predicates, with a field per predicate – FilteredEnumerable1, FilteredEnumerable2 etc. When you get bored (e.g. FilteredEnumberable9) you construct any further ones by combining predicates as per the LINQ to Objects approach. For example, here’s an implementation of FilteredEnumerable3:

public class FilteredEnumerable3<T> : IFilteredEnumerable<T>
{
    private readonly IEnumerable<T> source;
    private readonly Func<T, bool> predicate0;
    private readonly Func<T, bool> predicate1;
    private readonly Func<T, bool> predicate2;
    
    internal FilteredEnumerable3(IEnumerable<T> source,
                                 Func<T, bool> predicate0,
                                 Func<T, bool> predicate1,
                                 Func<T, bool> predicate2)
    {
        this.source = source;
        this.predicate0 = predicate0;
        this.predicate1 = predicate1;
        this.predicate2 = predicate2;
    }
    
    public IEnumerator<T> GetEnumerator()
    {
        foreach (T item in source)
        {
            if (predicate0(item) &&
                predicate1(item) &&
                predicate2(item))
            {
                yield return item;
            }
        }
    }
    
    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
    
    public IFilteredEnumerable<T> Where(Func<T, bool> extraPredicate)
    {
        return new FilteredEnumerable4<T>(source,
            predicate0,
            predicate1,
            predicate2,
            extraPredicate);
    }
}

In this case, I’m rather pleased with the results:

Depth Time for "always false" Time for "always true"
1 237 504
2 231 551
3 231 599
4 225 625
5 228 703
6 224 787
7 222 876
8 225 966
9 232 1069
10 219 1191

Now that’s more like it. Other than a few cells, we’re actually outperforming LINQ to Objects – and still compatible with it. The only problem is that the cells where it’s slower are the common ones – the top few in the right hand column. And this was just going up to FilteredEnumerable4<T>.

I honestly don’t know why my FilteredEnumerable1<T> is slower than whatever’s in LINQ to Objects. But it’s nice to see the rest going quickly… I suspect there are some downsides as well, and basically I trust the team to make the right decisions for normal cases.

And there’s more…

You may be surprised to hear I’ve kept things simple here – as Eric mentioned to me, it’s not just Where and Select that can be transformed in this way. Select works as well – you can combine projections together pretty easily. So imagine that we’ve effectively got this transformation:

// Normal LINQ query
var query = list.Where(x => Condition1(x))
                .Where(x => Condition2(x))
                .Select(x => Projection1(x))
                .Select(y => Projection2(y));

// After optimization
var query = list.WhereSelect(x => Condition1(x) && Condition2(x),
                             x => Projection2(Projection1(x));

Of course at that point, my FilteredEnumerableX<T> approach becomes even more unwieldy as you have two axes – the number of predicates and the number of projections.

Conclusion

As is so often the case with performance, there’s more to Where than there first appears. When I started benchmarking this I had no idea what I was getting myself into. The great thing is that very few people would ever need to know this. It’s all hidden by the abstraction.

Still, isn’t this sort of thing fun?

There’s a hole in my abstraction, dear Liza, dear Liza

I had an interesting day at work today. I thought my code had broken… but it turns out it was just a strange corner case which made it work very slowly. Usually when something interesting happens in my code it’s quite hard to blog about it, because of all the confidentiality issues involved. In this case, it’s extremely easy to reproduce the oddity in an entirely vanilla manner. All we need is the Java collections API.

I have a set – a HashSet, in fact. I want to remove some items from it… and many of the items may well not exist. In fact, in our test case, none of the items in the "removals" collection will be in the original set. This sounds – and indeed is – extremely easy to code. After all, we’ve got Set<T>.removeAll to help us, right?

Let’s make this concrete, and look at a little test. We specify the size of the "source" set and the size of the "removals" collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using System.currentTimeMillis(), which isn’t the world most accurate stopwatch but is more than adequate in this case, as you’ll see. Here’s the code:

import java.util.*;

public class Test
{
    public static void main(String[] args)
    {
        int sourceSize = Integer.parseInt(args[0]);
        int removalsSize = Integer.parseInt(args[1]);
        
        Set<Integer> source = new HashSet<Integer>();
        Collection<Integer> removals = new ArrayList<Integer>();
        
        for (int i = 0; i < sourceSize; i++)
        {
            source.add(i);
        }
        for (int i = 1; i <= removalsSize; i++)
        {
            removals.add(-i);
        }
        
        long start = System.currentTimeMillis();
        source.removeAll(removals);
        long end = System.currentTimeMillis();
        System.out.println("Time taken: " + (end – start) + "ms");
    }
}

Let’s start off by giving it an easy job: a source set of 100 items, and 100 to remove:

c:UsersJonTest>java Test 100 100
Time taken: 1ms

Okay, so we hadn’t expected it to be slow… clearly we can ramp things up a bit. How about a source of one million items1 and 300,000 items to remove?

c:UsersJonTest>java Test 1000000 300000
Time taken: 38ms

Hmm. That still seems pretty speedy. Now I feel I’ve been a little bit cruel, asking it to do all that removing. Let’s make it a bit easier – 300,000 source items and 300,000 removals:

c:UsersJonTest>java Test 300000 300000
Time taken: 178131ms

Excuse me? Nearly three minutes? Yikes! Surely it ought to be easier to remove items from a smaller collection than the one we managed in 38ms? Well, it does all make sense, eventually. HashSet<T> extends AbstractSet<T>, which includes this snippet in its documentation for the removeAll method:

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator’s remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set’s remove method.

Now that sounds reasonable on the surface of it – iterate through the smaller collection, check for the presence in the bigger collection. However, this is where the abstraction is leaky. Just because we can ask for the presence of an item in a large collection doesn’t mean it’s going to be fast. In our case, the collections are the same size – but checking for the presence of an item in the HashSet is O(1) whereas checking in the ArrayList is O(N)… whereas the cost of iterating is going to be the same for each collection. Basically by choosing to iterate over the HashSet and check for presence in the ArrayList, we’ve got an O(M * N) solution overall instead of an O(N) solution. Ouch. The removeAll method is making an "optimization" based on assumptions which just aren’t valid in this case.

Then fix it, dear Henry, dear Henry, dear Henry

There are two simple ways of fixing the problem. The first is to simply change the type of the collection we’re removing from. Simply changing ArrayList<Integer> to HashSet<Integer> gets us back down to the 34ms range. We don’t even need to change the declared type of removals.

The second approach is to change the API we use: if we know we want to iterate over removals and perform the lookup in source, that’s easy to do:

for (Integer value : removals)
{
    source.remove(value);
}

In fact, on my machine that performs slightly better than removeAll – it doesn’t need to check the return value of remove on each iteration, which removeAll does in order to return whether or not any items were removed. The above runs in about 28ms. (I’ve tested it with rather larger datasets, and it really is faster than the dual-hash-set approach.)

However, both of these approaches require comments in the source code to explain why we’re not using the most obvious code (a list and removeAll). I can’t complain about the documentation here – it says exactly what it will do. It’s just not obvious that you need to worry about it, until you run into such a problem.

So what should the implementation do? Arguably, it really needs to know what’s cheap in each of the collections it’s dealing with. The idea of probing for performance characteristics before you decide on a strategy is completely anathema to clean abstraction we like to consider with frameworks like Java collections… but maybe in this case it would be a good idea.


1 Please perform Dr Evil impression while reading this. I’m watching you through your webcam, and I’ll be disappointed if I don’t see you put your little finger to your mouth.

Revisiting randomness

Almost every Stack Overflow question which includes the words "random" and "repeated" has the same basic answer. It’s one of the most common "gotchas" in .NET, Java, and no doubt other platforms: creating a new random number generator without specifying a seed will depend on the current instant of time. The current time as measured by the computer doesn’t change very often compared with how often you can create and use a random number generator – so code which repeatedly creates a new instance of Random and uses it once will end up showing a lot of repetition.

One common solution is to use a static field to store a single instance of Random and reuse it. That’s okay in Java (where Random is thread-safe) but it’s not so good in .NET – if you use the same instance repeatedly from .NET, you can corrupt the internal data structures.

A long time ago, I created a StaticRandom class in MiscUtil – essentially, it was just a bunch of static methods (to mirror the instance methods found in Random) wrapping a single instance of Random and locking appropriately. This allows you to just call StaticRandom.Next(1, 7) to roll a die, for example. However, it has a couple of problems:

  • It doesn’t scale well in a multi-threaded environment. When I originally wrote it, I benchmarked an alternative approach using [ThreadStatic] and at the time, locking won (at least on my computer, which may well have only had a single core).
  • It doesn’t provide any way of getting at an instance of Random, other than by using new Random(StaticRandom.Next()).

The latter point is mostly a problem because it encourages a style of coding where you just use StaticRandom.Next(…) any time you want a random number. This is undoubtedly convenient in some situations, but it goes against the idea of treating a source of randomness as a service or dependency. It makes it harder to get repeatability and to see what needs that dependency.

I could have just added a method generating a new instance into the existing class, but I decided to play with a different approach – going back to per-thread instances, but this time using the ThreadLocal<T> class introduced in .NET 4.0. Here’s the resulting code:

using System;
using System.Threading;

namespace RandomDemo
{
    /// <summary>
    /// Convenience class for dealing with randomness.
    /// </summary>
    public static class ThreadLocalRandom
    {
        /// <summary>
        /// Random number generator used to generate seeds,
        /// which are then used to create new random number
        /// generators on a per-thread basis.
        /// </summary>
        private static readonly Random globalRandom = new Random();
        private static readonly object globalLock = new object();

        /// <summary>
        /// Random number generator
        /// </summary>
        private static readonly ThreadLocal<Random> threadRandom = new ThreadLocal<Random>(NewRandom);

        /// <summary>
        /// Creates a new instance of Random. The seed is derived
        /// from a global (static) instance of Random, rather
        /// than time.
        /// </summary>
        public static Random NewRandom()
        {
            lock (globalLock)
            {
                return new Random(globalRandom.Next());
            }
        }

        /// <summary>
        /// Returns an instance of Random which can be used freely
        /// within the current thread.
        /// </summary>
        public static Random Instance { get { return threadRandom.Value; } }

        /// <summary>See <see cref="Random.Next()" /></summary>
        public static int Next()
        {
            return Instance.Next();
        }

        /// <summary>See <see cref="Random.Next(int)" /></summary>
        public static int Next(int maxValue)
        {
            return Instance.Next(maxValue);
        }

        /// <summary>See <see cref="Random.Next(int, int)" /></summary>
        public static int Next(int minValue, int maxValue)
        {
            return Instance.Next(minValue, maxValue);
        }

        /// <summary>See <see cref="Random.NextDouble()" /></summary>
        public static double NextDouble()
        {
            return Instance.NextDouble();
        }

        /// <summary>See <see cref="Random.NextBytes(byte[])" /></summary>
        public static void NextBytes(byte[] buffer)
        {
            Instance.NextBytes(buffer);
        }
    }
}

The user can still call the static Next(…) methods if they want, but they can also get at the thread-local instance of Random by calling ThreadLocalRandom.Instance – or easily create a new instance with ThreadLocalRandom.NewRandom(). (The fact that NewRandom uses the global instance rather than the thread-local one is an implementation detail really; it happens to be convenient from the point of view of the ThreadLocal<T> constructor. It wouldn’t be terribly hard to change this.)

Now it’s easy to write a method which needs randomness (e.g. to shuffle a deck of cards) and give it a Random parameter, then call it using the thread-local instance:

public void Shuffle(Random rng)
{
    …
}

deck.Shuffle(ThreadLocalRandom.Instance);

The Shuffle method is then easier to test and debug, and expresses its dependency explicitly.

Performance

I tested the "old" and "new" implementations in a very simple way – for varying numbers of threads, I called Next() a fixed number of times (from each thread) and timed how long it took for all the threads to finish. I’ve also tried a .NET-3.5-compatible version using ThreadStatic instead of ThreadLocal<T>, and running the original code and the ThreadStatic version on .NET 3.5 as well.

Threads StaticRandom (4.0b2) ThreadLocalRandom (4.0b2) ThreadStaticRandom (4.0b2) StaticRandom(3.5) ThreadStaticRandom (3.5)
1 11582 6016 10150 10373 16049
2 24667 7214 9043 25062 17257
3 38095 10295 14771 36827 25625
4 49402 13435 19116 47882 34415

A few things to take away from this:

  • The numbers were slightly erratic; somehow it was quicker to do twice the work with ThreadStaticRandom on .NET 4.0b2! This isn’t the perfect benchmarking machine; we’re interested in trends rather than absolute figures.
  • Locking hasn’t changed much in performance between framework versions
  • ThreadStatic has improved massively between .NET 3.5 and 4.0
  • Even on 3.5, ThreadStatic wins over a global lock as soon as there’s contention

The only downside to the ThreadLocal<T> version is that it requires .NET 4.0 – but the ThreadStatic version works pretty well too.

It’s worth bearing in mind that of course this is testing the highly unusual situation where there’s a lot of contention in the global lock version. The performance difference in the single-threaded version where the lock is always uncontended is still present, but very small.

Update

After reading the comments and thinking further, I would indeed get rid of the static methods elsewhere. Also, for the purposes of dependency injection, I agree that it’s a good idea to have a factory interface where that’s not overkill. The factory implementation could use either the ThreadLocal or ThreadStatic implementations, or effectively use the global lock version (by having its own instance of Random and a lock). In many cases I’d regard that as overkill, however.

One other interesting option would be to create a thread-safe instance of Random to start with, which delegated to thread-local "normal" implementations. That would be very useful from a DI standpoint.

Buffering vs streaming benchmark: first results

My poor laptop’s had a busy weekend. It’s run 72 tests, rebooting between each test. Most of these tests have kept both the CPU and disk busy for a lot of the time. I expect to update this blog post with more numbers – and possibly more strategies – as time goes on, but I wanted to get the numbers out as quickly as possible. Graphs should be coming when I’ve got a decent network to experiment with Google graphing.

Source code and setup

While I don’t really expect anyone else to sacrifice their computer for the best part of 24 hours, it doesn’t take very much effort to get the tests running. This zip file contains the source code for three programs, a batch file to generate the test data, and a command file.

The CreateFiles executable creates input files based on its command line arguments. (This is run multiple times by the batch file.) EncryptFiles “encrypts” (I use the term loosely) all the files in a given directory, writing the results to another directory. The class design of EncryptFiles is a little bit funny, but it’s really tailored for these tests. It should be easy to write other strategies in the same way too.

ExecuteAndReboot is used to automate the testing. Basically it reads a command file (from a hard-coded directory; simplicity trumps flexibility here). If the file isn’t empty, it waits five minutes (so that after a reboot the computer has time to settle down after loading all its services) and runs the command specified in the first line of the file, appending the output to a results file. When the command has finished executing, ExecuteAndReboot rewrites the command file, removing the first line, and schedules a reboot. The idea is to drop ExecuteAndReboot into a Startup folder, set Windows to log in automatically, reboot once and then just let it go. If you want to use the computer, just wait until it’s in the “waiting five minutes” bit at the start of the cycle and kill ExecuteAndReboot. Next time you reboot it will start testing again.

That’s the general principle – if anyone does want to run the tests on their mail and needs a bit of help setting it up, just let me know.

Test environment and workload

I expect the details of the execution environment to affect the outcome significantly. For the sake of the record then, my laptop specs are:

Dell Inspiron 1720 (unlikely to be relevant, but…)
Intel Core 2 Duo T7250 2.0GHz
Vista Home Premium 32 bit
3 GB main memory, 32K L1 cache, 2048K L2 cache
2 disks, both 120GB (Fujitsu MHY2120BH, 5400RPM, 8MB cache)

Even though there are two drives in the system, the tests only ever read from and wrote to a single disk.

I created four sets of input data; I’ve presented the results for each one separately below, and included the description of the files along with the results. Basically in each case there’s a set of files where each file is the same size, consisting of a fixed number of fixed-length lines.

The “encryption” that is performed in each case (on the encoded version of the line) is just to XOR each byte with a value. This is repeated a specified number of times, and the XOR value on each pass is the current index, i.e. it first XORs with 0, then 1, then 2 etc. I’ve used work levels of 0 (the loop never runs), 100 and 1000. These were picked somewhat arbitrarily, but they give interesting results. At 0 the process is clearly IO-bound. At 1000 it’s clearly CPU-bound. At 100 it’s not pegging the CPU, but there’s significant activity. I may try 200 as well at some point.

For each scenario I tried using 1, 2 and 3 threads (and 4 for the big tests; I’ll go back and repeat earlier ones with 4 threads soon). The two strategies used are “streaming”: read a line, encrypt, write the line, repeat and “buffering”: read a whole file, encrypt it all, write it all out. After a thread has started, it needs no shared data at all – it’s given the complete list of files to encrypt at the very start.

Results

For every table, the number of threads is shown by the leftmost column, and the work level is shown by the topmost row. The cells are of the form “x/y” where “x” is the time taken for the streaming strategy, and “y” is the time taken for the buffering strategy. All times are in seconds. For each column, I’ve highlighted the optimal test.

Small tests

There are 10000 test files. Each is 100 lines long, with 100 characters per line.

Threads/Work 0 100 1000
1 47/90 94/79 163/158
2 95/112 113/157 129/146
3 179/128 130/144 152/160
4 144/139 163/193 156/204

Medium tests (short lines)

There are 100 test files. Each is 500,000 lines long, with 20 characters per line.

Threads/Work 0 100 1000
1 138/157 219/267 1125/1205
2 196/238 259/263 604/691
3 292/250 312/236 624/676
4 302/291 321/281 602/666

Medium tests (long lines)

There are 100 test files. Each is 100 lines long, with 100,000 characters per line.

Threads/Work 0 100 1000
1 203/213 249/286 1027/1107
2 283/264 311/296 602/696
3 319/284 314/278 608/618
4 360/300 336/297 607/598

Big tests

There are 20 files. Each file is 100,000 lines long, with 1000 characters per line.

Threads/Work 0 100 1000
1 190/246 349/441 2047/2165
2 365/375 392/473 1097/1387
3 456/379 484/399 1161/1296
4 517/442 521/431 1113/1214

Conclusions

I’m loathe to draw too many conclusions so far. After all, we’ve only got data for a single test environment. It would be interesting to see what would happen on a quad core machine. However, I think a few things can be said with reasonable confidence:

  • Sometimes the streaming strategy wins, sometimes the buffering strategy wins. The difference can be quite significant either way. For a given input and workload, the streaming solution almost always wins on my machine.
  • In ideal scenario, the bottlenecked resource should be busy for as much of the time as possible. For example, in a high-CPU task it makes little sense to have idle cores when you’ve already loaded data. Likewise in a disk-bound task we should always be fetching (or writing) data even if all our cores are momentarily busy. Personally I think this is the operating system’s pre-fetch systems’s job. We could do it ourselves, but it’s nicer if we don’t have to.
  • If we’re disk-bound, it makes sense to read (or write) one file at a time, to avoid seeking. This is obvious from the results – for work levels of 0 and 100, a single-thread solution is consistently best (and streaming usually works better buffering). Based on this assumption, I suspect a cleaner solution would be to have a single thread reading largish chunks (but not the whole file) and feeding the other threads – or to use async IO. However, this all gets very complicated. One nice aspect of both of the current solutions is that they’re really simple to implement safely. Reading line-by-line in an async manner is a pain. I’d quite like to write an async “execute for every line” helper at some time, but not just yet.
  • If we’re CPU bound, the buffering solution’s sweet spot is 3 threads (remember we’re using 2 cores) and the streaming solution works better with just 2 threads – introducing more threads will make both the IO and context switching less efficient.
  • The streaming strategy always uses less application memory than the buffering strategy. When running the buffering strategy against the “big” data set with 4 threads, the memory varied between 800MB and 1.8GB (as measured by the “Total bytes in all heaps” performance counter), vs around 165KB with the streaming strategy. (Obviously the process itself took more than 165KB, but that was the memory used in “normal” managed heap memory.) Using more memory to improve performance is a valid technique, but I’d say it’s only worth it when the improvement is very significant. Additionally, the streaming technique instantly scales to huge data files. While they’re not in the current requirements, it doesn’t hurt to get that ability for free.

Next steps…

I’ve included quite a few “I’m going to try to […]” points in this post. Just for reference, my plans are:

  • Turn off my virus checker! If this makes a significant difference, I’ll rerun all the current tests
  • Try to visualise the results with Google graphs – although I think it may get cluttered
  • Rerun big/buffering/1000 test – odd results
  • Rerun tests with a work level of 200 and keep an eye on CPU usage – I’d like to spot where we tip between CPU and IO being the bottleneck
  • Try to optimize the Stream/StreamReader being used – we should be able to skip the FileStream buffer completely, and specifying FileOptions.SequentialScan should give more hints to Windows about how we’re intending to read the data. I’ll try a few different buffer sizes, probably 4K, 8K and 64K, just to see what happens.

This is a really interesting little example problem, because it sounds so simple (and it is simple in terms of the implementation) but it has a lot of environment variables. Of course it would be nicer if we didn’t have to reboot the machine between test runs in order to get a fair result, but never mind…

Benchmarking IO: buffering vs streaming

I mentioned in my recent book review that I was concerned about a recommendation to load all of the data from an input file before processing all of it. This seems to me to be a bad idea in an age where Windows prefetch will anticipate what data you need next, etc – allowing you to process efficiently in a streaming fashion.

However, without any benchmarks I’m just guessing. I’d like to set up a benchmark to test this – it’s an interesting problem which I suspect has lots of nuances. This isn’t about trying to prove the book wrong – it’s about investigating a problem which sounds relatively simple, but could well not be. I wouldn’t be at all surprised to see that in some cases the streaming solution is faster, and in other cases the buffered solution is faster.

The Task

The situation presented is like this:

  • We have a bunch of input files, either locally or on the network (I’m probably just going to test locally for now)
  • Each file is less than 100MB
  • We need to encrypt each line of text in each input file, writing it to a corresponding output file

The method suggested in the book is for each thread to:

  1. Load a file into a List<string>
  2. Encrypt every line (replacing it in the list)
  3. Save to a new file

My alternative option is:

  1. Open a TextReader and a TextWriter for the input/output
  2. Repeatedly read a line, encrypt, write the encrypted line until we’ve exhausted the input file
  3. Close both the reader and the writer

These are the two implementations I want to test. I strongly suspect that the optimal solution would involve async IO, but doing an async version of ReadLine is a real pain for various reasons. I’m going to keep it simple – using plain threading, no TPL etc.

I haven’t written any code yet. This is where you come in – not to write the code for me, but to make sure I test in a useful way.

Environmental variations

My plan of attack is to first write a small program to generate the input files. These will just be random text files, and the program will have a few command line parameters:

  • Directory to put files under (one per test variation, basically)
  • Number of files to create
  • Number of lines per file
  • Number of characters per line

I’ll probably test a high and a low number for each of the last three parameters, possibly omitting a few variations for practical reasons.

In an ideal world I’d test on several different computers, locally and networked, but that just isn’t practical. In particular I’d be interested to see how much difference an SSD (low seek time) makes to this test. I’ll be using my normal laptop, which is a dual core Core Duo with two normal laptop disks. I may well try using different drives for reading and writing to see how much difference that makes.

Benchmarking

The benchmark program will also have a few command line parameters:

  • Directory to read files from
  • Directory to write files to
  • Number of threads to use (in some cases I suspect that more threads than cores will be useful, to avoid cores idling while data is read for a blocking thread)
  • Strategy to use (buffered or streaming)
  • Encryption work level

The first three parameters here are pretty self-explanatory, but the encryption work level isn’t. Basically I want to be able to vary the difficulty of the task, which will vary whether it ends up being CPU-bound or IO-bound (I expect). So, for a particular line I will:

  • Convert to binary (using Encoding.ASCII – I’ll generate just ASCII files)
  • Encrypt the binary data
  • Encrypt the encrypted binary data
  • Encrypt the encrypted encrypted […] etc until we’ve hit the number given by the encryption work level
  • Base64 encode the result – this will be the output line

So with an encryption work level of 1 I’ll just encrypt once. With a work level of 2 I’ll encrypt twice, etc. This is purely for the sake of giving the computer something to do. I’ll use AES unless anyone has a better suggestion. (Another option would be to just use an XOR or something else incredibly simple.) The key/IV will be fixed for all tests, just in case that has a bearing on anything.

The benchmarking program is going to be as simple as I can possibly make it:

  • Start a stopwatch
  • Read the names of all the files in the directory
  • Create a list of files for each thread to encrypt
  • Create and start the threads
  • Use Thread.Join on all the threads
  • Stop the stopwatch and report the time taken

No rendezvous required at all, which certainly simplifies things. By creating the work list before the thread, I don’t need to worry about memory model issues. It should all just be fine.

In the absence of a better way of emptying all the file read caches (at the Windows and disk levels) I plan to reboot my computer between test runs (which makes it pretty expensive in terms of time spent – hence omitting some variations). I wasn’t planning on shutting services etc down: I really hope that Vista won’t do anything silly like trying to index the disk while I’ve got a heavy load going. Obviously I won’t run any other applications at the same time.

If anyone has any suggested changes, I’d be very glad to hear them. Have I missed anything? Should I run a test where the file sizes vary? Is there a better way of flushing all caches than rebooting?

I don’t know exactly when I’m going to find time to do all of this, but I’ll get there eventually :)

Benchmarking: designing an API with unusual goals

In a couple of recent posts I’ve written about a benchmarking framework and the results it produced for using for vs foreach in loops. I’m pleased with what I’ve done so far, but I don’t think I’ve gone far enough yet. In particular, while it’s good at testing multiple algorithms against a single input, it’s not good at trying several different inputs to demonstrate the complexity vs input size. I wanted to rethink the design at three levels – what the framework would be capable of, how developers would use it, and then the fine-grained level of what the API would look like in terms of types, methods etc. These may all sound quite similar on the face of it, but this project is somewhat different to a lot of other coding I’ve done, mostly because I want to lower the barrier to entry as far as humanly possible.

Before any of this is meaningful, however, I really needed an idea of the fundamental goal. Why was I writing yet another benchmarking framework anyway? While I normally cringe at mission statements because they’re so badly formulated and used, I figured this time it would be helpful.

Minibench makes it easy for developers to write and share tests to investigate and measure code performance.

The words in bold (or for the semantically inclined, the strong words) are the real meat of it. It’s quite scary that even within a single sentence there are seven key points to address. Some are quite simple, others cause grief. Now let’s look at each of the areas of design in turn.

Each element of the design should either clearly contribute to the mission statement or help in a non-functional way (e.g. make the project feasible in a reasonable timeframe, avoid legal issues etc). I’m aware that with the length of this post, it sounds like I’m engaging in "big upfront design" but I’d like to think that it’s at least informed by my recent attempt, and that the design criteria here are statements of intent rather than implementation commitments. (Aargh, buzzword bingo… please persevere!)

What can it do?

As we’ve already said, it’s got to be able to measure code performance. That’s a pretty vague definition, however, so I’m going to restrict it a bit – the design is as much about saying what isn’t included as what is.

  • Each test will take the form of a single piece of code which is executed many times by the framework. It will have an input and an expected output. (Operations with no natural output can return a constant; I’m not going to make any special allowance for them.)
  • The framework should take the tedium out of testing. In particular I don’t want to have to run it several times to get a reasonable number of iterations. I suspect it won’t be feasible to get the framework to guess appropriate inputs, but that would be lovely if possible.
  • Only wall time is measured. There are loads of different metrics which could be applied: CPU execution time, memory usage, IO usage, lock contention – all kinds of things. Wall time (i.e. actual time elapsed, as measured by a clock on the wall) is by far the simplest to understand and capture, and it’s the one most frequently cited in newsgroup and forum questions in my experience.
  • The benchmark is uninstrumented. I’m not going to start rewriting your code dynamically. Frankly this is for reasons of laziness. A really professional benchmarking system might take your IL and wrap it in a timing loop within a single method, somehow enforcing that the result of each iteration is used. I don’t believe that’s worth my time and energy, as well as quite possibly being beyond my capabilities.
  • As a result of the previous bullet, the piece of code to be run lots of times needs to be non-trivial. The reality is that it’ll end up being called as a delegate. This is pretty quick, but if you’re just testing "is adding two doubles faster or slower than adding two floats" then you’ll need to put a bit more work in (e.g. having a loop in your own code as well).
  • As well as the use case of "which of these algorithms performs the best with this input?" I want to support "how does the performance vary as a function of the input?" This should support multiple algorithms at the same time as multiple inputs.
  • The output should be flexible but easy to describe in code. For single-input tests simple text output is fine (although the exact figures to produce can be interesting); for multiple inputs against multiple tests a graph would often be ideal. If I don’t have the energy to write a graphing output I should at least support writing to CSV or TSV so that a spreadsheet or graphing tool can do the heavy lifting.
  • The output should be useful – it should make it easy to compare the performance of different algorithms and/or inputs. It’s clear from the previous post here that just including the scaled score doesn’t give an obvious meaning. Some careful wording in the output, as well as labeled columns, may be required. This is emphatically not a dig at anyone confused by the last post – any confusion was my own fault.

Okay, that doesn’t sound too unreasonable. The next area is much harder, in my view.

How does a developer use it?

Possibly the most important word in the mission statement is share. The reason I started this project at all is that I was fed up with spending ages writing timing loops for benchmarks which I’d then post on newsgroups or Stack Overflow. That means there are two (overlapping) categories of user:

  • A developer writing a test. This needs to be easy, but that’s an aspect of design that I’m reasonably familiar with. I’m not saying I’m good at it, but at least I have some prior experience.
  • A developer reading a newsgroup/forum post, and wanting to run the benchark for themselves. This distribution aspect is the hard bit – or at least the bit requiring imagination. I want the barrier to running the code to be really, really low. I suspect that there’ll be a "fear of the unknown" to start with which is hard to conquer, but if the framework becomes widely used I want the reader’s reaction to be: "Ah, there’s a MiniBench for this. I’m confident that I can download and run this code with almost no effort."

This second bullet is the one that my friend Douglas and I have been discussing over the weekend, in some ways playing a game of one-upmanship: "I can think of an idea which is even easier than yours." It’s a really fun game to play. Things we’ve thought about so far:

  • A web page which lets you upload a full program (without the framework) and spits out a URL which can be posted onto Stack Overflow etc. The user would then choose from the following formats:
    • Single .cs file containing the whole program – just compile and run. (This would also be shown on the download page.)
    • Test code only – for those whole already have the framework
    • Batch file – just run it to extract/build/run the C# code.
    • NAnt project file containing the C# code embedded in it – just run NAnt
    • MSBuild project file – ditto but with msbuild.
    • Zipped project – open the project to load the test in one file and the framework code in other (possibly separate) .cs files
    • Zipped solution – open to load two projects: the test code in one and the framework in the other
  • A web page which which lets you upload your results and browse the results of others

Nothing’s finalised here, but I like the general idea. I’ve managed (fairly easily) to write a "self-building" batch file, but I haven’t tried with NAnt/MSBuild yet. I can’t imagine it’s that hard – but then I’m not sure how much value there is either. What I do want to try to aim for is users running the tests properly, first time, without much effort. Again, looking back at the last post, I want to make it obvious to users if they’re running under a debugger, which is almost always the wrong thing to be doing. (I’m pretty sure there’s an API for this somewhere, and if there’s not I’m sure I can work out an evil way of detecting it anyway.)

The main thing is the ease of downloading and running the benchmark. I can’t see how it could be much easier than "follow link; choose format and download; run batch file" – unless the link itself was to the batch file, of course. (That would make it harder to use for people who wanted to get the source in a more normal format, of course.)

Going back to the point of view of the developer writing the test, I need to make sure it’s easy enough for me to use from home, work and on the train. That may mean a web page where I can just type in code, the input and expected output, and let it fill in the rest of the code for me. It may mean compiling a source file against a library from the command line. It may mean compiling a source file against the source code of the framework from the command line, with the framework code all in one file. It may mean building in Visual Studio. I’d like to make all of these cases as simple as possible – which is likely to make it simple for other developers as well. I’m not planning on optimising the experience when it comes to writing a benchmark on my mobile though – that might be a step too far!

What should the API look like?

When we get down to the nitty-gritty of types and methods, I think what I’ve got is a good starting point. There are still a few things to think about though:

  • We nearly have the functionality required for running a suite with different inputs already – the only problem is that we’re specifying the input (and expected output) in the constructor rather than as parameters to the RunTests method. I could change that… but then we lose the benefit of type inference when creating the suite. I haven’t resolved this to my satisfaction yet :(
  • The idea of having the suite automatically set up using attributed methods appeals, although we’d still need a Main method to create the suite and format the output. The suite creation can be simplified, but the chances of magically picking the most appropriate output are fairly slim. I suppose it could go for the "scale to best by number of iterations and show all columns" option by default… that still leaves the input and expected output, of course. I’m sure I’ll have something like this as an option, but I don’t know how far it will go.
  • The "configuration" side of it is expressed as a couple of constants at the moment. These control the minimum amount of time to run tests for before we believe we’ll be able to guess how many iterations we’ll need to get close to the target time, and the target time itself. These are currently set at 2 seconds and 30 seconds respectively – but when running tests just to check that you’ve got the right output format etc, that’s far too long. I suspect I should make a test suite have a configuration, and default to those constants but allow them to be specified on the command line as well, or explicitly in code.
  • Why do we need to set the expected output? In many cases you can be pretty confident that at least one of the test cases will be correct – so it’s probably simpler just to run each test once and check that the results are the same for all of them, and take that as the expected output. If you don’t have to specify the expected output, it becomes easier to specify a sequence of inputs to test.
  • Currently BenchmarkResult is nongeneric. This makes things simpler internally – but should a result know the input that it was derived from? Or should the ResultSuite (which is also nongeneric) know the input that has been applied to all its functions? The information will certainly need to be somewhere so that it can be output appropriately in the multiple input case.

My main points of design focus around three areas:

  • Is it easy to pick up? The more flexible it is, with lots of options, the more daunting it may seem.
  • Is it flexible enough to be useful in a variety of situations? I don’t know what users will want to benchmark – and I don’t build the right tool, it will be worthless to them.
  • Is the resulting test code easy and brief enough to include in a forum post, with a link to the full program? Will readers understand it?

As you can see, these are aimed at three slightly different people: the first time test writer, the veteran test writer, and the first time test reader. Getting the balance between the three is tricky.

What’s next?

I haven’t started rewriting the framework yet, but will probably do so soon. This time I hope to do it in a rather more test-driven way, although of course the timing-specific elements will be tricky unless I start using a programmatic clock etc. I’d really like comments around this whole process:

  • Is this worth doing?
  • Am I asking the right questions?
  • Are my answers so far headed in the right direction?
  • What else haven’t I thought of?