Optimization and generics, part 2: lambda expressions and reference types

As with almost any performance work, your mileage may vary (in particular the 64-bit JIT may work differently) and you almost certainly shouldn’t care. Relatively few people write production code which is worth micro-optimizing. Please don’t take this post as an invitation to make code more complicated for the sake of irrelevant and possibly mythical performance changes.

It took me a surprisingly long time to find the problem described in the previous blog post, and almost no time at all to fix it. I understood why it was happening. This next problem took a while to identify at all, but even when I’d found a workaround I had no idea why it worked. Furthermore, I couldn’t reproduce it in a test case… because I was looking for the wrong set of triggers. I’ve now found at least some of the problem though.

This time the situation in Noda Time is harder to describe, although it concerns the same area of code. In various places I need to create new delegates containing parsing steps and add them to the list of steps required for a full parse. I can always use lambda expressions, but in many cases I’ve got the same logic repeatedly… so I decided to pull it out into a method. Bang – suddenly the code runs far slower. (In reality, I’d performed this refactoring first, and "unrefactored" it to speed things up.)

I think the problem comes down to method group conversions with generic methods and a type argument which is a reference type. The CLR isn’t very good at them, and the C# compiler uses them more than it needs to.

Show me the benchmark!

The complete benchmark code is available of course, but fundamentally I’m doing the same thing in each test case: creating a delegate of type Action which does nothing, and then checking that the delegate reference is non-null (just to avoid the JIT optimizing it away). In each case this is done in a generic method with a single type parameter. I call each method in two ways: once with int as the type argument, and once with string as the type argument. Here are the different cases involved:

  • Use a lambda expression: Action foo = () => {};
  • Fake what I expected the compiler to do: keep a separate generic cache class with a static variable for the delegate; populate the cache once if necessary, and thereafter use the cache field
  • Fake what the compiler is actually doing with the lambda expression: write a separate generic method and perform a method group conversion to it
  • Do what the compiler could do: write a separate non-generic method and perform a method group conversion to it
  • Use a method group conversion to a static (non-generic) method on a generic type
  • Use a method group conversion to an instance (non-generic) method on a generic type, via a generic cache class with a single field in referring to an instance of the generic class

(Yes, the last one is a bit convoluted – but the line in the method itself is simple: Action foo = ClassHolder<T>.SampleInstance.NoOpInstance;

Remember, we’re doing each of these in a generic method, and calling that generic method using a type argument of either int or string. (I’ve run a few tests, and the exact type isn’t important – all that matters is that int is a value type, and string is a reference type.)

Importantly, we’re not capturing any variables, and the type parameter is not involved in either the delegate type or any part of the implementation body.

Benchmark results

Again, times are in milliseconds – but this time I didn’t want to run it for 100 million iterations, as the "slow" versions would have taken far too long. I’ve run this on the x64 JIT as well and seen the same effect, but I haven’t included the figures here.

Times in milliseconds for 10 million iterations

Test TestCase<int> TestCase<string>
Lambda expression 180 29684
Generic cache class 90 288
Generic method group conversion 184 30017
Non-generic method group conversion 178 189
Static method on generic type 180 29276
Instance method on generic type 202 299

Yes, it’s about 150 times slower to create a delegate from a generic method with a reference type as the type argument than with a value type… and yet this is the first I’ve heard of this. (I wouldn’t be surprised if there were a post from the CLR team about it somewhere, but I don’t think it’s common knowledge by any means.)

Conclusion

One of the tricky things is that it’s hard to know exactly what the C# compiler is going to do with any given lambda expression. In fact, the method which was causing me grief earlier on isn’t generic, but it’s in a generic type and captures some variables which use the type parameters – so perhaps that’s causing a generic method group conversion somewhere along the way.

Noda Time is a relatively extreme case, but if you’re using delegates in any performance-critical spots, you should really be aware of this issue. I’m going to ping Microsoft (first informally, and then via a Connect report if that would be deemed useful) to see if there’s an awareness of this internally as potential "gotcha", and whether there’s anything that can be done. Normal trade-offs of work required vs benefit apply, of course. It’s possible that this really is an edge case… but with lambdas flying everywhere these days, I’m not sure that it is.

Maybe tomorrow I’ll actually be able to finish getting Noda Time moved onto the new system… all of this performance work has been a fun if surprising distraction from the main job of shipping working code…

Optimization and generics, part 1: the new() constraint (updated: now with CLR v2 results)

As with almost any performance work, your mileage may vary (in particular the 64-bit JIT may work differently) and you almost certainly shouldn’t care. Relatively few people write production code which is worth micro-optimizing. Please don’t take this post as an invitation to make code more complicated for the sake of irrelevant and possibly mythical performance changes.

I’ve been doing quite a bit of work on Noda Time recently – and have started getting my head round all the work that James Keesey has put into the parsing/formatting. I’ve been reworking it so that we can do everything without throwing any exceptions, and also to work on the idea of parsing a pattern once and building a sequence of actions for both formatting and parsing from the action. To format or parse a value, we then just need to apply the actions in turn. Simples.

Given that this is all in the name of performance (and I consider Noda Time to be worth optimizing pretty hard) I was pretty cross when I ran a complete revamp through the little benchmarking tool we use, and found that my rework had made everything much slower. Even parsing a value after parsing the pattern was slower than parsing both the value and the pattern together. Something was clearly very wrong.

In fact, it turns out that at least two things were very wrong. The first (the subject of this post) was easy to fix and actually made the code a little more flexible. The second (the subject of the next post, which may be tomorrow) is going to be harder to work around.

The new() constraint

In my SteppedPattern type, I have a generic type parameter – TBucket. It’s already constrained in terms of another type parameter, but that’s irrelevant as far as I’m aware. (After today though, I’m taking very little for granted…) The important thing is that before I try to parse a value, I want to create a new bucket. The idea is that bits of information end up in the bucket as they’re being parsed, and at the very end we put everything together. So each parse operation requires a new bucket. How can we create one in a nice generic way?

Well, we can just call its public parameterless constructor. I don’t mind the types involved having such a constructor, so all we need to do is add the new() constraint, and then we can call new TBucket():

// Somewhat simplified…
internal sealed class SteppedPattern<TBucket> : IParsePattern<TBucket>
    where TBucket : new()
{
    public ParseResult Parse(string value)
    {
        TBucket bucket = new TBucket();

        // Rest of parsing goes here
    }
}

Great! Nice and simple. Unfortunately, it turned out that that one line of code was taking 75% of the time to parse a value. Just creating an empty bucket – pretty much the simplest bit of parsing. I was amazed when I discovered that.

Fixing it with a provider

The fix is reasonably easy. We just need to tell the type how to create an instance, and we can do that with a delegate:

// Somewhat simplified…
internal sealed class SteppedPattern<TBucket> : IParsePattern<TBucket>
{
    private readonly Func bucketProvider;

    internal SteppedPattern(Func bucketProvider)
    {
        this.bucketProvider = bucketProvider;
    }

    public ParseResult Parse(string value)
    {
        TBucket bucket = bucketProvider();

        // Rest of parsing goes here
    }
}

Now I can just call new SteppedPattern(() => new OffsetBucket()) or whatever. This also means I can keep the constructor internal, not that I care much. I could even reuse old parse buckets if that wouldn’t be a semantic problem – in other cases it could be useful. Hooray for lambda expressions – until we get to the next post, anyway.

Show me the figures!

You don’t want to have to run Noda Time’s benchmarks to see the results for yourself, so I wrote a small benchmark to time just the construction of a generic type. As a measure of how insignificant this would be for most apps, these figures are in milliseconds, performing 100 million iterations of the action in question. Unless you’re going to do this in performance-critical code, you just shouldn’t care.

Anyway, the benchmark has four custom types: two classes, and two structs – a small and a large version of each. The small version has a single int field; the large version has eight long fields. For each type, I benchmarked both approaches to initialization.

The results on two machines (32-bit and 64-bit) are below, for both the v2 CLR and v4. The 64-bit machine is much faster in general – you should only compare results within one machine, as it were.)

CLR v4: 32-bit results (ms per 100 million iterations)

Test type new() constraint Provider delegate
Small struct 689 1225
Large struct 11188 7273
Small class 16307 1690
Large class 17471 3017

CLR v4: 64-bit results (ms per 100 million iterations)

Test type new() constraint Provider delegate
Small struct 473 868
Large struct 2670 2396
Small class 8366 1189
Large class 8805 1529

CLR v2: 32-bit results (ms per 100 million iterations)

Test type new() constraint Provider delegate
Small struct 703 1246
Large struct 11411 7392
Small class 143967 1791
Large class 143107 2581

CLR v2: 64-bit results (ms per 100 million iterations)

Test type new() constraint Provider delegate
Small struct 510 686
Large struct 2334 1731
Small class 81801 1539
Large class 83293 1896

(An earlier version of this post had a mistake – my original tests used structs for everything, despite the names.)

Others have reported slightly different results, including the new() constraint being better for both large and small structs.

Just in case you hadn’t spotted them, look at the results for classes. Those are the real results – it took over 2 minutes to run the test using the new() constraint on my 32-bit laptop, compared with under two seconds for the provider. Yikes. This was actually the situation I was in for Noda Time, which is built on .NET 2.0 – it’s not surprising that so much of my benchmark’s time was spent constructing classes, given results like this.

Of course you can download the benchmark program for yourself and see how it performs on your machine. It’s a pretty cheap-and-cheerful benchmark, but when the differences are this big, minor sources of inaccuracy don’t bother me too much. The simplest way to run under CLR v2 is to compile with the .NET 3.5 C# compiler to start with.

What’s going on under the hood?

As far as I’m aware, there’s no IL to give support for the new() constraint, in terms of using the parameterless constructor. (The constraint itself can be expressed in IL though.) Instead, when we write new T() in C#, the compiler emits a call to Activator.CreateInstance. Apparently, that’s slower than calling a delegate – presumably due to trying to find an accessible constructor with reflection, and invoking it. I suspect it could be optimized relatively easily – e.g. by caching the results per type it’s called with, in terms of delegates. I’m slightly surprised this hasn’t (apparently) been optimized, given how easy it is to cache values by generic type. No doubt there’s a good reason lurking there somewhere, even if it’s only the memory taken up by the cache.

Either way, it’s easy to work around in general.

Conclusion

I wouldn’t have found this gotcha if I didn’t have before and after tests (or in this case, side-by-side tests of the old way and the new way of parsing). The real lesson of this post shouldn’t be about the new() constraint – it should be how important it is to test performance (assuming you care), and how easy it is to assume certain operations are cheap.

Next post: something much weirder.