Optimization and generics, part 2: lambda expressions and reference types

As with almost any performance work, your mileage may vary (in particular the 64-bit JIT may work differently) and you almost certainly shouldn’t care. Relatively few people write production code which is worth micro-optimizing. Please don’t take this post as an invitation to make code more complicated for the sake of irrelevant and possibly mythical performance changes.

It took me a surprisingly long time to find the problem described in the previous blog post, and almost no time at all to fix it. I understood why it was happening. This next problem took a while to identify at all, but even when I’d found a workaround I had no idea why it worked. Furthermore, I couldn’t reproduce it in a test case… because I was looking for the wrong set of triggers. I’ve now found at least some of the problem though.

This time the situation in Noda Time is harder to describe, although it concerns the same area of code. In various places I need to create new delegates containing parsing steps and add them to the list of steps required for a full parse. I can always use lambda expressions, but in many cases I’ve got the same logic repeatedly… so I decided to pull it out into a method. Bang – suddenly the code runs far slower. (In reality, I’d performed this refactoring first, and "unrefactored" it to speed things up.)

I think the problem comes down to method group conversions with generic methods and a type argument which is a reference type. The CLR isn’t very good at them, and the C# compiler uses them more than it needs to.

Show me the benchmark!

The complete benchmark code is available of course, but fundamentally I’m doing the same thing in each test case: creating a delegate of type Action which does nothing, and then checking that the delegate reference is non-null (just to avoid the JIT optimizing it away). In each case this is done in a generic method with a single type parameter. I call each method in two ways: once with int as the type argument, and once with string as the type argument. Here are the different cases involved:

  • Use a lambda expression: Action foo = () => {};
  • Fake what I expected the compiler to do: keep a separate generic cache class with a static variable for the delegate; populate the cache once if necessary, and thereafter use the cache field
  • Fake what the compiler is actually doing with the lambda expression: write a separate generic method and perform a method group conversion to it
  • Do what the compiler could do: write a separate non-generic method and perform a method group conversion to it
  • Use a method group conversion to a static (non-generic) method on a generic type
  • Use a method group conversion to an instance (non-generic) method on a generic type, via a generic cache class with a single field in referring to an instance of the generic class

(Yes, the last one is a bit convoluted – but the line in the method itself is simple: Action foo = ClassHolder<T>.SampleInstance.NoOpInstance;

Remember, we’re doing each of these in a generic method, and calling that generic method using a type argument of either int or string. (I’ve run a few tests, and the exact type isn’t important – all that matters is that int is a value type, and string is a reference type.)

Importantly, we’re not capturing any variables, and the type parameter is not involved in either the delegate type or any part of the implementation body.

Benchmark results

Again, times are in milliseconds – but this time I didn’t want to run it for 100 million iterations, as the "slow" versions would have taken far too long. I’ve run this on the x64 JIT as well and seen the same effect, but I haven’t included the figures here.

Times in milliseconds for 10 million iterations

Test TestCase<int> TestCase<string>
Lambda expression 180 29684
Generic cache class 90 288
Generic method group conversion 184 30017
Non-generic method group conversion 178 189
Static method on generic type 180 29276
Instance method on generic type 202 299

Yes, it’s about 150 times slower to create a delegate from a generic method with a reference type as the type argument than with a value type… and yet this is the first I’ve heard of this. (I wouldn’t be surprised if there were a post from the CLR team about it somewhere, but I don’t think it’s common knowledge by any means.)

Conclusion

One of the tricky things is that it’s hard to know exactly what the C# compiler is going to do with any given lambda expression. In fact, the method which was causing me grief earlier on isn’t generic, but it’s in a generic type and captures some variables which use the type parameters – so perhaps that’s causing a generic method group conversion somewhere along the way.

Noda Time is a relatively extreme case, but if you’re using delegates in any performance-critical spots, you should really be aware of this issue. I’m going to ping Microsoft (first informally, and then via a Connect report if that would be deemed useful) to see if there’s an awareness of this internally as potential "gotcha", and whether there’s anything that can be done. Normal trade-offs of work required vs benefit apply, of course. It’s possible that this really is an edge case… but with lambdas flying everywhere these days, I’m not sure that it is.

Maybe tomorrow I’ll actually be able to finish getting Noda Time moved onto the new system… all of this performance work has been a fun if surprising distraction from the main job of shipping working code…

17 thoughts on “Optimization and generics, part 2: lambda expressions and reference types”

  1. I don’t have the time to go spelunking in the CLR at the moment, but a thought to consider: when you (if you are a runtime) instantiate the body of a generic method, you have two options: you can fully specialize it to the type arguments, or you can try and share a single method body across multiple type arguments, if the generated machine code is going to be compatible. But there’s always the danger that the user is going to do something silly like refer to one of the type arguments; that seems to hamper sharing, right? Well, not necessarily – not if you can somehow wrangle the type argument(s) into hidden actual arguments of the generic method bodies.

    If your method is a non-generic instance method on a generic class, sticking the type arguments somewhere in the instance will do (you can get them via ‘this’). But if it’s a static method, that’s not available. What can you do? You could try and modify all the call sites; but that won’t work when the thing is stuffed in a delegate. An option there is to create a thunk, a tiny method whose job is to shuffle arguments around to introduce the new type arguments and then forward the call. But that shuffling is going to be somewhat expensive.

    Of course, if the instantiation types are not amenable to code sharing (usually only reference types are shared, via System.__Canon “placeholder” type argument IIRC), then all of this is irrelevant – you can simply embed the types directly in the code. But for System.String, not so much.

    Note: without spelunking in the CLR generated machine code with the help of WinDBG and SOS, all the above is just an educated guess.

    Like

  2. Oh, and I can pretty much guarantee that the guy who wrote the code to enable the combination of method body sharing and storing references to same in delegates knows about the tradeoff :) Hopefully you can get MS to explain this mechanism to you. From my point of view it’s fairly clear what’s happening – there’s few other ways of doing it that don’t have similar or worse drawbacks.

    Like

  3. Both of these are more than interesting, thanks for posting.

    I might make issue with your closing remark though:

    “…but if you’re using delegates in any performance-critical spots, you should really be aware of this issue.”

    Delegates are not the underlying issue here, generics are!

    Like

  4. @Barry: Right, that all makes sense. But does that mean there’s the same sort of hit when *calling* a method? And could’t the JIT detect cases where the user is *not* going to refer to the type arguments?

    Like

  5. When you talked about the 150 times difference, I remembered a different cheap-and-cheerful benchmark I once did based on your article https://codeblog.jonskeet.uk/2008/08/09/making-reflection-fly-and-exploring-delegates

    The delegate that your technique creates is about 3 times slower than accessing a property directly*, but using PropertyInfo.GetValue() is _drum roll_ about 150 times slower!

    This could of course be a total coincidence.

    However, now that I think about it, your technique does create a generic delegate in the end, right? So why doesn’t this delegate suffer the same performance penalty?

    Something to ponder about…

    *
    the property was a simple auto property
    accessing directly meant hard coding `foo = obj.Value;` in the benchmark

    Like

  6. @Patrick: I suspect that’s a coincidence. It’s worth remembering the difference between a generic delegate (i.e. where the delegate type itself is generic) and a delegate whose target is a generic method, which is the situation we’re looking at here.

    Like

  7. If you step through the code in the debugger, you’ll see that the slow case is actually the normal (unoptimized) case.

    For the other cases, the JIT actually specializes the delegate construction, but for the slow case it falls back to calling the general COMDelegate::DelegateConstruct() implementation, which obviously has to do a lot of bookkeeping and lookups.

    So I don’t think there’s anything interesting going on, the CLR folks just didn’t get around to optimizing this case.

    Like

  8. I remember already sawing that performance problem elsewhere. When working with Retlang, an high performance asynchronous library, I remember sawing this post by Mike Rettig, Retlang’s author: http://social.msdn.microsoft.com/Forums/en-US/clr/thread/dd66d965-73cc-4d7d-9fe6-3b9bd3c55c57

    The described problem uses closures though, whereas yours don’t, but both seem closely related. What’s very interesting is that the bug reported in the above post was fixed in CLR 4.

    After I few tests, it appears to only be fixed for closures! In your benchmark, adding an useless closure to the foo action speeds things up: value types and reference types have equivalent speeds. Targeting .NET 3.5 and running on the CLR 2.0 make the performance problem come back, closure or not, since it wasn’t fixed for this runtime.

    It seems that Microsoft is already aware of this bug, has already fixed it, just not in every case.

    Like

  9. @Julien: Ooh, that’s very interesting – thanks. I’ll see if I can work out what’s going on. In particular, I wonder whether this is a difference in the C# compiler or the CLR itself. If it’s in the compiler, I may be able to fake it in .NET 2…

    Like

  10. Interestingly, Mono seems to have fixed the same issue:
    on Mono JIT 2.6.7, the gap between slow/fast was ~170x; On Mono JIT 2.10.4 the difference between reftype/valuetype is all but gone and the gap is reduced to a mere ~30x

    But the sensational part was this:
    ===================================

    Enabling the sgen garbage collector reduces the gap between slowest/fastest methods from ~30x to a more sane ~6x ratio. Enabling sgen reduced the net runtime of the whole benchmark by more than a factor of ~6x.

    Note the overal benchmark runtime reduction nets a whopping ~18x between 2.6.7 and 2.10.4-sgen.

    Recompiling with the new CLR compiler and .NET 4.0 framework made no visible difference.

    I could squeeze out a drop or two more performance by setting the MONO_GENERIC_SHARING=none environment option: it results in reftypes having _exactly_ the same performance as the valuetype variants, as expected.

    —————————————–

    The data:
    ===================
    On Mono (linux 64bit) I got the following results (unmodified benchmark).

    [mono-2.6.7] ~ @ gmcs -optimize+ GenericsAndLambdas.cs
    [mono-2.6.7] ~ @ time mono ./GenericsAndLambdas.exe
    Environment: CLR 2.0.50727.1433 on Unix 2.6.38.11
    LambdaInt32 : 1461ms
    LambdaString : 10498ms
    FakeCachedLambdaInt32 : 60ms
    FakeCachedLambdaString : 84ms
    GenericMethodGroupInt32 : 1132ms
    GenericMethodGroupString : 10390ms
    NonGenericMethodGroupInt32 : 1228ms
    NonGenericMethodGroupString : 1224ms
    StaticMethodOnGenericTypeInt32 : 1136ms
    StaticMethodOnGenericTypeString : 7782ms
    InstanceMethodOnGenericTypeInt32 : 1181ms
    InstanceMethodOnGenericTypeString: 8397ms

    real 0m44.623s
    user 0m44.490s
    sys 0m4.970s

    [mono-2.10.4] ~ @ time mono ./GenericsAndLambdas.exe
    Environment: CLR 2.0.50727.1433 on Unix 2.6.38.11
    LambdaInt32 : 1535ms
    LambdaString : 1568ms
    FakeCachedLambdaInt32 : 52ms
    FakeCachedLambdaString : 81ms
    GenericMethodGroupInt32 : 1629ms
    GenericMethodGroupString : 1807ms
    NonGenericMethodGroupInt32 : 1623ms
    NonGenericMethodGroupString : 1625ms
    StaticMethodOnGenericTypeInt32 : 1698ms
    StaticMethodOnGenericTypeString : 1654ms
    InstanceMethodOnGenericTypeInt32 : 1539ms
    InstanceMethodOnGenericTypeString: 1639ms

    real 0m16.504s
    user 0m17.260s
    sys 0m3.890s

    [mono] ~ @ time mono –gc=sgen ./GenericsAndLambdas.exe
    Environment: CLR 2.0.50727.1433 on Unix 2.6.38.11
    LambdaInt32 : 192ms
    LambdaString : 218ms
    FakeCachedLambdaInt32 : 49ms
    FakeCachedLambdaString : 81ms
    GenericMethodGroupInt32 : 202ms
    GenericMethodGroupString : 224ms
    NonGenericMethodGroupInt32 : 195ms
    NonGenericMethodGroupString : 197ms
    StaticMethodOnGenericTypeInt32 : 203ms
    StaticMethodOnGenericTypeString : 224ms
    InstanceMethodOnGenericTypeInt32 : 211ms
    InstanceMethodOnGenericTypeString: 305ms

    real 0m2.392s
    user 0m2.380s
    sys 0m0.030s

    Like

  11. @Seth: Thanks – that’s really interesting. Out of interest, are you able to run .NET on the same hardware, so we can compare absolute performance?

    Like

  12. I agree 100% about most applications not needing micro-optimization, ut I have also run into quite a few where the overall performance was due to the summation of lots of little “problems”, so [IMHO] it is important that most developers are aware of these potential pitfalls, and can call on someone when their impact is suspected.

    I just recently finished an application that built huge numbers of delegates as part of a simulation of a very dynamic environment. Getting the desired perofrmance was part trial and error (i.e. experimental/empricial trials) with different approaches, some was deep dives into the code (both at the IL and native levels). Every time I get down to that level, I learn something new about (a specific version of ) the CLR.

    Like

  13. @David: Agreed that the “death by a thousand cuts” issue is real. That said, I think it’s better to overemphasize the “don’t optimize until needed” message rather than underemphasizing it :)

    And again, I thoroughly agree on the “learning something new” front. Fascinating stuff.

    Like

  14. I wonder how fast in terms of MB/s parsing you can get if you do not use unsafe code and use delegates. I did ask on SO how fast one can get with managed code (http://stackoverflow.com/questions/7153315/how-to-parse-a-text-file-in-c-and-be-io-bound) but it seems that when you use SSDs or any modern hard disk your application will not be IO bound but as fast as your parser can get.

    Except from that my main concern with .NET is that you are quickly GC limitted (http://geekswithblogs.net/akraus1/archive/2011/08/13/146515.aspx) when your object graph becomes big (500MB+, several million objects). There are some mitigations like Forwarding Emtpy Values (http://geekswithblogs.net/akraus1/archive/2011/08/18/146583.aspx) but I have never read anything substantial how to program memory and GC efficient with .NET.

    Like

Leave a comment