Micro-optimization: the surprising inefficiency of readonly fields

July 16, 2014 jonskeet 30 Comments

Introduction

Recently I’ve been optimizing the heck out of Noda Time. Most of the time this has been a case of the normal measurement, find bottlenecks, carefully analyse them, lather, rinse, repeat. Yesterday I had a hunch about a particular cost, and decided to experiment… leading to a surprising optimization.

Noda Time’s core types are mostly value types – date/time values are naturally value types, just as DateTime and DateTimeOffset are in the BCL. Noda Time’s types are a bit bigger than most value types, however – the largest being ZonedDateTime, weighing in at 40 bytes in an x64 CLR at the moment. (I can shrink it down to 32 bytes with a bit of messing around, although it’s not terribly pleasant to do so.) The main reason for the bulk is that we have two reference types involved (the time zone and the calendar system), and in Noda Time 2.0 we’re going to have nanosecond resolution instead of tick resolution (so we need 12 bytes just to store a point in time). While this goes against the Class Library Design Guidelines, it would be odd for the smaller types (LocalDate, LocalTime) to be value types and the larger ones to be reference types. Overall, these still feel like value types.

A lot of these value types are logically composed of each other:

A LocalDate is a YearMonthDay and a CalendarSystem reference
A LocalDateTime is a LocalDate and a LocalTime
An OffsetDateTime is a LocalDateTime and an Offset
A ZonedDateTime is an OffsetDateTime and a DateTimeZone reference

This leads to a lot of delegation, potentially – asking a ZonedDateTime for its Year could mean asking the OffsetDateTime, which would ask the LocalDateTime, which would ask the LocalDate, which would ask the YearMonthDay. Very nice from a code reuse point of view, but potentially inefficient due to copying data.

Why would there be data copying involved? Well, that’s where this blog post comes in.

Behaviour of value type member invocations

When an instance member (method or property) belonging to a value type is invoked, the exact behaviour depends on the kind of expression it is called on. From the C# 5 spec, section 7.5.5 (where E is the expression the member M is invoked on, and the type declaring M is a value type):

If E is not classified as a variable, then a temporary local variable of E’s type is created and the value of E is assigned to that variable. E is then reclassified as a reference to that temporary local variable. The temporary variable is accessible as this within M, but not in any other way. Thus, only when E is a true variable is it possible for the caller to observe the changes that M makes to this.

So when is a variable not a variable? When it’s readonly… from section 7.6.4 (emphasis mine) :

If T is a struct-type and I identifies an instance field of that class-type:

If E is a value, or if the field is readonly and the reference occurs outside an instance constructor of the struct in which the field is declared, then the result is a value, namely the value of the field I in the struct instance given by E.

(There’s a very similar bullet for T being a class-type; the important part is that the field type is a value type

The upshot is that if you have a method call of:

int result = someField.Foo();

then it’s effectively converted into this:

var tmp = someField;
int result = tmp.Foo();

Now if the type of the field is quite a large value type, but Foo() doesn’t modify the value (which it never does within my value types), that’s performing a copy completely unnecessarily.

To see this in action outside Noda Time, I’ve built a little sample app.

Show me the code!

Our example is a simple 256-bit type, composed of 4 Int64 values. The type itself doesn’t do anything useful – it just holds the four values, and exposes them via properties. We then measure how long it takes to sum the four properties lots of times.

using System;
using System.Diagnostics;

public struct Int256
{
    private readonly long bits0;
    private readonly long bits1;
    private readonly long bits2;
    private readonly long bits3;

    public Int256(long bits0, long bits1, long bits2, long bits3)
    {
        this.bits0 = bits0;
        this.bits1 = bits1;
        this.bits2 = bits2;
        this.bits3 = bits3;
    }

    public long Bits0 { get { return bits0; } }
    public long Bits1 { get { return bits1; } }
    public long Bits2 { get { return bits2; } }
    public long Bits3 { get { return bits3; } }
}

class Test
{
private readonly Int256 value;

    public Test()
    {
        value = new Int256(1L, 5L, 10L, 100L);
    }

    public long TotalValue
    {
        get
        {
            return value.Bits0 + value.Bits1 + value.Bits2 + value.Bits3;
        }
    }

    public void RunTest()
    {
        // Just make sure it’s JITted…
        var sample = TotalValue;
        Stopwatch sw = Stopwatch.StartNew();
        long total = 0;
        for (int i = 0; i < 1000000000; i++)
        {
            total += TotalValue;
        }
        sw.Stop();
        Console.WriteLine("Total time: {0}ms", sw.ElapsedMilliseconds);
    }

    static void Main()
    {
        new Test().RunTest();
    }
}

Building this from the command line with /o+ /debug- and running (in a 64-bit CLR, but no RyuJIT) this takes about 20 seconds to run on my laptop. We can make it much faster with just one small change:

class Test
{
private Int256 value;

// Code as before
}

The same test now takes about 4 seconds – a 5-fold speed improvement, just by making a field non-readonly. If we look at the IL for the TotalValue property, the copying becomes obvious. Here it is when the field is readonly:

.method public hidebysig specialname instance int64
        get_TotalValue() cil managed
{
// Code size       60 (0x3c)
.maxstack 2
.locals init (valuetype Int256 V_0,
           valuetype Int256 V_1,
           valuetype Int256 V_2,
           valuetype Int256 V_3)
IL_0000: ldarg.0
IL_0001: ldfld      valuetype Int256 Test::’value’
IL_0006: stloc.0
IL_0007: ldloca.s   V_0
IL_0009: call       instance int64 Int256::get_Bits0()
IL_000e: ldarg.0
IL_000f: ldfld      valuetype Int256 Test::’value’
IL_0014: stloc.1
IL_0015: ldloca.s   V_1
IL_0017: call       instance int64 Int256::get_Bits1()
IL_001c: add
IL_001d: ldarg.0
IL_001e: ldfld      valuetype Int256 Test::’value’
IL_0023: stloc.2
IL_0024: ldloca.s   V_2
IL_0026: call       instance int64 Int256::get_Bits2()
IL_002b: add
IL_002c: ldarg.0
IL_002d: ldfld      valuetype Int256 Test::’value’
IL_0032: stloc.3
IL_0033: ldloca.s   V_3
IL_0035: call       instance int64 Int256::get_Bits3()
IL_003a: add
IL_003b: ret
} // end of method Test::get_TotalValue

And here it is when the field’s not readonly:

.method public hidebysig specialname instance int64
        get_TotalValue() cil managed
{
// Code size       48 (0x30)
.maxstack 8
IL_0000: ldarg.0
IL_0001: ldflda     valuetype Int256 Test::’value’
IL_0006: call       instance int64 Int256::get_Bits0()
IL_000b: ldarg.0
IL_000c: ldflda     valuetype Int256 Test::’value’
IL_0011: call       instance int64 Int256::get_Bits1()
IL_0016: add
IL_0017: ldarg.0
IL_0018: ldflda     valuetype Int256 Test::’value’
IL_001d: call       instance int64 Int256::get_Bits2()
IL_0022: add
IL_0023: ldarg.0
IL_0024: ldflda     valuetype Int256 Test::’value’
IL_0029: call       instance int64 Int256::get_Bits3()
IL_002e: add
IL_002f: ret
} // end of method Test::get_TotalValue

Note that it’s still loading the field address (ldflda) four times. You might expect that copying the field onto the stack once via a temporary variable would be faster, but that ends up at about 6.5 seconds on my machine.

There is an optimization which is even faster – moving the totalling property into Int256. That way (with the non-readonly field, still) the total time is less than a second – twenty times faster than the original code!

Conclusion

This isn’t an optimization I’d recommend in general. Most code really doesn’t need to be micro-optimized this hard, and most code doesn’t deal with large value types like the ones in Noda Time. However, I regard Noda Time as a sort of "system level" library, and I don’t ever want someone to decide not to use it on performance grounds. My benchmarks show that for potentially-frequently-called operations (such as the properties on ZonedDateTime) it really does make a difference, so I’m going to go for it.

I intend to apply a custom attribute to each of these "would normally be readonly" fields to document the intended behaviour of the field – and then when Roslyn is fully released, I’ll probably write a test to validate that all of these fields would still compile if the field were made readonly (e.g. that they’re never assigned to outside the constructor).

Aside from anything else, I find the subtle difference in behaviour between a readonly field and a read/write field fascinating… it’s something I’d been vaguely aware of in the past, but this is the first time that it’s had a practical impact on me. Maybe it’ll never make any difference to your code… but it’s probably worth being aware of anyway.

30 thoughts on “Micro-optimization: the surprising inefficiency of readonly fields”

Mark Rendle says:

July 19, 2014 at 8:54 am

That’s interesting.

I’d be tempted to use #if around declarations in this instance:

#if(DEBUG)
private readonly int _x;
#else
private int _x;
#endif

Doesn’t have to be DEBUG, obviously, but means you can just run a compile with whatever flag you use set to check the readonly behaviour.

LikeLike

Reply
pete.d says:

July 19, 2014 at 12:25 pm

It would be interesting to see a performance comparison that a) uses a value type as large as the real-world one (i.e. 320 bits instead of 256), and b) considers the performance of the reference type equivalent.

Even more interesting would be to see a more realistic performance profile, one designed to show the real-world performance impact of using a 40-byte value type instead of a reference type.

I empathize with the goal to ensure no one eschews NodaTime on the basis of performance. But it’s not clear to me that a profile performed here does that.

The most useful result is having demonstrated that, just as one might suppose, putting the arithmetic inside the value type instead of repeatedly retrieving the value type’s internals provides the best performance. But at the end of the day, what if even that optimization provides only a fraction the performance that a reference type would?

LikeLike

Reply
skeet says:

July 20, 2014 at 2:19 am

@M

LikeLiked by 1 person

Reply
skeet says:

July 20, 2014 at 2:20 am

@Mark: That’s an interesting idea. I like the way I could easily switch between configurations to check that it would still compile. I have something similar for preconditions of internal methods. Hmm.

LikeLike

Reply
skeet says:

July 20, 2014 at 2:37 am

@pete.d: The fact that OffsetDateTime and ZonedDateTime are value types has more to do with consistency than the performance goal. They still *feel* like natural value types to me – and I’d find it hard to justify why (say) LocalDateTime would be a value type but OffsetDateTime would be a reference type.

I agree that the performance tests I’ve done don’t address that question – but I’m actually not sure how I *would* address that, as suddenly the frequency of different operations becomes more important… there’s more cost to construction due to allocation and GC, so I’d need to understand the relative frequency of “create new value” vs “interrogate existing value”. It’s a question which would be good to answer, but which I don’t have the data for at the moment.

Of course I could still *try* making ZonedDateTime a class and run the existing benchmarks, just for the sake of interest…

LikeLike

Reply
Frank Niemeyer says:

July 20, 2014 at 8:23 am

This seems to be a problem of JIT64, specifically; if you look at the generated machine code, there is lots of unnecessary copying going on in the readonly case. Although, the x86 JIT compiler produces less efficient code for the readonly case as well, the difference is much less severe (approx. +14% runtime compared to the “not-readonly” case on an Core i5-4570).

LikeLike

Reply
pete.d says:

July 20, 2014 at 3:36 pm

Re: value vs reference type

Are these values in NodaTime immutable? If not, well there’s a problem right there.

If they are immutable, then so too could be the reference type version of the objects.

And in that case I would be surprised if reference type overheads of allocation and GC were significant. As immutable reference types, new instances need be made only when new values are actually made, and allocation and GC is likely to be _less_ expensive than copying 40 bytes every time the value is passed from one place in the code to another (e.g. down a method call stack or across threads).

As I know you’re well aware, allocation in .NET is basically free. During the lifetime of the object, you win by a margin of 5x even on x64 (10x on x86…and that ignores the fact that register-sized pointers are cheaper even per byte to copy than strings of bytes) on passing/copying/etc. So the only real overhead is collection, but of course .NET is well-optimized for dealing with short-lived objects, and long-lived objects get to exercise their 5x copying advantage more often.

I admit, I am not fully versed in the design philosophies behind the C# distinction between value and reference types. But it’s my understanding that the main goal was to expose to the C# programmer more control over allocation strategies. I don’t see value vs reference as having any real semantic benefit, so the idea that a type “feels like” a value type is foreign to me. Or rather, to me the way to deal with a type that “feels like” a value is really that it feels like something that ought to be immutable. Which is as easily accomplished with reference types as with value types.

Back to the performance question then…the one place I can see an argument in favor of sticking with a value type is if you expect there to be very large arrays of these objects, where most or all of the value instances are unique. But even there, you’re talking only 20% overhead (10% on x86) for a specific scenario, where in other scenarios there may be no performance benefit.

Well, sorry for the rambling. The bottom line is that I believe that in truth, time is not such a central element of nearly any program that these types of decisions would have real performance impacts (i.e. ones a user would notice) in the vast majority of cases.

You’re also correct that doing a performance analysis that takes into account a reference type implementation would be much more complicated. But a performance analysis minus consideration of reference types isn’t a valid analysis.

I.e. it’s academically interesting to understand the impact of readonly versus read/write value type members, but inasmuch as it presupposes value types instead of reference types, it seems to miss the bigger picture.

I just think that inasmuch as anyone might worry about performance to this degree, that the basic choice of using value types in the first place seems questionable. :) And if there are non-performance reasons to stick with value types, then I think the goal of “don’t _ever_ [emphasis pete.d] want someone to decide not to use it on performance grounds” is a distraction and should probably be abandoned. Because if using a value type for non-performance reasons results in less-than-optimal performance in any scenario, it’s practically certain at least one person will opt out of NodaTime for performance reasons.

(My apologies if this is a repeat submission…the server seems very slow and I have not gotten a confirmation of posting)

LikeLike

Reply
Chris Sinclair says:

July 21, 2014 at 9:09 am

@Mark: Not sure if it’s a good idea or not, but you can avoid redeclaring (that is, repeating yourself) the variable:

private
#if(DEBUG)
readonly
#endif
int _x;

or

#if(DEBUG)
readonly
#endif
private int _x;

Might throw someone for a loop reading the code though.

LikeLike

Reply
Guillaume Pouillet says:

July 28, 2014 at 11:53 am

Why not use a custom FxCop rule ?
You just have to check for every field with your custom attribute if there is assignment outside of it’s class’ constructor.

LikeLike

Reply
1. jonskeet says:
  
  July 28, 2014 at 11:55 am
  
  Firstly, ‘cos I don’t use FxCop at the moment :) It may well handle this particular case well, but there are other things I’m going to want to check which require a richer API than FxCop provides. I’m going to lean heavily on Roslyn for those…
  
  LikeLike
  
  Reply
  1. Jakub Linhart (@JakubLinhart) says:
    
    July 30, 2014 at 1:34 pm
    
    I would love to hear more about how are you going to use Roslyn to check those rules.
    
    LikeLike
    
    Reply
    1. jonskeet says:
      
      July 30, 2014 at 1:36 pm
      
      It’s still a far-off dream at the moment, but just validating that nothing ever writes to the fields outside a constructor should be reasonably straightforward.
      
      LikeLike
      
      Reply
      1. Torleif Berger (@svish) says:
        
        August 4, 2014 at 9:52 am
        
        Even if reasonably straightforward, I’d still like to hear how you’d do such a thing :)
        
        LikeLike
        
        Reply
        
        jonskeet says:
        
        August 4, 2014 at 9:56 am
        
        I’m sure I’ll have a blog post about that when I get round to it :)
        
        LikeLike
        
        Reply
Gregory says:

July 28, 2014 at 4:39 pm

Reblogged this on Under Framework.

LikeLike

Reply

Hello everyone,

based on Jon’s code i’ve been running some tests (with 100 mio instead of 1 billion iterations though). Because both – Test’s value as well as Int256’s Bits fields – are readonly, i’ve been benchmarking all possible permutations (and additionally the case where the TotalValue property was moved into the struct itself) with the following results:

.NET 4.0 results
Release x64:
Readonly value, readonly fields:                              1763 ms
Readonly value, non-readonly fields:                          1709 ms
Readonly value, non-readonly fields, Total on structure:       389 ms
Non-Readonly value, readonly fields:                           255 ms
Non-Readonly value, non-readonly fields:                       256 ms
Non-Readonly value, non-readonly fields, Total on structure:   249 ms

Release x86:
Readonly value, readonly fields:                               428 ms
Readonly value, non-readonly fields:                           412 ms
Readonly value, non-readonly fields, Total on structure:        32 ms
Non-Readonly value, readonly fields:                           399 ms
Non-Readonly value, non-readonly fields:                       405 ms
Non-Readonly value, non-readonly fields, Total on structure:   491 ms

Can someone explain to me why
1. The x64 versions get ‘progressively’ better the more you optimize for non-readonly modifiers, while the x86 versions do not?
2. There is this ridiculous speedup at test case 3 for Release x86 builds?
3. Even more confusing, compiling this for .NET 2.0 yields:

.NET 2.0 results
Release x64:
Readonly value, readonly fields:                              1567 ms
Readonly value, non-readonly fields:                          1586 ms
Readonly value, non-readonly fields, Total on structure:       326 ms
Non-Readonly value, readonly fields:                            32 ms
Non-Readonly value, non-readonly fields:                        32 ms
Non-Readonly value, non-readonly fields, Total on structure:    32 ms

Release x86:
Readonly value, readonly fields:                               768 ms
Readonly value, non-readonly fields:                           765 ms
Readonly value, non-readonly fields, Total on structure:        33 ms
Non-Readonly value, readonly fields:                           571 ms
Non-Readonly value, non-readonly fields:                       505 ms
Non-Readonly value, non-readonly fields, Total on structure:   579 ms

Pastebin of the Benchmark code: http://pastebin.com/uu6EBMwD

LikeLike

Perepechko Grigory says:

October 26, 2015 at 12:01 pm

Bullshit.

If you add Console.WriteLine(total) results will become completely different, and of course much more predictable. Optimizer simply cuts out all your code. And BTW, how do u get assembly code ? Via VS Disassembly window under debugger ?

The only interesting thing for me was, WHY in x86 accessing bits0-bits3 via getters were slightly faster than via readonly fields. Try doing this after adding Console.WriteLine(total).

LikeLike

Reply
1. jonskeet says:
  
  October 26, 2015 at 3:12 pm
  
  The results don’t become different for me – the non-readonly form is still vastly faster than the one with the readonly fields, and the timing is basically the same as without printing out the total. If you’re seeing something different, that’s odd… So no, this isn’t due to the optimizer removing code. The IL was obtained via ildasm – note that this is only IL, not native assembly code.
  
  I haven’t shown it here, but another option is copying the Int256 value to another variable within the getter, and then using the getters of that – which gives an intermediate speed.
  
  You can doubt the benefits of this all you like, but I’ve seen it reproduced in various different scenarios in Noda Time – it really makes a difference.
  
  LikeLike
  
  Reply
Anton says:

October 26, 2015 at 4:06 pm

try x86 arch compiler, you will be surprised ( using total or not )

LikeLike

Reply
1. jonskeet says:
  
  October 26, 2015 at 4:10 pm
  
  Just tried, and using /platform:x86 does make the difference much smaller (although printing total or not doesn’t affect things) – but personally I’m more interested in x86 performance anyway, for Noda Time.
  
  LikeLike
  
  Reply
Anton says:

October 26, 2015 at 4:17 pm

and try make BitsX public readonly fields, instead of getters, on x64 arch compiler

LikeLike

Reply
1. jonskeet says:
  
  October 26, 2015 at 4:20 pm
  
  Well yes, but at that point that changes the “If E is not classified as a variable” part of the spec that I quoted…
  
  LikeLike
  
  Reply
Ramon Smits (@ramonsmits) says:

March 20, 2017 at 8:28 am

Does this still hold up with the current revisions of the framework as its now March 2017 so almost 3 years gone by.

LikeLike

Reply
1. jonskeet says:
  
  March 20, 2017 at 8:30 am
  
  The framework doesn’t matter much – only the runtime does. And the language rules are the same: a copy is required. That said, I have noticed some changes when demonstrating things, so the set of situations where it makes a difference has changed over time.
  
  LikeLike
  
  Reply
Kent Boogaart says:

August 19, 2017 at 11:13 am

Sorry to necro this thread, but I didn’t see any mention of using a Fody addin to strip initonly from struct fields (I just went looking for such an addin and found this post in the process). Such an approach would give you the compile-time benefits of readonly whilst yielding the run-time benefits of better performance. Of course, you could require an opt-in strategy where structs must be annotated with a [StripInitOnly] attribute or whatever.

LikeLike

Reply
Pingback: Implementing IXmlSerializable in readonly structs | Jon Skeet's coding blog
Pingback: Novidades do C# 7.2: Estruturas imutáveis | Lambda3
sollyucko says:

March 10, 2019 at 11:40 pm

One thing that could help would be if methods could be marked as const, i.e. pure in the sense of not changing anything. In this case, marking the getters const would allow the compiler to perform various optimizations.

LikeLike

Reply
Frédéric Del says:

February 2, 2020 at 3:34 pm

I have to admit that I struggled a bit to understand your extract from the C# specs.

One nice this with Resharper is that you can directly spot those kind of issue. Indeed, if you call a method GetBit0() on a readonly struct, you get the message “Possibly impure struct method called on readonly variable: struct value always copied before invocation”, it makes clear that readonly will trigger a copy before any access.

Unfortunately, the check doesn’t appear when you call properties. I supposed this check was not here to make impact on performances more understandable, but to spot potential bugs.

LikeLike

Reply
Christophe Bertrand.net says:

April 17, 2021 at 7:37 am

Hi

On .NET 5 (Core), removing the readonly keyword is not faster [64 bits release compilation opened outside Visual Studio].
The IL code is still different with and without readonly. And slightly different from yours by the way.
So I suppose the JIT optimizes the problem you point. Maybe thanks to articles like this one.

In conclusion, there is no reason to remove readonly anymore.

LikeLike

Reply