Category Archives: Performance

Tracking down a performance hit

I’ve been following the progress of .NET Core with a lot of interest, and trying to make the Noda Time master branch keep up with it. The aim is that when Noda Time 2.0 eventually ships (apologies for the delays…) it will be compatible with .NET Core from the start. (I’d expected to be able to support netstandard1.0, but that appears to have too much missing from it. It looks like netstandard1.3 will be the actual target.)

I’ve been particularly looking forward to being able to run the Noda Time benchmarks (now using BenchmarkDotNet) to compare .NET Core on Linux with the same code on Windows. In order to make that a fair comparison, I now have two Intel NUCs, both sporting an i5-5250U and 8GB of memory.

As it happens, I haven’t got as far as running the benchmarks under .NET Core – but I am now able to run all the unit tests on both Linux and Windows, using both the net451 TFM and netcoreapp1.0.

When I did that recently, I was pretty shocked to see that (depending on which tests I ran) the tests were 6-10 times slower on Linux than on Windows, using netcoreapp1.0 in both cases. This post is a brief log of what I did to track down the problem.

Step 1: Check that there’s really a problem

Thought: Is this actually just a matter of not running the tests in a release configuration, or something similar?

Verification: I ran the tests several times, specifying -c Release on the command line to use the release build of both NodaTime.Test.dll and NodaTime.dll. Running under a debugger definitely wasn’t an issue, as this was all just done from the shell.

Additionally, I ran the tests in two ways – firstly, running the whole test suite, and secondly running with --where=cat!=Slow to avoid the few tests I’ve got which are known to be really pretty slow. They’re typically tests which compare the answers the BCL gives with the answers Noda Time gives, across the whole of history for a particular calendar system or time zone. I’m pleased to report that the bottleneck in these tests is almost always the BCL, but that doesn’t help to speed them up. If only the “slow” tests had been much slower on Linux, that might have pointed to the problems being in BCL calendar or time zone code.

The ratios vary, but there was enough of a problem under both circumstances for it to be worth looking further.

Step 2: Find a problematic test

I didn’t have very strong expectations one way or another about whether this would come down to some general problem in the JIT on Linux, or whether there might be one piece of code causing problems in some tests but not others. Knowing that there are significant differences in handling of some culture and time zone code between the Linux and Windows implementations, I wanted to find a test which used the BCL as little as possible – but which was also slow enough for the differences in timing to be pronounced and not easily explicable by the problems of measuring small amounts of time.

Fortunately, NUnit produces a TestResult.xml file which is easy to parse with LINQ to XML, so I could easily transform the results from Windows and Linux into a list of tests, ordered by duration (descending), and spot the right kind of test.

I found my answer in UmAlQuraYearMonthDayCalculatorTest.GetYearMonthDay_DaysSinceEpoch, which effectively tests the Um Al Qura calendar for self consistency, by iterating over every day in the supported time period and checking that we can convert from “days since Unix epoch” to an expected “year, month day”. In particular, this test doesn’t rely on the Windows implementation of the calendar, nor does it use any time zones, cultures or anything similar. It’s nicely self-contained.

This test took 2051ms on Linux and 295ms on Windows. It’s possible that those figures were from a debug build, but I repeated the tests using a release build and confirmed that the difference was still similar.

Step 3: Find the bottleneck

At this point, my aim was to try to remove bits of the test at a time, until the difference went away. I expected to find something quite obscure causing the difference – something like different CPU cache behaviour. I knew that the next step would be to isolate the problem to a small piece of code, but I expected that it would involve a reasonable chunk of Noda Time – at least a few types.

I was really lucky here – the first and most obvious call to remove made a big difference: the equality assertion. Assertions are usually the first thing to remove in tests, because everything else typically builds something that you use in the assertions… if you’re making a call without either using the result later or asserting something about the result, presumably you’re only interested in side effects.

As soon as I removed the call to Assert.AreEqual(expected, actual), the execution time dropped massively on Linux, but hardly moved on Windows: they were effectively on a par.

I wondered whether the problem was with the fact that I was asserting equality between custom structs, and so tried replacing the real assertions with assertions of equality of strings, then of integers. No significant difference – they all showed the same discrepancy between Windows and Linux.

Step 4: Remove Noda Time

Once I’d identified the assertions as the cause of the problem, it was trivial to start a new test project with no dependency on Noda Time, consisting of a test like this:

[Test]
public void Foo()
{
    for (int i = 0; i < 1000000; i++)
    {
        var x = 10;
        var y = 10;
        Assert.AreEqual(x, y);
    }
}

This still demonstrated the problem consistently, and allowed simpler experimentation with different assertions.

Step 5: Dig into NUnit

For once in my life, I was glad that a lot of implementation details of a framework were exposed publicly. I was able to try lots of different “bits” of asserting equality, in order to pin down the problem. Things I tried:

  • Assert.AreEqual(x, y): slow
  • Assert.That(x, Is.EqualTo(y)): slow
  • Constructing an NUnitEqualityComparer: fast
  • Calling NUnitEqualityComparer.AreEqual: fast. (Here the construction occurred before the loop, and the comparisons were in the loop.)
  • Calling Is.EqualTo(y): slow

The last bullets two bullets were surprising. I’d been tipped off that NUnitEqualityComparer uses reflection, which could easily differ in performance between Windows and Linux… but checking for equality seemed to be fast, and just constructing the constraint was slow. In poking around the NUnit source code (thank goodness for Open Source!) it’s obvious why Assert.AreEqual(x, y) and Assert.That(y, Is.EqualTo(x)) behave the same way – the former just calls the latter.

So, why is Is.EqualTo(y) slow (on Linux)? The method itself is simple – it just creates an instance of EqualConstraint. The EqualConstraint constructor body doesn’t do much… so I proved that it’s not EqualConstraint causing the problem by deriving my own constraint with a no-op implementation of ApplyTo… sure enough, just constructing that is slow.

That leaves the constructor of the Constraint abstract base class:

protected Constraint(params object[] args)
{
    Arguments = args;

    DisplayName = this.GetType().Name;
    if (DisplayName.EndsWith("`1") || DisplayName.EndsWith("`2"))
        DisplayName = DisplayName.Substring(0, DisplayName.Length - 2);
    if (DisplayName.EndsWith("Constraint"))
            DisplayName = DisplayName.Substring(0, DisplayName.Length - 10);
}

That looks innocuous enough… but maybe calling GetType().Name is expensive on Linux. So test that… nope, it’s fast.

At this point I’m beginning to wonder whether we’ll ever get to the bottom of it, but let’s just try…

[Test]
public void EndsWith()
{
    string text = "abcdefg";
    for (int i = 0; i < Iterations; i++)
    {
        text.EndsWith("123");
    }
}

… and sure enough, it’s fast on Windows and slow on Linux. Wow. Looks like we have a culprit.

Step 6: Remove NUnit

At this point, it’s relatively plain sailing. We can reproduce the issue in a simple console app. I won’t list the code here, but it’s in the GitHub issue. It just times calling EndsWith once (to get it JIT compiled) and then a million times. Is it the most rigorous benchmark in the world? Absolutely not… but when the difference is between 5.3s on Linux and 0.16s on Windows, on the same hardware, I’m not worried about inaccuracy of a few milliseconds here or there.

Step 7: File a CoreCLR issue

So, as I’ve shown, I filed a bug on GitHub. I’d like to think it was a pretty good bug report:

  • Details of the environment
  • Short but complete console app ready to copy/paste/compile/run
  • Results

Exactly the kind of thing I’d have put into a Stack Overflow question – when I ask for a minimal, complete example on Stack Overflow, this is what I mean.

Anyway, about 20 minutes later (!!!), Stephen Toub has basically worked out the nub of it: it’s a culture issue. Initially, he couldn’t reproduce it – he saw the same results on Windows and Linux. But changing his culture to en-GB, he saw what I was seeing. I then confirmed the opposite – when I ran the code having set LANG=en-US, the problem went away for me. Stephen pulled Matt Ellis in, who gave more details as to what was going wrong behind the scenes.

Step 8: File an NUnit issue

Matt Ellis suggested filing an issue against NUnit, as there’s no reason this code should be culture-sensitive. By specifying the string comparison as Ordinal, we can go through an even faster path than using the US culture. So

if (DisplayName.EndsWith("Constraint"))

becomes

if (DisplayName.EndsWith("Constraint", StringComparison.Ordinal))

… and the equivalent for the other two calls.

I pointed out in the issue that it was also a little bit odd that this was being worked out in every Constraint constructor call, when of course it’s going to give the same result for every instance of the same type. When “every Constraint constructor call” becomes “every assertion in an entire test run”, it’s a pretty performance-critical piece of code. While unit tests aren’t important in terms of performance in the same way that production code is, anything which adds friction is bad news.

Hopefully the NUnit team will apply the simple improvement for the next release, and then the CoreCLR team can attack the tougher underlying problem over time.

Step 9: Blog about it

Open up Stack Edit, start typing: “I’ve been following the progress”… :)

Conclusion

None of the steps I’ve listed here is particularly tricky. Diagnosing problems is often more a matter of determination and being unwilling to admit defeat than cleverness. (I’m not denying that there’s a certain art to being able to find the right seam to split the problem in two, admittedly.)

I hope this has been useful as a “start to finish” example of what a diagnostic session can look and feel like. It wasn’t one physical session, of course – I found bits of time to investigate it over the course of a day or so – but it would have been the same steps either way.

Smug, satisfied smile…

Object pooling and thread safety

A few days ago, I was watching Dustin Campell’s excellent talk from TechEd 2013, “Essential truths everyone should know about performance in a large managed code base”. I highly recommend it, and it made me think a bit more about Noda Time’s performance. (It doesn’t take much to make me think about Noda Time’s performance, admittedly.)

Base situation: no pooling

Most of our formatting goes through something called SteppedPattern – that’s basically one multicast delegate (with lots of actions) for formatting, and a list of delegates for parsing. The Format method was really simple before this week:

public string Format(TResult value)
{
    StringBuilder builder = new StringBuilder();
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

Very simple – but it allocates a new StringBuilder on each call, which is somewhat unnecessary. Allocation is cheap, but not free – and string formatting is something which could reasonably happen an awful lot.

In the talk, Dustin showed an example where introducing a single object pool – also for StringBuilder – had made marked difference. Unfortunately, in my case it’s not as simple as we might like. Or at least, I’m not confident that it is. Let’s look at the options I’ve considered so far – in no particular order.

Option 1: Lock-free using Interlocked

After construction, a formatter can be used on multiple threads, so some case is needed to make sure we don’t end up using one StringBuilder from multiple threads at a time. Interlocked.Exchange makes this reasonably straightforward:

private StringBuilder cachedBuilder = new StringBuilder();
...
public string Format(TResult value)
{
    // Try to borrow the cached builder, in a thread-safe way.
    // (If another thread is using the builder, this will return null.)
    StringBuilder builder = Interlocked.Exchange(ref cachedBuilder, null);
    if (builder == null)
    {
        builder = new StringBuilder();
    }
    else
    {
        builder.Length = 0;
    }
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    string ret = builder.ToString();
    // Whether we managed to borrow the cached builder or not,
    // we can replace it with the one we've used, now that we
    // know what to return.
    Volatile.Write(ref cachedBuilder, builder);
    return ret;
}

(See comments for discussion on the Volatile.Write call, which was previously just an assignment statement, along with the use of Interlocked.Exchange compared with an earlier version using Interlocked.CompareExchange.)
I’m pretty confident that this does achieve the goal of only using the builder in one thread at a time. If an exception is thrown while formatting, that’s fine – we don’t end up putting anything back in the cache, and the next caller will allocate a new one.
I don’t believe we need to use Exchange a second time when writing to cachedBuilder – we’re certainly fine if it takes a while for that write to be seen by other threads, as the only downside would be not using the cached builder when we might want to.

But is that enough to make this code thread-safe? Is it possible that a write from one thread would only be seen in another thread after the second thread has borrowed the builder? Is it possible that the write to cachedBuilder actually occurs before the call to builder.ToString() completes, allowing another thread to start messing with the builder before we’ve got our result?

My guess is that it’s actually okay – but lock-free code fundamentally scares me, and I’m very cautious about writing it. (Elsewhere in Noda Time we do have some lock-free mutable data – but usually in the form of array-based caches where the element type is guaranteed to have atomic writes, and is immutable.)

So while this is “clever” and may be fine, it makes me nervous, which is not a good place to be.

Option 2: Locking

I’m much more confident about using threads when locking is involved. I find it easier to reason about what’s allowed and what’s not.

private readonly StringBuilder pooledStringBuilder = new StringBuilder();
private bool pooledBuilderInUse = false;
...

public string Format(TResult value)
{
    StringBuilder builder = null;
    lock (pooledStringBuilder)
    {
        if (!pooledBuilderInUse)
        {
            pooledBuilderInUse = true;
            builder = pooledStringBuilder;
            builder.Length = 0;
        }
    }
    try
    {
        builder = builder ?? new StringBuilder();
        // This will call all the actions in the multicast delegate.
        formatActions(value, builder);
        return builder.ToString();
    }
    finally
    {
        if (builder == pooledStringBuilder)
        {
            lock (pooledStringBuilder)
            {
                pooledBuilderInUse = false;
            }
        }
    }
}

Unlike the previous solution, this time we always know what the builder is (and never change it) – that gives us something to lock on. What we change within the lock is whether or not the builder is in use. This should be safe across multiple threads, but it’s ever so verbose – and in my benchmarks, the results weren’t terribly clear.

Option 3: ThreadStatic

The obvious alternative to sharing one builder between multiple threads is to have a
single builder per thread. There are two approaches to this in .NET:
[ThreadStatic] and ThreadLocal.

ThreadStatic is the older mechanism, and is simply a matter of decorating a static field with the [ThreadStatic] attribute. Each thread has a separate variable, as if by magic. So in principle, the implementation just looks like this:

[ThreadStatic]
private static StringBuilder cachedBuilder;

public string Format(TResult value)
{
    StringBuilder builder = cachedBuilder;
    if (cachedBuilder == null)
    {
        builder = new StringBuilder();
        cachedBuilder = builder;
    }
    else
    {
        builder.Length = 0;
    }
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

Unfortunately we still need to check for nullity on every call, as any initialization
expression for the variable will only be executed by the first thread which uses the class. Still, the code is reasonably simple.

There’s a big “gotcha” with this though: our code is now not re-entrant. If one SteppedPattern has a format action which calls format on another SteppedPattern, we’ll end up trying to use the same builder for both operations, resetting the builder in the middle.

Fortunately, SteppedPattern already has a method (FormatPartial) designed for calling one formatter within another. In the very few places where this is required, I’m confident that it’s being called correctly – but I’m not keen on making my code fragile like this. It’s fairly safe from third party intervention – there are very few places (if any) where formatting code calls out to any other code which could include a format call. Just writing this paragraph, I’ve come up with a couple of worrying parts (fetching values from a CultureInfo, and fetching the ID of a time zone) but I think they’re safe… the question is whether I’ve missed something similar in another case. Either way, it would have to be pretty pathological code to cause a problem, and I consider the chances of this being a problem in the real world to be essentially zero.

However, I still want to guard against accidental re-entrancy within Noda Time itself – so I’ve got a debug-only version of Format which performs a crude re-entrancy check using another [ThreadStatic] variable. (And I’ve got a debug-only test to check that the check works, of course.) So long as my unit tests all still pass in a debug configuration, I’m reasonably confident. (You can see the check and the test in the revision that added this code. It may be gone in the latest version, of course – I’m still on the fence about all of this.)

There’s a second issue with [ThreadStatic] which is its level of support in the PCL. We just recently updated our PCL profiles, and I think using [ThreadStatic] would have been a problem if we’d continued to support Windows Phone 7, as that didn’t include [ThreadStatic] as far as I can tell. We’re fine in the current profile, but this is the first time I’ve knowing added code which prevents us from going back to a WP7-enabled profile.

Option 4: ThreadLocal

ThreadLocal is a more recent solution to the thread-local-storage problem. It’s generally a cleaner solution:

  • The variable doesn’t have to be static
  • You can specify an initialization delegate to be executed on first use in any particular thread
  • It’s clear from the type itself what you’re doing, rather than using an attribute which makes a variable act completely differently from normal

That means we can have one StringBuilder per formatter per thread (if we want; we could still make it a static variable if we wanted exactly one StringBuilder for each thread), and simplify the code within the Format method:

private ThreadLocal<StringBuilder> cachedBuilder =
    new ThreadLocal<StringBuilder>(() => new StringBuilder());

public string Format(TResult value)
{
    StringBuilder builder = cachedBuilder.Value;
    builder.Length = 0;
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

However, I haven’t tested this code, as ThreadLocal required .NET 4.0 – whereas Noda Time targets .NET 3.5.

Conclusion

You may have noticed that I haven’t provided any performance figures in this post. That’s deliberate, because the results I’ve seen have had too much variance so far. The basic upshot of this is that I don’t yet have enough evidence of the benefit to justify the complexity – it’s just one StringBuilder per format operation, after all, and there’s plenty of other work going on at the same time. I believe there is a benefit, but it’s pretty marginal – and the level of benefit depends on which of the above options is taken.

I’m intrigued by the first option though, and whether experts would consider it to be safe. I still want to see a really good specification for the .NET memory model (which I know to be stronger than the one in the ECMA CLI specification) – I know it’s something that Microsoft has wanted to get round to writing up, but I don’t know if it’s actually happened… I fully expect to see contradictory opinions in comments :)

Micro-optimization: the surprising inefficiency of readonly fields

Introduction

Recently I’ve been optimizing the heck out of Noda Time. Most of the time this has been a case of the normal measurement, find bottlenecks, carefully analyse them, lather, rinse, repeat. Yesterday I had a hunch about a particular cost, and decided to experiment… leading to a surprising optimization.

Noda Time’s core types are mostly value types – date/time values are naturally value types, just as DateTime and DateTimeOffset are in the BCL. Noda Time’s types are a bit bigger than most value types, however – the largest being ZonedDateTime, weighing in at 40 bytes in an x64 CLR at the moment. (I can shrink it down to 32 bytes with a bit of messing around, although it’s not terribly pleasant to do so.) The main reason for the bulk is that we have two reference types involved (the time zone and the calendar system), and in Noda Time 2.0 we’re going to have nanosecond resolution instead of tick resolution (so we need 12 bytes just to store a point in time). While this goes against the Class Library Design Guidelines, it would be odd for the smaller types (LocalDate, LocalTime) to be value types and the larger ones to be reference types. Overall, these still feel like value types.

A lot of these value types are logically composed of each other:

  • A LocalDate is a YearMonthDay and a CalendarSystem reference
  • A LocalDateTime is a LocalDate and a LocalTime
  • An OffsetDateTime is a LocalDateTime and an Offset
  • A ZonedDateTime is an OffsetDateTime and a DateTimeZone reference

This leads to a lot of delegation, potentially – asking a ZonedDateTime for its Year could mean asking the OffsetDateTime, which would ask the LocalDateTime, which would ask the LocalDate, which would ask the YearMonthDay. Very nice from a code reuse point of view, but potentially inefficient due to copying data.

Why would there be data copying involved? Well, that’s where this blog post comes in.

Behaviour of value type member invocations

When an instance member (method or property) belonging to a value type is invoked, the exact behaviour depends on the kind of expression it is called on. From the C# 5 spec, section 7.5.5 (where E is the expression the member M is invoked on, and the type declaring M is a value type):

If E is not classified as a variable, then a temporary local variable of E’s type is created and the value of E is assigned to that variable. E is then reclassified as a reference to that temporary local variable. The temporary variable is accessible as this within M, but not in any other way. Thus, only when E is a true variable is it possible for the caller to observe the changes that M makes to this.

So when is a variable not a variable? When it’s readonly… from section 7.6.4 (emphasis mine) :

If T is a struct-type and I identifies an instance field of that class-type:

  • If E is a value, or if the field is readonly and the reference occurs outside an instance constructor of the struct in which the field is declared, then the result is a value, namely the value of the field I in the struct instance given by E.

(There’s a very similar bullet for T being a class-type; the important part is that the field type is a value type

The upshot is that if you have a method call of:

int result = someField.Foo();

then it’s effectively converted into this:

var tmp = someField;
int result = tmp.Foo();

Now if the type of the field is quite a large value type, but Foo() doesn’t modify the value (which it never does within my value types), that’s performing a copy completely unnecessarily.

To see this in action outside Noda Time, I’ve built a little sample app.

Show me the code!

Our example is a simple 256-bit type, composed of 4 Int64 values. The type itself doesn’t do anything useful – it just holds the four values, and exposes them via properties. We then measure how long it takes to sum the four properties lots of times.

using System;
using System.Diagnostics;

public struct Int256
{
    private readonly long bits0;
    private readonly long bits1;
    private readonly long bits2;
    private readonly long bits3;
    
    public Int256(long bits0, long bits1, long bits2, long bits3)
    {
        this.bits0 = bits0;
        this.bits1 = bits1;
        this.bits2 = bits2;
        this.bits3 = bits3;
    }
    
    public long Bits0 { get { return bits0; } }
    public long Bits1 { get { return bits1; } }
    public long Bits2 { get { return bits2; } }
    public long Bits3 { get { return bits3; } }
}

class Test
{
    private readonly Int256 value;

    public Test()
    {
        value = new Int256(1L, 5L, 10L, 100L);
    }
    
    public long TotalValue 
    { 
        get 
        {
            return value.Bits0 + value.Bits1 + value.Bits2 + value.Bits3; 
        }
    }
    
    public void RunTest()
    {
        // Just make sure it’s JITted…
        var sample = TotalValue;
        Stopwatch sw = Stopwatch.StartNew();
        long total = 0;
        for (int i = 0; i < 1000000000; i++)
        {
            total += TotalValue;
        }
        sw.Stop();
        Console.WriteLine("Total time: {0}ms", sw.ElapsedMilliseconds);
    }
    
    static void Main()
    {
        new Test().RunTest();
    }
}

Building this from the command line with /o+ /debug- and running (in a 64-bit CLR, but no RyuJIT) this takes about 20 seconds to run on my laptop. We can make it much faster with just one small change:

class Test
{
    private Int256 value;

    // Code as before
}

The same test now takes about 4 seconds – a 5-fold speed improvement, just by making a field non-readonly. If we look at the IL for the TotalValue property, the copying becomes obvious. Here it is when the field is readonly:

.method public hidebysig specialname instance int64 
        get_TotalValue() cil managed
{
  // Code size       60 (0x3c)
  .maxstack  2
  .locals init (valuetype Int256 V_0,
           valuetype Int256 V_1,
           valuetype Int256 V_2,
           valuetype Int256 V_3)
  IL_0000:  ldarg.0
  IL_0001:  ldfld      valuetype Int256 Test::’value’
  IL_0006:  stloc.0
  IL_0007:  ldloca.s   V_0
  IL_0009:  call       instance int64 Int256::get_Bits0()
  IL_000e:  ldarg.0
  IL_000f:  ldfld      valuetype Int256 Test::’value’
  IL_0014:  stloc.1
  IL_0015:  ldloca.s   V_1
  IL_0017:  call       instance int64 Int256::get_Bits1()
  IL_001c:  add
  IL_001d:  ldarg.0
  IL_001e:  ldfld      valuetype Int256 Test::’value’
  IL_0023:  stloc.2
  IL_0024:  ldloca.s   V_2
  IL_0026:  call       instance int64 Int256::get_Bits2()
  IL_002b:  add
  IL_002c:  ldarg.0
  IL_002d:  ldfld      valuetype Int256 Test::’value’
  IL_0032:  stloc.3
  IL_0033:  ldloca.s   V_3
  IL_0035:  call       instance int64 Int256::get_Bits3()
  IL_003a:  add
  IL_003b:  ret
} // end of method Test::get_TotalValue

And here it is when the field’s not readonly:

.method public hidebysig specialname instance int64 
        get_TotalValue() cil managed
{
  // Code size       48 (0x30)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  ldflda     valuetype Int256 Test::’value’
  IL_0006:  call       instance int64 Int256::get_Bits0()
  IL_000b:  ldarg.0
  IL_000c:  ldflda     valuetype Int256 Test::’value’
  IL_0011:  call       instance int64 Int256::get_Bits1()
  IL_0016:  add
  IL_0017:  ldarg.0
  IL_0018:  ldflda     valuetype Int256 Test::’value’
  IL_001d:  call       instance int64 Int256::get_Bits2()
  IL_0022:  add
  IL_0023:  ldarg.0
  IL_0024:  ldflda     valuetype Int256 Test::’value’
  IL_0029:  call       instance int64 Int256::get_Bits3()
  IL_002e:  add
  IL_002f:  ret
} // end of method Test::get_TotalValue

Note that it’s still loading the field address (ldflda) four times. You might expect that copying the field onto the stack once via a temporary variable would be faster, but that ends up at about 6.5 seconds on my machine.

There is an optimization which is even faster – moving the totalling property into Int256. That way (with the non-readonly field, still) the total time is less than a second – twenty times faster than the original code!

Conclusion

This isn’t an optimization I’d recommend in general. Most code really doesn’t need to be micro-optimized this hard, and most code doesn’t deal with large value types like the ones in Noda Time. However, I regard Noda Time as a sort of "system level" library, and I don’t ever want someone to decide not to use it on  performance grounds. My benchmarks show that for potentially-frequently-called operations (such as the properties on ZonedDateTime) it really does make a difference, so I’m going to go for it.

I intend to apply a custom attribute to each of these "would normally be readonly" fields to document the intended behaviour of the field – and then when Roslyn is fully released, I’ll probably write a test to validate that all of these fields would still compile if the field were made readonly (e.g. that they’re never assigned to outside the constructor).

Aside from anything else, I find the subtle difference in behaviour between a readonly field and a read/write field fascinating… it’s something I’d been vaguely aware of in the past, but this is the first time that it’s had a practical impact on me. Maybe it’ll never make any difference to your code… but it’s probably worth being aware of anyway.