I’ve been following the progress of .NET Core with a lot of interest, and trying to make the Noda Time master branch keep up with it. The aim is that when Noda Time 2.0 eventually ships (apologies for the delays…) it will be compatible with .NET Core from the start. (I’d expected to be able to support netstandard1.0, but that appears to have too much missing from it. It looks like netstandard1.3 will be the actual target.)

I’ve been particularly looking forward to being able to run the Noda Time benchmarks (now using BenchmarkDotNet) to compare .NET Core on Linux with the same code on Windows. In order to make that a fair comparison, I now have two Intel NUCs, both sporting an i5-5250U and 8GB of memory.

As it happens, I haven’t got as far as running the benchmarks under .NET Core – but I am now able to run all the unit tests on both Linux and Windows, using both the net451 TFM and netcoreapp1.0.

When I did that recently, I was pretty shocked to see that (depending on which tests I ran) the tests were 6-10 times slower on Linux than on Windows, using netcoreapp1.0 in both cases. This post is a brief log of what I did to track down the problem.

Step 1: Check that there’s really a problem

Thought: Is this actually just a matter of not running the tests in a release configuration, or something similar?

Verification: I ran the tests several times, specifying -c Release on the command line to use the release build of both NodaTime.Test.dll and NodaTime.dll. Running under a debugger definitely wasn’t an issue, as this was all just done from the shell.

Additionally, I ran the tests in two ways – firstly, running the whole test suite, and secondly running with --where=cat!=Slow to avoid the few tests I’ve got which are known to be really pretty slow. They’re typically tests which compare the answers the BCL gives with the answers Noda Time gives, across the whole of history for a particular calendar system or time zone. I’m pleased to report that the bottleneck in these tests is almost always the BCL, but that doesn’t help to speed them up. If only the “slow” tests had been much slower on Linux, that might have pointed to the problems being in BCL calendar or time zone code.

The ratios vary, but there was enough of a problem under both circumstances for it to be worth looking further.

Step 2: Find a problematic test

I didn’t have very strong expectations one way or another about whether this would come down to some general problem in the JIT on Linux, or whether there might be one piece of code causing problems in some tests but not others. Knowing that there are significant differences in handling of some culture and time zone code between the Linux and Windows implementations, I wanted to find a test which used the BCL as little as possible – but which was also slow enough for the differences in timing to be pronounced and not easily explicable by the problems of measuring small amounts of time.

Fortunately, NUnit produces a TestResult.xml file which is easy to parse with LINQ to XML, so I could easily transform the results from Windows and Linux into a list of tests, ordered by duration (descending), and spot the right kind of test.

I found my answer in UmAlQuraYearMonthDayCalculatorTest.GetYearMonthDay_DaysSinceEpoch, which effectively tests the Um Al Qura calendar for self consistency, by iterating over every day in the supported time period and checking that we can convert from “days since Unix epoch” to an expected “year, month day”. In particular, this test doesn’t rely on the Windows implementation of the calendar, nor does it use any time zones, cultures or anything similar. It’s nicely self-contained.

This test took 2051ms on Linux and 295ms on Windows. It’s possible that those figures were from a debug build, but I repeated the tests using a release build and confirmed that the difference was still similar.

Step 3: Find the bottleneck

At this point, my aim was to try to remove bits of the test at a time, until the difference went away. I expected to find something quite obscure causing the difference – something like different CPU cache behaviour. I knew that the next step would be to isolate the problem to a small piece of code, but I expected that it would involve a reasonable chunk of Noda Time – at least a few types.

I was really lucky here – the first and most obvious call to remove made a big difference: the equality assertion. Assertions are usually the first thing to remove in tests, because everything else typically builds something that you use in the assertions… if you’re making a call without either using the result later or asserting something about the result, presumably you’re only interested in side effects.

As soon as I removed the call to Assert.AreEqual(expected, actual), the execution time dropped massively on Linux, but hardly moved on Windows: they were effectively on a par.

I wondered whether the problem was with the fact that I was asserting equality between custom structs, and so tried replacing the real assertions with assertions of equality of strings, then of integers. No significant difference – they all showed the same discrepancy between Windows and Linux.

Step 4: Remove Noda Time

Once I’d identified the assertions as the cause of the problem, it was trivial to start a new test project with no dependency on Noda Time, consisting of a test like this:

[Test]
public void Foo()
{
    for (int i = 0; i < 1000000; i++)
    {
        var x = 10;
        var y = 10;
        Assert.AreEqual(x, y);
    }
}

This still demonstrated the problem consistently, and allowed simpler experimentation with different assertions.

Step 5: Dig into NUnit

For once in my life, I was glad that a lot of implementation details of a framework were exposed publicly. I was able to try lots of different “bits” of asserting equality, in order to pin down the problem. Things I tried:

Assert.AreEqual(x, y): slow
Assert.That(x, Is.EqualTo(y)): slow
Constructing an NUnitEqualityComparer: fast
Calling NUnitEqualityComparer.AreEqual: fast. (Here the construction occurred before the loop, and the comparisons were in the loop.)
Calling Is.EqualTo(y): slow

The last bullets two bullets were surprising. I’d been tipped off that NUnitEqualityComparer uses reflection, which could easily differ in performance between Windows and Linux… but checking for equality seemed to be fast, and just constructing the constraint was slow. In poking around the NUnit source code (thank goodness for Open Source!) it’s obvious why Assert.AreEqual(x, y) and Assert.That(y, Is.EqualTo(x)) behave the same way – the former just calls the latter.

So, why is Is.EqualTo(y) slow (on Linux)? The method itself is simple – it just creates an instance of EqualConstraint. The EqualConstraint constructor body doesn’t do much… so I proved that it’s not EqualConstraint causing the problem by deriving my own constraint with a no-op implementation of ApplyTo… sure enough, just constructing that is slow.

That leaves the constructor of the Constraint abstract base class:

protected Constraint(params object[] args)
{
    Arguments = args;

    DisplayName = this.GetType().Name;
    if (DisplayName.EndsWith("`1") || DisplayName.EndsWith("`2"))
        DisplayName = DisplayName.Substring(0, DisplayName.Length - 2);
    if (DisplayName.EndsWith("Constraint"))
            DisplayName = DisplayName.Substring(0, DisplayName.Length - 10);
}

That looks innocuous enough… but maybe calling GetType().Name is expensive on Linux. So test that… nope, it’s fast.

At this point I’m beginning to wonder whether we’ll ever get to the bottom of it, but let’s just try…

[Test]
public void EndsWith()
{
    string text = "abcdefg";
    for (int i = 0; i < Iterations; i++)
    {
        text.EndsWith("123");
    }
}

… and sure enough, it’s fast on Windows and slow on Linux. Wow. Looks like we have a culprit.

Step 6: Remove NUnit

At this point, it’s relatively plain sailing. We can reproduce the issue in a simple console app. I won’t list the code here, but it’s in the GitHub issue. It just times calling EndsWith once (to get it JIT compiled) and then a million times. Is it the most rigorous benchmark in the world? Absolutely not… but when the difference is between 5.3s on Linux and 0.16s on Windows, on the same hardware, I’m not worried about inaccuracy of a few milliseconds here or there.

Step 7: File a CoreCLR issue

So, as I’ve shown, I filed a bug on GitHub. I’d like to think it was a pretty good bug report:

Details of the environment
Short but complete console app ready to copy/paste/compile/run
Results

Exactly the kind of thing I’d have put into a Stack Overflow question – when I ask for a minimal, complete example on Stack Overflow, this is what I mean.

Anyway, about 20 minutes later (!!!), Stephen Toub has basically worked out the nub of it: it’s a culture issue. Initially, he couldn’t reproduce it – he saw the same results on Windows and Linux. But changing his culture to en-GB, he saw what I was seeing. I then confirmed the opposite – when I ran the code having set LANG=en-US, the problem went away for me. Stephen pulled Matt Ellis in, who gave more details as to what was going wrong behind the scenes.

Step 8: File an NUnit issue

Matt Ellis suggested filing an issue against NUnit, as there’s no reason this code should be culture-sensitive. By specifying the string comparison as Ordinal, we can go through an even faster path than using the US culture. So

if (DisplayName.EndsWith("Constraint"))

becomes

if (DisplayName.EndsWith("Constraint", StringComparison.Ordinal))

… and the equivalent for the other two calls.

I pointed out in the issue that it was also a little bit odd that this was being worked out in every Constraint constructor call, when of course it’s going to give the same result for every instance of the same type. When “every Constraint constructor call” becomes “every assertion in an entire test run”, it’s a pretty performance-critical piece of code. While unit tests aren’t important in terms of performance in the same way that production code is, anything which adds friction is bad news.

Hopefully the NUnit team will apply the simple improvement for the next release, and then the CoreCLR team can attack the tougher underlying problem over time.

Step 9: Blog about it

Open up Stack Edit, start typing: “I’ve been following the progress”… :)

Conclusion

None of the steps I’ve listed here is particularly tricky. Diagnosing problems is often more a matter of determination and being unwilling to admit defeat than cleverness. (I’m not denying that there’s a certain art to being able to find the right seam to split the problem in two, admittedly.)

I hope this has been useful as a “start to finish” example of what a diagnostic session can look and feel like. It wasn’t one physical session, of course – I found bits of time to investigate it over the course of a day or so – but it would have been the same steps either way.

Smug, satisfied smile…

21 thoughts on “Tracking down a performance hit”

lazerfx says:

June 9, 2016 at 9:19 pm

I know this feeling, on a somewhat smaller scale. I’ve been dropping bug reports into various Aurelia and MVC core related repositories over the past couple of weeks, and it’s very satisfying when a bug report is a) promptly responded to, and b) quickly fixed :-D

LikeLike

Dax Fohl says:

June 10, 2016 at 2:31 am

Any idea how involved the CoreCLR issue is?

LikeLike

1. jonskeet says:
  
  June 10, 2016 at 5:55 am
  
  I think it’s probably going to be simple to improve it in some situations, but tricky to improve it more globally. That’s the sense I get from the comments on the issue – follow the link in the post for more information.
  
  LikeLiked by 1 person
  
  1. ExperiencedProgrammer says:
    
    July 18, 2016 at 11:29 am
    
    Globalization is hard.
    
    LikeLike
    
Ryan Heath says:

June 10, 2016 at 5:30 am

Nice breakdown!

I still wonder why the en-GB culture EndsWith is still slower under Linux than under Windows …
We can not always use an ordinal comparison …

// Ryan

LikeLike

1. jonskeet says:
  
  June 10, 2016 at 5:51 am
  
  There are more details on the GitHub CoreCLR issue. Basically some optimizations are assumed to be valid for en-US but not en-GB; the team has set a milestone of 1.1 to improve this.
  
  LikeLike
  
adatapost says:

June 10, 2016 at 6:12 am

Reblogged this on A datapost’s Blog.

LikeLike

Pingback: The Morning Brew #2111 | Tech News
gregsdennis says:

June 13, 2016 at 3:18 pm

Jon, that last step (9) wasn’t very clear. Could you elaborate?

LikeLike

DragonSpark Technologies says:

June 14, 2016 at 5:28 pm

Wow the legendary Jon Skeet. I saw you at Xamarin 2014 during a keynote, I think it was? Standing on the shoulders of giants? I had no idea who you were, haha. But for the past 4-6 months I have been doing some core framework searches on StackOverflow and you are EVERYWHERE, LOL. It’s almost odd to not see you as the first answer. So now I see why you were talking to all of Xamarin. :) Anyways, big fan (now that I know who you are! LOL), and wanted to share. I’ve also got you subscribed now in my RSS.

Btw, you are listed in this week’s This Week In .NET: https://blogs.msdn.microsoft.com/dotnet/2016/06/14/the-week-in-net-6142016/

LikeLike

Pingback: The week in .NET – 6/14/2016 | Tech News
sespoon says:

June 15, 2016 at 8:18 am

Thanks for the effort of outlining your steps here! Tracking down performance issues requires a lot of skill and experience – and it is always good to see how others approach it.

LikeLike

Kirill Rakhman says:

June 16, 2016 at 8:16 am

I’m not a .NET expert but aren’t there any profiling tools where you can see the distribution of time spent per method?

LikeLike

1. jonskeet says:
  
  June 16, 2016 at 8:18 am
  
  There are, but I don’t know how many of them work with .NET Core on Linux at the moment. (I strongly suspect that tracking down the problem here was simpler than investigating that possibility.)
  
  LikeLike
  
  1. Kirill Rakhman says:
    
    June 16, 2016 at 8:19 am
    
    Right, thanks for the reply.
    
    LikeLike
    
Pingback: Szumma #044 – 2016 24. hét | d/fuel
Szymon Warda says:

June 25, 2016 at 10:12 am

Did You check NBench for perf testing? If Yes why did You choose BenchmarkDotNet?

LikeLike

Pingback: .NET дайджест #11: Не все так красочно c .NET Core и С#, границы DDD и микросервисов, обзор подсистемы Windows для Linux - LIKELY.CLUB
aron says:

March 20, 2017 at 10:12 am

Did you try this out in Turkey?

LikeLike

1. jonskeet says:
  
  March 20, 2017 at 10:27 am
  
  No, but as I say in my post, performing an ordinal comparison should be better. So it may fix things that were broken in Turkey.
  
  LikeLike
  
Pingback: Diagnosing slow tests (again) | Jon Skeet's coding blog