Diagnosing slow tests (again)

I’ve previously blogged about a case where tests on .NET Core on Linux were much, much slower than the same tests on Windows. Today I’ve got a very similar problem – but I suspect the cause isn’t going to be the same.

This is my reasonably raw log – skip to the end for a summary of what I found, what went well, and what went badly. Apologies for skipping around in terms of past/present/future tense – being consistent in that respect would be distracting enough to make the rest worse, I think.

Step 0: notice the problem

I spotted this problem because I happened to be looking at the Travis and AppVeyor builds, moving to .NET Core 2.0. Looking at the test runs, I saw that the tests for .NET Core 1.0 were completing in about 15 seconds on AppVeyor (Windows), and 133 seconds on Travis Linux). My experience is that AppVeyor seems to use slightly less powerful hardware than Travis, making this a very odd result.

I looked at several builds, and it seemed consistent across them – very odd.

Step 1: file an issue

My main focus at that point wasn’t to investigate performance – it was to get the build working with .NET Core 2. However, I didn’t want to forget about the problem, so I filed an issue in the Noda Time repo. At this point, I have no idea where the problem is, so the Noda Time repo is the most obvious place to put it. There’s no evidence that it’s a problem with NUnit or .NET Core itself, for example.

Step 2: reproduce locally

I have two machines at home that I use for running the automated benchmarks. One (bagpuss) runs Windows 10, the other (gabriel) runs Ubuntu 16.04. Both are Intel NuCs with i5-5250U processors – I bought one much later than the other, but looked hard to find as close to identical hardware as I could, to enable cross-OS benchmark comparisons without dual booting or virtualization.

I ran the tests on both machines, and copied the TestResult.xml output files to my main dev box for analysis.

First interesting point: the difference isn’t as marked. The tests take 77 seconds on gabriel and 29 seconds on bagpuss. That’s still a lot, but very different to what we’re seeing in CI. It’s possible that the Travis machine is twice as slow as gabriel/bagpuss and that the AppVeyor machine is twice as fast, but that seems unlikely.

Deem that temporarily out of scope as it’s harder to investigate than the fixed-hardware, easy-access problem I have now. File another issue and I can come back to it.

Step 3: Look for individual test case discrepancies

If we’re really lucky, there’ll be a single test case that has a much longer duration on gabriel than on bagpuss, and I can then delve into it.

Investigation:

  • Create a small console app that loads the two XML files, finds all the test-case elements and puts them in a dictionary from fullname to duration (the latter parsed as a dictionary)
    • Oops: some full names aren’t unique (due to using parameters that don’t override ToString, I suspect).
    • Load into a lookup instead, then convert that into a Dictionary by taking the first test from each group. Unlikely to affect the results
  • For each test that’s in both result sets, find the ratio of gabriel/bagpuss
  • Dump the top 20 ratios, and the slowest 10 tests from each set for good measure.

Okay, lots of text handling tests with ratios of more than 2:1, but they’re all very small durations. I doubt that they’ll be responsible for everything, and could easily just be “measuring small amounts of time is flaky”. Nothing to blame yet.

Point to note: this console app was always less than 50 lines of code. LINQ to XML makes this sort of thing really easy.

Step 4: Look for test fixture discrepancies

If individual tests don’t show a massive change, what about test fixtures? Let’s change the program to load test-suite elements where type="TestFixture" which will get us the durations as a per-test-class level. We’ll start by dumping the slowest 10 tests for each environment… and wow, we have a result!

NodaTime.Test.TimeZones.TzdbDateTimeZoneSourceTest takes 65 seconds on Gabriel and 17 seconds on Bagpuss! That’s out of a total of 77 and 29 seconds respectively – so removing those would take us to 12 seconds on each machine.

Concern: that’s assuming no multithreading is going on. I don’t think NUnit parallelizes by default, but it would definitely make things trickier to reason about.

Step 5: check the logs in more detail

Given that we haven’t seen any very slow individual tests, my guess at this point is that the slowness is going to be in test fixture setup. I haven’t looked at the code yet. Let’s have a look at the XML logs for that one test in more detail.

Yikes: all the difference is in GuessZoneIdByTransitionsUncached. That’s a single method, but it’s parameterized for all system time zones.

At this point, a memory stirs: I’ve looked at this before. I have a distinct memory of being in an airport lounge. At the time I wasn’t trying to work out the OS discrepancy so much as why this one test was taking up over half the total unit test time, even on Windows.

Looking at the code for GuessZoneIdByTransitionsUncached, I see I was even diligent enough to create an issue and keep a bit of a log of what I found. Hooray! So, now we know why this is slow in general. But why is it so much slower on gabriel than on bagpuss?

The answer is that gabriel appears to have rather more system time zones than bagpuss: it’s running 424 tests instead of 135. If they ran each test equally fast, it would take bagpuss about 53 seconds to gabriel’s 65 – that’s not so bad.

Step 6: test on CI again

In one fell swoop, we can confirm that this is the problem on Travis and potentially work out why AppVeyor is so fast. If we only run this one test fixture, we can look at how many tests are running on each machine (and in which environments, as I now run the tests on net45 (AppVeyor only), netcoreapp1.0 and netcoreapp2.0.

That’s really easy to do – create a PR which just changes the test filter, let CI run it, then delete the PR.

Results:

  • Travis, both environments run 424 tests – 95 seconds for netcoreapp1.0, 92 seconds for netcoreapp2.0
  • AppVeyor, all environments roughly equal, 135 tests – about 3 seconds

That certainly shows that this is taking the majority of the time on Travis, but we still have two questions:

  • How is AppVeyor so much faster than my Core i7 development machine, where the same tests take about 15 seconds?
  • Travis is still only doing about 3 times as much work as AppVeyor, but it’s taking 30 times as long. Why?

Step 7: spot flag differences

I look at the Travis script. I look at the AppVeyor script. I look at what I’ve been running locally. There’s a difference… on AppVeyor, I’m running with -c Release, whereas locally I’ve used the default (debug). Can a debug build make that much difference?

  • Test locally: running on my dev box with -c Release brings the time down from 15s to 3.9s. Yikes!
  • Update the Travis script to run with -c Release as well

Now the tests take about 30 seconds on Travis. That’s still three times as long as it would be on Windows for the same number of time zones, but it’s much, much closer.

Run the tests with the same flags on gabriel and bagpuss: gabriel takes 19s, bagpuss takes 4.3s. Taking the number of time zones into account, that suggests that netcoreapp1.0 on Linux is about 40% slower than netcoreapp1.0 on Windows for each test. I can investigate that separately, but it does sound like Travis is just using slower hardware than AppVeyor, contrary to my previous suspicions.

Step 8: update Travis script and merge

Even though this one test takes a large proportion of the test time on Linux (where there are three times as many time zones to run through), it’s not a gating factor. Correctness beats speed in this case – I can live with waiting a few minutes for a complete CI test run.

We still have some investigation to do, but we can already make things definitely better by running the release build on Travis. On a new branch – not the one I’ve been investigating on – create a commit that just changes Travis to use -c Release. Observe the total time go down from 133 seconds to 57 seconds. Yay.

Points to note

  • When you’re running performance tests, don’t compare release with debug builds. Doh. This was a rookie mistake.
  • Reproducing a CI problem locally is key: it’s much, much easier to work with local machines than CI. It’s a faster turnaround, and you get easier access to detailed results.
  • Having multiple machines with the same hardware is really useful for performance comparisons.
  • Don’t assume that what sounds like the same set of tests will run the same number of tests on different platforms – if a test is parameterized, check how many tests are actually running in each case.
  • Ad-hoc tools are great – the console app I wrote for steps 3 and 4 was quick to write, and helped zoom in on the problem very simply.

Further work

Almost every diagnostic investigation seems to “end” with something else to look into. In this case, I want to find out why Linux is slower than Windows for the GuessTimeZoneTransitionsUncached test on a per-test basis. Is this a Noda Time difference, or a BCL difference? That can wait.

Diagnostics everywhere!

For a long time, I’ve believed that diagnostic skills are incredibly important for software engineers, and often poorly understood.

The main evidence I see of poor diagnostic skills is on Stack Overflow:

  • “I have a program that does 10 things, and the output isn’t right. Please fix.”
  • “I can’t post a short but complete program, because my code is long.”
  • “I can’t post a short but complete program, because it’s company code.”

I want to tackle this as best I can. If I can move the needle on the average diagnostic skill level, that will probably have more impact on the world than any code I could write. So, this is my new mission, basically.

Expect to see:

  • Blog posts
  • Podcast interviews
  • Screencasts
  • Conference and user group talks

In particular, I’ve now resolved that when I’m facing a diagnostic situation that at least looks like it’ll be interesting, I’m going to start writing a blog post about it immediately. The blog post will be an honest list of the steps I’ve been through, including those that turned out not to be helpful, to act as case studies. In the process, I hope to improve my own skills (by reflecting on whether the blind alleys were avoidable) and provide examples for other developers to follow as well.

There’s a new “Diagnostics” category on my blog – everything will be there.

Using .NET Core 2.0 SDK on Travis

This is just a brief post that I’m hoping may help some people migrate to use .NET Core 2.0 SDK on Travis. TL;DR: see the end of the post for a sample configuration.

Yesterday (August 15th), .NET Core 2.0 was fully released. Wonderfully, Travis already supports it. You just need dotnet: 2.0.0 in your YAML file.

I decided to experiment with upgrading the Noda Time build to require .NET Core 2.0 SDK. To be clear, I’m not doing anything in the code that requires 2.0, but it simplifies my build scripts:

Additionally, supporting netcoreapp2.0 means I’ll be able to run my benchmarks against that as well, which is going to be very interesting. However, my tests still target netcoreapp1.0, and that’s where I ran into problems.

Having done the bare minimum to try using 2.0 (edit global.json and .travis.yml) I ran into this error:

The specified framework 'Microsoft.NETCore.App', version '1.0.5' was not found.
  - Check application dependencies and target a framework version installed at:
      /
  - Alternatively, install the framework version '1.0.5'.

That makes sense. Although netcoreapp2.0 is compatible with netstandard1.0 (i.e. you can use libraries targeted to netstandard1.0 in a 2.0 environment) an application targeting netcoreapp1.0 really needs a 1.0 runtime.

So, we need to install just the runtime as well. I’d expected this to be potentially painful, but it’s really not. You just need an addons section in the YAML file:

addons:
  apt:
    sources:
    - sourceline: 'deb [arch=amd64] https://packages.microsoft.com/repos/microsoft-ubuntu-trusty-prod trusty main'
      key_url: 'https://packages.microsoft.com/keys/microsoft.asc'
    packages:
    - dotnet-hostfxr-1.0.1
    - dotnet-sharedframework-microsoft.netcore.app-1.0.5

Note that in my case, I want netcoreapp1.0 – if you need netcoreapp1.1, you’d probably install dotnet-sharedframework-microsoft.netcore.app-1.1.2.

So, aside from comments etc, my new Travis configuration will look like this:

language: csharp
mono: none
dotnet: 2.0.0
dist: trusty

addons:
  apt:
    sources:
    - sourceline: 'deb [arch=amd64] https://packages.microsoft.com/repos/microsoft-ubuntu-trusty-prod trusty main'
      key_url: 'https://packages.microsoft.com/keys/microsoft.asc'
    packages:
    - dotnet-hostfxr-1.0.1
    - dotnet-sharedframework-microsoft.netcore.app-1.0.5

script:
  - build/travis.sh

I can now build with the 2.0 SDK, and run tests under both netcoreapp1.0 and netcoreapp2.0.

I’m hoping it’s just as simple on AppVeyor when that supports 2.0 as well…

Upcoming speaking engagements

I’ve got a few speaking engagements coming up that I thought it might be worth publicising a bit further. They’re all within just over a week of each other, which is going to be somewhat tiring, but…

Here they are, in chronological order:

Progressive .NET 2017

Progressive .NET 2017 will be held on September 13th-15th in London.

The agenda is already available, and looks pretty awesome. I’d go into details of who I’m most looking forward to hearing, but it’s too tricky to rank talks with that line-up… Of course, I’m mostly looking forward to catching up with people afterwards. (In particular Carl and Richard, whose talk I won’t be able to attend without leaving my own audience somewhat disappointed.)

This is my most Google-centric talk of the three – it’ll show how you can get going on ASP.NET Core in Google Cloud Platform, with both AppEngine Flexible Environment and Google Container Engine. Basically, I get to show off what my team’s been working on for the last couple of years, which is always fun… although this talk is more about the tools than the libraries I specifically work on.

My colleague Mete Atamel will be approaching a related topic from a different angle – he’ll be going into more detail on the ASP.NET Core + Docker + Kubernetes stack.

.NET Conf

.NET Conf is a on September 19th-21st. It’s free and virtual, so there’s basically no barrier to entry here. The precise agenda has yet to be announced, but the speakers shown so far are stellar.

My talk for .NET Conf will be on diagnostics and problem solving: the path from “Something isn’t working” to “Here’s a great, well-researched Stack Overflow question” or “Ah, that’s what was wrong, and I fixed it!” Finding that path is a skill I’m passionate about – expect to hear more from me in all forms over the next few years. (Now that Noda Time is “mostly done” it’s probably going to be my next big goal in personal time.)

Developing a Digital Future

Developing a Digital Future is a one day conference in Moldova, hosted by Amdaris on September 21st.

I’m delighted to be giving the keynote, whose subject will be the name of the conference. What will tech be like in 5, 10, 20 years? How will it change society? How will our jobs as software engineers be different as a result?

In the afternoon, I’ll be doing a round-table event for a subset of attendees – live coding, answering attendee questions etc. Lots of fun!