Diagnosing a single slow test

In the last blog post, I investigated why my Noda Time tests on Travis were running much slower than those on AppVeyor. I resolved a lot of the problem just by making sure I was running release builds on Travis.

That left a single test which takes up about half of the test time though: TzdbDateTimeZoneSourceTest.GuessZoneIdByTransitionsUncached. It seems to run more slowly on .NET Core on Linux than on Windows. The aim of this post is to work out why – at least to the extent of understanding it better than before.

Step 0: Initial assumptions/knowledge and aim

For the moment, I’ll assume that my records of previous investigations are correct. So the situation is:

  • I have near-identical hardware to run tests on: gabriel is running Linux; bagpuss is running Windows.
  • Under a release build, the test takes 19s on gabriel and 4.3s on bagpuss
  • The test does more work on gabriel: it checks 434 zones instead of 135 on bagpuss
  • That still shows a per-test slow-down of 40%
  • I’ve looked at this code before to try to optimize it. In production, this method is only going to be called for one time zone (the system default) so it’s not really a problem – but improving the test speed would be nice. (We can categorize it as a “slow” test if necessary, but I’d rather not.)

Aims of this diagnosis:

  • Find out whether the speed difference is in the test code or the production code that it’s testing. (This test has a lot of assertions.)
  • Find out whether the speed difference is in Noda Time code (e.g. Instant) or in BCL code (e.g. TimeZoneInfo). If it’s in Noda Time code, that could lead to another investigation later, with potential optimizations.
  • Make another attempt at speeding up the test across the board

Step 1: validate assumptions

Let’s make sure we’re still in the same state as we were before. (That seems pretty likely.)

On both machines, pull the current code and run

dotnet run -c Release -f netcoreapp1.0 -- --where=test=~Guess

from the src/NodaTime.Test directory.

Odd: it now takes ~6 seconds on bagpuss. At that point, there’s no time discrepancy. But I’m sure there was when I wrote up the previous post. I didn’t make up the 4.3s result… but it’s entirely possible it was on the wrong machine. Run the tests several times – yes, it’s always around 6 seconds. Odd.

At this point, I nearly gave up. After a little while doing other things, I decided to pursue the rest of the diagnosis anyway… at which point the execution time was back to about 4 seconds. That’s very, very strange. I still don’t know what was going on there.

Brutal honesty time: although I ran the tests at this point, I didn’t spot the discrepancy until I was at the end of step 2. I then looked back in the command shell buffer and saw the 6 second run, which was consistent with the runs at the end of step 2.

Step 2: shorten experiment iteration time

The only test we’re interested in is this one. We’re going to be changing the code occasionally, so let’s ditch everything else.

  • Start a new branch in git
  • Move TzdbDateTimeZoneSourceTest.cs into NodaTime.Test so it’s easier to find. (It was in a subdirectory.)
  • Remove all other tests from TzdbDateTimeZoneSourceTest.cs
  • Delete all the subdirectories in NodaTime.Test
  • Remove everything in NodaTime.Test apart from Program.cs and TzdbDateTimeZoneSourceTest.cs
  • Change the target framework to just netcoreapp1.0 so we don’t need to specify it. (This also requires changing the csproj file to use instead of.)
  • Add a Release property to attempt to make it a release build by default
  • Test: run dotnet run – oops, even though it’s running from a Release folder, it’s taking 13 seconds instead of 3 on Windows.
  • Revert previous change, and try dotnet run -c Release – that’s more like it

Great. We can push the commit to github and fetch it on all the relevant machines. (I’m developing on a third machine, just using gabriel and bagpuss to run tests.)

Step 3: Narrow down the time-consuming code

As I mentioned before, the test has a lot of assertions, and they involve a lot of time zone work. Let’s see how much time they’re contributing – by removing them entirely.

  • Comment out everything after the call to GuessZoneIdByTransitionsUncached
  • Run locally just to check it does still run
  • Commit and push
  • Fetch on gabriel/bagpuss and rerun

Result: the checking part of the test is almost free.

Step 4: Remove NUnit

At this point I’m fairly confident that NUnit is irrelevant. I still want to use the NodaTime.Test project, as NodaTime is already configured to expose internal members to it – and GuessZoneIdByTransitionsUncached is internal. However, it’s very easy to remove NUnit – just remove the dependencies, and replace all the code in Program with a simple test:

using NodaTime.TimeZones;
using System;
using System.Diagnostics;
using System.Linq;

namespace NodaTime.Test
{
    class Program
    {
        public static void Main(string[] args)
        {
            var zones = TimeZoneInfo.GetSystemTimeZones().ToList();

            // Fetch the source outside the measurement time
            var source = TzdbDateTimeZoneSource.Default;
            // JIT compile before we start
            source.GuessZoneIdByTransitionsUncached(TimeZoneInfo.Local);

            var stopwatch = Stopwatch.StartNew();
            foreach (var zone in zones)
            {
                source.GuessZoneIdByTransitionsUncached(zone);
            }
            stopwatch.Stop();
            Console.WriteLine($"Zones tested: {zones.Count}");
            Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");
        }
    }
}

There are a couple of differences here compared with the NUnit test:

  • I’m deliberately fetching TzdbDateTimeZoneSource.Default (which needs to read some resources) outside the measurement
  • I’m deliberately calling GuessZoneIdByTransitionsUncached once outside the measurement to remove JIT compilation time

This still isn’t a good benchmark by any means, but it’s probably sufficient for now.

Commit/push/fetch/run as before.

Result: similar to before, but a little faster. Playing around with the code a little further, it seems that the bullet points above make a difference of about 0.1s each (on bagpuss, at least) – there’s quite a lot of code that wouldn’t have been JIT-compiled before the first call, and loading time zone data is somewhat expensive. NUnit also adds a little bit of overhead, which is understandable.

We now have more accurate data for just GuessZoneIdByTransitionsUncached:

  • bagpuss: 3.6s
  • gabriel: 19.0s (just under 19 in most cases, instead of just over when running NUnit)

Now we need to get into Noda Time code…

Step 5: Gut the production code

First, the source code helpfully points to https://github.com/nodatime/nodatime/issues/686 with some comments from my last investigation.

We don’t care about correctness at this point, just speed. We can return null at any point. At this point you probably need to know a bit about what the method does. The aim is to guess a matching TZDB (aka IANA, tz, zoneinfo or Olson) time zone from a BCL TimeZoneInfo, when we can’t use the names to match. It combines all the time zone transitions (e.g. going into or out of DST) of every TZDB zone for the next 5 years, and then computes a score for each zone by checking how many of the UTC offsets at those transition points match the offsets returned by the BCL zone.

Let’s divide the code into 5 sections:

  1. Working out the start and end of this UTC year
  2. Getting all the time zones
  3. Getting all the transitions we’re going to test
  4. Getting all the offsets from the BCL zone
  5. Computing the scores

It’s going to be simplest to measure just 1, then 1+2, then 1+2+3 etc. We can just add return null; after the step we’re interested in. We hope the results will tell us:

  • Which part of the code is slowest (so we can try to optimize it)
  • Which part of the code is slower on Linux than Windows

For simplicity, let’s make gabriel use the same number of time zones as bagpuss, just by adding .Take(135) in Program.cs above first. That way we’re comparing the same amount of work on each machine – although admittedly we don’t know that each time zone takes the same amount of time. Without any change to the production code, gabriel now takes 6s – which is about what we’d expect if the time per zone is uniform, as 19 * 135/434 = 5.91[…].

Results on gabriel:

  • Step 1: 0s (it’s free, effectively)
  • Step 1+2: 1.82s
  • Step 1+2+3: 3.15s
  • Step 1+2+3+4: 3.21s
  • Step 1+2+3+4+5: 6s

Results on bagpuss:

  • Step 1: 0s
  • Step 1+2: 1.21s
  • Step 1+2+3: 2.10s
  • Step 1+2+3+4: 2.13s
  • Step 1+2+3+4+5: 3.63s

Or in other words, stepwise:

Step 1: 0s on both
Step 2: 1.82s on gabriel; 1.21s on bagpuss
Step 3: 1.33s on gabriel; 0.89s on bagpuss
Step 4: 0.06s on gabriel; 0.03s on bagpuss
Step 5: 2.79s on gabriel; 1.50s on bagpuss

No particular smoking gun, unfortunately – no obvious part that would be easily optimized. The last part looks like it shows the worst platform difference though… let’s see if we can dig into that.

Step 6: Pivot to DateTimeZone.GetUtcOffset

Let’s change our test pretty radically:

  • Get a list of all the built in TZDB time zones
  • Create a list of 50,000 instants, starting at 2000-01-01T00:00:00Z and spaced 1 hour apart. 100,000 is some enough instants to make the test last for roughly 10 seconds. (Fast enough to iterate, slow enough to give significant results.)
  • For every zone, find the UTC offset for every instant.

To simplify reasoning about what we’re testing, we want to remove time zone caching, too. Time zone computations are inherently fiddly, so Noda Time has a cache for each zone. In GuessZoneIdByTransitionsUncached we were coming up with a new DateTimeZone in each call, so while there’d be some caching involved, it wouldn’t be as much as a simple version of this new test. For the moment, let’s just remove the time zone caching layer entirely:

var zones = DateTimeZoneProviders.Tzdb.GetAllZones()
    .Select(zone => zone is CachedDateTimeZone cached ? cached.TimeZone : zone)
    .ToList();
var start = Instant.FromUtc(2000, 1, 1, 0, 0);
var gap = Duration.FromHours(1);
var instants = Enumerable
    .Range(0, 100000)
    .Select(index => start + gap * index)
    .ToList();

var stopwatch = Stopwatch.StartNew();
foreach (var zone in zones)
{
    foreach (var instant in instants)
    {
        zone.GetUtcOffset(instant);
    }
}
stopwatch.Stop();
Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");

There’s still one other caching layer that’s relevant, and that’s in terms of conversions between calendrical values and instants… Noda Time is optimized for the years 1900-2100 in the Gregorian calendar. If our test went outside that range, we’d get very different results, hitting different pieces of code very hard. Our original test was looking at 5 years around now, so let’s stick to the optimized range for now.

Results:

  • gabriel: 19.5s
  • bagpuss: 12.5

Good – we’re still seeing the Linux run being about 50% slower than the Windows run.

Step 7: Break out a profiler!

It’s been a while since I’ve done any profiling in .NET. I had a quick search to see what was available in terms of free tools that handle .NET Core, and came across CodeTrack. I don’t think it runs under Linux, but that’s okay – we’re mostly using it to track down the hotspots first, so we can fine tune our tests.

I’m not specifically advocating CodeTrack – profiling is an art and science about which I know relatively little – but it was easy to get going. 85% of the time was spent in StandardDaylightAlternatingMap.GetZoneInterval, so that looks like a good place to look next.

Step 8: Write actual benchmarks

I have a benchmark suite for Noda Time already, and at this point we’ve got to a stage where it’s probably worth using proper benchmarking tools. BenchmarkDotNet is my framework of choice – so let’s add a couple of benchmarks.

Looking at the code and the profiler, it looks worth testing both StandardDaylightAlternatingMap and ZoneRecurrence. Having added those benchmarks, I might as well put this down until the next day, as the complete benchmark suite runs automatically every night.

Step 9: Compare the benchmarks

The benchmark results are automatically uploaded to the Noda Time web site although the UI for browsing them is somewhat unstable (in terms of URLs etc) – hence the lack of link here.

Looking at the benchmark results, it turns out I’d already got benchmarks for StandardDaylightAlternatingMap, annoyingly enough – so I needed to tidy those up.

However, they do clearly show the difference between .NET Core on Linux and on Windows, so we’ve narrowed the problem down further. That doesn’t show that only that code has different performance, but it’s hopefully a good enough start that we’ll be able to continue and get a short but complete example to provide to the .NET Core CLR team.

Diving deeper into the implementation, ZoneYearOffset makes sense to test too… the further we go down, the less code will be involved.

Step 10: take a step back for now

We’ve made progress, but we’re not all the way there. That’s okay – the progress is recorded (both here and in benchmarks) and in looking at the benchmark that’s taking so long to run, we’ve confirmed that we’re not doing anything stupid. It happens to end up doing a lot more work than any application code normally would – work that would normally only happen once (so can be inefficient without being a problem) is exercised hundreds of times. Such is life – I could decide not to run it in CI if I want, running it occasionally manually and always just before a release.

I’ll come back to this issue at some point in the future, but at this point, other projects have become more important.

Points to note and lessons learned

  • Validate assumptions carefully. I still don’t quite know why there was the blip in execution time, but it was very surprising.
  • Unit tests are no substitute for a good benchmarking framework.
  • Profilers exist, and are useful. No specific recommendations here, but make sure you’re aware of the tools you could benefit from.
  • Sometimes, you need to put down a problem for a while. Try not to lose the information you’ve already discovered – capture it in an issue, comments, a blog post like this one, a private document… anything to preserve information. My head leaks data very, very quickly.

Addendum

I’m still investigating where .NET Core runs faster on Linux than on Windows, but I’ve fixed this test – PR 953 is a pretty minor refactoring, but it allows the tests to pass in the same list of time zones into each test case. That has two benefits:

  • We don’t need to load the zones from raw data on each test
  • The “get UTC offset at instant X” result is cached in the time zone, so after the first test there’s very little work to do

This very small, internal-only change has halved the total time of the Noda Time tests on Travis. Yay.

5 thoughts on “Diagnosing a single slow test”

  1. I’m wondering why step 2-6 was before step 7, why didn’t you simply fire up a profilere to begin with, after ensuring you’re still experiencing the differences? Every “fix performance issue” article I’ve read, and all my experience, tells me that you should measure, measure, measure. Why did you decide otherwise? It seems you even changed the code (step 2, 4 and 5) before you started using a profiler?

    Like

    1. The profiler only runs on Windows, so that doesn’t help when you’re trying to find the difference between .NET Core on Linux and .NET Core on Windows, which was the main aim. Trying to make the tests quicker would have been nice, but my main point of interest was (and still is) why things were running slower on Linux than on Windows.

      The code changes in steps 2 and 4 were only changes to the test code, not the production code – trying to reduce how much was running. That was useful when I did end up running the profiler. The code changes in step 5 were just ways of measuring subsets of the code reasonably simply.

      Every “fix performance issue” article I’ve read, and all my experience, tells me that you should measure, measure, measure.

      And that’s what I was doing – with a fairly blunt tool, but they’re measurements nonetheless. They were more realistic measurements than the profiler gave in one mode, too – the profiler in tracing mode made the whole thing 10x slower. Great for seeing how many calls are made to what, but not so good for knowing the overall performance. Less of a difference in sampling mode, but my point is that there are pros and cons for different measurement techniques. I find profilers good for giving you strong suggestions about where to look for hotspots, but after that I really only trust the measurements from benchmarking tools.

      Using a profiler just for Windows followed by adding benchmarks for the hotspots of the test would probably have got me to the same state quicker, but not very much so – and I wouldn’t have had as much confidence that those benchmarks would show a difference between .NET Core on Linux and Windows as I did by the time I got there with this path.

      I’m certainly not going to argue against using a profiler earlier, but there are multiple feasible paths when diagnosing problems, and I don’t feel particularly foolish for not reaching for a profiler earlier in this case.

      Like

  2. We’re still in the infancy of .NET Core/Standard as far as I can see (opinion) so one part that I take away from your post and your reply is that we need more/better cross-platform tools.

    To be honest, I’m not entirely sure how to conduct profiling against different platforms at the same time in the interest of finding out why code running on one platform take more time than on a different platform but your initial sentence, “runs only on Windows”, will have to be fixed! We need profilers that can run on multiple platforms and give comparative results.

    I’m currently combating a similar case myself wherein a simple LINQ-like query runs “fine” on Windows but is “taking forever” on Linux, yet on OS X it seems I have to consider the current moon phase to know whether it will run slow or fast.

    I’m looking forward to reading more about what you finally find is the source of your differences, if you ever get to the bottom of it, but this article, and many others I’ve read over the past couple of months, tells me that we still have a ways to go in order to get true cross-platform C# and .NET code.

    Like

    1. Yes, more cross-platform tools such as profilers and coverage tools will definitely be welcome. It’s possible that a cross-platform profiler already exists, but I haven’t seen it yet.

      Will definitely post again with any further results when I get back to this.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s