Diagnosing a VS-only build problem

I do most of my work in the google-cloud-dotnet github repo (That’s the repo for the Google Cloud Client Libraries for .NET, just to get a quick bit of marketing in there.) We try to keep our build and test dependencies up to date, so I recently updated to the latest versions of Microsoft.NET.Test.Sdk and xUnit.

Leaving aside problems identified by the xUnit analyzer which is now bundled in the xUnit package (some of which were useful, and some of which weren’t), this change caused another problem: while building from the command line worked fine, building some of the many solutions from Visual Studio (15.3.3) generated the following error:

Error MSB4018 The “GenerateBindingRedirects” task failed unexpectedly.
System.IO.PathTooLongException: The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters.
at System.IO.PathHelper.GetFullPathName()
at System.IO.Path.LegacyNormalizePath(…)
at System.IO.Path.NormalizePath(…)
at System.IO.Path.NormalizePath(…)

(more stack trace here)

The output window shows a file related to the error:

C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\MSBuild\15.0\Bin\Microsoft.Common.CurrentVersion.targets(2099,5)

So, what’s going on, and how do we fix it?

Step 1: Check that this change is responsible

Git makes life really easy here. For peace of mind – particularly for this kind of issue – I tend to close Visual Studio when switching branches, so:

  • Find a solution that fails (e.g. Google.Cloud.Diagnostics.AspNetCore)
  • Close VS
  • git checkout master
  • Rebuild solution in VS – looks fine
  • Close VS
  • git checkout branch with change
  • Rebuild in VS – observe error
  • git checkout master
  • Rebuild solution in VS – looks fine
  • Close VS
  • git checkout branch with change
  • Rebuild in VS – observe error

Yes, I did everything twice, just to make sure it wasn’t an anomaly.

So yes, it’s definitely a problem.

Step 2: Moan on Twitter

This isn’t the most scientific diagnostic tool in the world, but posting on Twitter about the problem did at least reassure me that I wasn’t alone.

Step 3: Narrow down the cause

Even though I observed the problem in Google.Cloud.Diagnostics.AspNetCore, the full error message in VS referred to Google.Cloud.Diagnostics.Common.IntegrationTests. That’s part of the Google.Cloud.Diagnostics.Common solution – the AspNetCore projects depend on the Common projects.

Try step 1 again with Google.Cloud.Diagnostics.Common (just one pass and one failure this time) – yes, it’s still a problem there. That’s useful.

Step 4: Try a workaround

All my source is under c:\Users\Jon\Test\Projects. The “Test” part is a bit of an anomaly, and next time I get a new machine, I’ll probably work without it, but there’s no need to change now. The Projects directory is actually a junction (symlink) to the c:\Users\Jon\Documents\Visual Studio 2015\Projects directory. That’s quite a long path to start with… let’s see if getting rid of the symlink helps.

  • Delete symlink
  • Move Projects file from under Documents to directly under Test
  • Try to build: same failure

Okay, it looks like we’ll have to be a bit more subtle.

Step 5: Play spot the difference

Given that we have code which builds on the command line in both the working and failing situations, we can reasonably easily see any differences in generated files.

  • On the command line, go into the Google.Cloud.Diagnostics.Common.IntegrationTests directory
  • Delete bin and obj
  • git checkout master
  • Run dotnet build
  • Copy the resulting bin and obj directories to a “working” directory
  • git checkout branch with change
  • Delete bin and obj
  • Run dotnet build
  • Copy the resulting bin and obj directories to a “broken” directory
  • Run kdiff3 against the working and broken directories

There are lots of differences between the directories, as I’d expect, but given that this is about binding redirects, it’s reaosnable to use experience and start scanning for filenames ending with .config.

Sure enough, in the “broken” directory, under obj\net452, there was a file called Google.Cloud.Diagnostics.Common.IntegrationTests.csproj.Google.Cloud.Diagnostics.Common.IntegrationTests.dll.config. That’s not a copy and paste error – it really is a 115-character-long filename, even leaving out any directory parts.

In the file system, the full path is: c:\users\jon\Test\Projects/google-cloud-dotnet\apis\Google.Cloud.Diagnostics.Common\Google.Cloud.Diagnostics.Common.IntegrationTests\obj\Debug\net452\Google.Cloud.Diagnostics.Common.IntegrationTests.csproj.Google.Cloud.Diagnostics.Common.IntegrationTests.dll.config – that’s 266 characters.

Step 6: Try being slightly cheeky: very temporary workaround

Okay, so moving away from the “Documents\Visual Studio 2015” directory didn’t help much, but given that we’re just on the limit, let’s try just renaming “google-cloud-dotnet” to “short” (very temporarily).

Try opening it in Visual Studio – wahoo, it works :) The .config file is generated by Visual Studio correctly.

Step 7: Work out who to blame

So, where’s the bug?

  • It’s a shame that Visual Studio 2017 doesn’t support long filenames, even though the dotnet CLI does
  • Surely we don’t need such a long filename anyway
  • Do I need the project name to be so long?

Looking back to the very first description, let’s have a look at the msbuild file that’s mentioned: C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\MSBuild\15.0\Bin\Microsoft.Common.CurrentVersion.targets

That has:

<GenerateBindingRedirects
  AppConfigFile="@(AppConfigWithTargetPath)"
  TargetName="$(TargetFileName).config"
  OutputAppConfigFile="$(_GenerateBindingRedirectsIntermediateAppConfig)"
  SuggestedRedirects="@(SuggestedBindingRedirects)"
>

So it looks like the problem is _GenerateBindingRedirectsIntermediateAppConfig, which is defined in a property group elsewhere as:

<_GenerateBindingRedirectsIntermediateAppConfig>$(IntermediateOutputPath)$(MSBuildProjectFile).$(TargetFileName).config</_GenerateBindingRedirectsIntermediateAppConfig>

That certainly looks like it’s responsible for the choice of file.

A quick search for _GenerateBindingRedirectsIntermediateAppConfig shows that I’m not the first person to have run into this – PathTooLongException with _GenerateBindingRedirectsIntermediateAppConfig describes exactly the pain I’ve been going through, and even helpfully references MSBuild should handle paths longer than MAX_PATH.

I’ll add my little bit of feedback to the former issue as soon as I’ve published this post (so I can link to it).

Step 8: Decide what to do

My options appear to be:

  1. Change the project name or the directory structure
  2. Ask everyone on the team to work from a really short root directory name
  3. Downgrade my dependencies again
  4. Live with Visual Studio not building a few solutions properly – we can still work in Visual Studio, so long as we don’t need a full build in it.

I went with option 4, and published this blog post.

Step 9: Embrace community feedback

Theose were the only options I considered at the time of original writing, partly as I was somewhat scared of this suggestion, being inexperienced in msbuild. However, combining that with a Directory.Build.targets file (which I didn’t know about before being tipped off by Nick Guerrera), it was very easy to implement a repo-wide workaround.

My friend Kirill Osenkov also pointed me at his MSBuild Log Viewer and pointed out that if it failed in Visual Studio, it would probably fail from msbuild on the command line. He was right, and the log viewer pinpointed what it was trying to write, which would have saved time.

Points to note

  • Situations where you have a working case and a broken case are great. They can help you validate your problem (does it go away if I revert to the previous commit?) and find the cause (what’s the difference in output between working and broken?)
  • Temporary approaches to double-check the diagnosis (such as me renaming my source directory to “short”) can be useful – don’t forget to undo them though!
  • If you’ve run into a problem, someone else probably has too
  • The .NET community can be really amazing – I was really impressed by the response here
  • Tools are often available – in this case, msbuildlog.com would have saved me quite a bit of time.

Diagnosing a corner-case tooling bug

Unlike the previous tests which have been based on Noda Time, this post is based on some issues I’ve had with my main project at work: the Google Cloud Client Libraries for .NET.

Background

This is somewhat detailed for those of you who really care about the chain of events. If you don’t, skip to the next subheading.

google-cloud-dotnet is a complicated repo with lots of projects, generated project files, its own Roslyn analyzers, a fairly involved documentation process using docfx, and other bits and pieces. We use Travis and AppVeyor for CI, and the AppVeyor build has historically gone through the documentation build as well.

On the evening of August 17th (UK time), AppVeyor’s “Visual Studio 2017” image changed to include the .NET Core 2.0 SDK by default. This broke us in a few ways:

  • Our appveyor.yml invoked a bash script, but without explicitly invoking bash. I don’t know why it worked before, but the change broke that. Easily fixed.
  • Our docfx build completely broke on AppVeyor due to differences in how msbuild is found or invoked. I don’t understand this yet, but for the moment we’ve had to remove the docfx part of our build from AppVeyor. It works on our development machines – I believe due to the presence of VS2017 itself, rather than just the build tools.
  • Some aspect of building with .NET Core 2.0 broke our Roslyn analyzers, causing them to throw missing method exceptions. I don’t understand enough about the breakage to explain it or say whether it was a bug, but we had to change the analyzers for them to work with the latest compilers. This wasn’t a problem in our command-line builds as we had pinned to the 1.0 SDK using global.json, but it seems VS2017 update 3 uses the latest Roslyn version anyway (reasonably) so building in Visual Studio caused problems.

Unfortunately the solution to the last bullet point caused the analyzers to break under the 1.0 SDK… so we were basically in a no-win situation until we upgraded everything to use the 2.0 SDK. (Just to reassure customers, we’re still targeting netstandard1.0 – just because we’re using the 2.0 SDK doesn’t mean you have to.)

Having updated to the 2.0 SDK, we were now guaranteed to have the C# 7.1 compiler available, so after a project file update to include latest, we could use new features.

The problem for this post: docfx and default literals

Our code has a lot of methods accepting cancellation tokens, and the default is always specified as

CancellationToken cancellationToken = default(CancellationToken)

That’s quite a mouthful – but C# 7.1 default literals can help! This post isn’t about the language feature, but basically it allows you to just use default in places where the C# compiler can work out which type’s default value you mean. So we can change all occurrences of the above to:

CancellationToken cancellationToken = default

Great! A colleague put together a pull request, and I was about to give an LGTM and suggest that he merged it, when I remembered that we weren’t building the documentation on AppVeyor any more. Given that this was the first build using C# 7.1 features, I thought it would be worth just checking that it worked.

It didn’t. It failed with an error message:

Error:Error extracting metadata for c:/Users/jon/Test/Projects/google-cloud-dotnet/apis/Google.Cloud.BigQuery.V2/Google.Cloud.BigQuery.V2/Google.Cloud.BigQuery.V2.csproj: Value cannot be null.
Parameter name: name

This post is a log of the steps I’ve gone through, not trying to fix the bug, but trying to report the problem in a friendly manner.

On Stack Overflow, when I ask for a short but complete program demonstrating the problem, this is the sort of process I expect people to go through.

This won’t be quite as accurate a log as some other posts, as I only started writing it when I was most of the way through the investigation. Hopefully it’ll still be useful.

Step 0: Confirm the problem is real

This is where git is really, really useful. I was very easily able to clone my colleague’s fork, checkout the branch his PR was created from, and go through the build procedure to see the error shown above.

Once I’d done that, I checked out the master branch, pulled to make sure I was up to date, and rebuilt. No problems – so it was definitely the change in that PR.

Step 1: Try the simplest potential fix

We have a file in our repo specifying the version of docfx to use, so trying a different version is trivial. I knew we weren’t using the latest version, and that the latest version used a new version of Roslyn itself. It would make sense for that to fix the problem.

Tried it – no luck. At this point, it seemed unlikely that we’d be able to resolve the problem ourselves, and I was looking for a minimal example to report it.

Step 2: Try to reproduce in a minimal example

A minimal example in docfx is pretty simple. It needs three files:

A csproj for the project:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netstandard1.0</TargetFramework>
    <LangVersion>latest</LangVersion>
  </PropertyGroup>
</Project>

A docfx.json file, just enough to build the metadata in this case:

{
  "metadata": [{
    "src": [{ "files": ["Project.csproj"] }],
    "dest": "api",
  }]
}

And a source file for it to build:

namespace Project
{
    public class Class1
    {
        public void Foo(int x = default(int)) {}
    }
}

Then I just needed to build the project and then the metadata:

dotnet build && docfx metadata -f

That worked, as expected. I then changed default(int) to just default and built again, hoping to see the failure. Unfortunately, that worked too.

At this point, I have no clue about the differences between the real project which didn’t work and the minimal project that did.

Step 3: find a smaller reproduction from real code

As mentioned in the introduction, the repo contains a lot of projects. The BigQuery project mentioned in the error message contains a lot of code – not an ideal candidate to cut down. Looking through the changes in the PR, I spotted that files in Google.Cloud.Translation.V2 had also been changed. That’s a much smaller project, and would be simpler to work with.

Instead of running through the full build, I just ran the commands to build that specific project and then its documentation. Bingo – same error as before, basically:

Error:Error extracting metadata for C:/Users/jon/Test/Projects/google-cloud-dotnet/apis/Google.Cloud.Translation.V2/Google.Cloud.Translation.V2/Google.Cloud.Translation.V2.csproj: Value cannot be null.
Parameter name: name

Step 4: isolate the smaller reproduction

I copied the broken code away from the git repo, and hacked the project file a bit:

  • I didn’t need the signing key
  • I didn’t need the import that prevented building net45 code on Linux
  • I didn’t need the Roslyn analyzer

I copied and modified the docfx.json file that was used to create the metadata, and tested it again: same error.

At this point, we’ve got a reproduction with “only” 10 C# files and a single project. None of the infrastructure used for the rest of the repo is required. That’s a good start.

Step 5: Gut the code

The aim is to remove as much code as possible, leaving only what’s required to demonstrate the problem.

The project contained two parallel pairs of classes (TranslationClient/TranslationClientImpl and AdvancedTranslationClient/AdvancedTranslationClientImpl). They’re very similar (and really only both exist for backwards compatibility) – it was trivial to remove the “advanced” pair and retest: same failure.

Next, remove the TranslationClientImpl class – we weren’t running any code, so if we could reproduce the problem with the abstract TranslationClient class, that would be fine. Remove all factory methods and the class itself: same failure.

Next, remove the ListLanguages and DetectLanguage methods (and their related friends) – at which point the Language and Detection classes can go, along with LanguageCodes which is really just helper code. Same failure.

We’re down to relatively little code now, but still more than I really want. What’s left is a bunch of calls to do translation. It’s basically a matrix of 8 methods, for every combination of:

  • Translate a single item or multiple items
  • Translate text or HTML
  • Sync or async

The single/multiple part is just implemented in terms of the “single” method calling the “multiple” method and extracting the result, using an expression-bodied member. Only the async methods used the default literal, so I decided to remove the synchronous code first. Same failure.

The HTML and text methods were basically the same in the abstract class (the real differences, small as they are, lie in the implementation) so it was simple to remove the HTML methods. Same failure.

That left two methods: the two async methods, both using the default literal, and one of which (“single”) called the other (“multiple”). I tried removing the “single” method – suddenly, the failure went away. I restored the method, and double-checked: yes, we’re back to seeing the failure.

Still, two methods is reasonable. Let’s reduce those as far as we can, changing them to return strings instead of TranslationResult objects, for example, and removing the TranslationModel? parameter in each case: same failure.

Okay, this looks like it’s enough to reproduce the problem even more minimally.

Step 6: Back to the minimal example code

I went back to the sample project from step 2, with a tiny project file and a tiny docfx.json file. I thought the problem might be one method calling another, so I reproduced that within the test code. (Unfortunately I no longer have the code I tried at this point.)

No failure. Grrr. What was the significant difference between that code and the failing Translation code?

Step 7: Transplant the definitely-failing code

Pretty clueless at this point, I copied the TranslateText methods that were definitely failing in the Translation project into my test project. As I’d removed all the Translation-specific types now, that didn’t cause any compilation problems in itself… but the failure was back. Hooray.

So we now have:

  • Slightly complex code that at least slightly resembles the Translation code
  • A project file that’s nice and simple
  • A docfx file that’s nice and simple

That’s still good: we can now jettison the gutted Translation code, and work just within our minimal test project.

Step 8: Reduce to one method

We still had one method (“single”) calling another (“multiple”). Removing “single” removed the failure; removing “multiple” would cause a compilation failure in “single”. Let’s cut that link.

First, I replaced the the expression-bodied member from “single” with a regular method body, awaited a Task.Delay(...) (to avoid a warning about not awaiting anything) and returned null. The failure went away. Weird.

Next, I tried making it a non-async method, and just returned a null task instead. The failure stayed away.

My first “reasonably minimal, single-method” failure code looked like this:

using System.Threading.Tasks;

namespace Project
{
    public class Class1
    {
        /// <param name="x">A parameter</param>
        public virtual async Task<string> Foo(int x = default) => await Task.FromResult<string>(null);
    }
}

I’ve since played around a lot and reduced it somewhat:

using System.Threading.Tasks;

namespace Project
{
    public class Class1
    {
        /// <param name="x">A parameter</param>
        public Task Foo(int x = default) => (Task) null;
    }
}

Note that it’s now no longer an async method. This generates the same error as before, and it’s reliable, but any one of the following changes makes the failure go away:

  • Change default to default(int), the failure goes away
  • Remove the “ documentation, the failure goes away
  • Make it a regular method instead of an expression-bodied method
  • Change (Task) null to just null

That’s a pretty sensitive bug! I’ve still no idea where the problem is, but it’s now ready to tell the docfx team.

Step 9: file a bug

With all of this work done, it was simple to file an issue with the repro code. At this point, I can relax…

Points to note

  • The only chance I really had for resolving this problem myself was updating to the latest version of docfx. The C# code in my project is correct, so it really looks like a bug somewhere (whether in docfx or in one of its dependencies is still up for grabs) but it’s buried further than I’m likely to go.
  • To be as friendly as I can, I knew I had to provide a way for the team to reproduce the problem.
  • In this case, I tried a simple “let’s reproduce from scratch” unsuccessfully before going back to the known-failing code
  • Even though the original scope (“all the code in the repo”) was huge, it was fairly simple to reduce it in stages:
    • Find a sub-project which demonstrated the problem
    • Isolate that sub-project from the rest of the repo infrastructure
    • Cut down that sub-project to a reasonably minimal example
    • Transplant that example into a new, completely minimal project
  • Most of the time I “wasted” was due to the very odd nature of the failure – very small, usually-irrelevant perturbations of the code changed the result in an unexpected way. I could probably have realized this earlier, and started making my changes even smaller between tests.

Diagnosing a single slow test

In the last blog post, I investigated why my Noda Time tests on Travis were running much slower than those on AppVeyor. I resolved a lot of the problem just by making sure I was running release builds on Travis.

That left a single test which takes up about half of the test time though: TzdbDateTimeZoneSourceTest.GuessZoneIdByTransitionsUncached. It seems to run more slowly on .NET Core on Linux than on Windows. The aim of this post is to work out why – at least to the extent of understanding it better than before.

Step 0: Initial assumptions/knowledge and aim

For the moment, I’ll assume that my records of previous investigations are correct. So the situation is:

  • I have near-identical hardware to run tests on: gabriel is running Linux; bagpuss is running Windows.
  • Under a release build, the test takes 19s on gabriel and 4.3s on bagpuss
  • The test does more work on gabriel: it checks 434 zones instead of 135 on bagpuss
  • That still shows a per-test slow-down of 40%
  • I’ve looked at this code before to try to optimize it. In production, this method is only going to be called for one time zone (the system default) so it’s not really a problem – but improving the test speed would be nice. (We can categorize it as a “slow” test if necessary, but I’d rather not.)

Aims of this diagnosis:

  • Find out whether the speed difference is in the test code or the production code that it’s testing. (This test has a lot of assertions.)
  • Find out whether the speed difference is in Noda Time code (e.g. Instant) or in BCL code (e.g. TimeZoneInfo). If it’s in Noda Time code, that could lead to another investigation later, with potential optimizations.
  • Make another attempt at speeding up the test across the board

Step 1: validate assumptions

Let’s make sure we’re still in the same state as we were before. (That seems pretty likely.)

On both machines, pull the current code and run

dotnet run -c Release -f netcoreapp1.0 -- --where=test=~Guess

from the src/NodaTime.Test directory.

Odd: it now takes ~6 seconds on bagpuss. At that point, there’s no time discrepancy. But I’m sure there was when I wrote up the previous post. I didn’t make up the 4.3s result… but it’s entirely possible it was on the wrong machine. Run the tests several times – yes, it’s always around 6 seconds. Odd.

At this point, I nearly gave up. After a little while doing other things, I decided to pursue the rest of the diagnosis anyway… at which point the execution time was back to about 4 seconds. That’s very, very strange. I still don’t know what was going on there.

Brutal honesty time: although I ran the tests at this point, I didn’t spot the discrepancy until I was at the end of step 2. I then looked back in the command shell buffer and saw the 6 second run, which was consistent with the runs at the end of step 2.

Step 2: shorten experiment iteration time

The only test we’re interested in is this one. We’re going to be changing the code occasionally, so let’s ditch everything else.

  • Start a new branch in git
  • Move TzdbDateTimeZoneSourceTest.cs into NodaTime.Test so it’s easier to find. (It was in a subdirectory.)
  • Remove all other tests from TzdbDateTimeZoneSourceTest.cs
  • Delete all the subdirectories in NodaTime.Test
  • Remove everything in NodaTime.Test apart from Program.cs and TzdbDateTimeZoneSourceTest.cs
  • Change the target framework to just netcoreapp1.0 so we don’t need to specify it. (This also requires changing the csproj file to use instead of.)
  • Add a Release property to attempt to make it a release build by default
  • Test: run dotnet run – oops, even though it’s running from a Release folder, it’s taking 13 seconds instead of 3 on Windows.
  • Revert previous change, and try dotnet run -c Release – that’s more like it

Great. We can push the commit to github and fetch it on all the relevant machines. (I’m developing on a third machine, just using gabriel and bagpuss to run tests.)

Step 3: Narrow down the time-consuming code

As I mentioned before, the test has a lot of assertions, and they involve a lot of time zone work. Let’s see how much time they’re contributing – by removing them entirely.

  • Comment out everything after the call to GuessZoneIdByTransitionsUncached
  • Run locally just to check it does still run
  • Commit and push
  • Fetch on gabriel/bagpuss and rerun

Result: the checking part of the test is almost free.

Step 4: Remove NUnit

At this point I’m fairly confident that NUnit is irrelevant. I still want to use the NodaTime.Test project, as NodaTime is already configured to expose internal members to it – and GuessZoneIdByTransitionsUncached is internal. However, it’s very easy to remove NUnit – just remove the dependencies, and replace all the code in Program with a simple test:

using NodaTime.TimeZones;
using System;
using System.Diagnostics;
using System.Linq;

namespace NodaTime.Test
{
    class Program
    {
        public static void Main(string[] args)
        {
            var zones = TimeZoneInfo.GetSystemTimeZones().ToList();

            // Fetch the source outside the measurement time
            var source = TzdbDateTimeZoneSource.Default;
            // JIT compile before we start
            source.GuessZoneIdByTransitionsUncached(TimeZoneInfo.Local);

            var stopwatch = Stopwatch.StartNew();
            foreach (var zone in zones)
            {
                source.GuessZoneIdByTransitionsUncached(zone);
            }
            stopwatch.Stop();
            Console.WriteLine($"Zones tested: {zones.Count}");
            Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");
        }
    }
}

There are a couple of differences here compared with the NUnit test:

  • I’m deliberately fetching TzdbDateTimeZoneSource.Default (which needs to read some resources) outside the measurement
  • I’m deliberately calling GuessZoneIdByTransitionsUncached once outside the measurement to remove JIT compilation time

This still isn’t a good benchmark by any means, but it’s probably sufficient for now.

Commit/push/fetch/run as before.

Result: similar to before, but a little faster. Playing around with the code a little further, it seems that the bullet points above make a difference of about 0.1s each (on bagpuss, at least) – there’s quite a lot of code that wouldn’t have been JIT-compiled before the first call, and loading time zone data is somewhat expensive. NUnit also adds a little bit of overhead, which is understandable.

We now have more accurate data for just GuessZoneIdByTransitionsUncached:

  • bagpuss: 3.6s
  • gabriel: 19.0s (just under 19 in most cases, instead of just over when running NUnit)

Now we need to get into Noda Time code…

Step 5: Gut the production code

First, the source code helpfully points to https://github.com/nodatime/nodatime/issues/686 with some comments from my last investigation.

We don’t care about correctness at this point, just speed. We can return null at any point. At this point you probably need to know a bit about what the method does. The aim is to guess a matching TZDB (aka IANA, tz, zoneinfo or Olson) time zone from a BCL TimeZoneInfo, when we can’t use the names to match. It combines all the time zone transitions (e.g. going into or out of DST) of every TZDB zone for the next 5 years, and then computes a score for each zone by checking how many of the UTC offsets at those transition points match the offsets returned by the BCL zone.

Let’s divide the code into 5 sections:

  1. Working out the start and end of this UTC year
  2. Getting all the time zones
  3. Getting all the transitions we’re going to test
  4. Getting all the offsets from the BCL zone
  5. Computing the scores

It’s going to be simplest to measure just 1, then 1+2, then 1+2+3 etc. We can just add return null; after the step we’re interested in. We hope the results will tell us:

  • Which part of the code is slowest (so we can try to optimize it)
  • Which part of the code is slower on Linux than Windows

For simplicity, let’s make gabriel use the same number of time zones as bagpuss, just by adding .Take(135) in Program.cs above first. That way we’re comparing the same amount of work on each machine – although admittedly we don’t know that each time zone takes the same amount of time. Without any change to the production code, gabriel now takes 6s – which is about what we’d expect if the time per zone is uniform, as 19 * 135/434 = 5.91[…].

Results on gabriel:

  • Step 1: 0s (it’s free, effectively)
  • Step 1+2: 1.82s
  • Step 1+2+3: 3.15s
  • Step 1+2+3+4: 3.21s
  • Step 1+2+3+4+5: 6s

Results on bagpuss:

  • Step 1: 0s
  • Step 1+2: 1.21s
  • Step 1+2+3: 2.10s
  • Step 1+2+3+4: 2.13s
  • Step 1+2+3+4+5: 3.63s

Or in other words, stepwise:

Step 1: 0s on both
Step 2: 1.82s on gabriel; 1.21s on bagpuss
Step 3: 1.33s on gabriel; 0.89s on bagpuss
Step 4: 0.06s on gabriel; 0.03s on bagpuss
Step 5: 2.79s on gabriel; 1.50s on bagpuss

No particular smoking gun, unfortunately – no obvious part that would be easily optimized. The last part looks like it shows the worst platform difference though… let’s see if we can dig into that.

Step 6: Pivot to DateTimeZone.GetUtcOffset

Let’s change our test pretty radically:

  • Get a list of all the built in TZDB time zones
  • Create a list of 50,000 instants, starting at 2000-01-01T00:00:00Z and spaced 1 hour apart. 100,000 is some enough instants to make the test last for roughly 10 seconds. (Fast enough to iterate, slow enough to give significant results.)
  • For every zone, find the UTC offset for every instant.

To simplify reasoning about what we’re testing, we want to remove time zone caching, too. Time zone computations are inherently fiddly, so Noda Time has a cache for each zone. In GuessZoneIdByTransitionsUncached we were coming up with a new DateTimeZone in each call, so while there’d be some caching involved, it wouldn’t be as much as a simple version of this new test. For the moment, let’s just remove the time zone caching layer entirely:

var zones = DateTimeZoneProviders.Tzdb.GetAllZones()
    .Select(zone => zone is CachedDateTimeZone cached ? cached.TimeZone : zone)
    .ToList();
var start = Instant.FromUtc(2000, 1, 1, 0, 0);
var gap = Duration.FromHours(1);
var instants = Enumerable
    .Range(0, 100000)
    .Select(index => start + gap * index)
    .ToList();

var stopwatch = Stopwatch.StartNew();
foreach (var zone in zones)
{
    foreach (var instant in instants)
    {
        zone.GetUtcOffset(instant);
    }
}
stopwatch.Stop();
Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");

There’s still one other caching layer that’s relevant, and that’s in terms of conversions between calendrical values and instants… Noda Time is optimized for the years 1900-2100 in the Gregorian calendar. If our test went outside that range, we’d get very different results, hitting different pieces of code very hard. Our original test was looking at 5 years around now, so let’s stick to the optimized range for now.

Results:

  • gabriel: 19.5s
  • bagpuss: 12.5

Good – we’re still seeing the Linux run being about 50% slower than the Windows run.

Step 7: Break out a profiler!

It’s been a while since I’ve done any profiling in .NET. I had a quick search to see what was available in terms of free tools that handle .NET Core, and came across CodeTrack. I don’t think it runs under Linux, but that’s okay – we’re mostly using it to track down the hotspots first, so we can fine tune our tests.

I’m not specifically advocating CodeTrack – profiling is an art and science about which I know relatively little – but it was easy to get going. 85% of the time was spent in StandardDaylightAlternatingMap.GetZoneInterval, so that looks like a good place to look next.

Step 8: Write actual benchmarks

I have a benchmark suite for Noda Time already, and at this point we’ve got to a stage where it’s probably worth using proper benchmarking tools. BenchmarkDotNet is my framework of choice – so let’s add a couple of benchmarks.

Looking at the code and the profiler, it looks worth testing both StandardDaylightAlternatingMap and ZoneRecurrence. Having added those benchmarks, I might as well put this down until the next day, as the complete benchmark suite runs automatically every night.

Step 9: Compare the benchmarks

The benchmark results are automatically uploaded to the Noda Time web site although the UI for browsing them is somewhat unstable (in terms of URLs etc) – hence the lack of link here.

Looking at the benchmark results, it turns out I’d already got benchmarks for StandardDaylightAlternatingMap, annoyingly enough – so I needed to tidy those up.

However, they do clearly show the difference between .NET Core on Linux and on Windows, so we’ve narrowed the problem down further. That doesn’t show that only that code has different performance, but it’s hopefully a good enough start that we’ll be able to continue and get a short but complete example to provide to the .NET Core CLR team.

Diving deeper into the implementation, ZoneYearOffset makes sense to test too… the further we go down, the less code will be involved.

Step 10: take a step back for now

We’ve made progress, but we’re not all the way there. That’s okay – the progress is recorded (both here and in benchmarks) and in looking at the benchmark that’s taking so long to run, we’ve confirmed that we’re not doing anything stupid. It happens to end up doing a lot more work than any application code normally would – work that would normally only happen once (so can be inefficient without being a problem) is exercised hundreds of times. Such is life – I could decide not to run it in CI if I want, running it occasionally manually and always just before a release.

I’ll come back to this issue at some point in the future, but at this point, other projects have become more important.

Points to note and lessons learned

  • Validate assumptions carefully. I still don’t quite know why there was the blip in execution time, but it was very surprising.
  • Unit tests are no substitute for a good benchmarking framework.
  • Profilers exist, and are useful. No specific recommendations here, but make sure you’re aware of the tools you could benefit from.
  • Sometimes, you need to put down a problem for a while. Try not to lose the information you’ve already discovered – capture it in an issue, comments, a blog post like this one, a private document… anything to preserve information. My head leaks data very, very quickly.

Addendum

I’m still investigating where .NET Core runs faster on Linux than on Windows, but I’ve fixed this test – PR 953 is a pretty minor refactoring, but it allows the tests to pass in the same list of time zones into each test case. That has two benefits:

  • We don’t need to load the zones from raw data on each test
  • The “get UTC offset at instant X” result is cached in the time zone, so after the first test there’s very little work to do

This very small, internal-only change has halved the total time of the Noda Time tests on Travis. Yay.

Diagnosing slow tests (again)

I’ve previously blogged about a case where tests on .NET Core on Linux were much, much slower than the same tests on Windows. Today I’ve got a very similar problem – but I suspect the cause isn’t going to be the same.

This is my reasonably raw log – skip to the end for a summary of what I found, what went well, and what went badly. Apologies for skipping around in terms of past/present/future tense – being consistent in that respect would be distracting enough to make the rest worse, I think.

Step 0: notice the problem

I spotted this problem because I happened to be looking at the Travis and AppVeyor builds, moving to .NET Core 2.0. Looking at the test runs, I saw that the tests for .NET Core 1.0 were completing in about 15 seconds on AppVeyor (Windows), and 133 seconds on Travis Linux). My experience is that AppVeyor seems to use slightly less powerful hardware than Travis, making this a very odd result.

I looked at several builds, and it seemed consistent across them – very odd.

Step 1: file an issue

My main focus at that point wasn’t to investigate performance – it was to get the build working with .NET Core 2. However, I didn’t want to forget about the problem, so I filed an issue in the Noda Time repo. At this point, I have no idea where the problem is, so the Noda Time repo is the most obvious place to put it. There’s no evidence that it’s a problem with NUnit or .NET Core itself, for example.

Step 2: reproduce locally

I have two machines at home that I use for running the automated benchmarks. One (bagpuss) runs Windows 10, the other (gabriel) runs Ubuntu 16.04. Both are Intel NuCs with i5-5250U processors – I bought one much later than the other, but looked hard to find as close to identical hardware as I could, to enable cross-OS benchmark comparisons without dual booting or virtualization.

I ran the tests on both machines, and copied the TestResult.xml output files to my main dev box for analysis.

First interesting point: the difference isn’t as marked. The tests take 77 seconds on gabriel and 29 seconds on bagpuss. That’s still a lot, but very different to what we’re seeing in CI. It’s possible that the Travis machine is twice as slow as gabriel/bagpuss and that the AppVeyor machine is twice as fast, but that seems unlikely.

Deem that temporarily out of scope as it’s harder to investigate than the fixed-hardware, easy-access problem I have now. File another issue and I can come back to it.

Step 3: Look for individual test case discrepancies

If we’re really lucky, there’ll be a single test case that has a much longer duration on gabriel than on bagpuss, and I can then delve into it.

Investigation:

  • Create a small console app that loads the two XML files, finds all the test-case elements and puts them in a dictionary from fullname to duration (the latter parsed as a dictionary)
    • Oops: some full names aren’t unique (due to using parameters that don’t override ToString, I suspect).
    • Load into a lookup instead, then convert that into a Dictionary by taking the first test from each group. Unlikely to affect the results
  • For each test that’s in both result sets, find the ratio of gabriel/bagpuss
  • Dump the top 20 ratios, and the slowest 10 tests from each set for good measure.

Okay, lots of text handling tests with ratios of more than 2:1, but they’re all very small durations. I doubt that they’ll be responsible for everything, and could easily just be “measuring small amounts of time is flaky”. Nothing to blame yet.

Point to note: this console app was always less than 50 lines of code. LINQ to XML makes this sort of thing really easy.

Step 4: Look for test fixture discrepancies

If individual tests don’t show a massive change, what about test fixtures? Let’s change the program to load test-suite elements where type="TestFixture" which will get us the durations as a per-test-class level. We’ll start by dumping the slowest 10 tests for each environment… and wow, we have a result!

NodaTime.Test.TimeZones.TzdbDateTimeZoneSourceTest takes 65 seconds on Gabriel and 17 seconds on Bagpuss! That’s out of a total of 77 and 29 seconds respectively – so removing those would take us to 12 seconds on each machine.

Concern: that’s assuming no multithreading is going on. I don’t think NUnit parallelizes by default, but it would definitely make things trickier to reason about.

Step 5: check the logs in more detail

Given that we haven’t seen any very slow individual tests, my guess at this point is that the slowness is going to be in test fixture setup. I haven’t looked at the code yet. Let’s have a look at the XML logs for that one test in more detail.

Yikes: all the difference is in GuessZoneIdByTransitionsUncached. That’s a single method, but it’s parameterized for all system time zones.

At this point, a memory stirs: I’ve looked at this before. I have a distinct memory of being in an airport lounge. At the time I wasn’t trying to work out the OS discrepancy so much as why this one test was taking up over half the total unit test time, even on Windows.

Looking at the code for GuessZoneIdByTransitionsUncached, I see I was even diligent enough to create an issue and keep a bit of a log of what I found. Hooray! So, now we know why this is slow in general. But why is it so much slower on gabriel than on bagpuss?

The answer is that gabriel appears to have rather more system time zones than bagpuss: it’s running 424 tests instead of 135. If they ran each test equally fast, it would take bagpuss about 53 seconds to gabriel’s 65 – that’s not so bad.

Step 6: test on CI again

In one fell swoop, we can confirm that this is the problem on Travis and potentially work out why AppVeyor is so fast. If we only run this one test fixture, we can look at how many tests are running on each machine (and in which environments, as I now run the tests on net45 (AppVeyor only), netcoreapp1.0 and netcoreapp2.0.

That’s really easy to do – create a PR which just changes the test filter, let CI run it, then delete the PR.

Results:

  • Travis, both environments run 424 tests – 95 seconds for netcoreapp1.0, 92 seconds for netcoreapp2.0
  • AppVeyor, all environments roughly equal, 135 tests – about 3 seconds

That certainly shows that this is taking the majority of the time on Travis, but we still have two questions:

  • How is AppVeyor so much faster than my Core i7 development machine, where the same tests take about 15 seconds?
  • Travis is still only doing about 3 times as much work as AppVeyor, but it’s taking 30 times as long. Why?

Step 7: spot flag differences

I look at the Travis script. I look at the AppVeyor script. I look at what I’ve been running locally. There’s a difference… on AppVeyor, I’m running with -c Release, whereas locally I’ve used the default (debug). Can a debug build make that much difference?

  • Test locally: running on my dev box with -c Release brings the time down from 15s to 3.9s. Yikes!
  • Update the Travis script to run with -c Release as well

Now the tests take about 30 seconds on Travis. That’s still three times as long as it would be on Windows for the same number of time zones, but it’s much, much closer.

Run the tests with the same flags on gabriel and bagpuss: gabriel takes 19s, bagpuss takes 4.3s. Taking the number of time zones into account, that suggests that netcoreapp1.0 on Linux is about 40% slower than netcoreapp1.0 on Windows for each test. I can investigate that separately, but it does sound like Travis is just using slower hardware than AppVeyor, contrary to my previous suspicions.

Step 8: update Travis script and merge

Even though this one test takes a large proportion of the test time on Linux (where there are three times as many time zones to run through), it’s not a gating factor. Correctness beats speed in this case – I can live with waiting a few minutes for a complete CI test run.

We still have some investigation to do, but we can already make things definitely better by running the release build on Travis. On a new branch – not the one I’ve been investigating on – create a commit that just changes Travis to use -c Release. Observe the total time go down from 133 seconds to 57 seconds. Yay.

Points to note

  • When you’re running performance tests, don’t compare release with debug builds. Doh. This was a rookie mistake.
  • Reproducing a CI problem locally is key: it’s much, much easier to work with local machines than CI. It’s a faster turnaround, and you get easier access to detailed results.
  • Having multiple machines with the same hardware is really useful for performance comparisons.
  • Don’t assume that what sounds like the same set of tests will run the same number of tests on different platforms – if a test is parameterized, check how many tests are actually running in each case.
  • Ad-hoc tools are great – the console app I wrote for steps 3 and 4 was quick to write, and helped zoom in on the problem very simply.

Further work

Almost every diagnostic investigation seems to “end” with something else to look into. In this case, I want to find out why Linux is slower than Windows for the GuessTimeZoneTransitionsUncached test on a per-test basis. Is this a Noda Time difference, or a BCL difference? That can wait.

Diagnostics everywhere!

For a long time, I’ve believed that diagnostic skills are incredibly important for software engineers, and often poorly understood.

The main evidence I see of poor diagnostic skills is on Stack Overflow:

  • “I have a program that does 10 things, and the output isn’t right. Please fix.”
  • “I can’t post a short but complete program, because my code is long.”
  • “I can’t post a short but complete program, because it’s company code.”

I want to tackle this as best I can. If I can move the needle on the average diagnostic skill level, that will probably have more impact on the world than any code I could write. So, this is my new mission, basically.

Expect to see:

  • Blog posts
  • Podcast interviews
  • Screencasts
  • Conference and user group talks

In particular, I’ve now resolved that when I’m facing a diagnostic situation that at least looks like it’ll be interesting, I’m going to start writing a blog post about it immediately. The blog post will be an honest list of the steps I’ve been through, including those that turned out not to be helpful, to act as case studies. In the process, I hope to improve my own skills (by reflecting on whether the blind alleys were avoidable) and provide examples for other developers to follow as well.

There’s a new “Diagnostics” category on my blog – everything will be there.

Using .NET Core 2.0 SDK on Travis

This is just a brief post that I’m hoping may help some people migrate to use .NET Core 2.0 SDK on Travis. TL;DR: see the end of the post for a sample configuration.

Yesterday (August 15th), .NET Core 2.0 was fully released. Wonderfully, Travis already supports it. You just need dotnet: 2.0.0 in your YAML file.

I decided to experiment with upgrading the Noda Time build to require .NET Core 2.0 SDK. To be clear, I’m not doing anything in the code that requires 2.0, but it simplifies my build scripts:

Additionally, supporting netcoreapp2.0 means I’ll be able to run my benchmarks against that as well, which is going to be very interesting. However, my tests still target netcoreapp1.0, and that’s where I ran into problems.

Having done the bare minimum to try using 2.0 (edit global.json and .travis.yml) I ran into this error:

The specified framework 'Microsoft.NETCore.App', version '1.0.5' was not found.
  - Check application dependencies and target a framework version installed at:
      /
  - Alternatively, install the framework version '1.0.5'.

That makes sense. Although netcoreapp2.0 is compatible with netstandard1.0 (i.e. you can use libraries targeted to netstandard1.0 in a 2.0 environment) an application targeting netcoreapp1.0 really needs a 1.0 runtime.

So, we need to install just the runtime as well. I’d expected this to be potentially painful, but it’s really not. You just need an addons section in the YAML file:

addons:
  apt:
    sources:
    - sourceline: 'deb [arch=amd64] https://packages.microsoft.com/repos/microsoft-ubuntu-trusty-prod trusty main'
      key_url: 'https://packages.microsoft.com/keys/microsoft.asc'
    packages:
    - dotnet-sharedframework-microsoft.netcore.app-1.0.5

Note that in my case, I want netcoreapp1.0 – if you need netcoreapp1.1, you’d probably install dotnet-sharedframework-microsoft.netcore.app-1.1.2.

So, aside from comments etc, my new Travis configuration will look like this:

language: csharp
mono: none
dotnet: 2.0.0
dist: trusty

addons:
  apt:
    sources:
    - sourceline: 'deb [arch=amd64] https://packages.microsoft.com/repos/microsoft-ubuntu-trusty-prod trusty main'
      key_url: 'https://packages.microsoft.com/keys/microsoft.asc'
    packages:
    - dotnet-hostfxr-1.0.1
    - dotnet-sharedframework-microsoft.netcore.app-1.0.5

script:
  - build/travis.sh

I can now build with the 2.0 SDK, and run tests under both netcoreapp1.0 and netcoreapp2.0.

I’m hoping it’s just as simple on AppVeyor when that supports 2.0 as well…

Upcoming speaking engagements

I’ve got a few speaking engagements coming up that I thought it might be worth publicising a bit further. They’re all within just over a week of each other, which is going to be somewhat tiring, but…

Here they are, in chronological order:

Progressive .NET 2017

Progressive .NET 2017 will be held on September 13th-15th in London.

The agenda is already available, and looks pretty awesome. I’d go into details of who I’m most looking forward to hearing, but it’s too tricky to rank talks with that line-up… Of course, I’m mostly looking forward to catching up with people afterwards. (In particular Carl and Richard, whose talk I won’t be able to attend without leaving my own audience somewhat disappointed.)

This is my most Google-centric talk of the three – it’ll show how you can get going on ASP.NET Core in Google Cloud Platform, with both AppEngine Flexible Environment and Google Container Engine. Basically, I get to show off what my team’s been working on for the last couple of years, which is always fun… although this talk is more about the tools than the libraries I specifically work on.

My colleague Mete Atamel will be approaching a related topic from a different angle – he’ll be going into more detail on the ASP.NET Core + Docker + Kubernetes stack.

.NET Conf

.NET Conf is a on September 19th-21st. It’s free and virtual, so there’s basically no barrier to entry here. The precise agenda has yet to be announced, but the speakers shown so far are stellar.

My talk for .NET Conf will be on diagnostics and problem solving: the path from “Something isn’t working” to “Here’s a great, well-researched Stack Overflow question” or “Ah, that’s what was wrong, and I fixed it!” Finding that path is a skill I’m passionate about – expect to hear more from me in all forms over the next few years. (Now that Noda Time is “mostly done” it’s probably going to be my next big goal in personal time.)

Developing a Digital Future

Developing a Digital Future is a one day conference in Moldova, hosted by Amdaris on September 21st.

I’m delighted to be giving the keynote, whose subject will be the name of the conference. What will tech be like in 5, 10, 20 years? How will it change society? How will our jobs as software engineers be different as a result?

In the afternoon, I’ll be doing a round-table event for a subset of attendees – live coding, answering attendee questions etc. Lots of fun!