Category Archives: Diagnostics

Diagnosing an ASP.NET Core hard crash

February 20, 2022 jonskeet 6 Comments

As part of my church A/V system (At Your Service), I run a separate local web server to interact with the Zoom SDK. Initially this was because the Zoom SDK would only run in 32-bit processes and I needed a 64-bit process to handle the memory requirements for the rest of the app. However, it’s also proven useful in terms of keeping the Zoom meeting for a church service alive if At Your Service crashes. Obviously I try hard to avoid that happening, but when interoperating with a lot of native code (LibVLC, NDI, the Stream Deck, PowerPoint via COM) there are quite a few avenues for crashes. The web server runs ASP.NET Core within a WPF application to make it easy to interact with logs while it’s running, and to give the Zoom SDK a normal event dispatcher.

Yesterday when trying to change my error handling code significantly, I found that the web server was crashing hard, with no obvious trace of what’s going on. I’ve already spent a little time trying to figure out what’s going on, but I couldn’t get to the bottom of it. I know the immediate cause of the crash, and I’ve fixed that fairly easily – but I want to harden the web server against any further bugs I might introduce. I figured it would be useful to blog about that process as I went along.

What I know so far

The immediate crash was due to an exception being thrown in an async void method.

Relevant threading aspects:

I start the ASP.NET Core app in a separate thread (although that’s probably unnecessary anyway, now that I think about it) calling IHost.Start
I have handlers for Dispatcher.UnhandledException and TaskScheduler.UnobservedTaskException
I execute all Zoom-specific code on the WPF Dispatcher thread

The immediate error came from code like the following. You can ignore the way this effectively reproduces Web API to some extent… it’s the method body that’s important.

public abstract class CommandBase<TRequest, TResponse> : CommandBase
{
    public override async Task HandleRequest(HttpContext context)
    {
        var reader = new StreamReader(context.Request.Body);
        var text = await reader.ReadToEndAsync();
        var request = JsonUtilities.Parse<TRequest>(text);

        var dispatcher = Application.Current.Dispatcher;
        try
        {
            var response = await dispatcher.Invoke(() => ExecuteAsync(request));
            var responseJson = JsonUtilities.ToJson(response);
            await context.Response.WriteAsync(responseJson);
        }
        catch (ZoomSdkException ex)
        {
            SetExceptionResponse(new ZoomExceptionResponse { /* More info here */ });
        }

        async void SetExceptionResponse(ZoomExceptionResponse response)
        {
            var responseJson = JsonUtilities.ToJson(response);
            await context.Response.WriteAsync(responseJson);
            context.Response.StatusCode = 500;
        }
    }

    public abstract Task<TResponse> ExecuteAsync(TRequest request);
}

There are at least three problems here:

I’m trying to set HttpResponse.StatusCode after writing the body
The SetExceptionResponse method is async void (generally a bad idea)
I’m not awaiting the call to SetExceptionResponse (which I can’t, due to it returning void)

(It’s also a bit pointless having a local method there. This code could do with being rewritten when I don’t have Covid brain fog, but hey…)

The first of these causes an InvalidOperationException to be thrown. The second and third, between them, cause the app to crash. The debugger has been no help here in working out what’s going on.

Create with a console app to start ASP.NET Core

It feels like this should be really easy to demonstrate in a simple console app that does nothing but start a web server which fails in this particular way.

At this stage I should say how much I love the new top-level statements in C# 10. They make simple complete examples an absolute joy. So let’s create a console app, change the Sdk attribute in the project file to Microsoft.NET.Sdk.Web, and see what we can do. I’m aware that with ASP.NET Core 6 there are probably even simpler ways of starting the server, but this will do for now:

using System.Net;

var host = Host.CreateDefaultBuilder()
    .ConfigureWebHostDefaults(builder => builder
        .ConfigureKestrel((context, options) => options.Listen(IPAddress.Loopback, 8080))
        .Configure(application => application.Run(HandleRequest)))
    .Build();
host.Start();
host.WaitForShutdown();

async Task HandleRequest(HttpContext context)
{
    await context.Response.WriteAsync("Testing");
}

Trying to run that initially brought up prompts about IIS Express and trusting SSL certificates – all very normal for a regular web app, but not what I want here. After editing launchSettings.json to a simpler set of settings:

{
"profiles": {
"AspNetCoreCrash": {
"commandName": "Project"
}
}
}

… I can now start the debugger, then open up localhost:8080 and get the testing page. Great.

Reproduce the exception

Next step: make sure I can throw the InvalidOperationException in the same way as the original code. This is easy, just replacing the body of the HandlRequest method:

async Task HandleRequest(HttpContext context)
{
    await context.Response.WriteAsync("Testing");
    context.Response.StatusCode = 500;
}

Sure enough the console logs show that it’s failed as expected:

System.InvalidOperationException: StatusCode cannot be set because the response has already started.
  at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ThrowResponseAlreadyStartedException(String value)
  at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.set_StatusCode(Int32 value)
  at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.Microsoft.AspNetCore.Http.Features.IHttpResponseFeature.set_StatusCode(Int32 value)
  at Microsoft.AspNetCore.Http.DefaultHttpResponse.set_StatusCode(Int32 value)
  at Program.<<Main>$>g__HandleRequest|0_1(HttpContext context) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 19
  at Microsoft.WebTools.BrowserLink.Net.BrowserLinkMiddleware.ExecuteWithFilterAsync(IHttpSocketAdapter injectScriptSocket, String requestId, HttpContext httpContext)
  at Microsoft.AspNetCore.Watch.BrowserRefresh.BrowserRefreshMiddleware.InvokeAsync(HttpContext context)
  at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

… but (again, as expected) the server is still running. It’s interesting that BrowserLink occurs in the stack trace – I suspect that wouldn’t be the case in my

Let’s try making the failure occur in the same way as in At Your Service:

async Task HandleRequest(HttpContext context)
{
    // In AYS we await executing code in the dispatcher;
    // Task.Yield should take us off the synchronous path.
    await Task.Yield();
    WriteError();

    async void WriteError()
    {
        await context.Response.WriteAsync("Testing");
        context.Response.StatusCode = 500;
    }
}

This time we get a longer stack trace, and the process quits, just like in AYS:

System.InvalidOperationException: StatusCode cannot be set because the response has already started.
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ThrowResponseAlreadyStartedException(String value)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.set_StatusCode(Int32 value)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.Microsoft.AspNetCore.Http.Features.IHttpResponseFeature.set_StatusCode(Int32 value)
   at Microsoft.AspNetCore.Http.DefaultHttpResponse.set_StatusCode(Int32 value)
   at Program.c__DisplayClass0_0.<<<Main>$>g__WriteError|4>d.MoveNext() in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 19
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Task.c.b__128_1(Object state)
   at System.Threading.QueueUserWorkItemCallback.c.b__6_0(QueueUserWorkItemCallback quwi)
   at System.Threading.ExecutionContext.RunForThreadPoolUnsafe[TState](ExecutionContext executionContext, Action`1 callback, TState& state)
   at System.Threading.QueueUserWorkItemCallback.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

This happens in both the debugger and when running from the command line.

Setting a break point in the WriteError method shows a stack trace like this:

   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at Program.c__DisplayClass0_0.<<Main>$>g__WriteError|4()
   at Program.<<Main>$>g__HandleRequest|0_1(HttpContext context) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 14
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Runtime.CompilerServices.YieldAwaitable.YieldAwaiter.c.b__6_0(Action innerContinuation, Task continuationIdTask)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.ContinuationWrapper.Invoke()
   at System.Runtime.CompilerServices.YieldAwaitable.YieldAwaiter.RunAction(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

There’s nothing about ASP.NET Core in there at all… so I wonder if we can take that out of the equation too?

Reproducing the crash in a pure console app

To recap, I’m expecting at this stage that to reproduce the crash I should:

Write an async void method that throws an exception
Call that method from a regular async method

Let’s try:

#pragma warning disable CS1998 // Async method lacks 'await' operators and will run synchronously
await NormalAsyncMethod();
Console.WriteLine("Done");

async Task NormalAsyncMethod()
{
    await Task.Yield();
    Console.WriteLine("Start ofNormalAsyncMethod");
    BrokenAsyncMethod();
    Console.WriteLine("End of NormalAsyncMethod");
}

async void BrokenAsyncMethod()
{
    await Task.Yield();
    throw new Exception("Bang");
}

Hmm. That exits normally:

$ dotnet run
Start ofNormalAsyncMethod
End of NormalAsyncMethod
Done

But maybe there’s a race condition between the main thread finishing and the problem crashing the process? Let’s add a simple sleep:

#pragma warning disable CS1998 // Async method lacks 'await' operators and will run synchronously
await NormalAsyncMethod();
Thread.Sleep(1000);
Console.WriteLine("Done");
// Remainder of code as before

Yup, this time it crashes hard:

Start ofNormalAsyncMethod
End of NormalAsyncMethod
Unhandled exception. System.Exception: Bang
   at Program.<<Main>$>g__BrokenAsyncMethod|0_1() in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 15
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

Interlude: what about async Task?

At this point I’m remembering some of what I’ve learned about how async void methods handle exceptions. What happens if we turn it into an async Task method instead? At that point, the Task returned by the method (which we ignore) will have the exception, and as by default unobserved task exceptions no longer crash the process, maybe we’ll be okay. So just changing BrokenAsyncMethod to:

async Task BrokenAsyncMethod()
{
    throw new Exception("Bang");
}

(and ignoring the warning at the call site)… the program no longer crashes. (I could subscribe to TaskScheduler.UnobservedTaskException but I’m not that bothered… I’m pretty convinced it would fire, at least eventually.)

Do all ThreadPool exceptions crash the app?

We don’t need to use async methods to execute code on the thread pool. What happens if we just write a method which throws an exception, and call that from the thread pool?

ThreadPool.QueueUserWorkItem(ThrowException);
Thread.Sleep(1000);
Console.WriteLine("Done");

void ThrowException(object? state)
{
    throw new Exception("Bang!");
}

Yup, that crashes:

Unhandled exception. System.Exception: Bang!
   at Program.<<Main>$>g__ThrowException|0_0(Object state) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 7
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

At this point some readers (if there still are any…) may be surprised that this is a surprise to me. It’s been a long time since I’ve interacted with the thread pool directly, and taking down the process like this feels a little harsh to me. (There are pros and cons, certainly. I’m not trying to argue that Microsoft made the wrong decision here.)

Can we change the ThreadPool behaviour?

Given that we have things like TaskScheduler.UnobservedTaskException, I’d expect there to be something similar for the thread pool… but I can’t see anything. It looks like this is behaviour that changed with .NET 2.0 – back in 1.x, thread pool exceptions didn’t tear down the application.

After a bit more research, I found AppDomain.UnhandledException. This allows us to react to an exception that’s about to take down the application, but it doesn’t let us mark it as “handled”.

Here’s an example:

AppDomain.CurrentDomain.UnhandledException += (sender, args) =>
    Console.WriteLine($"Unhandled exception: {((Exception)args.ExceptionObject).Message}");
ThreadPool.QueueUserWorkItem(ThrowException);
Thread.Sleep(1000);
Console.WriteLine("Done");

void ThrowException(object? state) =>
    throw new Exception("Bang!");

Running this code a few times, I always get output like this:

Unhandled exception: Bang!
Unhandled exception. System.Exception: Bang!
   at Program.<<Main>$>g__ThrowException|0_1(Object state) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 8
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

… but sometimes “Done” is printed too. So I guess there’s some uncertainty about how quickly the AppDomain is torn down.

Hardening At Your Service

Given what I know now, I don’t think I can easily stop the web server for Zoom interactions from terminating if I have a bug – but I can make it easier to find that bug afterwards. I already normally write the log out to a text file when the app exits, but that only happens for an orderly shutdown.

Fortunately, it looks like the AppDomain.UnhandledException is given enough time to write the log out before the process terminates. Temporarily reverting to the broken code allows me to test that – and yes, I get a log with an appropriate critical error.

Conclusion

I knew that async void methods were generally bad, but I hadn’t quite appreciated how dangerous they are, particularly when executed from a thread pool thread.

While I’m not thrilled with the final result, I at least understand it now, and can find similar errors more easily in the future. The “not understanding” part was the main motivation for this blog post – given that I’d already found the immediate bug, I could have just fixed it and ignored the worrying lack of diagnostic information… but I always find it tremendously unsettling when I can’t explain significant behaviour. It’s not always worth investigating immediately but it’s generally useful to come back to it later on and keep diving deeper until you’ve got to the bottom of it.

I haven’t put the source code for this blog post in GitHub as there are so many iterations – and because it’s all in the post itself. Shout if you would find it valuable, and I’m happy to add it after all.

C#, Diagnostics

Diagnosing a VISCA camera issue

February 17, 2022 jonskeet Leave a comment

As I have mentioned before, I’ve been spending a lot of time over the last two years writing code for my local church’s A/V system. (Indeed, I’ve been giving quite a few user group talks recently about the fun I’ve had doing so.) That new A/V system is called “At Your Service”, or AYS for short. I only mention that because it’s simpler to refer to AYS for the rest of this post than “the A/V app”.

Our church now uses three cameras from PTZOptics. They’re all effectively the same model: 30x optical Zoom, Power over Ethernet, with NDI support (which I use for the actual video capture – more about that in a subsequent blog post). I control the cameras using VISCA over TCP/IP, as I’ve blogged about before. At home I have one PTZOptics camera (again, basically the same camera) and a Minrray UV5120A – also 30x optical zoom with NDI support.

Very occasionally, the cameras run into problems: sometimes the VISCA interface comes up, but there’s no NDI stream; sometimes the camera just doesn’t stop panning after I move it. In the early days, when I would only send a VISCA command when I wanted to actually do something, I believe the TCP/IP connection was closed automatically after being idle for an hour.

For these reasons, I now have two connections per camera:

One for the main control, which includes a “heartbeat” command of “get current pan/tilt/zoom” issued every three minutes.
One for frequent status checking, so that we can tell if the camera itself is online even if the main control connection ends up causing problems. This sends a “get power status” every 5 seconds.

If anything goes wrong, the user can remotely reboot the camera (which creates a whole new TCP/IP connection just in case the existing ones are broken; obviously this doesn’t help if the camera is completely failing in network terms).

This all works pretty well – but my logs sometimes show an error response from the camera – equivalent to an HTTP 400, basically saying “the VISCA command you sent has incorrect syntax”. Yesterday I decided to dedicate some time to working out what was going on, and that’s the topic of this blog post. The actual problem isn’t terribly important – I’d be surprised if more than a handful of readers (if any!) faced the same issue. But I’m always interested in the diagnostic process.

Hack AYS to add more stress

I started off by modifying AYS to make it send more commands to the camera, in two different ways:

I modified the frequency of the status check and heartbeat
I made each of those also send 100 commands instead of just one
- First sending the 100 commands one at a time
- Then changed to sending them as 100 tasks started at the same time and then awaited as a bunch

This increased the rate of error – but not in any nicely reproducible way.

Looking back, I probably shouldn’t have started here. AYS does a lot of other stuff, including quite a lot of initialization on startup (VLC, OBS, Zoom) which makes the feedback loop a lot slower when experimenting. I needed something that was dedicated to just provoking the problem.

Create a dedicated stress test app

It’s worth noting at this point that I’ve got a library (source here) that I’m using to handle the TCP connections and send the commands. This keeps all the knowledge of exactly how VISCA works away from AYS, and makes it reasonably easy to write test code separately from the app. However, it does mean that when something goes wrong, my first expectation is that this is a problem in the library. I always assume my own code is broken before placing the blame elsewhere… and usually that assumption is well-founded.

Despite the changes to AYS not provoking things as much as I’d expected, I still thought concurrency would be the key. How could we overload the camera?

I wrote a simple stress test app to send lots of simple “get power status” requests at the camera, from multiple tasks, and record the number of successful commands vs ones which failed for various different reasons (camera responded with an error; camera violated VISCA itself; command timed out). Initially, I found that some tasks had issued far more requests than others – this turned out to be due to the way that the thread pool (used by tasks in a console app) starts small and gradually expands. A significant chunk of the stress test app is given over to getting all the tasks “ready to go” and releasing them all at the same time.

The stress test app is configurable in a number of ways:

The camera IP address and VISCA port
Whether to use a single controller object (which means a single TCP connection) or one per task. (The library queues concurrent requests, so any given TCP connection only has a single active command at a time.)
How long each task should delay between requests
How long each command should wait before timing out
How long the test should last overall

At first, the results seemed all over the place. With a single controller, and enough of a delay (or few enough tasks) to avoid commands timing out due to the queuing, I received no errors. With multiple controllers, I saw all kinds of results – some tasks never managing a single request, others with just a few errors, and lots of timeouts.

Oh, and if I overwhelmed the camera too much, it would just reset itself during the test (which involves it tilting up to the ceiling then back down to horizontal). Not quite “halt and catch fire”, but a very physically visible indication of an error. I’m pretty confident I haven’t actually damaged the camera though, and it seems fairly reasonable for it to reset itself in the face of a sort of DoS attack like this stress test.

However, I did spot that I never had a completely clean test with more than one task. I wondered: how low could I take the load and still see problems?

Even with just two tasks, and a half second delay between attempts, I’d reliably see one task have a single error response, and the other task have a command timeout. Aha – we’re getting somewhere.

On a hunch (so often the case with diagnostics) I added console output in the catch blocks to include the time at which the exception was thrown. As I’d started to suspect, it was the very first command that failed, and it failed for both tasks. Now we’re getting somewhere.

What’s special about the first command? We’ve carefully set the system up to start all the tasks sending commands at exactly the same time (as near as the scheduler will allow, anyway) – they may well get out of sync after the first command, as different responses take slightly different lengths of time to come back, but the first commands should be sent pretty much simultaneously.

At this point, I broke out Wireshark – an excellent network protocol analyzer, allowing me to see individual TCP packets. Aside from anything else, this could help to rule out client-side issues: if the TCP packets I sent looked correct, the code sending them was irrelevant.

Sure enough, I saw:

Two connections being established
The same 5 bytes (the VISCA command: 81 09 04 00 ff) being sent on each connection in two packets about 20 microseconds apart.
A 4 byte response (a VISCA error: 90 60 02 ff) being sent down the first connection 10ms later
No data being sent down the second connection

That explained the “error on one, timeout on the other” behaviour I’d observed at the client side.

Test the other camera

Now that I could reliably reproduce the bug, I wondered – what does the Minrray camera do?

I expected it to have the same problem. I’m aware that there are a number of makes of camera that use very similar hardware – PTZOptics, Minrray, SMTAV, Zowietek for example. While they all have slightly different firmware and user interfaces, I expected that the core functionality of the camera would be common code.

I was wrong I have yet to reproduce the issue with the Minrray camera – even throwing 20 tasks at it, with only a millisecond delay between commands, I’ve seen no errors.

Report the bug

So, it looks like this may be a bug in the PTZOptics firmware. At first I was reluctant to go to the trouble of reporting the bug, given that it’s a real edge case. (I suspect relatively few people using these cameras connect more than one TCP stream to them over VISCA.) However, I received encouragement on Twitter, and I’ve had really good support experiences with PTZOptics, so I thought it worth having a go.

While my stress test app is much simpler than AYS, it’s definitely not a minimal example to demonstrate the problem. It’s 100 lines long and requires a separate library. Fortunately, all of the code for recording different kinds of failures, and starting tasks to loop for a while, etc – that’s all unnecessary when trying to just demonstrate the problem. Likewise I don’t need to worry about queuing commands on a single connection if I’m only sending a single command down each; I don’t need any of the abstraction that my library contains.

Instead, I boiled it down to this – 30 lines (of which several are comments or whitespace) that just sends two commands “nearly simultaneously” on separate connections and shows the response returned on the the first connection.

using System.Net.Sockets;

string host = "192.168.1.45";
int port = 5678;

byte[] getPowerCommand = { 0x81, 0x09, 0x04, 0x00, 0xff };

// Create two clients - these will connect immediately
var client1 = new TcpClient(host, port);
client1.NoDelay = true;
var stream1 = client1.GetStream();

var client2 = new TcpClient(host, port);
client2.NoDelay = true;
var stream2 = client2.GetStream();

// Write the "get power" command to both sockets as close to simultaneously as we can
stream1.Write(getPowerCommand);
stream2.Write(getPowerCommand);

// Read the response from the first client
var buffer = new byte[10];
int bytesRead = stream1.Read(buffer);
// Print it out in hex.
// This is an error: 90-60-02-FF
// This is success: 90-50-02-FF
Console.WriteLine(BitConverter.ToString(buffer[..bytesRead]));

// Note: this sample doesn't read from stream2, but basically when the bug strikes,
// there's nothing to read: the camera doesn't respond.

Simple but effectively – it reliably reproduces the error on the PTZOptics camera, and shows no problems on the Minrray.

I included that with the bug report, and have received a response from PTZOptics already, saying they’re looking into it. I don’t expect a new firmware version with a fix in any time soon – but I hope it might be in their next firmware update… and at least now I can test that easily.

Conclusion

This was classic diagnostic work:

Go from complex code to simple code
Try lots of configurations to try to make sense of random-looking data
Follow hunches and get more information
Use all the tools at your disposal (Wireshark in this case) to isolate the problem as far as possible
Once you understand the problem (even if you can’t fix it), write code designed specifically to reproduce it simply

Hope you found this as interesting as I did! Next (this weekend, hopefully): displaying the video output of these cameras on a Stream Deck…

C#, Diagnostics, General

Travis logs and .NET Core console output

July 19, 2020 jonskeet 5 Comments

This is a blog post rather than a bug report, partly because I really don’t know what’s at fault. Others with more knowledge of how the console works in .NET Core, or exactly what the Travis log does, might be able to dig deeper.

TL;DR: If you’re running jobs using .NET Core 3.1 on Travis and you care about the console output, you might want to set the TERM environment variable to avoid information being swallowed.

Much of my time is spent in the Google Cloud Libraries for .NET repository. That single repository hosts a lot of libraries, and many of the pull requests are from autogenerated code where the impact on the public API surface may not be immediately obvious. (It would be easy to miss one breaking change within dozens of comment changes, for example.) Our Travis build includes a job to work out the public API changes, which is fantastically useful. (Example)

When we updated our .NET Core SDK to 3.1 – or at least around that time; it may have been coincidence – we noticed that some of the log lines in our Travis jobs seemed to be missing. They were actually missing from all the jobs, but it was particularly noticeable for that “change detection” job because the output can often be small, but should always contain a “Diff level” line. It’s really obvious when that line is missing.

I spent rather longer trying to diagnose what was going wrong than I should have done. A colleague noted that clicking on “Raw log” showed that we were getting all the output – it’s just that Travis was swallowing some of it, due to control characters being emitted. This blog post is a distillation of what I learned when trying to work out what was going on.

A simple set of Travis jobs

In my DemoCode repository I’ve created a Travis setup for the sake of this post.

Here are the various files involved:

.travis.yml:

dist: xenial  

language: csharp  
mono: none  
dotnet: 3.1.301  

jobs:  
  include:  
    - name: "Default terminal, no-op program"  
      script: TravisConsole/run-dotnet.sh 0  

    - name: "Default terminal, write two lines"  
      script: TravisConsole/run-dotnet.sh 2  

    - name: "Mystery terminal, no-op program"  
      env: TERM=mystery  
      script: TravisConsole/run-dotnet.sh 0  

    - name: "Mystery terminal, write two lines"  
      env: TERM=mystery  
      script: TravisConsole/run-dotnet.sh 2  

    - name: "Mystery terminal, write two lines, no logo"  
      env: TERM=mystery DOTNET_NOLOGO=true  
      script: TravisConsole/run-dotnet.sh 2

TravisConsole/run-dotnet.sh:

#!/bin/bash  

set -e  

cd $(readlink -f $(dirname ${BASH_SOURCE}))  

echo "Before dotnet run (first)"  
dotnet run -- $1  
echo "After dotnet run (first)"  

echo "Before dotnet run (second)"  
dotnet run -- $1  
echo "After dotnet run (second)"

TravisConsole/Program.cs:

using System;  

class Program  
{  
    static void Main(string[] args)  
    {  
        int count = int.Parse(args[0]);  
        for (int i = 1; i <= count; i++)  
        {  
             Console.WriteLine($"Line {i}");  
        }  
    }  
}

So each job runs the same .NET Core console application twice with the same command line argument – either 0 (in which case nothing is printed out) or 2 (in which case two it prints out “Line 1” then “Line 2”). The shell script also logs before and after executing the console application. The only other differences are in terms of the environment variables:

Some jobs use TERM=mystery instead of the default
The final job uses DOTNET_NOLOGO=true

I’ll come back to the final job right at the end – we’ll concentrate on the impact of the TERM environment variable first, as that’s the main point of the post. Next we’ll look at the output of the jobs – in each case showing it in the “pretty” log first, then in the “raw” log. The pretty log has colour, and I haven’t tried to reproduce that. I’ve also only shown the relevant bit – the call to run-dotnet.sh.

You can see all of the output shown here in the Travis UI, of course.

Job 1: Default terminal, no-op program

Pretty log

$ TravisConsole/run-dotnet.sh 0
Before dotnet run (first)
Welcome to .NET Core 3.1!
---------------------
SDK Version: 3.1.301
----------------
Explore documentation: https://aka.ms/dotnet-docs
Report issues and find source on GitHub: https://github.com/dotnet/core
Find out what's new: https://aka.ms/dotnet-whats-new
Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https
Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs
Write your first app: https://aka.ms/first-net-core-app
--------------------------------------------------------------------------------------
Before dotnet run (second)
The command "TravisConsole/run-dotnet.sh 0" exited with 0.

Note the lack of After dotnet run in each case.

Raw log

[0K$ TravisConsole/run-dotnet.sh 0
Before dotnet run (first)

Welcome to .NET Core 3.1!

---------------------

SDK Version: 3.1.301

----------------

Explore documentation: https://aka.ms/dotnet-docs

Report issues and find source on GitHub: https://github.com/dotnet/core

Find out what's new: https://aka.ms/dotnet-whats-new

Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https

Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs

Write your first app: https://aka.ms/first-net-core-app

--------------------------------------------------------------------------------------
[?1h=[?1h=[?1h=[?1h=[?1h=[?1h=[?1h=After dotnet run (first)
Before dotnet run (second)
[?1h=[?1h=[?1h=[?1h=[?1h=[?1h=[?1h=After dotnet run (second)
travis_time:end:18aa556c:start=1595144448336834755,finish=1595144452475616837,duration=4138782082,event=script
[0K[32;1mThe command "TravisConsole/run-dotnet.sh 0" exited with 0.[0m

In the raw log, we can see that After dotnet run is present each time, but with [?1h=[?1h=[?1h=[?1h=[?1h=[?1h=[?1h= before it. Let’s see what happens when our console application actually writes to the console.

Job 2: Default terminal, write two lines

Pretty log

$ TravisConsole/run-dotnet.sh 2
Before dotnet run (first)
Welcome to .NET Core 3.1!
---------------------
SDK Version: 3.1.301
----------------
Explore documentation: https://aka.ms/dotnet-docs
Report issues and find source on GitHub: https://github.com/dotnet/core
Find out what's new: https://aka.ms/dotnet-whats-new
Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https
Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs
Write your first app: https://aka.ms/first-net-core-app
--------------------------------------------------------------------------------------
Line 2
Before dotnet run (second)
Line 2
The command "TravisConsole/run-dotnet.sh 2" exited with 0.

This time we don’t have After dotnet run – and we don’t have Line 1 either. As expected, they are present in the raw log, but with control characters before them:

[0K$ TravisConsole/run-dotnet.sh 2
Before dotnet run (first)

Welcome to .NET Core 3.1!

---------------------

SDK Version: 3.1.301

----------------

Explore documentation: https://aka.ms/dotnet-docs

Report issues and find source on GitHub: https://github.com/dotnet/core

Find out what's new: https://aka.ms/dotnet-whats-new

Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https

Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs

Write your first app: https://aka.ms/first-net-core-app

--------------------------------------------------------------------------------------
[?1h=[?1h=[?1h=[?1h=[?1h=[?1h=Line 1
Line 2
[?1h=After dotnet run (first)
Before dotnet run (second)
[?1h=[?1h=[?1h=[?1h=[?1h=[?1h=Line 1
Line 2
[?1h=After dotnet run (second)
travis_time:end:00729828:start=1595144445905196926,finish=1595144450121508733,duration=4216311807,event=script
[0K[32;1mThe command "TravisConsole/run-dotnet.sh 2" exited with 0.[0m

Now let’s try with the TERM environment variable set.

Job 3: Mystery terminal, no-op program

$ TravisConsole/run-dotnet.sh 0
Before dotnet run (first)
Welcome to .NET Core 3.1!
---------------------
SDK Version: 3.1.301
----------------
Explore documentation: https://aka.ms/dotnet-docs
Report issues and find source on GitHub: https://github.com/dotnet/core
Find out what's new: https://aka.ms/dotnet-whats-new
Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https
Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs
Write your first app: https://aka.ms/first-net-core-app
--------------------------------------------------------------------------------------
After dotnet run (first)
Before dotnet run (second)
After dotnet run (second)
The command "TravisConsole/run-dotnet.sh 0" exited with 0.

That’s more like it! This time the raw log doesn’t contain any characters within the script execution itself. (There are still blank lines in the “logo” part, admittedly. Not sure why, but we’ll get rid of that later anyway.)

[0K$ TravisConsole/run-dotnet.sh 0
Before dotnet run (first)

Welcome to .NET Core 3.1!

---------------------

SDK Version: 3.1.301

----------------

Explore documentation: https://aka.ms/dotnet-docs

Report issues and find source on GitHub: https://github.com/dotnet/core

Find out what's new: https://aka.ms/dotnet-whats-new

Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https

Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs

Write your first app: https://aka.ms/first-net-core-app

--------------------------------------------------------------------------------------
After dotnet run (first)
Before dotnet run (second)
After dotnet run (second)
travis_time:end:11222e41:start=1595144449188901003,finish=1595144453242229433,duration=4053328430,event=script
[0K[32;1mThe command "TravisConsole/run-dotnet.sh 0" exited with 0.[0m

Let’s just check that it still works with actual output:

Job 4: Mystery terminal, write two lines

Pretty log

4.45s$ TravisConsole/run-dotnet.sh 2
Before dotnet run (first)
Welcome to .NET Core 3.1!
---------------------
SDK Version: 3.1.301
----------------
Explore documentation: https://aka.ms/dotnet-docs
Report issues and find source on GitHub: https://github.com/dotnet/core
Find out what's new: https://aka.ms/dotnet-whats-new
Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https
Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs
Write your first app: https://aka.ms/first-net-core-app
--------------------------------------------------------------------------------------
Line 1
Line 2
After dotnet run (first)
Before dotnet run (second)
Line 1
Line 2
After dotnet run (second)
The command "TravisConsole/run-dotnet.sh 2" exited with 0.

Exactly what we’d expect from inspection. The raw log doesn’t hold any surprises either.

Raw log

[0K$ TravisConsole/run-dotnet.sh 2
Before dotnet run (first)

Welcome to .NET Core 3.1!

---------------------

SDK Version: 3.1.301

----------------

Explore documentation: https://aka.ms/dotnet-docs

Report issues and find source on GitHub: https://github.com/dotnet/core

Find out what's new: https://aka.ms/dotnet-whats-new

Learn about the installed HTTPS developer cert: https://aka.ms/aspnet-core-https

Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli-docs

Write your first app: https://aka.ms/first-net-core-app

--------------------------------------------------------------------------------------
Line 1
Line 2
After dotnet run (first)
Before dotnet run (second)
Line 1
Line 2
After dotnet run (second)
travis_time:end:0203f787:start=1595144444502091825,finish=1595144448950945977,duration=4448854152,event=script
[0K[32;1mThe command "TravisConsole/run-dotnet.sh 2" exited with 0.[0m

Job 5: Mystery terminal, write two lines, no logo

While job 4 is almost exactly what we want, it’s still got the annoying “Welcome to .NET Core 3.1!” section. That’s a friendly welcome for users in an interactive context, but pointless for continuous integration. Fortunately it’s now easy to turn off by setting DOTNET_NOLOGO=true. We now have exactly the log we’d want:

Pretty log

$ TravisConsole/run-dotnet.sh 2
Before dotnet run (first)
Line 1
Line 2
After dotnet run (first)
Before dotnet run (second)
Line 1
Line 2
After dotnet run (second)
The command "TravisConsole/run-dotnet.sh 2" exited with 0.

Raw log

[0K$ TravisConsole/run-dotnet.sh 2
Before dotnet run (first)
Line 1
Line 2
After dotnet run (first)
Before dotnet run (second)
Line 1
Line 2
After dotnet run (second)
travis_time:end:0bb5a6d4:start=1595144448986411002,finish=1595144453476210113,duration=4489799111,event=script
[0K[32;1mThe command "TravisConsole/run-dotnet.sh 2" exited with 0.[0m

Conclusion

The use of mystery as the value of the TERM environment variable isn’t special, other than “not being a terminal that either Travis or .NET Core will have any fixed expectations about”. I expect that .NET Core is trying to be clever with its output based on the TERM environment variable, and that Travis isn’t handling the control characters in quite the way that .NET Core expects it to. Which one is right, and which one is wrong? It doesn’t really matter to me, so long as I can fix it.

This does potentially have a cost, of course. Anything which would actually produce prettier output based on the TERM environment variable is being hampered by this change. But so far we haven’t seen any problems. (It certainly isn’t stopping our Travis logs from using colour, for example.)

I discovered the DOTNET_NOLOGO environment variable – introduced in .NET Core 3.1.301, I think – incidentally while researching this problem. It’s not strictly related to the core problem, but it is related to the matter of “making CI logs readable” so I thought I’d include it here.

I was rather surprised not to see complaints about this all over the place. As you can see from the code above, it’s not like I’m doing anything particularly “special” – just writing lines out to the console. Are other developers not having the same problem, or just not noticing the problem? Either way, I hope this post helps either the .NET Core team to dive deeper, find out what’s going on and fix it (talking to the Travis team if appropriate), or at least raise awareness of the issue so that others can apply the same workaround.

Diagnostics, General, Noda Time

Using “git bash” from AppVeyor

October 12, 2019 jonskeet 2 Comments

Update: I don’t know whether it was partially due to this blog post or not, but AppVeyor has fixed things so that you don’t (currently, 20th October 2019) need to use the fix in this post. You may want to include it anyway, for the sake of future-proofing.

TL;DR: If your AppVeyor build starts breaking because it’s started using WSL bash, change the path in your YAML file – see the end of the post for an example.

For years now, I’ve used bash scripts for all kinds of automation in Windows projects. The version of bash I use is the one that comes with Git for Windows – I believe its origins include Cygwin, MSYS2, and MinGW-w64. (I don’t know enough about the differences between those projects or which builds on which to say more. Fortunately, I don’t need to.) This version of bash is installed by default AppVeyor, the CI system I typically use for Windows builds, so I don’t need to do anything else.

Importantly, I don’t want to use Windows Subsystem for Linux (WSL) on Windows builds. The point of doing the build is to use the Windows tool chains. I use Travis for doing Linux builds.

On October 11th 2019, my Noda Time AppVeyor build failed with this error:

build/appveyor.sh: line 11: dotnet: command not found

It turns out this is because AppVeyor has started shipping WSL with its Visual Studio 2019 images. The bash from WSL is earlier in the path than the version of bash from git, so that one is used, and everything starts failing.

It took a little while to diagnose this, but the fix is pretty easy – you just need to put git bash earlier in your path. I chose to do this in the “install” part of appveyor.yml:

install:
  # Make sure we get the bash that comes with git, not WSL bash
  - ps: $env:Path = "C:\Program Files\Git\bin;$env:Path"

Using just that change, the build started working again. Hooray!

Diagnostics, Google, Noda Time

Hosting ASP.NET Core behind https in Google Kubernetes Engine

March 17, 2019 jonskeet 11 Comments

Side-note: this may be one of the clumsiest titles I’ve ever written for a blog post. But it does what it says on the tin. Oh, and the space after “ASP” in “ASP .NET Core” everywhere it to avoid auto-linking. While I could use a different dot or a zero-width non-breaking space to avoid it, I’m not sure I trust WordPress to do the right thing with those…

Background

Over the past few weeks, I’ve moved nodatime.org, csharpindepth.com and jonskeet.uk over to Google Kubernetes Engine. (They all used to be hosted on Azure.)

I’ve done this for a few reasons:

As my job is primarily making .NET developers more productive on Google Cloud Platform, it feels natural to run my own code there. I want to see where there are friction points, so I can help fix them.
I wanted more redundancy, particularly for nodatime.org; Kubernetes felt a simple way of managing that at a reasonable cost
HTTPS certificate management (via Let’s Encrypt) has been a bit painful for me on Azure; I could have automated more, but that would have taken extra time I don’t have. (It may also have improved since I last looked.)

The first of these is the most important, by a long way. But the HTTPS management part – and then the knock-on effects – is what I’m looking at in this blog post.

Basic hosting

Hosting an ASP .NET Core application in Google Kubernetes Engine (GKE from now on) is really simple, at least once you’ve understood the Kubernetes concepts. I have:

A single cluster
A single ingress backed by a Google Cloud Load Balancer for all my domains, with a reserved static IP address (set up in DNS)
A service per application (nodatime.org example)
A deployment per application (nodatime.org example)

In each case, the ASP .NET Core application is built with a vanilla Dockerfile which would not look unusual to anyone who’s hosted ASP .NET Core in Docker anywhere else.

I happen to use Google Cloud Build to build the Docker images, and Google Container Registry to host the images, but neither of those are required. (For csharpindepth.com and jonskeet.uk there are simple triggers in Google Cloud Build to build and deploy on GitHub pushes; for nodatime.org it’s a bit more complicated as the documentation build currently has some Windows dependencies. I have a machine at home that polls GitHub every half hour, and pushes the built application to Google Cloud Build for packaging when necessary.)

So, that gets HTTP hosting sorted. I dare say there are some aspects I’ve not configured as well as I could have done, but it was reasonably straightforward to get going.

HTTPS with Google-managed certificates

With HTTP working, it’s time to shoot for HTTPS. It’s important to note that the apps I’m talking about are all hobby projects, not commercial ones – I’m already paying for hosting, so I don’t want to have to pay for SSL certificates as well. Enter Let’s Encrypt, of course.

A while ago I used Let’s Encrypt to set up HTTPS on Azure, and while it was free and I didn’t have to write any code, it wasn’t exactly painless. I followed two guides at the same time, because neither of them exactly matched the Azure portal I was looking at. There were lots of bits of information to grab from various different bits of the portal, and it took a couple of attempts to get right… but I got there. I also set up a web job to renew the certificates, but didn’t go through the hoops required to run those web jobs periodically. (There were instructions, but it looked like they’d take a while to work through compared with just manually running the web job every couple of months or so. I decided to take the pragmatic approach, knowing that I was expecting to move to GKE anyway. If Azure had been the expected permanent home for the apps, I’d have gone through the steps and I’m sure they’d have worked fine.) I don’t know which guide I worked through at the time, but if I were starting today I’d probably try Scott Hanselman’s guide.

So, what can I do on Google Cloud Platform instead? I decided to terminate the SSL connection at the load balancer, using Google-managed certificates. To be really clear, these are currently in beta, but have worked well for me so far. Terminating the SSL connection at the load balancer means that the load balancer forwards the request to the Kubernetes service as an HTTP request, not HTTPS. The ASP .NET Core app itself only exposes an HTTP port, so it doesn’t need to know any details of certificates.

The steps to achieve this are simple, assuming you have the Google Cloud SDK (gcloud) installed already:

Create the certificate, e.g.
gcloud beta compute ssl-certificates create nodatime-org --domains nodatime.org
Attach the certificate to the load balancer, via the Kubernetes ingress in my case, with an annotation in the ingress metadata:
ingress.gcp.kubernetes.io/pre-shared-cert: "nodatime-org"
Apply the modifications to the ingress:
kubectl apply -f ingress.yaml
Wait for the certificate to become valid (the provisioning procedure takes a little while, and I’ve seen some interesting errors while that’s taking place)
Enjoy HTTPS support, with auto-renewing certificates!

There are only two downsides to this that I’ve experienced so far:

Currently each certificate can only be associated with a single domain. For example, I have different certificates for nodatime.org, http://www.nodatime.org and test.nodatime.org. (More about the last of these later.) This is a minor annoyance, but the ingress supports multiple pre-shared certificates, so it has no practical implications for me.
I had to accept some downtime on HTTPS when transferring from Azure to GKE, while the certificate was provisioning after I’d transferred the DNS entry. This was a one-time issue of course, and one that wouldn’t affect most users.

Beyond the basics

At this point I had working HTTPS URLs – but any visitor using HTTP would stay that way. (At the time of writing this is still true for csharpindepth.com and jonskeet.uk.) Obviously I’d like to encourage secure browsing, so I’d like to use the two pieces of functionality provided by ASP .NET Core:

Redirection of HTTP requests via app.UseHttpsRedirection()
HSTS support via app.UseHsts()

I should note here that the Microsoft documentation was fabulously useful throughout. It didn’t quite get me all the way, but it was really close.

Now, I could have just added those calls into the code and deployed straight to production. Local testing would have worked – it would have redirected from localhost:5000 on HTTP to localhost:5001 on HTTPS with no problems. It would also have failed massively for reasons we’ll look at in a minute. Just for a change, I happened to do the right thing…

For hosting changes, always use a test deployment first

In Azure, I had a separate AppService I could deploy to, called nodatimetest. It didn’t have a fancy URL, but it worked okay. That’s where I tested Azure-specific changes before deploying to the real AppService. Unfortunately, it wouldn’t have helped in this situation, as it didn’t have a certificate.

Fortunately, creating a new service in Kubernetes, adding it to the ingress, and creating a managed certificate is so easy that I did do this for the new hosting – and I’m so glad I did so. I use a small script to publish the local ASP .NET Core build to Google Cloud Build which does the Docker packaging, pushes it to Google Container Registry and updates the Kubernetes deployment. As part of that script, I add a small text file containing the current timestamp so I can check that I’m really looking at the deployment I expect. It takes just under two minutes to build, push, package, deploy – not a tight loop you’d want for every day development, but pretty good for the kind of change that can’t be tested locally.

So, I made the changes to use HTTPS redirection and HSTS, deployed, and… there was no obvious change.

Issue 1: No HTTPS port to redirect to

Remember how the ASP .NET Core app in Kubernetes is only listening on HTTP? That means it doesn’t know which port to redirect users to for HTTPS. Oops. While I guess it would be reasonable to guess 443 if it didn’t know any better, the default of “don’t redirect if you haven’t been told a port” means that your application doesn’t stop working if you get things wrong – it just doesn’t redirect.

This is easily fixed in ConfigureServices:

services.AddHttpsRedirection(options => options.HttpsPort = 443);

… but I’ve added conditional code so it doesn’t do that in development environment, as otherwise it would try to redirect from localhost:5000 to localhost:443, which wouldn’t work. This is a bit hacky at the moment, which is a common theme – I want to clean up all the configuration at some point quite soon (moving things into appsettings.json as far as possible) but it’s just about hanging together for now.

So, make the change, redeploy to test, and… observed infinite redirection in the browser. What?

Issue 2: Forwarding proxied headers

Remember again how the ASP .NET Core app is only listening on HTTP? We want it to behave differently depending on whether the end user made a request to the load balancer on HTTP or HTTPS. That means using headers forwarded from the proxy (in our case the load balancer) to determine the original request scheme. Fortunately, again there’s documentation on hand for this.

There are two parts to configuring this:

Configuring the ForwardedHeadersOptions in ConfigureServices
Calling app.UseForwardedHeaders() in Configure

(At least, that’s the way that’s documented. I’m sure there are myriad alternatives, but my experience level of ASP .NET Core is such that I’m still in “follow the examples verbatim, changing as little as possible” at the moment.)

I won’t go into the gory details of exactly how many times I messed up the forwarded headers options, but I will say:

The example which just changes options.ForwardedHeaders is probably fine if your proxy server is local to the application, but otherwise you will need do to extra work
The troubleshooting part of the documentation is spectacularly useful
There are warnings logged if you get things wrong, and those logs will help you – but they’re at a debug log level, so you may need to update your logging settings. (I only realized this after I’d fixed the problem, partly thanks to Twitter.)

Lesson to learn: when debugging a problem, turn on debugging logs. Who’d have thought?

Configuring this properly is an area where you really need to understand your deployment and how a request reaches you. In my case, the steps are:

The user’s HTTPS request is terminated by the load balancer
The load balancer makes a request to the Kubernetes service
The Kubernetes service makes a request to the application running on one of the suitable nodes

This leads to a relatively complex configuration, as there are two networks to trust (Google Cloud load balancers, and my internal Kubernetes network) and we need to allow two “hops” of proxying. So my configuration code looks like this:

services.Configure<ForwardedHeadersOptions>(options =>
{
    options.KnownNetworks.Clear();
    // Google Cloud Platform load balancers
    options.KnownNetworks.Add(new IPNetwork(IPAddress.Parse("130.211.0.0"), 22));
    options.KnownNetworks.Add(new IPNetwork(IPAddress.Parse("35.191.0.0"), 16));
    // GKE service which proxies the request as well.
    options.KnownNetworks.Add(new IPNetwork(IPAddress.Parse("10.0.0.0"), 8));
    options.ForwardedHeaders = ForwardedHeaders.XForwardedFor | ForwardedHeaders.XForwardedProto;
    options.ForwardLimit = 2;
});

(The call to KnownNetworks.Clear() probably isn’t necessary. The default is to include the loopback, which is safe enough to leave in the list.)

Okay, deploy that to the test environment. Everything will work now, right? Well, sort of…

Issue 3: make sure health checks are healthy!

As it happens, when I’d finished fixing issue 2, I needed to help at a birthday party for a family we’re friends with. Still, I went to the party happily knowing everything was fine.

I then came home and found the test deployment was broken. Really broken. “502 Bad Gateway” broken. For both HTTP and HTTPS. This is not good.

I tried adding more logging, but it looked like none of my requests were getting through to the application. I could see in the logs (thank you, Stackdriver!) that requests were being made, always to just “/” on HTTP. They were all being redirected to HTTPS via a 307, as I’d expect.

This confused me for a while. I honestly can’t remember what gave me the lightbulb moment of “Ah, these are load balancer health checks, and it thinks they’re failing!” but I checked with the load balancer in the Google Cloud Console and sure enough, I had multiple working backends, and one broken one – my test backend. The reason I hadn’t seen this before was that I’d only checked the test deployment for a few minutes – not long enough for the load balancer to deem the backend unhealthy.

I was stuck at this point for a little while. I considered reconfiguring the load balancer to make the health check over HTTPS, but I don’t think that could work as the app isn’t serving HTTPS itself – I’d need to persuade it to make the request as if it were a user-generated HTTPS request, with appropriate X-Forwarded-Proto etc headers. However, I saw that I could change which URL the load balancer would check. So how about we add a /healthz URL that would be served directly without being redirected? (The “z” at the end is a bit of Googler heritage. Just /health would be fine too, of course.)

I started thinking about adding custom inline middleware to do this, but fortunately didn’t get too far before realizing that ASP .NET Core provides health checking already… so all I needed to do was add the health check middleware before the HTTPS redirection middleware, and all would be well.

So in ConfigureServices, I added a no-op health check service:

services.AddHealthChecks();

And in Configure I added the middleware at an appropriate spot:

app.UseHealthChecks("/healthz");

After reconfiguring the health check on the load balancer, I could see /healthz requests coming in and receiving 200 (OK) responses… and the load balancer was then happy to use the backend again. Hooray!

After giving the test service long enough to fail, I deployed to production, changed the load balancer health check URL, and all was well. I did the two parts of this quickly enough so that it never failed – a safer approach would have been to add the health check handler but without the HTTPS redirection first, deploy that, change the health check URL, then turn on HTTPS.

But the end result is, all is working! Hooray!

Conclusion

Moving the service in the first place has been a long process, mostly due to a lack of time to spend on it, but the HTTPS redirection has been its own interesting bit of simultaneous pleasure and frustration. I’ve learned a number of lessons along the way:

The combination of Google Kubernetes Engine, Google Cloud Load Balancing, Google Cloud Build and Google Container registry is pretty sweet.
Managed SSL certificates are wonderfully easy to use, even if there is a bit of a worrying delay while provisioning
It’s really, really important to be able to test deployment changes (such as HTTPS redirection) in an environment which is very similar to production, but which no-one is depending on. (Obviously if you have a site which few people care about anyway, there’s less risk. But as it’s easy to set up a test deployment on GKE, why not do it anyway?)
HTTPS redirection caused me three headaches, all predictable:
- ASP .NET Core needs to know the HTTPS port to redirect to.
- You need to configure forwarded headers really carefully, and know your deployment model thoroughly .
- Be aware of health checks! Make sure you leave a test deployment “live” for long enough for the health checks to mark it as unhealthy if you’ve done something wrong, before you deploy to production.
When debugging, turn on debug logging. Sounds obvious in retrospect, doesn’t it? (Don’t start trying to copy middleware source code into your own application so you can add logging, rather than using the logging already there…)

I also have some future work to do:

There’s a particular URL (http://nodatime.org/tzdb/latest.txt) which is polled by applications in order to spot time zone information changes. That’s the bulk of the traffic to the site. It currently redirects to HTTPS along with everything else, which leads to the total traffic being nearly double what it was before, for no real benefit. I’ve encouraged app authors to use HTTPS instead, but I’ve also filed a feature request against myself to consider serving that particular URL without the redirect. It looks like that’s non-trivial though.
I have a bunch of hard-coded information which should really be in appsettings.json. I want to move all of that, but I need to learn more about the best way of doing it first.

All in all, this has been a very positive experience – I hope the details above are useful to anyone else hosting ASP .NET Core apps in Google Kubernetes Engine.

Diagnostics

Diagnosing a Linux-only unit test failure

October 8, 2017 jonskeet 14 Comments

This is an odd one. I’m currently working on Cloud Firestore support for C#, and I’ve introduced a GeoPoint struct to represent a latitude/longitude pair, each being represented by a double. It implements IEquatable and overloads == and != in the obvious way. So far, so good, and I have working tests which have passed on CI.

I’ve added some more types that implement equality, so I thought I’d extract out common code into an EqualityTester class that can handle both the Equals methods and the operators (the latter via reflection).

All went well on my local machine, but on CI one environment failed. We use both AppVeyor and Travis in CI, and only Travis was failing. There were two failures – one testing the equality operators, and one as part of a broader serialization/deserialization test. Odd. Time to go through the normal diagnostic process…

Step 1: Check there’s really a problem

The nature of the failing tests don’t feel like an intermittent problem, but it’s always good to get more of an idea. I forced Travis to rebuild the pull request on Travis a couple of times: yes, it failed the same two tests each time.

Step 2: Reproduce locally

Tests that only fail in CI are really annoying. I’ve had this a couple of times, and it’s always taken hours to sort out, because the “try something, get results, think about what they mean” cycle is so slow.

Fortunately, I have a Linux box at home – so I fetched the pull request and ran the tests there – and they failed. Hooray! It’s not quite as nice as being able to reproduce it on my Windows laptop, but it’s much better than being isolated to CI.

Step 3: Remove irrelevant test code

There are two parts to the code that’s failing: the unit tests, and the production code. Ideally, I’d like to get rid of both, to create a truly minimal example. I usually try to take out one project at a time, and I could have gone either way – but this time I decided to remove the testing first.

So in the same repo, I created a new netcoreapp1.0 console application, with a project reference to the production code. I moved the operator part of my equality tester code into Program.cs, simplified it slightly, introduced by own Assert class with True and False methods that printed out the result, and I was ready to run:

using System;
using System.Reflection;
using Google.Cloud.Firestore.Data;

class Program
{
   static void Main(string[] args)
   {
       var gp1 = new GeoPoint(1.5, 2.5);
       var gp2 = new GeoPoint(1.5, 2.5);
       AssertEqualityOperators(gp1, gp2);
   }

   internal static void AssertEqualityOperators<T>(T control, T equal)
   {
       var typeInfo = typeof(T).GetTypeInfo();
       var equalityMethodInfo = typeInfo.GetMethod(
           "op_Equality", new[] { typeof(T), typeof(T) });
       Func<T, T, bool> equality = 
           (lhs, rhs) => (bool) equalityMethodInfo.Invoke(null, new object[] { lhs, rhs });

       Assert.True(equality(control, equal));
   }
}

class Assert
{
   public static void True(bool actual)
       => Console.WriteLine($"Expected True; was {actual}");

   public static void False(bool actual)
       => Console.WriteLine($"Expected False; was {actual}");
}

Sure enough, the output shows the problem:

Expected True; was False

Great. Now let’s remove the other dependency…

Step 4: Remove production code dependency

Rather than copy the whole of GeoPoint into my test project, I just created a small Test struct which had similar equality code.

Dead-end: a single double value

I made a mistake here – I changed too much in one go. I reduced my Test struct to just one double field:

using System;
using System.Reflection;

class Program
{
   static void Main(string[] args)
   {
       Test t1 = new Test(1.5);
       Test t2 = new Test(1.5);

       var equalityMethodInfo = typeof(Test)
           .GetMethod("op_Equality", new[] { typeof(Test), typeof(Test) });
       Func<Test, Test, bool> equality =
           (lhs, rhs) => (bool) equalityMethodInfo.Invoke(null, new object[] { lhs, rhs });
       Console.WriteLine(t1 == t2);
       Console.WriteLine(equality(t1, t2));
       Console.WriteLine(equalityMethodInfo.Invoke(null, new object[] { t1, t2 }));
   }
}

struct Test : IEquatable<Test>
{
    private readonly double value;

    public Test(double value)
    {
        this.value = value;
    }

    public static bool operator==(Test lhs, Test rhs) =>
        lhs.value == rhs.value;

    public static bool operator!=(Test lhs, Test rhs) => !(lhs == rhs);

    public override bool Equals(object obj) =>
        obj is Test t && Equals(t);

    public bool Equals(Test other) => this == other;

    public override int GetHashCode() => value.GetHashCode();
}

That prints “True” three times – which is what we’d like it to do eventually, but it means it’s not reproducing the problem we want it do.

Back to reproducing it: two double values

Taking one step back towards the previous code, I went back to two double values instead of just one, expanding the Test struct like this:

struct Test : IEquatable<Test>
{
    private readonly double x;
    private readonly double y;

    public Test(double x, double y)
    {
        this.x = x;
        this.y = y;
    }

    public static bool operator==(Test lhs, Test rhs) =>
        lhs.x == rhs.x && lhs.y == rhs.y;

    public static bool operator!=(Test lhs, Test rhs) => !(lhs == rhs);

    public override bool Equals(object obj) =>
        obj is Test t && Equals(t);

    public bool Equals(Test other) => this == other;

    public override int GetHashCode() => x.GetHashCode() + y.GetHashCode();
}

Change both constructor calls to Test(1.5, 2.5) and bingo: it prints True, False, False.

Step 5: Explore the scope of the problem

I tried the exact same code on Windows, and (unsurprisingly, given that my earlier unit tests passed) it didn’t reproduce the problem. But sticking with Linux, we can still try multiple frameworks and multiple build configurations.

I wouldn’t like to say what prompted me to try different frameworks – intuition can be hard to explain, unfortunately. But let’s change the project file a bit:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp1.0;netcoreapp1.1;netcoreapp2.0</TargetFrameworks>
  </PropertyGroup>
</Project>

And then run it in several different configurations (remember that three “True” lines means “everything is working”, and “True”, “False”, “False” means “this is the problem we’re investigating”):

$ dotnet run -c Debug -f netcoreapp1.0
True
False
False
$ dotnet run -c Release -f netcoreapp1.0
True
False
False
$ dotnet run -c Debug -f netcoreapp1.1
True
True
True
$ dotnet run -c Release -f netcoreapp1.1
True
True
True
$ dotnet run -c Debug -f netcoreapp2.0
True
True
True
$ dotnet run -c Release -f netcoreapp2.0
True
True
True

So, it fails under .NET Core 1.0, but works under .NET Core 1.1 and .NET Core 2.0. That certainly makes it sound like a .NET Core 1.0 bug, but that doesn’t mean I want to give up at this point. It’s entirely possible that I have users who are using .NET Core 1.0, so I want my code to work under it.

Now, I’ve only given major/minor version numbers – and that’s partly because I don’t have a very good handle on the different parts of .NET Core on Linux and how to find the version numbers. The 1.0 packages I’ve got installed are listed as:

dotnet-dev-1.0.4/xenial,now 1.0.4-1 amd64 [installed]
dotnet-hostfxr-1.0.1/xenial,now 1.0.1-1 amd64 [installed]
dotnet-sharedframework-microsoft.netcore.app-1.0.5/xenial,now 1.0.5-1 amd64 [installed,automatic]

No dotnet-runtime package, which I believe is part of 2.0 instead.

Step 6: Look into the failed comparison

This part is really simple – just add more diagnostics to the == method:

public static bool operator==(Test lhs, Test rhs)
{
    Console.WriteLine($"lhs=({lhs.x}, {lhs.y}); rhs=({rhs.x}, {rhs.y})");
    Console.WriteLine($"{lhs.x == rhs.x}, {lhs.y == rhs.y}");
    return lhs.x == rhs.x && lhs.y == rhs.y;
}

Now things get really interesting in the output:

lhs=(1.5, 2.5); rhs=(1.5, 2.5)
True, True
True
lhs=(1.5, 1.85492638478664E-316); rhs=(1.5, 6.95251497332956E-310)
True, False
False
lhs=(1.5, 6.9525149733517E-310); rhs=(1.5, 6.9525149733351E-310)
True, False
False

The first three lines are as expected – and that’s the working comparison. But each of the three-line-segments for the second and third comparisons show the first field (x) being fine, but the second being some tiny double value, and inconsistent at that.

Step 7: Try (and fail) to simplify

Maybe boxing is to blame? How about we try a really simple version of boxing – just override ToString(), box a value and see what comes up. The change in Test is simply:

public override string ToString() => $"{x}, {y}";

And then to test it:

object x = new Test(1.5, 2.5);
Console.WriteLine(x);

This prints 1.5, 2.5 – so it appears to be boxed correctly.

Step 8: Try simplifying a bit less

In our failing program, there are two potential complications at play: we’re boxing the value, and we’re calling a method via reflection. If just boxing isn’t a problem, maybe reflection is?

I decided to try two things at once: an instance method and a static method accepting a Test parameter (like the operator). Now we’re not doing any comparisons, we can get rid of a lot of code. Here’s the new complete code:

using System;
using System.Reflection;

class Program
{
   static void Main(string[] args)
   {
       Test t = new Test(1.5, 2.5);
       var foo = typeof(Test).GetMethod("Foo");
       foo.Invoke(t, null);
       var bar = typeof(Test).GetMethod("Bar");
       bar.Invoke(null, new object[] { t });
   }
}

struct Test
{
    private readonly double x;
    private readonly double y;

    public Test(double x, double y)
    {
        this.x = x;
        this.y = y;
    }

    public void Foo() => Console.WriteLine($"{x}, {y}");
    public static void Bar(Test t) => Console.WriteLine($"{t.x}, {t.y}");
}

And now the results:

1.5, 2.5
1.5, 6.95260190884551E-310

Interesting! At this point, there are lots of things that I could check:

Does it affect all parameters?
Does it happen for other field types? Other primitives? References?
Can I provoke it in any other circumstances other than via
reflection?

Most of these can probably be answered by finding an existing issue that was fixed in netcore1.1… which is reasonably feasible given the Open Source nature of .NET Core.

That can all wait though (it’s a Sunday evening for me, after all…) – let’s see if we can fix the problem first.

Step 9: Fix it with a stroke of luck

Given that this feels like a memory management issue, let’s try something I’ve basically not used before: explicit struct layout.

A struct containing two double fields should be pretty easy to lay out: just fields at offsets 0 and 8, right? Admittedly it also sounds like it should be easy for the CLR to get right itself… but at least laying it out explicitly is low-risk.

using System.Runtime.InteropServices;

[StructLayout(LayoutKind.Explicit)]
struct Test
{
    [FieldOffset(0)]
    private readonly double x;
    [FieldOffset(8)]
    private readonly double y;

    ...
}

Normally, things don’t get fixed on first attempt, but in this case… it just worked!

Backtracking from the simplification, to the standalone comparison version, to fixing GeoPoint (which involved not using automatically implemented properties any more, as I want to apply the attribute to the field, not the property – boo!) and checking it works locally, to adding a commit to the PR and seeing it pass on Travis. Yay!

Next steps and points to note

My next step is definitely to try to find some history of this issue. It’s a surprisingly big problem to affect the GA .NET Core 1.0 runtime, so I expect there’s some record of it somewhere. I want to understand better what was wrong, whether my fix is appropriate, and whether I need to apply it elsewhere.

Update: this is a still an issue in the 1.0.7 runtime, and apparently even in the 1.0.8-servicing daily builds, so I’ve filed an issue and look forward to seeing what happens there.

Some interesting aspects:

If you initially see a problem in CI, your top priority is to reproduce it locally.
As ever, remove as much other code as you can – your production code, your test code, all dependencies.
It’s handy to have multiple execution environments available. If I hadn’t had a local Linux box, this would have been a nightmare.
Try different configurations and different runtimes/frameworks.
Sometimes the bug really is in the runtime or BCL. It’s pretty rare, but it does happen.

Diagnostics

Diagnosing a VS-only build problem

August 30, 2017 jonskeet 14 Comments

I do most of my work in the google-cloud-dotnet github repo (That’s the repo for the Google Cloud Client Libraries for .NET, just to get a quick bit of marketing in there.) We try to keep our build and test dependencies up to date, so I recently updated to the latest versions of Microsoft.NET.Test.Sdk and xUnit.

Leaving aside problems identified by the xUnit analyzer which is now bundled in the xUnit package (some of which were useful, and some of which weren’t), this change caused another problem: while building from the command line worked fine, building some of the many solutions from Visual Studio (15.3.3) generated the following error:

Error MSB4018 The “GenerateBindingRedirects” task failed unexpectedly.
System.IO.PathTooLongException: The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters.
at System.IO.PathHelper.GetFullPathName()
at System.IO.Path.LegacyNormalizePath(…)
at System.IO.Path.NormalizePath(…)
at System.IO.Path.NormalizePath(…)

(more stack trace here)

The output window shows a file related to the error:

C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\MSBuild\15.0\Bin\Microsoft.Common.CurrentVersion.targets(2099,5)

So, what’s going on, and how do we fix it?

Step 1: Check that this change is responsible

Git makes life really easy here. For peace of mind – particularly for this kind of issue – I tend to close Visual Studio when switching branches, so:

Find a solution that fails (e.g. Google.Cloud.Diagnostics.AspNetCore)
Close VS
git checkout master
Rebuild solution in VS – looks fine
Close VS
git checkout branch with change
Rebuild in VS – observe error
git checkout master
Rebuild solution in VS – looks fine
Close VS
git checkout branch with change
Rebuild in VS – observe error

Yes, I did everything twice, just to make sure it wasn’t an anomaly.

So yes, it’s definitely a problem.

Step 2: Moan on Twitter

This isn’t the most scientific diagnostic tool in the world, but posting on Twitter about the problem did at least reassure me that I wasn’t alone.

Step 3: Narrow down the cause

Even though I observed the problem in Google.Cloud.Diagnostics.AspNetCore, the full error message in VS referred to Google.Cloud.Diagnostics.Common.IntegrationTests. That’s part of the Google.Cloud.Diagnostics.Common solution – the AspNetCore projects depend on the Common projects.

Try step 1 again with Google.Cloud.Diagnostics.Common (just one pass and one failure this time) – yes, it’s still a problem there. That’s useful.

Step 4: Try a workaround

All my source is under c:\Users\Jon\Test\Projects. The “Test” part is a bit of an anomaly, and next time I get a new machine, I’ll probably work without it, but there’s no need to change now. The Projects directory is actually a junction (symlink) to the c:\Users\Jon\Documents\Visual Studio 2015\Projects directory. That’s quite a long path to start with… let’s see if getting rid of the symlink helps.

Delete symlink
Move Projects file from under Documents to directly under Test
Try to build: same failure

Okay, it looks like we’ll have to be a bit more subtle.

Step 5: Play spot the difference

Given that we have code which builds on the command line in both the working and failing situations, we can reasonably easily see any differences in generated files.

On the command line, go into the Google.Cloud.Diagnostics.Common.IntegrationTests directory
Delete bin and obj
git checkout master
Run dotnet build
Copy the resulting bin and obj directories to a “working” directory
git checkout branch with change
Delete bin and obj
Run dotnet build
Copy the resulting bin and obj directories to a “broken” directory
Run kdiff3 against the working and broken directories

There are lots of differences between the directories, as I’d expect, but given that this is about binding redirects, it’s reaosnable to use experience and start scanning for filenames ending with .config.

Sure enough, in the “broken” directory, under obj\net452, there was a file called Google.Cloud.Diagnostics.Common.IntegrationTests.csproj.Google.Cloud.Diagnostics.Common.IntegrationTests.dll.config. That’s not a copy and paste error – it really is a 115-character-long filename, even leaving out any directory parts.

In the file system, the full path is: c:\users\jon\Test\Projects/google-cloud-dotnet\apis\Google.Cloud.Diagnostics.Common\Google.Cloud.Diagnostics.Common.IntegrationTests\obj\Debug\net452\Google.Cloud.Diagnostics.Common.IntegrationTests.csproj.Google.Cloud.Diagnostics.Common.IntegrationTests.dll.config – that’s 266 characters.

Step 6: Try being slightly cheeky: very temporary workaround

Okay, so moving away from the “Documents\Visual Studio 2015” directory didn’t help much, but given that we’re just on the limit, let’s try just renaming “google-cloud-dotnet” to “short” (very temporarily).

Try opening it in Visual Studio – wahoo, it works :) The .config file is generated by Visual Studio correctly.

Step 7: Work out who to blame

So, where’s the bug?

It’s a shame that Visual Studio 2017 doesn’t support long filenames, even though the dotnet CLI does
Surely we don’t need such a long filename anyway
Do I need the project name to be so long?

Looking back to the very first description, let’s have a look at the msbuild file that’s mentioned: C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\MSBuild\15.0\Bin\Microsoft.Common.CurrentVersion.targets

That has:

<GenerateBindingRedirects
  AppConfigFile="@(AppConfigWithTargetPath)"
  TargetName="$(TargetFileName).config"
  OutputAppConfigFile="$(_GenerateBindingRedirectsIntermediateAppConfig)"
  SuggestedRedirects="@(SuggestedBindingRedirects)"
>

So it looks like the problem is _GenerateBindingRedirectsIntermediateAppConfig, which is defined in a property group elsewhere as:

<_GenerateBindingRedirectsIntermediateAppConfig>$(IntermediateOutputPath)$(MSBuildProjectFile).$(TargetFileName).config</_GenerateBindingRedirectsIntermediateAppConfig>

That certainly looks like it’s responsible for the choice of file.

A quick search for _GenerateBindingRedirectsIntermediateAppConfig shows that I’m not the first person to have run into this – PathTooLongException with _GenerateBindingRedirectsIntermediateAppConfig describes exactly the pain I’ve been going through, and even helpfully references MSBuild should handle paths longer than MAX_PATH.

I’ll add my little bit of feedback to the former issue as soon as I’ve published this post (so I can link to it).

Step 8: Decide what to do

My options appear to be:

Change the project name or the directory structure
Ask everyone on the team to work from a really short root directory name
Downgrade my dependencies again
Live with Visual Studio not building a few solutions properly – we can still work in Visual Studio, so long as we don’t need a full build in it.

I went with option 4, and published this blog post.

Step 9: Embrace community feedback

Theose were the only options I considered at the time of original writing, partly as I was somewhat scared of this suggestion, being inexperienced in msbuild. However, combining that with a Directory.Build.targets file (which I didn’t know about before being tipped off by Nick Guerrera), it was very easy to implement a repo-wide workaround.

My friend Kirill Osenkov also pointed me at his MSBuild Log Viewer and pointed out that if it failed in Visual Studio, it would probably fail from msbuild on the command line. He was right, and the log viewer pinpointed what it was trying to write, which would have saved time.

Points to note

Situations where you have a working case and a broken case are great. They can help you validate your problem (does it go away if I revert to the previous commit?) and find the cause (what’s the difference in output between working and broken?)
Temporary approaches to double-check the diagnosis (such as me renaming my source directory to “short”) can be useful – don’t forget to undo them though!
If you’ve run into a problem, someone else probably has too
The .NET community can be really amazing – I was really impressed by the response here
Tools are often available – in this case, msbuildlog.com would have saved me quite a bit of time.

Diagnostics

Diagnosing a corner-case tooling bug

August 19, 2017 jonskeet 3 Comments

Unlike the previous tests which have been based on Noda Time, this post is based on some issues I’ve had with my main project at work: the Google Cloud Client Libraries for .NET.

Background

This is somewhat detailed for those of you who really care about the chain of events. If you don’t, skip to the next subheading.

google-cloud-dotnet is a complicated repo with lots of projects, generated project files, its own Roslyn analyzers, a fairly involved documentation process using docfx, and other bits and pieces. We use Travis and AppVeyor for CI, and the AppVeyor build has historically gone through the documentation build as well.

On the evening of August 17th (UK time), AppVeyor’s “Visual Studio 2017” image changed to include the .NET Core 2.0 SDK by default. This broke us in a few ways:

Our appveyor.yml invoked a bash script, but without explicitly invoking bash. I don’t know why it worked before, but the change broke that. Easily fixed.
Our docfx build completely broke on AppVeyor due to differences in how msbuild is found or invoked. I don’t understand this yet, but for the moment we’ve had to remove the docfx part of our build from AppVeyor. It works on our development machines – I believe due to the presence of VS2017 itself, rather than just the build tools.
Some aspect of building with .NET Core 2.0 broke our Roslyn analyzers, causing them to throw missing method exceptions. I don’t understand enough about the breakage to explain it or say whether it was a bug, but we had to change the analyzers for them to work with the latest compilers. This wasn’t a problem in our command-line builds as we had pinned to the 1.0 SDK using global.json, but it seems VS2017 update 3 uses the latest Roslyn version anyway (reasonably) so building in Visual Studio caused problems.

Unfortunately the solution to the last bullet point caused the analyzers to break under the 1.0 SDK… so we were basically in a no-win situation until we upgraded everything to use the 2.0 SDK. (Just to reassure customers, we’re still targeting netstandard1.0 – just because we’re using the 2.0 SDK doesn’t mean you have to.)

Having updated to the 2.0 SDK, we were now guaranteed to have the C# 7.1 compiler available, so after a project file update to include latest, we could use new features.

The problem for this post: docfx and default literals

Our code has a lot of methods accepting cancellation tokens, and the default is always specified as

CancellationToken cancellationToken = default(CancellationToken)

That’s quite a mouthful – but C# 7.1 default literals can help! This post isn’t about the language feature, but basically it allows you to just use default in places where the C# compiler can work out which type’s default value you mean. So we can change all occurrences of the above to:

CancellationToken cancellationToken = default

Great! A colleague put together a pull request, and I was about to give an LGTM and suggest that he merged it, when I remembered that we weren’t building the documentation on AppVeyor any more. Given that this was the first build using C# 7.1 features, I thought it would be worth just checking that it worked.

It didn’t. It failed with an error message:

Error:Error extracting metadata for c:/Users/jon/Test/Projects/google-cloud-dotnet/apis/Google.Cloud.BigQuery.V2/Google.Cloud.BigQuery.V2/Google.Cloud.BigQuery.V2.csproj: Value cannot be null.
Parameter name: name

This post is a log of the steps I’ve gone through, not trying to fix the bug, but trying to report the problem in a friendly manner.

On Stack Overflow, when I ask for a short but complete program demonstrating the problem, this is the sort of process I expect people to go through.

This won’t be quite as accurate a log as some other posts, as I only started writing it when I was most of the way through the investigation. Hopefully it’ll still be useful.

Step 0: Confirm the problem is real

This is where git is really, really useful. I was very easily able to clone my colleague’s fork, checkout the branch his PR was created from, and go through the build procedure to see the error shown above.

Once I’d done that, I checked out the master branch, pulled to make sure I was up to date, and rebuilt. No problems – so it was definitely the change in that PR.

Step 1: Try the simplest potential fix

We have a file in our repo specifying the version of docfx to use, so trying a different version is trivial. I knew we weren’t using the latest version, and that the latest version used a new version of Roslyn itself. It would make sense for that to fix the problem.

Tried it – no luck. At this point, it seemed unlikely that we’d be able to resolve the problem ourselves, and I was looking for a minimal example to report it.

Step 2: Try to reproduce in a minimal example

A minimal example in docfx is pretty simple. It needs three files:

A csproj for the project:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netstandard1.0</TargetFramework>
    <LangVersion>latest</LangVersion>
  </PropertyGroup>
</Project>

A docfx.json file, just enough to build the metadata in this case:

{
  "metadata": [{
    "src": [{ "files": ["Project.csproj"] }],
    "dest": "api",
  }]
}

And a source file for it to build:

namespace Project
{
    public class Class1
    {
        public void Foo(int x = default(int)) {}
    }
}

Then I just needed to build the project and then the metadata:

dotnet build && docfx metadata -f

That worked, as expected. I then changed default(int) to just default and built again, hoping to see the failure. Unfortunately, that worked too.

At this point, I have no clue about the differences between the real project which didn’t work and the minimal project that did.

Step 3: find a smaller reproduction from real code

As mentioned in the introduction, the repo contains a lot of projects. The BigQuery project mentioned in the error message contains a lot of code – not an ideal candidate to cut down. Looking through the changes in the PR, I spotted that files in Google.Cloud.Translation.V2 had also been changed. That’s a much smaller project, and would be simpler to work with.

Instead of running through the full build, I just ran the commands to build that specific project and then its documentation. Bingo – same error as before, basically:

Error:Error extracting metadata for C:/Users/jon/Test/Projects/google-cloud-dotnet/apis/Google.Cloud.Translation.V2/Google.Cloud.Translation.V2/Google.Cloud.Translation.V2.csproj: Value cannot be null.
Parameter name: name

Step 4: isolate the smaller reproduction

I copied the broken code away from the git repo, and hacked the project file a bit:

I didn’t need the signing key
I didn’t need the import that prevented building net45 code on Linux
I didn’t need the Roslyn analyzer

I copied and modified the docfx.json file that was used to create the metadata, and tested it again: same error.

At this point, we’ve got a reproduction with “only” 10 C# files and a single project. None of the infrastructure used for the rest of the repo is required. That’s a good start.

Step 5: Gut the code

The aim is to remove as much code as possible, leaving only what’s required to demonstrate the problem.

The project contained two parallel pairs of classes (TranslationClient/TranslationClientImpl and AdvancedTranslationClient/AdvancedTranslationClientImpl). They’re very similar (and really only both exist for backwards compatibility) – it was trivial to remove the “advanced” pair and retest: same failure.

Next, remove the TranslationClientImpl class – we weren’t running any code, so if we could reproduce the problem with the abstract TranslationClient class, that would be fine. Remove all factory methods and the class itself: same failure.

Next, remove the ListLanguages and DetectLanguage methods (and their related friends) – at which point the Language and Detection classes can go, along with LanguageCodes which is really just helper code. Same failure.

We’re down to relatively little code now, but still more than I really want. What’s left is a bunch of calls to do translation. It’s basically a matrix of 8 methods, for every combination of:

Translate a single item or multiple items
Translate text or HTML
Sync or async

The single/multiple part is just implemented in terms of the “single” method calling the “multiple” method and extracting the result, using an expression-bodied member. Only the async methods used the default literal, so I decided to remove the synchronous code first. Same failure.

The HTML and text methods were basically the same in the abstract class (the real differences, small as they are, lie in the implementation) so it was simple to remove the HTML methods. Same failure.

That left two methods: the two async methods, both using the default literal, and one of which (“single”) called the other (“multiple”). I tried removing the “single” method – suddenly, the failure went away. I restored the method, and double-checked: yes, we’re back to seeing the failure.

Still, two methods is reasonable. Let’s reduce those as far as we can, changing them to return strings instead of TranslationResult objects, for example, and removing the TranslationModel? parameter in each case: same failure.

Okay, this looks like it’s enough to reproduce the problem even more minimally.

Step 6: Back to the minimal example code

I went back to the sample project from step 2, with a tiny project file and a tiny docfx.json file. I thought the problem might be one method calling another, so I reproduced that within the test code. (Unfortunately I no longer have the code I tried at this point.)

No failure. Grrr. What was the significant difference between that code and the failing Translation code?

Step 7: Transplant the definitely-failing code

Pretty clueless at this point, I copied the TranslateText methods that were definitely failing in the Translation project into my test project. As I’d removed all the Translation-specific types now, that didn’t cause any compilation problems in itself… but the failure was back. Hooray.

So we now have:

Slightly complex code that at least slightly resembles the Translation code
A project file that’s nice and simple
A docfx file that’s nice and simple

That’s still good: we can now jettison the gutted Translation code, and work just within our minimal test project.

Step 8: Reduce to one method

We still had one method (“single”) calling another (“multiple”). Removing “single” removed the failure; removing “multiple” would cause a compilation failure in “single”. Let’s cut that link.

First, I replaced the the expression-bodied member from “single” with a regular method body, awaited a Task.Delay(...) (to avoid a warning about not awaiting anything) and returned null. The failure went away. Weird.

Next, I tried making it a non-async method, and just returned a null task instead. The failure stayed away.

My first “reasonably minimal, single-method” failure code looked like this:

using System.Threading.Tasks;

namespace Project
{
    public class Class1
    {
        /// <param name="x">A parameter</param>
        public virtual async Task<string> Foo(int x = default) => await Task.FromResult<string>(null);
    }
}

I’ve since played around a lot and reduced it somewhat:

using System.Threading.Tasks;

namespace Project
{
    public class Class1
    {
        /// <param name="x">A parameter</param>
        public Task Foo(int x = default) => (Task) null;
    }
}

Note that it’s now no longer an async method. This generates the same error as before, and it’s reliable, but any one of the following changes makes the failure go away:

Change default to default(int), the failure goes away
Remove the “ documentation, the failure goes away
Make it a regular method instead of an expression-bodied method
Change (Task) null to just null

That’s a pretty sensitive bug! I’ve still no idea where the problem is, but it’s now ready to tell the docfx team.

Step 9: file a bug

With all of this work done, it was simple to file an issue with the repro code. At this point, I can relax…

Points to note

The only chance I really had for resolving this problem myself was updating to the latest version of docfx. The C# code in my project is correct, so it really looks like a bug somewhere (whether in docfx or in one of its dependencies is still up for grabs) but it’s buried further than I’m likely to go.
To be as friendly as I can, I knew I had to provide a way for the team to reproduce the problem.
In this case, I tried a simple “let’s reproduce from scratch” unsuccessfully before going back to the known-failing code
Even though the original scope (“all the code in the repo”) was huge, it was fairly simple to reduce it in stages:
- Find a sub-project which demonstrated the problem
- Isolate that sub-project from the rest of the repo infrastructure
- Cut down that sub-project to a reasonably minimal example
- Transplant that example into a new, completely minimal project
Most of the time I “wasted” was due to the very odd nature of the failure – very small, usually-irrelevant perturbations of the code changed the result in an unexpected way. I could probably have realized this earlier, and started making my changes even smaller between tests.

Diagnostics

Diagnosing a single slow test

August 19, 2017 jonskeet 7 Comments

In the last blog post, I investigated why my Noda Time tests on Travis were running much slower than those on AppVeyor. I resolved a lot of the problem just by making sure I was running release builds on Travis.

That left a single test which takes up about half of the test time though: TzdbDateTimeZoneSourceTest.GuessZoneIdByTransitionsUncached. It seems to run more slowly on .NET Core on Linux than on Windows. The aim of this post is to work out why – at least to the extent of understanding it better than before.

Step 0: Initial assumptions/knowledge and aim

For the moment, I’ll assume that my records of previous investigations are correct. So the situation is:

I have near-identical hardware to run tests on: gabriel is running Linux; bagpuss is running Windows.
Under a release build, the test takes 19s on gabriel and 4.3s on bagpuss
The test does more work on gabriel: it checks 434 zones instead of 135 on bagpuss
That still shows a per-test slow-down of 40%
I’ve looked at this code before to try to optimize it. In production, this method is only going to be called for one time zone (the system default) so it’s not really a problem – but improving the test speed would be nice. (We can categorize it as a “slow” test if necessary, but I’d rather not.)

Aims of this diagnosis:

Find out whether the speed difference is in the test code or the production code that it’s testing. (This test has a lot of assertions.)
Find out whether the speed difference is in Noda Time code (e.g. Instant) or in BCL code (e.g. TimeZoneInfo). If it’s in Noda Time code, that could lead to another investigation later, with potential optimizations.
Make another attempt at speeding up the test across the board

Step 1: validate assumptions

Let’s make sure we’re still in the same state as we were before. (That seems pretty likely.)

On both machines, pull the current code and run

dotnet run -c Release -f netcoreapp1.0 -- --where=test=~Guess

from the src/NodaTime.Test directory.

Odd: it now takes ~6 seconds on bagpuss. At that point, there’s no time discrepancy. But I’m sure there was when I wrote up the previous post. I didn’t make up the 4.3s result… but it’s entirely possible it was on the wrong machine. Run the tests several times – yes, it’s always around 6 seconds. Odd.

At this point, I nearly gave up. After a little while doing other things, I decided to pursue the rest of the diagnosis anyway… at which point the execution time was back to about 4 seconds. That’s very, very strange. I still don’t know what was going on there.

Brutal honesty time: although I ran the tests at this point, I didn’t spot the discrepancy until I was at the end of step 2. I then looked back in the command shell buffer and saw the 6 second run, which was consistent with the runs at the end of step 2.

Step 2: shorten experiment iteration time

The only test we’re interested in is this one. We’re going to be changing the code occasionally, so let’s ditch everything else.

Start a new branch in git
Move TzdbDateTimeZoneSourceTest.cs into NodaTime.Test so it’s easier to find. (It was in a subdirectory.)
Remove all other tests from TzdbDateTimeZoneSourceTest.cs
Delete all the subdirectories in NodaTime.Test
Remove everything in NodaTime.Test apart from Program.cs and TzdbDateTimeZoneSourceTest.cs
Change the target framework to just netcoreapp1.0 so we don’t need to specify it. (This also requires changing the csproj file to use instead of.)
Add a Release property to attempt to make it a release build by default
Test: run dotnet run – oops, even though it’s running from a Release folder, it’s taking 13 seconds instead of 3 on Windows.
Revert previous change, and try dotnet run -c Release – that’s more like it

Great. We can push the commit to github and fetch it on all the relevant machines. (I’m developing on a third machine, just using gabriel and bagpuss to run tests.)

Step 3: Narrow down the time-consuming code

As I mentioned before, the test has a lot of assertions, and they involve a lot of time zone work. Let’s see how much time they’re contributing – by removing them entirely.

Comment out everything after the call to GuessZoneIdByTransitionsUncached
Run locally just to check it does still run
Commit and push
Fetch on gabriel/bagpuss and rerun

Result: the checking part of the test is almost free.

Step 4: Remove NUnit

At this point I’m fairly confident that NUnit is irrelevant. I still want to use the NodaTime.Test project, as NodaTime is already configured to expose internal members to it – and GuessZoneIdByTransitionsUncached is internal. However, it’s very easy to remove NUnit – just remove the dependencies, and replace all the code in Program with a simple test:

using NodaTime.TimeZones;
using System;
using System.Diagnostics;
using System.Linq;

namespace NodaTime.Test
{
    class Program
    {
        public static void Main(string[] args)
        {
            var zones = TimeZoneInfo.GetSystemTimeZones().ToList();

            // Fetch the source outside the measurement time
            var source = TzdbDateTimeZoneSource.Default;
            // JIT compile before we start
            source.GuessZoneIdByTransitionsUncached(TimeZoneInfo.Local);

            var stopwatch = Stopwatch.StartNew();
            foreach (var zone in zones)
            {
                source.GuessZoneIdByTransitionsUncached(zone);
            }
            stopwatch.Stop();
            Console.WriteLine($"Zones tested: {zones.Count}");
            Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");
        }
    }
}

There are a couple of differences here compared with the NUnit test:

I’m deliberately fetching TzdbDateTimeZoneSource.Default (which needs to read some resources) outside the measurement
I’m deliberately calling GuessZoneIdByTransitionsUncached once outside the measurement to remove JIT compilation time

This still isn’t a good benchmark by any means, but it’s probably sufficient for now.

Commit/push/fetch/run as before.

Result: similar to before, but a little faster. Playing around with the code a little further, it seems that the bullet points above make a difference of about 0.1s each (on bagpuss, at least) – there’s quite a lot of code that wouldn’t have been JIT-compiled before the first call, and loading time zone data is somewhat expensive. NUnit also adds a little bit of overhead, which is understandable.

We now have more accurate data for just GuessZoneIdByTransitionsUncached:

bagpuss: 3.6s
gabriel: 19.0s (just under 19 in most cases, instead of just over when running NUnit)

Now we need to get into Noda Time code…

Step 5: Gut the production code

First, the source code helpfully points to https://github.com/nodatime/nodatime/issues/686 with some comments from my last investigation.

We don’t care about correctness at this point, just speed. We can return null at any point. At this point you probably need to know a bit about what the method does. The aim is to guess a matching TZDB (aka IANA, tz, zoneinfo or Olson) time zone from a BCL TimeZoneInfo, when we can’t use the names to match. It combines all the time zone transitions (e.g. going into or out of DST) of every TZDB zone for the next 5 years, and then computes a score for each zone by checking how many of the UTC offsets at those transition points match the offsets returned by the BCL zone.

Let’s divide the code into 5 sections:

Working out the start and end of this UTC year
Getting all the time zones
Getting all the transitions we’re going to test
Getting all the offsets from the BCL zone
Computing the scores

It’s going to be simplest to measure just 1, then 1+2, then 1+2+3 etc. We can just add return null; after the step we’re interested in. We hope the results will tell us:

Which part of the code is slowest (so we can try to optimize it)
Which part of the code is slower on Linux than Windows

For simplicity, let’s make gabriel use the same number of time zones as bagpuss, just by adding .Take(135) in Program.cs above first. That way we’re comparing the same amount of work on each machine – although admittedly we don’t know that each time zone takes the same amount of time. Without any change to the production code, gabriel now takes 6s – which is about what we’d expect if the time per zone is uniform, as 19 * 135/434 = 5.91[…].

Results on gabriel:

Step 1: 0s (it’s free, effectively)
Step 1+2: 1.82s
Step 1+2+3: 3.15s
Step 1+2+3+4: 3.21s
Step 1+2+3+4+5: 6s

Results on bagpuss:

Step 1: 0s
Step 1+2: 1.21s
Step 1+2+3: 2.10s
Step 1+2+3+4: 2.13s
Step 1+2+3+4+5: 3.63s

Or in other words, stepwise:

Step 1: 0s on both
Step 2: 1.82s on gabriel; 1.21s on bagpuss
Step 3: 1.33s on gabriel; 0.89s on bagpuss
Step 4: 0.06s on gabriel; 0.03s on bagpuss
Step 5: 2.79s on gabriel; 1.50s on bagpuss

No particular smoking gun, unfortunately – no obvious part that would be easily optimized. The last part looks like it shows the worst platform difference though… let’s see if we can dig into that.

Step 6: Pivot to DateTimeZone.GetUtcOffset

Let’s change our test pretty radically:

Get a list of all the built in TZDB time zones
Create a list of 50,000 instants, starting at 2000-01-01T00:00:00Z and spaced 1 hour apart. 100,000 is some enough instants to make the test last for roughly 10 seconds. (Fast enough to iterate, slow enough to give significant results.)
For every zone, find the UTC offset for every instant.

To simplify reasoning about what we’re testing, we want to remove time zone caching, too. Time zone computations are inherently fiddly, so Noda Time has a cache for each zone. In GuessZoneIdByTransitionsUncached we were coming up with a new DateTimeZone in each call, so while there’d be some caching involved, it wouldn’t be as much as a simple version of this new test. For the moment, let’s just remove the time zone caching layer entirely:

var zones = DateTimeZoneProviders.Tzdb.GetAllZones()
    .Select(zone => zone is CachedDateTimeZone cached ? cached.TimeZone : zone)
    .ToList();
var start = Instant.FromUtc(2000, 1, 1, 0, 0);
var gap = Duration.FromHours(1);
var instants = Enumerable
    .Range(0, 100000)
    .Select(index => start + gap * index)
    .ToList();

var stopwatch = Stopwatch.StartNew();
foreach (var zone in zones)
{
    foreach (var instant in instants)
    {
        zone.GetUtcOffset(instant);
    }
}
stopwatch.Stop();
Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");

There’s still one other caching layer that’s relevant, and that’s in terms of conversions between calendrical values and instants… Noda Time is optimized for the years 1900-2100 in the Gregorian calendar. If our test went outside that range, we’d get very different results, hitting different pieces of code very hard. Our original test was looking at 5 years around now, so let’s stick to the optimized range for now.

Results:

gabriel: 19.5s
bagpuss: 12.5

Good – we’re still seeing the Linux run being about 50% slower than the Windows run.

Step 7: Break out a profiler!

It’s been a while since I’ve done any profiling in .NET. I had a quick search to see what was available in terms of free tools that handle .NET Core, and came across CodeTrack. I don’t think it runs under Linux, but that’s okay – we’re mostly using it to track down the hotspots first, so we can fine tune our tests.

I’m not specifically advocating CodeTrack – profiling is an art and science about which I know relatively little – but it was easy to get going. 85% of the time was spent in StandardDaylightAlternatingMap.GetZoneInterval, so that looks like a good place to look next.

Step 8: Write actual benchmarks

I have a benchmark suite for Noda Time already, and at this point we’ve got to a stage where it’s probably worth using proper benchmarking tools. BenchmarkDotNet is my framework of choice – so let’s add a couple of benchmarks.

Looking at the code and the profiler, it looks worth testing both StandardDaylightAlternatingMap and ZoneRecurrence. Having added those benchmarks, I might as well put this down until the next day, as the complete benchmark suite runs automatically every night.

Step 9: Compare the benchmarks

The benchmark results are automatically uploaded to the Noda Time web site although the UI for browsing them is somewhat unstable (in terms of URLs etc) – hence the lack of link here.

Looking at the benchmark results, it turns out I’d already got benchmarks for StandardDaylightAlternatingMap, annoyingly enough – so I needed to tidy those up.

However, they do clearly show the difference between .NET Core on Linux and on Windows, so we’ve narrowed the problem down further. That doesn’t show that only that code has different performance, but it’s hopefully a good enough start that we’ll be able to continue and get a short but complete example to provide to the .NET Core CLR team.

Diving deeper into the implementation, ZoneYearOffset makes sense to test too… the further we go down, the less code will be involved.

Step 10: take a step back for now

We’ve made progress, but we’re not all the way there. That’s okay – the progress is recorded (both here and in benchmarks) and in looking at the benchmark that’s taking so long to run, we’ve confirmed that we’re not doing anything stupid. It happens to end up doing a lot more work than any application code normally would – work that would normally only happen once (so can be inefficient without being a problem) is exercised hundreds of times. Such is life – I could decide not to run it in CI if I want, running it occasionally manually and always just before a release.

I’ll come back to this issue at some point in the future, but at this point, other projects have become more important.

Points to note and lessons learned

Validate assumptions carefully. I still don’t quite know why there was the blip in execution time, but it was very surprising.
Unit tests are no substitute for a good benchmarking framework.
Profilers exist, and are useful. No specific recommendations here, but make sure you’re aware of the tools you could benefit from.
Sometimes, you need to put down a problem for a while. Try not to lose the information you’ve already discovered – capture it in an issue, comments, a blog post like this one, a private document… anything to preserve information. My head leaks data very, very quickly.

Addendum

I’m still investigating where .NET Core runs faster on Linux than on Windows, but I’ve fixed this test – PR 953 is a pretty minor refactoring, but it allows the tests to pass in the same list of time zones into each test case. That has two benefits:

We don’t need to load the zones from raw data on each test
The “get UTC offset at instant X” result is cached in the time zone, so after the first test there’s very little work to do

This very small, internal-only change has halved the total time of the Noda Time tests on Travis. Yay.

Diagnostics

Diagnosing slow tests (again)

August 17, 2017 jonskeet 3 Comments

I’ve previously blogged about a case where tests on .NET Core on Linux were much, much slower than the same tests on Windows. Today I’ve got a very similar problem – but I suspect the cause isn’t going to be the same.

This is my reasonably raw log – skip to the end for a summary of what I found, what went well, and what went badly. Apologies for skipping around in terms of past/present/future tense – being consistent in that respect would be distracting enough to make the rest worse, I think.

Step 0: notice the problem

I spotted this problem because I happened to be looking at the Travis and AppVeyor builds, moving to .NET Core 2.0. Looking at the test runs, I saw that the tests for .NET Core 1.0 were completing in about 15 seconds on AppVeyor (Windows), and 133 seconds on Travis Linux). My experience is that AppVeyor seems to use slightly less powerful hardware than Travis, making this a very odd result.

I looked at several builds, and it seemed consistent across them – very odd.

Step 1: file an issue

My main focus at that point wasn’t to investigate performance – it was to get the build working with .NET Core 2. However, I didn’t want to forget about the problem, so I filed an issue in the Noda Time repo. At this point, I have no idea where the problem is, so the Noda Time repo is the most obvious place to put it. There’s no evidence that it’s a problem with NUnit or .NET Core itself, for example.

Step 2: reproduce locally

I have two machines at home that I use for running the automated benchmarks. One (bagpuss) runs Windows 10, the other (gabriel) runs Ubuntu 16.04. Both are Intel NuCs with i5-5250U processors – I bought one much later than the other, but looked hard to find as close to identical hardware as I could, to enable cross-OS benchmark comparisons without dual booting or virtualization.

I ran the tests on both machines, and copied the TestResult.xml output files to my main dev box for analysis.

First interesting point: the difference isn’t as marked. The tests take 77 seconds on gabriel and 29 seconds on bagpuss. That’s still a lot, but very different to what we’re seeing in CI. It’s possible that the Travis machine is twice as slow as gabriel/bagpuss and that the AppVeyor machine is twice as fast, but that seems unlikely.

Deem that temporarily out of scope as it’s harder to investigate than the fixed-hardware, easy-access problem I have now. File another issue and I can come back to it.

Step 3: Look for individual test case discrepancies

If we’re really lucky, there’ll be a single test case that has a much longer duration on gabriel than on bagpuss, and I can then delve into it.

Investigation:

Create a small console app that loads the two XML files, finds all the test-case elements and puts them in a dictionary from fullname to duration (the latter parsed as a dictionary)
- Oops: some full names aren’t unique (due to using parameters that don’t override ToString, I suspect).
- Load into a lookup instead, then convert that into a Dictionary by taking the first test from each group. Unlikely to affect the results
For each test that’s in both result sets, find the ratio of gabriel/bagpuss
Dump the top 20 ratios, and the slowest 10 tests from each set for good measure.

Okay, lots of text handling tests with ratios of more than 2:1, but they’re all very small durations. I doubt that they’ll be responsible for everything, and could easily just be “measuring small amounts of time is flaky”. Nothing to blame yet.

Point to note: this console app was always less than 50 lines of code. LINQ to XML makes this sort of thing really easy.

Step 4: Look for test fixture discrepancies

If individual tests don’t show a massive change, what about test fixtures? Let’s change the program to load test-suite elements where type="TestFixture" which will get us the durations as a per-test-class level. We’ll start by dumping the slowest 10 tests for each environment… and wow, we have a result!

NodaTime.Test.TimeZones.TzdbDateTimeZoneSourceTest takes 65 seconds on Gabriel and 17 seconds on Bagpuss! That’s out of a total of 77 and 29 seconds respectively – so removing those would take us to 12 seconds on each machine.

Concern: that’s assuming no multithreading is going on. I don’t think NUnit parallelizes by default, but it would definitely make things trickier to reason about.

Step 5: check the logs in more detail

Given that we haven’t seen any very slow individual tests, my guess at this point is that the slowness is going to be in test fixture setup. I haven’t looked at the code yet. Let’s have a look at the XML logs for that one test in more detail.

Yikes: all the difference is in GuessZoneIdByTransitionsUncached. That’s a single method, but it’s parameterized for all system time zones.

At this point, a memory stirs: I’ve looked at this before. I have a distinct memory of being in an airport lounge. At the time I wasn’t trying to work out the OS discrepancy so much as why this one test was taking up over half the total unit test time, even on Windows.

Looking at the code for GuessZoneIdByTransitionsUncached, I see I was even diligent enough to create an issue and keep a bit of a log of what I found. Hooray! So, now we know why this is slow in general. But why is it so much slower on gabriel than on bagpuss?

The answer is that gabriel appears to have rather more system time zones than bagpuss: it’s running 424 tests instead of 135. If they ran each test equally fast, it would take bagpuss about 53 seconds to gabriel’s 65 – that’s not so bad.

Step 6: test on CI again

In one fell swoop, we can confirm that this is the problem on Travis and potentially work out why AppVeyor is so fast. If we only run this one test fixture, we can look at how many tests are running on each machine (and in which environments, as I now run the tests on net45 (AppVeyor only), netcoreapp1.0 and netcoreapp2.0.

That’s really easy to do – create a PR which just changes the test filter, let CI run it, then delete the PR.

Results:

Travis, both environments run 424 tests – 95 seconds for netcoreapp1.0, 92 seconds for netcoreapp2.0
AppVeyor, all environments roughly equal, 135 tests – about 3 seconds

That certainly shows that this is taking the majority of the time on Travis, but we still have two questions:

How is AppVeyor so much faster than my Core i7 development machine, where the same tests take about 15 seconds?
Travis is still only doing about 3 times as much work as AppVeyor, but it’s taking 30 times as long. Why?

Step 7: spot flag differences

I look at the Travis script. I look at the AppVeyor script. I look at what I’ve been running locally. There’s a difference… on AppVeyor, I’m running with -c Release, whereas locally I’ve used the default (debug). Can a debug build make that much difference?

Test locally: running on my dev box with -c Release brings the time down from 15s to 3.9s. Yikes!
Update the Travis script to run with -c Release as well

Now the tests take about 30 seconds on Travis. That’s still three times as long as it would be on Windows for the same number of time zones, but it’s much, much closer.

Run the tests with the same flags on gabriel and bagpuss: gabriel takes 19s, bagpuss takes 4.3s. Taking the number of time zones into account, that suggests that netcoreapp1.0 on Linux is about 40% slower than netcoreapp1.0 on Windows for each test. I can investigate that separately, but it does sound like Travis is just using slower hardware than AppVeyor, contrary to my previous suspicions.

Step 8: update Travis script and merge

Even though this one test takes a large proportion of the test time on Linux (where there are three times as many time zones to run through), it’s not a gating factor. Correctness beats speed in this case – I can live with waiting a few minutes for a complete CI test run.

We still have some investigation to do, but we can already make things definitely better by running the release build on Travis. On a new branch – not the one I’ve been investigating on – create a commit that just changes Travis to use -c Release. Observe the total time go down from 133 seconds to 57 seconds. Yay.

Points to note

When you’re running performance tests, don’t compare release with debug builds. Doh. This was a rookie mistake.
Reproducing a CI problem locally is key: it’s much, much easier to work with local machines than CI. It’s a faster turnaround, and you get easier access to detailed results.
Having multiple machines with the same hardware is really useful for performance comparisons.
Don’t assume that what sounds like the same set of tests will run the same number of tests on different platforms – if a test is parameterized, check how many tests are actually running in each case.
Ad-hoc tools are great – the console app I wrote for steps 3 and 4 was quick to write, and helped zoom in on the problem very simply.

Further work

Almost every diagnostic investigation seems to “end” with something else to look into. In this case, I want to find out why Linux is slower than Windows for the GuessTimeZoneTransitionsUncached test on a per-test basis. Is this a Noda Time difference, or a BCL difference? That can wait.