Diagnosing a Linux-only unit test failure

This is an odd one. I’m currently working on Cloud Firestore support for C#, and I’ve introduced a GeoPoint struct to represent a latitude/longitude pair, each being represented by a double. It implements IEquatable and overloads == and != in the obvious way. So far, so good, and I have working tests which have passed on CI.

I’ve added some more types that implement equality, so I thought I’d extract out common code into an EqualityTester class that can handle both the Equals methods and the operators (the latter via reflection).

All went well on my local machine, but on CI one environment failed. We use both AppVeyor and Travis in CI, and only Travis was failing. There were two failures – one testing the equality operators, and one as part of a broader serialization/deserialization test. Odd. Time to go through the normal diagnostic process…

Step 1: Check there’s really a problem

The nature of the failing tests don’t feel like an intermittent problem, but it’s always good to get more of an idea. I forced Travis to rebuild the pull request on Travis a couple of times: yes, it failed the same two tests each time.

Step 2: Reproduce locally

Tests that only fail in CI are really annoying. I’ve had this a couple of times, and it’s always taken hours to sort out, because the “try something, get results, think about what they mean” cycle is so slow.

Fortunately, I have a Linux box at home – so I fetched the pull request and ran the tests there – and they failed. Hooray! It’s not quite as nice as being able to reproduce it on my Windows laptop, but it’s much better than being isolated to CI.

Step 3: Remove irrelevant test code

There are two parts to the code that’s failing: the unit tests, and the production code. Ideally, I’d like to get rid of both, to create a truly minimal example. I usually try to take out one project at a time, and I could have gone either way – but this time I decided to remove the testing first.

So in the same repo, I created a new netcoreapp1.0 console application, with a project reference to the production code. I moved the operator part of my equality tester code into Program.cs, simplified it slightly, introduced by own Assert class with True and False methods that printed out the result, and I was ready to run:

using System;
using System.Reflection;
using Google.Cloud.Firestore.Data;

class Program
{
   static void Main(string[] args)
   {
       var gp1 = new GeoPoint(1.5, 2.5);
       var gp2 = new GeoPoint(1.5, 2.5);
       AssertEqualityOperators(gp1, gp2);
   }

   internal static void AssertEqualityOperators<T>(T control, T equal)
   {
       var typeInfo = typeof(T).GetTypeInfo();
       var equalityMethodInfo = typeInfo.GetMethod(
           "op_Equality", new[] { typeof(T), typeof(T) });
       Func<T, T, bool> equality = 
           (lhs, rhs) => (bool) equalityMethodInfo.Invoke(null, new object[] { lhs, rhs });

       Assert.True(equality(control, equal));
   }
}

class Assert
{
   public static void True(bool actual)
       => Console.WriteLine($"Expected True; was {actual}");

   public static void False(bool actual)
       => Console.WriteLine($"Expected False; was {actual}");
}

Sure enough, the output shows the problem:

Expected True; was False

Great. Now let’s remove the other dependency…

Step 4: Remove production code dependency

Rather than copy the whole of GeoPoint into my test project, I just created a small Test struct which had similar equality code.

Dead-end: a single double value

I made a mistake here – I changed too much in one go. I reduced my Test struct to just one double field:

using System;
using System.Reflection;

class Program
{
   static void Main(string[] args)
   {
       Test t1 = new Test(1.5);
       Test t2 = new Test(1.5);

       var equalityMethodInfo = typeof(Test)
           .GetMethod("op_Equality", new[] { typeof(Test), typeof(Test) });
       Func<Test, Test, bool> equality =
           (lhs, rhs) => (bool) equalityMethodInfo.Invoke(null, new object[] { lhs, rhs });
       Console.WriteLine(t1 == t2);
       Console.WriteLine(equality(t1, t2));
       Console.WriteLine(equalityMethodInfo.Invoke(null, new object[] { t1, t2 }));
   }
}

struct Test : IEquatable<Test>
{
    private readonly double value;

    public Test(double value)
    {
        this.value = value;
    }

    public static bool operator==(Test lhs, Test rhs) =>
        lhs.value == rhs.value;

    public static bool operator!=(Test lhs, Test rhs) => !(lhs == rhs);

    public override bool Equals(object obj) =>
        obj is Test t && Equals(t);

    public bool Equals(Test other) => this == other;

    public override int GetHashCode() => value.GetHashCode();
}

That prints “True” three times – which is what we’d like it to do eventually, but it means it’s not reproducing the problem we want it do.

Back to reproducing it: two double values

Taking one step back towards the previous code, I went back to two double values instead of just one, expanding the Test struct like this:

struct Test : IEquatable<Test>
{
    private readonly double x;
    private readonly double y;

    public Test(double x, double y)
    {
        this.x = x;
        this.y = y;
    }

    public static bool operator==(Test lhs, Test rhs) =>
        lhs.x == rhs.x && lhs.y == rhs.y;

    public static bool operator!=(Test lhs, Test rhs) => !(lhs == rhs);

    public override bool Equals(object obj) =>
        obj is Test t && Equals(t);

    public bool Equals(Test other) => this == other;

    public override int GetHashCode() => x.GetHashCode() + y.GetHashCode();
}

Change both constructor calls to Test(1.5, 2.5) and bingo: it prints True, False, False.

Step 5: Explore the scope of the problem

I tried the exact same code on Windows, and (unsurprisingly, given that my earlier unit tests passed) it didn’t reproduce the problem. But sticking with Linux, we can still try multiple frameworks and multiple build configurations.

I wouldn’t like to say what prompted me to try different frameworks – intuition can be hard to explain, unfortunately. But let’s change the project file a bit:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp1.0;netcoreapp1.1;netcoreapp2.0</TargetFrameworks>
  </PropertyGroup>
</Project>

And then run it in several different configurations (remember that three “True” lines means “everything is working”, and “True”, “False”, “False” means “this is the problem we’re investigating”):

$ dotnet run -c Debug -f netcoreapp1.0
True
False
False
$ dotnet run -c Release -f netcoreapp1.0
True
False
False
$ dotnet run -c Debug -f netcoreapp1.1
True
True
True
$ dotnet run -c Release -f netcoreapp1.1
True
True
True
$ dotnet run -c Debug -f netcoreapp2.0
True
True
True
$ dotnet run -c Release -f netcoreapp2.0
True
True
True

So, it fails under .NET Core 1.0, but works under .NET Core 1.1 and .NET Core 2.0. That certainly makes it sound like a .NET Core 1.0 bug, but that doesn’t mean I want to give up at this point. It’s entirely possible that I have users who are using .NET Core 1.0, so I want my code to work under it.

Now, I’ve only given major/minor version numbers – and that’s partly because I don’t have a very good handle on the different parts of .NET Core on Linux and how to find the version numbers. The 1.0 packages I’ve got installed are listed as:

  • dotnet-dev-1.0.4/xenial,now 1.0.4-1 amd64 [installed]
  • dotnet-hostfxr-1.0.1/xenial,now 1.0.1-1 amd64 [installed]
  • dotnet-sharedframework-microsoft.netcore.app-1.0.5/xenial,now 1.0.5-1 amd64 [installed,automatic]

No dotnet-runtime package, which I believe is part of 2.0 instead.

Step 6: Look into the failed comparison

This part is really simple – just add more diagnostics to the == method:

public static bool operator==(Test lhs, Test rhs)
{
    Console.WriteLine($"lhs=({lhs.x}, {lhs.y}); rhs=({rhs.x}, {rhs.y})");
    Console.WriteLine($"{lhs.x == rhs.x}, {lhs.y == rhs.y}");
    return lhs.x == rhs.x && lhs.y == rhs.y;
}

Now things get really interesting in the output:

lhs=(1.5, 2.5); rhs=(1.5, 2.5)
True, True
True
lhs=(1.5, 1.85492638478664E-316); rhs=(1.5, 6.95251497332956E-310)
True, False
False
lhs=(1.5, 6.9525149733517E-310); rhs=(1.5, 6.9525149733351E-310)
True, False
False

The first three lines are as expected – and that’s the working comparison. But each of the three-line-segments for the second and third comparisons show the first field (x) being fine, but the second being some tiny double value, and inconsistent at that.

Step 7: Try (and fail) to simplify

Maybe boxing is to blame? How about we try a really simple version of boxing – just override ToString(), box a value and see what comes up. The change in Test is simply:

public override string ToString() => $"{x}, {y}";

And then to test it:

object x = new Test(1.5, 2.5);
Console.WriteLine(x);

This prints 1.5, 2.5 – so it appears to be boxed correctly.

Step 8: Try simplifying a bit less

In our failing program, there are two potential complications at play: we’re boxing the value, and we’re calling a method via reflection. If just boxing isn’t a problem, maybe reflection is?

I decided to try two things at once: an instance method and a static method accepting a Test parameter (like the operator). Now we’re not doing any comparisons, we can get rid of a lot of code. Here’s the new complete code:

using System;
using System.Reflection;

class Program
{
   static void Main(string[] args)
   {
       Test t = new Test(1.5, 2.5);
       var foo = typeof(Test).GetMethod("Foo");
       foo.Invoke(t, null);
       var bar = typeof(Test).GetMethod("Bar");
       bar.Invoke(null, new object[] { t });
   }
}

struct Test
{
    private readonly double x;
    private readonly double y;

    public Test(double x, double y)
    {
        this.x = x;
        this.y = y;
    }

    public void Foo() => Console.WriteLine($"{x}, {y}");
    public static void Bar(Test t) => Console.WriteLine($"{t.x}, {t.y}");
}

And now the results:

1.5, 2.5
1.5, 6.95260190884551E-310

Interesting! At this point, there are lots of things that I could check:

  • Does it affect all parameters?
  • Does it happen for other field types? Other primitives? References?
  • Can I provoke it in any other circumstances other than via
    reflection?

Most of these can probably be answered by finding an existing issue that was fixed in netcore1.1… which is reasonably feasible given the Open Source nature of .NET Core.

That can all wait though (it’s a Sunday evening for me, after all…) – let’s see if we can fix the problem first.

Step 9: Fix it with a stroke of luck

Given that this feels like a memory management issue, let’s try something I’ve basically not used before: explicit struct layout.

A struct containing two double fields should be pretty easy to lay out: just fields at offsets 0 and 8, right? Admittedly it also sounds like it should be easy for the CLR to get right itself… but at least laying it out explicitly is low-risk.

using System.Runtime.InteropServices;

[StructLayout(LayoutKind.Explicit)]
struct Test
{
    [FieldOffset(0)]
    private readonly double x;
    [FieldOffset(8)]
    private readonly double y;

    ...
}

Normally, things don’t get fixed on first attempt, but in this case… it just worked!

Backtracking from the simplification, to the standalone comparison version, to fixing GeoPoint (which involved not using automatically implemented properties any more, as I want to apply the attribute to the field, not the property – boo!) and checking it works locally, to adding a commit to the PR and seeing it pass on Travis. Yay!

Next steps and points to note

My next step is definitely to try to find some history of this issue. It’s a surprisingly big problem to affect the GA .NET Core 1.0 runtime, so I expect there’s some record of it somewhere. I want to understand better what was wrong, whether my fix is appropriate, and whether I need to apply it elsewhere.

Update: this is a still an issue in the 1.0.7 runtime, and apparently even in the 1.0.8-servicing daily builds, so I’ve filed an issue and look forward to seeing what happens there.

Some interesting aspects:

  • If you initially see a problem in CI, your top priority is to reproduce it locally.
  • As ever, remove as much other code as you can – your production code, your test code, all dependencies.
  • It’s handy to have multiple execution environments available. If I hadn’t had a local Linux box, this would have been a nightmare.
  • Try different configurations and different runtimes/frameworks.
  • Sometimes the bug really is in the runtime or BCL. It’s pretty rare, but it does happen.

14 thoughts on “Diagnosing a Linux-only unit test failure”

  1. It’s handy to have multiple execution environments available. If I hadn’t had a local Linux box, this would have been a nightmare.

    I’ve been debugging my own Travis-CI-only unit test failures, and have found Docker for Windows invaluable for this. As per “Interactively build and run a simple .NET Core application” at https://hub.docker.com/r/microsoft/dotnet/, run ‘docker run -it –rm microsoft/dotnet’ to start a container with the .NET SDK that can be used to test out small repros. (One can use the -v flag to map a local directory into the container.)

    Like

    1. Yes, Docker is extremely handy for this. Our newer tests are whole containers – so you can even get the exact failing version.
      With Python, you can even debug into the thing – invaluable. That should work in .Net Core, too but I have not tried it yet

      Like

  2. Seems a lot like a JIT bug, probably to do with the difference between Windows and Linux ABIs. Passing structs is tricky. Passing floating-point arguments is tricky. Mixing them doesn’t make anything easier. :)

    Setting explicit layout — even if it’s exactly the same as the compiler would generate — disables some optimizations like passing struct fields in registers. My guess is the working/nonworking cases on Linux diverge here: https://github.com/dotnet/coreclr/blob/7590378d/src/vm/methodtable.cpp#L2337

    (From HasExplicitFieldOffsetLayout to StructLayout.Explicit: https://github.com/dotnet/coreclr/blob/559c603f/src/vm/class.h#L867, https://github.com/dotnet/coreclr/blob/cb8cfba3/src/inc/corhdr.h#L345, https://github.com/dotnet/coreclr/blob/32f0f972/src/md/compiler/custattr_emit.cpp#L1366)

    I happen to remember a lot about struct layout and field promotion because an insufficiently cautious optimization here led to MS11-028. That’s when this code was added: https://github.com/dotnet/coreclr/blob/02c0ef24/src/jit/lclvars.cpp#L1633-L1639

    Like

  3. Hi John, I’d be interested to know if you could reproduce this issue on the Linux subsystem for Windows 10.

    I’ve found it very useful, it’s so handy not to have to run two separate environments.

    Like

  4. If i’m correctly then you are a program leader in the C# language development area at Microsoft.
    I wonder if you could bring in ML (neural nets) more simple into C#, compared to python C# is loosing in this area, while it could be a perfect area for the C# language too.

    Like

    1. No, I work for Google, not Microsoft.

      While I’d love to see ML in C# as well, speaking with members of the ML community it sounds like there are lots of other things that .NET needs that aren’t really ML-specific (visualization libraries, an equivalent to Jupyter notebooks etc) that would really be required before it became viable. Miguel de Icaza has a C# library supporting TensorFlow I believe, but I haven’t used it myself.

      Like

Leave a comment