Hosting ASP.NET Core behind https in Google Kubernetes Engine

Side-note: this may be one of the clumsiest titles I’ve ever written for a blog post. But it does what it says on the tin. Oh, and the space after “ASP” in “ASP .NET Core” everywhere it to avoid auto-linking. While I could use a different dot or a zero-width non-breaking space to avoid it, I’m not sure I trust WordPress to do the right thing with those…

Background

Over the past few weeks, I’ve moved nodatime.org, csharpindepth.com and jonskeet.uk over to Google Kubernetes Engine. (They all used to be hosted on Azure.)

I’ve done this for a few reasons:

  • As my job is primarily making .NET developers more productive on Google Cloud Platform, it feels natural to run my own code there. I want to see where there are friction points, so I can help fix them.
  • I wanted more redundancy, particularly for nodatime.org; Kubernetes felt a simple way of managing that at a reasonable cost
  • HTTPS certificate management (via Let’s Encrypt) has been a bit painful for me on Azure; I could have automated more, but that would have taken extra time I don’t have. (It may also have improved since I last looked.)

The first of these is the most important, by a long way. But the HTTPS management part – and then the knock-on effects – is what I’m looking at in this blog post.

Basic hosting

Hosting an ASP .NET Core application in Google Kubernetes Engine (GKE from now on) is really simple, at least once you’ve understood the Kubernetes concepts. I have:

In each case, the ASP .NET Core application is built with a vanilla Dockerfile which would not look unusual to anyone who’s hosted ASP .NET Core in Docker anywhere else.

I happen to use Google Cloud Build to build the Docker images, and Google Container Registry to host the images, but neither of those are required. (For csharpindepth.com and jonskeet.uk there are simple triggers in Google Cloud Build to build and deploy on GitHub pushes; for nodatime.org it’s a bit more complicated as the documentation build currently has some Windows dependencies. I have a machine at home that polls GitHub every half hour, and pushes the built application to Google Cloud Build for packaging when necessary.)

So, that gets HTTP hosting sorted. I dare say there are some aspects I’ve not configured as well as I could have done, but it was reasonably straightforward to get going.

HTTPS with Google-managed certificates

With HTTP working, it’s time to shoot for HTTPS. It’s important to note that the apps I’m talking about are all hobby projects, not commercial ones – I’m already paying for hosting, so I don’t want to have to pay for SSL certificates as well. Enter Let’s Encrypt, of course.

A while ago I used Let’s Encrypt to set up HTTPS on Azure, and while it was free and I didn’t have to write any code, it wasn’t exactly painless. I followed two guides at the same time, because neither of them exactly matched the Azure portal I was looking at. There were lots of bits of information to grab from various different bits of the portal, and it took a couple of attempts to get right… but I got there. I also set up a web job to renew the certificates, but didn’t go through the hoops required to run those web jobs periodically. (There were instructions, but it looked like they’d take a while to work through compared with just manually running the web job every couple of months or so. I decided to take the pragmatic approach, knowing that I was expecting to move to GKE anyway. If Azure had been the expected permanent home for the apps, I’d have gone through the steps and I’m sure they’d have worked fine.) I don’t know which guide I worked through at the time, but if I were starting today I’d probably try Scott Hanselman’s guide.

So, what can I do on Google Cloud Platform instead? I decided to terminate the SSL connection at the load balancer, using Google-managed certificates. To be really clear, these are currently in beta, but have worked well for me so far. Terminating the SSL connection at the load balancer means that the load balancer forwards the request to the Kubernetes service as an HTTP request, not HTTPS. The ASP .NET Core app itself only exposes an HTTP port, so it doesn’t need to know any details of certificates.

The steps to achieve this are simple, assuming you have the Google Cloud SDK (gcloud) installed already:

  • Create the certificate, e.g.
    gcloud beta compute ssl-certificates create nodatime-org --domains nodatime.org
  • Attach the certificate to the load balancer, via the Kubernetes ingress in my case, with an annotation in the ingress metadata:
    ingress.gcp.kubernetes.io/pre-shared-cert: "nodatime-org"
  • Apply the modifications to the ingress:
    kubectl apply -f ingress.yaml
  • Wait for the certificate to become valid (the provisioning procedure takes a little while, and I’ve seen some interesting errors while that’s taking place)
  • Enjoy HTTPS support, with auto-renewing certificates!

There are only two downsides to this that I’ve experienced so far:

  • Currently each certificate can only be associated with a single domain. For example, I have different certificates for nodatime.org, http://www.nodatime.org and test.nodatime.org. (More about the last of these later.) This is a minor annoyance, but the ingress supports multiple pre-shared certificates, so it has no practical implications for me.
  • I had to accept some downtime on HTTPS when transferring from Azure to GKE, while the certificate was provisioning after I’d transferred the DNS entry. This was a one-time issue of course, and one that wouldn’t affect most users.

Beyond the basics

At this point I had working HTTPS URLs – but any visitor using HTTP would stay that way. (At the time of writing this is still true for csharpindepth.com and jonskeet.uk.) Obviously I’d like to encourage secure browsing, so I’d like to use the two pieces of functionality provided by ASP .NET Core:

  • Redirection of HTTP requests via app.UseHttpsRedirection()
  • HSTS support via app.UseHsts()

I should note here that the Microsoft documentation was fabulously useful throughout. It didn’t quite get me all the way, but it was really close.

Now, I could have just added those calls into the code and deployed straight to production. Local testing would have worked – it would have redirected from localhost:5000 on HTTP to localhost:5001 on HTTPS with no problems. It would also have failed massively for reasons we’ll look at in a minute. Just for a change, I happened to do the right thing…

For hosting changes, always use a test deployment first

In Azure, I had a separate AppService I could deploy to, called nodatimetest. It didn’t have a fancy URL, but it worked okay. That’s where I tested Azure-specific changes before deploying to the real AppService. Unfortunately, it wouldn’t have helped in this situation, as it didn’t have a certificate.

Fortunately, creating a new service in Kubernetes, adding it to the ingress, and creating a managed certificate is so easy that I did do this for the new hosting – and I’m so glad I did so. I use a small script to publish the local ASP .NET Core build to Google Cloud Build which does the Docker packaging, pushes it to Google Container Registry and updates the Kubernetes deployment. As part of that script, I add a small text file containing the current timestamp so I can check that I’m really looking at the deployment I expect. It takes just under two minutes to build, push, package, deploy – not a tight loop you’d want for every day development, but pretty good for the kind of change that can’t be tested locally.

So, I made the changes to use HTTPS redirection and HSTS, deployed, and… there was no obvious change.

Issue 1: No HTTPS port to redirect to

Remember how the ASP .NET Core app in Kubernetes is only listening on HTTP? That means it doesn’t know which port to redirect users to for HTTPS. Oops. While I guess it would be reasonable to guess 443 if it didn’t know any better, the default of “don’t redirect if you haven’t been told a port” means that your application doesn’t stop working if you get things wrong – it just doesn’t redirect.

This is easily fixed in ConfigureServices:

services.AddHttpsRedirection(options => options.HttpsPort = 443);

… but I’ve added conditional code so it doesn’t do that in development environment, as otherwise it would try to redirect from localhost:5000 to localhost:443, which wouldn’t work. This is a bit hacky at the moment, which is a common theme – I want to clean up all the configuration at some point quite soon (moving things into appsettings.json as far as possible) but it’s just about hanging together for now.

So, make the change, redeploy to test, and… observed infinite redirection in the browser. What?

Issue 2: Forwarding proxied headers

Remember again how the ASP .NET Core app is only listening on HTTP? We want it to behave differently depending on whether the end user made a request to the load balancer on HTTP or HTTPS. That means using headers forwarded from the proxy (in our case the load balancer) to determine the original request scheme. Fortunately, again there’s documentation on hand for this.

There are two parts to configuring this:

  • Configuring the ForwardedHeadersOptions in ConfigureServices
  • Calling app.UseForwardedHeaders() in Configure

(At least, that’s the way that’s documented. I’m sure there are myriad alternatives, but my experience level of ASP .NET Core is such that I’m still in “follow the examples verbatim, changing as little as possible” at the moment.)

I won’t go into the gory details of exactly how many times I messed up the forwarded headers options, but I will say:

  • The example which just changes options.ForwardedHeaders is probably fine if your proxy server is local to the application, but otherwise you will need do to extra work
  • The troubleshooting part of the documentation is spectacularly useful
  • There are warnings logged if you get things wrong, and those logs will help you – but they’re at a debug log level, so you may need to update your logging settings. (I only realized this after I’d fixed the problem, partly thanks to Twitter.)

Lesson to learn: when debugging a problem, turn on debugging logs. Who’d have thought?

Configuring this properly is an area where you really need to understand your deployment and how a request reaches you. In my case, the steps are:

  • The user’s HTTPS request is terminated by the load balancer
  • The load balancer makes a request to the Kubernetes service
  • The Kubernetes service makes a request to the application running on one of the suitable nodes

This leads to a relatively complex configuration, as there are two networks to trust (Google Cloud load balancers, and my internal Kubernetes network) and we need to allow two “hops” of proxying. So my configuration code looks like this:

services.Configure<ForwardedHeadersOptions>(options =>
{
    options.KnownNetworks.Clear();
    // Google Cloud Platform load balancers
    options.KnownNetworks.Add(new IPNetwork(IPAddress.Parse("130.211.0.0"), 22));
    options.KnownNetworks.Add(new IPNetwork(IPAddress.Parse("35.191.0.0"), 16));
    // GKE service which proxies the request as well.
    options.KnownNetworks.Add(new IPNetwork(IPAddress.Parse("10.0.0.0"), 8));
    options.ForwardedHeaders = ForwardedHeaders.XForwardedFor | ForwardedHeaders.XForwardedProto;
    options.ForwardLimit = 2;
});

(The call to KnownNetworks.Clear() probably isn’t necessary. The default is to include the loopback, which is safe enough to leave in the list.)

Okay, deploy that to the test environment. Everything will work now, right? Well, sort of…

Issue 3: make sure health checks are healthy!

As it happens, when I’d finished fixing issue 2, I needed to help at a birthday party for a family we’re friends with. Still, I went to the party happily knowing everything was fine.

I then came home and found the test deployment was broken. Really broken. “502 Bad Gateway” broken. For both HTTP and HTTPS. This is not good.

I tried adding more logging, but it looked like none of my requests were getting through to the application. I could see in the logs (thank you, Stackdriver!) that requests were being made, always to just “/” on HTTP. They were all being redirected to HTTPS via a 307, as I’d expect.

This confused me for a while. I honestly can’t remember what gave me the lightbulb moment of “Ah, these are load balancer health checks, and it thinks they’re failing!” but I checked with the load balancer in the Google Cloud Console and sure enough, I had multiple working backends, and one broken one – my test backend. The reason I hadn’t seen this before was that I’d only checked the test deployment for a few minutes – not long enough for the load balancer to deem the backend unhealthy.

I was stuck at this point for a little while. I considered reconfiguring the load balancer to make the health check over HTTPS, but I don’t think that could work as the app isn’t serving HTTPS itself – I’d need to persuade it to make the request as if it were a user-generated HTTPS request, with appropriate X-Forwarded-Proto etc headers. However, I saw that I could change which URL the load balancer would check. So how about we add a /healthz URL that would be served directly without being redirected? (The “z” at the end is a bit of Googler heritage. Just /health would be fine too, of course.)

I started thinking about adding custom inline middleware to do this, but fortunately didn’t get too far before realizing that ASP .NET Core provides health checking already… so all I needed to do was add the health check middleware before the HTTPS redirection middleware, and all would be well.

So in ConfigureServices, I added a no-op health check service:

services.AddHealthChecks();

And in Configure I added the middleware at an appropriate spot:

app.UseHealthChecks("/healthz");

After reconfiguring the health check on the load balancer, I could see /healthz requests coming in and receiving 200 (OK) responses… and the load balancer was then happy to use the backend again. Hooray!

After giving the test service long enough to fail, I deployed to production, changed the load balancer health check URL, and all was well. I did the two parts of this quickly enough so that it never failed – a safer approach would have been to add the health check handler but without the HTTPS redirection first, deploy that, change the health check URL, then turn on HTTPS.

But the end result is, all is working! Hooray!

Conclusion

Moving the service in the first place has been a long process, mostly due to a lack of time to spend on it, but the HTTPS redirection has been its own interesting bit of simultaneous pleasure and frustration. I’ve learned a number of lessons along the way:

  • The combination of Google Kubernetes Engine, Google Cloud Load Balancing, Google Cloud Build and Google Container registry is pretty sweet.
  • Managed SSL certificates are wonderfully easy to use, even if there is a bit of a worrying delay while provisioning
  • It’s really, really important to be able to test deployment changes (such as HTTPS redirection) in an environment which is very similar to production, but which no-one is depending on. (Obviously if you have a site which few people care about anyway, there’s less risk. But as it’s easy to set up a test deployment on GKE, why not do it anyway?)
  • HTTPS redirection caused me three headaches, all predictable:
    • ASP .NET Core needs to know the HTTPS port to redirect to.
    • You need to configure forwarded headers really carefully, and know your deployment model thoroughly .
    • Be aware of health checks! Make sure you leave a test deployment “live” for long enough for the health checks to mark it as unhealthy if you’ve done something wrong, before you deploy to production.
  • When debugging, turn on debug logging. Sounds obvious in retrospect, doesn’t it? (Don’t start trying to copy middleware source code into your own application so you can add logging, rather than using the logging already there…)

I also have some future work to do:

  • There’s a particular URL (http://nodatime.org/tzdb/latest.txt) which is polled by applications in order to spot time zone information changes. That’s the bulk of the traffic to the site. It currently redirects to HTTPS along with everything else, which leads to the total traffic being nearly double what it was before, for no real benefit. I’ve encouraged app authors to use HTTPS instead, but I’ve also filed a feature request against myself to consider serving that particular URL without the redirect. It looks like that’s non-trivial though.
  • I have a bunch of hard-coded information which should really be in appsettings.json. I want to move all of that, but I need to learn more about the best way of doing it first.

All in all, this has been a very positive experience – I hope the details above are useful to anyone else hosting ASP .NET Core apps in Google Kubernetes Engine.

NullableAttribute and C# 8

Background: Noda Time and C# 8

Note: this blog post was written based on experimentation with Visual Studio 2019 preview 2.2. It’s possible that some of the details here will change over time.

C# 8 is nearly here. At least, it’s close enough to being “here” that there are preview builds of Visual Studio 2019 available that support it. Unsurprisingly, I’ve been playing with it quite a bit.

In particular, I’ve been porting the Noda Time source code to use the new C# 8 features. The master branch of the repo is currently the code for Noda Time 3.0, which won’t be shipping (as a GA release) until after C# 8 and Visual Studio 2019 have fully shipped, so it’s a safe environment in which to experiment.

While it’s possible that I’ll use other C# 8 features in the future, the two C# 8 features that impact Noda Time most are nullable reference types and switch expressions. Both sets of changes are merged into master now, but the pull requests are still available so you can see just the changes:

The switch expressions PR is much simpler than the nullable reference types one. It’s entirely an implementation detail… although admittedly one that confused docfx, requiring a few of those switch expressions to be backed out or moved in a later PR.

Nullable reference types are a much, much bigger deal. They affect the public API, so they need to be treated much more carefully, and the changes end up being spread far wide throughout the codebase. That’s why the switch expression PR is a single commit, whereas nullable reference types is split into 14 commits – mostly broken up by project.

Reviewing the public API of a nullable reference type change

So I’m now in a situation where I’ve got nullable reference type support in Noda Time. Anyone consuming the 3.0 build (and there’s an alpha available for experimentation purposes) from C# 8 will benefit from the extra information that can now be expressed about parameters and return values. Great!

But how can I be confident in the changes to the API? My process for making the change in the first place was to enable nullable reference types and see what warnings were created. That’s a great starting point, but it doesn’t necessarily catch everything. In particular, although I started with the main project (the one that creates NodaTime.dll), I found that I needed to make more changes later on, as I modified other projects.

Just because your code compiles without any warnings with nullable reference types enabled doesn’t mean it’s “correct” in terms of the API you want to expose.

For example, consider this method:

public static string Identity(string input) => input;

That’s entirely valid C# 7 code, and doesn’t require any changes to compile, warning-free, in C# 8 with nullable reference types enabled. But it may not be what you actually want to expose. I’d argue that it should look like one of these:

// Allowing null input, and producing nullable output
public static string? Identity(string? input) => input;

// Preventing null input, and producing non-nullable output
public static string Identity(string input)
{
    // Convenience method for nullity checking.
    Preconditions.CheckNotNull(input, nameof(input));
    return input;
}

If you were completely diligent when writing tests for the code before C# 8, it should be obvious which is required – because you’d presumably have something like:

[Test]
public void Identity_AcceptsNull()
{
    Assert.IsNull(Identity(null));
}

That test would have produced a warning in C# 8, and would have suggested that the null-permissive API is the one you wanted. But maybe you forgot to write that test. Maybe the test you would have written was one that would have shown up a need to put that precondition in. It’s entirely possible that you write much more comprehensive tests than I do, but I suspect most of us have some code that isn’t explicitly tested in terms of its null handling.

The important part take-away here is that even code that hasn’t changed in appearance can change meaning in C# 8… so you really need to review any public APIs. How do you do that? Well, you could review the entire public API surface you’re exposing, of course. For many libraries that would be the simplest approach to take, as a “belt and braces” attitude to review. For Noda Time that’s less appropriate, as so much of the API only deals in value types. While a full API review would no doubt be useful in itself, I just don’t have the time to do it right now.

Instead, what I want to review is any API element which is impacted by the C# 8 change – even if the code itself hasn’t changed. Fortunately, that’s relatively easy to do.

Enter NullableAttribute

The C# 8 compiler applies a new attribute to every API element which is affected by nullability. As an example of what I mean by this, consider the following code which uses the #nullable directive to control the nullable context of the code.

public class Test
{
#nullable enable
    public void X(string input) {}

    public void Y(string? input) {}
#nullable restore

#nullable disable
    public void Z(string input) {}
#nullable restore
}

The C# 8 compiler creates an internal NullableAttribute class within the assembly (which I assume it wouldn’t if we were targeting a framework that already includes such an attribute) and applies the attribute anywhere it’s relevant. So the above code compiles to the same IL as this:

using System.Runtime.CompilerServices;

public class Test
{
    public void X([Nullable((byte) 1)] string input) {}    

    public void Y([Nullable((byte) 2)] string input) {}

    public void Z(string input) {}}
}

Note how the parameter for Z doesn’t have the attribute at all, because that code is still oblivious to nullable reference types. But both X and Y have the attribute applied to their parameters – just with different arguments to describe the nullability. 1 is used for not-null; 2 is used for nullable.

That makes it relatively easy to write a tool to display every part of a library’s API that relates to nullable reference types – just find all the members that refer to NullableAttribute, and filter down to public and protected members. It’s slightly annoying that NullableAttribute doesn’t have any properties; code to analyze an assembly needs to find the appropriate CustomAttributeData and examine the constructor arguments. It’s awkward, but not insurmountable.

I’ve started doing exactly that in the Noda Time repository, and got it to the state where it’s fine for Noda Time’s API review. It’s a bit quick and dirty at the moment. It doesn’t show protected members, or setter-only properties, or handle arrays, and there are probably other things I’ve forgotten about. I intend to improve the code over time and probably move it to my Demo Code repository at some point, but I didn’t want to wait until then to write about NullableAttribute.

But hey, I’m all done, right? I’ve just explained how NullableAttribute works, so what’s left? Well, it’s not quite as simple as I’ve shown so far.

NullableAttribute in more complex scenarios

It would be oh-so-simple if each parameter or return type could just be nullable or non-nullable. But life gets more complicated than that, with both generics and arrays. Consider a method called GetNames() returning a list of strings. All of these are valid:

// Return value is non-null, and elements aren't null
List<string> GetNames()

// Return value is non-null, but elements may be null
List<string?> GetNames()

// Return value may be null, but elements aren't null
List<string>? GetNames()

// Return value may be null, and elements may be null
List<string?>? GetNames()

So how are those represented in IL? Well, NullableAttribute has one constructor accepting a single byte for simple situations, but another one accepting byte[] for more complex ones like this. Of course, List<string> is still relatively simple – it’s just a single top-level generic type with a single type argument. For a more complex example, imagine Dictionary<List<string?>, string[]?> . (A non-nullable reference to a dictionary where each key is a not-null list of nullable strings, and each value is a possibly-null array of non-nullable elements. Ouch.)

The layout of NullableAttribute in these cases can be thought of in terms of a pre-order traversal of a tree representing the type, where generic type arguments and array element types are leaves in the tree. The above example could be thought of as this tree:

         Dictionary<,> (not null)
            /               \
           /                 \
 List<> (not null)      Array (nullable)
        |                     |
        |                     |
 string (nullable)      string (not null)

The pre-order traversal of that tree gives us these values:

  • Not null (dictionary)
  • Not null (list)
  • Nullable (string)
  • Nullable (array)
  • Not null (string)

So a parameter declared with that type would be decorated like this:

[Nullable(new byte[] { 1, 1, 2, 2, 1 })]

But wait, there’s more!

NullableAttribute in simultaneously-complex-and-simple scenarios

The compiler has one more trick up its sleeve. When all the elements in the tree are “not null” or all elements in the tree are “nullable”, it simply uses the constructor with the single-byte parameter instead. So Dictionary<List<string>, string[]> would be decorated with Nullable[(byte) 1] and Dictionary<List<string?>?, string?[]?>? would be decorated with Nullable[(byte) 2].

(Admittedly, Dictionary<,> doesn’t permit null keys anyway, but that’s an implementation detail.)

Conclusion

The C# 8 feature of nullable reference types is a really complicated one. I don’t think we’ve seen anything like this since async/await. This post has just touched on one interesting implementation detail. I’m sure there’ll be more posts on nullability over the next few months…

Farewell, Daisy Shipton

This is more of a quick, explanatory “heads-up” post than anything else.

On March 31st 2018, I started an experiment: I created a new Stack Overflow user called “Daisy Shipton” with no picture and a profile that just read “Love coding in C#” (or similar). I wanted to see how a new user presenting with a traditionally-female name would be treated, while posting the same content that I normally would. This experiment was only a small part of my thinking around the culture of Stack Overflow, and I expect to write more on that subject, touching on the experience of “Daisy”, at another time.

I let a few people in on the secret as I went along – people who I fully expected to recognize my writing style fairly quickly. A single person emailed me to ask whether Daisy and I were the same person – well done to them for spotting it. (Once someone had the idea, the evidence was pretty compelling – the “Jon Skeet” account went into a decline in posting answers at the same time that the “Daisy Shipton” account was created, and Daisy just happened to post about C#, Noda Time, Protocol Buffers, time zones and Google Cloud Platform client libraries for .NET. I really wasn’t trying to cover my tracks.)

As Daisy reached a rep of about 12,000 points, there became little point in continuing the experiment, so I asked for “her” account to be merged into my regular one. So if you see comments on my posts referring to @DaisyShipton, that’s why.

There’s one aspect of experimentation that never happened: Daisy never asked a question. Next time I want to ask a question on Stack Overflow, I’ll probably create another account to see how a question I think is good is received when posted from a 1-rep account.

It’s been fun, but it’ll also be nice to only have one account to manage now…

First steps with nullable reference types

This blog post is effectively a log of my experience with the preview of the C# 8 nullable reference types feature.

There are lots of caveats here: it’s mostly “as I go along” so there may well be backtracking. I’m not advising the right thing to do, as I’m still investigating that myself. And of course the feature is still changing. Oh, and this blog post is inconsistent about its tense. Sometimes I write in the present tense as I go along, sometimes I wrote in the past tense afterwards without worrying about it. I hope this isn’t/wasn’t/won’t be too annoying.

I decided that the best way of exploring the feature would be to try to use it with Noda Time. In particular:

  • Does it find any existing bugs?
  • Do my existing attributes match what Roslyn expects?
  • Does the feature get in the way, or improve my productivity?

Installation

I started at the preview page on GitHub. There are two really important points to note:

  • Do this on a non-production machine. I used an old laptop, but presumably you can use a VM instead.
  • Uninstall all versions of Visual Studio other than VS2017 first

I ended up getting my installation into a bad state, and had to reinstall VS2017 (and then the preview) before it would work again. Fortunately that takes a lot less time than it used to.

Check it works

The preview does not work with .NET Core projects or the command-line csc

It’s only for old-style MSBuild projects targeting the .NET framework, and only from Visual Studio.

So to test your installation:

  • Create a new .NET Framework desktop console app
  • Edit Program.cs to include: string? x = null;
  • Build

If you get an error CS0453 (“The type ‘string’ must be a non-nullable value type…”) then it’s not working. If it builds with maybe a warning about the variable not being used, you’re good to go.

First steps with Noda Time

The first thing I needed to do was convert Noda Time to a desktop project. This didn’t require the preview to be installed, so I was able to do it on my main machine.

I created a new solution with three desktop projects (NodaTime, NodaTime.Test and NodaTime.Testing), and added the dependencies between the projects and external ones. I then copied these project files over the regular Noda Time ones.

Handy tip: if you add <Compile Include="**\*.cs" /> in an MSBuild file and open it in Visual Studio, VS will replace it with all the C# files it finds. No need for tedious “Add existing” all over the place.

A small amount of fiddling was required for signing and resources, and then I had a working copy of Noda Time targeting just .NET 4.5. All tests passed :)

For anyone wanting to follow my progress, the code is in a branch of my fork of Noda Time although I don’t know how long I’ll keep it for.

Building with the preview

After fetching that branch onto my older laptop, it built first time – with 228 warnings, most of which were “CS8600: Cannot convert null to non-nullable reference.” Hooray – this is exactly what we want. Bear in mind that this is before I’ve made any changes to the code.

The warnings were split between the three projects like this:

  • NodaTime: 94
  • NodaTime.Testing: 0
  • NodaTime.Test: 134

Follow the annotations

Noda Time already uses [CanBeNull] and [NotNull] attributes for both parameters and return types to indicate expectations. The first obvious step is to visit each application of [CanBeNull] and use a nullable reference type there.

To make it easier to see what’s going on, I first unloaded the NodaTime.Test project. This was so that I could concentrate on making NodaTime self-consistent to start with.

Just doing that actually raised the number of warnings from 94 to 110. Clearly I’m not as consistent as I’d like to be. I suspect I’ve got plenty of parameters which can actually be null but which I didn’t apply the annotation to. It’s time to start actually looking at the warnings.

Actually fix things

I did this in a completely haphazard fashion: fix one warning, go onto another.

I’ve noticed a pattern that was already feasible before, but has extra benefits in the nullable reference type world. Instead of this:

// Old code
string id = SomeMethodThatCanReturnNull();
if (id == null)
{
    throw new SomeException();
}
// Use id knowing it's not null

… I can use the ?? operator with the C# 7 feature of throw expressions:

// Old code
string id = SomeMethodThatCanReturnNull() ??
    throw new SomeException();
// Use id knowing it's not null

That avoids having a separate local variable of type string?, which can be very handy.

I did find a few places where the compiler could do better at working out nullity. For example:

// This is fine
string? x = SomeExpressionThatCanReturnNull();
if (x == null)
{
    return;
}
string y = x;

// This creates a warning: the compiler doesn't know that x
// can't be null on the last line
string? x = SomeExpressionThatCanReturnNull();
if (ReferenceEquals(x, null))
{
    return;
}
string y = x;

The preview doc talks about this in the context of string.IsNullOrEmpty; the ReferenceEquals version is a little more subtle as we can’t determine nullity just from the output – it’s only relevant if the other argument is a constant null. On the other hand, that’s such a fundamental method that I’m hopeful it’ll get fixed.

Fixing these warnings didn’t take very long, but it was definitely like playing Whackamole. You fix one warning, and that causes another. For example, you might make a return type nullable to make
a return null; statement work – and that affects all the callers.

I found that rebuilding would sometimes find more warnings, too. At one point I thought I was done (for the time being) – after rebuilding, I had 26 warnings.

I ran into one very common problem: implementing IEquatable<T> (for a concrete reference type T). In every case, I ended up making it implement IEquatable<T?>. I think that’s the right thing to do… (I didn’t do it consistently though, as I found out later on. And IEqualityComparer<T> is trickier, as I’ll talk about later.)

Reload the test project

So, after about an hour of fixing warnings in the main project, what would happen when I reload the test project? We previously had 134 warnings in the test project. After reloading… I was down to 123.

Fixing the test project involved fixing a lot more of the production code, interestingly enough. And that led me to find a limitation not mentioned on the preview page:

public static NodaFormatInfo GetInstance(IFormatProvider? provider)
{
    switch (provider)
    {
        case null:
            return ...;
        case CultureInfo cultureInfo:
            return ...;
        case DateTimeFormatInfo dateTimeFormatInfo;
            return ...;
        default:
            throw new ArgumentException($"Cannot use provider of type {provider.GetType()}");
    }
}

This causes a warning of a possible dereference in the default case – despite the fact that provider clearly can’t be null, as otherwise it would match the null case. Will try to provide a full example in a bug report.

The more repetitive part is fixing all the tests that ensure a method throws an ArgumentNullException if called with a null argument. As there’s no compile-time checking as well, the argument
needs to be null!, meaning “I know it’s really null, but pretend it isn’t.” It makes me chuckle in terms of syntax, but it’s tedious to fix every occurrence.

IEqualityComparer<T>

I have discovered an interesting problem. It’s hard to implement IEqualityComparer<T> properly. The signatures on the interface are pretty trivial:

public interface IEqualityComparer<in T>
{
    bool Equals(T x, T y);
    int GetHashCode(T obj);
}

But problems lurk beneath the surface. The documentation for the Equals() method doesn’t state what should happen if x or y is null. I’ve typically treated this as valid, and just used the normal equality rules (two null references are equal to each other, but nothing else is equal to a null reference.) Compare that with GetHashCode(), where it’s explicitly documented that the method should throw ArgumentNullException if obj is null.

Now think about a type I’d like to implement an equality comparer for – Period for example. Should I write:

public class PeriodComparer : IEqualityComparer<Period?>

This allows x and y to be null – but also allows obj to be null, causing an ArgumentNullException, which this language feature is trying to eradicate as far as possible.

I could implement the non-nullable version instead:

public class PeriodComparer : IEqualityComparer<Period>

Now the compiler will check that you’re not passing a possibly-null value to GetHashCode(), but will also check that for Equals, despite it being okay.

This feels like it’s a natural but somewhat unwelcome result of the feature arriving so much later than the rest of the type system. I’ve chosen to implement the nullable form, but still throw the
exception in GetHashCode(). I’m not sure that’s the right solution, but I’d be interested to hear what others think.

Found bugs in Noda Time!

One of the things I was interested in finding out with all of this was how consistent Noda Time is in terms of its nullity handling. Until you have a tool like this, it’s hard to tell. I’m very pleased to say that most of it hangs together nicely – although so far that’s only the result of getting down to no warnings, rather than a file-by-file check through the code, which I suspect I’ll want to do eventually.

I did find two bugs, however. Noda Time tries to handle the case where TimeZoneInfo.Local returns a null reference, because we’ve seen that happen in the wild. (Hopefully that’s been fixed now in Mono, but even so it’s nice to know we can cope.) It turns out that we have code to cope with it in one place, but there are two places where we don’t… and the C# 8 tooling found that. Yay!

Found a bug in the preview!

To be clear, I didn’t expect the preview code to be perfect. As noted earlier, there are a few places I think it can be smarter. But I found a nasty bug that would hang Visual Studio and cause csc.exe to fail when building. It turns out that if you have a type parameter T with a constraint of T : class, IEquatable<T?>, that causes a stack overflow. I’ve reported the bug (now filed on GitHub thanks to diligent Microsoft folks) so hopefully it’ll be fixed long before the final version.

Admittedly the constraint is interesting in itself – it’s not necessarily clear what it means, if T is already a nullable reference type. I’ll let smarter people than myself work that out.

Conclusion

Well, that was a jolly exercise. My first impressions are:

  • We really need class library authors to embrace this as soon as C# 8 comes out, in order to make it as useful as possible early. Noda Time has no further dependencies, fortunately.
  • It didn’t take as long as I feared it might to do a first pass at annotating Noda Time, although I’m sure I missed some things while doing it.
  • A few bugs aside, the tooling is generally in very good shape; the warnings it produced were relevant and easy to understand.
  • It’s going to take me a while to get used to things like IList<string?>? for a nullable list of nullable strings.

Overall I’m very excited by all of this – I’m really looking forward to the final release. I suspect more blog posts will come over time…

Backward compatibility and overloading

I started writing a blog post about versioning in July 2017. I’ve mostly abandoned it, because I think the topic is too vast for a single post. It potentially needs a whole site/wiki/repository devoted to it. I hope to come back to it at some point, because I believe this is a hugely important topic that doesn’t get as much attention as it deserves.

In particular, the .NET ecosystem is mostly embracing semantic versioning – which sounds great, but does rely on us having a common understanding of what’s meant by a “breaking change”. That’s something I’ve been thinking about quite a lot. One aspect which has struck me forcefully recently is how hard it is to avoid breaking changes when using method overloading. That’s what this post is about, mostly because it’s fun.

First, a quick definition…

Source and binary compatibility

If I can recompile my client code with a new version of the library and it all works fine, that’s source compatible. If I can redeploy my existing client binary with a new version of the library without recompiling, that’s binary compatible. Neither of these is a superset of the other:

  • Some changes are both source and binary incompatible, such as removing a whole public type that you depended on.
  • Some changes are source compatible but binary incompatible, such as changing a public static read-only field into a property.
  • Some changes are binary compatible but source incompatible, such as adding an overload which could cause compile-time ambiguity.
  • Some changes are source and binary compatible, such as reimplementing the body of a method.

So what are we talking about?

I’m going to assume that we have a public library at version 1.0, and we wish to add some overloads in version 1.1. We’re following semantic versioning, so we need to be backward compatible. What does that mean we can and can’t do, and is it a simple binary choice?

In various cases, I’ll present library code at version 1.0 and version 1.1, then “client” code (i.e. code that is using the library) which could be broken by the change. I’m not presenting method bodies or class declarations, as they’re largely irrelevant – focus on the signatures. It should be easy to reproduce any of this if you’re interested though. We’ll imagine that all the methods I present are in a class called Library.

Simplest conceivable change, foiled by method group conversions

The simplest example I can imagine would be adding a parameterized method when there’s a parameterless one already:

// Library version 1.0
public void Foo()

// Library version 1.1
public void Foo()
public void Foo(int x)

Even that’s not completely compatible. Consider this client code:

// Client
static void Method()
{
    var library = new Library();
    HandleAction(library.Foo);
}

static void HandleAction(Action action) {}
static void HandleAction(Action<int> action) {}

In library version 1.0, that’s fine. The call to HandleAction performs a method group conversion of library.Foo to create an Action. In library version 1.1, it’s ambiguous: the method group can be converted to either Action or Action<int>. So it’s not source compatible, if we’re going to be strict about it.

At this point you might be tempted to give up and go home, resolving never to add any overloads, ever again. Or maybe we can say that this is enough of a corner case to not consider it breaking. Let’s call method group conversions out of scope for now.

Unrelated reference types

We get into a different kind of territory when we have overloads with the same number of parameters. You might expect this library change to be non-breaking:

// Library version 1.0
public void Foo(string x)

// Library version 1.1
public void Foo(string x)
public void Foo(FileStream x)

That feels like it should be reasonable. The original method still exists, so we won’t be breaking binary compatibility. The simplest way of breaking source compatibility is to have a call that either works in v1.0 but doesn’t in v1.1, or works in both but does something different in v1.1 than it did in v1.0.

How can a call break between v1.0 and v1.1? We’d have to have an argument that’s compatible with both string and FileStream. But they’re unrelated reference types…

The first failure is if we have a user-defined implicit conversion to both string and FileStream:

// Client
class OddlyConvertible
{
    public static implicit operator string(OddlyConvertible c) => null;
    public static implicit operator FileStream(OddlyConvertible c) => null;
}

static void Method()
{
    var library = new Library();
    var convertible = new OddlyConvertible();
    library.Foo(convertible);
}

Hopefully the problem is obvious: what used to be unambiguous via a conversion to string is now ambiguous as the OddlyConvertible type can be implicitly converted to both string and FileStream. (Both overloads are applicable, neither is better than the other.)

It may be reasonable to exclude user-defined conversions… but there’s a far simpler way of making this fail:

// Client
static void Method()
{
    var library = new Library();
    library.Foo(null);
}

The null literal is implicitly convertible to any reference type or any nullable value type… so again, the call becomes ambiguous in the library v1.1. Let’s try again…

Reference type and non-nullable value type parameters

If we don’t mind user-defined conversions, but don’t like null literals causing a problem, how about introducing an overload with a non-nullable value type?

// Library version 1.0
public void Foo(string x)

// Library version 1.1
public void Foo(string x)
public void Foo(int x)

This looks good – library.Foo(null) will be fine in v1.1. So is it safe? Not in C# 7.1…

// Client
static void Method()
{
    var library = new Library();
    library.Foo(default);
}

The default literal is like the null literal, but for any type. It’s really useful – and a complete pain when it comes to overloading and compatibility :(

Optional parameters

Optional parameters bring their own kind of pain. Suppose we have one optional parameter, but wish to add a second. We have three options, shown as 1.1a, 1.1b and 1.1c below.

// Library version 1.0
public void Foo(string x = "")

// Library version 1.1a
// Keep the existing method, but add another one with two optional parameters.
public void Foo(string x = "")
public void Foo(string x = "", string y = "")

// Library version 1.1b
// Just add the parameter to the existing method.
public void Foo(string x = "", string y = "")

// Library version 1.1c
// Keep the old method but make the parameter required, and add a new method
// with both parameters optional.
public void Foo(string x)
public void Foo(string x = "", string y = "")

Let’s think about a client that makes two calls:

// Client
static void Method()
{
    var library = new Library();
    library.Foo();
    library.Foo("xyz");
}

Library 1.1a keeps binary compatiblity, but breaks source compatibility: the library.Foo() is now ambiguous. The C# overloading rules prefer a method that doesn’t need the compiler to “fill in” any optional parameters, but it doesn’t have any preference in terms of how many optional parameters are filled in.

Library 1.1b keeps source compatibility, but breaks binary compatibility. Existing compiled code will expect to call a method with a single parameter – and that method no longer exists.

Library 1.1c keeps binary compatibility, but is potentially odd around source compatibility. The library.Foo() call now resolves to the two-parameter method, whereas library.Foo("xyz") resolves to the one-parameter method (which the compiler prefers over the two-parameter method because it doesn’t need to fill in any optional parameters). That may very well be okay, if the one-parameter version simply delegates to the two-parameter version using the same default value. It feels odd for the meaning of the first call to change though, when the method it used to resolve to still exists.

Optional parameters get even hairer when you don’t want to add a new one at the end, but in the middle – e.g. if you’re trying to follow a convention of keeping an optional CancellationToken parameter at the end. I’m not going to dive into this…

Generics

Type inference is a tricky beast at the best of times. With overload resolution it goes into full-on nightmare mode.

Let’s have a single non-generic method in v1.0, and then add a generic method in v1.1.

// Library version 1.0
public void Foo(object x)

// Library version 1.1
public void Foo(object x)
public void Foo<T>(T x)

That doesn’t seem too awful… but let’s look closely at what happens to client code:

// Client
static void Method()
{
    var library = new Library();
    library.Foo(new object());
    library.Foo("xyz");
}

In library v1.0, both calls resolve to Foo(object) – the only method that exists.

Library v1.1 is binary-compatible: if we use a client executable compiled against v1.0 but running against v1.1, both calls will still use Foo(object). But if we recompile, the second call (and only the second one) will change to using the generic method. Both methods are applicable for both calls.

In the first call, T would be inferred to be object, so the argument-to-parameter-type conversion is just object to object in both cases. Great. The compiler applies a tie-break rule that prefers non-generic methods over generic methods.

In the second call, T would be inferred to be string, so the argument-to-parameter-type conversion is string to object for the original method and string to string for the generic method. The latter is a “better” conversion, so the second method is picked.

If the two methods behave the same way, that’s fine. If they don’t, you’ve broken compatibility in a very subtle way.

Inheritance and dynamic typing

I’m sorry: I just don’t have the energy. Both inheritance and dynamic typing would interact with overload resolution in “fun” and obscure ways.

If you add a method in one level of the inheritance hierarchy which overloads a method in a base class, the new method will be examined first, and picked over the base class method even when the base class method is more specific in terms of argument-to-parameter-type conversions. There’s lots of scope for messing things up.

Likewise with dynamic typing (within the client code), to some extent all bets are off. You’re already sacrificing a lot of compile-time safety… it shouldn’t come as a surprise when things break.

Conclusion

I’ve tried to keep the examples reasonably simple here. It can get really complicated really quickly as soon as you have multiple optional parameters etc.

Versioning is hard and makes my head hurt.

Stack Overflow Culture

This blog post was most directly provoked by this tweet from my friend Rob Conery, explaining why he’s giving up contributing on Stack Overflow.

However, it’s been a long time coming. A while ago I started writing a similar post, but it got longer and longer without coming to any conclusion. I’m writing this one with a timebox of one hour, and then I’ll post whatever I’ve got. (I may then reformat it later.)

I’m aware of the mixed feelings many people have about Stack Overflow. Some consider it to be completely worthless, but I think more people view it as “a valuable resource, but a scary place to contribute due to potential hostility.” Others contribute on a regular basis, occasionally experiencing or witnessing hostility, but generally having a reasonable time.

This post talks about my experiences and my thoughts on where Stack Overflow has a problem, where I disagree with some of the perceived problems, and what can be done to improve the situation. This is a topic I wish I’d had time to talk about in more detail with the Stack Overflow team when I visited them in New York in February, but we were too busy discussing other important issues.

For a lot of this post I’ll talk about “askers” and “answerers”. This is a deliberate simplification for the sake of, well, simplicity. Many users are both askers and answerers, and a lot of the time I’ll write comments with a view to being an answerer, but without necessarily ending up writing an answer. Although any given user may take on different roles even in the course of an hour, for a single post each person usually has a single role. There are other roles of course “commenter on someone else’s answer” for example – I’m not trying to be exhaustive here.

Differences in goals and expectations

Like most things in life, Stack Overflow works best when everyone has the same goal. We can all take steps towards that goal together. Conversely, when people in a single situation have different goals, that’s when trouble often starts.

On Stack Overflow, the most common disconnect is between these two goals:

  • Asker: minimize the time before I’m unblocked on the problem I’m facing
  • Answerer: maximize the value to the site of any given post, treating the site as a long-lasting resource

In my case, I have often have a sub-goal of “try to help improve the diagnostic skill of software engineers so that they’re in a better position to solve their own problems.”

As an example, consider this question – invented, but not far-fetched:

Random keeps giving me the same numbers. Is it broken?

This is a low-quality question, in my view. (I’ll talk more about that later.) I know what the problem is likely to be, but to work towards my goal I want the asker to improve the question – I want to see their code, the results etc. If I’m right about the problem (creating multiple instances of System.Random in quick succession, which will also use the same system-time-based seed), I’d then almost certainly be able to close the question as a duplicate, and it could potentially be deleted. In its current form, it provides no benefit to the site. I don’t want to close the question as a duplicate without seeing that it really is a duplicate though.

Now from the asker’s perspective, none of that is important. If they know that I have an idea what the problem might be, their perspective is probably that I should just tell them so they can be unblocked. Why take another 10 minutes to reproduce the problem in a good question, if I can just give them the answer now? Worse, if they do take the time to do that and then I promptly close their question as a duplicate, it feels like wasted time.

Now if I ignored emotions, I’d argue that the time wasn’t wasted:

  • The asker learned that when they ask a clearer question, they get to their answer more quickly. (Assuming they follow the link to the duplicate and apply it.)
  • The asker learned that it’s worth searching for duplicate questions in their research phase, as that may mean they don’t need to ask the question at all.

But ignoring emotions is a really bad idea, because we’re all human. What may well happen in that situation – even if I’ve been polite throughout – is that the asker will decide that Stack Overflow is full of “traffic cop” moderators who only care about wielding power. I could certainly argue that that’s unfair – perhaps highlighting my actual goals – but that may not change anyone’s mind.

So that’s one problem. How does the Stack Overflow community agree what the goal of site is, and then make that clearer to users when they ask a question? It’s worth noting that the tour page (which curiously doesn’t seem to be linked from the front page of the site any more) does include this text:

With your help, we’re working together to build a library of detailed answers to every question about programming.

I tend to put it slightly differently:

The goal of Stack Overflow is to create a repository of high-quality questions, and high-quality answers to those questions.

Is that actually a shared vision? If askers were aware of it, would that help? I’d like to hope so, although I doubt that it would completely stop all problems. (I don’t think anything would. The world isn’t a perfect place.)

Let’s move onto another topic where I disagree with some people: low-quality questions.

Yes, there are low-quality questions

I assert that even if it can’t be measured in a totally objective manner, there are high-quality questions and low-quality questions (and lots in between).

I view a high-quality question in the context of Stack Overflow as one which:

  • Asks a question, and is clear in what it’s asking for. It should be reasonably obvious whether any given attempted answer does answer the question. (That’s separate to whether the answer is correct.)
  • Avoids irrelevancies. This can be hard, but I view it as part of due diligence: if you’re encountering a problem as part of writing a web-app, you should at least try to determine whether the context of a web-app is relevant to the problem.
  • Is potentially useful to other people. This is where avoiding irrelevant aspects is important. Lots of people need to parse strings as dates; relatively few will need to parse strings as dates using framework X version Y in conjunction with a client written in COBOL, over a custom and proprietary network protocol.
  • Explains what the asker has already tried or researched, and where they’ve become stuck.
  • Where appropriate (which is often the case) contains a minimal example demonstrating the problem.
  • Is formatted appropriately. No whole-page paragraphs, no code that’s not formatted as code, etc.

There are lots of questions which meet all those requirements, or at least most of them.

I think it’s reasonable to assert that such a question is of higher quality than a question which literally consists of a link to a photo of a homework assignment, and that’s it. Yes, I’ve seen questions like that. They’re not often quite that bad, but if we really can’t agree that that is a low-quality question, I don’t know what we can agree on.

Of course, there’s a huge spectrum in between – but I think it’s important to accept that there are such things as low-quality questions, or at least to debate it and find out where we disagree.

Experience helps write good questions, but isn’t absolutely required

I’ve seen a lot of Meta posts complaining that Stack Overflow is too hard on newcomers, who can’t be expected to write a good question.

I would suggest that a newcomer who accepts the premise of the site and is willing to put in effort is likely to be able to come up with at least a reasonable question. It may take them longer to perform the research and write the question, and the question may well not be as crisp as one written by a more experienced developer in the same situation, but I believe that on the whole, newcomers are capable of writing questions of sufficient quality for Stack Overflow. They may not be aware of what they need to do or why, but that’s a problem with a different solution than just “we should answer awful questions which show no effort because the asker may be new to tech”.

One slightly separate issue is whether people have the diagnostic skills required to write genuinely good questions. This is a topic dear to my heart, and I really wish I had a good solution, but I don’t. I firmly believe that if we can help programmers become better at diagnostics, then that will be of huge benefit to them well beyond asking better Stack Overflow questions.

Some regular users behave like jerks on Stack Overflow, but most don’t

I’m certainly not going to claim that the Stack Overflow community is perfect. I have seen people being rude to people asking bad questions – and I’m not going to excuse that. If you catch me being rude, call me out on it. I don’t believe that requesting improvements to a question is rude in and of itself though. It can be done nicely, or it can be done meanly. I’m all for raising the level of civility on Stack Overflow, but I don’t think that has to be done at the expense of site quality.

I’d also say that I’ve experienced plenty of askers who react very rudely to being asked for more information. It’s far from one way traffic. I think I’ve probably seen more rudeness in this direction than from answerers, in fact – although the questions usually end up being closed and deleted, so anyone just browsing the site casually is unlikely to see that.

My timebox is rapidly diminishing, so let me get to the most important point. We need to be nicer to each other.

Jon’s Stack Overflow Covenant

I’ve deliberately called this my covenant, because it’s not my place to try to impose it on anyone else. If you think it’s something you could get behind (maybe with modifications), that’s great. If Stack Overflow decides to adopt it somewhere in the site guidelines, they’re very welcome to take it and change it however they see fit.

Essentially, I see many questions as a sort of transaction between askers and answerers. As such, it makes sense to have a kind of contract – but that sounds more like business, so I’d prefer to think of a covenant of good faith.

As an answerer, I will…

  • Not be a jerk.
  • Remember that the person I’m responding to is a human being, with feelings.
  • Assume that the person I’m responding to is acting in good faith and wants to be helped.
  • Be clear that a comment on the quality of a question is not a value judgement on the person asking it.
  • Remember that sometimes, the person I’m responding to may feel they’re being judged, even if I don’t think I’m doing that.
  • Be clear in my comments about how a question can be improved, giving concrete suggestions for positive changes rather than emphasizing the negative aspects of the current state.
  • Be clear in my answers, remembering that not everyone has the same technical context that I do (so some terms may need links etc).
  • Take the time to present my answer well, formatting it as readably as I can.

As an asker, I will…

  • Not be a jerk.
  • Remember that anyone who responds to me is a human being, with feelings.
  • Assume that any person who responds to me is acting in good faith and trying to help me.
  • Remember that I’m asking people to give up their time, for free, to help me with a problem.
  • Respect the time of others by researching my question before asking, narrowing it down as far as I can, and then presenting as much information as I think may be relevant.
  • Take the time to present my question well, formatting it as readably as I can.

I hope that most of the time, I’ve already been following that. Sometimes I suspect I’ve fallen down. Hopefully by writing it out explicitly, and then reading it, I’ll become a better community member.

I think if everyone fully took something like this on board before posting anything on Stack Overflow, we’d be in a better place.

Implementing IXmlSerializable in readonly structs

Background

There are three things you need to know to start with:

Operations on read-only variables which are value types copy the variable value first. I’ve written about this before on this blog. C# 7.2 addresses this by introducing the readonly modifier for structs. See the language proposal for more details. I was touched to see that the proposal references my blog post :)

The ref readonly local variables and in parameter features in C# 7.2 mean that “read-only variables” are likely to be more common in C# than they have been in the past.

Noda Time includes many value types which implement IXmlSerializable. Noda Time implements IXmlSerializable.ReadXml by assigning to this: fundamentally IXmlSerializable assumes a mutable type. I use explicit interface implementation to make this less likely to be used directly on an unboxed variable. With a generic method using an interface constraint you can observe a simple method call mutating a non-read-only variable, but that’s generally harmless.

Adding the readonly modifier to Noda Time structs

I relished news of the readonly modifier for struct declarations. At last, I can remove my ReadWriteForEfficiency attribute! Hmm. Not so much. To be fair, some of the structs (7 out of 18) were fine. But every struct that implements IXmlSerializable gave me this error:

Cannot assign to ‘this’ because it is read-only

That’s reasonable: in members of a readonly struct, this effectively becomes an in parameter instead of a ref parameter. But how can we fix that? Like any sensible developer, I turned to Stack Overflow, which has precisely the question I’m interested in. It even has an answer! Unfortunately, it amounts to a workaround: assigning to this via unsafe code.

Violating readonly with unsafe code

To give a concrete example of the answer from Stack Overflow, here’s my current LocalTime code:

void IXmlSerializable.ReadXml([NotNull] XmlReader reader)
{
    Preconditions.CheckNotNull(reader, nameof(reader));
    var pattern = LocalTimePattern.ExtendedIso;
    string text = reader.ReadElementContentAsString();
    this = pattern.Parse(text).Value;
}

Here’s the code that compiles when LocalTime is marked as readonly, after enabling unsafe blocks:

unsafe void IXmlSerializable.ReadXml([NotNull] XmlReader reader)
{
    Preconditions.CheckNotNull(reader, nameof(reader));
    var pattern = LocalTimePattern.ExtendedIso;
    string text = reader.ReadElementContentAsString();
    fixed (LocalTime* thisAddr = &this)
    {
        *thisAddr = pattern.Parse(text).Value;
    }
}

Essentially, the unsafe code is bypassing the controls around read-only structs. Just for kicks, let’s apply the same change
throughout Noda Time, and think about what would happen…

As it happens, that fix doesn’t work universally: ZonedDateTime is a “managed type” because it contains a reference (to a DateTimeZone) which means you can’t create a pointer to it. That’s a pity, but if we can make everything else readonly, that’s a good start. Now let’s look at the knock-on effects…

Safe code trying to abuse the unsafe code

Let’s try to abuse our “slightly dodgy” implementations. Here’s a class we’ll put in a “friendly” assembly which is trying to be as helpful as possible:

using NodaTime;

public class AccessToken
{
    private readonly Instant expiry;
    public ref readonly Instant Expiry => ref expiry;

    public AccessToken(Instant expiry) => this.expiry = expiry;
}

Great – it lets you get at the expiry time of an access token without even copying the value.

The big test is: can we break this friendly’s code’s assumptions about expiry really not changing its value?

Here’s code I expected to mutate the access token:

static void MutateAccessToken(AccessToken accessToken)
{
    ref readonly Instant expiry = ref accessToken.Expiry;
    string xml = "<evil>2100-01-01T00:00:00Z</evil>";
    EvilReadXml(in expiry, xml);
}

static void EvilReadXml<T>(in T value, string xml) where T : IXmlSerializable
{
    var reader = XmlReader.Create(new StringReader(xml));
    reader.Read();
    value.ReadXml(reader);
}

We have an in parameter in EvilReadXml, so the expiry variable is being passed by reference, and then we’re calling ReadXml on that parameter… so doesn’t that mean we’ll modify the parameter, and thus the underlying expiry field in the object?

Nope. Thinking about it, the compiler doesn’t know when it compiles EvilReadXml that T is a readonly struct – it could be a regular struct. So it has to create a copy of the value before calling ReadXml.

Looking the the spec proposal, there’s one interesting point in the section on ref extension methods:

However any use of an in T parameter will have to be done through an interface member. Since all interface members are considered mutating, any such use would require a copy.

Hooray! That suggests it’s safe – at least for now. I’m still worried though: what if the C# team adds a readonly struct generic constraint in C# 8? Would that allow a small modification of the above code to break? And if interface methods are considered mutating anyway, why doesn’t the language know that when I’m trying to implement the interface?

But hey, now that we know unsafe code is able to work around this in our XML implementation, what’s to stop nasty code from using the same trick directly?

Unsafe code abusing safe code

Imagine we didn’t support XML serialization at all. Could unsafe code mutate AccessToken? It turns out it can, very easily:

static void MutateAccessToken(AccessToken accessToken)
{
    ref readonly Instant expiry = ref accessToken.Expiry;
    unsafe
    {
        fixed (Instant* expiryAddr = &expiry)
        {
            *expiryAddr = Instant.FromUtc(2100, 1, 1, 0, 0);
        }
    }
}

This isn’t too surprising – unsafe code is, after all, unsafe. I readily admit I’m not an expert on .NET security, and I know the
landscape has changed quite a bit over time. These days I believe the idea of a sandbox has been somewhat abandoned – if you care about executing code you don’t really trust, you do it in a container and use that as your security boundary. That’s about the limit of my knowledge though, which could be incorrect anyway.

Where do we go from here?

At this point, I’m stuck making a choice between several unpleasant ones:

  • Leave Noda Time public structs as non-read-only. That prevents users from writing efficient code to use them without copying.
  • Remove XML serialization from Noda Time 3.0. We’re probably going to remove binary serialization anyway on the grounds that Microsoft is somewhat-discouraging it going forward, and it’s generally considered a security problem. However, I don’t think XML serialization is considered to be as bad, so long as you use it carefully, preventing arbitrary type loading. (See this paper for more information.)
  • Implement XML serialization using unsafe code as shown above. Even though my attempt at abusing it failed, that doesn’t provide very much comfort. I don’t know what other ramifications there might be for including unsafe code. Does that limit where the code can be deployed? It also doesn’t help with my concern about a future readonly struct constraint.

Thoughts very welcome. Reassurances from the C# team that “readonly struct” constraints won’t happen particularly welcome… along with alternative ways of implementing XML serialization.

NuGet package statistics

For a while, I’ve been considering how useful nuget.org statistics are.

I know there have been issues in the past around accuracy, but that’s not what I’m thinking about. I’ve been
trying to work out what the numbers mean at all and whether that’s useful.

I’ve pretty sure an older version of the nuget.org gallery gave stats on a per-operation basis, but right now it looks like we can break down the downloads by package version, client name and client version. (NodaTime example)

In a way, the lack of NuGet “operation” at least makes it simpler to talk about: we only know about “downloads”. So, what counts as a download?

What’s a download?

Here are a few things that might increment that counter:

  • Manual download from the web page
  • Adding a new package in Visual Studio
  • Adding a new package in Visual Studio Code
  • nuget install from the command line
  • dotnet restore for a project locally
  • dotnet restore in a Continuous Integration system testing a PR
  • dotnet restore in a CI system testing a merged PR

All of them sound plausible, but it’s also possible that they wouldn’t increment the counter:

  • I might have a package in my NuGet cache locally
  • A CI system might have its own global package cache
  • A CI system might use a mirror service somehow

So what does the number really mean? Some set of coincidences in terms of developer behavior and project lifetime? One natural reaction to this is “The precise meaning of the number doesn’t matter, but bigger is better.” I’d suggest that’s overly complacent.

Suppose I’m right that some CI systems have a package cache, but others don’t. Suppose we look at packages X and Y which have download numbers of 1000 and 100,000 respectively. (Let’s ignore
which versions those are for, or how long those versions have been out.) Does that mean Y‘s usage is “better” than X‘s in some way? Not necessarily. Maybe it means there’s a single actively-developed
open source project using Y and a CI system that doesn’t have a NuGet cache (and configured to build each PR on each revision), whereas maybe there are a thousand entirely separate projects using
X, but all using a CI system that just serves up a single version from a cache for everything.

Of course, that’s an extreme position. It’s reasonable to suggest that on average, if package Y has larger download numbers than package X, then it’s likely to be more widely used… but can we
do better?

What are we trying to measure?

Imagine we had perfect information: a view into every machine on the planet, and every operation any of them performed. What number would we want to report? What does it mean for a package to be “popular” or “widely used”?

Maybe we should think in terms of “number of projects that use package X“. Let’s consider some situations:

  • A project created to investigate a problem, and then gets deleted. Never even committed to source control system.
  • A project which is created and committed to source control, but never used.
  • A project created and in production use, maintained by 1 person.
  • A project created and in production use, maintained by a team of
    100 people.
  • A project created by 1 person, but then forked by 10 people and
    never merged.
  • A project created on github by 1 person, and forked by 10 people on github, with them repeatedly creating branches and merging back into the original repo.
  • A project which doesn’t use package X directly, but uses package Y that depends on package X.

If those all happened for the same package, what number would you want each of those projects to contribute to the package usage?

One first-order approximation could be achieved with “take some hash of the name of the project and propagate it (even past caches) when installing a package”. That would allow us to be reasonably confident in some measure of “how many differently-named projects depend on package X” which might at least feel slightly more reasonable, although it’s unclear to me how throwaway projects would end up being represented. (Do people tend to use the same names as each other for throwaway projects? I bet Console1 and WindowsForms1 would be pretty popular…)

That isn’t a serious suggestion, by the way – it’s not clear to me that hashing alone provides sufficient privacy protection, for a start. There are multiple further issues in terms of cache-busting, too. It’s an interesting thought experiment.

What do I actually care about though?

That’s even assuming that “number of projects that use package X is a useful measure. It’s not clear to me that it is.

As an open source contributor, there are two aspects I care about:

  • How many people will I upset, and how badly, if I break something?
  • How many people will I delight, and to what extent, if I implement a particular new feature?

It’s not clear to me that any number is going to answer those questions for me.

So what do you care about? What would you want nuget.org to show if it could? What do you think would be reasonable for it to show in the real world with real world constraints?

Diagnosing a Linux-only unit test failure

This is an odd one. I’m currently working on Cloud Firestore support for C#, and I’ve introduced a GeoPoint struct to represent a latitude/longitude pair, each being represented by a double. It implements IEquatable and overloads == and != in the obvious way. So far, so good, and I have working tests which have passed on CI.

I’ve added some more types that implement equality, so I thought I’d extract out common code into an EqualityTester class that can handle both the Equals methods and the operators (the latter via reflection).

All went well on my local machine, but on CI one environment failed. We use both AppVeyor and Travis in CI, and only Travis was failing. There were two failures – one testing the equality operators, and one as part of a broader serialization/deserialization test. Odd. Time to go through the normal diagnostic process…

Step 1: Check there’s really a problem

The nature of the failing tests don’t feel like an intermittent problem, but it’s always good to get more of an idea. I forced Travis to rebuild the pull request on Travis a couple of times: yes, it failed the same two tests each time.

Step 2: Reproduce locally

Tests that only fail in CI are really annoying. I’ve had this a couple of times, and it’s always taken hours to sort out, because the “try something, get results, think about what they mean” cycle is so slow.

Fortunately, I have a Linux box at home – so I fetched the pull request and ran the tests there – and they failed. Hooray! It’s not quite as nice as being able to reproduce it on my Windows laptop, but it’s much better than being isolated to CI.

Step 3: Remove irrelevant test code

There are two parts to the code that’s failing: the unit tests, and the production code. Ideally, I’d like to get rid of both, to create a truly minimal example. I usually try to take out one project at a time, and I could have gone either way – but this time I decided to remove the testing first.

So in the same repo, I created a new netcoreapp1.0 console application, with a project reference to the production code. I moved the operator part of my equality tester code into Program.cs, simplified it slightly, introduced by own Assert class with True and False methods that printed out the result, and I was ready to run:

using System;
using System.Reflection;
using Google.Cloud.Firestore.Data;

class Program
{
   static void Main(string[] args)
   {
       var gp1 = new GeoPoint(1.5, 2.5);
       var gp2 = new GeoPoint(1.5, 2.5);
       AssertEqualityOperators(gp1, gp2);
   }

   internal static void AssertEqualityOperators<T>(T control, T equal)
   {
       var typeInfo = typeof(T).GetTypeInfo();
       var equalityMethodInfo = typeInfo.GetMethod(
           "op_Equality", new[] { typeof(T), typeof(T) });
       Func<T, T, bool> equality = 
           (lhs, rhs) => (bool) equalityMethodInfo.Invoke(null, new object[] { lhs, rhs });

       Assert.True(equality(control, equal));
   }
}

class Assert
{
   public static void True(bool actual)
       => Console.WriteLine($"Expected True; was {actual}");

   public static void False(bool actual)
       => Console.WriteLine($"Expected False; was {actual}");
}

Sure enough, the output shows the problem:

Expected True; was False

Great. Now let’s remove the other dependency…

Step 4: Remove production code dependency

Rather than copy the whole of GeoPoint into my test project, I just created a small Test struct which had similar equality code.

Dead-end: a single double value

I made a mistake here – I changed too much in one go. I reduced my Test struct to just one double field:

using System;
using System.Reflection;

class Program
{
   static void Main(string[] args)
   {
       Test t1 = new Test(1.5);
       Test t2 = new Test(1.5);

       var equalityMethodInfo = typeof(Test)
           .GetMethod("op_Equality", new[] { typeof(Test), typeof(Test) });
       Func<Test, Test, bool> equality =
           (lhs, rhs) => (bool) equalityMethodInfo.Invoke(null, new object[] { lhs, rhs });
       Console.WriteLine(t1 == t2);
       Console.WriteLine(equality(t1, t2));
       Console.WriteLine(equalityMethodInfo.Invoke(null, new object[] { t1, t2 }));
   }
}

struct Test : IEquatable<Test>
{
    private readonly double value;

    public Test(double value)
    {
        this.value = value;
    }

    public static bool operator==(Test lhs, Test rhs) =>
        lhs.value == rhs.value;

    public static bool operator!=(Test lhs, Test rhs) => !(lhs == rhs);

    public override bool Equals(object obj) =>
        obj is Test t && Equals(t);

    public bool Equals(Test other) => this == other;

    public override int GetHashCode() => value.GetHashCode();
}

That prints “True” three times – which is what we’d like it to do eventually, but it means it’s not reproducing the problem we want it do.

Back to reproducing it: two double values

Taking one step back towards the previous code, I went back to two double values instead of just one, expanding the Test struct like this:

struct Test : IEquatable<Test>
{
    private readonly double x;
    private readonly double y;

    public Test(double x, double y)
    {
        this.x = x;
        this.y = y;
    }

    public static bool operator==(Test lhs, Test rhs) =>
        lhs.x == rhs.x && lhs.y == rhs.y;

    public static bool operator!=(Test lhs, Test rhs) => !(lhs == rhs);

    public override bool Equals(object obj) =>
        obj is Test t && Equals(t);

    public bool Equals(Test other) => this == other;

    public override int GetHashCode() => x.GetHashCode() + y.GetHashCode();
}

Change both constructor calls to Test(1.5, 2.5) and bingo: it prints True, False, False.

Step 5: Explore the scope of the problem

I tried the exact same code on Windows, and (unsurprisingly, given that my earlier unit tests passed) it didn’t reproduce the problem. But sticking with Linux, we can still try multiple frameworks and multiple build configurations.

I wouldn’t like to say what prompted me to try different frameworks – intuition can be hard to explain, unfortunately. But let’s change the project file a bit:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp1.0;netcoreapp1.1;netcoreapp2.0</TargetFrameworks>
  </PropertyGroup>
</Project>

And then run it in several different configurations (remember that three “True” lines means “everything is working”, and “True”, “False”, “False” means “this is the problem we’re investigating”):

$ dotnet run -c Debug -f netcoreapp1.0
True
False
False
$ dotnet run -c Release -f netcoreapp1.0
True
False
False
$ dotnet run -c Debug -f netcoreapp1.1
True
True
True
$ dotnet run -c Release -f netcoreapp1.1
True
True
True
$ dotnet run -c Debug -f netcoreapp2.0
True
True
True
$ dotnet run -c Release -f netcoreapp2.0
True
True
True

So, it fails under .NET Core 1.0, but works under .NET Core 1.1 and .NET Core 2.0. That certainly makes it sound like a .NET Core 1.0 bug, but that doesn’t mean I want to give up at this point. It’s entirely possible that I have users who are using .NET Core 1.0, so I want my code to work under it.

Now, I’ve only given major/minor version numbers – and that’s partly because I don’t have a very good handle on the different parts of .NET Core on Linux and how to find the version numbers. The 1.0 packages I’ve got installed are listed as:

  • dotnet-dev-1.0.4/xenial,now 1.0.4-1 amd64 [installed]
  • dotnet-hostfxr-1.0.1/xenial,now 1.0.1-1 amd64 [installed]
  • dotnet-sharedframework-microsoft.netcore.app-1.0.5/xenial,now 1.0.5-1 amd64 [installed,automatic]

No dotnet-runtime package, which I believe is part of 2.0 instead.

Step 6: Look into the failed comparison

This part is really simple – just add more diagnostics to the == method:

public static bool operator==(Test lhs, Test rhs)
{
    Console.WriteLine($"lhs=({lhs.x}, {lhs.y}); rhs=({rhs.x}, {rhs.y})");
    Console.WriteLine($"{lhs.x == rhs.x}, {lhs.y == rhs.y}");
    return lhs.x == rhs.x && lhs.y == rhs.y;
}

Now things get really interesting in the output:

lhs=(1.5, 2.5); rhs=(1.5, 2.5)
True, True
True
lhs=(1.5, 1.85492638478664E-316); rhs=(1.5, 6.95251497332956E-310)
True, False
False
lhs=(1.5, 6.9525149733517E-310); rhs=(1.5, 6.9525149733351E-310)
True, False
False

The first three lines are as expected – and that’s the working comparison. But each of the three-line-segments for the second and third comparisons show the first field (x) being fine, but the second being some tiny double value, and inconsistent at that.

Step 7: Try (and fail) to simplify

Maybe boxing is to blame? How about we try a really simple version of boxing – just override ToString(), box a value and see what comes up. The change in Test is simply:

public override string ToString() => $"{x}, {y}";

And then to test it:

object x = new Test(1.5, 2.5);
Console.WriteLine(x);

This prints 1.5, 2.5 – so it appears to be boxed correctly.

Step 8: Try simplifying a bit less

In our failing program, there are two potential complications at play: we’re boxing the value, and we’re calling a method via reflection. If just boxing isn’t a problem, maybe reflection is?

I decided to try two things at once: an instance method and a static method accepting a Test parameter (like the operator). Now we’re not doing any comparisons, we can get rid of a lot of code. Here’s the new complete code:

using System;
using System.Reflection;

class Program
{
   static void Main(string[] args)
   {
       Test t = new Test(1.5, 2.5);
       var foo = typeof(Test).GetMethod("Foo");
       foo.Invoke(t, null);
       var bar = typeof(Test).GetMethod("Bar");
       bar.Invoke(null, new object[] { t });
   }
}

struct Test
{
    private readonly double x;
    private readonly double y;

    public Test(double x, double y)
    {
        this.x = x;
        this.y = y;
    }

    public void Foo() => Console.WriteLine($"{x}, {y}");
    public static void Bar(Test t) => Console.WriteLine($"{t.x}, {t.y}");
}

And now the results:

1.5, 2.5
1.5, 6.95260190884551E-310

Interesting! At this point, there are lots of things that I could check:

  • Does it affect all parameters?
  • Does it happen for other field types? Other primitives? References?
  • Can I provoke it in any other circumstances other than via
    reflection?

Most of these can probably be answered by finding an existing issue that was fixed in netcore1.1… which is reasonably feasible given the Open Source nature of .NET Core.

That can all wait though (it’s a Sunday evening for me, after all…) – let’s see if we can fix the problem first.

Step 9: Fix it with a stroke of luck

Given that this feels like a memory management issue, let’s try something I’ve basically not used before: explicit struct layout.

A struct containing two double fields should be pretty easy to lay out: just fields at offsets 0 and 8, right? Admittedly it also sounds like it should be easy for the CLR to get right itself… but at least laying it out explicitly is low-risk.

using System.Runtime.InteropServices;

[StructLayout(LayoutKind.Explicit)]
struct Test
{
    [FieldOffset(0)]
    private readonly double x;
    [FieldOffset(8)]
    private readonly double y;

    ...
}

Normally, things don’t get fixed on first attempt, but in this case… it just worked!

Backtracking from the simplification, to the standalone comparison version, to fixing GeoPoint (which involved not using automatically implemented properties any more, as I want to apply the attribute to the field, not the property – boo!) and checking it works locally, to adding a commit to the PR and seeing it pass on Travis. Yay!

Next steps and points to note

My next step is definitely to try to find some history of this issue. It’s a surprisingly big problem to affect the GA .NET Core 1.0 runtime, so I expect there’s some record of it somewhere. I want to understand better what was wrong, whether my fix is appropriate, and whether I need to apply it elsewhere.

Update: this is a still an issue in the 1.0.7 runtime, and apparently even in the 1.0.8-servicing daily builds, so I’ve filed an issue and look forward to seeing what happens there.

Some interesting aspects:

  • If you initially see a problem in CI, your top priority is to reproduce it locally.
  • As ever, remove as much other code as you can – your production code, your test code, all dependencies.
  • It’s handy to have multiple execution environments available. If I hadn’t had a local Linux box, this would have been a nightmare.
  • Try different configurations and different runtimes/frameworks.
  • Sometimes the bug really is in the runtime or BCL. It’s pretty rare, but it does happen.

Diagnosing a VS-only build problem

I do most of my work in the google-cloud-dotnet github repo (That’s the repo for the Google Cloud Client Libraries for .NET, just to get a quick bit of marketing in there.) We try to keep our build and test dependencies up to date, so I recently updated to the latest versions of Microsoft.NET.Test.Sdk and xUnit.

Leaving aside problems identified by the xUnit analyzer which is now bundled in the xUnit package (some of which were useful, and some of which weren’t), this change caused another problem: while building from the command line worked fine, building some of the many solutions from Visual Studio (15.3.3) generated the following error:

Error MSB4018 The “GenerateBindingRedirects” task failed unexpectedly.
System.IO.PathTooLongException: The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters.
at System.IO.PathHelper.GetFullPathName()
at System.IO.Path.LegacyNormalizePath(…)
at System.IO.Path.NormalizePath(…)
at System.IO.Path.NormalizePath(…)

(more stack trace here)

The output window shows a file related to the error:

C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\MSBuild\15.0\Bin\Microsoft.Common.CurrentVersion.targets(2099,5)

So, what’s going on, and how do we fix it?

Step 1: Check that this change is responsible

Git makes life really easy here. For peace of mind – particularly for this kind of issue – I tend to close Visual Studio when switching branches, so:

  • Find a solution that fails (e.g. Google.Cloud.Diagnostics.AspNetCore)
  • Close VS
  • git checkout master
  • Rebuild solution in VS – looks fine
  • Close VS
  • git checkout branch with change
  • Rebuild in VS – observe error
  • git checkout master
  • Rebuild solution in VS – looks fine
  • Close VS
  • git checkout branch with change
  • Rebuild in VS – observe error

Yes, I did everything twice, just to make sure it wasn’t an anomaly.

So yes, it’s definitely a problem.

Step 2: Moan on Twitter

This isn’t the most scientific diagnostic tool in the world, but posting on Twitter about the problem did at least reassure me that I wasn’t alone.

Step 3: Narrow down the cause

Even though I observed the problem in Google.Cloud.Diagnostics.AspNetCore, the full error message in VS referred to Google.Cloud.Diagnostics.Common.IntegrationTests. That’s part of the Google.Cloud.Diagnostics.Common solution – the AspNetCore projects depend on the Common projects.

Try step 1 again with Google.Cloud.Diagnostics.Common (just one pass and one failure this time) – yes, it’s still a problem there. That’s useful.

Step 4: Try a workaround

All my source is under c:\Users\Jon\Test\Projects. The “Test” part is a bit of an anomaly, and next time I get a new machine, I’ll probably work without it, but there’s no need to change now. The Projects directory is actually a junction (symlink) to the c:\Users\Jon\Documents\Visual Studio 2015\Projects directory. That’s quite a long path to start with… let’s see if getting rid of the symlink helps.

  • Delete symlink
  • Move Projects file from under Documents to directly under Test
  • Try to build: same failure

Okay, it looks like we’ll have to be a bit more subtle.

Step 5: Play spot the difference

Given that we have code which builds on the command line in both the working and failing situations, we can reasonably easily see any differences in generated files.

  • On the command line, go into the Google.Cloud.Diagnostics.Common.IntegrationTests directory
  • Delete bin and obj
  • git checkout master
  • Run dotnet build
  • Copy the resulting bin and obj directories to a “working” directory
  • git checkout branch with change
  • Delete bin and obj
  • Run dotnet build
  • Copy the resulting bin and obj directories to a “broken” directory
  • Run kdiff3 against the working and broken directories

There are lots of differences between the directories, as I’d expect, but given that this is about binding redirects, it’s reaosnable to use experience and start scanning for filenames ending with .config.

Sure enough, in the “broken” directory, under obj\net452, there was a file called Google.Cloud.Diagnostics.Common.IntegrationTests.csproj.Google.Cloud.Diagnostics.Common.IntegrationTests.dll.config. That’s not a copy and paste error – it really is a 115-character-long filename, even leaving out any directory parts.

In the file system, the full path is: c:\users\jon\Test\Projects/google-cloud-dotnet\apis\Google.Cloud.Diagnostics.Common\Google.Cloud.Diagnostics.Common.IntegrationTests\obj\Debug\net452\Google.Cloud.Diagnostics.Common.IntegrationTests.csproj.Google.Cloud.Diagnostics.Common.IntegrationTests.dll.config – that’s 266 characters.

Step 6: Try being slightly cheeky: very temporary workaround

Okay, so moving away from the “Documents\Visual Studio 2015” directory didn’t help much, but given that we’re just on the limit, let’s try just renaming “google-cloud-dotnet” to “short” (very temporarily).

Try opening it in Visual Studio – wahoo, it works :) The .config file is generated by Visual Studio correctly.

Step 7: Work out who to blame

So, where’s the bug?

  • It’s a shame that Visual Studio 2017 doesn’t support long filenames, even though the dotnet CLI does
  • Surely we don’t need such a long filename anyway
  • Do I need the project name to be so long?

Looking back to the very first description, let’s have a look at the msbuild file that’s mentioned: C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\MSBuild\15.0\Bin\Microsoft.Common.CurrentVersion.targets

That has:

<GenerateBindingRedirects
  AppConfigFile="@(AppConfigWithTargetPath)"
  TargetName="$(TargetFileName).config"
  OutputAppConfigFile="$(_GenerateBindingRedirectsIntermediateAppConfig)"
  SuggestedRedirects="@(SuggestedBindingRedirects)"
>

So it looks like the problem is _GenerateBindingRedirectsIntermediateAppConfig, which is defined in a property group elsewhere as:

<_GenerateBindingRedirectsIntermediateAppConfig>$(IntermediateOutputPath)$(MSBuildProjectFile).$(TargetFileName).config</_GenerateBindingRedirectsIntermediateAppConfig>

That certainly looks like it’s responsible for the choice of file.

A quick search for _GenerateBindingRedirectsIntermediateAppConfig shows that I’m not the first person to have run into this – PathTooLongException with _GenerateBindingRedirectsIntermediateAppConfig describes exactly the pain I’ve been going through, and even helpfully references MSBuild should handle paths longer than MAX_PATH.

I’ll add my little bit of feedback to the former issue as soon as I’ve published this post (so I can link to it).

Step 8: Decide what to do

My options appear to be:

  1. Change the project name or the directory structure
  2. Ask everyone on the team to work from a really short root directory name
  3. Downgrade my dependencies again
  4. Live with Visual Studio not building a few solutions properly – we can still work in Visual Studio, so long as we don’t need a full build in it.

I went with option 4, and published this blog post.

Step 9: Embrace community feedback

Theose were the only options I considered at the time of original writing, partly as I was somewhat scared of this suggestion, being inexperienced in msbuild. However, combining that with a Directory.Build.targets file (which I didn’t know about before being tipped off by Nick Guerrera), it was very easy to implement a repo-wide workaround.

My friend Kirill Osenkov also pointed me at his MSBuild Log Viewer and pointed out that if it failed in Visual Studio, it would probably fail from msbuild on the command line. He was right, and the log viewer pinpointed what it was trying to write, which would have saved time.

Points to note

  • Situations where you have a working case and a broken case are great. They can help you validate your problem (does it go away if I revert to the previous commit?) and find the cause (what’s the difference in output between working and broken?)
  • Temporary approaches to double-check the diagnosis (such as me renaming my source directory to “short”) can be useful – don’t forget to undo them though!
  • If you’ve run into a problem, someone else probably has too
  • The .NET community can be really amazing – I was really impressed by the response here
  • Tools are often available – in this case, msbuildlog.com would have saved me quite a bit of time.