Category Archives: C#

Surprise! Creating an instance of an open generic type

This is a brief post documenting a very weird thing I partly came up with on Stack Overflow today.

The context is this question. But to skip to the shock, we end up with code like this:

object x = GetWeirdValue();
// This line prints True. Be afraid - be very afraid!
Console.WriteLine(x.GetType().GetTypeInfo().IsGenericTypeDefinition);

That just shouldn’t happen. You shouldn’t be able to create an instance of an open type – a type that still contains generic type parameters. What does a List<T> (rather than a List<string> or List<int>) mean? It’s like creating an instance of an abstract class.

Before today, I’d have expected it to be impossible – the CLR should just not allow such an object to exist. I now know one – and only one – way to do it. While you can’t get normal field values for an open generic type, you can get constants… after all, they’re constant values, right? That’s fine for most constants, because those can’t be generic types – int, string etc. The only type of constant with a user-defined type is an enum. Enums themselves aren’t generic, of course… but what if it’s nested inside another generic type, like this:

class Generic<T>
{
    enum GenericEnum
    {
        Foo = 0
    }
}

Now Generic<>.GenericEnum is an open type, because it’s nested in an open type. Using Enum.GetValues(typeof(Generic<>.GenericEnum)) fails in the expected way: the CLR complains that it can’t create instances of the open type. But if you use reflection to get at the constant field representing Foo, the CLR magically converts the underlying integer (which is what’s in the IL of course) into an instance of the open type.

Here’s the complete code:

using System;
using System.Reflection;

class Program
{
    static void Main(string[] args)
    {
        object x = GetWeirdValue();
        // This line prints True
        Console.WriteLine(x.GetType().GetTypeInfo().IsGenericTypeDefinition);
    }

    static object GetWeirdValue() =>
        typeof(Generic<>.GenericEnum).GetTypeInfo()
            .GetDeclaredField("Foo")
            .GetValue(null);

    class Generic<T>
    {
        public enum GenericEnum
        {
            Foo = 0
        }
    }
}

… and the corresponding project file, to prove it works for both the desktop and .NET Core…

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp1.0;net45</TargetFrameworks>
  </PropertyGroup>

</Project>

Use this at your peril. I expect that many bits of code dealing with reflection would be surprised if they were provided with a value like this…

It turns out I’m not the first one to spot this. (That would be pretty unlikely, admittedly.) Kirill Osenkov blogged two other ways of doing this, discovered by Vladimir Reshetnikov, back in 2014.

Tracking down a performance hit

I’ve been following the progress of .NET Core with a lot of interest, and trying to make the Noda Time master branch keep up with it. The aim is that when Noda Time 2.0 eventually ships (apologies for the delays…) it will be compatible with .NET Core from the start. (I’d expected to be able to support netstandard1.0, but that appears to have too much missing from it. It looks like netstandard1.3 will be the actual target.)

I’ve been particularly looking forward to being able to run the Noda Time benchmarks (now using BenchmarkDotNet) to compare .NET Core on Linux with the same code on Windows. In order to make that a fair comparison, I now have two Intel NUCs, both sporting an i5-5250U and 8GB of memory.

As it happens, I haven’t got as far as running the benchmarks under .NET Core – but I am now able to run all the unit tests on both Linux and Windows, using both the net451 TFM and netcoreapp1.0.

When I did that recently, I was pretty shocked to see that (depending on which tests I ran) the tests were 6-10 times slower on Linux than on Windows, using netcoreapp1.0 in both cases. This post is a brief log of what I did to track down the problem.

Step 1: Check that there’s really a problem

Thought: Is this actually just a matter of not running the tests in a release configuration, or something similar?

Verification: I ran the tests several times, specifying -c Release on the command line to use the release build of both NodaTime.Test.dll and NodaTime.dll. Running under a debugger definitely wasn’t an issue, as this was all just done from the shell.

Additionally, I ran the tests in two ways – firstly, running the whole test suite, and secondly running with --where=cat!=Slow to avoid the few tests I’ve got which are known to be really pretty slow. They’re typically tests which compare the answers the BCL gives with the answers Noda Time gives, across the whole of history for a particular calendar system or time zone. I’m pleased to report that the bottleneck in these tests is almost always the BCL, but that doesn’t help to speed them up. If only the “slow” tests had been much slower on Linux, that might have pointed to the problems being in BCL calendar or time zone code.

The ratios vary, but there was enough of a problem under both circumstances for it to be worth looking further.

Step 2: Find a problematic test

I didn’t have very strong expectations one way or another about whether this would come down to some general problem in the JIT on Linux, or whether there might be one piece of code causing problems in some tests but not others. Knowing that there are significant differences in handling of some culture and time zone code between the Linux and Windows implementations, I wanted to find a test which used the BCL as little as possible – but which was also slow enough for the differences in timing to be pronounced and not easily explicable by the problems of measuring small amounts of time.

Fortunately, NUnit produces a TestResult.xml file which is easy to parse with LINQ to XML, so I could easily transform the results from Windows and Linux into a list of tests, ordered by duration (descending), and spot the right kind of test.

I found my answer in UmAlQuraYearMonthDayCalculatorTest.GetYearMonthDay_DaysSinceEpoch, which effectively tests the Um Al Qura calendar for self consistency, by iterating over every day in the supported time period and checking that we can convert from “days since Unix epoch” to an expected “year, month day”. In particular, this test doesn’t rely on the Windows implementation of the calendar, nor does it use any time zones, cultures or anything similar. It’s nicely self-contained.

This test took 2051ms on Linux and 295ms on Windows. It’s possible that those figures were from a debug build, but I repeated the tests using a release build and confirmed that the difference was still similar.

Step 3: Find the bottleneck

At this point, my aim was to try to remove bits of the test at a time, until the difference went away. I expected to find something quite obscure causing the difference – something like different CPU cache behaviour. I knew that the next step would be to isolate the problem to a small piece of code, but I expected that it would involve a reasonable chunk of Noda Time – at least a few types.

I was really lucky here – the first and most obvious call to remove made a big difference: the equality assertion. Assertions are usually the first thing to remove in tests, because everything else typically builds something that you use in the assertions… if you’re making a call without either using the result later or asserting something about the result, presumably you’re only interested in side effects.

As soon as I removed the call to Assert.AreEqual(expected, actual), the execution time dropped massively on Linux, but hardly moved on Windows: they were effectively on a par.

I wondered whether the problem was with the fact that I was asserting equality between custom structs, and so tried replacing the real assertions with assertions of equality of strings, then of integers. No significant difference – they all showed the same discrepancy between Windows and Linux.

Step 4: Remove Noda Time

Once I’d identified the assertions as the cause of the problem, it was trivial to start a new test project with no dependency on Noda Time, consisting of a test like this:

[Test]
public void Foo()
{
    for (int i = 0; i < 1000000; i++)
    {
        var x = 10;
        var y = 10;
        Assert.AreEqual(x, y);
    }
}

This still demonstrated the problem consistently, and allowed simpler experimentation with different assertions.

Step 5: Dig into NUnit

For once in my life, I was glad that a lot of implementation details of a framework were exposed publicly. I was able to try lots of different “bits” of asserting equality, in order to pin down the problem. Things I tried:

  • Assert.AreEqual(x, y): slow
  • Assert.That(x, Is.EqualTo(y)): slow
  • Constructing an NUnitEqualityComparer: fast
  • Calling NUnitEqualityComparer.AreEqual: fast. (Here the construction occurred before the loop, and the comparisons were in the loop.)
  • Calling Is.EqualTo(y): slow

The last bullets two bullets were surprising. I’d been tipped off that NUnitEqualityComparer uses reflection, which could easily differ in performance between Windows and Linux… but checking for equality seemed to be fast, and just constructing the constraint was slow. In poking around the NUnit source code (thank goodness for Open Source!) it’s obvious why Assert.AreEqual(x, y) and Assert.That(y, Is.EqualTo(x)) behave the same way – the former just calls the latter.

So, why is Is.EqualTo(y) slow (on Linux)? The method itself is simple – it just creates an instance of EqualConstraint. The EqualConstraint constructor body doesn’t do much… so I proved that it’s not EqualConstraint causing the problem by deriving my own constraint with a no-op implementation of ApplyTo… sure enough, just constructing that is slow.

That leaves the constructor of the Constraint abstract base class:

protected Constraint(params object[] args)
{
    Arguments = args;

    DisplayName = this.GetType().Name;
    if (DisplayName.EndsWith("`1") || DisplayName.EndsWith("`2"))
        DisplayName = DisplayName.Substring(0, DisplayName.Length - 2);
    if (DisplayName.EndsWith("Constraint"))
            DisplayName = DisplayName.Substring(0, DisplayName.Length - 10);
}

That looks innocuous enough… but maybe calling GetType().Name is expensive on Linux. So test that… nope, it’s fast.

At this point I’m beginning to wonder whether we’ll ever get to the bottom of it, but let’s just try…

[Test]
public void EndsWith()
{
    string text = "abcdefg";
    for (int i = 0; i < Iterations; i++)
    {
        text.EndsWith("123");
    }
}

… and sure enough, it’s fast on Windows and slow on Linux. Wow. Looks like we have a culprit.

Step 6: Remove NUnit

At this point, it’s relatively plain sailing. We can reproduce the issue in a simple console app. I won’t list the code here, but it’s in the GitHub issue. It just times calling EndsWith once (to get it JIT compiled) and then a million times. Is it the most rigorous benchmark in the world? Absolutely not… but when the difference is between 5.3s on Linux and 0.16s on Windows, on the same hardware, I’m not worried about inaccuracy of a few milliseconds here or there.

Step 7: File a CoreCLR issue

So, as I’ve shown, I filed a bug on GitHub. I’d like to think it was a pretty good bug report:

  • Details of the environment
  • Short but complete console app ready to copy/paste/compile/run
  • Results

Exactly the kind of thing I’d have put into a Stack Overflow question – when I ask for a minimal, complete example on Stack Overflow, this is what I mean.

Anyway, about 20 minutes later (!!!), Stephen Toub has basically worked out the nub of it: it’s a culture issue. Initially, he couldn’t reproduce it – he saw the same results on Windows and Linux. But changing his culture to en-GB, he saw what I was seeing. I then confirmed the opposite – when I ran the code having set LANG=en-US, the problem went away for me. Stephen pulled Matt Ellis in, who gave more details as to what was going wrong behind the scenes.

Step 8: File an NUnit issue

Matt Ellis suggested filing an issue against NUnit, as there’s no reason this code should be culture-sensitive. By specifying the string comparison as Ordinal, we can go through an even faster path than using the US culture. So

if (DisplayName.EndsWith("Constraint"))

becomes

if (DisplayName.EndsWith("Constraint", StringComparison.Ordinal))

… and the equivalent for the other two calls.

I pointed out in the issue that it was also a little bit odd that this was being worked out in every Constraint constructor call, when of course it’s going to give the same result for every instance of the same type. When “every Constraint constructor call” becomes “every assertion in an entire test run”, it’s a pretty performance-critical piece of code. While unit tests aren’t important in terms of performance in the same way that production code is, anything which adds friction is bad news.

Hopefully the NUnit team will apply the simple improvement for the next release, and then the CoreCLR team can attack the tougher underlying problem over time.

Step 9: Blog about it

Open up Stack Edit, start typing: “I’ve been following the progress”… :)

Conclusion

None of the steps I’ve listed here is particularly tricky. Diagnosing problems is often more a matter of determination and being unwilling to admit defeat than cleverness. (I’m not denying that there’s a certain art to being able to find the right seam to split the problem in two, admittedly.)

I hope this has been useful as a “start to finish” example of what a diagnostic session can look and feel like. It wasn’t one physical session, of course – I found bits of time to investigate it over the course of a day or so – but it would have been the same steps either way.

Smug, satisfied smile…

Versioning conundrum for Noda Time – help requested

Obviously I’d normally ask developer questions on Stack Overflow but in this case, it feels like the answers may be at least somewhat opinion-based. If it turns out that it’s sufficiently straightforward that a Stack Overflow question and answer would be useful, I can always repost it there later.

The Facts

Noda Time 1.x exists “in production”, and the latest version is 1.3.1. This targets .NET 3.5 Client profile, .NET 4.0, and PCL Profile 328 (in a directory of lib\portable-net4+sl5+netcore45+wpa81+wp8+MonoAndroid1+MonoTouch1+XamariniOS1)

Noda Time currently includes the IANA time zone data (“TZDB”) – each released version of Noda Time contains the TZDB version that was “most recent” at the time that the Noda Time release was built. This gets out of date quite quickly, as there are multiple releases of TZDB every year. Those releases are named 2016a, 2016b etc. Noda Time also provides the ability to read .nzd files (Noda Zone Data – a custom format) and every time there’s a new release of TZDB, I build a .nzd file and upload it to nodatime.org, updating http://nodatime.org/tzdb/latest.txt to point to the latest version.

Noda Time 2.0 has not been released yet. When I do release it, I expect to target .NET 4.5 and netstandard1.0.

Each Noda Time 1.x release has an AssemblyVersion just based on major/minor, i.e. 1.0, 1.1, 1.2 etc. Based on this blog post, this may have been a mistake – it should quite possibly have been 1.0 for all versions. Obviously I can’t fix that now, but I can make the 2.x releases use 2.0 everywhere.

When 2.0 is “pretty much ready” we’re going to cut a 1.4 release which deprecates things that are removed in 2.0 and provides the new approaches as far as possible. For example, the IClock.Now property from 1.x is removed in 2.0, and replaced by IClock.GetCurrentInstant(). We’ll deprecate the Now property and introduce a GetCurrentInstant() extension method which delegates to it. This shouldn’t break any 1.x users, but should allow them to move over to the new API as far as possible before upgrading to 2.0. The intention is that users wouldn’t stay on 1.4 for very long. (Obviously they could do so, but there’s not a lot of benefit. 1.4 won’t have new features – it’s really just a transition version.)

So far, that’s just the way of the world. Now I want to make it easier for users to stay up-to-date with TZDB – including if nodatime.org goes down. (That’s considerably more likely than nuget.org going down, for example.)

The plan is to introduce a new nearly-data-only assembly, packaged as NodaTime.Tzdb. The aim is to allow users to update their data dependency at build time, in a controlled fashion. If you only want to specify an exact version to depend on, you can do so. If you want to pick up the latest version every time you build, that should be possible too.

The tricky bits come in terms of the versioning.

Some options

Firstly, the versioning scheme for the package ignoring everything else. I plan to use something like this:

  • 2016a => 1.2016.1
  • 2016b => 1.2016.2
  • 2016c => 1.2016.3
  • 2017a => 1.2017.1

This should make it reasonably easy to tell the TZDB version just from the package version.

However, I’m considering a few options within this idea:

  • I could create a single package per TZDB release, targeting .NET 3.5 client profile, .NET 4.0, the Profile 328 PCL, .NET 4.5, and .NET Standard 1.0. The first four of these could depend on Noda Time 1.1, and the last one would have to depend on Noda Time 2.0.
  • I could do the above, but depend on 1.3.1 instead of 1.1.
  • I could create one package with two versions per TZDB release – a 1.x depending on Noda Time 1.1, and a 2.x depending on Noda Time 2.0. For example, when TZDB 2016d is released, I could create 1.2016.4 and 2.2016.4.
  • I could create one package version depending on 1.1, one depending on 1.2, one depending on 1.3, one depending on 1.4 (when that exists) and one depending on 2.0.
  • I could create two separate packages, i.e. include the Noda Time major version number in the package name. I don’t like this idea, but it’s on the table.

Some concerns and questions

There are various aspects to this which cause me a few worries. I’m not sure how well I can really structure or segregate those, so I’ll just list them.

  • Can a non-prerelease package depend on a prerelease package for some frameworks? If not, that possibly blows the “single version” idea out of the water, as I can’t depend on NodaTime v2.0 yet – it’s not out.
  • Even if that’s feasible, is it sane to depend on different major versions of the NodaTime package from within a single version of the NodaTime.Tzdb package, or is that going to cause massive confusion?
  • Should I depend on NodaTime v1.1 or v1.3.1? They have different AssemblyVersion numbers, which I believe means an assembly binding redirect will be required if I depend on 1.1 but users depend on 1.3.1. To be clear, I don’t expect many users to still be on versions older than 1.3.1.
  • Likewise, is it going to cause issues for .NET 4.5 users who use NodaTime 2.0 (eventually) if they depend on a version of NodaTime.Tzdb that depends on NodaTime 1.3.1? Again, presumably assembly binding redirects are needed.
  • If I go with the “two-version” scheme (i.e. 1.2016.4 and 2.2016.4 etc) how careful would NodaTime 1.3.1 users have to be? I wouldn’t want them to accidentally get upgraded to NodaTime 2.0 when that’s released, by accidentally taking the 2.x line of NodaTime.Tzdb.
  • Does dotnet cli support the nuget allowedVersions feature at all? I haven’t found any support for it in DNX, but really it’s vital for this scheme to work at all – basically I’d expect a NodaTime 1.3.1 user to specify an allowed version range for NodaTime.Tzdb of [1,2)
  • Is my scheme of 1.2016.4 (etc) sensible? It’s somewhat abusing major/minor/patch, in that there’s no real difference in meaning between a minor version bump (“it’s the new year”) and a patch bump (“there’s been another release in the same year”). Neither kind of change will be breaking (unless you depend on specific time zones behaving in specific ways, of course), and it’s handy to be able to give a simple mapping between TZDB version and package version, but there may be consequences I’m unaware of.

Please feel free to ask clarifying questions in comments. Will look forward to getting some answers :)

Ultimate Man Cave: voice automation for my shed

Source code for everything is on Github. It probably won’t be useful to you unless you’ve got very similar hardware to mine, but you may want to just have a look.

Background

Near the end of 2015, we had a new shed built at the back of our garden. The term “shed” is downplaying it somewhat – it’s a garden building, about 7m x 2.5m, with heating, lighting and an ethernet connection from the house.

It’s divided in half, with one half being a normal shed (lawnmower, wheelbarrow, tools etc) and one half being my office for working from home. Both sides are also used for general storage – we have a lot of stuff to sort out from a loft conversion a few years ago.

The shed

It only took about three days of using the shed for me to work out that I wanted remote-controlled lighting. If I’m going out there at 6.30am in winter, it’s pretty dark – so it’s really useful to be able to turn the lights on from the house first, so I can negotiate the muddier bits of the garden, see the keyhole to unlock it etc.

After a little research, this turned out to be pretty easy: MiLight is simple and relatively cheap. The equivalent of $100 got me four lights and a wifi controller box. It only took me a few minutes to configure it to talk to my wifi, install the Light Controller android app, and I could easily turn my lights on and off from my phone from the house, before stepping outside. Yay. First steps to home automation.

I won’t go into all the details of the rest of the tech in my shed, but the important parts for the purposes of this post are:

Command-line automation

Sometimes, I’m too lazy to reach for my phone when I want to turn on the lights. Very much a first world problem, I realize. And not so much a problem, as an opportunity to see what’s feasible.

So, I looked around the net for code related to MiLight / EasyBulb, and found (amongst other things) Andy Scott’s MiLight.NET library on Github. A small amount of tweaking, and I had a short console app allowing me to run “lights on” or “lights off” which did the obvious thing. Amongst other things, copying this onto an Intel NUC allowed me to turn the lights off via remote desktop when Holly messaged me at the (Google) office to tell me that I’d left them on. It also meant I could schedule a task to turn the lights off at 10.30pm automatically, in case I forgot when I came in.

For a few months, that kept me satisfied… but it was never going to be the final solution.

The next step was to look at other aspects I could automate, and both the amplifier/receiver and the Sonos unit were obvious targets. I knew both had network support, as I already had apps for both on my phone, but I had no idea what the protocols involved were. The amplifier lives in an A/V cabinet, and I normally keep the doors of that shut – so just turning it on, setting the source, and changing the volume either involved getting the phone out or opening the cabinet. Again, could do better.

Sonos supports UPnP/SOAP for control. An old blog post got me started, and then I used Intel Device Spy to work out what else I could easily do. (I don’t have very demanding requirements – just play/pause, set volume, next/previous track is fine.)

It turns out that Onkyo has its own protocol called ISCP (Integra Serial Control Protocol) which has a network binding called eISCP. There’s remarkably good documentation in the form of an Excel spreadsheet, providing more information than I’m ever likely to need.

Implementing both of these was slightly faffy. The eISCP code didn’t work for some time, then started working – presumably with some minor tweak, but it wasn’t clear to me which of the many tweaks I made actually fixed it. The Sonos code worked fairly soon, but was very inelegant for quite a while.

Initially, this was all driven from the command line. I introduced a very simple sort of discovery, separating out controllers from their commands:

public interface IController
{
    string Name { get; }
    IImmutableList<ICommand> Commands { get; }
}

public interface ICommand
{
    string Name { get; }
    string Description { get; }

    void Execute(params string[] arguments);
}

There’s then a Factory class with a static AllControllers property. (I’m not keen on the naming here, but we’ll come to that later.)

The fact that Execute takes a string array is indicative of its use for a command line application – although looking at it now, I might have made it IEnumerable given that I’ll always be skipping the first actual argument which identifies the controller.

Anyway, this allows a very simple command line app which doesn’t know anything about lights, music etc – it just offers you the controllers and commands it finds.

There’s only actually one implementation of IController, calledReflectiveController. You pass it the real controller to wrap, which can be any instance of a type with a description and with public methods which also have descriptions. These descriptions are provided with an attribute. The arguments passed to Execute are then converted to the method parameter types using Convert.ChangeType. Crude but effective.

With this in place, adding a new command to an existing controller is just a matter of adding a public method. Adding a new controller is just a matter of creating a new class with a description, and adding it to the list of controllers in Factory. It’s all really, really simple.

Deploy to the Pi!

This was the aim all along, of course – I’ve been wanting to try out Windows IoT edition, and put my Raspberry Pi to good use, and try out Windows UAP to get a feeling for it. (In particular, I want to learn about some of the constraints I’ll run into with Noda Time 2.0.) This project was a fantastic excuse to do all three.

I started off by building the application just on my laptop. This is one of the lovely benefits of universal apps – you can get them working in a convenient environment first, then deploy elsewhere when you’re ready.

In fact, the very first version of the app didn’t have any speech recognition – it just had buttons to turn the lights on or off. I checked that this worked on both my laptop and the Raspberry Pi – it was nice to see that Windows IoT still supports a UI over HDMI, and it all worked fine, first time. A few years ago, this would have been absolutely stunning in itself – but I think we’re starting to take portability for granted.

Voice automation

On to the final steps: adding speech recognition.

I had a bit of a false start, as there are multiple approaches to speech recognition in Windows UAP. Initially I tried using Cortana, but never got that to work. Instead, I went with the Windows.Media.SpeechRecognition library, which worked pretty much immediately. Again, my initial attempt was more complicated than it needed to be, using an SRGS grammar file. This worked, but it was fiddly. When I discovered the SpeechRecognitionListConstraint class, it was beautiful… it’s literally just a list of strings, and the speech recognizer raises an event when any of those strings is recognized.

The code required to start the speech recognition is trivial:

private async void RegisterVoiceActivation(object sender, RoutedEventArgs e)
{
    recognizer = new SpeechRecognizer
    {
        Constraints = { new SpeechRecognitionListConstraint(handlers.Keys) }
    };
    recognizer.ContinuousRecognitionSession.ResultGenerated += HandleVoiceCommand;
    recognizer.StateChanged += HandleStateChange;

        SpeechRecognitionCompilationResult compilationResult = await recognizer.CompileConstraintsAsync();

    if (compilationResult.Status == SpeechRecognitionResultStatus.Success)
    {
        await recognizer.ContinuousRecognitionSession.StartAsync();
    }
    else
    {
        await Dispatcher.RunIdleAsync(_ => lastState.Text = $"Compilation failed: {compilationResult.Status}");
    }
}

Given the way we’re compiling the constraints, I’d be reasonably happy not checking the compilation result, but I just never took that code away after using it for SRGS (where it was very much required).

The HandleVoiceCommand method just checks whether the recognition confidence is above a certain threshold (0.6 at the moment, but I may tweak it down a bit), and if so, it consults a dictionary to find out a delegate to invoke. It also updates the UI for diagnostic purposes. The dictionary itself is the only code that knows about the shed controllers, using import static to avoid having Factory. everywhere:

private const string Prefix = "shed ";

private static readonly Dictionary<string, Action> handlers = new Dictionary<string, Action>
{
    { "lights on", Lighting.On },
    { "lights off", Lighting.Off },
    { "music play", Sonos.Play },
    { "music pause", Sonos.Pause },
    { "music mute", () => Sonos.SetVolume(0) },
    { "music quiet", () => Sonos.SetVolume(30) },
    { "music medium", () => Sonos.SetVolume(60) },
    { "music loud", () => Sonos.SetVolume(90) },
    { "music next", Sonos.Next },
    { "music previous", Sonos.Previous },
    { "music restart", Sonos.Restart },
    { "amplifier on", Amplifier.On },
    { "amplifier off", Amplifier.Off },
    { "amplifier mute", () => Amplifier.SetVolume(0) },
    { "amplifier quiet", () => Amplifier.SetVolume(30) },
    { "amplifier medium", () => Amplifier.SetVolume(50) },
    { "amplifier loud", () => Amplifier.SetVolume(60) },
    { "amplifier source pie", () => Amplifier.Source("pi") },
    { "amplifier source sonos", () => Amplifier.Source("sonos") },
    { "amplifier source playstation", () => Amplifier.Source("ps4") }
}.WithKeyPrefix(Prefix);

Here, WithKeyPrefix is just a small extension method to create a new dictionary with a specified prefix to each key.

Just like with the command line version, adding a command is now simply a matter of adding a single entry in this dictionary.

Deploy that on my Raspberry Pi, and as if by magic, I can say “shed lights on” and the lights come on, etc. Admittedly after saying “shed music play” it can be quite tricky to launch further actions, as the music interferes with the speed recognition for obvious reasons.

Simple code for the win

I’d like to take a few moments to talk about the code. At this point, you may want to have Github open in another tab to follow along.

There are lots of things about the code which I’d deem pretty unacceptable at work:

  • It uses the service locator pattern instead of dependency injection. I’m not a fan of this in general.
  • I really hate the name Factory – but I haven’t found anything significantly better, yet. (ControllerProvider? I’d call it just Controllers, but that’s the final part of the namespace name…)
  • There are no tests. At all. Not even a test project.
  • There are only a few comments.
  • The IP addresses are hard-coded into Factory. No config files, no discovery, not even names – just IP addresses.
  • There’s no abstraction beyond IController and ICommand. I could potentially have an IVolumeController, IMusicController, ISourceController etc.

None of these bother me, even though the code is “in production” and I’m expecting to use it for a long time. It’s never going to grow large enough for the service locator pattern to be a problem. With so few types involved, a few non-ideal names isn’t going to cause much of a problem. The only tests that matter are the ones involving me saying “shed amplifier on” and the amplifier either turning on or not… there’s very little code here that’s really testable anyway. My device IP addresses are all fixed by my router, so I’d only have to change them if I change that – and I’d still end up changing it in just one place. Extra abstraction wouldn’t actually give me any benefits at the moment.

So yes, basically I’m happy with the code now. It provides me value, and it’s easy to maintain. In particular, adding extra controllers or commands is trivial. I guess what I’m saying is that this is a reminder that not all code is “enterprise software” and even “best practice” rules such as writing no code without tests have their limitations. Context is king.

What next?

My Raspberry Pi 3 has a small touchscreen display on it, which uses the Rasperry Pi SPI for communication. I haven’t yet managed to get this working, but obviously that would be a lovely next step. It’s a bit of a pain changing from Displayport to HDMI to see the UI and check what phrases have been recognized, for example. The display part will definitely be useful – I might use the touch part just for a very few key commands, such as “stop the music, you can’t hear me any more!”

The device I’d most like to control next is the heater. I keep leaving the heating on accidentally, then having to put my shoes on again to go out and just turn the heating off. If the heater plugged in via a regular socket, it would be easy enough to sort out – but unfortunately the power cable goes straight into a box in the wall. I may try to sort this out at some point, but it’s going to be a pain.

The other thing I’d like to do is add the ability to switch monitor inputs using DDC/CI. That could be tricky in terms of getting access to such a low-level API, and also it requires a permanent “live” connection to the monitor – whereas both my HDMI and Displayport connections are switched (by the Onkyo for HDMI, and a KVM for Displayport). I’m still thinking about that one. I could potentially have a secondary output from the NUC to a DVI input on the monitor, then make the NUC listen as a server that the Pi could talk to…

Conclusion

Home automation is fun and simple – but it really, really helps to have a project which will actually be useful to you. I’ve had a few Raspberry Pis sitting around for ages waiting to be used. They’ve always been fun to play with, but now there’s a purpose, and that makes a huge difference…

To base() or not to base(), that is the question

Today I’ve been reviewing the ECMA-334 C# specification, and in particular the section about class instance constructors.

I was struck by this piece in a clause about default constructors:

If a class contains no instance constructor declarations, a default instance constructor is automatically provided. That default constructor simply invokes the parameterless constructor of the direct base class.

I believe this to be incorrect, and indeed it is, as shown here (in C# 6 code for brevity, despite this being the C# 5 spec that I’m reviewing; that’s irrelevant in this case):

using System;

class Base
{
    public int Foo { get; }

    public Base(int foo = 5)
    {
        Foo = foo;
    }
}

class Derived : Base
{    
}

class Test
{
    static void Main()
    {
        var d = new Derived();
        Console.WriteLine(d.Foo); // Prints 5
    }    
}

Here the default constructor in Derived clearly doesn’t execute a parameterless constructor in Base because there is no parameterless constructor in Base. Instead, it executes the parameterized constructor, providing the default argument value.

So, I considered whether we could reword the standard to something like:

If a class contains no instance constructor declarations, a default instance constructor is automatically provided. That default constructor simply invokes a constructor of the direct base class as if the default constructor contained a constructor initializer of base().

But is that always the case? It turns out it’s not – at least not in Roslyn. There are more interesting optional parameters we can use than just int foo = 5. Let’s have a look:

using System;
using System.Runtime.CompilerServices;

class Base
{
    public string Origin { get; }

    public Base([CallerMemberName] string name = "Unspecified",
                [CallerFilePath] string source = "Unspecified",                
                [CallerLineNumber] int line = -1)
    {
        Origin = $"{name} - {source}:{line}";
    }
}

class Derived1 : Base {}
class Derived2 : Base
{
    public Derived2() {}
}
class Derived3 : Base
{
    public Derived3() : base() {}
}

class Test
{
    static void Main()
    {
        Console.WriteLine(new Derived1().Origin);
        Console.WriteLine(new Derived2().Origin);
        Console.WriteLine(new Derived3().Origin);
    }    
}

The result is:

Unspecified - Unspecified:-1
Unspecified - Unspecified:-1
.ctor - c:\Users\Jon\Test\Test.cs:23

When base() is explicitly specified, that source location is treated as the “caller” for caller member info attributes. When it’s implicit (including when there’s a default constructor), no source location is made available to the Base constructor.

This is somewhat compiler-specific – and I can imagine different results where the default constructor could specify a name but not source file or line number, and the declared constructor with an implicit call could specify the name and source file but no line number.

I would never suggest using this little tidbit of Roslyn implementation trivia, but it’s fun nonetheless…

“Sideways overriding” with partial methods

First note: this blog post is very much tongue in cheek. I’m not actually planning on using the idea. But it was too fun not to share.

As anyone following my activity on GitHub may be aware, I’ve been quite a lot of work on Protocol Buffers recently – in particular, a mostly-new port for proto3. I’ve recently been looking at JSON support, and thinking about how to implement “overriding” ToString() for a few well-known types. I generate partial classes, so that gives me a hook to provide extra functionality. Indeed, I’m planning on using this to provide conversion methods for Timestamp and Duration, for example. However, you can’t really override anything in partial methods.

Refresher on partial methods

While partial classes were introduced in C# 2, partial methods were introduced in C# 3. The idea is that one source file (usually the generated one) can provide a partial method signature, and another source file (usually the manually-written one) can provide an implementation if it wants to. Any part of the source can call the method, and the call will be removed at compile-time if nothing provides an implementation. The fact that the method may not be there leads to some limitations:

  • Partial methods are implicitly private, but you can’t specify an access modifier explicitly
  • Partial methods are always void – they can’t return any values
  • Partial methods cannot have out parameters

(Interestingly, a partial method implementation can be an async method – but with a return type of void, which is never a nice situation to be in.)

There’s more in the spec, but the last two bullets are the important part.

So, suppose I want to override ToString() in the generated code, but provide a mechanism for that override to be “further overridden” effectively, in the manual code for the same class? How do I get the value from an “extra override”? How do I even detect whether or not it’s there?

Side effects to the rescue!

(Now there’s a phrase you never thought you’d hear from me.)

I mentioned before that if a partial method is called but no implementation is provided, the call is removed. That includes all aspects of the call – including the evaluation of the arguments. So if evaluating the argument has a side-effect… we can spot that side effect.

Next, we have to work out how to get a value back from a method. We can’t use the return value, and we can’t use an out parameter. There are two options here: we could either pass a wrapper (e.g. an array with a single element) and allow the “extra override” to populate the wrapper… or we can use a ref parameter. The latter feels ever-so-slightly cleaner to me.

And so the ugly hack is born. The code generator can always generate code like this:

partial void ToStringOverride(bool ignored, ref string value);

public override string ToString()
{
    string value = null;
    bool overridden = false;
    ToStringOverride(overridden = true, ref value);
    return overridden ? value : "Original";
}

For any partial class where the ToStringOverride method isn’t implemented, overridden will still be false, so we’ll fall back to returning "Original". (I would hope that any decent JIT would remove the overridden and value local variables entirely at that point.) Otherwise, we’ll return whatever the method has changed value to.

Here’s a short but complete example:

using System;

// Generated code
partial class UglyHack1
{
    partial void ToStringOverride(bool ignored, ref string value);

    public override string ToString()
    {
        string value = null;
        bool overridden = false;
        ToStringOverride(overridden = true, ref value);
        return overridden ? value : "Original";
    }
}

// Generated code
partial class UglyHack2
{
    partial void ToStringOverride(bool ignored, ref string value);

    public override string ToString()
    {
        string value = null;
        bool overridden = false;
        ToStringOverride(overridden = true, ref value);
        return overridden ? value : "Original";        
    }    
}

// Manual code
partial class UglyHack2
{
    partial void ToStringOverride(bool ignored, ref string value)
    {
        value = "Different!";
    }
}

class Test
{
    static void Main()
    {
        var g1 = new UglyHack1();
        var g2 = new UglyHack2();

        Console.WriteLine(g1);
        Console.WriteLine(g2);
    }
}

Horribly ugly, but it works…

Alternatives?

Obviously this isn’t really pleasant. Some alternatives:

  • Derive from the generated class in order to override ToString again. Doesn’t work with sealed classes, and will only work if clients create instances of the derived class.
  • Introduce a new interface, and allow manual code to implement it on the partial class. The ToString method can then check this is IMyOtherToString or whatever, and call it appropriately. This introduces another virtual call for no great reason, and exposes the interface to the outside world, which we may not want to do.
  • Don’t override ToString in the generated code at all. Not good if you normally want to override it.
  • Introduce an abstract base class which the generated class derives from. Override ToString() in that base class, possibly calling an abstract member which is then provided in the generated class – but allowing the manual code to override ToString() again.

Conclusion

Ugly hacks are fun. But it’s much better to keep them where it belongs: in a blog post, not in production code.

Backwards compatibility is (still) hard

At the moment, I’m spending a fair amount of time thinking about a new version of the C# API and codegen for Protocol Buffers, as well as other APIs for interacting with Google services. While that’s the context for this post, I want to make it very clear that this is still a personal post, and should in no way be taken to be “Google’s opinion” on anything. The underlying issue could apply in many other situations, but it’s easiest to describe a concrete scenario.

Context and current state: the builder pattern

The problem I’ve been trying to address is the relative pain of initializing a protobuf message. Protocol buffer messages are declared in a separate schema file (.proto) and then code is generated. The schema declares fields, each of which has a name, a type and a number associated with it. The generated message types are immutable, with builder classes associated with them. So for example, we might start off with a message like this:

message Person {
  string first_name = 1;
  string last_name = 3;
}

And construct a Person object in C# like this:

var person = new Person.Builder { FirstName = "Jon", LastName = "Skeet" }.Build();
// Now person.FirstName and person.LastName are readonly properties

That’s not awful, but it’s not the cleanest code in the world. We can make it slightly simpler using an implicit conversion from the builder type to the message type:

Person person = new Person.Builder { FirstName = "Jon", LastName = "Skeet" };

It’s still not really clean though. Let’s revisit why the builder pattern is useful:

  • We can specify just the properties we want.
  • By deferring the “build” step until after we’ve specified everything, we get mutability without building a lot of intermediate objects.

If only there were another language construct allowing that…

Optional parameters to the rescue!

If we provided a constructor with an optional parameter for each property, we can specify just what we want. So something like:

public Person(string firstName = null, string lastName = null)
...
var person = new Person(firstName: "Jon", lastName: "Skeet");

Hooray! That looks much nicer:

  • We can use var (if we want to) because there are no implicit conversions to confuse things.
  • We don’t need to mention a builder at all.
  • Every piece of text in the statement is something we want to express, and we only express it once.

That last point is a lovely place to be in terms of API design – while you still need to worry about naming, ordering and how the syntax fits into bigger expressions, you’ve achieved some sense of “as simple as possible, but no simpler”.

So, that’s all great – except for versioning.

Let’s just add a field at the end…

One of the aims of protocol buffers is to support an evolving schema. (The limitations are different for proto2 and proto3, but that’s a slightly different matter.) So what happens if we add a new field to the message?

message Person {
  string first_name = 1;
  string last_name = 3;
  string title = 4; // Mr, Mrs etc
}

Now we end up with the following constructor:

public Person(string firstName = null, string lastName = null, string title = null)

The code still compiles – but if we try to use run our old client code against the new version of the library, it will fail – because the method it refers to no longer exists. So we have source compatibility, but not binary compatibility.

Let’s just add a field in the middle…

You may have noticed that I don’t have a field with tag 2 – this is not an accident. Suppose we now add it, for the obvious middle_name field:

message Person {
  string first_name = 1;
  string middle_name = 2;
  string last_name = 3;
  string title = 4; // Mr, Mrs etc
}

Regenerate the code, and we end up with a constructor with 4 parameters:

public Person(
    string firstName = null,
    string middleName = null,
    string lastName = null,
    string title = null)

Just to be clear, this change is entirely fine in protocol buffers – while normally fields are assigned incrementally, it shouldn’t be a breaking change to add a new field “between” existing ones.

Let’s take a look at our client code again:

var person = new Person(firstName: "Jon", lastName: "Skeet");

Yup, that still works – we need to recompile, but we still end up with a Person with the right properties. But that’s not the only code we could have started with. Suppose we’d actually had:

var person = new Person("Jon", "Skeet");

Until this last change that would have been fine – even after we’d added the optional title parameter, the two arguments would still have mapped to firstName and lastName respectively.

Having added the middle_name field, however, the code would still compile with no errors or warnings, but the meaning of the second argument would have changed – it would now map onto the middleName parameter instead of lastName.

Basically, we’d like to stop this code (using positional arguments) from compiling in the first place.

Feature requests and a workaround

The two features we really want from C# here are:

  • Some way of asking the generated code to perform dynamic overload resolution at execution time… not based on dynamic values, but on the basis that the code we’re compiling against may have changed since we compiled. This resolution only needs to be performed once, on first execution (or class load, or whatever) as by the time we’re executing, everything is fixed (the parameter names and types, and the argument names and types). It could be efficient.
  • Some way of forcing any call sites to use named arguments for any optional parameters. (Even though in our case all the parameters are optional, I can easily imagine a case where there are a few required parameters and then the optional ones. Using positional arguments for those required parameters is fine.)

It’s hard (without forking Roslyn :) to implement these feature requests ourselves, but for the second one we can at least have a workaround. Consider the following struct:

public struct DoNotCallThisMethodWithPositionalArguments {}

… and now imagine our generated constructor had been :

public Person(
    DoNotCallThisMethodWithPositionalArguments ignoreMe =
        default(DoNotCallThisMethodWithPositionalArguments),
    string firstName = null,
    string middleName = null,
    string lastName = null,
    string title = null)

Now our constructor call using positional arguments will fail, because there’s no conversion from string to the crazily-named struct. The only “nice” way you can call it is to use named arguments, which is what we wanted. You could call it using positional arguments like this:

var person = new Person(
    new DoNotCallThisMethodWithPositionalArguments(),
    "Jon",
    "Skeet");

(or using default(...) like the constructor declaration) – but at this point the code looks broken, so it’s your own fault if you decide to use it.

The reason for making it a struct rather than a class is to avoid null being convertible to it. Annoyingly, it wouldn’t be hard to make a class that you could never create an actual instance of, but you can’t prevent anyone from creating a value of a struct. Basically, what we really want is a type such that there is no valid expression which is convertible to that type – but aside from static classes (which can’t be used as parameter types) I don’t know of any way of doing that. (I don’t know what would happen if you compiled the Person class using a non-static class as the first parameter, then made that class static and recompiled it. Confusion on the part of the C# compiler, I should think.)

Another option (as mentioned in the comments) is to have a “poor man’s” version of the compiler enforcement via a Roslyn Code Diagnostic – add an attribute to any method call where you want “All optional parameters must be specified with named arguments” to apply, and then make the code diagnostic complain if you disobey that. That diagnostic could ship with the Protocol Buffers NuGet package, which would make for a pretty nice experience. Not quite as good as a language feature though :)

Conclusion

Default parameters are a pain in terms of compatibility. For internal company code, it’s often reasonable to only care about source compatibility as you can recompile all calling code before deployment – but for open source projects, binary compatibility within the same major version is the norm.

How useful and common do I think these features would be? Probably not common enough to meet the bar – unless there’s encouragement within comments here, in which case I’m happy to file feature requests on GitHub, of course.

As it happens, I’m currently looking at radical changes to the C# implementation of Protocol Buffers, regretfully losing the immutability aspect due to it raising the barrier to entry. It’s not quite a done deal yet, but assuming that goes ahead, all of this will be mostly irrelevant – for Protocol Buffers. There are plenty of other places where code generation could be more robustly backward-compatible through judicious use of optional-but-please-use-named-arguments parameters though…

Precedence: ordering or grouping?

As I’ve mentioned before, I’m part of the technical group looking at updating the ECMA-334 C# standard to reflect the C# 5 Microsoft specification. I recently made a suggestion that I thought would be uncontroversial, but which caused some discussion – and prompted this “request for comment” post, effectively.

What does the standard say about precedence?

The current proposed standard includes the following text:

The order of evaluation of operators in an expression is determined by the precedence and associativity of the operators (ยง13.4.2).

Operands in an expression are evaluated from left to right.

When an expression contains multiple operators, the precedence of the operators controls the order in which the individual operators are evaluated. [Note: For example, the expression x + y * z is evaluated as x + (y * z) because the * operator has higher precedence than the binary + operator. end note]

I like the example in the note, but I’m not keen on the rest of the wording. It’s very easy to miss the difference between operands for any given expression always being evaluated left to right, and operators being evaluated in an order determined by precedence.

I’ve always thought about precedence in terms of grouping, not ordering – I think of one operator as “binding tighter” than another rather than as being “executed before” it. When I consider precedence, I mentally apply brackets to group operators and operands together explicitly. Eric Lippert has blogged along similar lines but I wouldn’t want to put words into his mouth by suggesting he agrees with me. Interestingly, he includes:

Order of evaluation rules describe the order in which each operand in an expression is evaluated.

That’s certainly true about the order of evaluation of operands in an expression, but as seen earlier precedence is also specified in terms of “order of evaluation”. To me, that’s what makes the standard confusing.

Importantly though, when I expressed this in a meeting, smarter people than me said that they exactly thought of precedence in terms of order of evaluation. Grouping was just another way of looking at it, but a sort of secondary approach.

What does “order of evaluation” even mean?

Let’s take a closer look at the wording of the standard. It’s fairly clear what “operands in an expression are evaluated from left to right” means (ignoring the possibility that the “left” operand actually occurs to the right of the “right” operand physically due to line breaks). The left operand is completely evaluated, from start to finish, before the right operand is evaluated. Great.

But what about the “order of evaluation of operators”? Here it’s trickier. Does “evaluating” a + b include first evaluating a and b? Should “order of evaluation” mean “order of starting to evaluate”? If so, the standard would actually be inaccurate. Let’s go back to the example of x + y * z. We can view that as a sequence of steps:

  1. Evaluate x.
  2. Evaluate y.
  3. Evaluate z.
  4. Multiply the results of steps 2 and 3.
  5. Add the results of steps 1 and 4.

Note that the multiplication definitely occurs before the addition. So that looks right. But if I rewrite it just a little to give more context (sorry about the bullets; it’s the only way I could get the formatting right):

  • Evaluate x + y * z
    • 1 Evaluate x
    • 2 Evaluate y * z
      • 2a Evaluate y
      • 2b Evaluate z
      • 2c Multiply the results of steps 2a and 2b; this is the result of step 2
    • 3 Add the results of steps 1 and 2; this is the result of the expression

At that point, it’s clear that we’ve started evaluating the + operator before we’ve started evaluating the * operator.

Similarly, if we view it as a tree:

    +
   / \
  x   *
     / \
    y   z

… we hit the + node before we hit the *node.

So from the “starting to evaluate” perspective, precedence appears to fall apart. The ordering only makes sense when you start talking about the operator performing its duty with the already-evaluated operands – which is tricky for ??, ?. and ?: which don’t always evaluate all their operands. I suspect that’s fixable though, with careful wording – and I’m influenced by the fact that the term “precedence” is naturally ordering-related (one thing preceding another). Maybe it’s as simple as talking about the order of completing evaluation of operands.

Non-conclusion: over to you

So, what should we do in the standard? Given the range of views on the technical group, I said I’d write this blog post and canvas opinion. Readers: how do you as readers think about precedence? How should the standard talk about precedence? Is any aspect of the existing wording (in C# 5 specification it’s section 7.3.1; in ECMA-334 4th edition it’s section 14.2.1) particularly helpful or confusing?

The technical group is full of very smart people – all of them smarter than me and with a deeper computer science (and/or C# compiler implementation) background. That makes me simultaneously nervous of proposing changes – but also confident in my role of “interested amateur” in that if I find something confusing, I suspect some other readers will too.

I’m in no way saying it’s wrong to think of precedence in terms of ordering – albeit with a more precise definition of ordering than we’ve got now – but I’m suggesting it’s not the most helpful way of expressing it for readers. Just to be entirely clear, I’m not suggesting any sort of semantic change – if we change the wording of the standard, it would be purely about clarification, with no behavioural change.

I had originally intended to make this blog post as “on the fence” as possible, but the more I’ve looked at it, the more I’ve reinforced my original position – I can only apologise for not being terribly even-handed. I’m very happy to be corrected though, and look forward to reading plenty of comments. Don’t be shy.

When is an identifier not an identifier? (Attack of the Mongolian Vowel Separator)

Here’s a few things you may not be aware of:

  • C# identifiers can include Unicode escape sequences (\u1234 etc)
  • C# identifiers can include Unicode characters in the category “Other, formatting” (Cf) but these are ignored when comparing identifiers for equality
  • The Mongolian Vowel Separator (U+180E) has oscillated between the Cf and Zs categories a couple of times
  • .NET has its own copy of Unicode categories, separate from whatever Win32 might provide
  • Roslyn (built in .NET) uses the Unicode categories, whereas csc.exe (the “old” native C# compiler) uses either the Win32 categories or a built-in copy
  • Neither the .NET table nor the Win32 table necessarily reflects exactly what any one version of the Unicode standard says
  • Compilers can have bugs in

Put them together, and chaos ensues!

How this all started – blame Vladimir

I started looking at this based on a discussion in our ECMA technical group meeting last week, when we were considering the normative references – and in particular, which version of Unicode we were going to target. Currently the ECMA 4th edition spec targets Unicode 4.0 and the Microsoft C# 5 specification targets Unicode 3.0. It’s not clear to me whether any compilers actually take note of this, and moving forward we’d like both the ECMA and Microsoft standards to not specify a particular version of Unicode, effectively encouraging compiler authors to use the most recent one available to them. Despite the wrinkles listed below, I think that makes the most sense for real world uses – it’s crazy to require compilers to ship with their own private copies of Unicode, effectively.

When discussing this, Vladimir Reshetnikov mentioned the Mongolian Vowel Separator (U+180E) which has had an interesting life. It was introduced in Unicode 3.0.0, when it was in the Cf category (“other, formatting”). Then in Unicode 4.0.0 it was moved into the Zs (separator, space) category. In Unicode 6.3.0 it was then moved back to the Cf category.

Of course, my natural inclination was to try to abuse this. My initial aim was to come up with code which behaved differently depending on which version of Unicode the compiler was using. It turned out to be a little more complicated than that, but we’ll assume a hypothetical compiler first, with no bugs, but which obeys whichever version of the Unicode standard we want it to. (Arguably that’s already a bug given the requirements of the current C# specs, but we’ll set that aside.)

Hypothetical example 1: valid or invalid

For simplicity, let’s start with some source code which is all in ASCII:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        string\u180ex = "b";
        Console.WriteLine(stringx);
    }
}

If the compiler is using Unicode 6.3 or higher (or a version earlier than 4.0) then U+180E is deemed to be in the Cf category, and is therefore valid within an identifier. In that case, it’s fine for it to be escaped as per the code above. At that point, the identifier in the second line of the method is deemed to be “the same as” stringx, so the output is b.

What about a compiler using a version of Unicode between 4.0 and 6.2 (inclusive) though? At that point, U+180E is deemed to be in the Zs category, which makes it a whitespace character. Zs characters are allowed as whitespace within C# programs – but not within identifiers. Once it’s not a valid identifier – and because this isn’t within a character/string literal – it’s invalid to use the Unicode escape sequence, so the code doesn’t compile.

Hypothetical example 2: valid two different ways

We can write the same code without using an escape sequence, however. If you create a regular ASCII file like this:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        stringAAAx = "b";
        Console.WriteLine(stringx);
    }
}

then open it up in a hex editor and replace the AAA with bytes E1 A0 8E, then you’ve got a file containing the UTF-8 representation of U+180E at the same location as we had the Unicode escape sequence in the first version.

So, a compiler which compiled the first version would still compile this version (assuming you could tell it that the source code was UTF-8), and the results would be the same – it would print b as the second statement of the method would be a simple assignment to the existing variable.

However, a compiler which treats U+180E as whitespace and would therefore treat the first program as an error would accept this program – and treat that second statement as a declaration of a second local variable (x) and assign it an initial value. You might get a warning about the variable being unused, but it’s valid C# and the output is a.

Reality: the Microsoft compilers

Whenever we talk about the Microsoft C# compiler these days, we need to distinguish between the “native” compiler (csc) and Roslyn (rcsc, although typically I just call it Roslyn).

As it’s written in native code, csc uses whatever Windows supplies for its Unicode character tables – or it embeds it directly in the executable, potentially. (I’ve been scouring MSDN to find a Win32 native function to tell me the Unicode category of a specific code point, and failed so far. It would have been useful…)

Compare that with Roslyn, which is written in C# and (as far as I’m aware) uses char.GetUnicodeCategory – which in turn uses the Unicode tables built into mscorlib.

My experiments suggest that whatever the native compiler uses to get the Unicode category has treated U+180E as a formatting character forever. At least, I’ve tried to find old machines (including VM images) which haven’t had Windows update applied since September 2013 (which is when Unicode 6.3 was published) and they all compile the first program listed above. I’m beginning to suspect that csc might actually have a copy of Unicode 3.0 built into it; it certainly treats U+180E as a formatting character, but doesn’t like either U+0600 or U+00AD within identifiers. (U+0600 wasn’t introduced until Unicode 4.0, but has always been a formatting character; U+00AD was a “dash punctuation” character in Unicode 3.0, and became a formatting character in 4.0.)

The table built into mscorlib has definitely changed over time, however. If you run a simple program such as this:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(Environment.Version);
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

then running under CLRv2, the result is “SpaceSeparator” whereas running on CLRv4 (at least on a recently-updated system), the result is “Format”.

Of course, Roslyn won’t run under old CLRs, but we have hope by way of csharppad.com – which runs Roslyn in an environment (of uncertain origin – Mono? I’m unsure) which prints “SpaceSeparator” for the above. Sure enough the first program fails to compile – but it’s harder to check the second program, as csharppad.com doesn’t allow you to upload source code, and copy/paste produces some odd results.

Reality: mcs (Mono C# compiler)

Mono’s compiler uses the BCL GetUnicodeCategory code too, which should make it significantly simpler to experiment – but unfortunately, the Mono parser has (at least) two bugs in it:

  • It will allow any Unicode escape sequence in an identifier, whether it’s an escape sequence for a valid identifier part or not. For example, string\u0020x = "" is valid under the Mono compiler. Filed as bug 24968. Source.
  • It doesn’t allow formatting characters within identifiers – it includes characters in classes Mn, Mc, Nd and Pc, but not Cf. Filed as bug 24969. Source.

For this reason, the first program always compiles and prints “b” whereas the second program always fails to compile, regardless of whether U+180E is treated as being in Zs or Cf.

What version is this, anyway?

Next, let’s think about the Unicode data itself. It’s not at all clear which version any particular BCL implementation is actually using. Consider this little program:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(char.GetUnicodeCategory('\u00ad'));
        Console.WriteLine(char.GetUnicodeCategory('\u0600'));
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

On my computer, under CLR v4 this prints “DashPunctuation, Format, Format”, and under both Mono (3.3.0) and CLR v2 it prints “DashPunctuation, Format, SpaceSeparator”.

That’s very odd. It doesn’t correspond with any version of the Unicode standard, as far as I can tell:

  • U+00AD was a Po (other, punctuation) character in Unicode 1.x, then Pd (dash, punctuation) in 2.x and 3.x, and from 4.0 onwards has been Cf.
  • U+0600 was only introduced in Unicode 4.0, and has always been Cf
  • U+180E introduced as Cf in 3.0, then changed to Zs in 4.0, then back to Cf in 6.3.

So there is no version where the first line agrees with either the second line or the third line. I’m basically a bit baffled by this.

What about nameof and CallerMemberName?

The names of identifiers aren’t only used for comparisons – they’re available as strings without any reflection being involved at all. From C# 5, we’ve had CallerMemberName attribute, allowing things like:

public static void X\u0600y()
{
    ShowCaller();
}

public static void ShowCaller([CallerMemberName] string caller = null)
{
    Console.WriteLine("Called by {0}", caller);
}

And in C# 6, we can write:

string x\u0600y = "";
Console.WriteLine("nameof = {0}", nameof(x\u0600y));

What should those print? They do just print “Xy” and “xy” as the names (respectively), as if the compiler has simply thrown away the formatting character entirely. But what should they print? Bear in mind that in the second case, we could easily have used nameof(xy) and that would still have compared equal to the declared identifier.

We can’t even say “What’s the name of the member being declared?” because you can overload with “different but equal” identifiers:

public static void Xy() {}
public static void X\u0600y() {}
public static void X\u070fy() {}
...
Console.WriteLine(nameof(X\u200by));

What should that print out? I’m sure you’ll be relieved to hear that the C# team has a plan in place – but fundamentally this is one of these “no obvious right answer” scenarios. It gets even weirder when you bring the CLI specification into the mix. Section I.8.5.1 of ECMA-335 6th edition has:

Assemblies shall follow Annex 7 of Technical Report 15 of the Unicode Standard
3.0 governing the set of characters permitted to start and be included in identifiers, available online
at http://www.unicode.org/unicode/reports/tr15/tr15-18.html. Identifiers shall be in the
canonical format defined by Unicode Normalization Form C. For CLS purposes, two identifiers
are the same if their lowercase mappings (as specified by the Unicode locale-insensitive, one-to-one
lowercase mappings) are the same. That is, for two identifiers to be considered different
under the CLS they shall differ in more than simply their case. However, in order to override an
inherited definition the CLI requires the precise encoding of the original declaration be used.

I would love to explore the impact of this by adding a Cf character into IL, but unfortunately I haven’t worked out a way of affecting the encoding of ilasm, in order to persuade it that my hacked up IL is what I want it to be.

Conclusion

As noted before, text is hard.

It turns out that even when restricting oneself to identifiers, text is hard. Who would’ve thought?

When is a string not a string?

When is a string not a string?

As part of my “work” on the ECMA-334 TC49-TG2 technical group, standardizing C# 5 (which will probably be completed long after C# 6 is out… but it’s a start!) I’ve had the pleasure of being exposed to some of the interesting ways in which Vladimir Reshetnikov has tortured C#. This post highlights one of the issues he’s raised. As usual, it will probably never impact 99.999% of C# developers… but it’s a lovely little problem to look at.

Relevant specifications referenced in this post:
– The Unicode Standard, version 7.0.0 – in particular, chapter 3
C# 5 (Word document)
ECMA-335 (CLI specification)

What is a string?

How would you define the string (or System.String) type? I can imagine a number of responses to that question, from vague to pretty specific, and not all well-defined:

  • “Some text”
  • A sequence of characters
  • A sequence of Unicode characters
  • A sequence of 16-bit characters
  • A sequence of UTF-16 code units

The last of these is correct. The C# 5 specification (section 1.3) states:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So far, so good. But that’s C#. What about IL? What does that use, and does it matter? It turns out that it does… Strings need to be represented in IL as constants, and the nature of that representation is important, not only in terms of the encoding used, but how the encoded data is interpreted. In particular, a sequence of UTF-16 code units isn’t always representable as a sequence of UTF-8 code units.

I feel ill (formed)

Consider the C# string literal "X\uD800Y". That is a string consisting of three UTF-16 code units:

  • 0x0058 – ‘X’
  • 0xD800 – High surrogate
  • 0x0059 – ‘Y’

That’s fine as a string – it’s even a Unicode string according to the spec (item D80). However, it’s ill-formed (item D84). That’s because the UTF-16 code unit 0xD800 doesn’t map to a Unicode scalar value (item D76) – the set of Unicode scalar values explicitly excludes the high/low surrogate code points.

Just in case you’re new to surrogate pairs: UTF-16 only deals in 16-bit code units, which means it can’t cope with the whole of Unicode (which ranges from U+0000 to U+10FFFF inclusive). If you want to represent a value greater than U+FFFF in UTF-16, you need to use two UTF-16 code units: a high surrogate (in the range 0xD800 to 0xDBFF) followed by a low surrogate (in the range 0xDC00 to 0xDFFF). So a high surrogate on its own makes no sense. It’s a valid UTF-16 code unit in itself, but it only has meaning when followed by a low surrogate.

Show me some code!

So what does this have to do with C#? Well, string constants have to be represented in IL somehow. As it happens, there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8.

Let’s take an example:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;

[Description(Value)]
class Test
{
    const string Value = "X\ud800Y";

    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }

    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

The output of this code (under .NET) is:

Attribute: 0058 fffd fffd 0059
Constant: 0058 d800 0059

As you can see, the “constant” (Test.Value) has been preserved as a sequence of UTF-16 code units, but the attribute property has U+FFFD (the Unicode replacement character which is used to indicate broken data when decoding binary to text). Let’s dig a little deeper and look at the IL for the attribute and the constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )

.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant (Value) is really simple – it’s just little-endian UTF-16. The format of the attribute is specified in ECMA-335 section II.23.3. Here, the meaning is:

  • Prolog (01 00)
  • Fixed arguments (for specified constructor signature)
    • 05 58 ED A0 80 59 (a single string argument as a SerString)
      • 05 (the length, i.e. 5, as a PackedLen)
      • 58 ED A0 80 59 (the UTF-8-encoded form of the string)
  • Number of named arguments (00 00)
  • Named arguments (there aren’t any)

The interesting part is the “UTF-8-encoded form of the string” here. It’s not valid UTF-8, because the input isn’t a well-formed string. The compiler has taken the high surrogate, determined that there isn’t a low surrogate after it, and just treated it as a value to be encoded in the normal UTF-8 way of encoding anything in the range U+0800 to U+FFFF inclusive.

It’s worth noting that if we had a full surrogate pair, UTF-8 would encode the single Unicode scalar value being represented, using 4 bytes. For example, if we change the declaration of Value to:

const string Value = "X\ud800\udc00Y";

then the UTF-8 bytes in the IL are 58 F0 90 80 80 59 – where F0 90 80 80 is the UTF-8 encoding for U+10000. That’s a well-formed string, and we get the same value for both the description attribute and the constant.

So in our original example, the string constant (encoded as UTF-16 in the IL) is just decoded without checking whether or not it’s ill-formed, whereas the attribute argument (encoded as UTF-8) is decoded with extra validation, which detects the ill-formed code unit sequence and replaces it.

Encoding behaviour

So which approach is right? According to the Unicode specification (item C10) both could be fine:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

and

Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89.

It’s not at all clear to me whether either the attribute argument or the constant value “purports to be in a Unicode character encoding form”. In my experience, very few pieces of documentation or specification are clear about whether they expect a piece of text to be well-formed or not.

Additionally, System.Text.Encoding implementations can often be configured to determine how they behave when encoding or decoding ill-formed data. For example, Encoding.UTF8.GetBytes(Value) returns byte sequence 58 EF BF BD 59 – in other words, it spots the bad data and replaces it with U+FFFD as part of the encoding… so decoding this value will result in X U+FFFD Y with no problems. On the other hand, if you use new UTF8Encoding(true, true).GetBytes(Value), an exception will be thrown. The first constructor argument is whether or not to emit a byte order mark under certain circumstances; the second one is what dictates the encoding behaviour in the face of invalid data, along with the EncoderFallback and DecoderFallback properties.

Language behaviour

So should this compile at all? Well, the language specification doesn’t currently prohibit it – but specifications can be changed :)

In fact, both csc and Roslyn do prohibit the use of ill-formed strings with certain attributes. For example, with DllImportAttribute:

[DllImport(Value)]
static extern void Foo();

This gives an error when Value is ill-formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

There may be other attributes this is applied to as well; I’m not sure.

If we take it as read that the ill-formed value won’t be decoded back to its original form when the attribute is instantiated, I think it would be entirely reasonable to make it a compile-time failure – for attributes. (This is assuming that the runtime behaviour can’t be changed to just propagate the ill-formed string.)

What about the constant value though? Should that be allowed? Can it serve any purpose? Well, the precise value I’ve given is probably not terribly helpful – but it could make sense to have a string constant which ends with a high surrogate or starts with a low surrogate… because it can then be combined with another string to form a well-formed UTF-16 string. Of course, you should be very careful about this sort of thing – read the Unicode Technical Report 36 “Security Considerations” for some thoroughly alarming possibilities.

Corollaries

One interesting aspect to all of this is that “string encoding arithmetic” doesn’t behave as you might expect it to. For example, consider this method:

// Bad code!
string SplitEncodeDecodeAndRecombine
    (string input, int splitPoint, Encoding encoding)
{
    byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint));
    byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint));
    return encoding.GetString(firstPart) + encoding.GetString(secondPart);            
}

You might expect that this would be a no-op so long as everything is non-null and splitPoint is within range… but if you happen to split in the middle of a surrogate pair, it’s not going to be happy. There may well be other potential problems lurking there, depending on things like normalization form – I don’t think so, but at this point I’m unwilling to bet too heavily on string behaviour.

If you think the above code is unrealistic, just imagine partitioning a large body of text, whether that’s across network packets, files, or whatever. You might feel clever for realizing that without a bit of care you’d get binary data split between UTF-16 code units… but even handling that doesn’t save you. Yikes.

I’m tempted to swear off text data entirely at this point. Floating point is a nightmare, dates and times… well, you know my feelings about those. I wonder what projects are available that only need to deal with integers, and where all operations are guaranteed not to overflow. Let me know if you have any.

Conclusion

Text is hard.