C# 6 in action

Now that the Visual Studio 2015 Preview is available and the C# 6 feature set is a bit more stable, I figured it was time to start updating the Noda Time 2.0 source code to C# 6. The target framework is still .NET 3.5 (although that might change; I gather very few developers are actually going to be hampered by a change to target 4.0 if that would make things easier) but we can still take advantage of all the goodies C# 6 has in store.

I’ve checked all the changes into a dedicated branch which will only contain changes relevant to C# 6 (although a couple of tiny other changes have snuck in). When I’ve got round to updating my continuous integration server, I’ll merge onto the default branch, but I’m in no rush. (I’ll need to work out what to do about Mono at that point, too – there are various options there.)

In this post, I’ll go through the various C# 6 features, and show how useful (or otherwise) they are in Noda Time.

Read-only automatically implemented properties (“autoprops”)

Finally! I’ve been waiting for these for ages. You can specify a property with just a blank getter, and then assign it a value from either the declaration statement, or within a constructor/static constructor.

So for example, in DateTimeZone, this:

private static readonly DateTimeZone UtcZone = new FixedDateTimeZone(Offset.Zero);
public static DateTimeZone Utc { get { return UtcZone; } }

becomes

public static DateTimeZone Utc { get; } = new FixedDateTimeZone(Offset.Zero);

and

private readonly string id;
public string Id { get { return id; } }
protected DateTimeZone(string id, ...)
{
    this.id = id;
    ...
}

becomes

public string Id { get; }
protected DateTimeZone(string id, ...)
{
    this.Id = id;
    ...
}

As I mentioned before, I’ve been asking for this feature for a very long time – so I’m slightly surprised to find myself not entirely positive about the results. The problem it introduces isn’t really new – it’s just one that I’m not used to, as I haven’t used automatically-implemented properties much in a large code base. The issue is consistency.

With separate fields and properties, if you knew you didn’t need any special behaviour due to the properties when you accessed the value within the same type, you could always use the fields. With automatically-implemented properties, the incidental fact that a field is also exposed as a property changes the code – because now the whole class refers to it as a property instead of as a field.

I’m sure I’ll get used to this – it’s just a matter of time.

Initial values for automatically-implemented properties

The ability to specify an initial value for automatically-implemented properties applies to writable properties as well. I haven’t used that in the main body of Noda Time (almost all Noda Time types are immutable), but here’s an example from PatternTestData, which is used to provide data for text parsing/formatting tests. The code before:

internal CultureInfo Culture { get; set; }

public PatternTestData(...)
{
    ...
    Culture = CultureInfo.InvariantCulture;
}

And after:

internal CultureInfo Culture { get; set; } = CultureInfo.InvariantCulture;

public PatternTestData(...)
{
    ...
}

It’s worth noting that just like with a field initializer, you can’t call instance members within the same class when initializing an automatically-implemented property.

Expression-bodied members

I’d expected to like this feature… but I hadn’t expected to fall in love with it quite as much as I’ve done. It’s really simple to describe – if you’ve got a read-only property, or a method or operator, and the body is just a single return statement (or any other simple statement for a void method), you can use => to express that instead of putting the body in braces. It’s worth noting that this is not a lambda expression, nor is the compiler converting anything to delegates – it’s just a different way of expressing the same thing. Three examples of this from LocalDateTime (one property, one operator, one method – they’re not next to each other in the original source code, but it makes it simpler for this post):

public int Year { get { return date.Year; } }

public static LocalDateTime operator +(LocalDateTime localDateTime, Period period)
{
    return localDateTime.Plus(period);
}

public static LocalDateTime Add(LocalDateTime localDateTime, Period period)
{
    return localDateTime.Plus(period);
}

becomes

public int Year => date.Year;

public static LocalDateTime operator +(LocalDateTime localDateTime, Period period) =>
    localDateTime.Plus(period);

public static LocalDateTime Add(LocalDateTime localDateTime, Period period) =>
    localDateTime.Plus(period);

In my actual code, the operator and method each only take up a single (pretty long) line. For some other methods – particularly ones where the body has a comment – I’ve split it into multiple lines. How you format your code is up to you, of course :)

So what’s the benefit of this? Why do I like it? It makes the code feel more functional. It makes it really clear which methods are just shorthand for some other expression, and which really do involve a series of steps. It’s far too early to claim that this improves the quality of the code or the API, but it definitely feels nice. One interesting data point – using this has removed about half of the return statements across the whole of the NodaTime production assembly. Yup, we’ve got a lot of properties which just delegate to something else – particularly in the core types like LocalDate and LocalTime.

The nameof operator

This was a no-brainer in Noda Time. We have a lot of code like this:

public void Foo([NotNull] string x)
{
    Preconditions.CheckNotNull(x, "x");
    ...
}

This trivially becomes:

public void Foo([NotNull] string x)
{
    Preconditions.CheckNotNull(x, nameof(x));
    ...
}

Checking that every call to Preconditions.CheckNotNull (and CheckArgument etc) uses nameof and that the name is indeed the name of a parameter is going to be one of the first code diagnostics I write in Roslyn, when I finally get round to it. (That will hopefully be very soon – I’m talking about it at CodeMash in a month!)

Dictionary initializers

We’ve been able to use collection initializers with dictionaries since C# 3, using the Add method. C# 6 adds the ability to use the indexer too, which leads to code which looks more like normal dictionary access. As an example, I’ve changed a “field enum value to delegate” dictionary in TzdbStreamData from

private static readonly Dictionary<TzdbStreamFieldId, Action<Builder, TzdbStreamField>> FieldHanders =
    new Dictionary<TzdbStreamFieldId, Action<Builder, TzdbStreamField>>
{
   { TzdbStreamFieldId.StringPool, (builder, field) => builder.HandleStringPoolField(field) },
   { TzdbStreamFieldId.TimeZone, (builder, field) => builder.HandleZoneField(field) },
   { TzdbStreamFieldId.TzdbIdMap, (builder, field) => builder.HandleTzdbIdMapField(field) },
   ...
}

to:

private static readonly Dictionary<TzdbStreamFieldId, Action<Builder, TzdbStreamField>> FieldHanders =
    new Dictionary<TzdbStreamFieldId, Action<Builder, TzdbStreamField>>
{
    [TzdbStreamFieldId.StringPool] = (builder, field) => builder.HandleStringPoolField(field),
    [TzdbStreamFieldId.TimeZone] = (builder, field) => builder.HandleZoneField(field),
    [TzdbStreamFieldId.TzdbIdMap] = (builder, field) => builder.HandleTzdbIdMapField(field),
...
}

One downside of this is that the initializer will now not throw an exception if the same key is specified twice. So whereas the bug in this code is obvious immediately:

var dictionary = new Dictionary<string, string>
{
    { "a", "b" },
    { "a", "c" }
};

if you convert it to similar code using the indexer:

var dictionary = new Dictionary<string, string
{
    ["a"] = "b",
    ["a"] = "c",
};

… you end up with a dictionary which only has a single value.

To be honest, I’m now pretty used to the syntax which uses Add – so even though there are some other dictionaries initialized with collection initializers in Noda Time, I don’t think I’ll be changing them.

Using static members

For a while I didn’t think I was going to use this much – and then I remembered NodaConstants. The majority of the constants here are things like MillisecondsPerHour, and they’re used a lot in some of the core types like Duration. The ability to add a using directive for a whole type, which imports all the members of that type, allows code like this:

public int Seconds =>
    unchecked((int) ((NanosecondOfDay / NodaConstants.NanosecondsPerSecond) % NodaConstants.SecondsPerMinute));

to become:

using NodaTime.NodaConstants;
...
public int Seconds =>
    unchecked((int) ((NanosecondOfDay / NanosecondsPerSecond) % SecondsPerMinute));

Expect to see this to be used a lot in trigonometry code, making all those calls to Math.Cos, Math.Sin etc a lot more readable.

Another benefit of this syntax is to allow extension methods to be imported just from an individual type instead of from a whole namespace. In Noda Time 2.0, I’m introducing a NodaTime.Extensions namespace with extensions to various BCL types (to allow more fluent conversions such as DateTimeOffset.ToOffsetDateTime()) – I suspect that not everyone will want all of these extensions to be available all the time, so the ability to import selectively is very welcome.

String interpolation

We don’t use the system default culture much in Noda Time, which the string interpolation feature always does, so string interpolation isn’t terribly useful – but there are a few cases where it’s handy.

For example, consider this code:

throw new KeyNotFoundException(
    string.Format("No calendar system for ID {0} exists", id));

With C# 6 in the VS2015 preview, this has become

throw new KeyNotFoundException("No calendar system for ID \{id} exists");

Note that the syntax of this feature is not finalized yet – I expect to have to change this for the final release to:

throw new KeyNotFoundException($"No calendar system for ID {id} exists");

It’s always worth considering places where a feature could be used, but probably shouldn’t. ZoneInterval is one such place. Its ToString() feature looks like this:

public override string ToString() =>
    string.Format("{0}: [{1}, {2}) {3} ({4})",
        Name,
        HasStart ? Start.ToString() : "StartOfTime",
        HasEnd ? End.ToString() : "EndOfTime",
        WallOffset,
        Savings);

I tried using string interpolation here, but it ended up being pretty horrible:

  • String literals within string literals look very odd
  • An interpolated string literal has to be on a single line, which ended up being very long
  • The fact that two of the arguments use the conditional operator makes them harder to read as part of interpolation

Basically, I can see string interpolation being great for “simple” interpolation with no significant logic, but less helpful for code like this.

Null propagation

Amazingly, I’ve only found a single place to use null propagation in Noda Time so far. As a lot of the types are value types, we don’t do a lot of null checking – and when we do, it’s typically to throw an exception if the value is null. However, this is the one place I’ve found so far, in BclDateTimeZone.ForSystemDefault. The original code is:

if (currentSystemDefault == null || currentSystemDefault.OriginalZone != local)

With null propagation we can handle “we don’t have a cached system default” and “the cached system default is for the wrong time zone” with a single expression:

if (currentSystemDefault?.OriginalZone != local)

(Note that local will never be null, or this transformation wouldn’t be valid.)

There may well be a few other places this could be used, and I’m certain it’ll be really useful in a lot of other code – it’s just that the Noda Time codebase doesn’t contain much of the kind of code this feature is targeted at.

Conclusion

What a lot of features! C# 6 is definitely a “lots of little features” release rather than the “single big feature” releases we’ve seen with C# 4 (dynamic) and C# 5 (async). Even C# 3 had a lot of little features which all served the common purpose of LINQ. If you had to put a single theme around C# 6, it would probably be making existing code more concise – it’s the sort of feature set that really lends itself to this “refactor the whole codebase to tidy it up” approach.

I’ve been very pleasantly surprised by how much I like expression-bodied members, and read-only automatically implemented properties are a win too (even though I need to get used to it a bit). Other features such as static imports are definitely welcome to remove some of the drudgery of constants and provide finer-grained extension method discovery.

Altogether, I’m really pleased with how C# 6 has come together – I’m looking forward to merging the C# 6 branch into the main Noda Time code base as soon as I’ve got my continuous integration server ready for it…

When is an identifier not an identifier? (Attack of the Mongolian Vowel Separator)

Here’s a few things you may not be aware of:

  • C# identifiers can include Unicode escape sequences (\u1234 etc)
  • C# identifiers can include Unicode characters in the category “Other, formatting” (Cf) but these are ignored when comparing identifiers for equality
  • The Mongolian Vowel Separator (U+180E) has oscillated between the Cf and Zs categories a couple of times
  • .NET has its own copy of Unicode categories, separate from whatever Win32 might provide
  • Roslyn (built in .NET) uses the Unicode categories, whereas csc.exe (the “old” native C# compiler) uses either the Win32 categories or a built-in copy
  • Neither the .NET table nor the Win32 table necessarily reflects exactly what any one version of the Unicode standard says
  • Compilers can have bugs in

Put them together, and chaos ensues!

How this all started – blame Vladimir

I started looking at this based on a discussion in our ECMA technical group meeting last week, when we were considering the normative references – and in particular, which version of Unicode we were going to target. Currently the ECMA 4th edition spec targets Unicode 4.0 and the Microsoft C# 5 specification targets Unicode 3.0. It’s not clear to me whether any compilers actually take note of this, and moving forward we’d like both the ECMA and Microsoft standards to not specify a particular version of Unicode, effectively encouraging compiler authors to use the most recent one available to them. Despite the wrinkles listed below, I think that makes the most sense for real world uses – it’s crazy to require compilers to ship with their own private copies of Unicode, effectively.

When discussing this, Vladimir Reshetnikov mentioned the Mongolian Vowel Separator (U+180E) which has had an interesting life. It was introduced in Unicode 3.0.0, when it was in the Cf category (“other, formatting”). Then in Unicode 4.0.0 it was moved into the Zs (separator, space) category. In Unicode 6.3.0 it was then moved back to the Cf category.

Of course, my natural inclination was to try to abuse this. My initial aim was to come up with code which behaved differently depending on which version of Unicode the compiler was using. It turned out to be a little more complicated than that, but we’ll assume a hypothetical compiler first, with no bugs, but which obeys whichever version of the Unicode standard we want it to. (Arguably that’s already a bug given the requirements of the current C# specs, but we’ll set that aside.)

Hypothetical example 1: valid or invalid

For simplicity, let’s start with some source code which is all in ASCII:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        string\u180ex = "b";
        Console.WriteLine(stringx);
    }
}

If the compiler is using Unicode 6.3 or higher (or a version earlier than 4.0) then U+180E is deemed to be in the Cf category, and is therefore valid within an identifier. In that case, it’s fine for it to be escaped as per the code above. At that point, the identifier in the second line of the method is deemed to be “the same as” stringx, so the output is b.

What about a compiler using a version of Unicode between 4.0 and 6.2 (inclusive) though? At that point, U+180E is deemed to be in the Zs category, which makes it a whitespace character. Zs characters are allowed as whitespace within C# programs – but not within identifiers. Once it’s not a valid identifier – and because this isn’t within a character/string literal – it’s invalid to use the Unicode escape sequence, so the code doesn’t compile.

Hypothetical example 2: valid two different ways

We can write the same code without using an escape sequence, however. If you create a regular ASCII file like this:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        stringAAAx = "b";
        Console.WriteLine(stringx);
    }
}

then open it up in a hex editor and replace the AAA with bytes E1 A0 8E, then you’ve got a file containing the UTF-8 representation of U+180E at the same location as we had the Unicode escape sequence in the first version.

So, a compiler which compiled the first version would still compile this version (assuming you could tell it that the source code was UTF-8), and the results would be the same – it would print b as the second statement of the method would be a simple assignment to the existing variable.

However, a compiler which treats U+180E as whitespace and would therefore treat the first program as an error would accept this program – and treat that second statement as a declaration of a second local variable (x) and assign it an initial value. You might get a warning about the variable being unused, but it’s valid C# and the output is a.

Reality: the Microsoft compilers

Whenever we talk about the Microsoft C# compiler these days, we need to distinguish between the “native” compiler (csc) and Roslyn (rcsc, although typically I just call it Roslyn).

As it’s written in native code, csc uses whatever Windows supplies for its Unicode character tables – or it embeds it directly in the executable, potentially. (I’ve been scouring MSDN to find a Win32 native function to tell me the Unicode category of a specific code point, and failed so far. It would have been useful…)

Compare that with Roslyn, which is written in C# and (as far as I’m aware) uses char.GetUnicodeCategory – which in turn uses the Unicode tables built into mscorlib.

My experiments suggest that whatever the native compiler uses to get the Unicode category has treated U+180E as a formatting character forever. At least, I’ve tried to find old machines (including VM images) which haven’t had Windows update applied since September 2013 (which is when Unicode 6.3 was published) and they all compile the first program listed above. I’m beginning to suspect that csc might actually have a copy of Unicode 3.0 built into it; it certainly treats U+180E as a formatting character, but doesn’t like either U+0600 or U+00AD within identifiers. (U+0600 wasn’t introduced until Unicode 4.0, but has always been a formatting character; U+00AD was a “dash punctuation” character in Unicode 3.0, and became a formatting character in 4.0.)

The table built into mscorlib has definitely changed over time, however. If you run a simple program such as this:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(Environment.Version);
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

then running under CLRv2, the result is “SpaceSeparator” whereas running on CLRv4 (at least on a recently-updated system), the result is “Format”.

Of course, Roslyn won’t run under old CLRs, but we have hope by way of csharppad.com – which runs Roslyn in an environment (of uncertain origin – Mono? I’m unsure) which prints “SpaceSeparator” for the above. Sure enough the first program fails to compile – but it’s harder to check the second program, as csharppad.com doesn’t allow you to upload source code, and copy/paste produces some odd results.

Reality: mcs (Mono C# compiler)

Mono’s compiler uses the BCL GetUnicodeCategory code too, which should make it significantly simpler to experiment – but unfortunately, the Mono parser has (at least) two bugs in it:

  • It will allow any Unicode escape sequence in an identifier, whether it’s an escape sequence for a valid identifier part or not. For example, string\u0020x = "" is valid under the Mono compiler. Filed as bug 24968. Source.
  • It doesn’t allow formatting characters within identifiers – it includes characters in classes Mn, Mc, Nd and Pc, but not Cf. Filed as bug 24969. Source.

For this reason, the first program always compiles and prints “b” whereas the second program always fails to compile, regardless of whether U+180E is treated as being in Zs or Cf.

What version is this, anyway?

Next, let’s think about the Unicode data itself. It’s not at all clear which version any particular BCL implementation is actually using. Consider this little program:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(char.GetUnicodeCategory('\u00ad'));
        Console.WriteLine(char.GetUnicodeCategory('\u0600'));
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

On my computer, under CLR v4 this prints “DashPunctuation, Format, Format”, and under both Mono (3.3.0) and CLR v2 it prints “DashPunctuation, Format, SpaceSeparator”.

That’s very odd. It doesn’t correspond with any version of the Unicode standard, as far as I can tell:

  • U+00AD was a Po (other, punctuation) character in Unicode 1.x, then Pd (dash, punctuation) in 2.x and 3.x, and from 4.0 onwards has been Cf.
  • U+0600 was only introduced in Unicode 4.0, and has always been Cf
  • U+180E introduced as Cf in 3.0, then changed to Zs in 4.0, then back to Cf in 6.3.

So there is no version where the first line agrees with either the second line or the third line. I’m basically a bit baffled by this.

What about nameof and CallerMemberName?

The names of identifiers aren’t only used for comparisons – they’re available as strings without any reflection being involved at all. From C# 5, we’ve had CallerMemberName attribute, allowing things like:

public static void X\u0600y()
{
    ShowCaller();
}

public static void ShowCaller([CallerMemberName] string caller = null)
{
    Console.WriteLine("Called by {0}", caller);
}

And in C# 6, we can write:

string x\u0600y = "";
Console.WriteLine("nameof = {0}", nameof(x\u0600y));

What should those print? They do just print “Xy” and “xy” as the names (respectively), as if the compiler has simply thrown away the formatting character entirely. But what should they print? Bear in mind that in the second case, we could easily have used nameof(xy) and that would still have compared equal to the declared identifier.

We can’t even say “What’s the name of the member being declared?” because you can overload with “different but equal” identifiers:

public static void Xy() {}
public static void X\u0600y() {}
public static void X\u070fy() {}
...
Console.WriteLine(nameof(X\u200by));

What should that print out? I’m sure you’ll be relieved to hear that the C# team has a plan in place – but fundamentally this is one of these “no obvious right answer” scenarios. It gets even weirder when you bring the CLI specification into the mix. Section I.8.5.1 of ECMA-335 6th edition has:

Assemblies shall follow Annex 7 of Technical Report 15 of the Unicode Standard
3.0 governing the set of characters permitted to start and be included in identifiers, available online
at http://www.unicode.org/unicode/reports/tr15/tr15-18.html. Identifiers shall be in the
canonical format defined by Unicode Normalization Form C. For CLS purposes, two identifiers
are the same if their lowercase mappings (as specified by the Unicode locale-insensitive, one-to-one
lowercase mappings) are the same. That is, for two identifiers to be considered different
under the CLS they shall differ in more than simply their case. However, in order to override an
inherited definition the CLI requires the precise encoding of the original declaration be used.

I would love to explore the impact of this by adding a Cf character into IL, but unfortunately I haven’t worked out a way of affecting the encoding of ilasm, in order to persuade it that my hacked up IL is what I want it to be.

Conclusion

As noted before, text is hard.

It turns out that even when restricting oneself to identifiers, text is hard. Who would’ve thought?