Clean event handler invocation with C# 6

The problem

Invoking event handlers in C# has always been a bit of a pain, because an event with no subscribers is usually represented as a null reference. This leads to code like this:

public event EventHandler Foo;

public void OnFoo()
{
    EventHandler handler = Foo;
    if (handler != null)
    {
        handler(this, EventArgs.Empty);
    }   
}

It’s important to use the handler local variable, as if instead you access the field twice, it’s possible that the last subscriber will unsubscribe between the check and the invocation:

// Bad code, do not use!
if (Foo != null)
{
    // Foo could be null here, if the class is intended
    // to be used from other threads.
    Foo(this, EventArgs.Empty);
}

Now this can be simplified slightly using an extension method:

public static void Raise(this EventHandler handler, object sender, EventArgs args)
{
    if (handler != null)
    {
        handler(sender, args);
    }   
}

Then in each event, you can write a single line:

public void OnFoo()
{
    Foo.Raise(this, EventArgs.Empty);
}

However, this means having a different extension method for each delegate type. It’s not too bad if you’re using EventHandler but it’s still not ideal.

C# 6 to the rescue!

The null-conditional operator (?.) in C# 6 isn’t just for properties. It can also be used for method calls. The compiler does the right thing (evaluating the expression only once) so you can do without the extension method entirely:

public void OnFoo()
{
    Foo?.Invoke(this, EventArgs.Empty);
}

Hooray! This will never throw a NullReferenceException, and doesn’t need any extra utility classes.

Admittedly it might be nicer if you could write Foo?(this, EventArgs.Empty) but that would no longer be a ?. operator, so would complicate the language quite a bit, I suspect. The extra slight cruft of Invoke really doesn’t bother me much.

What is this thing you call thread-safe?

The code we’ve got so far is “thread-safe” in that it doesn’t matter what other threads do – you won’t get a NullReferenceException from the above code. However, if other threads are subscribing to the event or unsubscribing from it, you might not see the most recent changes for the normal reasons of memory models being complicated.

As of C# 4, field-like events are implemented using Interlocked.CompareExchange, so we can just use a corresponding Interlocked.CompareExchange call to make sure we get the most recent value. There’s nothing new about being able to do that, admittedly, but it does mean we can just write:

public void OnFoo()
{
    Interlocked.CompareExchange(ref Foo, null, null)?.Invoke(this, EventArgs.Empty);
}

with no other code, to invoke the absolute latest set of event subscribers, without failing if a NullReferenceException is thrown. Thanks to David Fowler for reminding me about this aspect.

Admittedly the CompareExchange call is ugly. In .NET 4.5 and up, there’s Volatile.Read which may do the tricky, but it’s not entirely clear to me (based on the documentation) whether it actually does the right thing. (The summary suggests it’s about preventing the movement of later reads/writes earlier than the given volatile read; we want to prevent earlier writes from being moved later.)

public void OnFoo()
{
    // .NET 4.5+, may or may not be safe...
    Volatile.Read(ref Foo)?.Invoke(this, EventArgs.Empty);
}

… but that makes me nervous in terms of whether I’ve missed something. Expert readers may well be able to advise me on why this is sufficiently foolish that it’s not in the BCL.

An alternative approach

One alternative approach I’ve used in the passed is to create a dummy event handler, usually using the one feature that anonymous methods have over lambda expressions – the ability to indicate that you don’t care about the parameters by not even specifying a parameter list:

public event EventHandler Foo = delegate {}

public void OnFoo()
{
    // Foo will never be null
    Volatile.Read(ref Foo).Invoke(this, EventArgs.Empty);   
}

This has all the same memory barrier issues as before, but it does mean you don’t have to worry about the nullity aspect. It looks a little odd and presumably there’s a tiny performance penalty, but it’s a good alternative option to be aware of.

Conclusion

Yup, C# 6 rocks again. Really looking forward to the final release.

C# 7 feature request #1… extension attributes

Last week I learned that using static is going to be the syntax for importing static members (including extension methods) in C# 6. That fulfils a feature request I made in September 2005 (my fourth ever blog post, as it happens). With a feature request turnaround of 10 years, I figure I should get put everything I could ever want out there now… (Just kidding really – more seriously, I’m really pleased to see this change in C# 6, relative to both C# 5 and the earlier designs of C# 6.)

Last week at CodeMash, Dustin Campbell demonstrated one Roslyn DiagnosticAnalyzer for ArgumentNullException and friends, and I demonstrated one I’d written for the JetBrains InvokerParameterName attribute. The two diagnostics check largely the same thing: some parameters are “special” in that the argument for them should be a parameter in the current context, and should use the C# 6 nameof operator.

The most significant difference between them (beyond quality – it’s not hard to spot which is written by a C# team member…) is that my diagnostic spots these special parameters because they are decorated with a particular attribute, whereas Dustin’s has the relevant information hard-coded within the diagnostic. Mine can’t spot the ArgumentNullException constructor, and Dustin’s can’t spot the Preconditions.CheckNotNull method in Noda Time. It would be really nice if I could pretend that certain members of existing types had particular attributes applied to them. For example, wouldn’t it be nice if I could write something like:

using JetBrains.Annotations;

namespace System
{
    public extern partial class ArgumentNullException
    {
        public ArgumentNullException([InvokerParameterName] string paramName);
    }
}

The rules would be roughly like this:

  • A type declared with public extern partial modifiers would indicate that attributes should be applied to an existing type.
  • Each member listed would be checked for existence, and any attributes present in the declaration would be noted in the IL for the “extending” assembly.
  • A new call on Type, MethodInfo etc would be created to allow all attributes to be fetched for a member, whether “extended” or not. This would find all attributes contributed by all assemblies loaded in the current AppDomain (so you’d need to make sure that you did something to initialize any assembly containing extension attributes).
  • A similar new call would be available within Roslyn – we’d need to think carefully about which assemblies that examined, of course.

By making this an “opt-in” mechanism from the examiner perspective, it would be safe – anything dealing in security attributes would want to use the existing mechanism, for example.

This wouldn’t just be useful for diagnostic purposes, mind you. There are a number of situations where you might want to augment existing types with more metadata, whether those attributes are your own custom ones or existing ones in the framework. Want to apply an attribute-based serialization framework to a different third-party type? Sure, just use those extensions. Want to give system enums localization-oriented attributes for your own framework? No problem.

I’m not actually expecting this feature to go very far – it’s relatively niche, as well as possibly requiring CLR changes (rather than just framework and language changes). Still, having thought of it, I decided it would be odd to keep it to myself. And hey, it’s a good excuse to create the C# 7 category on my blog…

C# 6 in action

Now that the Visual Studio 2015 Preview is available and the C# 6 feature set is a bit more stable, I figured it was time to start updating the Noda Time 2.0 source code to C# 6. The target framework is still .NET 3.5 (although that might change; I gather very few developers are actually going to be hampered by a change to target 4.0 if that would make things easier) but we can still take advantage of all the goodies C# 6 has in store.

I’ve checked all the changes into a dedicated branch which will only contain changes relevant to C# 6 (although a couple of tiny other changes have snuck in). When I’ve got round to updating my continuous integration server, I’ll merge onto the default branch, but I’m in no rush. (I’ll need to work out what to do about Mono at that point, too – there are various options there.)

In this post, I’ll go through the various C# 6 features, and show how useful (or otherwise) they are in Noda Time.

Read-only automatically implemented properties (“autoprops”)

Finally! I’ve been waiting for these for ages. You can specify a property with just a blank getter, and then assign it a value from either the declaration statement, or within a constructor/static constructor.

So for example, in DateTimeZone, this:

private static readonly DateTimeZone UtcZone = new FixedDateTimeZone(Offset.Zero);
public static DateTimeZone Utc { get { return UtcZone; } }

becomes

public static DateTimeZone Utc { get; } = new FixedDateTimeZone(Offset.Zero);

and

private readonly string id;
public string Id { get { return id; } }
protected DateTimeZone(string id, ...)
{
    this.id = id;
    ...
}

becomes

public string Id { get; }
protected DateTimeZone(string id, ...)
{
    this.Id = id;
    ...
}

As I mentioned before, I’ve been asking for this feature for a very long time – so I’m slightly surprised to find myself not entirely positive about the results. The problem it introduces isn’t really new – it’s just one that I’m not used to, as I haven’t used automatically-implemented properties much in a large code base. The issue is consistency.

With separate fields and properties, if you knew you didn’t need any special behaviour due to the properties when you accessed the value within the same type, you could always use the fields. With automatically-implemented properties, the incidental fact that a field is also exposed as a property changes the code – because now the whole class refers to it as a property instead of as a field.

I’m sure I’ll get used to this – it’s just a matter of time.

Initial values for automatically-implemented properties

The ability to specify an initial value for automatically-implemented properties applies to writable properties as well. I haven’t used that in the main body of Noda Time (almost all Noda Time types are immutable), but here’s an example from PatternTestData, which is used to provide data for text parsing/formatting tests. The code before:

internal CultureInfo Culture { get; set; }

public PatternTestData(...)
{
    ...
    Culture = CultureInfo.InvariantCulture;
}

And after:

internal CultureInfo Culture { get; set; } = CultureInfo.InvariantCulture;

public PatternTestData(...)
{
    ...
}

It’s worth noting that just like with a field initializer, you can’t call instance members within the same class when initializing an automatically-implemented property.

Expression-bodied members

I’d expected to like this feature… but I hadn’t expected to fall in love with it quite as much as I’ve done. It’s really simple to describe – if you’ve got a read-only property, or a method or operator, and the body is just a single return statement (or any other simple statement for a void method), you can use => to express that instead of putting the body in braces. It’s worth noting that this is not a lambda expression, nor is the compiler converting anything to delegates – it’s just a different way of expressing the same thing. Three examples of this from LocalDateTime (one property, one operator, one method – they’re not next to each other in the original source code, but it makes it simpler for this post):

public int Year { get { return date.Year; } }

public static LocalDateTime operator +(LocalDateTime localDateTime, Period period)
{
    return localDateTime.Plus(period);
}

public static LocalDateTime Add(LocalDateTime localDateTime, Period period)
{
    return localDateTime.Plus(period);
}

becomes

public int Year => date.Year;

public static LocalDateTime operator +(LocalDateTime localDateTime, Period period) =>
    localDateTime.Plus(period);

public static LocalDateTime Add(LocalDateTime localDateTime, Period period) =>
    localDateTime.Plus(period);

In my actual code, the operator and method each only take up a single (pretty long) line. For some other methods – particularly ones where the body has a comment – I’ve split it into multiple lines. How you format your code is up to you, of course :)

So what’s the benefit of this? Why do I like it? It makes the code feel more functional. It makes it really clear which methods are just shorthand for some other expression, and which really do involve a series of steps. It’s far too early to claim that this improves the quality of the code or the API, but it definitely feels nice. One interesting data point – using this has removed about half of the return statements across the whole of the NodaTime production assembly. Yup, we’ve got a lot of properties which just delegate to something else – particularly in the core types like LocalDate and LocalTime.

The nameof operator

This was a no-brainer in Noda Time. We have a lot of code like this:

public void Foo([NotNull] string x)
{
    Preconditions.CheckNotNull(x, "x");
    ...
}

This trivially becomes:

public void Foo([NotNull] string x)
{
    Preconditions.CheckNotNull(x, nameof(x));
    ...
}

Checking that every call to Preconditions.CheckNotNull (and CheckArgument etc) uses nameof and that the name is indeed the name of a parameter is going to be one of the first code diagnostics I write in Roslyn, when I finally get round to it. (That will hopefully be very soon – I’m talking about it at CodeMash in a month!)

Dictionary initializers

We’ve been able to use collection initializers with dictionaries since C# 3, using the Add method. C# 6 adds the ability to use the indexer too, which leads to code which looks more like normal dictionary access. As an example, I’ve changed a “field enum value to delegate” dictionary in TzdbStreamData from

private static readonly Dictionary<TzdbStreamFieldId, Action<Builder, TzdbStreamField>> FieldHanders =
    new Dictionary<TzdbStreamFieldId, Action<Builder, TzdbStreamField>>
{
   { TzdbStreamFieldId.StringPool, (builder, field) => builder.HandleStringPoolField(field) },
   { TzdbStreamFieldId.TimeZone, (builder, field) => builder.HandleZoneField(field) },
   { TzdbStreamFieldId.TzdbIdMap, (builder, field) => builder.HandleTzdbIdMapField(field) },
   ...
}

to:

private static readonly Dictionary<TzdbStreamFieldId, Action<Builder, TzdbStreamField>> FieldHanders =
    new Dictionary<TzdbStreamFieldId, Action<Builder, TzdbStreamField>>
{
    [TzdbStreamFieldId.StringPool] = (builder, field) => builder.HandleStringPoolField(field),
    [TzdbStreamFieldId.TimeZone] = (builder, field) => builder.HandleZoneField(field),
    [TzdbStreamFieldId.TzdbIdMap] = (builder, field) => builder.HandleTzdbIdMapField(field),
...
}

One downside of this is that the initializer will now not throw an exception if the same key is specified twice. So whereas the bug in this code is obvious immediately:

var dictionary = new Dictionary<string, string>
{
    { "a", "b" },
    { "a", "c" }
};

if you convert it to similar code using the indexer:

var dictionary = new Dictionary<string, string
{
    ["a"] = "b",
    ["a"] = "c",
};

… you end up with a dictionary which only has a single value.

To be honest, I’m now pretty used to the syntax which uses Add – so even though there are some other dictionaries initialized with collection initializers in Noda Time, I don’t think I’ll be changing them.

Using static members

For a while I didn’t think I was going to use this much – and then I remembered NodaConstants. The majority of the constants here are things like MillisecondsPerHour, and they’re used a lot in some of the core types like Duration. The ability to add a using directive for a whole type, which imports all the members of that type, allows code like this:

public int Seconds =>
    unchecked((int) ((NanosecondOfDay / NodaConstants.NanosecondsPerSecond) % NodaConstants.SecondsPerMinute));

to become:

using NodaTime.NodaConstants;
...
public int Seconds =>
    unchecked((int) ((NanosecondOfDay / NanosecondsPerSecond) % SecondsPerMinute));

Expect to see this to be used a lot in trigonometry code, making all those calls to Math.Cos, Math.Sin etc a lot more readable.

Another benefit of this syntax is to allow extension methods to be imported just from an individual type instead of from a whole namespace. In Noda Time 2.0, I’m introducing a NodaTime.Extensions namespace with extensions to various BCL types (to allow more fluent conversions such as DateTimeOffset.ToOffsetDateTime()) – I suspect that not everyone will want all of these extensions to be available all the time, so the ability to import selectively is very welcome.

String interpolation

We don’t use the system default culture much in Noda Time, which the string interpolation feature always does, so string interpolation isn’t terribly useful – but there are a few cases where it’s handy.

For example, consider this code:

throw new KeyNotFoundException(
    string.Format("No calendar system for ID {0} exists", id));

With C# 6 in the VS2015 preview, this has become

throw new KeyNotFoundException("No calendar system for ID \{id} exists");

Note that the syntax of this feature is not finalized yet – I expect to have to change this for the final release to:

throw new KeyNotFoundException($"No calendar system for ID {id} exists");

It’s always worth considering places where a feature could be used, but probably shouldn’t. ZoneInterval is one such place. Its ToString() feature looks like this:

public override string ToString() =>
    string.Format("{0}: [{1}, {2}) {3} ({4})",
        Name,
        HasStart ? Start.ToString() : "StartOfTime",
        HasEnd ? End.ToString() : "EndOfTime",
        WallOffset,
        Savings);

I tried using string interpolation here, but it ended up being pretty horrible:

  • String literals within string literals look very odd
  • An interpolated string literal has to be on a single line, which ended up being very long
  • The fact that two of the arguments use the conditional operator makes them harder to read as part of interpolation

Basically, I can see string interpolation being great for “simple” interpolation with no significant logic, but less helpful for code like this.

Null propagation

Amazingly, I’ve only found a single place to use null propagation in Noda Time so far. As a lot of the types are value types, we don’t do a lot of null checking – and when we do, it’s typically to throw an exception if the value is null. However, this is the one place I’ve found so far, in BclDateTimeZone.ForSystemDefault. The original code is:

if (currentSystemDefault == null || currentSystemDefault.OriginalZone != local)

With null propagation we can handle “we don’t have a cached system default” and “the cached system default is for the wrong time zone” with a single expression:

if (currentSystemDefault?.OriginalZone != local)

(Note that local will never be null, or this transformation wouldn’t be valid.)

There may well be a few other places this could be used, and I’m certain it’ll be really useful in a lot of other code – it’s just that the Noda Time codebase doesn’t contain much of the kind of code this feature is targeted at.

Conclusion

What a lot of features! C# 6 is definitely a “lots of little features” release rather than the “single big feature” releases we’ve seen with C# 4 (dynamic) and C# 5 (async). Even C# 3 had a lot of little features which all served the common purpose of LINQ. If you had to put a single theme around C# 6, it would probably be making existing code more concise – it’s the sort of feature set that really lends itself to this “refactor the whole codebase to tidy it up” approach.

I’ve been very pleasantly surprised by how much I like expression-bodied members, and read-only automatically implemented properties are a win too (even though I need to get used to it a bit). Other features such as static imports are definitely welcome to remove some of the drudgery of constants and provide finer-grained extension method discovery.

Altogether, I’m really pleased with how C# 6 has come together – I’m looking forward to merging the C# 6 branch into the main Noda Time code base as soon as I’ve got my continuous integration server ready for it…

When is an identifier not an identifier? (Attack of the Mongolian Vowel Separator)

Here’s a few things you may not be aware of:

  • C# identifiers can include Unicode escape sequences (\u1234 etc)
  • C# identifiers can include Unicode characters in the category “Other, formatting” (Cf) but these are ignored when comparing identifiers for equality
  • The Mongolian Vowel Separator (U+180E) has oscillated between the Cf and Zs categories a couple of times
  • .NET has its own copy of Unicode categories, separate from whatever Win32 might provide
  • Roslyn (built in .NET) uses the Unicode categories, whereas csc.exe (the “old” native C# compiler) uses either the Win32 categories or a built-in copy
  • Neither the .NET table nor the Win32 table necessarily reflects exactly what any one version of the Unicode standard says
  • Compilers can have bugs in

Put them together, and chaos ensues!

How this all started – blame Vladimir

I started looking at this based on a discussion in our ECMA technical group meeting last week, when we were considering the normative references – and in particular, which version of Unicode we were going to target. Currently the ECMA 4th edition spec targets Unicode 4.0 and the Microsoft C# 5 specification targets Unicode 3.0. It’s not clear to me whether any compilers actually take note of this, and moving forward we’d like both the ECMA and Microsoft standards to not specify a particular version of Unicode, effectively encouraging compiler authors to use the most recent one available to them. Despite the wrinkles listed below, I think that makes the most sense for real world uses – it’s crazy to require compilers to ship with their own private copies of Unicode, effectively.

When discussing this, Vladimir Reshetnikov mentioned the Mongolian Vowel Separator (U+180E) which has had an interesting life. It was introduced in Unicode 3.0.0, when it was in the Cf category (“other, formatting”). Then in Unicode 4.0.0 it was moved into the Zs (separator, space) category. In Unicode 6.3.0 it was then moved back to the Cf category.

Of course, my natural inclination was to try to abuse this. My initial aim was to come up with code which behaved differently depending on which version of Unicode the compiler was using. It turned out to be a little more complicated than that, but we’ll assume a hypothetical compiler first, with no bugs, but which obeys whichever version of the Unicode standard we want it to. (Arguably that’s already a bug given the requirements of the current C# specs, but we’ll set that aside.)

Hypothetical example 1: valid or invalid

For simplicity, let’s start with some source code which is all in ASCII:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        string\u180ex = "b";
        Console.WriteLine(stringx);
    }
}

If the compiler is using Unicode 6.3 or higher (or a version earlier than 4.0) then U+180E is deemed to be in the Cf category, and is therefore valid within an identifier. In that case, it’s fine for it to be escaped as per the code above. At that point, the identifier in the second line of the method is deemed to be “the same as” stringx, so the output is b.

What about a compiler using a version of Unicode between 4.0 and 6.2 (inclusive) though? At that point, U+180E is deemed to be in the Zs category, which makes it a whitespace character. Zs characters are allowed as whitespace within C# programs – but not within identifiers. Once it’s not a valid identifier – and because this isn’t within a character/string literal – it’s invalid to use the Unicode escape sequence, so the code doesn’t compile.

Hypothetical example 2: valid two different ways

We can write the same code without using an escape sequence, however. If you create a regular ASCII file like this:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        stringAAAx = "b";
        Console.WriteLine(stringx);
    }
}

then open it up in a hex editor and replace the AAA with bytes E1 A0 8E, then you’ve got a file containing the UTF-8 representation of U+180E at the same location as we had the Unicode escape sequence in the first version.

So, a compiler which compiled the first version would still compile this version (assuming you could tell it that the source code was UTF-8), and the results would be the same – it would print b as the second statement of the method would be a simple assignment to the existing variable.

However, a compiler which treats U+180E as whitespace and would therefore treat the first program as an error would accept this program – and treat that second statement as a declaration of a second local variable (x) and assign it an initial value. You might get a warning about the variable being unused, but it’s valid C# and the output is a.

Reality: the Microsoft compilers

Whenever we talk about the Microsoft C# compiler these days, we need to distinguish between the “native” compiler (csc) and Roslyn (rcsc, although typically I just call it Roslyn).

As it’s written in native code, csc uses whatever Windows supplies for its Unicode character tables – or it embeds it directly in the executable, potentially. (I’ve been scouring MSDN to find a Win32 native function to tell me the Unicode category of a specific code point, and failed so far. It would have been useful…)

Compare that with Roslyn, which is written in C# and (as far as I’m aware) uses char.GetUnicodeCategory – which in turn uses the Unicode tables built into mscorlib.

My experiments suggest that whatever the native compiler uses to get the Unicode category has treated U+180E as a formatting character forever. At least, I’ve tried to find old machines (including VM images) which haven’t had Windows update applied since September 2013 (which is when Unicode 6.3 was published) and they all compile the first program listed above. I’m beginning to suspect that csc might actually have a copy of Unicode 3.0 built into it; it certainly treats U+180E as a formatting character, but doesn’t like either U+0600 or U+00AD within identifiers. (U+0600 wasn’t introduced until Unicode 4.0, but has always been a formatting character; U+00AD was a “dash punctuation” character in Unicode 3.0, and became a formatting character in 4.0.)

The table built into mscorlib has definitely changed over time, however. If you run a simple program such as this:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(Environment.Version);
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

then running under CLRv2, the result is “SpaceSeparator” whereas running on CLRv4 (at least on a recently-updated system), the result is “Format”.

Of course, Roslyn won’t run under old CLRs, but we have hope by way of csharppad.com – which runs Roslyn in an environment (of uncertain origin – Mono? I’m unsure) which prints “SpaceSeparator” for the above. Sure enough the first program fails to compile – but it’s harder to check the second program, as csharppad.com doesn’t allow you to upload source code, and copy/paste produces some odd results.

Reality: mcs (Mono C# compiler)

Mono’s compiler uses the BCL GetUnicodeCategory code too, which should make it significantly simpler to experiment – but unfortunately, the Mono parser has (at least) two bugs in it:

  • It will allow any Unicode escape sequence in an identifier, whether it’s an escape sequence for a valid identifier part or not. For example, string\u0020x = "" is valid under the Mono compiler. Filed as bug 24968. Source.
  • It doesn’t allow formatting characters within identifiers – it includes characters in classes Mn, Mc, Nd and Pc, but not Cf. Filed as bug 24969. Source.

For this reason, the first program always compiles and prints “b” whereas the second program always fails to compile, regardless of whether U+180E is treated as being in Zs or Cf.

What version is this, anyway?

Next, let’s think about the Unicode data itself. It’s not at all clear which version any particular BCL implementation is actually using. Consider this little program:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(char.GetUnicodeCategory('\u00ad'));
        Console.WriteLine(char.GetUnicodeCategory('\u0600'));
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

On my computer, under CLR v4 this prints “DashPunctuation, Format, Format”, and under both Mono (3.3.0) and CLR v2 it prints “DashPunctuation, Format, SpaceSeparator”.

That’s very odd. It doesn’t correspond with any version of the Unicode standard, as far as I can tell:

  • U+00AD was a Po (other, punctuation) character in Unicode 1.x, then Pd (dash, punctuation) in 2.x and 3.x, and from 4.0 onwards has been Cf.
  • U+0600 was only introduced in Unicode 4.0, and has always been Cf
  • U+180E introduced as Cf in 3.0, then changed to Zs in 4.0, then back to Cf in 6.3.

So there is no version where the first line agrees with either the second line or the third line. I’m basically a bit baffled by this.

What about nameof and CallerMemberName?

The names of identifiers aren’t only used for comparisons – they’re available as strings without any reflection being involved at all. From C# 5, we’ve had CallerMemberName attribute, allowing things like:

public static void X\u0600y()
{
    ShowCaller();
}

public static void ShowCaller([CallerMemberName] string caller = null)
{
    Console.WriteLine("Called by {0}", caller);
}

And in C# 6, we can write:

string x\u0600y = "";
Console.WriteLine("nameof = {0}", nameof(x\u0600y));

What should those print? They do just print “Xy” and “xy” as the names (respectively), as if the compiler has simply thrown away the formatting character entirely. But what should they print? Bear in mind that in the second case, we could easily have used nameof(xy) and that would still have compared equal to the declared identifier.

We can’t even say “What’s the name of the member being declared?” because you can overload with “different but equal” identifiers:

public static void Xy() {}
public static void X\u0600y() {}
public static void X\u070fy() {}
...
Console.WriteLine(nameof(X\u200by));

What should that print out? I’m sure you’ll be relieved to hear that the C# team has a plan in place – but fundamentally this is one of these “no obvious right answer” scenarios. It gets even weirder when you bring the CLI specification into the mix. Section I.8.5.1 of ECMA-335 6th edition has:

Assemblies shall follow Annex 7 of Technical Report 15 of the Unicode Standard
3.0 governing the set of characters permitted to start and be included in identifiers, available online
at http://www.unicode.org/unicode/reports/tr15/tr15-18.html. Identifiers shall be in the
canonical format defined by Unicode Normalization Form C. For CLS purposes, two identifiers
are the same if their lowercase mappings (as specified by the Unicode locale-insensitive, one-to-one
lowercase mappings) are the same. That is, for two identifiers to be considered different
under the CLS they shall differ in more than simply their case. However, in order to override an
inherited definition the CLI requires the precise encoding of the original declaration be used.

I would love to explore the impact of this by adding a Cf character into IL, but unfortunately I haven’t worked out a way of affecting the encoding of ilasm, in order to persuade it that my hacked up IL is what I want it to be.

Conclusion

As noted before, text is hard.

It turns out that even when restricting oneself to identifiers, text is hard. Who would’ve thought?

When is a string not a string?

When is a string not a string?

As part of my “work” on the ECMA-334 TC49-TG2 technical group, standardizing C# 5 (which will probably be completed long after C# 6 is out… but it’s a start!) I’ve had the pleasure of being exposed to some of the interesting ways in which Vladimir Reshetnikov has tortured C#. This post highlights one of the issues he’s raised. As usual, it will probably never impact 99.999% of C# developers… but it’s a lovely little problem to look at.

Relevant specifications referenced in this post:
– The Unicode Standard, version 7.0.0 – in particular, chapter 3
C# 5 (Word document)
ECMA-335 (CLI specification)

What is a string?

How would you define the string (or System.String) type? I can imagine a number of responses to that question, from vague to pretty specific, and not all well-defined:

  • “Some text”
  • A sequence of characters
  • A sequence of Unicode characters
  • A sequence of 16-bit characters
  • A sequence of UTF-16 code units

The last of these is correct. The C# 5 specification (section 1.3) states:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So far, so good. But that’s C#. What about IL? What does that use, and does it matter? It turns out that it does… Strings need to be represented in IL as constants, and the nature of that representation is important, not only in terms of the encoding used, but how the encoded data is interpreted. In particular, a sequence of UTF-16 code units isn’t always representable as a sequence of UTF-8 code units.

I feel ill (formed)

Consider the C# string literal "X\uD800Y". That is a string consisting of three UTF-16 code units:

  • 0x0058 – ‘X’
  • 0xD800 – High surrogate
  • 0x0059 – ‘Y’

That’s fine as a string – it’s even a Unicode string according to the spec (item D80). However, it’s ill-formed (item D84). That’s because the UTF-16 code unit 0xD800 doesn’t map to a Unicode scalar value (item D76) – the set of Unicode scalar values explicitly excludes the high/low surrogate code points.

Just in case you’re new to surrogate pairs: UTF-16 only deals in 16-bit code units, which means it can’t cope with the whole of Unicode (which ranges from U+0000 to U+10FFFF inclusive). If you want to represent a value greater than U+FFFF in UTF-16, you need to use two UTF-16 code units: a high surrogate (in the range 0xD800 to 0xDBFF) followed by a low surrogate (in the range 0xDC00 to 0xDFFF). So a high surrogate on its own makes no sense. It’s a valid UTF-16 code unit in itself, but it only has meaning when followed by a low surrogate.

Show me some code!

So what does this have to do with C#? Well, string constants have to be represented in IL somehow. As it happens, there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8.

Let’s take an example:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;

[Description(Value)]
class Test
{
    const string Value = "X\ud800Y";

    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }

    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

The output of this code (under .NET) is:

Attribute: 0058 fffd fffd 0059
Constant: 0058 d800 0059

As you can see, the “constant” (Test.Value) has been preserved as a sequence of UTF-16 code units, but the attribute property has U+FFFD (the Unicode replacement character which is used to indicate broken data when decoding binary to text). Let’s dig a little deeper and look at the IL for the attribute and the constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )

.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant (Value) is really simple – it’s just little-endian UTF-16. The format of the attribute is specified in ECMA-335 section II.23.3. Here, the meaning is:

  • Prolog (01 00)
  • Fixed arguments (for specified constructor signature)
    • 05 58 ED A0 80 59 (a single string argument as a SerString)
      • 05 (the length, i.e. 5, as a PackedLen)
      • 58 ED A0 80 59 (the UTF-8-encoded form of the string)
  • Number of named arguments (00 00)
  • Named arguments (there aren’t any)

The interesting part is the “UTF-8-encoded form of the string” here. It’s not valid UTF-8, because the input isn’t a well-formed string. The compiler has taken the high surrogate, determined that there isn’t a low surrogate after it, and just treated it as a value to be encoded in the normal UTF-8 way of encoding anything in the range U+0800 to U+FFFF inclusive.

It’s worth noting that if we had a full surrogate pair, UTF-8 would encode the single Unicode scalar value being represented, using 4 bytes. For example, if we change the declaration of Value to:

const string Value = "X\ud800\udc00Y";

then the UTF-8 bytes in the IL are 58 F0 90 80 80 59 – where F0 90 80 80 is the UTF-8 encoding for U+10000. That’s a well-formed string, and we get the same value for both the description attribute and the constant.

So in our original example, the string constant (encoded as UTF-16 in the IL) is just decoded without checking whether or not it’s ill-formed, whereas the attribute argument (encoded as UTF-8) is decoded with extra validation, which detects the ill-formed code unit sequence and replaces it.

Encoding behaviour

So which approach is right? According to the Unicode specification (item C10) both could be fine:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

and

Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89.

It’s not at all clear to me whether either the attribute argument or the constant value “purports to be in a Unicode character encoding form”. In my experience, very few pieces of documentation or specification are clear about whether they expect a piece of text to be well-formed or not.

Additionally, System.Text.Encoding implementations can often be configured to determine how they behave when encoding or decoding ill-formed data. For example, Encoding.UTF8.GetBytes(Value) returns byte sequence 58 EF BF BD 59 – in other words, it spots the bad data and replaces it with U+FFFD as part of the encoding… so decoding this value will result in X U+FFFD Y with no problems. On the other hand, if you use new UTF8Encoding(true, true).GetBytes(Value), an exception will be thrown. The first constructor argument is whether or not to emit a byte order mark under certain circumstances; the second one is what dictates the encoding behaviour in the face of invalid data, along with the EncoderFallback and DecoderFallback properties.

Language behaviour

So should this compile at all? Well, the language specification doesn’t currently prohibit it – but specifications can be changed :)

In fact, both csc and Roslyn do prohibit the use of ill-formed strings with certain attributes. For example, with DllImportAttribute:

[DllImport(Value)]
static extern void Foo();

This gives an error when Value is ill-formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

There may be other attributes this is applied to as well; I’m not sure.

If we take it as read that the ill-formed value won’t be decoded back to its original form when the attribute is instantiated, I think it would be entirely reasonable to make it a compile-time failure – for attributes. (This is assuming that the runtime behaviour can’t be changed to just propagate the ill-formed string.)

What about the constant value though? Should that be allowed? Can it serve any purpose? Well, the precise value I’ve given is probably not terribly helpful – but it could make sense to have a string constant which ends with a high surrogate or starts with a low surrogate… because it can then be combined with another string to form a well-formed UTF-16 string. Of course, you should be very careful about this sort of thing – read the Unicode Technical Report 36 “Security Considerations” for some thoroughly alarming possibilities.

Corollaries

One interesting aspect to all of this is that “string encoding arithmetic” doesn’t behave as you might expect it to. For example, consider this method:

// Bad code!
string SplitEncodeDecodeAndRecombine
    (string input, int splitPoint, Encoding encoding)
{
    byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint));
    byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint));
    return encoding.GetString(firstPart) + encoding.GetString(secondPart);            
}

You might expect that this would be a no-op so long as everything is non-null and splitPoint is within range… but if you happen to split in the middle of a surrogate pair, it’s not going to be happy. There may well be other potential problems lurking there, depending on things like normalization form – I don’t think so, but at this point I’m unwilling to bet too heavily on string behaviour.

If you think the above code is unrealistic, just imagine partitioning a large body of text, whether that’s across network packets, files, or whatever. You might feel clever for realizing that without a bit of care you’d get binary data split between UTF-16 code units… but even handling that doesn’t save you. Yikes.

I’m tempted to swear off text data entirely at this point. Floating point is a nightmare, dates and times… well, you know my feelings about those. I wonder what projects are available that only need to deal with integers, and where all operations are guaranteed not to overflow. Let me know if you have any.

Conclusion

Text is hard.

Writing and speaking update

This post has a few purposes – it’s partly a bit of advertising, but it’s also meant to serve as a quick way of replying to speaking requests for a while… if you’ve been directed here by an email from me, I hope you’ll excuse the “form letter” approach. (You can probably skip down to the bottom.)

Upcoming speaking engagements

I’ve got a few speaking engagements in the next four months:

  • December 2nd, hack.summit(): Humanity: Epic Fail (not currently sure how Tony the Pony will handle working with a webcam, but we’ll see)
  • December 3rd, NDC London: Abusing C#
  • January 6th-9th: CodeMash: workshop on LINQ, and a session on testing with Roslyn
  • February 27th: NorDevCon: Keynote (!) on passion, and C# 6 (probably with a bit of Roslyn too)

New book under development: Grokking C#

A long time ago I wrote about my hope to write two books: C# in Style and C# from Scratch. As you may have noticed, these haven’t happened yet, although I have had a couple of abortive attempts.

I’ve been working with my publisher (Manning) to think about the best way of introducing C# to newcomers, and we’ve made solid progress on a book with the working title of “Grokking C#” (subtitle: getting stuff done with objects). The aim will be to introduce both C# and OO hand-in-hand… not just “enough OO to understand the syntax of C#” but hopefully enough of the big picture to encourage readers to think about objects in a way which will last them long beyond the short time they’re reading the book.

The “Grokking” series from Manning is a pretty new one, but I’m really excited about the approach – it’s very visual, with a lot of thought put into reducing cognitive load. Have a look at Grokking Functional Programming for an idea of the style. If this sounds a bit like the Head-First books, there’s a good reason for that – Bert Bates who wrote Head-First Java with Kathy Sierra is acting in a consulting role for the Grokking books. However, the Grokking books have a style of their own – one which happens to take the things I like about the Head-First books, but without some of the aspects I’m less fond of.

I’m very much looking forward to learning a lot from Bert about getting a message out as clearly as possible. My blog, book and conference audience has always been reasonably advanced, and while I obviously answer Stack Overflow questions asked by newcomers, I haven’t done a large amount of writing for that audience. I expect this to be very challenging, but really rewarding.

No more speaking engagements…

Now for the bit which is tricky to tell conference organizers: I’m not going to accept any extra speaking engagements until the book is done, at least in first draft.

Writing the book is going to take a lot of my time, some of which would otherwise be family time… and currently speaking is taking about half of my vacation allocation each year. If I tried to continue with the speaking and the book, my family would suffer – and they don’t deserve that.

Don’t get me wrong – I still love speaking, and I’ll look forward to returning to the conference scene when I’m done, assuming anyone still wants me. Hopefully by then I’ll have some fresh ideas for talks as well. (I’ve been considering “Techniques for persuading people you’re an expert when you’re not” as a soft-skills talk, although I’m not sure I want to give away the secret…)

Apologies to conference organizers who I’ve already put off a few times with “ask me next year” – people trust me when I say it isn’t personal. If I had more time (or didn’t have a family!) I’d love to spend my life travelling around the world giving talks and chatting with developers. There just aren’t enough hours in the day :(

Violating the “smart enum” pattern in C#

For a while now, I’ve been a big fan of a pattern in C# which mimics Java enums to a certain extent. In general, it’s a lovely pattern. Only after reading a comment on a recent blog post by Eric Lippert did I find out about a horrible flaw. Dubious thanks to John Payson for prompting this post. (And I fully intend to include this in the Abusing C# talk that I give every so often…)

Trigger warning: this post contains code you may wish you’d never seen.

Let’s start off with the happy path – a moment of calm before the storm.

When things go well…

The idea is to create a class – which can be abstract – with a certain fixed set of instances, which are available via static properties on the class itself. The class is not sealed, but it prevents other classes from subclassing it by using a private constructor. Here’s an example, including how it can be used:

using System;

public abstract class Operation
{
    public static Operation Add { get; } = new AddOperation();
    public static Operation Subtract { get; } = new SubtractOperation();

    // Prevent non-nested subclasses. They can't call this constructor,
    // which is the only way of creating an instance.
    private Operation()
    {}

    public abstract int Apply(int lhs, int rhs);

    private class AddOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs + rhs;
        }
    }

    private class SubtractOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs - rhs;
        }
    }
}

public class Test
{
    static void Main()
    {
        Operation add = Operation.Add;
        Console.WriteLine(add.Apply(5, 3)); // 8
    }
}

(Note the use of the new “automatically implemented read-only property” from C# 6. Yum!)

Obviously there are possible variations of this – there could be multiple instances of one subclass, or we could even create instances dynamically if that were useful. The main point is that no other code can create instances or even declare their own subclasses of Operation. That private constructor stops everything, right? Um…

My eyes! The goggles do nothing!

I’m sure all my readers are aware that every C# constructor either has to chain to another constructor using : this(...) or chain to a base class constructor, either implicitly calling a constructor without arguments or explicitly using : base(...).

What I wasn’t aware of – or rather, I think I’d been aware of once, but ignored – was that constructor initializers can have cycles. In other words, it’s absolutely fine to write code like this:

public class SneakyOperation : Operation
{
    public SneakyOperation() : this(1)
    {}

    public SneakyOperation(int i) : this()
    {}

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

Neither operator attempts to call a base class constructor, so it doesn’t matter that the only base class constructor is private. This code compiles with no issues – not even a warning.

So we’ve already failed in our intention to prevent subclassing from occurring… but this will fail with a StackOverflowException which will take down the process, so there’s not much harm, right?

Well… suppose we didn’t let it throw a StackOverflowException. Suppose we triggered our own exception first. There are various ways of doing this. For example, we could use arithmetic:

public SneakyOperation(int x) : this(x, 0)
{}

// This isn't going to be happy if y is 0...
public SneakyOperation(int x, int y) : this(x / y)
{}

Or we could have one constructor call another with a delegate which will deliberately throw:

public SneakyOperation(int x) : this(() => { throw new Exception("Bang!"); })
{}

public SneakyOperation(Func<int> func) : this(func())
{}

So now, we have a class where you can call the constructor, and an exception will be thrown. But hey, you still can’t get at the instance which was created, right? Well, this is tricky. We genuinely can’t use this anywhere in code which we expect to execute – there’s no way of reaching a constructor body, because in order to do that we’d have to have some route which took us through the base class constructor, which was can’t access. Field initializers aren’t allow to use this, either explicitly or implicitly (e.g. by calling instance methods). If there’s a way of circumventing that, I’m not aware of it.

So, we can’t capture a reference to the instance before throwing the exception. And after the exception is thrown, it’s garbage, and therefore unavailable… unless we take it out of the garbage bin, of course. Enter a finalizer…

private static volatile Operation instance;

~SneakyOperation()
{
    // Resurrect the instance.
    instance = this;
}

Now all we have to do is wrap up the "create an object and wait for it to be resurrected" logic in a handy factory method:

public static Operation GetInstance()
{
    try
    {
        new SneakyOperation(0);
    }
    catch {} // Thoroughly expected
    Operation local;
    while ((local = instance) == null)
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
    return local;
}

Admittedly if multiple threads call GetInstance() at a time, there’s no guarantee which thread will get which instance, or whether they’ll get the same instance… but that’s fine. There’s very little to guarantee this will ever terminate, in fact… but I don’t care much, as clearly I’m never going to use this code in production. It works on my machine, for the few times I’ve been able to bear running it. Here’s a complete example (just combine it with the Operation class as before) in case you want to run it yourself:

public class SneakyOperation : Operation
{
    private static volatile Operation instance;

    private SneakyOperation(int x) : this(() => { throw new Exception(); })
    {}

    private SneakyOperation(Func<int> func) : this(func())
    {}

    public static Operation GetInstance()
    {
        try
        {
            new SneakyOperation(0);
        }
        catch {}
        Operation local;
        while ((local = instance) == null)
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
        }
        return local;
    }

    ~SneakyOperation()
    {
        instance = this;
    }

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

public class Test
{
    static void Main()
    {
        Operation sneaky = SneakyOperation.GetInstance();
        Console.WriteLine(sneaky.Apply(5, 3)); // 15
    }
}

Vile, isn’t it?

Fixing the problem

The first issue is the cyclic constructors. That’s surely a mistake in the language specification. (It’s not a bug in the compiler, as far as I can tell – I can’t find anything in the spec to prohibit it.) This can be fixed, and it shouldn’t even be too hard to do. There’s nothing conditional in constructor initializers; each constructor chains to exactly one other constructor, known via overload resolution at compile time. Even dynamic doesn’t break this – constructor initializers can’t be dispatched dynamically. This makes detecting a cycle trivial – from each constructor, just follow the chain of constructor calls until either you reach a base class constructor (success) or you reach one of the constructor declarations already in the chain (failure). I will be pestering Mads for this change – it may be too late for C# 6, but it’s never too early to ask about C# 7. This will be a breaking change – but it will only break code which is tragically broken already, always resulting in either an exception or a stack overflow.

Of course, a C# language change wouldn’t change the validity of such code being created through IL. peverify appears to be happy with it, for example. Still, a language fix would be a good start.

There’s a second language change which could help to fix this too – the introduction of private abstract/virtual methods. Until you consider nested classes, neither of those make sense – but as soon as you remember that there can be subclasses with access to private members, it makes perfect sense to have them. If Operation had a private abstract method (probably exposed via a public concrete method) then SneakyOperation (and any further subclasses) would have to be abstract, as it couldn’t possibly override the method. I’ve already blogged about something similar with public abstract classes containing internal abstract methods, which prevents concrete subclasses outside that assembly… but this would restrict inheritance even further.

Does it reach the usefulness bar required for new language features? Probably not. Would it be valid IL with the obvious meaning? I’ve performed experiments which suggest that it would be valid, but not as restrictive as one would expect – but I could be mistaken. I think there’s distinct merit in the idea, but I’m not expecting to see it come to pass.

Conclusion

So, what can you take away from this blog post? For starters, the fact that the language can probably be abused in far nastier ways than you’ve previously imagined – and that it’s very easy to miss the holes in the spec which allow for that abuse. (It’s not like this hole is a recent one… it’s been open forever.) You should absolutely not take away the idea that this is a recommended coding pattern – although the original pattern (for Operation) is a lovely one which deserves more widespread recognition.

Finally, you can take away the fact that nothing gets my attention quicker than a new (to me) way of abusing the language.

Follow

Get every new post delivered to your Inbox.

Join 265 other followers