Category Archives: Evil code

Lying to the compiler

This morning I tweeted this:

Just found a C# 8 nullable reference types warning in Noda Time. Fixing it by changing Foo(x, x?.Bar) to Foo(x, x?.Bar!) which looks really dodgy… anyone want to guess why it’s okay?

This attracted more interest than I expected, so I thought I’d blog about it.

First let’s unpack what x?.Bar! means. x?.Bar means “if x is non-null, the value of x.Bar; otherwise, the corresponding null value”. The ! operator at the end is introduced in C# 8, and it’s the damn-it operator (more properly the “null-forgiving operator”, but I expect to keep calling it the damn-it operator forever). It tells the compiler to treat the preceding expression as “definitely not null” even if the compiler isn’t sure for whatever reason. Importantly, this does not emit a null check in the IL – it’s a compile-time only change.

When talking about the damn-it operator, I’ve normally given two scenarios where it makes sense:

  • When testing argument validation
  • When you have invariants in your code which allow you to know more than the compiler does about nullability. This is a little bit like a cast: you’re saying you know more than the compiler. Remember that it’s not like a cast in terms of behaviour though; it’s not checked at execution time.

My tweet this morning wasn’t about either of these cases. It’s in production code, and I absolutely believe that it’s possible for x?.Bar to be null. I’m lying to the compiler to get it to stop it emitting a warning. The reason is that in the case where the value is null, it won’t matter that it’s null.

The actual code is in this Noda Time commit, but the code below provides a simplified version. We have three classes:

  • Person, with a name and home address
  • Address, with some properties I haven’t bothered showing here
  • Delivery, with a recipient and an address to deliver to
using System;

public sealed class Address
{
    // Properties here
}

public sealed class Person
{
    public string Name { get; }
    public Address HomeAddress { get; }

    public Person(string name, Address homeAddress)
    {
        Name = name ??
            throw new ArgumentNullException(nameof(name));
        HomeAddress = homeAddress ??
            throw new ArgumentNullException(nameof(homeAddress));
    }
}

public sealed class Delivery
{
    public Person Recipient { get; }
    public Address Address { get; }

    public Delivery(Person recipient)
        : this(recipient, recipient?.HomeAddress!)
    {
    }

    public Delivery(Person recipient, Address address)
    {
        Recipient = recipient ??
            throw new ArgumentNullException(nameof(recipient));
        Address = address ??
            throw new ArgumentNullException(nameof(address));
    }
}

The interesting part is the Delivery(Person) constructor, that delegates to the Delivery(Person, Address) constructor.

Here’s a version that would compile with no warnings:

public Delivery(Person recipient)
    : this(recipient, recipient.HomeAddress)

However, now if recipient is null, that will throw NullReferenceException instead of the (preferred) ArgumentNullException. Remember that nullable reference checking in C# 8 is really just advisory – the compiler does nothing to stop a non-nullable reference variable from actually having a value of null. This means we need to keep all the argument validation we’ve already got.

We could validate recipient before we pass it on to the other constructor:

public Delivery(Person recipient)
    : this(recipient ?? throw new ArgumentNullException(...),
           recipient.HomeAddress)

That will throw the right exception, but it’s ugly and more code than we need. We know that the constructor we’re delegating to already validates recipient – we just need to get that far. That’s where the null-conditional operator comes in. So we can write:

public Delivery(Person recipient)
    : this(recipient, recipient?.HomeAddress)

That will behave as we want it to – if recipient is null, we’ll pass null values as both arguments to the other constructor, and it will validate them. But now the compiler warns that the second argument could be null, and the parameter is meant to be non-null. The solution is to use the damn-it operator:

public Delivery(Person recipient)
    : this(recipient, recipient?.HomeAddress!)

Now we get the behaviour we want, with no redundant code, and no warnings. We’re lying to the compiler and satisfied that we’re doing so sensibly, because recipient?.HomeAddress is only null if recipient is null, and we know that that will be validated first anyway.

I’ve added a comment, as it’s pretty obscure otherwise – but part of me just enjoys the oddity of it all :)

Backward compatibility and overloading

I started writing a blog post about versioning in July 2017. I’ve mostly abandoned it, because I think the topic is too vast for a single post. It potentially needs a whole site/wiki/repository devoted to it. I hope to come back to it at some point, because I believe this is a hugely important topic that doesn’t get as much attention as it deserves.

In particular, the .NET ecosystem is mostly embracing semantic versioning – which sounds great, but does rely on us having a common understanding of what’s meant by a “breaking change”. That’s something I’ve been thinking about quite a lot. One aspect which has struck me forcefully recently is how hard it is to avoid breaking changes when using method overloading. That’s what this post is about, mostly because it’s fun.

First, a quick definition…

Source and binary compatibility

If I can recompile my client code with a new version of the library and it all works fine, that’s source compatible. If I can redeploy my existing client binary with a new version of the library without recompiling, that’s binary compatible. Neither of these is a superset of the other:

  • Some changes are both source and binary incompatible, such as removing a whole public type that you depended on.
  • Some changes are source compatible but binary incompatible, such as changing a public static read-only field into a property.
  • Some changes are binary compatible but source incompatible, such as adding an overload which could cause compile-time ambiguity.
  • Some changes are source and binary compatible, such as reimplementing the body of a method.

So what are we talking about?

I’m going to assume that we have a public library at version 1.0, and we wish to add some overloads in version 1.1. We’re following semantic versioning, so we need to be backward compatible. What does that mean we can and can’t do, and is it a simple binary choice?

In various cases, I’ll present library code at version 1.0 and version 1.1, then “client” code (i.e. code that is using the library) which could be broken by the change. I’m not presenting method bodies or class declarations, as they’re largely irrelevant – focus on the signatures. It should be easy to reproduce any of this if you’re interested though. We’ll imagine that all the methods I present are in a class called Library.

Simplest conceivable change, foiled by method group conversions

The simplest example I can imagine would be adding a parameterized method when there’s a parameterless one already:

// Library version 1.0
public void Foo()

// Library version 1.1
public void Foo()
public void Foo(int x)

Even that’s not completely compatible. Consider this client code:

// Client
static void Method()
{
    var library = new Library();
    HandleAction(library.Foo);
}

static void HandleAction(Action action) {}
static void HandleAction(Action<int> action) {}

In library version 1.0, that’s fine. The call to HandleAction performs a method group conversion of library.Foo to create an Action. In library version 1.1, it’s ambiguous: the method group can be converted to either Action or Action<int>. So it’s not source compatible, if we’re going to be strict about it.

At this point you might be tempted to give up and go home, resolving never to add any overloads, ever again. Or maybe we can say that this is enough of a corner case to not consider it breaking. Let’s call method group conversions out of scope for now.

Unrelated reference types

We get into a different kind of territory when we have overloads with the same number of parameters. You might expect this library change to be non-breaking:

// Library version 1.0
public void Foo(string x)

// Library version 1.1
public void Foo(string x)
public void Foo(FileStream x)

That feels like it should be reasonable. The original method still exists, so we won’t be breaking binary compatibility. The simplest way of breaking source compatibility is to have a call that either works in v1.0 but doesn’t in v1.1, or works in both but does something different in v1.1 than it did in v1.0.

How can a call break between v1.0 and v1.1? We’d have to have an argument that’s compatible with both string and FileStream. But they’re unrelated reference types…

The first failure is if we have a user-defined implicit conversion to both string and FileStream:

// Client
class OddlyConvertible
{
    public static implicit operator string(OddlyConvertible c) => null;
    public static implicit operator FileStream(OddlyConvertible c) => null;
}

static void Method()
{
    var library = new Library();
    var convertible = new OddlyConvertible();
    library.Foo(convertible);
}

Hopefully the problem is obvious: what used to be unambiguous via a conversion to string is now ambiguous as the OddlyConvertible type can be implicitly converted to both string and FileStream. (Both overloads are applicable, neither is better than the other.)

It may be reasonable to exclude user-defined conversions… but there’s a far simpler way of making this fail:

// Client
static void Method()
{
    var library = new Library();
    library.Foo(null);
}

The null literal is implicitly convertible to any reference type or any nullable value type… so again, the call becomes ambiguous in the library v1.1. Let’s try again…

Reference type and non-nullable value type parameters

If we don’t mind user-defined conversions, but don’t like null literals causing a problem, how about introducing an overload with a non-nullable value type?

// Library version 1.0
public void Foo(string x)

// Library version 1.1
public void Foo(string x)
public void Foo(int x)

This looks good – library.Foo(null) will be fine in v1.1. So is it safe? Not in C# 7.1…

// Client
static void Method()
{
    var library = new Library();
    library.Foo(default);
}

The default literal is like the null literal, but for any type. It’s really useful – and a complete pain when it comes to overloading and compatibility :(

Optional parameters

Optional parameters bring their own kind of pain. Suppose we have one optional parameter, but wish to add a second. We have three options, shown as 1.1a, 1.1b and 1.1c below.

// Library version 1.0
public void Foo(string x = "")

// Library version 1.1a
// Keep the existing method, but add another one with two optional parameters.
public void Foo(string x = "")
public void Foo(string x = "", string y = "")

// Library version 1.1b
// Just add the parameter to the existing method.
public void Foo(string x = "", string y = "")

// Library version 1.1c
// Keep the old method but make the parameter required, and add a new method
// with both parameters optional.
public void Foo(string x)
public void Foo(string x = "", string y = "")

Let’s think about a client that makes two calls:

// Client
static void Method()
{
    var library = new Library();
    library.Foo();
    library.Foo("xyz");
}

Library 1.1a keeps binary compatiblity, but breaks source compatibility: the library.Foo() is now ambiguous. The C# overloading rules prefer a method that doesn’t need the compiler to “fill in” any optional parameters, but it doesn’t have any preference in terms of how many optional parameters are filled in.

Library 1.1b keeps source compatibility, but breaks binary compatibility. Existing compiled code will expect to call a method with a single parameter – and that method no longer exists.

Library 1.1c keeps binary compatibility, but is potentially odd around source compatibility. The library.Foo() call now resolves to the two-parameter method, whereas library.Foo("xyz") resolves to the one-parameter method (which the compiler prefers over the two-parameter method because it doesn’t need to fill in any optional parameters). That may very well be okay, if the one-parameter version simply delegates to the two-parameter version using the same default value. It feels odd for the meaning of the first call to change though, when the method it used to resolve to still exists.

Optional parameters get even hairer when you don’t want to add a new one at the end, but in the middle – e.g. if you’re trying to follow a convention of keeping an optional CancellationToken parameter at the end. I’m not going to dive into this…

Generics

Type inference is a tricky beast at the best of times. With overload resolution it goes into full-on nightmare mode.

Let’s have a single non-generic method in v1.0, and then add a generic method in v1.1.

// Library version 1.0
public void Foo(object x)

// Library version 1.1
public void Foo(object x)
public void Foo<T>(T x)

That doesn’t seem too awful… but let’s look closely at what happens to client code:

// Client
static void Method()
{
    var library = new Library();
    library.Foo(new object());
    library.Foo("xyz");
}

In library v1.0, both calls resolve to Foo(object) – the only method that exists.

Library v1.1 is binary-compatible: if we use a client executable compiled against v1.0 but running against v1.1, both calls will still use Foo(object). But if we recompile, the second call (and only the second one) will change to using the generic method. Both methods are applicable for both calls.

In the first call, T would be inferred to be object, so the argument-to-parameter-type conversion is just object to object in both cases. Great. The compiler applies a tie-break rule that prefers non-generic methods over generic methods.

In the second call, T would be inferred to be string, so the argument-to-parameter-type conversion is string to object for the original method and string to string for the generic method. The latter is a “better” conversion, so the second method is picked.

If the two methods behave the same way, that’s fine. If they don’t, you’ve broken compatibility in a very subtle way.

Inheritance and dynamic typing

I’m sorry: I just don’t have the energy. Both inheritance and dynamic typing would interact with overload resolution in “fun” and obscure ways.

If you add a method in one level of the inheritance hierarchy which overloads a method in a base class, the new method will be examined first, and picked over the base class method even when the base class method is more specific in terms of argument-to-parameter-type conversions. There’s lots of scope for messing things up.

Likewise with dynamic typing (within the client code), to some extent all bets are off. You’re already sacrificing a lot of compile-time safety… it shouldn’t come as a surprise when things break.

Conclusion

I’ve tried to keep the examples reasonably simple here. It can get really complicated really quickly as soon as you have multiple optional parameters etc.

Versioning is hard and makes my head hurt.

Implementing IXmlSerializable in readonly structs

Background

There are three things you need to know to start with:

Operations on read-only variables which are value types copy the variable value first. I’ve written about this before on this blog. C# 7.2 addresses this by introducing the readonly modifier for structs. See the language proposal for more details. I was touched to see that the proposal references my blog post :)

The ref readonly local variables and in parameter features in C# 7.2 mean that “read-only variables” are likely to be more common in C# than they have been in the past.

Noda Time includes many value types which implement IXmlSerializable. Noda Time implements IXmlSerializable.ReadXml by assigning to this: fundamentally IXmlSerializable assumes a mutable type. I use explicit interface implementation to make this less likely to be used directly on an unboxed variable. With a generic method using an interface constraint you can observe a simple method call mutating a non-read-only variable, but that’s generally harmless.

Adding the readonly modifier to Noda Time structs

I relished news of the readonly modifier for struct declarations. At last, I can remove my ReadWriteForEfficiency attribute! Hmm. Not so much. To be fair, some of the structs (7 out of 18) were fine. But every struct that implements IXmlSerializable gave me this error:

Cannot assign to ‘this’ because it is read-only

That’s reasonable: in members of a readonly struct, this effectively becomes an in parameter instead of a ref parameter. But how can we fix that? Like any sensible developer, I turned to Stack Overflow, which has precisely the question I’m interested in. It even has an answer! Unfortunately, it amounts to a workaround: assigning to this via unsafe code.

Violating readonly with unsafe code

To give a concrete example of the answer from Stack Overflow, here’s my current LocalTime code:

void IXmlSerializable.ReadXml([NotNull] XmlReader reader)
{
    Preconditions.CheckNotNull(reader, nameof(reader));
    var pattern = LocalTimePattern.ExtendedIso;
    string text = reader.ReadElementContentAsString();
    this = pattern.Parse(text).Value;
}

Here’s the code that compiles when LocalTime is marked as readonly, after enabling unsafe blocks:

unsafe void IXmlSerializable.ReadXml([NotNull] XmlReader reader)
{
    Preconditions.CheckNotNull(reader, nameof(reader));
    var pattern = LocalTimePattern.ExtendedIso;
    string text = reader.ReadElementContentAsString();
    fixed (LocalTime* thisAddr = &this)
    {
        *thisAddr = pattern.Parse(text).Value;
    }
}

Essentially, the unsafe code is bypassing the controls around read-only structs. Just for kicks, let’s apply the same change
throughout Noda Time, and think about what would happen…

As it happens, that fix doesn’t work universally: ZonedDateTime is a “managed type” because it contains a reference (to a DateTimeZone) which means you can’t create a pointer to it. That’s a pity, but if we can make everything else readonly, that’s a good start. Now let’s look at the knock-on effects…

Safe code trying to abuse the unsafe code

Let’s try to abuse our “slightly dodgy” implementations. Here’s a class we’ll put in a “friendly” assembly which is trying to be as helpful as possible:

using NodaTime;

public class AccessToken
{
    private readonly Instant expiry;
    public ref readonly Instant Expiry => ref expiry;

    public AccessToken(Instant expiry) => this.expiry = expiry;
}

Great – it lets you get at the expiry time of an access token without even copying the value.

The big test is: can we break this friendly’s code’s assumptions about expiry really not changing its value?

Here’s code I expected to mutate the access token:

static void MutateAccessToken(AccessToken accessToken)
{
    ref readonly Instant expiry = ref accessToken.Expiry;
    string xml = "<evil>2100-01-01T00:00:00Z</evil>";
    EvilReadXml(in expiry, xml);
}

static void EvilReadXml<T>(in T value, string xml) where T : IXmlSerializable
{
    var reader = XmlReader.Create(new StringReader(xml));
    reader.Read();
    value.ReadXml(reader);
}

We have an in parameter in EvilReadXml, so the expiry variable is being passed by reference, and then we’re calling ReadXml on that parameter… so doesn’t that mean we’ll modify the parameter, and thus the underlying expiry field in the object?

Nope. Thinking about it, the compiler doesn’t know when it compiles EvilReadXml that T is a readonly struct – it could be a regular struct. So it has to create a copy of the value before calling ReadXml.

Looking the the spec proposal, there’s one interesting point in the section on ref extension methods:

However any use of an in T parameter will have to be done through an interface member. Since all interface members are considered mutating, any such use would require a copy.

Hooray! That suggests it’s safe – at least for now. I’m still worried though: what if the C# team adds a readonly struct generic constraint in C# 8? Would that allow a small modification of the above code to break? And if interface methods are considered mutating anyway, why doesn’t the language know that when I’m trying to implement the interface?

But hey, now that we know unsafe code is able to work around this in our XML implementation, what’s to stop nasty code from using the same trick directly?

Unsafe code abusing safe code

Imagine we didn’t support XML serialization at all. Could unsafe code mutate AccessToken? It turns out it can, very easily:

static void MutateAccessToken(AccessToken accessToken)
{
    ref readonly Instant expiry = ref accessToken.Expiry;
    unsafe
    {
        fixed (Instant* expiryAddr = &expiry)
        {
            *expiryAddr = Instant.FromUtc(2100, 1, 1, 0, 0);
        }
    }
}

This isn’t too surprising – unsafe code is, after all, unsafe. I readily admit I’m not an expert on .NET security, and I know the
landscape has changed quite a bit over time. These days I believe the idea of a sandbox has been somewhat abandoned – if you care about executing code you don’t really trust, you do it in a container and use that as your security boundary. That’s about the limit of my knowledge though, which could be incorrect anyway.

Where do we go from here?

At this point, I’m stuck making a choice between several unpleasant ones:

  • Leave Noda Time public structs as non-read-only. That prevents users from writing efficient code to use them without copying.
  • Remove XML serialization from Noda Time 3.0. We’re probably going to remove binary serialization anyway on the grounds that Microsoft is somewhat-discouraging it going forward, and it’s generally considered a security problem. However, I don’t think XML serialization is considered to be as bad, so long as you use it carefully, preventing arbitrary type loading. (See this paper for more information.)
  • Implement XML serialization using unsafe code as shown above. Even though my attempt at abusing it failed, that doesn’t provide very much comfort. I don’t know what other ramifications there might be for including unsafe code. Does that limit where the code can be deployed? It also doesn’t help with my concern about a future readonly struct constraint.

Thoughts very welcome. Reassurances from the C# team that “readonly struct” constraints won’t happen particularly welcome… along with alternative ways of implementing XML serialization.

Surprise! Creating an instance of an open generic type

This is a brief post documenting a very weird thing I partly came up with on Stack Overflow today.

The context is this question. But to skip to the shock, we end up with code like this:

object x = GetWeirdValue();
// This line prints True. Be afraid - be very afraid!
Console.WriteLine(x.GetType().GetTypeInfo().IsGenericTypeDefinition);

That just shouldn’t happen. You shouldn’t be able to create an instance of an open type – a type that still contains generic type parameters. What does a List<T> (rather than a List<string> or List<int>) mean? It’s like creating an instance of an abstract class.

Before today, I’d have expected it to be impossible – the CLR should just not allow such an object to exist. I now know one – and only one – way to do it. While you can’t get normal field values for an open generic type, you can get constants… after all, they’re constant values, right? That’s fine for most constants, because those can’t be generic types – int, string etc. The only type of constant with a user-defined type is an enum. Enums themselves aren’t generic, of course… but what if it’s nested inside another generic type, like this:

class Generic<T>
{
    enum GenericEnum
    {
        Foo = 0
    }
}

Now Generic<>.GenericEnum is an open type, because it’s nested in an open type. Using Enum.GetValues(typeof(Generic<>.GenericEnum)) fails in the expected way: the CLR complains that it can’t create instances of the open type. But if you use reflection to get at the constant field representing Foo, the CLR magically converts the underlying integer (which is what’s in the IL of course) into an instance of the open type.

Here’s the complete code:

using System;
using System.Reflection;

class Program
{
    static void Main(string[] args)
    {
        object x = GetWeirdValue();
        // This line prints True
        Console.WriteLine(x.GetType().GetTypeInfo().IsGenericTypeDefinition);
    }

    static object GetWeirdValue() =>
        typeof(Generic<>.GenericEnum).GetTypeInfo()
            .GetDeclaredField("Foo")
            .GetValue(null);

    class Generic<T>
    {
        public enum GenericEnum
        {
            Foo = 0
        }
    }
}

… and the corresponding project file, to prove it works for both the desktop and .NET Core…

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp1.0;net45</TargetFrameworks>
  </PropertyGroup>

</Project>

Use this at your peril. I expect that many bits of code dealing with reflection would be surprised if they were provided with a value like this…

It turns out I’m not the first one to spot this. (That would be pretty unlikely, admittedly.) Kirill Osenkov blogged two other ways of doing this, discovered by Vladimir Reshetnikov, back in 2014.

To base() or not to base(), that is the question

Today I’ve been reviewing the ECMA-334 C# specification, and in particular the section about class instance constructors.

I was struck by this piece in a clause about default constructors:

If a class contains no instance constructor declarations, a default instance constructor is automatically provided. That default constructor simply invokes the parameterless constructor of the direct base class.

I believe this to be incorrect, and indeed it is, as shown here (in C# 6 code for brevity, despite this being the C# 5 spec that I’m reviewing; that’s irrelevant in this case):

using System;

class Base
{
    public int Foo { get; }

    public Base(int foo = 5)
    {
        Foo = foo;
    }
}

class Derived : Base
{    
}

class Test
{
    static void Main()
    {
        var d = new Derived();
        Console.WriteLine(d.Foo); // Prints 5
    }    
}

Here the default constructor in Derived clearly doesn’t execute a parameterless constructor in Base because there is no parameterless constructor in Base. Instead, it executes the parameterized constructor, providing the default argument value.

So, I considered whether we could reword the standard to something like:

If a class contains no instance constructor declarations, a default instance constructor is automatically provided. That default constructor simply invokes a constructor of the direct base class as if the default constructor contained a constructor initializer of base().

But is that always the case? It turns out it’s not – at least not in Roslyn. There are more interesting optional parameters we can use than just int foo = 5. Let’s have a look:

using System;
using System.Runtime.CompilerServices;

class Base
{
    public string Origin { get; }

    public Base([CallerMemberName] string name = "Unspecified",
                [CallerFilePath] string source = "Unspecified",                
                [CallerLineNumber] int line = -1)
    {
        Origin = $"{name} - {source}:{line}";
    }
}

class Derived1 : Base {}
class Derived2 : Base
{
    public Derived2() {}
}
class Derived3 : Base
{
    public Derived3() : base() {}
}

class Test
{
    static void Main()
    {
        Console.WriteLine(new Derived1().Origin);
        Console.WriteLine(new Derived2().Origin);
        Console.WriteLine(new Derived3().Origin);
    }    
}

The result is:

Unspecified - Unspecified:-1
Unspecified - Unspecified:-1
.ctor - c:\Users\Jon\Test\Test.cs:23

When base() is explicitly specified, that source location is treated as the “caller” for caller member info attributes. When it’s implicit (including when there’s a default constructor), no source location is made available to the Base constructor.

This is somewhat compiler-specific – and I can imagine different results where the default constructor could specify a name but not source file or line number, and the declared constructor with an implicit call could specify the name and source file but no line number.

I would never suggest using this little tidbit of Roslyn implementation trivia, but it’s fun nonetheless…

“Sideways overriding” with partial methods

First note: this blog post is very much tongue in cheek. I’m not actually planning on using the idea. But it was too fun not to share.

As anyone following my activity on GitHub may be aware, I’ve been quite a lot of work on Protocol Buffers recently – in particular, a mostly-new port for proto3. I’ve recently been looking at JSON support, and thinking about how to implement “overriding” ToString() for a few well-known types. I generate partial classes, so that gives me a hook to provide extra functionality. Indeed, I’m planning on using this to provide conversion methods for Timestamp and Duration, for example. However, you can’t really override anything in partial methods.

Refresher on partial methods

While partial classes were introduced in C# 2, partial methods were introduced in C# 3. The idea is that one source file (usually the generated one) can provide a partial method signature, and another source file (usually the manually-written one) can provide an implementation if it wants to. Any part of the source can call the method, and the call will be removed at compile-time if nothing provides an implementation. The fact that the method may not be there leads to some limitations:

  • Partial methods are implicitly private, but you can’t specify an access modifier explicitly
  • Partial methods are always void – they can’t return any values
  • Partial methods cannot have out parameters

(Interestingly, a partial method implementation can be an async method – but with a return type of void, which is never a nice situation to be in.)

There’s more in the spec, but the last two bullets are the important part.

So, suppose I want to override ToString() in the generated code, but provide a mechanism for that override to be “further overridden” effectively, in the manual code for the same class? How do I get the value from an “extra override”? How do I even detect whether or not it’s there?

Side effects to the rescue!

(Now there’s a phrase you never thought you’d hear from me.)

I mentioned before that if a partial method is called but no implementation is provided, the call is removed. That includes all aspects of the call – including the evaluation of the arguments. So if evaluating the argument has a side-effect… we can spot that side effect.

Next, we have to work out how to get a value back from a method. We can’t use the return value, and we can’t use an out parameter. There are two options here: we could either pass a wrapper (e.g. an array with a single element) and allow the “extra override” to populate the wrapper… or we can use a ref parameter. The latter feels ever-so-slightly cleaner to me.

And so the ugly hack is born. The code generator can always generate code like this:

partial void ToStringOverride(bool ignored, ref string value);

public override string ToString()
{
    string value = null;
    bool overridden = false;
    ToStringOverride(overridden = true, ref value);
    return overridden ? value : "Original";
}

For any partial class where the ToStringOverride method isn’t implemented, overridden will still be false, so we’ll fall back to returning "Original". (I would hope that any decent JIT would remove the overridden and value local variables entirely at that point.) Otherwise, we’ll return whatever the method has changed value to.

Here’s a short but complete example:

using System;

// Generated code
partial class UglyHack1
{
    partial void ToStringOverride(bool ignored, ref string value);

    public override string ToString()
    {
        string value = null;
        bool overridden = false;
        ToStringOverride(overridden = true, ref value);
        return overridden ? value : "Original";
    }
}

// Generated code
partial class UglyHack2
{
    partial void ToStringOverride(bool ignored, ref string value);

    public override string ToString()
    {
        string value = null;
        bool overridden = false;
        ToStringOverride(overridden = true, ref value);
        return overridden ? value : "Original";        
    }    
}

// Manual code
partial class UglyHack2
{
    partial void ToStringOverride(bool ignored, ref string value)
    {
        value = "Different!";
    }
}

class Test
{
    static void Main()
    {
        var g1 = new UglyHack1();
        var g2 = new UglyHack2();

        Console.WriteLine(g1);
        Console.WriteLine(g2);
    }
}

Horribly ugly, but it works…

Alternatives?

Obviously this isn’t really pleasant. Some alternatives:

  • Derive from the generated class in order to override ToString again. Doesn’t work with sealed classes, and will only work if clients create instances of the derived class.
  • Introduce a new interface, and allow manual code to implement it on the partial class. The ToString method can then check this is IMyOtherToString or whatever, and call it appropriately. This introduces another virtual call for no great reason, and exposes the interface to the outside world, which we may not want to do.
  • Don’t override ToString in the generated code at all. Not good if you normally want to override it.
  • Introduce an abstract base class which the generated class derives from. Override ToString() in that base class, possibly calling an abstract member which is then provided in the generated class – but allowing the manual code to override ToString() again.

Conclusion

Ugly hacks are fun. But it’s much better to keep them where it belongs: in a blog post, not in production code.

Backwards compatibility is (still) hard

At the moment, I’m spending a fair amount of time thinking about a new version of the C# API and codegen for Protocol Buffers, as well as other APIs for interacting with Google services. While that’s the context for this post, I want to make it very clear that this is still a personal post, and should in no way be taken to be “Google’s opinion” on anything. The underlying issue could apply in many other situations, but it’s easiest to describe a concrete scenario.

Context and current state: the builder pattern

The problem I’ve been trying to address is the relative pain of initializing a protobuf message. Protocol buffer messages are declared in a separate schema file (.proto) and then code is generated. The schema declares fields, each of which has a name, a type and a number associated with it. The generated message types are immutable, with builder classes associated with them. So for example, we might start off with a message like this:

message Person {
  string first_name = 1;
  string last_name = 3;
}

And construct a Person object in C# like this:

var person = new Person.Builder { FirstName = "Jon", LastName = "Skeet" }.Build();
// Now person.FirstName and person.LastName are readonly properties

That’s not awful, but it’s not the cleanest code in the world. We can make it slightly simpler using an implicit conversion from the builder type to the message type:

Person person = new Person.Builder { FirstName = "Jon", LastName = "Skeet" };

It’s still not really clean though. Let’s revisit why the builder pattern is useful:

  • We can specify just the properties we want.
  • By deferring the “build” step until after we’ve specified everything, we get mutability without building a lot of intermediate objects.

If only there were another language construct allowing that…

Optional parameters to the rescue!

If we provided a constructor with an optional parameter for each property, we can specify just what we want. So something like:

public Person(string firstName = null, string lastName = null)
...
var person = new Person(firstName: "Jon", lastName: "Skeet");

Hooray! That looks much nicer:

  • We can use var (if we want to) because there are no implicit conversions to confuse things.
  • We don’t need to mention a builder at all.
  • Every piece of text in the statement is something we want to express, and we only express it once.

That last point is a lovely place to be in terms of API design – while you still need to worry about naming, ordering and how the syntax fits into bigger expressions, you’ve achieved some sense of “as simple as possible, but no simpler”.

So, that’s all great – except for versioning.

Let’s just add a field at the end…

One of the aims of protocol buffers is to support an evolving schema. (The limitations are different for proto2 and proto3, but that’s a slightly different matter.) So what happens if we add a new field to the message?

message Person {
  string first_name = 1;
  string last_name = 3;
  string title = 4; // Mr, Mrs etc
}

Now we end up with the following constructor:

public Person(string firstName = null, string lastName = null, string title = null)

The code still compiles – but if we try to use run our old client code against the new version of the library, it will fail – because the method it refers to no longer exists. So we have source compatibility, but not binary compatibility.

Let’s just add a field in the middle…

You may have noticed that I don’t have a field with tag 2 – this is not an accident. Suppose we now add it, for the obvious middle_name field:

message Person {
  string first_name = 1;
  string middle_name = 2;
  string last_name = 3;
  string title = 4; // Mr, Mrs etc
}

Regenerate the code, and we end up with a constructor with 4 parameters:

public Person(
    string firstName = null,
    string middleName = null,
    string lastName = null,
    string title = null)

Just to be clear, this change is entirely fine in protocol buffers – while normally fields are assigned incrementally, it shouldn’t be a breaking change to add a new field “between” existing ones.

Let’s take a look at our client code again:

var person = new Person(firstName: "Jon", lastName: "Skeet");

Yup, that still works – we need to recompile, but we still end up with a Person with the right properties. But that’s not the only code we could have started with. Suppose we’d actually had:

var person = new Person("Jon", "Skeet");

Until this last change that would have been fine – even after we’d added the optional title parameter, the two arguments would still have mapped to firstName and lastName respectively.

Having added the middle_name field, however, the code would still compile with no errors or warnings, but the meaning of the second argument would have changed – it would now map onto the middleName parameter instead of lastName.

Basically, we’d like to stop this code (using positional arguments) from compiling in the first place.

Feature requests and a workaround

The two features we really want from C# here are:

  • Some way of asking the generated code to perform dynamic overload resolution at execution time… not based on dynamic values, but on the basis that the code we’re compiling against may have changed since we compiled. This resolution only needs to be performed once, on first execution (or class load, or whatever) as by the time we’re executing, everything is fixed (the parameter names and types, and the argument names and types). It could be efficient.
  • Some way of forcing any call sites to use named arguments for any optional parameters. (Even though in our case all the parameters are optional, I can easily imagine a case where there are a few required parameters and then the optional ones. Using positional arguments for those required parameters is fine.)

It’s hard (without forking Roslyn :) to implement these feature requests ourselves, but for the second one we can at least have a workaround. Consider the following struct:

public struct DoNotCallThisMethodWithPositionalArguments {}

… and now imagine our generated constructor had been :

public Person(
    DoNotCallThisMethodWithPositionalArguments ignoreMe =
        default(DoNotCallThisMethodWithPositionalArguments),
    string firstName = null,
    string middleName = null,
    string lastName = null,
    string title = null)

Now our constructor call using positional arguments will fail, because there’s no conversion from string to the crazily-named struct. The only “nice” way you can call it is to use named arguments, which is what we wanted. You could call it using positional arguments like this:

var person = new Person(
    new DoNotCallThisMethodWithPositionalArguments(),
    "Jon",
    "Skeet");

(or using default(...) like the constructor declaration) – but at this point the code looks broken, so it’s your own fault if you decide to use it.

The reason for making it a struct rather than a class is to avoid null being convertible to it. Annoyingly, it wouldn’t be hard to make a class that you could never create an actual instance of, but you can’t prevent anyone from creating a value of a struct. Basically, what we really want is a type such that there is no valid expression which is convertible to that type – but aside from static classes (which can’t be used as parameter types) I don’t know of any way of doing that. (I don’t know what would happen if you compiled the Person class using a non-static class as the first parameter, then made that class static and recompiled it. Confusion on the part of the C# compiler, I should think.)

Another option (as mentioned in the comments) is to have a “poor man’s” version of the compiler enforcement via a Roslyn Code Diagnostic – add an attribute to any method call where you want “All optional parameters must be specified with named arguments” to apply, and then make the code diagnostic complain if you disobey that. That diagnostic could ship with the Protocol Buffers NuGet package, which would make for a pretty nice experience. Not quite as good as a language feature though :)

Conclusion

Default parameters are a pain in terms of compatibility. For internal company code, it’s often reasonable to only care about source compatibility as you can recompile all calling code before deployment – but for open source projects, binary compatibility within the same major version is the norm.

How useful and common do I think these features would be? Probably not common enough to meet the bar – unless there’s encouragement within comments here, in which case I’m happy to file feature requests on GitHub, of course.

As it happens, I’m currently looking at radical changes to the C# implementation of Protocol Buffers, regretfully losing the immutability aspect due to it raising the barrier to entry. It’s not quite a done deal yet, but assuming that goes ahead, all of this will be mostly irrelevant – for Protocol Buffers. There are plenty of other places where code generation could be more robustly backward-compatible through judicious use of optional-but-please-use-named-arguments parameters though…

When is an identifier not an identifier? (Attack of the Mongolian Vowel Separator)

Here’s a few things you may not be aware of:

  • C# identifiers can include Unicode escape sequences (\u1234 etc)
  • C# identifiers can include Unicode characters in the category “Other, formatting” (Cf) but these are ignored when comparing identifiers for equality
  • The Mongolian Vowel Separator (U+180E) has oscillated between the Cf and Zs categories a couple of times
  • .NET has its own copy of Unicode categories, separate from whatever Win32 might provide
  • Roslyn (built in .NET) uses the Unicode categories, whereas csc.exe (the “old” native C# compiler) uses either the Win32 categories or a built-in copy
  • Neither the .NET table nor the Win32 table necessarily reflects exactly what any one version of the Unicode standard says
  • Compilers can have bugs in

Put them together, and chaos ensues!

How this all started – blame Vladimir

I started looking at this based on a discussion in our ECMA technical group meeting last week, when we were considering the normative references – and in particular, which version of Unicode we were going to target. Currently the ECMA 4th edition spec targets Unicode 4.0 and the Microsoft C# 5 specification targets Unicode 3.0. It’s not clear to me whether any compilers actually take note of this, and moving forward we’d like both the ECMA and Microsoft standards to not specify a particular version of Unicode, effectively encouraging compiler authors to use the most recent one available to them. Despite the wrinkles listed below, I think that makes the most sense for real world uses – it’s crazy to require compilers to ship with their own private copies of Unicode, effectively.

When discussing this, Vladimir Reshetnikov mentioned the Mongolian Vowel Separator (U+180E) which has had an interesting life. It was introduced in Unicode 3.0.0, when it was in the Cf category (“other, formatting”). Then in Unicode 4.0.0 it was moved into the Zs (separator, space) category. In Unicode 6.3.0 it was then moved back to the Cf category.

Of course, my natural inclination was to try to abuse this. My initial aim was to come up with code which behaved differently depending on which version of Unicode the compiler was using. It turned out to be a little more complicated than that, but we’ll assume a hypothetical compiler first, with no bugs, but which obeys whichever version of the Unicode standard we want it to. (Arguably that’s already a bug given the requirements of the current C# specs, but we’ll set that aside.)

Hypothetical example 1: valid or invalid

For simplicity, let’s start with some source code which is all in ASCII:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        string\u180ex = "b";
        Console.WriteLine(stringx);
    }
}

If the compiler is using Unicode 6.3 or higher (or a version earlier than 4.0) then U+180E is deemed to be in the Cf category, and is therefore valid within an identifier. In that case, it’s fine for it to be escaped as per the code above. At that point, the identifier in the second line of the method is deemed to be “the same as” stringx, so the output is b.

What about a compiler using a version of Unicode between 4.0 and 6.2 (inclusive) though? At that point, U+180E is deemed to be in the Zs category, which makes it a whitespace character. Zs characters are allowed as whitespace within C# programs – but not within identifiers. Once it’s not a valid identifier – and because this isn’t within a character/string literal – it’s invalid to use the Unicode escape sequence, so the code doesn’t compile.

Hypothetical example 2: valid two different ways

We can write the same code without using an escape sequence, however. If you create a regular ASCII file like this:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        stringAAAx = "b";
        Console.WriteLine(stringx);
    }
}

then open it up in a hex editor and replace the AAA with bytes E1 A0 8E, then you’ve got a file containing the UTF-8 representation of U+180E at the same location as we had the Unicode escape sequence in the first version.

So, a compiler which compiled the first version would still compile this version (assuming you could tell it that the source code was UTF-8), and the results would be the same – it would print b as the second statement of the method would be a simple assignment to the existing variable.

However, a compiler which treats U+180E as whitespace and would therefore treat the first program as an error would accept this program – and treat that second statement as a declaration of a second local variable (x) and assign it an initial value. You might get a warning about the variable being unused, but it’s valid C# and the output is a.

Reality: the Microsoft compilers

Whenever we talk about the Microsoft C# compiler these days, we need to distinguish between the “native” compiler (csc) and Roslyn (rcsc, although typically I just call it Roslyn).

As it’s written in native code, csc uses whatever Windows supplies for its Unicode character tables – or it embeds it directly in the executable, potentially. (I’ve been scouring MSDN to find a Win32 native function to tell me the Unicode category of a specific code point, and failed so far. It would have been useful…)

Compare that with Roslyn, which is written in C# and (as far as I’m aware) uses char.GetUnicodeCategory – which in turn uses the Unicode tables built into mscorlib.

My experiments suggest that whatever the native compiler uses to get the Unicode category has treated U+180E as a formatting character forever. At least, I’ve tried to find old machines (including VM images) which haven’t had Windows update applied since September 2013 (which is when Unicode 6.3 was published) and they all compile the first program listed above. I’m beginning to suspect that csc might actually have a copy of Unicode 3.0 built into it; it certainly treats U+180E as a formatting character, but doesn’t like either U+0600 or U+00AD within identifiers. (U+0600 wasn’t introduced until Unicode 4.0, but has always been a formatting character; U+00AD was a “dash punctuation” character in Unicode 3.0, and became a formatting character in 4.0.)

The table built into mscorlib has definitely changed over time, however. If you run a simple program such as this:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(Environment.Version);
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

then running under CLRv2, the result is “SpaceSeparator” whereas running on CLRv4 (at least on a recently-updated system), the result is “Format”.

Of course, Roslyn won’t run under old CLRs, but we have hope by way of csharppad.com – which runs Roslyn in an environment (of uncertain origin – Mono? I’m unsure) which prints “SpaceSeparator” for the above. Sure enough the first program fails to compile – but it’s harder to check the second program, as csharppad.com doesn’t allow you to upload source code, and copy/paste produces some odd results.

Reality: mcs (Mono C# compiler)

Mono’s compiler uses the BCL GetUnicodeCategory code too, which should make it significantly simpler to experiment – but unfortunately, the Mono parser has (at least) two bugs in it:

  • It will allow any Unicode escape sequence in an identifier, whether it’s an escape sequence for a valid identifier part or not. For example, string\u0020x = "" is valid under the Mono compiler. Filed as bug 24968. Source.
  • It doesn’t allow formatting characters within identifiers – it includes characters in classes Mn, Mc, Nd and Pc, but not Cf. Filed as bug 24969. Source.

For this reason, the first program always compiles and prints “b” whereas the second program always fails to compile, regardless of whether U+180E is treated as being in Zs or Cf.

What version is this, anyway?

Next, let’s think about the Unicode data itself. It’s not at all clear which version any particular BCL implementation is actually using. Consider this little program:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(char.GetUnicodeCategory('\u00ad'));
        Console.WriteLine(char.GetUnicodeCategory('\u0600'));
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

On my computer, under CLR v4 this prints “DashPunctuation, Format, Format”, and under both Mono (3.3.0) and CLR v2 it prints “DashPunctuation, Format, SpaceSeparator”.

That’s very odd. It doesn’t correspond with any version of the Unicode standard, as far as I can tell:

  • U+00AD was a Po (other, punctuation) character in Unicode 1.x, then Pd (dash, punctuation) in 2.x and 3.x, and from 4.0 onwards has been Cf.
  • U+0600 was only introduced in Unicode 4.0, and has always been Cf
  • U+180E introduced as Cf in 3.0, then changed to Zs in 4.0, then back to Cf in 6.3.

So there is no version where the first line agrees with either the second line or the third line. I’m basically a bit baffled by this.

What about nameof and CallerMemberName?

The names of identifiers aren’t only used for comparisons – they’re available as strings without any reflection being involved at all. From C# 5, we’ve had CallerMemberName attribute, allowing things like:

public static void X\u0600y()
{
    ShowCaller();
}

public static void ShowCaller([CallerMemberName] string caller = null)
{
    Console.WriteLine("Called by {0}", caller);
}

And in C# 6, we can write:

string x\u0600y = "";
Console.WriteLine("nameof = {0}", nameof(x\u0600y));

What should those print? They do just print “Xy” and “xy” as the names (respectively), as if the compiler has simply thrown away the formatting character entirely. But what should they print? Bear in mind that in the second case, we could easily have used nameof(xy) and that would still have compared equal to the declared identifier.

We can’t even say “What’s the name of the member being declared?” because you can overload with “different but equal” identifiers:

public static void Xy() {}
public static void X\u0600y() {}
public static void X\u070fy() {}
...
Console.WriteLine(nameof(X\u200by));

What should that print out? I’m sure you’ll be relieved to hear that the C# team has a plan in place – but fundamentally this is one of these “no obvious right answer” scenarios. It gets even weirder when you bring the CLI specification into the mix. Section I.8.5.1 of ECMA-335 6th edition has:

Assemblies shall follow Annex 7 of Technical Report 15 of the Unicode Standard
3.0 governing the set of characters permitted to start and be included in identifiers, available online
at http://www.unicode.org/unicode/reports/tr15/tr15-18.html. Identifiers shall be in the
canonical format defined by Unicode Normalization Form C. For CLS purposes, two identifiers
are the same if their lowercase mappings (as specified by the Unicode locale-insensitive, one-to-one
lowercase mappings) are the same. That is, for two identifiers to be considered different
under the CLS they shall differ in more than simply their case. However, in order to override an
inherited definition the CLI requires the precise encoding of the original declaration be used.

I would love to explore the impact of this by adding a Cf character into IL, but unfortunately I haven’t worked out a way of affecting the encoding of ilasm, in order to persuade it that my hacked up IL is what I want it to be.

Conclusion

As noted before, text is hard.

It turns out that even when restricting oneself to identifiers, text is hard. Who would’ve thought?

When is a string not a string?

When is a string not a string?

As part of my “work” on the ECMA-334 TC49-TG2 technical group, standardizing C# 5 (which will probably be completed long after C# 6 is out… but it’s a start!) I’ve had the pleasure of being exposed to some of the interesting ways in which Vladimir Reshetnikov has tortured C#. This post highlights one of the issues he’s raised. As usual, it will probably never impact 99.999% of C# developers… but it’s a lovely little problem to look at.

Relevant specifications referenced in this post:
– The Unicode Standard, version 7.0.0 – in particular, chapter 3
C# 5 (Word document)
ECMA-335 (CLI specification)

What is a string?

How would you define the string (or System.String) type? I can imagine a number of responses to that question, from vague to pretty specific, and not all well-defined:

  • “Some text”
  • A sequence of characters
  • A sequence of Unicode characters
  • A sequence of 16-bit characters
  • A sequence of UTF-16 code units

The last of these is correct. The C# 5 specification (section 1.3) states:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So far, so good. But that’s C#. What about IL? What does that use, and does it matter? It turns out that it does… Strings need to be represented in IL as constants, and the nature of that representation is important, not only in terms of the encoding used, but how the encoded data is interpreted. In particular, a sequence of UTF-16 code units isn’t always representable as a sequence of UTF-8 code units.

I feel ill (formed)

Consider the C# string literal "X\uD800Y". That is a string consisting of three UTF-16 code units:

  • 0x0058 – ‘X’
  • 0xD800 – High surrogate
  • 0x0059 – ‘Y’

That’s fine as a string – it’s even a Unicode string according to the spec (item D80). However, it’s ill-formed (item D84). That’s because the UTF-16 code unit 0xD800 doesn’t map to a Unicode scalar value (item D76) – the set of Unicode scalar values explicitly excludes the high/low surrogate code points.

Just in case you’re new to surrogate pairs: UTF-16 only deals in 16-bit code units, which means it can’t cope with the whole of Unicode (which ranges from U+0000 to U+10FFFF inclusive). If you want to represent a value greater than U+FFFF in UTF-16, you need to use two UTF-16 code units: a high surrogate (in the range 0xD800 to 0xDBFF) followed by a low surrogate (in the range 0xDC00 to 0xDFFF). So a high surrogate on its own makes no sense. It’s a valid UTF-16 code unit in itself, but it only has meaning when followed by a low surrogate.

Show me some code!

So what does this have to do with C#? Well, string constants have to be represented in IL somehow. As it happens, there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8.

Let’s take an example:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;

[Description(Value)]
class Test
{
    const string Value = "X\ud800Y";

    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }

    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

The output of this code (under .NET) is:

Attribute: 0058 fffd fffd 0059
Constant: 0058 d800 0059

As you can see, the “constant” (Test.Value) has been preserved as a sequence of UTF-16 code units, but the attribute property has U+FFFD (the Unicode replacement character which is used to indicate broken data when decoding binary to text). Let’s dig a little deeper and look at the IL for the attribute and the constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )

.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant (Value) is really simple – it’s just little-endian UTF-16. The format of the attribute is specified in ECMA-335 section II.23.3. Here, the meaning is:

  • Prolog (01 00)
  • Fixed arguments (for specified constructor signature)
    • 05 58 ED A0 80 59 (a single string argument as a SerString)
      • 05 (the length, i.e. 5, as a PackedLen)
      • 58 ED A0 80 59 (the UTF-8-encoded form of the string)
  • Number of named arguments (00 00)
  • Named arguments (there aren’t any)

The interesting part is the “UTF-8-encoded form of the string” here. It’s not valid UTF-8, because the input isn’t a well-formed string. The compiler has taken the high surrogate, determined that there isn’t a low surrogate after it, and just treated it as a value to be encoded in the normal UTF-8 way of encoding anything in the range U+0800 to U+FFFF inclusive.

It’s worth noting that if we had a full surrogate pair, UTF-8 would encode the single Unicode scalar value being represented, using 4 bytes. For example, if we change the declaration of Value to:

const string Value = "X\ud800\udc00Y";

then the UTF-8 bytes in the IL are 58 F0 90 80 80 59 – where F0 90 80 80 is the UTF-8 encoding for U+10000. That’s a well-formed string, and we get the same value for both the description attribute and the constant.

So in our original example, the string constant (encoded as UTF-16 in the IL) is just decoded without checking whether or not it’s ill-formed, whereas the attribute argument (encoded as UTF-8) is decoded with extra validation, which detects the ill-formed code unit sequence and replaces it.

Encoding behaviour

So which approach is right? According to the Unicode specification (item C10) both could be fine:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

and

Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89.

It’s not at all clear to me whether either the attribute argument or the constant value “purports to be in a Unicode character encoding form”. In my experience, very few pieces of documentation or specification are clear about whether they expect a piece of text to be well-formed or not.

Additionally, System.Text.Encoding implementations can often be configured to determine how they behave when encoding or decoding ill-formed data. For example, Encoding.UTF8.GetBytes(Value) returns byte sequence 58 EF BF BD 59 – in other words, it spots the bad data and replaces it with U+FFFD as part of the encoding… so decoding this value will result in X U+FFFD Y with no problems. On the other hand, if you use new UTF8Encoding(true, true).GetBytes(Value), an exception will be thrown. The first constructor argument is whether or not to emit a byte order mark under certain circumstances; the second one is what dictates the encoding behaviour in the face of invalid data, along with the EncoderFallback and DecoderFallback properties.

Language behaviour

So should this compile at all? Well, the language specification doesn’t currently prohibit it – but specifications can be changed :)

In fact, both csc and Roslyn do prohibit the use of ill-formed strings with certain attributes. For example, with DllImportAttribute:

[DllImport(Value)]
static extern void Foo();

This gives an error when Value is ill-formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

There may be other attributes this is applied to as well; I’m not sure.

If we take it as read that the ill-formed value won’t be decoded back to its original form when the attribute is instantiated, I think it would be entirely reasonable to make it a compile-time failure – for attributes. (This is assuming that the runtime behaviour can’t be changed to just propagate the ill-formed string.)

What about the constant value though? Should that be allowed? Can it serve any purpose? Well, the precise value I’ve given is probably not terribly helpful – but it could make sense to have a string constant which ends with a high surrogate or starts with a low surrogate… because it can then be combined with another string to form a well-formed UTF-16 string. Of course, you should be very careful about this sort of thing – read the Unicode Technical Report 36 “Security Considerations” for some thoroughly alarming possibilities.

Corollaries

One interesting aspect to all of this is that “string encoding arithmetic” doesn’t behave as you might expect it to. For example, consider this method:

// Bad code!
string SplitEncodeDecodeAndRecombine
    (string input, int splitPoint, Encoding encoding)
{
    byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint));
    byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint));
    return encoding.GetString(firstPart) + encoding.GetString(secondPart);            
}

You might expect that this would be a no-op so long as everything is non-null and splitPoint is within range… but if you happen to split in the middle of a surrogate pair, it’s not going to be happy. There may well be other potential problems lurking there, depending on things like normalization form – I don’t think so, but at this point I’m unwilling to bet too heavily on string behaviour.

If you think the above code is unrealistic, just imagine partitioning a large body of text, whether that’s across network packets, files, or whatever. You might feel clever for realizing that without a bit of care you’d get binary data split between UTF-16 code units… but even handling that doesn’t save you. Yikes.

I’m tempted to swear off text data entirely at this point. Floating point is a nightmare, dates and times… well, you know my feelings about those. I wonder what projects are available that only need to deal with integers, and where all operations are guaranteed not to overflow. Let me know if you have any.

Conclusion

Text is hard.

Violating the “smart enum” pattern in C#

For a while now, I’ve been a big fan of a pattern in C# which mimics Java enums to a certain extent. In general, it’s a lovely pattern. Only after reading a comment on a recent blog post by Eric Lippert did I find out about a horrible flaw. Dubious thanks to John Payson for prompting this post. (And I fully intend to include this in the Abusing C# talk that I give every so often…)

Trigger warning: this post contains code you may wish you’d never seen.

Let’s start off with the happy path – a moment of calm before the storm.

When things go well…

The idea is to create a class – which can be abstract – with a certain fixed set of instances, which are available via static properties on the class itself. The class is not sealed, but it prevents other classes from subclassing it by using a private constructor. Here’s an example, including how it can be used:

using System;

public abstract class Operation
{
    public static Operation Add { get; } = new AddOperation();
    public static Operation Subtract { get; } = new SubtractOperation();

    // Prevent non-nested subclasses. They can't call this constructor,
    // which is the only way of creating an instance.
    private Operation()
    {}

    public abstract int Apply(int lhs, int rhs);

    private class AddOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs + rhs;
        }
    }

    private class SubtractOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs - rhs;
        }
    }
}

public class Test
{
    static void Main()
    {
        Operation add = Operation.Add;
        Console.WriteLine(add.Apply(5, 3)); // 8
    }
}

(Note the use of the new “automatically implemented read-only property” from C# 6. Yum!)

Obviously there are possible variations of this – there could be multiple instances of one subclass, or we could even create instances dynamically if that were useful. The main point is that no other code can create instances or even declare their own subclasses of Operation. That private constructor stops everything, right? Um…

My eyes! The goggles do nothing!

I’m sure all my readers are aware that every C# constructor either has to chain to another constructor using : this(...) or chain to a base class constructor, either implicitly calling a constructor without arguments or explicitly using : base(...).

What I wasn’t aware of – or rather, I think I’d been aware of once, but ignored – was that constructor initializers can have cycles. In other words, it’s absolutely fine to write code like this:

public class SneakyOperation : Operation
{
    public SneakyOperation() : this(1)
    {}

    public SneakyOperation(int i) : this()
    {}

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

Neither operator attempts to call a base class constructor, so it doesn’t matter that the only base class constructor is private. This code compiles with no issues – not even a warning.

So we’ve already failed in our intention to prevent subclassing from occurring… but this will fail with a StackOverflowException which will take down the process, so there’s not much harm, right?

Well… suppose we didn’t let it throw a StackOverflowException. Suppose we triggered our own exception first. There are various ways of doing this. For example, we could use arithmetic:

public SneakyOperation(int x) : this(x, 0)
{}

// This isn't going to be happy if y is 0...
public SneakyOperation(int x, int y) : this(x / y)
{}

Or we could have one constructor call another with a delegate which will deliberately throw:

public SneakyOperation(int x) : this(() => { throw new Exception("Bang!"); })
{}

public SneakyOperation(Func<int> func) : this(func())
{}

So now, we have a class where you can call the constructor, and an exception will be thrown. But hey, you still can’t get at the instance which was created, right? Well, this is tricky. We genuinely can’t use this anywhere in code which we expect to execute – there’s no way of reaching a constructor body, because in order to do that we’d have to have some route which took us through the base class constructor, which was can’t access. Field initializers aren’t allow to use this, either explicitly or implicitly (e.g. by calling instance methods). If there’s a way of circumventing that, I’m not aware of it.

So, we can’t capture a reference to the instance before throwing the exception. And after the exception is thrown, it’s garbage, and therefore unavailable… unless we take it out of the garbage bin, of course. Enter a finalizer…

private static volatile Operation instance;

~SneakyOperation()
{
    // Resurrect the instance.
    instance = this;
}

Now all we have to do is wrap up the "create an object and wait for it to be resurrected" logic in a handy factory method:

public static Operation GetInstance()
{
    try
    {
        new SneakyOperation(0);
    }
    catch {} // Thoroughly expected
    Operation local;
    while ((local = instance) == null)
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
    return local;
}

Admittedly if multiple threads call GetInstance() at a time, there’s no guarantee which thread will get which instance, or whether they’ll get the same instance… but that’s fine. There’s very little to guarantee this will ever terminate, in fact… but I don’t care much, as clearly I’m never going to use this code in production. It works on my machine, for the few times I’ve been able to bear running it. Here’s a complete example (just combine it with the Operation class as before) in case you want to run it yourself:

public class SneakyOperation : Operation
{
    private static volatile Operation instance;

    private SneakyOperation(int x) : this(() => { throw new Exception(); })
    {}

    private SneakyOperation(Func<int> func) : this(func())
    {}

    public static Operation GetInstance()
    {
        try
        {
            new SneakyOperation(0);
        }
        catch {}
        Operation local;
        while ((local = instance) == null)
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
        }
        return local;
    }

    ~SneakyOperation()
    {
        instance = this;
    }

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

public class Test
{
    static void Main()
    {
        Operation sneaky = SneakyOperation.GetInstance();
        Console.WriteLine(sneaky.Apply(5, 3)); // 15
    }
}

Vile, isn’t it?

Fixing the problem

The first issue is the cyclic constructors. That’s surely a mistake in the language specification. (It’s not a bug in the compiler, as far as I can tell – I can’t find anything in the spec to prohibit it.) This can be fixed, and it shouldn’t even be too hard to do. There’s nothing conditional in constructor initializers; each constructor chains to exactly one other constructor, known via overload resolution at compile time. Even dynamic doesn’t break this – constructor initializers can’t be dispatched dynamically. This makes detecting a cycle trivial – from each constructor, just follow the chain of constructor calls until either you reach a base class constructor (success) or you reach one of the constructor declarations already in the chain (failure). I will be pestering Mads for this change – it may be too late for C# 6, but it’s never too early to ask about C# 7. This will be a breaking change – but it will only break code which is tragically broken already, always resulting in either an exception or a stack overflow.

Of course, a C# language change wouldn’t change the validity of such code being created through IL. peverify appears to be happy with it, for example. Still, a language fix would be a good start.

There’s a second language change which could help to fix this too – the introduction of private abstract/virtual methods. Until you consider nested classes, neither of those make sense – but as soon as you remember that there can be subclasses with access to private members, it makes perfect sense to have them. If Operation had a private abstract method (probably exposed via a public concrete method) then SneakyOperation (and any further subclasses) would have to be abstract, as it couldn’t possibly override the method. I’ve already blogged about something similar with public abstract classes containing internal abstract methods, which prevents concrete subclasses outside that assembly… but this would restrict inheritance even further.

Does it reach the usefulness bar required for new language features? Probably not. Would it be valid IL with the obvious meaning? I’ve performed experiments which suggest that it would be valid, but not as restrictive as one would expect – but I could be mistaken. I think there’s distinct merit in the idea, but I’m not expecting to see it come to pass.

Conclusion

So, what can you take away from this blog post? For starters, the fact that the language can probably be abused in far nastier ways than you’ve previously imagined – and that it’s very easy to miss the holes in the spec which allow for that abuse. (It’s not like this hole is a recent one… it’s been open forever.) You should absolutely not take away the idea that this is a recommended coding pattern – although the original pattern (for Operation) is a lovely one which deserves more widespread recognition.

Finally, you can take away the fact that nothing gets my attention quicker than a new (to me) way of abusing the language.