Category Archives: Evil code

Surprise! Creating an instance of an open generic type

This is a brief post documenting a very weird thing I partly came up with on Stack Overflow today.

The context is this question. But to skip to the shock, we end up with code like this:

object x = GetWeirdValue();
// This line prints True. Be afraid - be very afraid!
Console.WriteLine(x.GetType().GetTypeInfo().IsGenericTypeDefinition);

That just shouldn’t happen. You shouldn’t be able to create an instance of an open type – a type that still contains generic type parameters. What does a List<T> (rather than a List<string> or List<int>) mean? It’s like creating an instance of an abstract class.

Before today, I’d have expected it to be impossible – the CLR should just not allow such an object to exist. I now know one – and only one – way to do it. While you can’t get normal field values for an open generic type, you can get constants… after all, they’re constant values, right? That’s fine for most constants, because those can’t be generic types – int, string etc. The only type of constant with a user-defined type is an enum. Enums themselves aren’t generic, of course… but what if it’s nested inside another generic type, like this:

class Generic<T>
{
    enum GenericEnum
    {
        Foo = 0
    }
}

Now Generic<>.Enum is an open type, because it’s nested in an open type. Using Enum.GetValues(typeof(Generic<>.GenericEnum)) fails in the expected way: the CLR complains that it can’t create instances of the open type. But if you use reflection to get at the constant field representing Foo, the CLR magically converts the underlying integer (which is what’s in the IL of course) into an instance of the open type.

Here’s the complete code:

using System;
using System.Reflection;

class Program
{
    static void Main(string[] args)
    {
        object x = GetWeirdValue();
        // This line prints True
        Console.WriteLine(x.GetType().GetTypeInfo().IsGenericTypeDefinition);
    }

    static object GetWeirdValue() =>
        typeof(Generic<>.GenericEnum).GetTypeInfo()
            .GetDeclaredField("Foo")
            .GetValue(null);

    class Generic<T>
    {
        public enum GenericEnum
        {
            Foo = 0
        }
    }
}

… and the corresponding project file, to prove it works for both the desktop and .NET Core…

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp1.0;net45</TargetFrameworks>
  </PropertyGroup>

</Project>

Use this at your peril. I expect that many bits of code dealing with reflection would be surprised if they were provided with a value like this…

It turns out I’m not the first one to spot this. (That would be pretty unlikely, admittedly.) Kirill Osenkov blogged two other ways of doing this, discovered by Vladimir Reshetnikov, back in 2014.

To base() or not to base(), that is the question

Today I’ve been reviewing the ECMA-334 C# specification, and in particular the section about class instance constructors.

I was struck by this piece in a clause about default constructors:

If a class contains no instance constructor declarations, a default instance constructor is automatically provided. That default constructor simply invokes the parameterless constructor of the direct base class.

I believe this to be incorrect, and indeed it is, as shown here (in C# 6 code for brevity, despite this being the C# 5 spec that I’m reviewing; that’s irrelevant in this case):

using System;

class Base
{
    public int Foo { get; }

    public Base(int foo = 5)
    {
        Foo = foo;
    }
}

class Derived : Base
{    
}

class Test
{
    static void Main()
    {
        var d = new Derived();
        Console.WriteLine(d.Foo); // Prints 5
    }    
}

Here the default constructor in Derived clearly doesn’t execute a parameterless constructor in Base because there is no parameterless constructor in Base. Instead, it executes the parameterized constructor, providing the default argument value.

So, I considered whether we could reword the standard to something like:

If a class contains no instance constructor declarations, a default instance constructor is automatically provided. That default constructor simply invokes a constructor of the direct base class as if the default constructor contained a constructor initializer of base().

But is that always the case? It turns out it’s not – at least not in Roslyn. There are more interesting optional parameters we can use than just int foo = 5. Let’s have a look:

using System;
using System.Runtime.CompilerServices;

class Base
{
    public string Origin { get; }

    public Base([CallerMemberName] string name = "Unspecified",
                [CallerFilePath] string source = "Unspecified",                
                [CallerLineNumber] int line = -1)
    {
        Origin = $"{name} - {source}:{line}";
    }
}

class Derived1 : Base {}
class Derived2 : Base
{
    public Derived2() {}
}
class Derived3 : Base
{
    public Derived3() : base() {}
}

class Test
{
    static void Main()
    {
        Console.WriteLine(new Derived1().Origin);
        Console.WriteLine(new Derived2().Origin);
        Console.WriteLine(new Derived3().Origin);
    }    
}

The result is:

Unspecified - Unspecified:-1
Unspecified - Unspecified:-1
.ctor - c:\Users\Jon\Test\Test.cs:23

When base() is explicitly specified, that source location is treated as the “caller” for caller member info attributes. When it’s implicit (including when there’s a default constructor), no source location is made available to the Base constructor.

This is somewhat compiler-specific – and I can imagine different results where the default constructor could specify a name but not source file or line number, and the declared constructor with an implicit call could specify the name and source file but no line number.

I would never suggest using this little tidbit of Roslyn implementation trivia, but it’s fun nonetheless…

“Sideways overriding” with partial methods

First note: this blog post is very much tongue in cheek. I’m not actually planning on using the idea. But it was too fun not to share.

As anyone following my activity on GitHub may be aware, I’ve been quite a lot of work on Protocol Buffers recently – in particular, a mostly-new port for proto3. I’ve recently been looking at JSON support, and thinking about how to implement “overriding” ToString() for a few well-known types. I generate partial classes, so that gives me a hook to provide extra functionality. Indeed, I’m planning on using this to provide conversion methods for Timestamp and Duration, for example. However, you can’t really override anything in partial methods.

Refresher on partial methods

While partial classes were introduced in C# 2, partial methods were introduced in C# 3. The idea is that one source file (usually the generated one) can provide a partial method signature, and another source file (usually the manually-written one) can provide an implementation if it wants to. Any part of the source can call the method, and the call will be removed at compile-time if nothing provides an implementation. The fact that the method may not be there leads to some limitations:

  • Partial methods are implicitly private, but you can’t specify an access modifier explicitly
  • Partial methods are always void – they can’t return any values
  • Partial methods cannot have out parameters

(Interestingly, a partial method implementation can be an async method – but with a return type of void, which is never a nice situation to be in.)

There’s more in the spec, but the last two bullets are the important part.

So, suppose I want to override ToString() in the generated code, but provide a mechanism for that override to be “further overridden” effectively, in the manual code for the same class? How do I get the value from an “extra override”? How do I even detect whether or not it’s there?

Side effects to the rescue!

(Now there’s a phrase you never thought you’d hear from me.)

I mentioned before that if a partial method is called but no implementation is provided, the call is removed. That includes all aspects of the call – including the evaluation of the arguments. So if evaluating the argument has a side-effect… we can spot that side effect.

Next, we have to work out how to get a value back from a method. We can’t use the return value, and we can’t use an out parameter. There are two options here: we could either pass a wrapper (e.g. an array with a single element) and allow the “extra override” to populate the wrapper… or we can use a ref parameter. The latter feels ever-so-slightly cleaner to me.

And so the ugly hack is born. The code generator can always generate code like this:

partial void ToStringOverride(bool ignored, ref string value);

public override string ToString()
{
    string value = null;
    bool overridden = false;
    ToStringOverride(overridden = true, ref value);
    return overridden ? value : "Original";
}

For any partial class where the ToStringOverride method isn’t implemented, overridden will still be false, so we’ll fall back to returning "Original". (I would hope that any decent JIT would remove the overridden and value local variables entirely at that point.) Otherwise, we’ll return whatever the method has changed value to.

Here’s a short but complete example:

using System;

// Generated code
partial class UglyHack1
{
    partial void ToStringOverride(bool ignored, ref string value);

    public override string ToString()
    {
        string value = null;
        bool overridden = false;
        ToStringOverride(overridden = true, ref value);
        return overridden ? value : "Original";
    }
}

// Generated code
partial class UglyHack2
{
    partial void ToStringOverride(bool ignored, ref string value);

    public override string ToString()
    {
        string value = null;
        bool overridden = false;
        ToStringOverride(overridden = true, ref value);
        return overridden ? value : "Original";        
    }    
}

// Manual code
partial class UglyHack2
{
    partial void ToStringOverride(bool ignored, ref string value)
    {
        value = "Different!";
    }
}

class Test
{
    static void Main()
    {
        var g1 = new UglyHack1();
        var g2 = new UglyHack2();

        Console.WriteLine(g1);
        Console.WriteLine(g2);
    }
}

Horribly ugly, but it works…

Alternatives?

Obviously this isn’t really pleasant. Some alternatives:

  • Derive from the generated class in order to override ToString again. Doesn’t work with sealed classes, and will only work if clients create instances of the derived class.
  • Introduce a new interface, and allow manual code to implement it on the partial class. The ToString method can then check this is IMyOtherToString or whatever, and call it appropriately. This introduces another virtual call for no great reason, and exposes the interface to the outside world, which we may not want to do.
  • Don’t override ToString in the generated code at all. Not good if you normally want to override it.
  • Introduce an abstract base class which the generated class derives from. Override ToString() in that base class, possibly calling an abstract member which is then provided in the generated class – but allowing the manual code to override ToString() again.

Conclusion

Ugly hacks are fun. But it’s much better to keep them where it belongs: in a blog post, not in production code.

Backwards compatibility is (still) hard

At the moment, I’m spending a fair amount of time thinking about a new version of the C# API and codegen for Protocol Buffers, as well as other APIs for interacting with Google services. While that’s the context for this post, I want to make it very clear that this is still a personal post, and should in no way be taken to be “Google’s opinion” on anything. The underlying issue could apply in many other situations, but it’s easiest to describe a concrete scenario.

Context and current state: the builder pattern

The problem I’ve been trying to address is the relative pain of initializing a protobuf message. Protocol buffer messages are declared in a separate schema file (.proto) and then code is generated. The schema declares fields, each of which has a name, a type and a number associated with it. The generated message types are immutable, with builder classes associated with them. So for example, we might start off with a message like this:

message Person {
  string first_name = 1;
  string last_name = 3;
}

And construct a Person object in C# like this:

var person = new Person.Builder { FirstName = "Jon", LastName = "Skeet" }.Build();
// Now person.FirstName and person.LastName are readonly properties

That’s not awful, but it’s not the cleanest code in the world. We can make it slightly simpler using an implicit conversion from the builder type to the message type:

Person person = new Person.Builder { FirstName = "Jon", LastName = "Skeet" };

It’s still not really clean though. Let’s revisit why the builder pattern is useful:

  • We can specify just the properties we want.
  • By deferring the “build” step until after we’ve specified everything, we get mutability without building a lot of intermediate objects.

If only there were another language construct allowing that…

Optional parameters to the rescue!

If we provided a constructor with an optional parameter for each property, we can specify just what we want. So something like:

public Person(string firstName = null, string lastName = null)
...
var person = new Person(firstName: "Jon", lastName: "Skeet");

Hooray! That looks much nicer:

  • We can use var (if we want to) because there are no implicit conversions to confuse things.
  • We don’t need to mention a builder at all.
  • Every piece of text in the statement is something we want to express, and we only express it once.

That last point is a lovely place to be in terms of API design – while you still need to worry about naming, ordering and how the syntax fits into bigger expressions, you’ve achieved some sense of “as simple as possible, but no simpler”.

So, that’s all great – except for versioning.

Let’s just add a field at the end…

One of the aims of protocol buffers is to support an evolving schema. (The limitations are different for proto2 and proto3, but that’s a slightly different matter.) So what happens if we add a new field to the message?

message Person {
  string first_name = 1;
  string last_name = 3;
  string title = 4; // Mr, Mrs etc
}

Now we end up with the following constructor:

public Person(string firstName = null, string lastName = null, string title = null)

The code still compiles – but if we try to use run our old client code against the new version of the library, it will fail – because the method it refers to no longer exists. So we have source compatibility, but not binary compatibility.

Let’s just add a field in the middle…

You may have noticed that I don’t have a field with tag 2 – this is not an accident. Suppose we now add it, for the obvious middle_name field:

message Person {
  string first_name = 1;
  string middle_name = 2;
  string last_name = 3;
  string title = 4; // Mr, Mrs etc
}

Regenerate the code, and we end up with a constructor with 4 parameters:

public Person(
    string firstName = null,
    string middleName = null,
    string lastName = null,
    string title = null)

Just to be clear, this change is entirely fine in protocol buffers – while normally fields are assigned incrementally, it shouldn’t be a breaking change to add a new field “between” existing ones.

Let’s take a look at our client code again:

var person = new Person(firstName: "Jon", lastName: "Skeet");

Yup, that still works – we need to recompile, but we still end up with a Person with the right properties. But that’s not the only code we could have started with. Suppose we’d actually had:

var person = new Person("Jon", "Skeet");

Until this last change that would have been fine – even after we’d added the optional title parameter, the two arguments would still have mapped to firstName and lastName respectively.

Having added the middle_name field, however, the code would still compile with no errors or warnings, but the meaning of the second argument would have changed – it would now map onto the middleName parameter instead of lastName.

Basically, we’d like to stop this code (using positional arguments) from compiling in the first place.

Feature requests and a workaround

The two features we really want from C# here are:

  • Some way of asking the generated code to perform dynamic overload resolution at execution time… not based on dynamic values, but on the basis that the code we’re compiling against may have changed since we compiled. This resolution only needs to be performed once, on first execution (or class load, or whatever) as by the time we’re executing, everything is fixed (the parameter names and types, and the argument names and types). It could be efficient.
  • Some way of forcing any call sites to use named arguments for any optional parameters. (Even though in our case all the parameters are optional, I can easily imagine a case where there are a few required parameters and then the optional ones. Using positional arguments for those required parameters is fine.)

It’s hard (without forking Roslyn :) to implement these feature requests ourselves, but for the second one we can at least have a workaround. Consider the following struct:

public struct DoNotCallThisMethodWithPositionalArguments {}

… and now imagine our generated constructor had been :

public Person(
    DoNotCallThisMethodWithPositionalArguments ignoreMe =
        default(DoNotCallThisMethodWithPositionalArguments),
    string firstName = null,
    string middleName = null,
    string lastName = null,
    string title = null)

Now our constructor call using positional arguments will fail, because there’s no conversion from string to the crazily-named struct. The only “nice” way you can call it is to use named arguments, which is what we wanted. You could call it using positional arguments like this:

var person = new Person(
    new DoNotCallThisMethodWithPositionalArguments(),
    "Jon",
    "Skeet");

(or using default(...) like the constructor declaration) – but at this point the code looks broken, so it’s your own fault if you decide to use it.

The reason for making it a struct rather than a class is to avoid null being convertible to it. Annoyingly, it wouldn’t be hard to make a class that you could never create an actual instance of, but you can’t prevent anyone from creating a value of a struct. Basically, what we really want is a type such that there is no valid expression which is convertible to that type – but aside from static classes (which can’t be used as parameter types) I don’t know of any way of doing that. (I don’t know what would happen if you compiled the Person class using a non-static class as the first parameter, then made that class static and recompiled it. Confusion on the part of the C# compiler, I should think.)

Another option (as mentioned in the comments) is to have a “poor man’s” version of the compiler enforcement via a Roslyn Code Diagnostic – add an attribute to any method call where you want “All optional parameters must be specified with named arguments” to apply, and then make the code diagnostic complain if you disobey that. That diagnostic could ship with the Protocol Buffers NuGet package, which would make for a pretty nice experience. Not quite as good as a language feature though :)

Conclusion

Default parameters are a pain in terms of compatibility. For internal company code, it’s often reasonable to only care about source compatibility as you can recompile all calling code before deployment – but for open source projects, binary compatibility within the same major version is the norm.

How useful and common do I think these features would be? Probably not common enough to meet the bar – unless there’s encouragement within comments here, in which case I’m happy to file feature requests on GitHub, of course.

As it happens, I’m currently looking at radical changes to the C# implementation of Protocol Buffers, regretfully losing the immutability aspect due to it raising the barrier to entry. It’s not quite a done deal yet, but assuming that goes ahead, all of this will be mostly irrelevant – for Protocol Buffers. There are plenty of other places where code generation could be more robustly backward-compatible through judicious use of optional-but-please-use-named-arguments parameters though…

When is an identifier not an identifier? (Attack of the Mongolian Vowel Separator)

Here’s a few things you may not be aware of:

  • C# identifiers can include Unicode escape sequences (\u1234 etc)
  • C# identifiers can include Unicode characters in the category “Other, formatting” (Cf) but these are ignored when comparing identifiers for equality
  • The Mongolian Vowel Separator (U+180E) has oscillated between the Cf and Zs categories a couple of times
  • .NET has its own copy of Unicode categories, separate from whatever Win32 might provide
  • Roslyn (built in .NET) uses the Unicode categories, whereas csc.exe (the “old” native C# compiler) uses either the Win32 categories or a built-in copy
  • Neither the .NET table nor the Win32 table necessarily reflects exactly what any one version of the Unicode standard says
  • Compilers can have bugs in

Put them together, and chaos ensues!

How this all started – blame Vladimir

I started looking at this based on a discussion in our ECMA technical group meeting last week, when we were considering the normative references – and in particular, which version of Unicode we were going to target. Currently the ECMA 4th edition spec targets Unicode 4.0 and the Microsoft C# 5 specification targets Unicode 3.0. It’s not clear to me whether any compilers actually take note of this, and moving forward we’d like both the ECMA and Microsoft standards to not specify a particular version of Unicode, effectively encouraging compiler authors to use the most recent one available to them. Despite the wrinkles listed below, I think that makes the most sense for real world uses – it’s crazy to require compilers to ship with their own private copies of Unicode, effectively.

When discussing this, Vladimir Reshetnikov mentioned the Mongolian Vowel Separator (U+180E) which has had an interesting life. It was introduced in Unicode 3.0.0, when it was in the Cf category (“other, formatting”). Then in Unicode 4.0.0 it was moved into the Zs (separator, space) category. In Unicode 6.3.0 it was then moved back to the Cf category.

Of course, my natural inclination was to try to abuse this. My initial aim was to come up with code which behaved differently depending on which version of Unicode the compiler was using. It turned out to be a little more complicated than that, but we’ll assume a hypothetical compiler first, with no bugs, but which obeys whichever version of the Unicode standard we want it to. (Arguably that’s already a bug given the requirements of the current C# specs, but we’ll set that aside.)

Hypothetical example 1: valid or invalid

For simplicity, let’s start with some source code which is all in ASCII:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        string\u180ex = "b";
        Console.WriteLine(stringx);
    }
}

If the compiler is using Unicode 6.3 or higher (or a version earlier than 4.0) then U+180E is deemed to be in the Cf category, and is therefore valid within an identifier. In that case, it’s fine for it to be escaped as per the code above. At that point, the identifier in the second line of the method is deemed to be “the same as” stringx, so the output is b.

What about a compiler using a version of Unicode between 4.0 and 6.2 (inclusive) though? At that point, U+180E is deemed to be in the Zs category, which makes it a whitespace character. Zs characters are allowed as whitespace within C# programs – but not within identifiers. Once it’s not a valid identifier – and because this isn’t within a character/string literal – it’s invalid to use the Unicode escape sequence, so the code doesn’t compile.

Hypothetical example 2: valid two different ways

We can write the same code without using an escape sequence, however. If you create a regular ASCII file like this:

class MvsTest
{
    static void Main()
    {
        string stringx = "a";
        stringAAAx = "b";
        Console.WriteLine(stringx);
    }
}

then open it up in a hex editor and replace the AAA with bytes E1 A0 8E, then you’ve got a file containing the UTF-8 representation of U+180E at the same location as we had the Unicode escape sequence in the first version.

So, a compiler which compiled the first version would still compile this version (assuming you could tell it that the source code was UTF-8), and the results would be the same – it would print b as the second statement of the method would be a simple assignment to the existing variable.

However, a compiler which treats U+180E as whitespace and would therefore treat the first program as an error would accept this program – and treat that second statement as a declaration of a second local variable (x) and assign it an initial value. You might get a warning about the variable being unused, but it’s valid C# and the output is a.

Reality: the Microsoft compilers

Whenever we talk about the Microsoft C# compiler these days, we need to distinguish between the “native” compiler (csc) and Roslyn (rcsc, although typically I just call it Roslyn).

As it’s written in native code, csc uses whatever Windows supplies for its Unicode character tables – or it embeds it directly in the executable, potentially. (I’ve been scouring MSDN to find a Win32 native function to tell me the Unicode category of a specific code point, and failed so far. It would have been useful…)

Compare that with Roslyn, which is written in C# and (as far as I’m aware) uses char.GetUnicodeCategory – which in turn uses the Unicode tables built into mscorlib.

My experiments suggest that whatever the native compiler uses to get the Unicode category has treated U+180E as a formatting character forever. At least, I’ve tried to find old machines (including VM images) which haven’t had Windows update applied since September 2013 (which is when Unicode 6.3 was published) and they all compile the first program listed above. I’m beginning to suspect that csc might actually have a copy of Unicode 3.0 built into it; it certainly treats U+180E as a formatting character, but doesn’t like either U+0600 or U+00AD within identifiers. (U+0600 wasn’t introduced until Unicode 4.0, but has always been a formatting character; U+00AD was a “dash punctuation” character in Unicode 3.0, and became a formatting character in 4.0.)

The table built into mscorlib has definitely changed over time, however. If you run a simple program such as this:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(Environment.Version);
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

then running under CLRv2, the result is “SpaceSeparator” whereas running on CLRv4 (at least on a recently-updated system), the result is “Format”.

Of course, Roslyn won’t run under old CLRs, but we have hope by way of csharppad.com – which runs Roslyn in an environment (of uncertain origin – Mono? I’m unsure) which prints “SpaceSeparator” for the above. Sure enough the first program fails to compile – but it’s harder to check the second program, as csharppad.com doesn’t allow you to upload source code, and copy/paste produces some odd results.

Reality: mcs (Mono C# compiler)

Mono’s compiler uses the BCL GetUnicodeCategory code too, which should make it significantly simpler to experiment – but unfortunately, the Mono parser has (at least) two bugs in it:

  • It will allow any Unicode escape sequence in an identifier, whether it’s an escape sequence for a valid identifier part or not. For example, string\u0020x = "" is valid under the Mono compiler. Filed as bug 24968. Source.
  • It doesn’t allow formatting characters within identifiers – it includes characters in classes Mn, Mc, Nd and Pc, but not Cf. Filed as bug 24969. Source.

For this reason, the first program always compiles and prints “b” whereas the second program always fails to compile, regardless of whether U+180E is treated as being in Zs or Cf.

What version is this, anyway?

Next, let’s think about the Unicode data itself. It’s not at all clear which version any particular BCL implementation is actually using. Consider this little program:

using System;

class Test
{
    static void Main()
    {
        Console.WriteLine(char.GetUnicodeCategory('\u00ad'));
        Console.WriteLine(char.GetUnicodeCategory('\u0600'));
        Console.WriteLine(char.GetUnicodeCategory('\u180e'));
    }
}

On my computer, under CLR v4 this prints “DashPunctuation, Format, Format”, and under both Mono (3.3.0) and CLR v2 it prints “DashPunctuation, Format, SpaceSeparator”.

That’s very odd. It doesn’t correspond with any version of the Unicode standard, as far as I can tell:

  • U+00AD was a Po (other, punctuation) character in Unicode 1.x, then Pd (dash, punctuation) in 2.x and 3.x, and from 4.0 onwards has been Cf.
  • U+0600 was only introduced in Unicode 4.0, and has always been Cf
  • U+180E introduced as Cf in 3.0, then changed to Zs in 4.0, then back to Cf in 6.3.

So there is no version where the first line agrees with either the second line or the third line. I’m basically a bit baffled by this.

What about nameof and CallerMemberName?

The names of identifiers aren’t only used for comparisons – they’re available as strings without any reflection being involved at all. From C# 5, we’ve had CallerMemberName attribute, allowing things like:

public static void X\u0600y()
{
    ShowCaller();
}

public static void ShowCaller([CallerMemberName] string caller = null)
{
    Console.WriteLine("Called by {0}", caller);
}

And in C# 6, we can write:

string x\u0600y = "";
Console.WriteLine("nameof = {0}", nameof(x\u0600y));

What should those print? They do just print “Xy” and “xy” as the names (respectively), as if the compiler has simply thrown away the formatting character entirely. But what should they print? Bear in mind that in the second case, we could easily have used nameof(xy) and that would still have compared equal to the declared identifier.

We can’t even say “What’s the name of the member being declared?” because you can overload with “different but equal” identifiers:

public static void Xy() {}
public static void X\u0600y() {}
public static void X\u070fy() {}
...
Console.WriteLine(nameof(X\u200by));

What should that print out? I’m sure you’ll be relieved to hear that the C# team has a plan in place – but fundamentally this is one of these “no obvious right answer” scenarios. It gets even weirder when you bring the CLI specification into the mix. Section I.8.5.1 of ECMA-335 6th edition has:

Assemblies shall follow Annex 7 of Technical Report 15 of the Unicode Standard
3.0 governing the set of characters permitted to start and be included in identifiers, available online
at http://www.unicode.org/unicode/reports/tr15/tr15-18.html. Identifiers shall be in the
canonical format defined by Unicode Normalization Form C. For CLS purposes, two identifiers
are the same if their lowercase mappings (as specified by the Unicode locale-insensitive, one-to-one
lowercase mappings) are the same. That is, for two identifiers to be considered different
under the CLS they shall differ in more than simply their case. However, in order to override an
inherited definition the CLI requires the precise encoding of the original declaration be used.

I would love to explore the impact of this by adding a Cf character into IL, but unfortunately I haven’t worked out a way of affecting the encoding of ilasm, in order to persuade it that my hacked up IL is what I want it to be.

Conclusion

As noted before, text is hard.

It turns out that even when restricting oneself to identifiers, text is hard. Who would’ve thought?

When is a string not a string?

When is a string not a string?

As part of my “work” on the ECMA-334 TC49-TG2 technical group, standardizing C# 5 (which will probably be completed long after C# 6 is out… but it’s a start!) I’ve had the pleasure of being exposed to some of the interesting ways in which Vladimir Reshetnikov has tortured C#. This post highlights one of the issues he’s raised. As usual, it will probably never impact 99.999% of C# developers… but it’s a lovely little problem to look at.

Relevant specifications referenced in this post:
– The Unicode Standard, version 7.0.0 – in particular, chapter 3
C# 5 (Word document)
ECMA-335 (CLI specification)

What is a string?

How would you define the string (or System.String) type? I can imagine a number of responses to that question, from vague to pretty specific, and not all well-defined:

  • “Some text”
  • A sequence of characters
  • A sequence of Unicode characters
  • A sequence of 16-bit characters
  • A sequence of UTF-16 code units

The last of these is correct. The C# 5 specification (section 1.3) states:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So far, so good. But that’s C#. What about IL? What does that use, and does it matter? It turns out that it does… Strings need to be represented in IL as constants, and the nature of that representation is important, not only in terms of the encoding used, but how the encoded data is interpreted. In particular, a sequence of UTF-16 code units isn’t always representable as a sequence of UTF-8 code units.

I feel ill (formed)

Consider the C# string literal "X\uD800Y". That is a string consisting of three UTF-16 code units:

  • 0x0058 – ‘X’
  • 0xD800 – High surrogate
  • 0x0059 – ‘Y’

That’s fine as a string – it’s even a Unicode string according to the spec (item D80). However, it’s ill-formed (item D84). That’s because the UTF-16 code unit 0xD800 doesn’t map to a Unicode scalar value (item D76) – the set of Unicode scalar values explicitly excludes the high/low surrogate code points.

Just in case you’re new to surrogate pairs: UTF-16 only deals in 16-bit code units, which means it can’t cope with the whole of Unicode (which ranges from U+0000 to U+10FFFF inclusive). If you want to represent a value greater than U+FFFF in UTF-16, you need to use two UTF-16 code units: a high surrogate (in the range 0xD800 to 0xDBFF) followed by a low surrogate (in the range 0xDC00 to 0xDFFF). So a high surrogate on its own makes no sense. It’s a valid UTF-16 code unit in itself, but it only has meaning when followed by a low surrogate.

Show me some code!

So what does this have to do with C#? Well, string constants have to be represented in IL somehow. As it happens, there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8.

Let’s take an example:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;

[Description(Value)]
class Test
{
    const string Value = "X\ud800Y";

    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }

    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

The output of this code (under .NET) is:

Attribute: 0058 fffd fffd 0059
Constant: 0058 d800 0059

As you can see, the “constant” (Test.Value) has been preserved as a sequence of UTF-16 code units, but the attribute property has U+FFFD (the Unicode replacement character which is used to indicate broken data when decoding binary to text). Let’s dig a little deeper and look at the IL for the attribute and the constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )

.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant (Value) is really simple – it’s just little-endian UTF-16. The format of the attribute is specified in ECMA-335 section II.23.3. Here, the meaning is:

  • Prolog (01 00)
  • Fixed arguments (for specified constructor signature)
    • 05 58 ED A0 80 59 (a single string argument as a SerString)
      • 05 (the length, i.e. 5, as a PackedLen)
      • 58 ED A0 80 59 (the UTF-8-encoded form of the string)
  • Number of named arguments (00 00)
  • Named arguments (there aren’t any)

The interesting part is the “UTF-8-encoded form of the string” here. It’s not valid UTF-8, because the input isn’t a well-formed string. The compiler has taken the high surrogate, determined that there isn’t a low surrogate after it, and just treated it as a value to be encoded in the normal UTF-8 way of encoding anything in the range U+0800 to U+FFFF inclusive.

It’s worth noting that if we had a full surrogate pair, UTF-8 would encode the single Unicode scalar value being represented, using 4 bytes. For example, if we change the declaration of Value to:

const string Value = "X\ud800\udc00Y";

then the UTF-8 bytes in the IL are 58 F0 90 80 80 59 – where F0 90 80 80 is the UTF-8 encoding for U+10000. That’s a well-formed string, and we get the same value for both the description attribute and the constant.

So in our original example, the string constant (encoded as UTF-16 in the IL) is just decoded without checking whether or not it’s ill-formed, whereas the attribute argument (encoded as UTF-8) is decoded with extra validation, which detects the ill-formed code unit sequence and replaces it.

Encoding behaviour

So which approach is right? According to the Unicode specification (item C10) both could be fine:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

and

Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89.

It’s not at all clear to me whether either the attribute argument or the constant value “purports to be in a Unicode character encoding form”. In my experience, very few pieces of documentation or specification are clear about whether they expect a piece of text to be well-formed or not.

Additionally, System.Text.Encoding implementations can often be configured to determine how they behave when encoding or decoding ill-formed data. For example, Encoding.UTF8.GetBytes(Value) returns byte sequence 58 EF BF BD 59 – in other words, it spots the bad data and replaces it with U+FFFD as part of the encoding… so decoding this value will result in X U+FFFD Y with no problems. On the other hand, if you use new UTF8Encoding(true, true).GetBytes(Value), an exception will be thrown. The first constructor argument is whether or not to emit a byte order mark under certain circumstances; the second one is what dictates the encoding behaviour in the face of invalid data, along with the EncoderFallback and DecoderFallback properties.

Language behaviour

So should this compile at all? Well, the language specification doesn’t currently prohibit it – but specifications can be changed :)

In fact, both csc and Roslyn do prohibit the use of ill-formed strings with certain attributes. For example, with DllImportAttribute:

[DllImport(Value)]
static extern void Foo();

This gives an error when Value is ill-formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

There may be other attributes this is applied to as well; I’m not sure.

If we take it as read that the ill-formed value won’t be decoded back to its original form when the attribute is instantiated, I think it would be entirely reasonable to make it a compile-time failure – for attributes. (This is assuming that the runtime behaviour can’t be changed to just propagate the ill-formed string.)

What about the constant value though? Should that be allowed? Can it serve any purpose? Well, the precise value I’ve given is probably not terribly helpful – but it could make sense to have a string constant which ends with a high surrogate or starts with a low surrogate… because it can then be combined with another string to form a well-formed UTF-16 string. Of course, you should be very careful about this sort of thing – read the Unicode Technical Report 36 “Security Considerations” for some thoroughly alarming possibilities.

Corollaries

One interesting aspect to all of this is that “string encoding arithmetic” doesn’t behave as you might expect it to. For example, consider this method:

// Bad code!
string SplitEncodeDecodeAndRecombine
    (string input, int splitPoint, Encoding encoding)
{
    byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint));
    byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint));
    return encoding.GetString(firstPart) + encoding.GetString(secondPart);            
}

You might expect that this would be a no-op so long as everything is non-null and splitPoint is within range… but if you happen to split in the middle of a surrogate pair, it’s not going to be happy. There may well be other potential problems lurking there, depending on things like normalization form – I don’t think so, but at this point I’m unwilling to bet too heavily on string behaviour.

If you think the above code is unrealistic, just imagine partitioning a large body of text, whether that’s across network packets, files, or whatever. You might feel clever for realizing that without a bit of care you’d get binary data split between UTF-16 code units… but even handling that doesn’t save you. Yikes.

I’m tempted to swear off text data entirely at this point. Floating point is a nightmare, dates and times… well, you know my feelings about those. I wonder what projects are available that only need to deal with integers, and where all operations are guaranteed not to overflow. Let me know if you have any.

Conclusion

Text is hard.

Violating the “smart enum” pattern in C#

For a while now, I’ve been a big fan of a pattern in C# which mimics Java enums to a certain extent. In general, it’s a lovely pattern. Only after reading a comment on a recent blog post by Eric Lippert did I find out about a horrible flaw. Dubious thanks to John Payson for prompting this post. (And I fully intend to include this in the Abusing C# talk that I give every so often…)

Trigger warning: this post contains code you may wish you’d never seen.

Let’s start off with the happy path – a moment of calm before the storm.

When things go well…

The idea is to create a class – which can be abstract – with a certain fixed set of instances, which are available via static properties on the class itself. The class is not sealed, but it prevents other classes from subclassing it by using a private constructor. Here’s an example, including how it can be used:

using System;

public abstract class Operation
{
    public static Operation Add { get; } = new AddOperation();
    public static Operation Subtract { get; } = new SubtractOperation();

    // Prevent non-nested subclasses. They can't call this constructor,
    // which is the only way of creating an instance.
    private Operation()
    {}

    public abstract int Apply(int lhs, int rhs);

    private class AddOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs + rhs;
        }
    }

    private class SubtractOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs - rhs;
        }
    }
}

public class Test
{
    static void Main()
    {
        Operation add = Operation.Add;
        Console.WriteLine(add.Apply(5, 3)); // 8
    }
}

(Note the use of the new “automatically implemented read-only property” from C# 6. Yum!)

Obviously there are possible variations of this – there could be multiple instances of one subclass, or we could even create instances dynamically if that were useful. The main point is that no other code can create instances or even declare their own subclasses of Operation. That private constructor stops everything, right? Um…

My eyes! The goggles do nothing!

I’m sure all my readers are aware that every C# constructor either has to chain to another constructor using : this(...) or chain to a base class constructor, either implicitly calling a constructor without arguments or explicitly using : base(...).

What I wasn’t aware of – or rather, I think I’d been aware of once, but ignored – was that constructor initializers can have cycles. In other words, it’s absolutely fine to write code like this:

public class SneakyOperation : Operation
{
    public SneakyOperation() : this(1)
    {}

    public SneakyOperation(int i) : this()
    {}

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

Neither operator attempts to call a base class constructor, so it doesn’t matter that the only base class constructor is private. This code compiles with no issues – not even a warning.

So we’ve already failed in our intention to prevent subclassing from occurring… but this will fail with a StackOverflowException which will take down the process, so there’s not much harm, right?

Well… suppose we didn’t let it throw a StackOverflowException. Suppose we triggered our own exception first. There are various ways of doing this. For example, we could use arithmetic:

public SneakyOperation(int x) : this(x, 0)
{}

// This isn't going to be happy if y is 0...
public SneakyOperation(int x, int y) : this(x / y)
{}

Or we could have one constructor call another with a delegate which will deliberately throw:

public SneakyOperation(int x) : this(() => { throw new Exception("Bang!"); })
{}

public SneakyOperation(Func<int> func) : this(func())
{}

So now, we have a class where you can call the constructor, and an exception will be thrown. But hey, you still can’t get at the instance which was created, right? Well, this is tricky. We genuinely can’t use this anywhere in code which we expect to execute – there’s no way of reaching a constructor body, because in order to do that we’d have to have some route which took us through the base class constructor, which was can’t access. Field initializers aren’t allow to use this, either explicitly or implicitly (e.g. by calling instance methods). If there’s a way of circumventing that, I’m not aware of it.

So, we can’t capture a reference to the instance before throwing the exception. And after the exception is thrown, it’s garbage, and therefore unavailable… unless we take it out of the garbage bin, of course. Enter a finalizer…

private static volatile Operation instance;

~SneakyOperation()
{
    // Resurrect the instance.
    instance = this;
}

Now all we have to do is wrap up the "create an object and wait for it to be resurrected" logic in a handy factory method:

public static Operation GetInstance()
{
    try
    {
        new SneakyOperation(0);
    }
    catch {} // Thoroughly expected
    Operation local;
    while ((local = instance) == null)
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
    return local;
}

Admittedly if multiple threads call GetInstance() at a time, there’s no guarantee which thread will get which instance, or whether they’ll get the same instance… but that’s fine. There’s very little to guarantee this will ever terminate, in fact… but I don’t care much, as clearly I’m never going to use this code in production. It works on my machine, for the few times I’ve been able to bear running it. Here’s a complete example (just combine it with the Operation class as before) in case you want to run it yourself:

public class SneakyOperation : Operation
{
    private static volatile Operation instance;

    private SneakyOperation(int x) : this(() => { throw new Exception(); })
    {}

    private SneakyOperation(Func<int> func) : this(func())
    {}

    public static Operation GetInstance()
    {
        try
        {
            new SneakyOperation(0);
        }
        catch {}
        Operation local;
        while ((local = instance) == null)
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
        }
        return local;
    }

    ~SneakyOperation()
    {
        instance = this;
    }

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

public class Test
{
    static void Main()
    {
        Operation sneaky = SneakyOperation.GetInstance();
        Console.WriteLine(sneaky.Apply(5, 3)); // 15
    }
}

Vile, isn’t it?

Fixing the problem

The first issue is the cyclic constructors. That’s surely a mistake in the language specification. (It’s not a bug in the compiler, as far as I can tell – I can’t find anything in the spec to prohibit it.) This can be fixed, and it shouldn’t even be too hard to do. There’s nothing conditional in constructor initializers; each constructor chains to exactly one other constructor, known via overload resolution at compile time. Even dynamic doesn’t break this – constructor initializers can’t be dispatched dynamically. This makes detecting a cycle trivial – from each constructor, just follow the chain of constructor calls until either you reach a base class constructor (success) or you reach one of the constructor declarations already in the chain (failure). I will be pestering Mads for this change – it may be too late for C# 6, but it’s never too early to ask about C# 7. This will be a breaking change – but it will only break code which is tragically broken already, always resulting in either an exception or a stack overflow.

Of course, a C# language change wouldn’t change the validity of such code being created through IL. peverify appears to be happy with it, for example. Still, a language fix would be a good start.

There’s a second language change which could help to fix this too – the introduction of private abstract/virtual methods. Until you consider nested classes, neither of those make sense – but as soon as you remember that there can be subclasses with access to private members, it makes perfect sense to have them. If Operation had a private abstract method (probably exposed via a public concrete method) then SneakyOperation (and any further subclasses) would have to be abstract, as it couldn’t possibly override the method. I’ve already blogged about something similar with public abstract classes containing internal abstract methods, which prevents concrete subclasses outside that assembly… but this would restrict inheritance even further.

Does it reach the usefulness bar required for new language features? Probably not. Would it be valid IL with the obvious meaning? I’ve performed experiments which suggest that it would be valid, but not as restrictive as one would expect – but I could be mistaken. I think there’s distinct merit in the idea, but I’m not expecting to see it come to pass.

Conclusion

So, what can you take away from this blog post? For starters, the fact that the language can probably be abused in far nastier ways than you’ve previously imagined – and that it’s very easy to miss the holes in the spec which allow for that abuse. (It’s not like this hole is a recent one… it’s been open forever.) You should absolutely not take away the idea that this is a recommended coding pattern – although the original pattern (for Operation) is a lovely one which deserves more widespread recognition.

Finally, you can take away the fact that nothing gets my attention quicker than a new (to me) way of abusing the language.