Category Archives: Evil code

When is a string not a string?

When is a string not a string?

As part of my “work” on the ECMA-334 TC49-TG2 technical group, standardizing C# 5 (which will probably be completed long after C# 6 is out… but it’s a start!) I’ve had the pleasure of being exposed to some of the interesting ways in which Vladimir Reshetnikov has tortured C#. This post highlights one of the issues he’s raised. As usual, it will probably never impact 99.999% of C# developers… but it’s a lovely little problem to look at.

Relevant specifications referenced in this post:
– The Unicode Standard, version 7.0.0 – in particular, chapter 3
C# 5 (Word document)
ECMA-335 (CLI specification)

What is a string?

How would you define the string (or System.String) type? I can imagine a number of responses to that question, from vague to pretty specific, and not all well-defined:

  • “Some text”
  • A sequence of characters
  • A sequence of Unicode characters
  • A sequence of 16-bit characters
  • A sequence of UTF-16 code units

The last of these is correct. The C# 5 specification (section 1.3) states:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So far, so good. But that’s C#. What about IL? What does that use, and does it matter? It turns out that it does… Strings need to be represented in IL as constants, and the nature of that representation is important, not only in terms of the encoding used, but how the encoded data is interpreted. In particular, a sequence of UTF-16 code units isn’t always representable as a sequence of UTF-8 code units.

I feel ill (formed)

Consider the C# string literal "X\uD800Y". That is a string consisting of three UTF-16 code units:

  • 0x0058 – ‘X’
  • 0xD800 – High surrogate
  • 0x0059 – ‘Y’

That’s fine as a string – it’s even a Unicode string according to the spec (item D80). However, it’s ill-formed (item D84). That’s because the UTF-16 code unit 0xD800 doesn’t map to a Unicode scalar value (item D76) – the set of Unicode scalar values explicitly excludes the high/low surrogate code points.

Just in case you’re new to surrogate pairs: UTF-16 only deals in 16-bit code units, which means it can’t cope with the whole of Unicode (which ranges from U+0000 to U+10FFFF inclusive). If you want to represent a value greater than U+FFFF in UTF-16, you need to use two UTF-16 code units: a high surrogate (in the range 0xD800 to 0xDBFF) followed by a low surrogate (in the range 0xDC00 to 0xDFFF). So a high surrogate on its own makes no sense. It’s a valid UTF-16 code unit in itself, but it only has meaning when followed by a low surrogate.

Show me some code!

So what does this have to do with C#? Well, string constants have to be represented in IL somehow. As it happens, there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8.

Let’s take an example:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;

[Description(Value)]
class Test
{
    const string Value = "X\ud800Y";

    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }

    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

The output of this code (under .NET) is:

Attribute: 0058 fffd fffd 0059
Constant: 0058 d800 0059

As you can see, the “constant” (Test.Value) has been preserved as a sequence of UTF-16 code units, but the attribute property has U+FFFD (the Unicode replacement character which is used to indicate broken data when decoding binary to text). Let’s dig a little deeper and look at the IL for the attribute and the constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )

.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant (Value) is really simple – it’s just little-endian UTF-16. The format of the attribute is specified in ECMA-335 section II.23.3. Here, the meaning is:

  • Prolog (01 00)
  • Fixed arguments (for specified constructor signature)
    • 05 58 ED A0 80 59 (a single string argument as a SerString)
      • 05 (the length, i.e. 5, as a PackedLen)
      • 58 ED A0 80 59 (the UTF-8-encoded form of the string)
  • Number of named arguments (00 00)
  • Named arguments (there aren’t any)

The interesting part is the “UTF-8-encoded form of the string” here. It’s not valid UTF-8, because the input isn’t a well-formed string. The compiler has taken the high surrogate, determined that there isn’t a low surrogate after it, and just treated it as a value to be encoded in the normal UTF-8 way of encoding anything in the range U+0800 to U+FFFF inclusive.

It’s worth noting that if we had a full surrogate pair, UTF-8 would encode the single Unicode scalar value being represented, using 4 bytes. For example, if we change the declaration of Value to:

const string Value = "X\ud800\udc00Y";

then the UTF-8 bytes in the IL are 58 F0 90 80 80 59 – where F0 90 80 80 is the UTF-8 encoding for U+10000. That’s a well-formed string, and we get the same value for both the description attribute and the constant.

So in our original example, the string constant (encoded as UTF-16 in the IL) is just decoded without checking whether or not it’s ill-formed, whereas the attribute argument (encoded as UTF-8) is decoded with extra validation, which detects the ill-formed code unit sequence and replaces it.

Encoding behaviour

So which approach is right? According to the Unicode specification (item C10) both could be fine:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

and

Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89.

It’s not at all clear to me whether either the attribute argument or the constant value “purports to be in a Unicode character encoding form”. In my experience, very few pieces of documentation or specification are clear about whether they expect a piece of text to be well-formed or not.

Additionally, System.Text.Encoding implementations can often be configured to determine how they behave when encoding or decoding ill-formed data. For example, Encoding.UTF8.GetBytes(Value) returns byte sequence 58 EF BF BD 59 – in other words, it spots the bad data and replaces it with U+FFFD as part of the encoding… so decoding this value will result in X U+FFFD Y with no problems. On the other hand, if you use new UTF8Encoding(true, true).GetBytes(Value), an exception will be thrown. The first constructor argument is whether or not to emit a byte order mark under certain circumstances; the second one is what dictates the encoding behaviour in the face of invalid data, along with the EncoderFallback and DecoderFallback properties.

Language behaviour

So should this compile at all? Well, the language specification doesn’t currently prohibit it – but specifications can be changed :)

In fact, both csc and Roslyn do prohibit the use of ill-formed strings with certain attributes. For example, with DllImportAttribute:

[DllImport(Value)]
static extern void Foo();

This gives an error when Value is ill-formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

There may be other attributes this is applied to as well; I’m not sure.

If we take it as read that the ill-formed value won’t be decoded back to its original form when the attribute is instantiated, I think it would be entirely reasonable to make it a compile-time failure – for attributes. (This is assuming that the runtime behaviour can’t be changed to just propagate the ill-formed string.)

What about the constant value though? Should that be allowed? Can it serve any purpose? Well, the precise value I’ve given is probably not terribly helpful – but it could make sense to have a string constant which ends with a high surrogate or starts with a low surrogate… because it can then be combined with another string to form a well-formed UTF-16 string. Of course, you should be very careful about this sort of thing – read the Unicode Technical Report 36 “Security Considerations” for some thoroughly alarming possibilities.

Corollaries

One interesting aspect to all of this is that “string encoding arithmetic” doesn’t behave as you might expect it to. For example, consider this method:

// Bad code!
string SplitEncodeDecodeAndRecombine
    (string input, int splitPoint, Encoding encoding)
{
    byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint));
    byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint));
    return encoding.GetString(firstPart) + encoding.GetString(secondPart);            
}

You might expect that this would be a no-op so long as everything is non-null and splitPoint is within range… but if you happen to split in the middle of a surrogate pair, it’s not going to be happy. There may well be other potential problems lurking there, depending on things like normalization form – I don’t think so, but at this point I’m unwilling to bet too heavily on string behaviour.

If you think the above code is unrealistic, just imagine partitioning a large body of text, whether that’s across network packets, files, or whatever. You might feel clever for realizing that without a bit of care you’d get binary data split between UTF-16 code units… but even handling that doesn’t save you. Yikes.

I’m tempted to swear off text data entirely at this point. Floating point is a nightmare, dates and times… well, you know my feelings about those. I wonder what projects are available that only need to deal with integers, and where all operations are guaranteed not to overflow. Let me know if you have any.

Conclusion

Text is hard.

Violating the “smart enum” pattern in C#

For a while now, I’ve been a big fan of a pattern in C# which mimics Java enums to a certain extent. In general, it’s a lovely pattern. Only after reading a comment on a recent blog post by Eric Lippert did I find out about a horrible flaw. Dubious thanks to John Payson for prompting this post. (And I fully intend to include this in the Abusing C# talk that I give every so often…)

Trigger warning: this post contains code you may wish you’d never seen.

Let’s start off with the happy path – a moment of calm before the storm.

When things go well…

The idea is to create a class – which can be abstract – with a certain fixed set of instances, which are available via static properties on the class itself. The class is not sealed, but it prevents other classes from subclassing it by using a private constructor. Here’s an example, including how it can be used:

using System;

public abstract class Operation
{
    public static Operation Add { get; } = new AddOperation();
    public static Operation Subtract { get; } = new SubtractOperation();

    // Prevent non-nested subclasses. They can't call this constructor,
    // which is the only way of creating an instance.
    private Operation()
    {}

    public abstract int Apply(int lhs, int rhs);

    private class AddOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs + rhs;
        }
    }

    private class SubtractOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs - rhs;
        }
    }
}

public class Test
{
    static void Main()
    {
        Operation add = Operation.Add;
        Console.WriteLine(add.Apply(5, 3)); // 8
    }
}

(Note the use of the new “automatically implemented read-only property” from C# 6. Yum!)

Obviously there are possible variations of this – there could be multiple instances of one subclass, or we could even create instances dynamically if that were useful. The main point is that no other code can create instances or even declare their own subclasses of Operation. That private constructor stops everything, right? Um…

My eyes! The goggles do nothing!

I’m sure all my readers are aware that every C# constructor either has to chain to another constructor using : this(...) or chain to a base class constructor, either implicitly calling a constructor without arguments or explicitly using : base(...).

What I wasn’t aware of – or rather, I think I’d been aware of once, but ignored – was that constructor initializers can have cycles. In other words, it’s absolutely fine to write code like this:

public class SneakyOperation : Operation
{
    public SneakyOperation() : this(1)
    {}

    public SneakyOperation(int i) : this()
    {}

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

Neither operator attempts to call a base class constructor, so it doesn’t matter that the only base class constructor is private. This code compiles with no issues – not even a warning.

So we’ve already failed in our intention to prevent subclassing from occurring… but this will fail with a StackOverflowException which will take down the process, so there’s not much harm, right?

Well… suppose we didn’t let it throw a StackOverflowException. Suppose we triggered our own exception first. There are various ways of doing this. For example, we could use arithmetic:

public SneakyOperation(int x) : this(x, 0)
{}

// This isn't going to be happy if y is 0...
public SneakyOperation(int x, int y) : this(x / y)
{}

Or we could have one constructor call another with a delegate which will deliberately throw:

public SneakyOperation(int x) : this(() => { throw new Exception("Bang!"); })
{}

public SneakyOperation(Func<int> func) : this(func())
{}

So now, we have a class where you can call the constructor, and an exception will be thrown. But hey, you still can’t get at the instance which was created, right? Well, this is tricky. We genuinely can’t use this anywhere in code which we expect to execute – there’s no way of reaching a constructor body, because in order to do that we’d have to have some route which took us through the base class constructor, which was can’t access. Field initializers aren’t allow to use this, either explicitly or implicitly (e.g. by calling instance methods). If there’s a way of circumventing that, I’m not aware of it.

So, we can’t capture a reference to the instance before throwing the exception. And after the exception is thrown, it’s garbage, and therefore unavailable… unless we take it out of the garbage bin, of course. Enter a finalizer…

private static volatile Operation instance;

~SneakyOperation()
{
    // Resurrect the instance.
    instance = this;
}

Now all we have to do is wrap up the "create an object and wait for it to be resurrected" logic in a handy factory method:

public static Operation GetInstance()
{
    try
    {
        new SneakyOperation(0);
    }
    catch {} // Thoroughly expected
    Operation local;
    while ((local = instance) == null)
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
    return local;
}

Admittedly if multiple threads call GetInstance() at a time, there’s no guarantee which thread will get which instance, or whether they’ll get the same instance… but that’s fine. There’s very little to guarantee this will ever terminate, in fact… but I don’t care much, as clearly I’m never going to use this code in production. It works on my machine, for the few times I’ve been able to bear running it. Here’s a complete example (just combine it with the Operation class as before) in case you want to run it yourself:

public class SneakyOperation : Operation
{
    private static volatile Operation instance;

    private SneakyOperation(int x) : this(() => { throw new Exception(); })
    {}

    private SneakyOperation(Func<int> func) : this(func())
    {}

    public static Operation GetInstance()
    {
        try
        {
            new SneakyOperation(0);
        }
        catch {}
        Operation local;
        while ((local = instance) == null)
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
        }
        return local;
    }

    ~SneakyOperation()
    {
        instance = this;
    }

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

public class Test
{
    static void Main()
    {
        Operation sneaky = SneakyOperation.GetInstance();
        Console.WriteLine(sneaky.Apply(5, 3)); // 15
    }
}

Vile, isn’t it?

Fixing the problem

The first issue is the cyclic constructors. That’s surely a mistake in the language specification. (It’s not a bug in the compiler, as far as I can tell – I can’t find anything in the spec to prohibit it.) This can be fixed, and it shouldn’t even be too hard to do. There’s nothing conditional in constructor initializers; each constructor chains to exactly one other constructor, known via overload resolution at compile time. Even dynamic doesn’t break this – constructor initializers can’t be dispatched dynamically. This makes detecting a cycle trivial – from each constructor, just follow the chain of constructor calls until either you reach a base class constructor (success) or you reach one of the constructor declarations already in the chain (failure). I will be pestering Mads for this change – it may be too late for C# 6, but it’s never too early to ask about C# 7. This will be a breaking change – but it will only break code which is tragically broken already, always resulting in either an exception or a stack overflow.

Of course, a C# language change wouldn’t change the validity of such code being created through IL. peverify appears to be happy with it, for example. Still, a language fix would be a good start.

There’s a second language change which could help to fix this too – the introduction of private abstract/virtual methods. Until you consider nested classes, neither of those make sense – but as soon as you remember that there can be subclasses with access to private members, it makes perfect sense to have them. If Operation had a private abstract method (probably exposed via a public concrete method) then SneakyOperation (and any further subclasses) would have to be abstract, as it couldn’t possibly override the method. I’ve already blogged about something similar with public abstract classes containing internal abstract methods, which prevents concrete subclasses outside that assembly… but this would restrict inheritance even further.

Does it reach the usefulness bar required for new language features? Probably not. Would it be valid IL with the obvious meaning? I’ve performed experiments which suggest that it would be valid, but not as restrictive as one would expect – but I could be mistaken. I think there’s distinct merit in the idea, but I’m not expecting to see it come to pass.

Conclusion

So, what can you take away from this blog post? For starters, the fact that the language can probably be abused in far nastier ways than you’ve previously imagined – and that it’s very easy to miss the holes in the spec which allow for that abuse. (It’s not like this hole is a recent one… it’s been open forever.) You should absolutely not take away the idea that this is a recommended coding pattern – although the original pattern (for Operation) is a lovely one which deserves more widespread recognition.

Finally, you can take away the fact that nothing gets my attention quicker than a new (to me) way of abusing the language.

When is a constant not a constant? When it’s a decimal…

A comment on a Stack Overflow post recently got me delving into constants a bit more thoroughly than I have done before.

Const fields

I’ve been aware for a while that although you can specify decimal field as a const in C#, it’s not really const as far as the CLR is concerned. Let’s consider this class to start with:

class Test
{
    const int ConstInt32 = 5;
    const decimal ConstDecimal = 5;
}

Firstly, csc gives us a warning about ConstDecimal but not about ConstInt32:

Test.cs(4,19): warning CS0414: The field ‘Test.ConstDecimal’ is assigned but its value is never used

The Roslyn compiler (rcsc) doesn’t warn about either of the fields. This is just a curiosity, really – but it already shows that there’s some difference in how they’re compiled. When we delve into the IL, the difference is much more pronounced:

.class private auto ansi beforefieldinit Test
       extends [mscorlib]System.Object
{
  .field private static literal int32 ConstInt32 = int32(0x00000005)
  .field private static initonly valuetype [mscorlib]System.Decimal ConstDecimal
  .custom instance void [mscorlib]DecimalConstantAttribute::.ctor
      (uint8, uint8, uint32, uint32, uint32) = 
      ( 01 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 00 00 ) 

  // Skip the parameterless constructor

  .method private hidebysig specialname rtspecialname static 
          void  .cctor() cil managed
  {
    // Code size       12 (0xc)
    .maxstack  8
    IL_0000:  ldc.i4.5
    IL_0001:  newobj     instance void [mscorlib]System.Decimal::.ctor(int32)
    IL_0006:  stsfld     valuetype [mscorlib]System.Decimal Test::ConstDecimal
    IL_000b:  ret
  } // end of method Test::.cctor

} // end of class Test

First things to note:

ConstInt32 has the literal constraint in IL. From ECMA 335, I.8.6.1.2:

The literal constraint promises that the value of the location is actually a fixed value
of a built-in type. The value is specified as part of the constraint. Compilers are
required to replace all references to the location with its value, and the VES therefore
need not allocate space for the location. This constraint, while logically applicable to
any location, shall only be placed on static fields of compound types. Fields that are
so marked are not permitted to be referenced from CIL (they shall be in-lined to their
constant value at compile time), but are available using reflection and tools that
directly deal with the metadata.

Whereas ConstDecimal only has the initonly constraint. Again, from ECMA 335:

The init-only constraint promises (hence, requires) that once the location has been
initialized, its contents never change. Namely, the contents are initialized before any
access, and after initialization, no value can be stored in the location. The contents are
always identical to the initialized value (see §I.8.2.3). This constraint, while logically
applicable to any location, shall only be placed on fields (static or instance) of
compound types.

(Here “compound type” just means “not an array type” – although in quite a confusing manner. I would ignore it if I were you.)

Next, note that there’s an attribute (System.Runtime.CompilerServices.DecimalConstantAttribute – I’ve taken the namespace off in the listing for the sake of readability) applied to the field, which tells anything consuming the assembly that it should be a constant, and what its value is. Indeed, if you’re very careful, you can create your own “constants” like this:

[DecimalConstant((byte)0, (byte)0, (uint)0, (uint)0, (uint) 5)]
public static readonly decimal ConstDecimal;

That field declaration will be treated as a constant in other assembles – but not within the same assembly. So printing ConstDecimal within the same assembly will result in 0 (unless you change it to another value in the static initializer) whereas printing Test.ConstDecimal in a different assembly will result in 5. (It won’t even touch the field at execution time.) I’m sure I can work out some nasty ways of abusing that, if I try hard enough.

Note that the casts to uint are important – if you accidentally call the attribute constructor with a (byte, byte, int, int, int), the compiler doesn’t recognize it. (Interesting, the latter was only introduced in .NET 2.0. I’ve no idea why.)

Amusingly, you can combine the two:

[DecimalConstant((byte)0, (byte)0, (uint)0, (uint)0, (uint) 5)]
public const decimal ConstDecimal = 10;

In this case, the IL contains both DecimalConstant attributes, despite the fact that it’s only legal to apply one. (AllowMultiple is false on its AttributeUsage.) The compiler appears to pick the one specified by the value rather than manually applied, which is slightly disappointing, but of no real importance.

Optional parameters

In the case of const fields, we’ve really only cared about what the compiler does. Let’s try something where both the compiler and the framework can get involved: optional parameters.

Again, let’s write a little class to demonstrate how the default values of optional parameters are encoded in IL:

public class Test
{
    public void PrintInt32(int x = 10)
    {
        Console.WriteLine(x);
    }

    public void PrintDecimal(decimal d = 10m)
    {
        Console.WriteLine(d);
    }
}

The important bits of the generated IL are:

.method public hidebysig instance void PrintInt32([opt] int32 x) cil managed
{
    .param [1] = int32(0x0000000A)
    ...
} // end of method Test::PrintInt32

.method public hidebysig instance void PrintDecimal(
    [opt] valuetype [mscorlib]System.Decimal d) cil managed
{
    .param [1]
    .custom instance void [mscorlib] DecimalConstantAttribute::.ctor
    (uint8, uint8, uint32, uint32, uint32) =
    ( 01 00 00 00 00 00 00 00 00 00 00 00 0A 00 00 00 00 00 ) 
    ...
} // end of method Test::PrintDecimal

Again, we have a DecimalConstantAttribute to describe the default value for the decimal parameter, whereas . If you call the method but don’t specify an argument, the compiler notes the DecimalConstantAttribute applied to the method parameter, and constructs the value in the calling code. That’s not the only way of observing the default value, however – you can do it with reflection, too:

public static void Main()
{
    var method = typeof(Test).GetMethod("PrintDecimal");
    var parameter = method.GetParameters()[0];
    Console.WriteLine("Default value: {0}", parameter.DefaultValue);
}

As you’d expect, that prints a default value of 10. There’s no direct equivalent of DefaultValue for FieldInfo – there’s GetRawConstantValue() but that only works for constants that the CLR is directly aware of – it fails for a field like const decimal Foo = 10m , with an InvalidOperationException. I’ll talk more about CLR constants later.

Now let’s try something a bit more tricksy though… C# doesn’t support DateTime literals, unlike VB – but there’s a DateTimeConstantAttribute – what happens if we try to apply that ourselves? Let’s see…

public void PrintDateTime
    ([Optional, DateTimeConstant(635443315962469079L)] DateTime date)
{
    Console.WriteLine(date);
}

So if we call PrintDateTime(), what does that print? Well (leaving aside the formatting – the examples below use the UK default formatting):

  • With csc.exe (the “old” C# compiler), with the call in the same assembly as the method declaration, it prints 01/01/0001 00:00:00
  • With csc.exe, with the call in a different assembly to the method declaration, it prints 22/08/2014 19:13:16
  • With rcsc.exe (Roslyn), it prints 22/08/2014 19:13:16 regardless of where it’s called from
  • If you call it dynamically (dynamic d = new Test(); d.PrintDateTime();) it prints 22/08/2014 19:13:16 regardless of the compiler – at least with the version of .NET I’m using. It may well vary by version.

In every case, printing out the ParameterInfo.DefaultValue gives the right answer: the framework doesn’t care whether or not the compiler understands the attribute.

In case you’re wondering why I didn’t mention this possibility for constant fields – I tried it, and it didn’t work, even in Roslyn. For some reason optional parameters are treated as more “special” than constant fields.

Having got this far, why stop with DateTime? The DateTimeConstantAttribute class derives from CustomConstantAttribute (whereas DecimalConstantAttribute just derives from Attribute). So can I introduce my own constant attributes? Noda Time seems to be an obvious candidate for this – let’s try for a LocalDateConstantAttribute:

[AttributeUsage(AttributeTargets.Parameter | AttributeTargets.Field)]
public class LocalDateConstantAttribute : CustomConstantAttribute
{
    private readonly LocalDate value;

    public LocalDateConstantAttribute(int year, int month, int day)
    {
        value = new LocalDate(year, month, day);
    }

    public override object Value { get { return value; } }
}
...
public void PrintLocalDate(
    [Optional, LocalDateConstant(2014, 8, 22)] LocalDate date)
{
    Console.WriteLine(date);
}

How does this fare? Not so well, unfortunately:
– With a normal method call (regardless of assembly), it prints 01 January 1970 (the default value for a LocalDate in Noda Time 1.3)
– With a dynamic method call it prints 01 January 1970
– With reflection (i.e. using ParameterInfo.DefaultValue) the framework does construct the appropriate LocalDate, which seems logical as it’s presumably just using the Value property of the attribute

So, there’s still work to be done there. I think it would be feasible to do this in a general way, if it’s acceptable for an exception to be thrown if the Value property returns an incompatible type of object to the parameter type. The great thing is that Roslyn is open source, so I should be able to spike this myself, right? How hard can it be? (Cue howls of derisive laughter from the Roslyn team, who will know much better than I how hard it’s likely to really be.)

CLR constants and attribute arguments

So with constant fields, it was all down to the compiler, really. For optional parameters, it’s mostly still down to the compiler, but with framework support for reflection. What about attribute arguments? They’re the other notable aspect of C# which requires compile-time constants. For example, this is fine:

[Description("This is a constant")]

But this is not:

[Description("Initialized at " + DateTime.Now)]

Intriguingly, this is fine too:

[ContractClass(typeof(Foo))]

… despite the fact that

const Type ConstType = typeof(Foo);

isn’t valid. The only constant expression which is valid for type Type is null. So in section 17.2 of the C# 5 specification, Type is explicitly called out:

An expression E is an attribute-argument-expression if all of the following statements are true:

  • The type of E is an attribute parameter type (§17.1.3).
  • At compile-time, the value of E can be resolved to one of the following:
    • A constant value.
    • A System.Type object.
    • A one-dimensional array of attribute-argument-expressions.

(Interestingly, there’s no indication that I can see that the value of E has to be obtained via typeof, in the spec – clearly [ContractClass(Type.GetType("TodayIs" + DateTime.Today.Day))] should be invalid, but I can’t currently see which bit of the spec that violates. Something to be tightened up, potentially.)

And the “attribute parameter type” part – section 17.1.3 – looks like this:

The types of positional and named parameters for an attribute class are limited to the attribute parameter types, which are:

  • One of the following types: bool, byte, char, double, float, int, long, sbyte, short, string, uint, ulong, ushort.
  • The type object.
  • The type System.Type.
  • An enum type, provided it has public accessibility and the types in which it is nested (if any) also have public accessibility (§17.2).
  • Single-dimensional arrays of the above types.

Oops – no decimal. Note that the type of E that has to be one of those types, as well as the parameter type… so it doesn’t help to have a parameter of type object and then try to pass a constant decimal value as the argument.

Basically, attributes arguments are sufficiently low-level that the CLR itself wants to know about them – and while it has a deep knowledge of Type, string and the primitive value types, it doesn’t have much knowledge about decimal (or at least, it isn’t required to). Attribute arguments can’t use funky “custom constant” attributes to specify values like decimal or DateTime – you really are limited to the types that the CLR knows about. In a future version it’s not inconceivable that this could be broadened, but at the moment it’s pretty strict.

Conclusion

So, it turns out the idea of a constant isn’t terribly constant in itself. We have:

  • Constant fields, which are primarily a compile-time concern, and therefore language-specific.
  • Optional parameter default values, which feel like they ought to be just like constant fields (in that a value specified in one place is substituted in another) but apparently have a bit more support in the C# compiler… and more reflection support too.
  • Attribute arguments, which are the strictest form of constant I’ve found so far, in that they have to correspond to a small set of CLR “special” types.

I didn’t even talk about constant expressions (section 7.19 of the C# 5 spec) much in this post – maybe I’ll delve into those in more detail in another post.

Unlike my normal day-dreaming about changing the compiler, I think I really might have a crack at Roslyn for supporting arbitrary optional parameters – it feels like it could potentially be genuinely useful (which is also unlike most of my idle speculation).

So next time you ask yourself whether something is a constant, you should start off by asking yourself what you mean by “constant” in the first place.

The BobbyTables culture

I started writing a post like this a long time ago, but somehow never finished it.

Countless posts on Stack Overflow are vulnerable to SQL injection attacks. Along with several other users, I always raise this when it shows up – this is something that really just shouldn’t happen these days. It’s a well-understood issue,and parameterized SQL is a great solution in almost all cases. (No, it doesn’t work if you want to specify an column or table name dynamically. Yes, whitelisting is the solution there.)

The response usually falls into one of three camps:

  • Ah – I didn’t know about that. Great, I’ll fix it now. Thanks!
  • This is just a prototype. I’ll fix it for the real thing. (Ha! Like that ever happens.)
  • Well yes, in theory – but I’m just using numbers. That’s not a problem, is it?

Now personally I feel that you should just get the habit of using parameterized queries all the time, even when you could get away without it. This post is a somewhat tongue-in-cheek counterargument to the last of these responses. If you haven’t seen Bobby Tables, you really should. It’s the best 10-second explanation of SQL injection that I’ve ever seen, and I almost always drop a link to it when I’m adding a comment on a vulnerable query on Stack Overflow.

So in honour of Bobby, here’s a little program. See if you can predict the output.

using System;
using System.Globalization;
using System.Threading;

class Test
{
    static void Main()
    {
        string sql = "SELECT * FROM Foo WHERE BarDate > '" + DateTime.Today + "'";
        // Imagine you're executing the query here...
        Console.WriteLine(sql);

        int bar = -10;
        sql = "SELECT * FROM Foo WHERE BarValue = " + bar;
        // Imagine you're executing the query here...
        Console.WriteLine(sql);
    }

    // Some other code here...
}

Does that look okay? Not great, admittedly – but not too bad, right? Well, the output of the program is:

SELECT * FROM Foo WHERE BarDate > '2014-08-08' OR ' '=' '
SELECT * FROM Foo WHERE BarValue = 1 OR 1=1 OR 1=10

Yikes! Our queries aren’t filtering out anything!

Of course, the black magic is in “Some other code here” part:

static Test()
{
    InstallBobbyTablesCulture();
}

static void InstallBobbyTablesCulture()
{
    CultureInfo bobby = (CultureInfo) CultureInfo.InvariantCulture.Clone();
    bobby.DateTimeFormat.ShortDatePattern = @"yyyy-MM-dd'' OR ' '=''";
    bobby.DateTimeFormat.LongTimePattern = "";
    bobby.NumberFormat.NegativeSign = "1 OR 1=1 OR 1=";
    Thread.CurrentThread.CurrentCulture = bobby;
}

Neither numbers (well, negative numbers in this case) nor dates are safe. And of course if your database permissions aren’t set correctly, the queries could do a lot more than just remove any filtering. For extra fun, you can subvert some custom format strings – by changing the DateSeparator property, for example.

Even in sensible cultures, if the database expects you to use . for the decimal separator and you’re in a European culture that uses , instead, do you know how your database will behave? If you sanitize your input based on the numeric value, but then that isn’t the value that the database sees due to a string conversion, how comfortable are you that your application is still safe? It may not allow direct damage, but it could potentially reveal more data than you originally expected – which is definitely a vulnerability in a form.

Now the chances of me getting onto your system and installing the Bobby Tables culture – let alone making it the system default – are pretty slim, and if that happens you’ve probably got bigger problems anyway… but it’s the principle of the thing. You don’t care about a text representation of your values: you just want to get them to the database intact.

Parameterized SQL: just say yes.

Micro-optimization: the surprising inefficiency of readonly fields

Introduction

Recently I’ve been optimizing the heck out of Noda Time. Most of the time this has been a case of the normal measurement, find bottlenecks, carefully analyse them, lather, rinse, repeat. Yesterday I had a hunch about a particular cost, and decided to experiment… leading to a surprising optimization.

Noda Time’s core types are mostly value types – date/time values are naturally value types, just as DateTime and DateTimeOffset are in the BCL. Noda Time’s types are a bit bigger than most value types, however – the largest being ZonedDateTime, weighing in at 40 bytes in an x64 CLR at the moment. (I can shrink it down to 32 bytes with a bit of messing around, although it’s not terribly pleasant to do so.) The main reason for the bulk is that we have two reference types involved (the time zone and the calendar system), and in Noda Time 2.0 we’re going to have nanosecond resolution instead of tick resolution (so we need 12 bytes just to store a point in time). While this goes against the Class Library Design Guidelines, it would be odd for the smaller types (LocalDate, LocalTime) to be value types and the larger ones to be reference types. Overall, these still feel like value types.

A lot of these value types are logically composed of each other:

  • A LocalDate is a YearMonthDay and a CalendarSystem reference
  • A LocalDateTime is a LocalDate and a LocalTime
  • An OffsetDateTime is a LocalDateTime and an Offset
  • A ZonedDateTime is an OffsetDateTime and a DateTimeZone reference

This leads to a lot of delegation, potentially – asking a ZonedDateTime for its Year could mean asking the OffsetDateTime, which would ask the LocalDateTime, which would ask the LocalDate, which would ask the YearMonthDay. Very nice from a code reuse point of view, but potentially inefficient due to copying data.

Why would there be data copying involved? Well, that’s where this blog post comes in.

Behaviour of value type member invocations

When an instance member (method or property) belonging to a value type is invoked, the exact behaviour depends on the kind of expression it is called on. From the C# 5 spec, section 7.5.5 (where E is the expression the member M is invoked on, and the type declaring M is a value type):

If E is not classified as a variable, then a temporary local variable of E’s type is created and the value of E is assigned to that variable. E is then reclassified as a reference to that temporary local variable. The temporary variable is accessible as this within M, but not in any other way. Thus, only when E is a true variable is it possible for the caller to observe the changes that M makes to this.

So when is a variable not a variable? When it’s readonly… from section 7.6.4 (emphasis mine) :

If T is a struct-type and I identifies an instance field of that class-type:

  • If E is a value, or if the field is readonly and the reference occurs outside an instance constructor of the struct in which the field is declared, then the result is a value, namely the value of the field I in the struct instance given by E.

(There’s a very similar bullet for T being a class-type; the important part is that the field type is a value type

The upshot is that if you have a method call of:

int result = someField.Foo();

then it’s effectively converted into this:

var tmp = someField;
int result = tmp.Foo();

Now if the type of the field is quite a large value type, but Foo() doesn’t modify the value (which it never does within my value types), that’s performing a copy completely unnecessarily.

To see this in action outside Noda Time, I’ve built a little sample app.

Show me the code!

Our example is a simple 256-bit type, composed of 4 Int64 values. The type itself doesn’t do anything useful – it just holds the four values, and exposes them via properties. We then measure how long it takes to sum the four properties lots of times.

using System;
using System.Diagnostics;

public struct Int256
{
    private readonly long bits0;
    private readonly long bits1;
    private readonly long bits2;
    private readonly long bits3;
    
    public Int256(long bits0, long bits1, long bits2, long bits3)
    {
        this.bits0 = bits0;
        this.bits1 = bits1;
        this.bits2 = bits2;
        this.bits3 = bits3;
    }
    
    public long Bits0 { get { return bits0; } }
    public long Bits1 { get { return bits1; } }
    public long Bits2 { get { return bits2; } }
    public long Bits3 { get { return bits3; } }
}

class Test
{
    private readonly Int256 value;

    public Test()
    {
        value = new Int256(1L, 5L, 10L, 100L);
    }
    
    public long TotalValue 
    { 
        get 
        {
            return value.Bits0 + value.Bits1 + value.Bits2 + value.Bits3; 
        }
    }
    
    public void RunTest()
    {
        // Just make sure it’s JITted…
        var sample = TotalValue;
        Stopwatch sw = Stopwatch.StartNew();
        long total = 0;
        for (int i = 0; i < 1000000000; i++)
        {
            total += TotalValue;
        }
        sw.Stop();
        Console.WriteLine("Total time: {0}ms", sw.ElapsedMilliseconds);
    }
    
    static void Main()
    {
        new Test().RunTest();
    }
}

Building this from the command line with /o+ /debug- and running (in a 64-bit CLR, but no RyuJIT) this takes about 20 seconds to run on my laptop. We can make it much faster with just one small change:

class Test
{
    private Int256 value;

    // Code as before
}

The same test now takes about 4 seconds – a 5-fold speed improvement, just by making a field non-readonly. If we look at the IL for the TotalValue property, the copying becomes obvious. Here it is when the field is readonly:

.method public hidebysig specialname instance int64 
        get_TotalValue() cil managed
{
  // Code size       60 (0x3c)
  .maxstack  2
  .locals init (valuetype Int256 V_0,
           valuetype Int256 V_1,
           valuetype Int256 V_2,
           valuetype Int256 V_3)
  IL_0000:  ldarg.0
  IL_0001:  ldfld      valuetype Int256 Test::’value’
  IL_0006:  stloc.0
  IL_0007:  ldloca.s   V_0
  IL_0009:  call       instance int64 Int256::get_Bits0()
  IL_000e:  ldarg.0
  IL_000f:  ldfld      valuetype Int256 Test::’value’
  IL_0014:  stloc.1
  IL_0015:  ldloca.s   V_1
  IL_0017:  call       instance int64 Int256::get_Bits1()
  IL_001c:  add
  IL_001d:  ldarg.0
  IL_001e:  ldfld      valuetype Int256 Test::’value’
  IL_0023:  stloc.2
  IL_0024:  ldloca.s   V_2
  IL_0026:  call       instance int64 Int256::get_Bits2()
  IL_002b:  add
  IL_002c:  ldarg.0
  IL_002d:  ldfld      valuetype Int256 Test::’value’
  IL_0032:  stloc.3
  IL_0033:  ldloca.s   V_3
  IL_0035:  call       instance int64 Int256::get_Bits3()
  IL_003a:  add
  IL_003b:  ret
} // end of method Test::get_TotalValue

And here it is when the field’s not readonly:

.method public hidebysig specialname instance int64 
        get_TotalValue() cil managed
{
  // Code size       48 (0x30)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  ldflda     valuetype Int256 Test::’value’
  IL_0006:  call       instance int64 Int256::get_Bits0()
  IL_000b:  ldarg.0
  IL_000c:  ldflda     valuetype Int256 Test::’value’
  IL_0011:  call       instance int64 Int256::get_Bits1()
  IL_0016:  add
  IL_0017:  ldarg.0
  IL_0018:  ldflda     valuetype Int256 Test::’value’
  IL_001d:  call       instance int64 Int256::get_Bits2()
  IL_0022:  add
  IL_0023:  ldarg.0
  IL_0024:  ldflda     valuetype Int256 Test::’value’
  IL_0029:  call       instance int64 Int256::get_Bits3()
  IL_002e:  add
  IL_002f:  ret
} // end of method Test::get_TotalValue

Note that it’s still loading the field address (ldflda) four times. You might expect that copying the field onto the stack once via a temporary variable would be faster, but that ends up at about 6.5 seconds on my machine.

There is an optimization which is even faster – moving the totalling property into Int256. That way (with the non-readonly field, still) the total time is less than a second – twenty times faster than the original code!

Conclusion

This isn’t an optimization I’d recommend in general. Most code really doesn’t need to be micro-optimized this hard, and most code doesn’t deal with large value types like the ones in Noda Time. However, I regard Noda Time as a sort of "system level" library, and I don’t ever want someone to decide not to use it on  performance grounds. My benchmarks show that for potentially-frequently-called operations (such as the properties on ZonedDateTime) it really does make a difference, so I’m going to go for it.

I intend to apply a custom attribute to each of these "would normally be readonly" fields to document the intended behaviour of the field – and then when Roslyn is fully released, I’ll probably write a test to validate that all of these fields would still compile if the field were made readonly (e.g. that they’re never assigned to outside the constructor).

Aside from anything else, I find the subtle difference in behaviour between a readonly field and a read/write field fascinating… it’s something I’d been vaguely aware of in the past, but this is the first time that it’s had a practical impact on me. Maybe it’ll never make any difference to your code… but it’s probably worth being aware of anyway.

Extension methods, explicitly implemented interfaces and collection initializers

This post is the answer to yesterday’s brainteaser. As a reminder, I was asking what purpose this code might have:

public static class Extensions 

    public static void Add<T>(this ICollection<T> source, T item) 
    { 
        source.Add(item); 
    } 
}

There are plenty of answers, varying from completely incorrect (sorry!) to pretty much spot on.

As many people noticed, ICollection<T> already has an Add method taking an item of type T. So what difference could this make? Well, consider LinkedList<T>, which implements ICollection<T>, used as below:

// Invalid
LinkedList<int> list = new LinkedList<int>();
list.Add(10);

That’s not valid code (normally)…. whereas this is:

// Valid
ICollection<int> list = new LinkedList<int>();
list.Add(10);

The only difference is the compile-time type of the list variable – and that changes things because LinkedList<T> implements ICollection<T>.Add using explicit interface implementation. (Basically you’re encouraged to use AddFirst and AddLast instead, to make it clear which end you’re adding to. Add is equivalent to AddLast.)

Now consider the invalid code above, but with the brainteaser extension method in place. Now it’s a perfectly valid call to the extension method, which happens to delegate straight to the ICollection<T> implementation. Great! But why bother? Surely we can just cast list if we really want to:

LinkedList<int> list = new LinkedList<int>();
((IList<int>)list).Add(10);

That’s ugly (really ugly) – but it does work. But what about situations where you can’t cast? They’re pretty rare, but they do exist. Case in point: collection initializers. This is where the C# 6 connection comes in. As of C# 6 (at least so far…) collection initializers have changed so that an appropriate Add extension method is also permitted. So for example:

// Sometimes valid :)
LinkedList<int> list = new LinkedList<int> { 10, 20 };

That’s invalid in C# 5 whatever you do, and it’s only valid in C# 6 when you’ve got a suitable extension method in place, such as the one in yesterday’s post. There’s nothing to say the extension method has to be on ICollection<T>. While it might feel nice to be general, most implementations of ICollection<T> which use explicit interface implementation for ICollection<T>.Add do so for a very good reason. With the extension method in place, this is valid too…

// Valid from a language point of view…
ReadOnlyCollection<int> collection = new ReadOnlyCollection<int>(new[] { 10, 20 }) { 30, 40 };

That will compile, but it’s obviously not going to succeed at execution time. (It throws NotSupportedException.)

Conclusion

I don’t think I’d ever actually use the extension method I showed yesterday… but that’s not quite the same thing as it being useless, particularly when coupled with C# 6’s new-and-improved collection initializers. (The indexer functionality means you can use collection initializers with ConcurrentDictionary<,> even without extension methods, by the way.)

Explicit interface implementation is an interesting little corner to C# which is easy to forget about when you look at code – and which doesn’t play nicely with dynamic typing, as I’ve mentioned before.

And finally…

Around the same time as I posted the brainteaser yesterday, I also remarked on how unfortunate it was that StringBuilder didn’t implement IEnumerable<char>. It’s not that I really want to iterate over a StringBuilder… but if it implemented IEnumerable, I could use it with a collection initializer, having added some extension methods. This would have been wonderfully evil…

using System;
using System.Text; 

public static class Extensions  
{  
    public static void Add(this StringBuilder builder, string text)
    {  
        builder.AppendLine(text);
    }  

    public static void Add(this StringBuilder builder, string format, params object[] arguments)
    {  
        builder.AppendFormat(format, arguments);
        builder.AppendLine();
    }  

class Test
{
    static void Main()
    {
        // Sadly invalid :(
        var builder = new StringBuilder
        {
            "Just a plain message",
            { "A message with formatting, recorded at {0}", DateTime.Now }
        };
    }
}

Unfortunately it’s not to be. But watch this space – I’m sure I’ll find some nasty ways of abusing C# 6…

Quick brainteaser

Just a really quick one today…

What’s the point of this code? Does it have any point at all?

public static class Extensions
{
    public static void Add<T>(this ICollection<T> source, T item)
    {
        source.Add(item);
    }
}

Bonus marks if you can work out what made me think about it.

I suggest you ROT-13 answers to avoid spoilers for other readers.