When is a string not a string?

When is a string not a string?

As part of my “work” on the ECMA-334 TC49-TG2 technical group, standardizing C# 5 (which will probably be completed long after C# 6 is out… but it’s a start!) I’ve had the pleasure of being exposed to some of the interesting ways in which Vladimir Reshetnikov has tortured C#. This post highlights one of the issues he’s raised. As usual, it will probably never impact 99.999% of C# developers… but it’s a lovely little problem to look at.

Relevant specifications referenced in this post:
– The Unicode Standard, version 7.0.0 – in particular, chapter 3
C# 5 (Word document)
ECMA-335 (CLI specification)

What is a string?

How would you define the string (or System.String) type? I can imagine a number of responses to that question, from vague to pretty specific, and not all well-defined:

  • “Some text”
  • A sequence of characters
  • A sequence of Unicode characters
  • A sequence of 16-bit characters
  • A sequence of UTF-16 code units

The last of these is correct. The C# 5 specification (section 1.3) states:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So far, so good. But that’s C#. What about IL? What does that use, and does it matter? It turns out that it does… Strings need to be represented in IL as constants, and the nature of that representation is important, not only in terms of the encoding used, but how the encoded data is interpreted. In particular, a sequence of UTF-16 code units isn’t always representable as a sequence of UTF-8 code units.

I feel ill (formed)

Consider the C# string literal "X\uD800Y". That is a string consisting of three UTF-16 code units:

  • 0x0058 – ‘X’
  • 0xD800 – High surrogate
  • 0x0059 – ‘Y’

That’s fine as a string – it’s even a Unicode string according to the spec (item D80). However, it’s ill-formed (item D84). That’s because the UTF-16 code unit 0xD800 doesn’t map to a Unicode scalar value (item D76) – the set of Unicode scalar values explicitly excludes the high/low surrogate code points.

Just in case you’re new to surrogate pairs: UTF-16 only deals in 16-bit code units, which means it can’t cope with the whole of Unicode (which ranges from U+0000 to U+10FFFF inclusive). If you want to represent a value greater than U+FFFF in UTF-16, you need to use two UTF-16 code units: a high surrogate (in the range 0xD800 to 0xDBFF) followed by a low surrogate (in the range 0xDC00 to 0xDFFF). So a high surrogate on its own makes no sense. It’s a valid UTF-16 code unit in itself, but it only has meaning when followed by a low surrogate.

Show me some code!

So what does this have to do with C#? Well, string constants have to be represented in IL somehow. As it happens, there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8.

Let’s take an example:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;

[Description(Value)]
class Test
{
    const string Value = "X\ud800Y";

    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }

    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

The output of this code (under .NET) is:

Attribute: 0058 fffd fffd 0059
Constant: 0058 d800 0059

As you can see, the “constant” (Test.Value) has been preserved as a sequence of UTF-16 code units, but the attribute property has U+FFFD (the Unicode replacement character which is used to indicate broken data when decoding binary to text). Let’s dig a little deeper and look at the IL for the attribute and the constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )

.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant (Value) is really simple – it’s just little-endian UTF-16. The format of the attribute is specified in ECMA-335 section II.23.3. Here, the meaning is:

  • Prolog (01 00)
  • Fixed arguments (for specified constructor signature)
    • 05 58 ED A0 80 59 (a single string argument as a SerString)
      • 05 (the length, i.e. 5, as a PackedLen)
      • 58 ED A0 80 59 (the UTF-8-encoded form of the string)
  • Number of named arguments (00 00)
  • Named arguments (there aren’t any)

The interesting part is the “UTF-8-encoded form of the string” here. It’s not valid UTF-8, because the input isn’t a well-formed string. The compiler has taken the high surrogate, determined that there isn’t a low surrogate after it, and just treated it as a value to be encoded in the normal UTF-8 way of encoding anything in the range U+0800 to U+FFFF inclusive.

It’s worth noting that if we had a full surrogate pair, UTF-8 would encode the single Unicode scalar value being represented, using 4 bytes. For example, if we change the declaration of Value to:

const string Value = "X\ud800\udc00Y";

then the UTF-8 bytes in the IL are 58 F0 90 80 80 59 – where F0 90 80 80 is the UTF-8 encoding for U+10000. That’s a well-formed string, and we get the same value for both the description attribute and the constant.

So in our original example, the string constant (encoded as UTF-16 in the IL) is just decoded without checking whether or not it’s ill-formed, whereas the attribute argument (encoded as UTF-8) is decoded with extra validation, which detects the ill-formed code unit sequence and replaces it.

Encoding behaviour

So which approach is right? According to the Unicode specification (item C10) both could be fine:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

and

Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89.

It’s not at all clear to me whether either the attribute argument or the constant value “purports to be in a Unicode character encoding form”. In my experience, very few pieces of documentation or specification are clear about whether they expect a piece of text to be well-formed or not.

Additionally, System.Text.Encoding implementations can often be configured to determine how they behave when encoding or decoding ill-formed data. For example, Encoding.UTF8.GetBytes(Value) returns byte sequence 58 EF BF BD 59 – in other words, it spots the bad data and replaces it with U+FFFD as part of the encoding… so decoding this value will result in X U+FFFD Y with no problems. On the other hand, if you use new UTF8Encoding(true, true).GetBytes(Value), an exception will be thrown. The first constructor argument is whether or not to emit a byte order mark under certain circumstances; the second one is what dictates the encoding behaviour in the face of invalid data, along with the EncoderFallback and DecoderFallback properties.

Language behaviour

So should this compile at all? Well, the language specification doesn’t currently prohibit it – but specifications can be changed :)

In fact, both csc and Roslyn do prohibit the use of ill-formed strings with certain attributes. For example, with DllImportAttribute:

[DllImport(Value)]
static extern void Foo();

This gives an error when Value is ill-formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

There may be other attributes this is applied to as well; I’m not sure.

If we take it as read that the ill-formed value won’t be decoded back to its original form when the attribute is instantiated, I think it would be entirely reasonable to make it a compile-time failure – for attributes. (This is assuming that the runtime behaviour can’t be changed to just propagate the ill-formed string.)

What about the constant value though? Should that be allowed? Can it serve any purpose? Well, the precise value I’ve given is probably not terribly helpful – but it could make sense to have a string constant which ends with a high surrogate or starts with a low surrogate… because it can then be combined with another string to form a well-formed UTF-16 string. Of course, you should be very careful about this sort of thing – read the Unicode Technical Report 36 “Security Considerations” for some thoroughly alarming possibilities.

Corollaries

One interesting aspect to all of this is that “string encoding arithmetic” doesn’t behave as you might expect it to. For example, consider this method:

// Bad code!
string SplitEncodeDecodeAndRecombine
    (string input, int splitPoint, Encoding encoding)
{
    byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint));
    byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint));
    return encoding.GetString(firstPart) + encoding.GetString(secondPart);            
}

You might expect that this would be a no-op so long as everything is non-null and splitPoint is within range… but if you happen to split in the middle of a surrogate pair, it’s not going to be happy. There may well be other potential problems lurking there, depending on things like normalization form – I don’t think so, but at this point I’m unwilling to bet too heavily on string behaviour.

If you think the above code is unrealistic, just imagine partitioning a large body of text, whether that’s across network packets, files, or whatever. You might feel clever for realizing that without a bit of care you’d get binary data split between UTF-16 code units… but even handling that doesn’t save you. Yikes.

I’m tempted to swear off text data entirely at this point. Floating point is a nightmare, dates and times… well, you know my feelings about those. I wonder what projects are available that only need to deal with integers, and where all operations are guaranteed not to overflow. Let me know if you have any.

Conclusion

Text is hard.

Writing and speaking update

This post has a few purposes – it’s partly a bit of advertising, but it’s also meant to serve as a quick way of replying to speaking requests for a while… if you’ve been directed here by an email from me, I hope you’ll excuse the “form letter” approach. (You can probably skip down to the bottom.)

Upcoming speaking engagements

I’ve got a few speaking engagements in the next four months:

  • December 2nd, hack.summit(): Humanity: Epic Fail (not currently sure how Tony the Pony will handle working with a webcam, but we’ll see)
  • December 3rd, NDC London: Abusing C#
  • January 6th-9th: CodeMash: workshop on LINQ, and a session on testing with Roslyn
  • February 27th: NorDevCon: Keynote (!) on passion, and C# 6 (probably with a bit of Roslyn too)

New book under development: Grokking C#

A long time ago I wrote about my hope to write two books: C# in Style and C# from Scratch. As you may have noticed, these haven’t happened yet, although I have had a couple of abortive attempts.

I’ve been working with my publisher (Manning) to think about the best way of introducing C# to newcomers, and we’ve made solid progress on a book with the working title of “Grokking C#” (subtitle: getting stuff done with objects). The aim will be to introduce both C# and OO hand-in-hand… not just “enough OO to understand the syntax of C#” but hopefully enough of the big picture to encourage readers to think about objects in a way which will last them long beyond the short time they’re reading the book.

The “Grokking” series from Manning is a pretty new one, but I’m really excited about the approach – it’s very visual, with a lot of thought put into reducing cognitive load. Have a look at Grokking Functional Programming for an idea of the style. If this sounds a bit like the Head-First books, there’s a good reason for that – Bert Bates who wrote Head-First Java with Kathy Sierra is acting in a consulting role for the Grokking books. However, the Grokking books have a style of their own – one which happens to take the things I like about the Head-First books, but without some of the aspects I’m less fond of.

I’m very much looking forward to learning a lot from Bert about getting a message out as clearly as possible. My blog, book and conference audience has always been reasonably advanced, and while I obviously answer Stack Overflow questions asked by newcomers, I haven’t done a large amount of writing for that audience. I expect this to be very challenging, but really rewarding.

No more speaking engagements…

Now for the bit which is tricky to tell conference organizers: I’m not going to accept any extra speaking engagements until the book is done, at least in first draft.

Writing the book is going to take a lot of my time, some of which would otherwise be family time… and currently speaking is taking about half of my vacation allocation each year. If I tried to continue with the speaking and the book, my family would suffer – and they don’t deserve that.

Don’t get me wrong – I still love speaking, and I’ll look forward to returning to the conference scene when I’m done, assuming anyone still wants me. Hopefully by then I’ll have some fresh ideas for talks as well. (I’ve been considering “Techniques for persuading people you’re an expert when you’re not” as a soft-skills talk, although I’m not sure I want to give away the secret…)

Apologies to conference organizers who I’ve already put off a few times with “ask me next year” – people trust me when I say it isn’t personal. If I had more time (or didn’t have a family!) I’d love to spend my life travelling around the world giving talks and chatting with developers. There just aren’t enough hours in the day :(

Violating the “smart enum” pattern in C#

For a while now, I’ve been a big fan of a pattern in C# which mimics Java enums to a certain extent. In general, it’s a lovely pattern. Only after reading a comment on a recent blog post by Eric Lippert did I find out about a horrible flaw. Dubious thanks to John Payson for prompting this post. (And I fully intend to include this in the Abusing C# talk that I give every so often…)

Trigger warning: this post contains code you may wish you’d never seen.

Let’s start off with the happy path – a moment of calm before the storm.

When things go well…

The idea is to create a class – which can be abstract – with a certain fixed set of instances, which are available via static properties on the class itself. The class is not sealed, but it prevents other classes from subclassing it by using a private constructor. Here’s an example, including how it can be used:

using System;

public abstract class Operation
{
    public static Operation Add { get; } = new AddOperation();
    public static Operation Subtract { get; } = new SubtractOperation();

    // Prevent non-nested subclasses. They can't call this constructor,
    // which is the only way of creating an instance.
    private Operation()
    {}

    public abstract int Apply(int lhs, int rhs);

    private class AddOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs + rhs;
        }
    }

    private class SubtractOperation : Operation
    {
        public override int Apply(int lhs, int rhs)
        {
            return lhs - rhs;
        }
    }
}

public class Test
{
    static void Main()
    {
        Operation add = Operation.Add;
        Console.WriteLine(add.Apply(5, 3)); // 8
    }
}

(Note the use of the new “automatically implemented read-only property” from C# 6. Yum!)

Obviously there are possible variations of this – there could be multiple instances of one subclass, or we could even create instances dynamically if that were useful. The main point is that no other code can create instances or even declare their own subclasses of Operation. That private constructor stops everything, right? Um…

My eyes! The goggles do nothing!

I’m sure all my readers are aware that every C# constructor either has to chain to another constructor using : this(...) or chain to a base class constructor, either implicitly calling a constructor without arguments or explicitly using : base(...).

What I wasn’t aware of – or rather, I think I’d been aware of once, but ignored – was that constructor initializers can have cycles. In other words, it’s absolutely fine to write code like this:

public class SneakyOperation : Operation
{
    public SneakyOperation() : this(1)
    {}

    public SneakyOperation(int i) : this()
    {}

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

Neither operator attempts to call a base class constructor, so it doesn’t matter that the only base class constructor is private. This code compiles with no issues – not even a warning.

So we’ve already failed in our intention to prevent subclassing from occurring… but this will fail with a StackOverflowException which will take down the process, so there’s not much harm, right?

Well… suppose we didn’t let it throw a StackOverflowException. Suppose we triggered our own exception first. There are various ways of doing this. For example, we could use arithmetic:

public SneakyOperation(int x) : this(x, 0)
{}

// This isn't going to be happy if y is 0...
public SneakyOperation(int x, int y) : this(x / y)
{}

Or we could have one constructor call another with a delegate which will deliberately throw:

public SneakyOperation(int x) : this(() => { throw new Exception("Bang!"); })
{}

public SneakyOperation(Func<int> func) : this(func())
{}

So now, we have a class where you can call the constructor, and an exception will be thrown. But hey, you still can’t get at the instance which was created, right? Well, this is tricky. We genuinely can’t use this anywhere in code which we expect to execute – there’s no way of reaching a constructor body, because in order to do that we’d have to have some route which took us through the base class constructor, which was can’t access. Field initializers aren’t allow to use this, either explicitly or implicitly (e.g. by calling instance methods). If there’s a way of circumventing that, I’m not aware of it.

So, we can’t capture a reference to the instance before throwing the exception. And after the exception is thrown, it’s garbage, and therefore unavailable… unless we take it out of the garbage bin, of course. Enter a finalizer…

private static volatile Operation instance;

~SneakyOperation()
{
    // Resurrect the instance.
    instance = this;
}

Now all we have to do is wrap up the "create an object and wait for it to be resurrected" logic in a handy factory method:

public static Operation GetInstance()
{
    try
    {
        new SneakyOperation(0);
    }
    catch {} // Thoroughly expected
    Operation local;
    while ((local = instance) == null)
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
    return local;
}

Admittedly if multiple threads call GetInstance() at a time, there’s no guarantee which thread will get which instance, or whether they’ll get the same instance… but that’s fine. There’s very little to guarantee this will ever terminate, in fact… but I don’t care much, as clearly I’m never going to use this code in production. It works on my machine, for the few times I’ve been able to bear running it. Here’s a complete example (just combine it with the Operation class as before) in case you want to run it yourself:

public class SneakyOperation : Operation
{
    private static volatile Operation instance;

    private SneakyOperation(int x) : this(() => { throw new Exception(); })
    {}

    private SneakyOperation(Func<int> func) : this(func())
    {}

    public static Operation GetInstance()
    {
        try
        {
            new SneakyOperation(0);
        }
        catch {}
        Operation local;
        while ((local = instance) == null)
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
        }
        return local;
    }

    ~SneakyOperation()
    {
        instance = this;
    }

    public override int Apply(int lhs, int rhs)
    {
        return lhs * rhs;
    }
}

public class Test
{
    static void Main()
    {
        Operation sneaky = SneakyOperation.GetInstance();
        Console.WriteLine(sneaky.Apply(5, 3)); // 15
    }
}

Vile, isn’t it?

Fixing the problem

The first issue is the cyclic constructors. That’s surely a mistake in the language specification. (It’s not a bug in the compiler, as far as I can tell – I can’t find anything in the spec to prohibit it.) This can be fixed, and it shouldn’t even be too hard to do. There’s nothing conditional in constructor initializers; each constructor chains to exactly one other constructor, known via overload resolution at compile time. Even dynamic doesn’t break this – constructor initializers can’t be dispatched dynamically. This makes detecting a cycle trivial – from each constructor, just follow the chain of constructor calls until either you reach a base class constructor (success) or you reach one of the constructor declarations already in the chain (failure). I will be pestering Mads for this change – it may be too late for C# 6, but it’s never too early to ask about C# 7. This will be a breaking change – but it will only break code which is tragically broken already, always resulting in either an exception or a stack overflow.

Of course, a C# language change wouldn’t change the validity of such code being created through IL. peverify appears to be happy with it, for example. Still, a language fix would be a good start.

There’s a second language change which could help to fix this too – the introduction of private abstract/virtual methods. Until you consider nested classes, neither of those make sense – but as soon as you remember that there can be subclasses with access to private members, it makes perfect sense to have them. If Operation had a private abstract method (probably exposed via a public concrete method) then SneakyOperation (and any further subclasses) would have to be abstract, as it couldn’t possibly override the method. I’ve already blogged about something similar with public abstract classes containing internal abstract methods, which prevents concrete subclasses outside that assembly… but this would restrict inheritance even further.

Does it reach the usefulness bar required for new language features? Probably not. Would it be valid IL with the obvious meaning? I’ve performed experiments which suggest that it would be valid, but not as restrictive as one would expect – but I could be mistaken. I think there’s distinct merit in the idea, but I’m not expecting to see it come to pass.

Conclusion

So, what can you take away from this blog post? For starters, the fact that the language can probably be abused in far nastier ways than you’ve previously imagined – and that it’s very easy to miss the holes in the spec which allow for that abuse. (It’s not like this hole is a recent one… it’s been open forever.) You should absolutely not take away the idea that this is a recommended coding pattern – although the original pattern (for Operation) is a lovely one which deserves more widespread recognition.

Finally, you can take away the fact that nothing gets my attention quicker than a new (to me) way of abusing the language.

The mysteries of BCL time zone data

Note: I’ve now identified two bugs in TimeZoneInfo… details later in the post.

Background

Early on Friday morning (UTC), IANA released version 2014h of the time zone database. As a dutiful Noda Time maintainer, I fetched it, converted it into our native format, and ran the unit tests prior to pushing the new version.

Unfortunately, some tests failed. These weren’t the tests of the IANA data at all – they were tests of the BCL time zone data, which we access via TimeZoneInfo. I knew our code hadn’t changed, so I temporarily set those tests to be ignored, pushed the update, and filed a bug so I’d remember to fix it. On Friday evening, I tried to work out what had happened – strongly suspecting Windows Update had given us some “interesting” time zone data. Matt Johnson helpfully pointed me to a hotfix which I suspect rolled out via Windows Update, causing me to notice the issue. (As an aside, this is an argument in favour of regular CI builds even when no code has been pushed, if your code uses data which is updated automatically.) Digging into the time zone data for the Russian zones which were updating in the hotfix, I’m very confused.

This post was prompted by Noda Time, but actually it doesn’t involve Noda Time any further. I simply wanted to give some examples of the odd data I’m trying to understand.

Windows adjustment rules

TimeZoneInfo reveals its daylight saving rules via the GetAdjustmentRules() method, which returns an array of TimeZoneInfo.AdjustmentRule. An adjustment rule consists of:

  • The delta between daylight saving time and the zone’s standard time. (Windows time zones can only have a single standard offset; the data model doesn’t really handle the situation where the standard offset changes over time.)
  • A range of dates for which the rule is applicable
  • Transition times for the start and end of daylight savings

The transition times (TimeZoneInfo.TransitionTime) determine at which point within any given year the time zone starts or stops observing daylight saving time. Each time can either have a fixed date (e.g. “March 5th”) or a week-based rule (“The 3rd Sunday in October”). Additionally, it has a time of day associated with it.

So, in order to work out the offset from UTC for a specific moment in time, you should work out which rule it’s part of, and then check whether or not it’s in the daylight saving portion of the rule, right? That certainly sounds reasonable, but it’s slightly trickier in practice.

Awful GUI tooling ahoy!

I got fed up with poking around in the debugger and writing down all the adjustment rules, then testing what it actually did – so I wrote a very small WinForms app to experiment with. The source is in my GitHub repo, and may improve over time. (I’m really not a UX designer. I make sure that a UI does what I need it to, then run away before it breaks. Apologies to anyone who feels offended by the awfulness of this.)

The tool (“TimeZoneInfo explorer”) allows you to select a time zone, at which point the adjustment rules will be displayed. You can then also select two date/time values, and it will show how the UTC offset changes between the two, sampling it once per hour. Here’s an example, showing the UK time zone around the DST transition in October 2014:

UK time zone

Now the UK is particularly simple – especially as the Windows time zone database doesn’t know anything about the British Standard Time experiment, for example. Other time zones are more complicated, and that’s where things get tricky. Let’s look at a few and see why things aren’t as straightforward as they appear.

Awkward cases

Here’s one of the examples which caused my Noda Time failure last Friday: Kaliningrad.

Kaliningrad time zone

There are a few things to note here:

  • The rules are from the beginning of time to the end of 2010, then for the whole of 2011, then for 2014… it’s not obvious what should happen in 2012 or 2013, or after 2014.
  • The DST end transition for 2011 is at the start of the year
  • The DST start transition for 2014 is at the start of the year
  • The UTC offset changes at midnight UTC at the start of 2014

Now, the documentation for the time of day of a transition is actually fairly clear:

For transitions from standard time to daylight saving time, the TimeOfDay value represents the time of the transition in the time zone’s standard time. For transitions from daylight saving time to standard time, it represents the time of the transition in the time zone’s daylight saving time.

In other words, it’s always the time of day that would have occurred locally if there wasn’t a transition – in IANA time zone language, this is a “wall mode” transition, as it tells you the time you’d see on a wall clock exactly when you need to adjust it.

Great. Now what about the start and end date of a rule? For example, which rule – if any – is in force in Kaliningrad at 2013-12-31T23:00:00Z? The UTC year is 2013, but the local year would be 2014. Does it fall in the gap, or does it use the 2014 rule?

There are three options for interpreting the start/end dates:

  • They’re in UTC
  • They’re in standard time (i.e. using the standard UTC offset, and ignoring the possibility that the new year occurs when daylight savings are being observed
  • They’re in local time

The documentation for AdjustmentRule.DateStart and AdjustmentRule.DateEnd don’t specify how the value should be interpreted, although they do have this warning:

Unless there is a compelling reason to do otherwise, you should define the adjustment rule’s start date to occur within the time interval during which the time zone observes standard time. Unless there is a compelling reason to do so, you should not define the adjustment rule’s start date to occur within the time interval during which the time zone observes daylight saving time. For example, if a time zone’s transition from daylight saving time occurs on the third Sunday of March and its transition to daylight saving time occurs on the first Sunday of October, the effective start date of the adjustment rule should not be January 1 of a particular year, since that date occurs within the period of daylight saving time.

However, I think that every system-provided adjustment rule I’ve seen starts on January 1st and ends on December 31st.

Let’s defer judgement on what this all means until we’ve seen a couple more examples. Next up, Perth. If you enjoy adjusting your clocks, then Windows thinks that Perth is a great place to live, at least at the end of 2008:

Perth time zone

At midnight local time, the offset is adjusted from +9 to +8… and then at 8am local time (as it would have been) it’s adjusted back again, making it 9am.

Finally, here’s a similar example for Tripoli:

Tripoli time zone

This is similar to the Perth case, changing the clocks twice in quick succession – although the transition at the start of 2012 is similar to the Kaliningrad case, occurring at midnight UTC instead of midnight local time. This was actually the first case I noticed, causing issue 220 in Noda Time, and filed as a Connect bug.

So what’s the answer?

~~I haven’t yet come up with a perfect way of understanding the Windows adjustment rules.~~ I’ve now identified two bugs, , thanks to looking at the reference source. Let’s focus on Tripoli, and try to work out some way of explaining the two transitions around the start of 2012, effectively tracing the call to TimeZoneInfo.GetUtcOffset, passing in two different values:

  • 2012-12-31T22:00:00Z
  • 2013-01-01T00:00:00Z.

Just to recap, look again at the final picture above. The first adjustment rule (extended in both directions) would propose the following transitions (amongst many others):

  • January 1st 2012, midnight local time, +02 -> +03
  • November 10th 2012, 2am local time, +03 -> +02
  • January 6th 2013, midnight local time, +02 -> +03

The second adjustment rule would propose the following transitions:

  • January 3rd 2012, midnight local time, +03 -> +02
  • March 30th 2012, 1am local time, +02 -> +03
  • January 1st 2013, midnight local time (2012-12-31T21:00:00Z), +03 -> +02
  • March 29th 2013, 1am local time, +02 -> +03

Note that at 2012-12-31T22:00:00Z, neither rule would suggest that it’s in DST – and yet the offset is +03. At 2013-01-01T00:00:00Z, it still shouldn’t be in DST, and indeed the offset is +02… but it’s not obvious why there would be a transition at this point.

As noted in a comment, the reference source for TimeZoneInfo holds the key.

First bug: 2012-12-31T22:00:00Z

The code uses the standard offset to determine which adjustment rule to use for the year (so that’s answered that question) and then determines when DST starts and ends in local time. This is where the first bug lies… while the rule to use is identified in GetAdjustmentRuleForTime using the standard time version of the input, the year for which to determine the transitions is the UTC year of the input, due to this statement:

isDaylightSavings = GetIsDaylightSavingsFromUtc(time, year, zone.m_baseUtcOffset, rule, out isAmbiguousLocalDst);

Here time is the UTC version of the original input, and year is assigned a few lines earlier as:

year = time.Year;

So even though we’re using the 2013 rule, we’re finding out what DST transitions it would have used in 2012! Those transitions are:

  • January 3rd 2012, midnight local time (2012-01-02T21:00:00Z), +03 -> +02
  • March 30th 2012, 1am local time (2012-03-29T23:00:00Z), +02 -> +03

The transition times are converted from local time into UTC, and then passed to CheckIsDst to determine whether the tuple of (entering DST, time to check, exiting DST) means that the time to check is in DST. The call is effectively:

CheckIsDst(2012-03-29T23:00:00Z, // startTime
           2012-12-31T22:00:00Z, // time
           2012-01-02T21:00:00Z) // endTime

That looks like it’s in DST… so we end up with an offset of +03.

So the bug here is that the rule for 2013 is being asked for its 2012 transitions, despite the first 2013 transition actually coming earlier than the point in time we’re asking about.

Second bug: 2013-01-01T00:00:00Z

Back to the start, and again the code picks the 2013 rule (we’re now two hours into 2013, according to standard time). This time, because the UTC year is also UTC, the code asks the rule for the transitions during 2013. As shown above, these are:

  • DST ends: January 1st 2013, midnight local time (2012-12-31T21:00:00Z), +03 -> +02
  • DST starts: March 29th 2013, 1am local time (2012-03-28T23:00:00Z), +02 -> +03

This time our call to CheckIsDst is:

CheckIsDst(2013-03-28T23:00:00Z, // startTime
           2013-01-01T00:00:00Z, // time
           2012-12-31T21:00:00Z) // endTime

So far so good. This looks like we should not be in DST. But now we come to the body of CheckIsDst, which starts off like this:

static private Boolean CheckIsDst(DateTime startTime,
                                  DateTime time,
                                  DateTime endTime) {
    Boolean isDst;
    int startTimeYear = startTime.Year;
    int endTimeYear = endTime.Year;

    if (startTimeYear != endTimeYear) {
        endTime = endTime.AddYears(startTimeYear - endTimeYear);
    }

The years of startTime and endTime aren’t the same, so a year is arbitrarily added to endTime. In other words, the code is effectively assuming that the transition is on the same date every year. It does the same for time as well, if necessary – which it isn’t in our case. So after this change, we have:

  • startTime = 2013-03-28T23:00:00Z
  • time = 2013-01-01T00:00:00Z
  • endTime = 2013-12-31T21:00:00Z

This time we get the right answer – but we’re looking at an “end transition” which isn’t logically predicted by the rule. (The next “UTC end” is in January 2014, because the local date of the change would be Tuesday January 7th.)

The operation here is simply illogical. To my mind, the only justifiable way of determining whether or not a particular time falls in the DST or standard part of a recurrent rule is to predict the transition on/before it and the transition after it.

Conclusion

Of course, none of this reflects what happens in real life – the Windows time zone data is simply inaccurate here, or at best a poor facsimile of a complex situation, limited by the representation available. Still, it would be nice to be able to understand how the BCL is interpreting the data, in order to replicate it in Noda Time.

Now that I understand the two BCL bugs – or at least two of the BCL bugs – I’m minded to suppress the tests and add some warning documentation. This isn’t a matter of “the documentation isn’t terribly clear about what the rules mean” – it’s fundamentally broken behaviour. Ick.

When is a constant not a constant? When it’s a decimal…

A comment on a Stack Overflow post recently got me delving into constants a bit more thoroughly than I have done before.

Const fields

I’ve been aware for a while that although you can specify decimal field as a const in C#, it’s not really const as far as the CLR is concerned. Let’s consider this class to start with:

class Test
{
    const int ConstInt32 = 5;
    const decimal ConstDecimal = 5;
}

Firstly, csc gives us a warning about ConstDecimal but not about ConstInt32:

Test.cs(4,19): warning CS0414: The field ‘Test.ConstDecimal’ is assigned but its value is never used

The Roslyn compiler (rcsc) doesn’t warn about either of the fields. This is just a curiosity, really – but it already shows that there’s some difference in how they’re compiled. When we delve into the IL, the difference is much more pronounced:

.class private auto ansi beforefieldinit Test
       extends [mscorlib]System.Object
{
  .field private static literal int32 ConstInt32 = int32(0x00000005)
  .field private static initonly valuetype [mscorlib]System.Decimal ConstDecimal
  .custom instance void [mscorlib]DecimalConstantAttribute::.ctor
      (uint8, uint8, uint32, uint32, uint32) = 
      ( 01 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 00 00 ) 

  // Skip the parameterless constructor

  .method private hidebysig specialname rtspecialname static 
          void  .cctor() cil managed
  {
    // Code size       12 (0xc)
    .maxstack  8
    IL_0000:  ldc.i4.5
    IL_0001:  newobj     instance void [mscorlib]System.Decimal::.ctor(int32)
    IL_0006:  stsfld     valuetype [mscorlib]System.Decimal Test::ConstDecimal
    IL_000b:  ret
  } // end of method Test::.cctor

} // end of class Test

First things to note:

ConstInt32 has the literal constraint in IL. From ECMA 335, I.8.6.1.2:

The literal constraint promises that the value of the location is actually a fixed value
of a built-in type. The value is specified as part of the constraint. Compilers are
required to replace all references to the location with its value, and the VES therefore
need not allocate space for the location. This constraint, while logically applicable to
any location, shall only be placed on static fields of compound types. Fields that are
so marked are not permitted to be referenced from CIL (they shall be in-lined to their
constant value at compile time), but are available using reflection and tools that
directly deal with the metadata.

Whereas ConstDecimal only has the initonly constraint. Again, from ECMA 335:

The init-only constraint promises (hence, requires) that once the location has been
initialized, its contents never change. Namely, the contents are initialized before any
access, and after initialization, no value can be stored in the location. The contents are
always identical to the initialized value (see §I.8.2.3). This constraint, while logically
applicable to any location, shall only be placed on fields (static or instance) of
compound types.

(Here “compound type” just means “not an array type” – although in quite a confusing manner. I would ignore it if I were you.)

Next, note that there’s an attribute (System.Runtime.CompilerServices.DecimalConstantAttribute – I’ve taken the namespace off in the listing for the sake of readability) applied to the field, which tells anything consuming the assembly that it should be a constant, and what its value is. Indeed, if you’re very careful, you can create your own “constants” like this:

[DecimalConstant((byte)0, (byte)0, (uint)0, (uint)0, (uint) 5)]
public static readonly decimal ConstDecimal;

That field declaration will be treated as a constant in other assembles – but not within the same assembly. So printing ConstDecimal within the same assembly will result in 0 (unless you change it to another value in the static initializer) whereas printing Test.ConstDecimal in a different assembly will result in 5. (It won’t even touch the field at execution time.) I’m sure I can work out some nasty ways of abusing that, if I try hard enough.

Note that the casts to uint are important – if you accidentally call the attribute constructor with a (byte, byte, int, int, int), the compiler doesn’t recognize it. (Interesting, the latter was only introduced in .NET 2.0. I’ve no idea why.)

Amusingly, you can combine the two:

[DecimalConstant((byte)0, (byte)0, (uint)0, (uint)0, (uint) 5)]
public const decimal ConstDecimal = 10;

In this case, the IL contains both DecimalConstant attributes, despite the fact that it’s only legal to apply one. (AllowMultiple is false on its AttributeUsage.) The compiler appears to pick the one specified by the value rather than manually applied, which is slightly disappointing, but of no real importance.

Optional parameters

In the case of const fields, we’ve really only cared about what the compiler does. Let’s try something where both the compiler and the framework can get involved: optional parameters.

Again, let’s write a little class to demonstrate how the default values of optional parameters are encoded in IL:

public class Test
{
    public void PrintInt32(int x = 10)
    {
        Console.WriteLine(x);
    }

    public void PrintDecimal(decimal d = 10m)
    {
        Console.WriteLine(d);
    }
}

The important bits of the generated IL are:

.method public hidebysig instance void PrintInt32([opt] int32 x) cil managed
{
    .param [1] = int32(0x0000000A)
    ...
} // end of method Test::PrintInt32

.method public hidebysig instance void PrintDecimal(
    [opt] valuetype [mscorlib]System.Decimal d) cil managed
{
    .param [1]
    .custom instance void [mscorlib] DecimalConstantAttribute::.ctor
    (uint8, uint8, uint32, uint32, uint32) =
    ( 01 00 00 00 00 00 00 00 00 00 00 00 0A 00 00 00 00 00 ) 
    ...
} // end of method Test::PrintDecimal

Again, we have a DecimalConstantAttribute to describe the default value for the decimal parameter, whereas . If you call the method but don’t specify an argument, the compiler notes the DecimalConstantAttribute applied to the method parameter, and constructs the value in the calling code. That’s not the only way of observing the default value, however – you can do it with reflection, too:

public static void Main()
{
    var method = typeof(Test).GetMethod("PrintDecimal");
    var parameter = method.GetParameters()[0];
    Console.WriteLine("Default value: {0}", parameter.DefaultValue);
}

As you’d expect, that prints a default value of 10. There’s no direct equivalent of DefaultValue for FieldInfo – there’s GetRawConstantValue() but that only works for constants that the CLR is directly aware of – it fails for a field like const decimal Foo = 10m , with an InvalidOperationException. I’ll talk more about CLR constants later.

Now let’s try something a bit more tricksy though… C# doesn’t support DateTime literals, unlike VB – but there’s a DateTimeConstantAttribute – what happens if we try to apply that ourselves? Let’s see…

public void PrintDateTime
    ([Optional, DateTimeConstant(635443315962469079L)] DateTime date)
{
    Console.WriteLine(date);
}

So if we call PrintDateTime(), what does that print? Well (leaving aside the formatting – the examples below use the UK default formatting):

  • With csc.exe (the “old” C# compiler), with the call in the same assembly as the method declaration, it prints 01/01/0001 00:00:00
  • With csc.exe, with the call in a different assembly to the method declaration, it prints 22/08/2014 19:13:16
  • With rcsc.exe (Roslyn), it prints 22/08/2014 19:13:16 regardless of where it’s called from
  • If you call it dynamically (dynamic d = new Test(); d.PrintDateTime();) it prints 22/08/2014 19:13:16 regardless of the compiler – at least with the version of .NET I’m using. It may well vary by version.

In every case, printing out the ParameterInfo.DefaultValue gives the right answer: the framework doesn’t care whether or not the compiler understands the attribute.

In case you’re wondering why I didn’t mention this possibility for constant fields – I tried it, and it didn’t work, even in Roslyn. For some reason optional parameters are treated as more “special” than constant fields.

Having got this far, why stop with DateTime? The DateTimeConstantAttribute class derives from CustomConstantAttribute (whereas DecimalConstantAttribute just derives from Attribute). So can I introduce my own constant attributes? Noda Time seems to be an obvious candidate for this – let’s try for a LocalDateConstantAttribute:

[AttributeUsage(AttributeTargets.Parameter | AttributeTargets.Field)]
public class LocalDateConstantAttribute : CustomConstantAttribute
{
    private readonly LocalDate value;

    public LocalDateConstantAttribute(int year, int month, int day)
    {
        value = new LocalDate(year, month, day);
    }

    public override object Value { get { return value; } }
}
...
public void PrintLocalDate(
    [Optional, LocalDateConstant(2014, 8, 22)] LocalDate date)
{
    Console.WriteLine(date);
}

How does this fare? Not so well, unfortunately:
– With a normal method call (regardless of assembly), it prints 01 January 1970 (the default value for a LocalDate in Noda Time 1.3)
– With a dynamic method call it prints 01 January 1970
– With reflection (i.e. using ParameterInfo.DefaultValue) the framework does construct the appropriate LocalDate, which seems logical as it’s presumably just using the Value property of the attribute

So, there’s still work to be done there. I think it would be feasible to do this in a general way, if it’s acceptable for an exception to be thrown if the Value property returns an incompatible type of object to the parameter type. The great thing is that Roslyn is open source, so I should be able to spike this myself, right? How hard can it be? (Cue howls of derisive laughter from the Roslyn team, who will know much better than I how hard it’s likely to really be.)

CLR constants and attribute arguments

So with constant fields, it was all down to the compiler, really. For optional parameters, it’s mostly still down to the compiler, but with framework support for reflection. What about attribute arguments? They’re the other notable aspect of C# which requires compile-time constants. For example, this is fine:

[Description("This is a constant")]

But this is not:

[Description("Initialized at " + DateTime.Now)]

Intriguingly, this is fine too:

[ContractClass(typeof(Foo))]

… despite the fact that

const Type ConstType = typeof(Foo);

isn’t valid. The only constant expression which is valid for type Type is null. So in section 17.2 of the C# 5 specification, Type is explicitly called out:

An expression E is an attribute-argument-expression if all of the following statements are true:

  • The type of E is an attribute parameter type (§17.1.3).
  • At compile-time, the value of E can be resolved to one of the following:
    • A constant value.
    • A System.Type object.
    • A one-dimensional array of attribute-argument-expressions.

(Interestingly, there’s no indication that I can see that the value of E has to be obtained via typeof, in the spec – clearly [ContractClass(Type.GetType("TodayIs" + DateTime.Today.Day))] should be invalid, but I can’t currently see which bit of the spec that violates. Something to be tightened up, potentially.)

And the “attribute parameter type” part – section 17.1.3 – looks like this:

The types of positional and named parameters for an attribute class are limited to the attribute parameter types, which are:

  • One of the following types: bool, byte, char, double, float, int, long, sbyte, short, string, uint, ulong, ushort.
  • The type object.
  • The type System.Type.
  • An enum type, provided it has public accessibility and the types in which it is nested (if any) also have public accessibility (§17.2).
  • Single-dimensional arrays of the above types.

Oops – no decimal. Note that the type of E that has to be one of those types, as well as the parameter type… so it doesn’t help to have a parameter of type object and then try to pass a constant decimal value as the argument.

Basically, attributes arguments are sufficiently low-level that the CLR itself wants to know about them – and while it has a deep knowledge of Type, string and the primitive value types, it doesn’t have much knowledge about decimal (or at least, it isn’t required to). Attribute arguments can’t use funky “custom constant” attributes to specify values like decimal or DateTime – you really are limited to the types that the CLR knows about. In a future version it’s not inconceivable that this could be broadened, but at the moment it’s pretty strict.

Conclusion

So, it turns out the idea of a constant isn’t terribly constant in itself. We have:

  • Constant fields, which are primarily a compile-time concern, and therefore language-specific.
  • Optional parameter default values, which feel like they ought to be just like constant fields (in that a value specified in one place is substituted in another) but apparently have a bit more support in the C# compiler… and more reflection support too.
  • Attribute arguments, which are the strictest form of constant I’ve found so far, in that they have to correspond to a small set of CLR “special” types.

I didn’t even talk about constant expressions (section 7.19 of the C# 5 spec) much in this post – maybe I’ll delve into those in more detail in another post.

Unlike my normal day-dreaming about changing the compiler, I think I really might have a crack at Roslyn for supporting arbitrary optional parameters – it feels like it could potentially be genuinely useful (which is also unlike most of my idle speculation).

So next time you ask yourself whether something is a constant, you should start off by asking yourself what you mean by “constant” in the first place.

The BobbyTables culture

I started writing a post like this a long time ago, but somehow never finished it.

Countless posts on Stack Overflow are vulnerable to SQL injection attacks. Along with several other users, I always raise this when it shows up – this is something that really just shouldn’t happen these days. It’s a well-understood issue,and parameterized SQL is a great solution in almost all cases. (No, it doesn’t work if you want to specify an column or table name dynamically. Yes, whitelisting is the solution there.)

The response usually falls into one of three camps:

  • Ah – I didn’t know about that. Great, I’ll fix it now. Thanks!
  • This is just a prototype. I’ll fix it for the real thing. (Ha! Like that ever happens.)
  • Well yes, in theory – but I’m just using numbers. That’s not a problem, is it?

Now personally I feel that you should just get the habit of using parameterized queries all the time, even when you could get away without it. This post is a somewhat tongue-in-cheek counterargument to the last of these responses. If you haven’t seen Bobby Tables, you really should. It’s the best 10-second explanation of SQL injection that I’ve ever seen, and I almost always drop a link to it when I’m adding a comment on a vulnerable query on Stack Overflow.

So in honour of Bobby, here’s a little program. See if you can predict the output.

using System;
using System.Globalization;
using System.Threading;

class Test
{
    static void Main()
    {
        string sql = "SELECT * FROM Foo WHERE BarDate > '" + DateTime.Today + "'";
        // Imagine you're executing the query here...
        Console.WriteLine(sql);

        int bar = -10;
        sql = "SELECT * FROM Foo WHERE BarValue = " + bar;
        // Imagine you're executing the query here...
        Console.WriteLine(sql);
    }

    // Some other code here...
}

Does that look okay? Not great, admittedly – but not too bad, right? Well, the output of the program is:

SELECT * FROM Foo WHERE BarDate > '2014-08-08' OR ' '=' '
SELECT * FROM Foo WHERE BarValue = 1 OR 1=1 OR 1=10

Yikes! Our queries aren’t filtering out anything!

Of course, the black magic is in “Some other code here” part:

static Test()
{
    InstallBobbyTablesCulture();
}

static void InstallBobbyTablesCulture()
{
    CultureInfo bobby = (CultureInfo) CultureInfo.InvariantCulture.Clone();
    bobby.DateTimeFormat.ShortDatePattern = @"yyyy-MM-dd'' OR ' '=''";
    bobby.DateTimeFormat.LongTimePattern = "";
    bobby.NumberFormat.NegativeSign = "1 OR 1=1 OR 1=";
    Thread.CurrentThread.CurrentCulture = bobby;
}

Neither numbers (well, negative numbers in this case) nor dates are safe. And of course if your database permissions aren’t set correctly, the queries could do a lot more than just remove any filtering. For extra fun, you can subvert some custom format strings – by changing the DateSeparator property, for example.

Even in sensible cultures, if the database expects you to use . for the decimal separator and you’re in a European culture that uses , instead, do you know how your database will behave? If you sanitize your input based on the numeric value, but then that isn’t the value that the database sees due to a string conversion, how comfortable are you that your application is still safe? It may not allow direct damage, but it could potentially reveal more data than you originally expected – which is definitely a vulnerability in a form.

Now the chances of me getting onto your system and installing the Bobby Tables culture – let alone making it the system default – are pretty slim, and if that happens you’ve probably got bigger problems anyway… but it’s the principle of the thing. You don’t care about a text representation of your values: you just want to get them to the database intact.

Parameterized SQL: just say yes.

Object pooling and thread safety

A few days ago, I was watching Dustin Campell’s excellent talk from TechEd 2013, “Essential truths everyone should know about performance in a large managed code base”. I highly recommend it, and it made me think a bit more about Noda Time’s performance. (It doesn’t take much to make me think about Noda Time’s performance, admittedly.)

Base situation: no pooling

Most of our formatting goes through something called SteppedPattern – that’s basically one multicast delegate (with lots of actions) for formatting, and a list of delegates for parsing. The Format method was really simple before this week:

public string Format(TResult value)
{
    StringBuilder builder = new StringBuilder();
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

Very simple – but it allocates a new StringBuilder on each call, which is somewhat unnecessary. Allocation is cheap, but not free – and string formatting is something which could reasonably happen an awful lot.

In the talk, Dustin showed an example where introducing a single object pool – also for StringBuilder – had made marked difference. Unfortunately, in my case it’s not as simple as we might like. Or at least, I’m not confident that it is. Let’s look at the options I’ve considered so far – in no particular order.

Option 1: Lock-free using Interlocked

After construction, a formatter can be used on multiple threads, so some case is needed to make sure we don’t end up using one StringBuilder from multiple threads at a time. Interlocked.Exchange makes this reasonably straightforward:

private StringBuilder cachedBuilder = new StringBuilder();
...
public string Format(TResult value)
{
    // Try to borrow the cached builder, in a thread-safe way.
    // (If another thread is using the builder, this will return null.)
    StringBuilder builder = Interlocked.Exchange(ref cachedBuilder, null);
    if (builder == null)
    {
        builder = new StringBuilder();
    }
    else
    {
        builder.Length = 0;
    }
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    string ret = builder.ToString();
    // Whether we managed to borrow the cached builder or not,
    // we can replace it with the one we've used, now that we
    // know what to return.
    Volatile.Write(ref cachedBuilder, builder);
    return ret;
}

(See comments for discussion on the Volatile.Write call, which was previously just an assignment statement, along with the use of Interlocked.Exchange compared with an earlier version using Interlocked.CompareExchange.)
I’m pretty confident that this does achieve the goal of only using the builder in one thread at a time. If an exception is thrown while formatting, that’s fine – we don’t end up putting anything back in the cache, and the next caller will allocate a new one.
I don’t believe we need to use Exchange a second time when writing to cachedBuilder – we’re certainly fine if it takes a while for that write to be seen by other threads, as the only downside would be not using the cached builder when we might want to.

But is that enough to make this code thread-safe? Is it possible that a write from one thread would only be seen in another thread after the second thread has borrowed the builder? Is it possible that the write to cachedBuilder actually occurs before the call to builder.ToString() completes, allowing another thread to start messing with the builder before we’ve got our result?

My guess is that it’s actually okay – but lock-free code fundamentally scares me, and I’m very cautious about writing it. (Elsewhere in Noda Time we do have some lock-free mutable data – but usually in the form of array-based caches where the element type is guaranteed to have atomic writes, and is immutable.)

So while this is “clever” and may be fine, it makes me nervous, which is not a good place to be.

Option 2: Locking

I’m much more confident about using threads when locking is involved. I find it easier to reason about what’s allowed and what’s not.

private readonly StringBuilder pooledStringBuilder = new StringBuilder();
private bool pooledBuilderInUse = false;
...

public string Format(TResult value)
{
    StringBuilder builder = null;
    lock (pooledStringBuilder)
    {
        if (!pooledBuilderInUse)
        {
            pooledBuilderInUse = true;
            builder = pooledStringBuilder;
            builder.Length = 0;
        }
    }
    try
    {
        builder = builder ?? new StringBuilder();
        // This will call all the actions in the multicast delegate.
        formatActions(value, builder);
        return builder.ToString();
    }
    finally
    {
        if (builder == pooledStringBuilder)
        {
            lock (pooledStringBuilder)
            {
                pooledBuilderInUse = false;
            }
        }
    }
}

Unlike the previous solution, this time we always know what the builder is (and never change it) – that gives us something to lock on. What we change within the lock is whether or not the builder is in use. This should be safe across multiple threads, but it’s ever so verbose – and in my benchmarks, the results weren’t terribly clear.

Option 3: ThreadStatic

The obvious alternative to sharing one builder between multiple threads is to have a
single builder per thread. There are two approaches to this in .NET:
[ThreadStatic] and ThreadLocal.

ThreadStatic is the older mechanism, and is simply a matter of decorating a static field with the [ThreadStatic] attribute. Each thread has a separate variable, as if by magic. So in principle, the implementation just looks like this:

[ThreadStatic]
private static StringBuilder cachedBuilder;

public string Format(TResult value)
{
    StringBuilder builder = cachedBuilder;
    if (cachedBuilder == null)
    {
        builder = new StringBuilder();
        cachedBuilder = builder;
    }
    else
    {
        builder.Length = 0;
    }
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

Unfortunately we still need to check for nullity on every call, as any initialization
expression for the variable will only be executed by the first thread which uses the class. Still, the code is reasonably simple.

There’s a big “gotcha” with this though: our code is now not re-entrant. If one SteppedPattern has a format action which calls format on another SteppedPattern, we’ll end up trying to use the same builder for both operations, resetting the builder in the middle.

Fortunately, SteppedPattern already has a method (FormatPartial) designed for calling one formatter within another. In the very few places where this is required, I’m confident that it’s being called correctly – but I’m not keen on making my code fragile like this. It’s fairly safe from third party intervention – there are very few places (if any) where formatting code calls out to any other code which could include a format call. Just writing this paragraph, I’ve come up with a couple of worrying parts (fetching values from a CultureInfo, and fetching the ID of a time zone) but I think they’re safe… the question is whether I’ve missed something similar in another case. Either way, it would have to be pretty pathological code to cause a problem, and I consider the chances of this being a problem in the real world to be essentially zero.

However, I still want to guard against accidental re-entrancy within Noda Time itself – so I’ve got a debug-only version of Format which performs a crude re-entrancy check using another [ThreadStatic] variable. (And I’ve got a debug-only test to check that the check works, of course.) So long as my unit tests all still pass in a debug configuration, I’m reasonably confident. (You can see the check and the test in the revision that added this code. It may be gone in the latest version, of course – I’m still on the fence about all of this.)

There’s a second issue with [ThreadStatic] which is its level of support in the PCL. We just recently updated our PCL profiles, and I think using [ThreadStatic] would have been a problem if we’d continued to support Windows Phone 7, as that didn’t include [ThreadStatic] as far as I can tell. We’re fine in the current profile, but this is the first time I’ve knowing added code which prevents us from going back to a WP7-enabled profile.

Option 4: ThreadLocal

ThreadLocal is a more recent solution to the thread-local-storage problem. It’s generally a cleaner solution:

  • The variable doesn’t have to be static
  • You can specify an initialization delegate to be executed on first use in any particular thread
  • It’s clear from the type itself what you’re doing, rather than using an attribute which makes a variable act completely differently from normal

That means we can have one StringBuilder per formatter per thread (if we want; we could still make it a static variable if we wanted exactly one StringBuilder for each thread), and simplify the code within the Format method:

private ThreadLocal<StringBuilder> cachedBuilder =
    new ThreadLocal<StringBuilder>(() => new StringBuilder());

public string Format(TResult value)
{
    StringBuilder builder = cachedBuilder.Value;
    builder.Length = 0;
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

However, I haven’t tested this code, as ThreadLocal required .NET 4.0 – whereas Noda Time targets .NET 3.5.

Conclusion

You may have noticed that I haven’t provided any performance figures in this post. That’s deliberate, because the results I’ve seen have had too much variance so far. The basic upshot of this is that I don’t yet have enough evidence of the benefit to justify the complexity – it’s just one StringBuilder per format operation, after all, and there’s plenty of other work going on at the same time. I believe there is a benefit, but it’s pretty marginal – and the level of benefit depends on which of the above options is taken.

I’m intrigued by the first option though, and whether experts would consider it to be safe. I still want to see a really good specification for the .NET memory model (which I know to be stronger than the one in the ECMA CLI specification) – I know it’s something that Microsoft has wanted to get round to writing up, but I don’t know if it’s actually happened… I fully expect to see contradictory opinions in comments :)

Follow

Get every new post delivered to your Inbox.

Join 166 other followers