Fun with Object and Collection Initializers

Gosh it feels like a long time since I’ve blogged – particularly since I’ve blogged anything really C#-language-related.

At some point I want to blog about my two CodeMash 2013 sessions (making the C# compiler/team cry, and learning lessons about API design from the Spice Girls) but those will take significant time – so here’s a quick post about object and collection initializers instead. Two interesting little oddities…

Is it an object initializer? Is it a collection initializer? No, it’s a syntax error!

The first part came out of a real life situation – FakeDateTimeZoneSource, if you want to look at the complete context.

Basically, I have a class designed to help test time zone-sensitive code. As ever, I like to create immutable objects, so I have a builder class. That builder class has various properties which we’d like to be able to set, and we’d also like to be able to provide it with the time zones it supports, as simply as possible. For the zones-only use case (where the other properties can just be defaulted) I want to support code like this:

var source = new FakeDateTimeZoneSource.Builder
{
    CreateZone("x"),
    CreateZone("y"),
    CreateZone("a"),
    CreateZone("b")
}.Build();

(CreateZone is just a method to create an arbitrary time zone with the given name.)

To achieve this, I made the Builder implement IEnumerable<DateTimeZone>, and created an Add method. (In this case the IEnumerable<> implementation actually works; in another case I’ve used explicit interface implementation and made the GetEnumerator() method throw NotSupportedException, as it’s really not meant to be called in either case.)

So far, so good. The collection initializer worked perfectly as normal. But what about when we want to set some other properties? Without any time zones, that’s fine:

var source = new FakeDateTimeZoneSource.Builder
{
    VersionId = "foo"
}.Build();

But how could we set VersionId and add some zones? This doesn’t work:

var invalid = new FakeDateTimeZoneSource.Builder
{
    VersionId = "foo",
    CreateZone("x"),
    CreateZone("y")
}.Build();

That’s neither a valid object initializer (the second part doesn’t specify a field or property) nor a valid collection initializer (the first part does set a property).

In the end, I had to expose an IList<DateTimeZone> property:

var valid = new FakeDateTimeZoneSource.Builder
{
    VersionId = "foo",
    Zones = { CreateZone("x"), CreateZone("y") }
}.Build();

An alternative would have been to expose a propert of type Builder which just returned itself – the same code would have been valid, but it would have been distinctly odd, and allowed some really spurious code.

I’m happy with the result in terms of the flexibility for clients – but the class design feels a bit messy, and I wouldn’t have wanted to expose this for the "production" assembly of Noda Time.

Describing all of this to a colleague gave rise to the following rather sillier observation…

Is it an object initializer? Is it a collection initializer? (Parenthetically speaking…)

In a lot of C# code, an assignment expression is just a normal expression. That means there’s potentially room for ambiguity, in exactly the same kind of situation as above – when sometimes we want a collection initializer, and sometimes we want an object initializer. Consider this sample class:

using System;
using System.Collections;

class Weird : IEnumerable
{
    public string Foo { get; set; }
    
    private int count;
    public int Count { get { return count; } }
        
    public void Add(string x)
    {
        count++;
    }
            
    IEnumerator IEnumerable.GetEnumerator()
    {
        throw new NotSupportedException();
    }    
}

As you can see, it doesn’t actually remember anything passed to the Add method, but it does remember how many times we’ve called it.

Now let’s try using Weird in two ways which only differ in terms of parentheses. First up, no parentheses:

string Foo = "x";
Weird weird = new Weird { Foo = "y" };
    
Console.WriteLine(Foo);         // x
Console.WriteLine(weird.Foo);   // y
Console.WriteLine(weird.Count); // 0

Okay, so it’s odd having a local variable called Foo, but we’re basically fine. This is an object initializer, and it’s setting the Foo property within the new Weird instance. Now let’s add a pair of parentheses:

string Foo = "x";
Weird weird = new Weird { (Foo = "y") };
    
Console.WriteLine(Foo);         // y
Console.WriteLine(weird.Foo);   // Nothing (null)
Console.WriteLine(weird.Count); // 1

Just adding those parenthese turn the object initializer into a collection initializer, whose sole item is the result of the assignment operator – which is the value which has now been assigned to Foo.

Needless to say, I don’t recommend using this approach in real code…

Stack Overflow question checklist

Note: this post is now available with a tinyurl of http://tinyurl.com/stack-checklist

My earlier post on how to write a good question is pretty long, and I suspect that even when I refer people to it, often they don’t bother reading it. So here’s a short list of questions to check after you’ve written a question (and to think about before you write the question):

  • Have you done some research before asking the question? 1
  • Have you explained what you’ve already tried to solve your problem?
  • Have you specified which language and platform you’re using, including version number where relevant?
  • If your question includes code, have you written it as a short but complete program? 2
  • If your question includes code, have you checked that it’s correctly formatted? 3
  • If your code doesn’t compile, have you included the exact compiler error?
  • If your question doesn’t include code, are you sure it shouldn’t?
  • If your program throws an exception, have you included the exception, with both the message and the stack trace?
  • If your program produces different results to what you expected, have you stated what you expected, why you expected it, and the actual results?
  • If your question is related to anything locale-specific (languages, time zones) have you stated the relevant information about your system (e.g. your current time zone)?
  • Have you checked that your question looks reasonable in terms of formatting?
  • Have you checked the spelling and grammar to the best of your ability? 4
  • Have you read the whole question to yourself carefully, to make sure it makes sense and contains enough information for someone coming to it without any of the context that you already know?

    If the answer to any of these questions is “no” you should take the time to fix up your question before posting. I realize this may seem like a lot of effort, but it will help you to get a useful answer as quickly as possible. Don’t forget that you’re basically asking other people to help you out of the goodness of their heart – it’s up to you to do all you can to make that as simple as possible.


    1 If you went from “something’s not working” to “asking a question” in less than 10 minutes, you probably haven’t done enough research.

    2 Ideally anyone answering the question should be able to copy your code, paste it into a text editor, compile it, run it, and observe the problem. Console applications are good for this – unless your question is directly about a user interface aspect, prefer to write a short console app. Remove anything not directly related to your question, but keep it complete enough to run.

    3 Try to avoid code which makes users scroll horizontally. You may well need to change how you split lines from how you have it in your IDE. Take the time to make it as clear as possible for those trying to help you.

    4 I realize that English isn’t the first language for many Stack Overflow users. We’re not looking for perfection – just some effort. If you know your English isn’t good, see if a colleague or friend can help you with your question before you post it.

  • Noda Time v1.0 released

    Go get Noda Time 1.0!

    Today is the end of the longest release cycle I’ve been personally involved in. On November 5th 2009, I announced my intention to write a port of Joda Time for .NET. The next day, Noda Time was born – with a lofty (foolhardy) set of targets.

    Near the end of a talk *about* Noda Time this evening, I released Noda Time 1.0.0.

    It’s taken three years, but I’m immensely proud of what we’ve managed to achieve. We’re far from "done" but I believe we’re already significantly ahead of most other date/time APIs I’ve seen in terms of providing a clean API which reduces *incidental* complexity while highlighting the *inherent* complexity of the domain. (This is a theme I’m becoming dogmatic about on various fronts.)

    There’s more to do – I can’t see myself considering Noda Time to be "done" any time soon – but hopefully now we’ve got a stable release, we can start to build user momentum.

    One point I raised at the DotNetDevNet presentation tonight was that there’s a definite benefit (in my very biased view) in just *looking into* Noda Time:

    • If you can’t use it in your production code, use it when prototyping
    • If you can’t use it in your prototype code, play with it in personal projects
    • If you can’t use it in personal projects, read the user guide to understand the concepts

    I hope that simply looking at the various types that Noda Time providers will give you more insight into how you should be thinking about date and time handling in your code. While the BCL API has a lot of flaws, you can work around most of them if you make it crystal clear what your data means at every step. The type system will leave that largely ambiguous, but there’s nothing to stop you from naming your variables descriptively, and adding appropriate
    comments.

    Of course, I would far prefer it if you’d start using Noda Time and raising issues on how to make it better. Spread the word.

    Oh, and if anyone from the BCL team is reading this and would like to include something like Noda Time into .NET 5 as a "next generation" date/time, I’d be *really* interested in talking to you :)

    How can I enumerate thee? Let me count the ways…

    This weekend, I was writing some demo code for the async chapter of C# in Depth – the idea was to decompile a simple asynchronous method and see what happened. I received quite a surprise during this, in a way which had nothing to do with asynchrony.

    Given that at execution time, text refers to an instance of System.String, and assuming nothing in the body of the loop captures the ch variable, how would you expect the following loop to be compiled?

    foreach (char ch in text)
    {
        // Body here
    }

    Before today, I could think of four answers depending on the compile-time type of text, assuming it compiled at all. One of those answers is if text is declared to be dynamic, which I’m not going to go into here. Let’s stick with static typing.

    If text is declared as IEnumerable

    In this case, the compiler can only use the non-generic IEnumerator interface, and I’d expect the code to be roughly equivalent to this:

    IEnumerator iterator = text.GetEnumerator();
    try
    {
        while (iterator.MoveNext())
        {
            char ch = (char) iterator.Current;
            // Body here
        }
    }
    finally
    {
        IDisposable disposable = iterator as IDisposable;
        if (disposable != null)
        { 
            disposable.Dispose();
        }
    }

    Note how the disposal of the iterator has to be conditional, as IEnumerator doesn’t extend IDisposable.

    If text is declared as IEnumerable<char>

    Here, we don’t need any execution time casting, and the disposal can be unconditional:

    IEnumerator<char> iterator = text.GetEnumerator();
    try
    {
        while (iterator.MoveNext())
        {
            char ch = iterator.Current;
            // Body here
        }
    }
    finally
    {
        iterator.Dispose();
    }

    If text is declared as string

    Now things get interesting. System.String implements IEnumerable<char> using explicit interface implementation, and exposes a separate public GetEnumerator() method which is declared to return a CharEnumerator.

    Usually when I find a type doing this sort of thing, it’s for the sake of efficiency, to reduce heap allocations. For example, List<T>.GetEnumerator returns a List<T>.Enumerator which is struct with the appropriate iteration members. This means if you use foreach over an expression of type List<T>, the iterator can stay on the stack in most cases, saving object allocation and garbage collection.

    In this case, however, I suspect CharEnumerator was introduced (way back in .NET 1.0) to avoid having to box each character in the string. This was one reason for foreach handling to be based on types obeying the enumerable pattern, as well as there being support through the normal interfaces. It strikes me that it could still have been a structure in the same way as for List<T>, but maybe that wasn’t considered as an option.

    Anyway, it means that I would have expected the code to be compiled like this, even back to C# 1:

    CharEnumerator iterator = text.GetEnumerator();
    try
    {
        while (iterator.MoveNext())
        {
            char ch = iterator.Current;
            // Body here
        }
    }
    finally
    {
        iterator.Dispose();
    }

    What really happens when text is declared as string

    (This is the bit that surprised me.)

    So far, I’ve been assuming that the C# compiler doesn’t have any special knowledge about strings, when it comes to iteration. I knew it did for arrays, but that’s all. The actual result – under the C# 5 compiler, at least – is to use the Length property and the indexer directly:

    int index = 0;
    while (index < text.Length)
    {
        char ch = text[index];
        index++;
        // Body here
    }

    There’s no heap allocation, and no Dispose call. If the variable in question can change its value within the loop (e.g. if it’s a field, or a captured variable, or there’s an assignment to it within the body) then a copy is made of the variable value (just a reference, of course) first, so that all member access is performed on the same object.

    Conclusion

    So, there we go. There’s nothing particularly mind-blowing here – certainly nothing to affect your programming style, unless you were deliberately avoiding using foreach on strings "because it’s slow." It’s still a good lesson in not assuming you know what the compiler is going to do though… so long as the results are as expected, I’m very happy for them to put extra smarts in, even if it does mean having to change my C# in Depth sample code a bit…

    Stack Overflow and personal emails

    This post is partly meant to be a general announcement, and partly meant to be something I can point people at in the future (rather than writing a short version of this on each email).

    These days, I get at least a few emails practically every day along the lines of:

    “I saw you on Stack Overflow, and would like you to answer this development question for me…”

    It’s clear that the author:

    • Is aware of Stack Overflow
    • Is aware that Stack Overflow is a site for development Q&A
    • Is aware that I answer questions on Stack Overflow

    … and yet they believe that the right way of getting me to answer a question is by emailing it to me directly. Sometimes it’s a link to a Stack Overflow question, sometimes it’s the question asked directly in email.

    In the early days of Stack Overflow, this wasn’t too bad. I’d get maybe one email like this a week. Nowadays, it’s simply too much.

    If you have a question worthy of Stack Overflow, ask it on Stack Overflow. If you’ve been banned from asking questions due to asking too many low-quality ones before, then I’m unlikely to enjoy answering your questions by email – learn what makes a good question instead, and edit your existing questions.

    If you’ve already asked the question on Stack Overflow, you should consider why you think it’s more worthy of my attention than everyone else’s questions. You should also consider what would happen if everyone who would like me to answer a question decided to email me.

    Of course in some cases it’s appropriate. If you’ve already asked a question, written it as well as you can, waited a while to see if you get any answers naturally, and if it’s in an area that you know I’m particularly experienced in (read: the C# language, basically) then that’s fine. If your question is about something from C# in Depth – a snippet which doesn’t work or some text you don’t understand, for example – then it’s entirely appropriate to mail me directly.

    Basically, ask yourself whether you think I will actually welcome the email. Is it about something you know I’m specifically interested in? Or are you just trying to get more attention to a question, somewhat like jumping a queue?

    I’m aware that it’s possible this post makes me look either like a grumpy curmudgeon or (worse) like an egocentric pseudo-celebrity. The truth is I’m just like everyone else, with very little time on my hands – time I’d like to spend as usefully and fairly as possible.

    The future of “C# in Depth”

    I’m getting fairly frequent questions – mostly on Twitter – about whether there’s going to be a third edition of C# in Depth. I figure it’s worth answering it once in some detail rather than repeatedly in 140 characters ;)

    I’m currently writing a couple of new chapters covering the new features in C# 5 – primarily async, of course. The current "plan" is that these will be added to the existing 2nd edition to create a 3rd edition. There will be minimal changes to the existing text of the 2nd edition – basically going over the errata and editing a few places which ought to mention C# 5 early. (In particular the changes to how foreach loop variables are captured.)

    So there will definitely be new chapters. I’m hoping there’ll be a full new print (and ebook of course) edition, but no contracts have been signed yet. I’m hoping that the new chapters will be provided free electronically to anyone who’s already got the ebook of the 2nd edition – but we’ll see. Oh, and I don’t have any timelines at the moment. Work is more demanding than it was when I was writing the first and second editions, but obviously I’ll try to get the job done at a reasonable pace. (Writing about async in a way which is both accessible and accurate is really tricky, by the way.)

    Of course when I’ve finished those, I’ve got two other C# books I want to be writing… when I’m not working on Noda Time, Tekpub screencasts etc…

    Update

    I had a question on Twitter around the "two other C# books". I don’t want to go into too many details – partly because they’re very likely to change – but my intention is to write "C# from Scratch" and "C# in Style". The first would be for complete beginners; the second wouldn’t go into "how things work" so much as "how to use the language most effectively." (Yes, competition for Effective C#.) One possibility is that both would be donationware, at least in ebook form, ideally with community involvement in terms of public comments.

    I’m hoping that both will use the same codebase as an extended example, where "From Scratch" would explain what the code does, and "In Style" would explain why I chose that approach. Oh, and "From Scratch" would use unit testing as a teaching tool wherever possible, attempting to convey the idea that it’s something every self-respecting dev does :)

    The perils of conditional mutability

    This morning I was wrestling with trying to make some Noda Time unit tests faster. For some reason, the continuous integration host we’re using is really slow at loading resources under .NET 4. The unit tests which run in 10 seconds on my home laptop take over three hours on the continuous integration system. Taking stack traces at regular intervals showed the problem was with the NodaFormatInfo constructor, which reads some resources.

    I may look into streamlining the resource access later, but before we get to that point, I wanted to try to reduce the number of times we call that constructor in the first place. NodaFormatInfo is meant to be cached, so I wouldn’t have expected thousands of instances to be created – but it’s only cached when the System.Globalization.CultureInfo it’s based on is read-only. This is where the problems start…

    CultureInfo is conditionally mutable (not an official term, just one I’ve coined for the purposes of this post). You can ask whether or not it’s read-only with the IsReadOnly property, and obviously if it’s read-only you can’t change it. Additionally, CultureInfo is composed of other conditionally mutable objects – DateTimeFormatInfo, NumberFormatInfo etc. There’s a static ReadOnly method on CultureInfo to create a read-only wrapper around a mutable CultureInfo. It’s not clearly documented whether that’s expected to take a deep copy (so that callers can really rely on it not changing) or whether it’s expected to reflect any further changes made to the culture info it’s based on. To go in the other direction, you can call Clone on a CultureInfo to create a mutable copy of any existing culture.

    Further complications are introduced by the properties on the composite objects – we have properties such as DateTimeFormatInfo.MonthNames which returns a string array. Remember, arrays are always mutable. So it’s really important to know whether the array reference returned from the property refers to a copy of the underlying data, or whether it refers to the array that’s used internally by the type. Obviously for read-only DateTimeFormatInfo objects, I’d expect a copy to be returned – but for a mutable DateTimeFormatInfo, it would potentially make sense to return the underlying array reference. Unfortunately, the documentation doesn’t make this clear – but in practice, it always returns a copy. If you need to change the month names, you need to clone the array, mutate the clone, and then set the MonthNames property.

    All of this makes CultureInfo hard to work with. The caching decision earlier on only really works if a "read-only" culture genuinely won’t change behind the scenes. The type system gives you no help to catch subtle bugs at compile-time. Making any of this robust but efficient (in terms of taking minimal copies) is tricky to say the least.

    Not only does it make it hard to work with from a client’s point of view, but apparently it’s hard to implement correctly too…

    First bug: Mono’s invariant culture isn’t terribly invariant…

    (Broken in 2.10.8; apparently fixed later.)

    I discovered this while getting Noda Time’s unit tests to pass on Mono. Unfortunately there are some I’ve had to effectively disable at the moment, due to deficiencies in Mono (some of which are being fixed, of course).

    Here’s a program which builds a clone of the invariant culture, changes the clone’s genitive month names, and then prints out the first non-genitive name from the plain invariant culture:

    using System;
    using System.Globalization;

    class Test
    {
        static void Main()
        {        
            CultureInfo clone = (CultureInfo) CultureInfo.InvariantCulture.Clone();
            // Note: not even deliberately changing MonthNames for this culture!
            clone.DateTimeFormat.MonthGenitiveNames[0] = "Changed";
            
            // Prints Changed
            Console.WriteLine(CultureInfo.InvariantCulture.DateTimeFormat.MonthNames[0]);
        }
    }

    I believe this bug is really due to the lack of support for genitive month names in Mono at the moment – the MonthGenitiveNames property always just returns a reference to the month names for the invariant culture – without taking a copy first. (It always returns the invariant culture’s month names, even if you’re using a different culture entirely.) The code above shows an "innocent" attempt to change a mutable clone – but in reality we could have used any culture (including an immutable one) to make the change.

    Note that in the .NET implementation, the change would only have been to a copy of the underlying data, so even the clone wouldn’t have reflected change anywhere.

    Second bug: ReadOnly losing changes

    The second bug is the one I found this morning. It looks like it’s fixed in .NET 4, but it’s present in .NET 3.5, which is where it bit me this morning. When you try to make a read-only wrapper around a mutated culture, some of the properties are preserved… but some aren’t:

    using System;
    using System.Globalization;

    class Test
    {
        static void Main()
        {
            CultureInfo clone = (CultureInfo) CultureInfo.InvariantCulture.Clone();
            clone.DateTimeFormat.AMDesignator = "ChangedAm";

            // The array is recreated on each call to MonthNames, so changing the
            // value within the array itself doesn’t work :(
            string[] months = (string[]) clone.DateTimeFormat.MonthNames;
            months[0] = "ChangedMonth";
            clone.DateTimeFormat.MonthNames = months;
            
            CultureInfo readOnlyCopy = CultureInfo.ReadOnly(clone);
            Console.WriteLine(clone.DateTimeFormat.AMDesignator); // ChangedAm
            Console.WriteLine(clone.DateTimeFormat.MonthNames[0]); // ChangedMonth
                    
            Console.WriteLine(readOnlyCopy.DateTimeFormat.AMDesignator); // ChangedAm
            Console.WriteLine(readOnlyCopy.DateTimeFormat.MonthNames[0]); // January (!)
        }
    }

    I don’t know what’s going on here. In the end I just left the test code using the mutable clone, having added a comment explaining why it wasn’t created a read-only wrapper.

    I’ve experimented with a few different options here, including setting the MonthNames property on the clone after creating the wrapper. No joy – I simply can’t make the new month names stick in the read-only copy. <sigh>

    Conclusion

    I’ve been frustrated by the approach we’ve taken to cultures in Noda Time for a while. I haven’t worked out exactly what we should do about it yet, so it’s likely to stay somewhat annoying for v1, but I may well revisit it significantly for v2. Unfortunately, there’s nothing I can do about CultureInfo itself.

    What I would have preferred in all of this is the builder pattern: make CultureInfo, DateTimeFormatInfo etc all immutable, but give each of them mutable builder types, with the ability to create a mutable builder based on an existing immutable instance, and obviously to create a new immutable instance from a builder. That would make all kinds of things simpler – including caching.

    For the moment though, I hope we can all learn lessons from this – or have old lessons reinforced, at least:

    • Making a single type behave in different ways based on different "modes" makes it hard to use correctly. (Yes, this is the same first conclusion as with DateTime in the previous post. Funny, that.)
    • Immutability has to be deep to be meaningful: it’s not much use having a supposedly read-only object which composes a StringBuilder…
    • Arrays should be considered somewhat harmful. If you’re going to return an array from a method, make sure you document whether this is a copy of the underlying data, or a "live" reference. (The latter should be very rare, particularly for a public API.) The exception here is if you return an empty array – that’s effectively immutable, so you can always return it with no problems.
    • The builder pattern rocks – use it!

    In my next post I’ll try to get back to the TimeZoneInfo oddities I mentioned last time.

    More fun with DateTime

    (Note that this is deliberately not posted in the Noda Time blog. I reckon it’s of wider interest from a design perspective, and I won’t be posting any of the equivalent Noda Time code. I’ll just say now that we don’t have this sort of craziness in Noda Time, and leave it at that…)

    A few weeks ago, I was answering a Stack Overflow question when I noticed an operation around dates and times which should have been losing information apparently not doing so. I investigated further, and discovered some "interesting" aspects of both DateTime and TimeZoneInfo. In an effort to keep this post down to a readable length (at least for most readers; certain WebDriver developers who shall remain nameless have probably given up by now already) I’ll save the TimeZoneInfo bits for another post.

    Background: daylight saving transitions and ambiguous times

    There’s one piece of inherent date/time complexity you’ll need to understand for this post to make sense: sometimes, a local date/time occurs twice. For the purposes of this post, I’m going to assume you’re in the UK time zone. On October 28th 2012, at 2am local time (1am UTC), UK clocks will go back to 1am local time. So 1:20am local time occurs twice – once at 12:20am UTC (in daylight saving time, BST), and once at 1:20am UTC (in standard time, GMT).

    If you want to run any of the code in this post and you’re not in the UK, please adjust the dates and times used to a similar ambiguity for when your clocks go back. If you happen to be in a time zone which doesn’t observe daylight savings, I’m afraid you’ll have to adjust your system time zone in order to see the effect for yourself.

    DateTime.Kind and conversions

    As you may already know, as of .NET 2.0, DateTime has a Kind property, of type DateTimeKind – an enum with the following values:

    • Local: The DateTime is considered to be in the system time zone. Not an arbitrary "local time in some time zone", but in the specific current system time zone.
    • Utc: The DateTime is considered to be in UTC (corollary: it always unambiguously represents an instant in time)
    • Unspecified: This means different things in different contexts, but it’s a sort of "don’t know" kind; this is closer to "local time in some time zone" which is represented as LocalDateTime in Noda Time.

    DateTime provides three methods to convert between the kinds:

    • ToUniversalTime: if the original kind is Local or Unspecified, convert it from local time to universal time in the system time zone. If the original kind is Utc, this is a no-op.
    • ToLocalTime: if the original kind is Utc or Unspecified, convert it from UTC to local time. If the original kind is Local, this is a no-op.
    • SpecifyKind: keep the existing date/time, but just change the kind. (So 7am stays as 7am, but it changes the meaning of that 7am effectively.)

    (Prior to .NET 2.0, ToUniversalTime and ToLocalTime were already present, but always assumed the original value needed conversion – so if you called x.ToLocalTime().ToLocalTime().ToLocalTime() the result would probably end up with the appropriate offset from UTC being applied three times!)

    Of course, none of these methods change the existing value – DateTime is immutable, and a value type – instead, they return a new value.

    DateTime’s Deep Dark Secret

    (The code in this section is presented in several chunks, but it forms a single complete piece of code – later chunks refer to variables in earlier chunks. Put it all together in a Main method to run it.)

    Armed with the information in the previous sections, we should be able to make DateTime lose data. If we start with 12:20am UTC and 1:20am UTC on October 28th as DateTimes with a kind of Utc, when we convert them to local time (on a system in the UK time zone) we should get 1:20am in both cases due to the daylight saving transition. Indeed, that works:

    // Start with different UTC values around a DST transition
    var original1 = new DateTime(2012, 10, 28, 0, 20, 0, DateTimeKind.Utc);
    var original2 = new DateTime(2012, 10, 28, 1, 20, 0, DateTimeKind.Utc);

    // Convert to local time
    var local1 = original1.ToLocalTime();
    var local2 = original2.ToLocalTime();

    // Result is the same for both values. Information loss?
    var expected = new DateTime(2012, 10, 28, 1, 20, 0, DateTimeKind.Local);
    Console.WriteLine(local1 == expected); // True
    Console.WriteLine(local2 == expected); // True
    Console.WriteLine(local1 == local2);   // True

    If we’ve started with two different values, applied the same operation to both, and ended up with equal values, then we must have lost information, right? That doesn’t mean that operation is "bad" any more than "dividing by 2" is bad. You ought to be aware of that information loss, that’s all.

    So, we ought to be able to demonstrate that information loss further by converting back from local time to universal time. Here we have the opposite problem: from our local time of 1:20am, we have two valid universal times we could convert to – either 12:20am UTC or 1:20am UTC. Both answers would be correct – they are universal times at which the local time would be 1:20am. So which one will get picked? Well… here’s the surprising bit:

    // Convert back to UTC
    var roundTrip1 = local1.ToUniversalTime(); 
    var roundTrip2 = local2.ToUniversalTime();

    // Values round-trip correctly! Information has been recovered…
    Console.WriteLine(roundTrip1 == original1);  // True
    Console.WriteLine(roundTrip2 == original2);  // True
    Console.WriteLine(roundTrip1 == roundTrip2); // False

    Somehow, each of the local values knows which universal value it came from. The The information has been recovered, so the reverse conversion round-trips each value back to its original one. How is that possible?

    It turns out that DateTime actually has four potential kinds: Local, Utc, Unspecified, and "local but treat it as the earlier option when resolving ambiguity". A DateTime is really just a 64-bit number of ticks, but because the range of DateTime is only January 1st 0001 to December 31st 9999. That range can be represented in 62 bits, leaving 2 bits "spare" to represent the kind. 2 bits gives 4 possible values… the three documented ones and the shadowy extra one.

    Through experimentation, I’ve discovered that the kind is preserved if you perform arithmetic on the value, too… so if you go to another "fall back" DST transition such as October 30th 2011, the ambiguity resolution works the same way as before:

    var local3 = local1.AddYears(-1).AddDays(2); 
    var local4 = local2.AddYears(-1).AddDays(2);        
    Console.WriteLine(local3.ToUniversalTime().Hour); // 0
    Console.WriteLine(local4.ToUniversalTime().Hour); // 1

    If you use DateTime.SpecifyKind with DateTimeKind.Local, however, it goes back to the "normal" kind, even though it looks like it should be a no-op:

    // Should be a no-op?
    var local5 = DateTime.SpecifyKind(local1, local1.Kind); 
    Console.WriteLine(local5.ToUniversalTime().Hour); // 1

    Is this correct behaviour? Or should it be a no-op, just like calling ToLocalTime on a "local" DateTime is? (Yes, I’ve checked – that doesn’t lose the information.) It’s hard to say, really, as this whole business appears to be undocumented… at least, I haven’t seen anything in MSDN about it. (Please add a link in the comments if you find something. The behaviour actually goes against what’s documented, as far as I can tell.)

    I haven’t looked into whether various forms of serialization preserve values like this faithfully, by the way – but you’d have to work hard to reproduce it in non-framework code. You can’t explicitly construct a DateTime with the "extra" kind; the only ways I know of to create such a value are via a conversion to local time or through arithmetic on a value which already has the kind. (Admittedly if you’re serializing a DateTime with a Kind of Local, you’re already on potentially shaky ground, given that you could be deserializing it on a machine with a different system time zone.)

    Unkind comparisons

    I’ve misled you a little, I have to admit. In the code above, when I compared the "expected" value with the results of the first conversions, I deliberately specified DateTimeKind.Local in the constructor call. After all, that’s the kind we do expect. Well, yes – but I then printed the result of comparing this value with local1 and local2… and those comparisons would have been the same regardless of the kind I’d specified in the constructor.

    All comparisons between DateTimes ignore the Kind property. It’s not just restricted to equality. So for example, consider this comparison:

    // In June: Local time is UTC+1, so 8am UTC is 9am local
    var dt1 = new DateTime(2012, 6, 1, 8, 0, 0, DateTimeKind.Utc); 
    var dt2 = new DateTime(2012, 6, 1, 8, 30, 0, DateTimeKind.Local); 
    Console.WriteLine(dt1 < dt2); // True

    When viewed in terms of "what instants in time do these both represent?" the answer here is wrong – when you convert both values into the same time zone (in either direction), dt1 occurs after dt2. But a simple look at the properties tells a different story. In practice, I suspect that most comparisons between DateTime values of different kinds involve code which is at best sloppy and is quite possibly broken in a meaningful way.

    Of course, if you bring Kind=Unspecified into the picture, it becomes impossible to compare meaningfully in a kind-sensitive way. Is 12am UTC before or after 1am Unspecified? It depends what time zone you later use.

    To be clear, it is a hard-to-resolve issue, and one that we don’t do terribly well at in Noda Time at the moment for ZonedDateTime. (And even with just LocalDateTime you’ve got issues between calendars.) This is a situation where providing separate Comparer<T> implementations works nicely – so you can explicitly say what kind of comparison you want.

    Conclusions

    There’s more fun to be had with a similar situation when we look at TimeZoneInfo, but for now, a few lessons:

    • Giving a type different "modes" which make it mean fairly significantly different things is likely to cause headaches
    • Keeping one of those modes secret (and preventing users from even constructing a value in that mode directly) leads to even more fun and games
    • If two instances of your type are considered "equal" but behave differently, you should at least consider whether there’s something smelly going on
    • There’s always more fun to be had with DateTime…

    Type initializer circular dependencies

    To some readers, the title of this post may induce nightmarish recollections of late-night debugging sessions. To others it may be simply the epitome of jargon. Just to break the jargon down a bit:

    • Type initializer: the code executed to initialize the static variables of a class, and the static constructor
    • Circular dependency: two bits of code which depend on each other – in this case, two classes whose type initializers each require that the other class is initialized

    A quick example of the kind of problem I’m talking about would be helpful here. What would you expect this code to print?

    using System;

    class Test
    {    
        static void Main()
        {
            Console.WriteLine(First.Beta);
        }
    }

    class First
    {
        public static readonly int Alpha = 5;
        public static readonly int Beta = Second.Gamma;
    }

    class Second
    {
        public static readonly int Gamma = First.Alpha;
    }

    Of course, without even glancing at the specification, any expectations are pretty irrelevant. Here’s what the spec (section 10.5.5.1 of the C# 4 version):

    The static field variable initializers of a class correspond to a sequence of assignments that are executed in the textual order in which they appear in the class declaration. If a static constructor (§10.12) exists in the class, execution of the static field initializers occurs immediately prior to executing that static constructor. Otherwise, the static field initializers are executed at an implementation-dependent time prior to the first use of a static field of that class.

    In addition to the language specification, the CLI specification gives more details about type initialization in the face of circular dependencies and multiple threads. I won’t post the details here, but the gist of it is:

    • Type initialization acts like a lock, to prevent more than one thread from initializing a type
    • If the CLI notices that type A needs to be initialized in order to make progress, but it’s already in the process of initializing type A in the same thread, it continues as if the type were already initialized.

    So here’s what you might expect to happen:

    1. Initialize Test: no further action required
    2. Start running Main
    3. Start initializing First (as we need First.Beta)
    4. Set First.Alpha to 5
    5. Start initializing Second (as we need Second.Gamma)
    6. Set Second.Gamma to First.Alpha (5)
    7. End initializing Second
    8. Set First.Beta to Second.Gamma (5)
    9. End initializing First
    10. Print 5

    Here’s what actually happens – on my box, running .NET 4.5 beta. (I know that type initialization changed for .NET 4, for example. I don’t know of any changes for .NET 4.5, but I’m not going to claim it’s impossible.)

    1. Initialize Test: no further action required
    2. Start running Main
    3. Start initializing First (as we need First.Beta)
    4. Start initializing Second (we will need Second.Gamma)
    5. Set Second.Gamma to First.Alpha (0)
    6. End initializing Second
    7. Set First.Alpha to 5
    8. Set First.Beta to Second.Gamma (0)
    9. End initializing First
    10. Print 0

    Step 5 is the interesting one here. We know that we need First to be initialized, in order to get First.Alpha, but this thread is already initializing First (we started in step 3) so we just access First.Alpha and hope that it’s got the right value. As it happens, that variable initializer hasn’t been executed yet. Oops.

    (One subtlety here is that I could have declared all these variables as constants instead using "const" which would have avoided all these problems.)

    Back in the real world…

    Hopefully that example makes it clear why circular dependencies in type initializers are nasty. They’re hard to spot, hard to debug, and hard to test. Pretty much your classic Heisenbug, really. It’s important to note that if the program above had happened to initialize Second first (to access a different variable, for example) we could have ended up with a different result. In particular, it’s easy to get into a situation where running all your unit tests can cause a failure – but if you run just the failing test, it passes.

    One way of avoiding all of this is never to use any type initializers for anything, of course. In many cases that’s exactly the right solution – but often there are natural uses, particularly for well-known values such as Encoding.Utf8, TimeZoneInfo.Utc and the like. Note that in both of those cases they are static properties, but I would expect them to be backed by static fields. I’m somewhat ambivalent between using public static readonly fields and public static get-only properties – but as we’ll see later, there’s a definite advantage to using properties.

    Noda Time has quite a few values like this – partly because so many of its types are immutable. It makes sense to create a single UTC time zone, a single ISO calendar system, a single "pattern" (text formatter/parser) for each of a variety of common cases. In addition to the publicly visible values, there are various static variables used internally, mostly for caching purposes. All of this definitely adds complexity – and makes it harder to test – but the performance benefits can be significant.

    Unfortunately, a lot of these values end up with fairly natural circular dependencies – as I discovered just recently, where adding a new static field caused all kinds of breakage. I was able to fix the immediate cause, but it left me concerned about the integrity of the code. I’d fixed the one failure I knew about – but what about any others?

    Testing type initialization

    One of the biggest issues with type initialization is the order-sensitivity – combined with the way that once a type has been initialized once, that’s it for that AppDomain. As I showed earlier, it’s possible that initializing types in one particular order causes a problem, but a different order won’t.

    I’ve decided that for Noda Time at least, I want to be reasonably sure that type initialization circularity isn’t going to bite me. So I want to validate that no type initializers form cycles, whatever order the types are initialized in. Logically if we can detect a cycle starting with one type, we ought to be able to detect it starting with any of the other types in that cycle – but I’m sufficiently concerned about weird corner cases that I’d rather just take a brute force approach.

    So, as a rough plan:

    • Start with an empty set of dependencies
    • For each type in the target assembly:
      • Create a new AppDomain
      • Load the target assembly into it
      • Force the type to be initialized
      • Take a stack trace at the start of each type initializer and record any dependencies
    • Look for cycles in the complete set of dependencies

    Note that we’ll never spot a cycle within any single AppDomain, due to the way that type initialization works. We have to put together the results for multiple initialization sequences to try to find a cycle.

    A description of the code would probably be harder to follow than the code itself, but the code is relatively long – I’ve included it at the end of this post to avoid intefering with the narrative flow. For more up-to-date versions in the future, look at the Noda Time repository.

    This isn’t a terribly nice solution, for various reasons:

    • Creating a new AppDomain and loading assemblies into it from a unit test runner isn’t as simple as it might be. My code doesn’t currently work with NCrunch; I don’t know how it finds its assemblies yet. When I’ve fixed that, I’m sure other test runners would still be broken. Likewise I’ve had to explicitly filter types to get TeamCity (the continuous integration system Noda Time uses) to work properly. Currently, you’d need to edit the test code to change the filters. (It’s possible that there are better ways of filtering, of course.)
    • It relies on each type within the production code which has an "interesting" type initializer to have a line like this:
      private static readonly int TypeInitializationChecking = NodaTime.Utility.TypeInitializationChecker.RecordInitializationStart();
    • Not only does the previous line need to be added to the production code – it clearly gets executed each time, and takes up heap space per type. It’s only 4 bytes for each type involved, and it does no real work when we’re not testing, but it’s a nuisance anyway. I could use preprocessor directives to remove the code from non-debug or non-test-targeted builds, but that would look even messier.
    • It only picks up cycles which occur when running the version of .NET the tests happen to execute on. Given that there are ordering changes for different versions, I wouldn’t like to claim this is 100% bullet-proof. Likewise if there are only cycles when you’re running in some specific cultures (or other environmental features), it won’t necessarily pick those up.
    • I’ve deliberately not tried to make the testing code thread-safe. That’s fine in Noda Time – I don’t have any asynchronous operations or new threads in Noda Time at all – but other code may need to make this more robust.

    So with all these caveats, is it still worth it? Absolutely: it’s already found bugs which have now been fixed.

    In fact, the test didn’t get as far as reporting cycles to start with – it turned out that if you initialized one particular type first, the type initializer would fail with a NullReferenceException. Ouch! Once I’d fixed that, there were still quite a few problems to fix. Somewhat coincidentally, fixing them improved the design too – although the user-visible API didn’t change at all.

    Fixing type initializer cycles

    In the past, I’ve occasionally "fixed" type initialization ordering problems by simply moving fields around. The cycles still existed, but I figured out how to make them harmless. I can say now that this approach does not scale, and is more effort than it’s worth. The code ends up being brittle, hard to think about, and once you’ve got more than a couple of types involved it’s really error-prone, at least for my brain. It’s much better to break the cycle completely. To this end, I’ve ended up using a fairly simple technique to defer initialization of static variables. It’s a poor-man’s Lazy<T>, to some extent – but I’d rather not have to write Lazy<T> myself, and we’re currently targeting .NET 3.5…

    Basically, instead of exposing a public static readonly field which creates the cycle, you expose a public static readonly property – which returns an internal static readonly field in a nested, private static class. We still get the nice thread-safe once-only initialization of a type initializer, but the nested type won’t be initialized until it needs to be. (In theory it could be initialized earlier, but a static constructor would ensure it isn’t.) So long as nothing within the rest of the type initializer for the containing class uses that property, we avoid the cycle.

    So instead of this:

    // Requires Bar to be initialized – if Bar also requires Foo to be
    // initialized, we have a problem…
    public static readonly Foo SimpleFoo = new Foo(Bar.Zero);

    We might have:

    public static readonly Foo SimpleFoo { get { return Constants.SimpleFoo; } }

    private static class Constants
    {
        private static readonly int TypeInitializationChecking = NodaTime.Utility.TypeInitializationChecker.RecordInitializationStart(); 

        // This requires both Foo and Bar to be initialized, but that’s okay
        // so long as neither of them require Foo.Constants to be initialized.
        // (The unit test would spot that.)
        internal static readonly Foo SimpleFoo = new Foo(Bar.Zero);
    }

    I’m currently undecided about whether to include static constructors in these classes to ensure lazy initialization. If the type initializer for Foo triggered the initializer of Foo.Constants, we’d be back to square one… but adding static constructors into each of these nested classes sounds like a bit of a pain. The nested classes should call into the type initialization checking as well, to validate they don’t cause any problems themselves.

    Conclusion

    I have to say, part of me really doesn’t like either the testing code or the workaround. Both smack of being clever, which is never a good thing. It’s definitely worth considering whether you could actually just get rid of the type initializer (or part of it) entirely, avoiding maintaining so much static state. It would be nice to be able to detect these type initializer cycles without running anything, simply using static analysis – I’m going to see whether NDepend could do that when I get a chance. The workaround doesn’t feel as neat as Lazy<T>, which is really what’s called for here – but I don’t trust myself to implement it correctly and efficiently myself.

    So while both are somewhat hacky, they’re better than the alternative: buggy code. That’s what I’m ashamed to say I had in Noda Time, and I don’t think I’d ever have spotted all the cycles by inspection. It’s worth a try on your own code – see whether you’ve got problems lurking…

     

     

    Appendix: Testing code

    As promised earlier, here’s the code for the production and test classes.

    TypeInitializationChecker

    This is in NodaTime.dll itself.

    internal sealed class TypeInitializationChecker : MarshalByRefObject
    {
        private static List<Dependency> dependencies = null;

        private static readonly MethodInfo EntryMethod = typeof(TypeInitializationChecker).GetMethod("FindDependencies");

        internal static int RecordInitializationStart()
        {
            if (dependencies == null)
            {
                return 0;
            }
            Type previousType = null;
            foreach (var frame in new StackTrace().GetFrames())
            {
                var method = frame.GetMethod();
                if (method == EntryMethod)
                {
                    break;
                }
                var declaringType = method.DeclaringType;
                if (method == declaringType.TypeInitializer)
                {
                    if (previousType != null)
                    {
                        dependencies.Add(new Dependency(declaringType, previousType));
                    }
                    previousType = declaringType;
                }
            }
            return 0;
        }

        /// <summary>
        /// Invoked from the unit tests, this finds the dependency chain for a single type
        /// by invoking its type initializer.
        /// </summary>
        public Dependency[] FindDependencies(string name)
        {
            dependencies = new List<Dependency>();
            Type type = typeof(TypeInitializationChecker).Assembly.GetType(name, true);
            RuntimeHelpers.RunClassConstructor(type.TypeHandle);
            return dependencies.ToArray();
        }

        /// <summary>
        /// A simple from/to tuple, which can be marshaled across AppDomains.
        /// </summary>
        internal sealed class Dependency : MarshalByRefObject
        {
            public string From { get; private set; }
            public string To { get; private set; }
            internal Dependency(Type from, Type to)
            {
                From = from.FullName;
                To = to.FullName;
            }
        }
    }

    TypeInitializationTest

    This is within NodaTime.Test:

    [TestFixture]
    public class TypeInitializationTest
    {
        [Test]
        public void BuildInitializerLoops()
        {
            Assembly assembly = typeof(TypeInitializationChecker).Assembly;
            var dependencies = new List<TypeInitializationChecker.Dependency>();
            // Test each type in a new AppDomain – we want to see what happens where each type is initialized first.
            // Note: Namespace prefix check is present to get this to survive in test runners which
            // inject extra types. (Seen with JetBrains.Profiler.Core.Instrumentation.DataOnStack.)
            foreach (var type in assembly.GetTypes().Where(t => t.FullName.StartsWith("NodaTime")))
            {
                // Note: this won’t be enough to load the assembly in all test runners. In particular, it fails in
                // NCrunch at the moment.
                AppDomainSetup setup = new AppDomainSetup { ApplicationBase = AppDomain.CurrentDomain.BaseDirectory };
                AppDomain domain = AppDomain.CreateDomain("InitializationTest" + type.Name, AppDomain.CurrentDomain.Evidence, setup);
                var helper = (TypeInitializationChecker)domain.CreateInstanceAndUnwrap(assembly.FullName,
                    typeof(TypeInitializationChecker).FullName);
                dependencies.AddRange(helper.FindDependencies(type.FullName));
            }
            var lookup = dependencies.ToLookup(d => d.From, d => d.To);
            // This is less efficient than it might be, but I’m aiming for simplicity: starting at each type
            // which has a dependency, can we make a cycle?
            // See Tarjan’s Algorithm in Wikipedia for ways this could be made more efficient.
            // http://en.wikipedia.org/wiki/Tarjan’s_strongly_connected_components_algorithm
            foreach (var group in lookup)
            {
                Stack<string> path = new Stack<string>();
                CheckForCycles(group.Key, path, lookup);
            }
        }

        private static void CheckForCycles(string next, Stack<string> path, ILookup<string, string> dependencyLookup)
        {
            if (path.Contains(next))
            {
                Assert.Fail("Type initializer cycle: {0}-{1}", string.Join("-", path.Reverse().ToArray()), next);
            }
            path.Push(next);
            foreach (var candidate in dependencyLookup[next].Distinct())
            {
                CheckForCycles(candidate, path, dependencyLookup);
            }
            path.Pop();
        }
    }

    Diagnosing weird problems – a Stack Overflow case study

    Earlier, I came across this Stack Overflow question. I solved it, tweeted it, but then thought it would serve as a useful case study into the mental processes I go through when trying to solve a problem – whether that’s on Stack Overflow, at work, or at home.

    It’s definitely worth reading the original question, but the executive summary is:

    When I compute the checksum/hash of c:WindowsSystem32Calc.exe using various tools and algorithms, those tools all give the same answer for each algorithm. When I try doing the same thing in Java, I get different results. What’s going on?

    Now to start with, I’d like to shower a bit of praise on the author:

    • The post came with a short but utterly complete program to demonstrate the problem
    • The comments in the program showed the expected values and the actual values
    • The code was mostly pretty clean (clean enough for an SO post anyway)

    In short, it had pretty much everything I ask for in a question. Yay! Additionally, the result seemed to be strange. The chances of any one of Java’s hashing algorithms being broken seem pretty slim, but four of them? Okay, now you’ve got me interested.

    Reproducing the problem

    Unless I can spot the error immediately, I usually try to reproduce the problem in a Stack Overflow post with a short but complete program. In this case, the program was already provided, so it was just a matter of copy/paste/compile/run. This one had the additional tweak that it was comparing the results of Java with the results of other tools, so I had to get hold of an MD5 sum tool first. (I chose to start with MD5 for no particular reason.) I happened to pick this one, but it didn’t really seem it would make much difference. (As it happens, that was an incorrect assumption, but hey…)

    I ran md5sums on c:WindowsSystem32calc.exe, and got the same result as the poster. Handy.

    I then ran the Java code, and again got the same result as the poster: step complete, we have a discrepancy between at least one tool (and MD5 isn’t that hard to get right) and Java.

    Looking for obvious problems

    The code has four main areas:

    • Reading a file
    • Updating digests
    • Converting bytes to hex
    • Storing and displaying results

    Of these, all of the first three have common and fairly simple error modes. For example:

    • Failure to use the return value from InputStream.read()
    • Failure to update the digests using only the relevant portion of the data buffer
    • Failure to cope with Java’s signed bytes

    The code for storing and displaying results seemed solid enough to ignore to start with, and brief inspection suggested that the first two failure modes had been avoided. While the hex code didn’t have any obvious problems either, it made sense to check it. I simply printed the result of hard-coding the “known good” CRC-32 value:

    System.out.println(toHex(new byte[] {
        (byte) 0x8D, (byte) 0x8F, (byte) 0x5F, (byte) 0x8E
      }));

    That produced the right result, so I ruled out that part of the code too. Even if it had errors in some cases, we know it’s capable of producing the right string for one of the values we know we should be returning, so it can’t be getting that value.

    Initial checks around the file

    I’m always suspicious of stream-handling code – or rather, I know how easily it can hide subtle bugs. Even though it looked okay, I thought I’d check – so I added a simple total to the code so I could see how many bytes had been hashed. I also removed all the hash algorithms other than MD5, just for simplicity:

    MessageDigest md5 = MessageDigest.getInstance(“MD5”);

    FileInputStream fis = new FileInputStream(file);
    byte data[] = new byte[size];
    int len = 0;
    int total = 0;
    while ((len = fis.read(data)) != -1) {
        md5.update(data, 0, len);
        total += len;
    }
    fis.close();
    System.out.println(“Total bytes read: “ + total);

    results.put(md5.getAlgorithm(), toHex(md5.digest()));

    It’s worth noting that I haven’t tried to fix up bits of the code which I know I would change if I were actually doing a code review:

    • The stream isn’t being closed in a finally block, so we’ll have a dangling resource (until GC) if an IOException is thrown
    • The initial value of len is never read, and can be removed

    Neither of these matters in terms of the problem at hand, and closing the file “properly” would make the sample code more complicated. (For the sake of just a short sample program, I’d be tempted to remove it entirely.)

    The result showed the number of bytes being read as the command prompt did when I ran “dir c:WindowsSystem32Calc.exe” – so again, everything looks like it’s okay.

    Desperate times call for stupid measures

    Just on a whim, I decided to copy Calc.exe to a local folder (the current directory) and retry. After all, accessing a file in a system folder might have some odd notions applied to it. It’s hard to work out what, but… there’s nothing to lose just by trying a simple test. If it can rule anything out, and you’ve got no better ideas, you might as well try even the daftest idea.

    I modified the source code to use the freshly-copied file, and it gave the same result. Hmm.

    I then reran md5sums on the copied file… and it gave the same result as Java. In other words, running “md5sums c:WindowsSystem32Calc.exe” gave one result, but “md5sums CopyOfCalc.exe” gave a different result. At this point we’ve moved from Java looking like it’s behaving weirdly to md5sums looking suspicious.

    Proving the root cause

    At this point we’re sort of done – we’ve basically proved that the Java code produces the right hash for whatever data it’s given, but we’re left with the question of what’s happening on the file system. I had a hunch that it might be something to do with x86 vs x64 code (all of this was running on a 64-bit version of Windows 7) – so how do we test that assumption?

    I don’t know if there’s a simple way of running an x86 version of the JVM, but I do know how to switch .NET code between x86 and x64 – you can do that for an assembly at build time. C# also makes the hashing and hex conversion simple, so I was able to knock up a very small app to show the problem:

    using System;
    using System.IO;
    using System.Security.Cryptography;

    class Test
    {
        static void Main()
        {
            using (var md5 = MD5.Create())
            {
                string path = “c:/Windows/System32/Calc.exe”;
                var bytes = md5.ComputeHash(File.ReadAllBytes(path));
                Console.WriteLine(BitConverter.ToString(bytes));
            }
        }
    }

    (For a larger file I’d have used File.OpenRead instead, but then I’d have wanted to close the stream afterwards. Somehow it wasn’t worth correcting the existing possible handle leak in the Java code, but I didn’t want to write leaky code myself. So instead I’ve got code which reads the whole file into memory unnecessarily… )

    You can choose the architecture to build against (usually AnyCPU, x86 or x64 – though it’s interesting to see that “arm” is also an option under .NET 4.5, for obvious reasons) either from Visual Studio or using the “/platform” command line option. This doesn’t change the IL emitted (as far as I’m aware) but it’s used for interop with native code – and in the case of executables, it also determines the version of the CLR which is used.

    Building and running in x86 mode gave the same answer as the original “assumed to be good” tools; building and running in x64 mode gave the same answer as Java.

    Explaining the root cause

    At this point we’ve proved that the file system gives different results depending on whether you access it from a 64-bit process or a 32-bit process. The final piece of the puzzle was to find some something to explain why that happens. With all the evidence about what’s happening, it was now easy to search for more information, and I found this article giving satisfactory details. Basically, there are two different copies of the system executables on a 64 bit system: x86 ones which run under the 32-bit emulator, and x64 ones. They’re actually in different directories, but when a process opens a file in WindowsSystem32, the copy which matches the architecture of the process is used. It’s almost as if the WindowsSystem32 directory is a symlink which changes target depending on the current process.

    A Stack Overflow comment on my answer gave a final nugget that this is called the “File System Redirector”.

    Conclusion

    Debugging sessions often feel a bit like this – particularly if you’re like me, and only resort to real debugging after unit testing has failed. It’s a matter of looking under all kinds of rocks, trying anything and everything, but keeping track of everything you try. At the end of process you should be able to explain every result you’ve seen, in an ideal world. (You may not be able to give evidence of the actual chain of events, but you should be able to construct a plausible chain of events which concurs with your theory.)

    Be aware of areas which can often lead to problems, and check those first, gradually reducing the scope of “stuff you don’t understand” to a manageable level, until it disappears completely.