Array covariance: not just ugly, but slow too

It seems to be quite a long time since I’ve written a genuine "code" blog post. Time to fix that.

This material may well be covered elsewhere – it’s certainly not terrifically original, and I’ve been meaning to post about it for a long time. In particular, I remember mentioning it at CodeMash in 2012. Anyway, the time has now come.

Refresher on array covariance

Just as a bit of background before we delve into the performance aspect, let me remind you what array covariance is, and when it applies. The basic idea is that C# allows a reference conversion from type TDerived[] to type TBase[], so long as:

  • TDerived and TBase are both reference types (potentially interfaces)
  • There’s a reference conversion from TDerived to TBase (so either TDerived is the same as TBase, or a subclass, or an implementing class etc)

Just to remind you about reference conversions, those are conversions from one reference type to another, where the result (on success) is never a reference to a different object. To quote section 6.1.6 of the C# 5 spec:

Reference conversions, implicit or explicit, never change the referential identity of the object being converted. In other words, while a reference conversion may change the type of the reference, it never changes the type or value of the object being referred to.

So as a simple example, there’s a reference conversion from string to object, therefore there’s a reference conversion from string[] to object[]:

string[] strings = new string[10];
object[] objects = strings;

// strings and objects now refer to the same object

There is not a reference conversion between value type arrays, so you can’t use the same code to conver from int[] to object[].

The nasty part is that every store operation into a reference type array now has to be checked at execution time for type safety. So to extend our sample code very slightly:

string[] strings = new string[10]; 
object[] objects = strings;

objects[0] = "string"; // This is fine
objects[0] = new Button(); // This will fail

The last line here will fail with an ArrayTypeMismatchException, to avoid storing a Button reference in a String[] object. When I said that every store operation has to be checked, that’s a slight exaggeration: in theory, if the compile-time type is an array with an element type which is a sealed class, the check can be avoided as it can’t fail.

Avoiding array covariance

I would rather arrays weren’t covariant in the first place, but there’s not a lot that can be done about that now. However, we can work around this, if we really need to. We know that value type arrays are not covariant… so how about we use a value type array instead, even if we want to store reference types?

All we need is a value type which can store the reference type we need – which is dead easy with a wrapper type:

public struct Wrapper<T> where T : class
{
    private readonly T value;
    public T Value { get { return value; } }
    
    public Wrapper(T value)
    {
        this.value = value;
    }
    
    public static implicit operator Wrapper<T>(T value)
    {
        return new Wrapper<T>(value);
    }
}

Now if we have a Wrapper<string>[], we can’t assign that to a Wrapper<object>[] variable – the types are incompatible. If that feels a bit clunky, we can put the array into its own type:

public sealed class InvariantArray<T> where T : class
{
    private readonly Wrapper<T>[] array;
    
    public InvariantArray(int size)
    {
        array = new Wrapper<T>[size];
    }
    
    public T this[int index]
    {
        get { return array[index].Value; }
        set { array[index] = value; }
    }
}

Just to clarify, we now only have value type arrays, but ones where each value is a plain wrapper for a reference. We now can’t accidentally violate type-safety at compile-time, and the CLR doesn’t need to validate write operations.

There’s no memory overhead here – aside from the type information at the start, I’d actually expect the contents of a Wrapper<T>[] to be indistinguishable from a T[] in memory.

Benchmarking

So, how does it perform? I’ve written a small console app to test it. You can download the full code, but the gist of it is that we use a stopwatch to measure how long it takes to either repeatedly write to an array, or repeatedly read from an array (validating that the value read is non-null, just to prove that we’ve really read something). I’m hoping I haven’t fallen foul of any of the various mistakes in benchmarking which are so easy to make.

The test tries four scenarios:

  • object[] (but still storing strings)
  • string[]
  • Wrapper<string>[]
  • InvariantArray<string>

Running against an array size of 100, with 100 million iterations per test, I get the following results on my Thinkpad Twist :

Array type Read time (ms) Write time
object[] 11842 44827
string[] 12000 40865
Wrapper<string>[] 11843 29338
InvariantArray<string> 11825 32973

That’s just one run, but the results are fairly consistent across runs. The one interesting deviation is the write time for object[] – I’ve observed it sometimes being the same as for string[], but not consistently. I don’t understand this, but it does seem that the JIT isn’t performing the optimization for string[] that it could if it spotted that string is sealed.

Both of the workarounds to avoid array covariance make a noticeable difference to the performance of writing to the array, without affecting read performance. Hooray!

Conclusion

I think it would be a very rare application which noticed a significant performance boost here, but I do like the fact that this is one of those situations where a cleaner design also leads to better performance, without many obvious technical downsides.

That said, I doubt that I’ll actually be using this in real code any time soon – the fact that it’s just "different" to normal C# code is a big downside in itself. Hope you found it interesting though :)

But what does it all mean?

This year before NDC, I wrote an article for the conference edition of "The Developer" magazine. Follow that link to find the article in all its illustrated glory (along with many other fine articles, of course) – or read on for just the text.

Back when I used to post on newsgroups I would frequently be in the middle of a debate of the details of some behaviour or terminology, when one poster would say: “You’re just quibbling over semantics” as if this excused any and all previous inaccuracies. I would usually agree –  I was indeed quibbling about semantics, but there’s no “just” about it.

Semantics is meaning, and that’s at the heart of communication – so for example, a debate over whether it’s correct to say that Java uses pass- by-reference1 is all about semantics. Without semantics, there’s nothing to talk about.
This has been going on for years, and I’m quite used to being the pedant in any conversation when it comes to terminology – it’s a topic close to my heart. But over the years – and importantly since my attention has migrated to Stack Overflow, which tends to be more about real problems developers are facing than abstract discussions – I’ve noticed that I’m now being picky in the same sort of way, but about the meaning of data instead of terminology.

Data under the microscope

When it comes down to it, all the data we use is just bits – 1s and 0s. We assemble order from the chaos by ascribing meaning to those bits… and not just once, but in a whole hierarchy. For example, take the bits 01001010 00000000:

  • Taken as a little-endian 16-bit unsigned integer, they form a value of 74. • That 16-bit unsigned integer can be viewed as a UTF-16 code unit for the character ‘J’.
  • That character might be the first character within a string.
  • That string might be the target of a reference, which is the value for a field called “firstName”.
  • That field might be within an instance of a class called “Person”.
  • The instance of “Person” whose “firstName” field has a value
    which is a reference to the string whose first character is ‘J’ might itself be the target of a reference, which is the value for a field called “author”, within an instance of a class called “Article”.
  • The instance of “Article” whose “author” field (fill in the rest your- self…) might itself be the target of a reference which is part of a collection, stored (indirectly) via a field called “articles” in a class called “Magazine”.

As we’ve zoomed out from sixteen individual bits, at every level we’ve imposed meaning. Imagine all the individual bits of information which would be involved in a single instance of the Magazine with a dozen articles, an editorial, credits – and perhaps even images. Really imagine them, all written down next to each other, possibly without even the helpful gap between bytes that I included in our example earlier.

That’s the raw data. Everything else is “just” semantics.

So what does that have to do with me?

I’m sure I haven’t told you anything you don’t already know. Yes, we can impose meaning on these puny bits, with our awesome developer power. The trouble is that bits have a habit of rebelling if you try to impose the wrong kind of meaning on them… and we seem to do that quite a lot.

The most common example I see on Stack Overflow is treating text (strings) and binary data (image files, zip files, encrypted data) as if they were interchangeable. If you try to load a JPEG using StreamReader in .NET or FileReader in Java, you’re going to have problems. There are ways you can actually get away with it – usually by using the ISO-8859-1 encoding – but it’s a little bit like trying to drive down a road with a broken steering wheel, only making progress by bouncing off other obstacles.

While this is a common example, it’s far from the only one. Some of the problems which fall into this category might not obviously be due to the mishandling of data, but at a deep level they’re all quite similar:

  • SQL injection attacks due to mingling code (SQL) with data (values) instead of using parameters to keep the two separate.
  • The computer getting arithmetic “wrong” because the developer didn’t understand the meaning of floating binary point numbers, and should actually have used a floating decimal point type (such as System.Decimal or java.math. BigDecimal).
  • String formatting issues due to treating the result of a previous string formatting operation as another format string – despite the fact that now it includes user data which could really have any kind of text in it.
  • Double-encoding or double-unencoding of text data to make it safe for transport via a URL.
  • Almost anything to do with dates and times, including – but certainly not limited to – the way that java.util.Date and System. DateTime values don’t inherently have a format. They’re just values.

The sheer bulk of questions which indicate a lack of understanding of the nature of data is enormous. Of course Stack Overflow only shows a tiny part of this – it doesn’t give much insight into the mountain of code which handles data correctly from the perspective of the types involved, but does entirely inappropriate things with those values from the perspective of the intended business meaning of those values.

It’s not all doom and gloom though. We have some simple but powerful weapons available in the fight against semantic drivel.

Types

This article gives a good indication of why I’m a fan of statically typed languages. The type system can convey huge amounts of information about the nature of data, even if the business meaning of values of those types can be horribly overloaded.

Maybe it would be good if we distinguished between human-readable text which should usually be treated in a culture-sensitive way, and machine-parsable text which should usually be treated without reference to any culture. Those two types might have different operations available on them, for example – but it would almost certainly get messy very quickly.

For business-specific types though, it’s usually easy to make sure that each type is really only used for one concept, and only provides operations which are meaningful for that concept.

Meaningful names

Naming is undoubtedly hard, and I suspect most developers have had the same joyless experiences that I have of struggling for ten minutes to come up with a good class or method name, only to end up with the one we first thought of which we really don’t like… but which is simply better than all the other options we’ve considered. Still, it’s worth the struggle.

Our names don’t have to be perfect, but there’s simply no excuse for names such as “Form1” or “MyClass” which seem almost designed to carry no useful information whatsoever. Often simply the act of naming something can communicate meaning. Don’t be afraid to extract local variables in code just to clarify the meaning of some otherwise-obscure expression.

Documentation

I don’t think I’ve ever met a developer who actually enjoys writing documentation, but I think it’s hard to deny that it’s important. There tend to be very few things that are so precisely communicated just by the names of types, properties and methods that no further information is required. What is guaranteed about the data? How should it be used – and how should it not be used?

The form, style and detail level of documentation will vary from project to project, but don’t underestimate its value. Aside from behavioural  details, ask yourself what meaning you’re imposing on or assuming about the data you’re dealing with… what would happen if someone else made different assumptions? What could go wrong, and how can you prevent it by expressing your own understanding clearly? This isn’t just important for large projects with big teams, either – it’s entirely possible that the person who comes to the code with a different viewpoint is going to be you, six months later.

Conclusion

I apologise if this all sounds like motherhood and apple pie. I’m not saying anything new, after all – of course we all agree that we need to understand the meaning of the our data. I’m really not trying to waste your time though: I want you to take a moment to appreciate just how much it matters that we understand the data we work with, and how many problems can be avoided by putting effort into communicating that effectively and reading what others have written about their data.

There are other approaches we can take beyond those I’ve listed above, too – much more technically exciting ones – around static code analysis, contracts, complicated annotations and the like. I have nothing against them, but just understanding the value of semantics is a crucial starting point, and once everyone on your team agrees on that, you’ll be in a much better position to move forward and agree on the meaning of your data itself. From there, the world’s your oyster – if you see what I mean.


1 It doesn’t; references are passed by value, as even my non-programmer wife knows by now. That’s how often this myth comes up.