All posts by jonskeet

Mad props to @arcaderage for the "Princess Rescue" image - see https://toggl.com/programming-princess for the full original

Reimplementing LINQ to Objects: Part 2 – “Where”

Warning: this post is quite long. Although I’ve chosen a simple operator to implement, we’ll encounter a few of the corner cases and principles involved in LINQ along the way. This will also be a somewhat experimental post in terms of format, as I try to work out the best way of presenting the material.

We’re going to implement the "Where" clause/method/operator. It’s reasonably simple to understand in general, but goes into all of the deferred execution and streaming bits which can cause problems. It’s generic, but only uses one type parameter (which is a big deal, IMO – the more type parameters a method has, the harder I find it to understand in general). Oh, and it’s a starting point for query expressions, which is a bonus.

What is it?

"Where" has two overloads:

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

Before I go into what it actually does, I’ll point out a few things which are going to be common across nearly all of the LINQ operators we’ll be implementing:

  • These are extension methods – they’re declared in a top-level, non-nested, static class and the first parameter has a "this" modifier. They can be invoked roughly as if they were instance methods on the type of the first parameter.
  • They’re generic methods – in this case there’s just a single type parameter (TSource) which indicates the type of sequence we’re dealing with. So for (say) a list of strings, TSource would be string.
  • They take generic delegates in the Func<…> family. These are usually specified with lambda expressions, but any other way of providing a delegate will work fine too.
  • They deal with sequences. These are represented by IEnumerable<T>, with an iterator over a sequence being represented by IEnumerator<T>.

I fully expect that most readers will be comfortable with all of these concepts, so I won’t go into them in more detail. If any of the above makes you nervous, please familiarise yourself with them before continuing, otherwise you’re likely to have a hard time.

The purpose of "Where" is to filter a sequence. It takes an input sequence and a predicate, and returns another sequence. The output sequence is of the same element type (so if you put in a sequence of strings, you’ll get a sequence of strings out) and it will only contain elements from the input sequence which pass the predicate. (Each item will be passed to the predicate in turn. If the predicate returns true, the item will be part of the output sequence; otherwise it won’t.)

Now, a few important details about the behaviour:

  • The input sequence is not modified in any way: this isn’t like List<T>.RemoveAll, for example.
  • The method uses deferred execution – until you start trying to fetch items from the output sequence, it won’t start fetching items from the input sequence
  • Despite deferred execution, it will validate that the parameters aren’t null immediately
  • It streams its results: it only ever needs to look at one result at a time, and will yield it without keeping a reference to it. This means you can apply it to an infinitely long sequence (for example a sequence of random numbers)
  • It will iterate over the input sequence exactly once each time you iterate over the output sequence
  • Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence. (In case you didn’t realise, the foreach statement in C# uses a try/finally block to make sure the iterator is always disposed however the loop finishes.)

Many of these points will be true for a lot of our other operators too.

The overload which takes a Func<TSource, int, bool> lets the predicate use the index within the sequence as well as the value. The index always starts at 0, and increments by 1 each time regardless of previous results from the predicate.

What are we going to test?

Ideally, we’d like to test all of the points above. The details of streaming and how many times the sequence is iterated over are frankly a pain to deal with, unfortunately. Given how much there is to implement already, we’ll come back to those.

Let’s have a look at some tests. First, here’s a simple "positive" test – we’re starting with an array of integers, and using a lambda expression to only include elements less than 4 in the output. (The word "filter" is omnipresent but unfortunate. It’s easier to talk about "filtering out" elements than "including" them, but the predicate is expressed in a positive way.)

[Test]
public void SimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = source.Where(x => x < 4);
    result.AssertSequenceEqual(1, 3, 2, 1);
}

I’ve kept the TestExtensions from MoreLINQ, despite NUnit coming with CollectionAssert. I find the extension methods easier to work with for three reasons:

  • They’re extension methods, which helps to reduce the clutter
  • They can use a parameter array for the expected output, which makes the test simpler to express
  • The message is clearer when the assertion fails

Basically, AssertSequenceEqual does what you’d expect it to – it checks that the actual result (usually expressed as the variable you call the method on) matches the expected result (usually expressed as a parameter array).

So far, so good. Now let’s check argument validation:

[Test]
public void NullSourceThrowsNullArgumentException()
{
    IEnumerable<int> source = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(x => x > 5));
}

[Test]
public void NullPredicateThrowsNullArgumentException()
{
    int[] source = { 1, 3, 7, 9, 10 };
    Func<int, bool> predicate = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(predicate));
}

I’m not bothering to check the name within the ArgumentNullException, but importantly I’m testing that the arguments are being validated immediately. I’m not trying to iterate over the result – so if the validation is deferred, the test will fail.

The final interesting test for the moment is also around deferred execution, using a helper class called ThrowingEnumerable. This is a sequence which blows up with an InvalidOperationException if you ever try to iterate over it. Essentially, we want to check two things:

  • Just calling Where doesn’t start iterating over the source sequence
  • When we call GetEnumerator() to get an iterator and then MoveNext() on that iterator, we should start iterating, causing an exception to be thrown.

We’ll need to do something similar for other operators, so I’ve written a small helper method in ThrowingEnumerable:

internal static void AssertDeferred<T>(
    Func<IEnumerable<int>, IEnumerable<T>> deferredFunction)
{
    ThrowingEnumerable source = new ThrowingEnumerable();
    var result = deferredFunction(source);
    using (var iterator = result.GetEnumerator())
    {
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

Now we can use that to check that Where really defers execution:

[Test]
public void ExecutionIsDeferred()
{
    ThrowingEnumerable.AssertDeferred(src => src.Where(x => x > 0));
}

These tests have all dealt with the simpler predicate overload – where the predicate is only passed the item, not the index. The tests involving the index are very similar.

Let’s implement it!

With all the tests passing when running against the real LINQ to Objects, it’s time to implement it in our production code. We’re going to use iterator blocks, which were introduced in C# 2 to make it easier to implement IEnumerable<T>. I have a couple of articles you can read if you want more background details… or read chapter 6 of C# in Depth (either edition). These give us deferred execution for free… but that can be a curse as well as a blessing, as we’ll see in a minute.

At its heart, the implementation is going to look something like this:

// Naive implementation
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Simple, isn’t it? Iterator blocks allow us to write the code pretty much how we’d describe it: we iterate over each item in the source, and if the predicate returns true for that particular item, we yield (include) it in the output sequence.

Lo and behold, some of our tests pass already. Now we just need argument validation. That’s easy, right? Let’s give it a go:

// Naive validation – broken!
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Hmm. Our validation tests still seem to be red, and putting a breakpoint on the "throw" statements doesn’t help… they’re not getting hit. What’s going on?

I’ve given a few pretty broad hints already. The problem is deferred execution. Until we start trying to iterate over the result, none of our code will run. Our tests deliberately don’t iterate over the result, so validation is never performed.

We’ve just hit a design flaw in C#. Iterator blocks in C# simply don’t work nicely when you want to split execution between "immediate" (usually for validation) and "deferred". Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

It’s ugly, but it works: all our index-less tests go green. From here, it’s a short step to implement the version using an index as well:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<T, int, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (predicate(item, index))
        {
            yield return item;
        }
        index++;
    }
}

Now the bar is green, and we’re done. Hang on though… we haven’t used it every way we can yet.

Query expressions

So far, we’ve been calling the method directly (although as an extension method) – but LINQ also provides us with query expressions. Here’s our "SimpleFiltering" test rewritten to use a query expression:

[Test]
public void QueryExpressionSimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = from x in source
                 where x < 4
                 select x;
    result.AssertSequenceEqual(1, 3, 2, 1);
}

(Note that the name is different here to in the downloadable code, to stop the blog server software blocking the name of the method. Grr.)

That will produce exactly the same code as our earlier test. The compiler basically translates this form into the previous one, leaving the condition (x < 4) as a lambda expression and then converting it appropriately (into a delegate in this case). You may be surprised that this works as we have no Select method yet…  but in this case we have a "no-op" select projection; we’re not actually performing a real transformation. In that case – and so long as there’s something else in the query, in this case our "where" clause – the compiler effectively omits the "select" clause, so it doesn’t matter that we haven’t implemented it. If you changed "select x" to "select x * 2", it would fail to compile against our Where-only LINQ implementation.

The fact that query expressions are just based on patterns like this is a very powerful feature for flexibility – it’s how LINQ to Rx is able to only implement the operators that make sense in that environment, for example. Similarly, there’s nothing in the C# compiler that "knows" about IEnumerable<T> when it comes to query expressions – which is how entirely separate interfaces such as IObservable<T> work just as well.

What have we learned?

There’s been a lot to take in here, in terms of both implementation and core LINQ principles:

  • LINQ to Objects is based on extension methods, delegates and IEnumerable<T>
  • Operators use deferred execution where appropriate and stream their data where possible
  • Operators don’t mutate the original source, but instead return a new sequence which will return the appropriate data
  • Query expressions are based on compiler translations of patterns; you don’t need to implement any more of the pattern than the relevant query expression requires
  • Iterator blocks are great for implementing deferred execution…
  • … but make eager argument validation a pain

Code download

Linq-To-Objects-2.zip

Many people have asked for a source repository for the project, and that makes sense. I’m putting it together a source repository for it now; it’s likely to be done before I post the next part.

Reimplementing LINQ to Objects: Part 1 – Introduction

About a year and a half ago, I gave a talk at a DDD day in Reading, attempting to reimplement as much of LINQ to Objects as possible in an hour. Based on the feedback from the session, I went far too fast… and I was still a long way from finishing. However, I still think it’s an interesting exercise, so I thought I’d do it again in a more leisurely way, blogging as I go. Everything will be under the "Edulinq" tag, so that’s the best way to get all the parts in order, without any of my other posts.

General approach

The plan is to reimplement the whole of LINQ to Objects, explaining each method (or group of methods) in a blog post. I’m going to try to make the code itself production quality, but I’m not going to include any XML documentation – if I’m already writing up how things work, I don’t want to do everything twice. I’ll include optimizations where appropriate, hopefully doing better than LINQ to Objects itself.

The approach is going to be fairly simple: for each LINQ method, I’ll write some unit tests (most of which I won’t show in the blog posts) and make sure they run against the normal .NET implementation. I’ll then comment out the using directive for System.Linq, and introduce one for JonSkeet.Linq instead. The tests will fail, I’ll implement the methods, and gradually they’ll go green again. It’s not quite the normal TDD pattern, but it works pretty well.

I will write up a blog entry for each LINQ operator, probably including all of the production code but only "interesting" tests. I’ll highlight important patterns as they come up – that’s half the point of the exercise, of course.

At the end of each post, I’ll include a link to download the "code so far". For the sake of anyone looking at these posts in the future, I’m going to keep the downloads numbered separately, rather than updating a single download over time. I’m hoping that the code will simply grow, but I dare say there’ll be some modifications along the way too.

The aim is not to end up with LINQBridge: I’m going to be targeting .NET 3.5 (mostly so I can use extension methods without creating my own attribute) and I’m certainly not going to start worrying about installers and the like. The purpose of all of this is purely educational: if you follow along with these blog posts, with any luck you’ll have a deeper understanding of LINQ in general and LINQ to Objects in particular. For example, topics like deferred execution are often misunderstood: looking at the implementation can clarify things pretty well.

Testing

The unit tests will be written using NUnit (just for the sake of my own familiarity). Fairly obviously, one of the things we’ll need to test quite a lot is whether two sequences are equal. We’ll do this using the TestExtensions class from MoreLINQ (which I’ve just copied into the project). The netbook on which I’ll probably write most of this code only has C# Express 2010 on it, so I’m going to use the external NUnit GUI. I’ve set this in the project file as the program to start… which you can’t do from C# Express directly, but editing the project file is easy, to include this:

<StartAction>Program</StartAction>
<StartProgram>C:Program FilesNUnit-2.5.7.10213binnet-2.0nunit-x86.exe</StartProgram>

It’s a bit of a grotty hack, but it works. The "additional command line parameters" are then set to just JonSkeet.Linq.Tests.dll – the current directory is the bin/debug directory by default, so everything’s fine. Obviously if you want to run the tests yourself and you have ReSharper or something similar, you can see the results integrated into Visual Studio.

Although I’m hoping to write reasonably production-level code, I doubt that there’ll be as many unit tests as I’d really write against production code. I fully expect the number of lines of test code to dwarf the number of lines of production code even so. There are simply vast numbers of potential corner cases… and quite a few overloads in several cases. Remember that the goal here is to examine interesting facets of LINQ.

Code layout

Just like the real LINQ to Objects, I’ll be creating an enormous static Enumerable class… but I’ll do so using partial classes, with one method name (but multiple overloads) per file. So Where will be implemented in Where.cs and tested in WhereTest.cs, for example.

First code drop

The first zip file is available: Linq-To-Objects-1.zip. It doesn’t contain any production code yet – just 4 tests for Where, so I could check that NUnit was working properly for me. Next stop… implementing Where.

Presentation preparation

As most of you know, I occasionally talk about C# at conferences, user groups or basically anywhere that people won’t attack me. A while ago I rejected PowerPoint in favour of a rather less formal approach: hand-drawn slides. Quite a few people have now asked me about how they’re prepared – even occasionally making the assumption that my awful handwriting is actually a real font – so I figured it’s worth a blog post.

My process is both primitive and complex…

1. Draw the slides

I use plain A4 printer paper and a black flipchart marker to draw the slides. Where I need one bit overlaid on top of another (e.g. to cross something out), I’ll draw the two sections separately, to merge later. Given that I’m fiddling around anyway, it doesn’t matter if there are extra bits on the paper – so if I make a small mistake, I’ll often keep going on the same sheet.

I rarely make a textual plan of my presentations beforehand – I think of roughly where I want to go, and then just draw. Often some slides will be reordered or deleted for the final presentation. I find a good fat pen in my hand helps the creative process.

2. Scan the paper

I have a relatively ancient but still trusty flatbed scanner (a Xerox 2400, if you’re interested) which I use to scan everything in. A document scanner would be more convenient – scanning 25 slides individually is very tedious – but I don’t really have the space at the moment.

I scan in greyscale at 400dpi, into Paintshop Pro v8.10. It may be 7 years old, but it does the job perfectly adequately. I then re-orient them to landscape mode and save them as BMP files. Yes, it’s as daft as it sounds, but necessary for the next step. These files are pretty big – about 15MB – and large in terms of canvas size too… but here’s an idea of what they look like (but as a PNG, scaled down significantly):

Slide before alterations

At this point I’m done with Paintshop Pro, so I close it down…

3. Convert into SVG

I used to edit the slides as PNG files in Paintshop Pro, but this had various issues:

  • Colouring was a pain
  • Resizing didn’t work terribly well
  • Cutting and pasting shapes wasn’t as easy as it might have been

For line drawing like mine, I suspect SVG is the ideal format. I found InkScape as a free tool which works pretty well… but I need to convert the images first. InkScape does actually have a conversion tool built-in, but I’ve never quite got it to work properly – whereas using potrace directly (I believe InkScape uses potrace internally) means I can batch the conversions and control them as far as I need to.

When I discovered potrace, I was amazed by it. It really does exactly what it says on the tin – it converts bitmaps into SVGs. Unfortunately it’s limited in terms of its input formats – hence the BMPs – but that’s fine.

For the record, here’s the potrace command I use:

potrace -b svg -r 400 -t 8 -O 0.4 -k 0.7 %*

I can’t remember what all the arguments mean offhand, but you can find out if you want to :)

4. Edit in InkScape

So, now I have black and white SVGs which I can edit in InkScape. I generally perform the following tweaks:

  • Remove any crud left from my dirty scanner – often there are extra shapes which need deleting
  • Resize everything and rotate shapes – usually to straighten them
  • Apply colour; I usually stick to red, black, blue and green with occasional other colours where necessary. A bit of colour’s great, but we’re not going for works of art here… and many colours look rubbish on some projectors.

Here’s the result of editing the above picture – assuming your browser can handle SVG:

Slide as an SVG

5. Convert to PNG

So now we’re done, right? Well, no… because I haven’t found any way of presenting nicely from SVGs. Aside from anything else, they take a long time to render, and I really need to be able to flick from slide to slide whenever I want to. However, I’ve found that PNGs work well as a display format. I use ImageMagick to convert from SVG to PNG usually… although very occasionally it crashes, and I have to use InkScape to export to PNG, which it’s perfectly capable of doing. (I use ImageMagick mostly so that I can do it easily in a batch.)

I convert to a 1024×768 (-ish) PNG, so that it won’t be too time-consuming to render, and I don’t need to worry about scaling it. (Most projectors I’ve used have been that size.) Here’s the PNG for the same image again – shrunk to 600×436 just to avoid taking up too much room:

Final converted slide

Even if you couldn’t see the SVG before, you should now be able to spot that not only is the image coloured in (and with a lot better contrast than the original) but I’ve tweaked the man to be falling instead of flying. (The aim of the slide is to indicate that C# should push you into the pit of success.)

6. Present!

At this point, I have three directories: one of the original BMPs, one of the SVGs, and one of the PNGs. I could then create a PowerPoint presentation consisting of all of the images… but I’ve found that PowerPoint takes an awfully long time to render images, and it would also be a bit of a pain to add them in the first place.

Instead, I use FastStone Image Viewer to present a slideshow of the images. I’ve tweaked a few options to make the images appear nice and quickly against a white background. It’s easiest if the image files are in alphabetical order, so I tend to go for filenames such as "010 Introduction.png" – always iniitially leaving 10 between each number, so that I can easily slot in an extra slide or two if necessary.

So that’s my process… four pieces of free software and one very old commercial one, a fair amount of time, and a slightly odd result.

Looking to the future…

Obviously this isn’t an ideal process. I now have a Wacom Bamboo Fun tablet, and the hope is that eventually I’ll be able to draw with it directly into InkScape, cutting out huge amounts of the process. However, so far I’m completely inept with it. If you think my drawings are bad and my handwriting is ridiculous when I’m using a pen and paper, you should see it with the tablet… but maybe that will change. I’d like to think so.

I suspect my style may change over time too, although I’m not sure in what way. Different presentations already get different treatment – for some talks I work entirely in Visual Studio, occasionally breaking to talk without coding, but without any slides… for others I’ll have slides but no code at all. If you ever happen to see me present though, please give me feedback afterwards – good or bad. It’s an area I’d love to improve on, and feedback is the best way of that happening.

Don’t let this get away

Josh Twist asked me this via Twitter:

is it possible to invoke a member before a ctor is finished (eg maybe using threaded IL trickery) or is this forbidden somehow? :D

Now I don’t know why everyone seems to think I enjoy writing code which could have bizarre effects on either you, the compiler, the resulting execution or your co-workers… but it’s an interesting topic to look at, anyway.

The perils of partially constructed objects

Hopefully it’s reasonably obvious why it’s dangerous to access a member before it’s been properly constructed – but it may be worse than you’ve considered.

In particular, immutable types are only immutable after they’re fully constructed. It’s entirely reasonable for an immutable type to change read-only fields several times during the course of initialization. The fields can only be set in the constructor itself (or a variable initializer for the field) but this can occur several times. If the constructor for the immutable type exposes the instance it’s constructing to other code, all the immutability guarantees go out of the window.

Even in mostly-mutable types, code may well assume that it’s dealing with some fixed aspects. For example, you may have some database entity type which is either freshly created with a random GUID, or created from an existing record with an ID from the database. In either case, code consuming this type wouldn’t expect to see an ID of Guid.Empty, or for the ID to change after it’s been observed… even if other properties of the object can be changed later.

What C# does to protect you

C# as a language (plus conforming compiler, of course) protects you from some of this.

When you chain to another constructor, you can’t use this to calculate any arguments you want to pass to the other constructor. The code is clearly dealing with a partially constructed object at this point – it knows none of your constructor body has been executed – so it’s protecting you from harm. Unfortunately this means you can’t even call this.GetType(), which can make it tricky to write objects which populate themselves using reflection.

During the constructor body, you have complete access to this of course – you have to, in order to set any state within the object. This is where things can get nasty.

Virtual methods

One way in which Java, C# and C++ diverge in their constructor behaviour is with regard to virtual methods:

  • In C++, the object only really "becomes" an instance of the subclass when the subclass constructor has been executed, so calling a virtual method will only execute the override in the "current" type hierarchy.
  • In Java, the object is of the final type from the start, so the most deeply overridden implementation of the virtual method is called – but this will occur without any initialization having taken place. All fields will still have their default values (null, 0 etc).
  • C# is like Java, except that variable initializers will have been executed (as they’re executed before the base class constructor is called). In other words, initialization within the constructor body won’t have taken place, but any fields which are initialized as part of the declaration will have their appropriate values.

This is really dangerous if you’re not aware of it. In particular, any time you override a virtual method in Java or C#, you need to know whether it might be called in a partially-initialized state.

Wherever possible, try not to call virtual methods from the constructor for precisely this reason. I would advise that if you absolutely have to do it (I failed to remove this behaviour when porting Joda Time to Noda Time, for example) you document that fact very heavily and make sure that you don’t call the method in any other place. Make it protected, too. Basically it should only be part of initialization. If you need similar behaviour at other times, create another method. This allows derived classes to tailor their implementation to the expected state at the time of invocation.

Callbacks

You may be thinking that this is all easy: just avoid virtual methods, don’t do anything stupid like setting a static variable to this during a constructor (making it visible from other threads before initialization is completed) and you’ll be fine.

Well, I suspect that almost every Windows Forms app in existence publishes this during the constructor. Any time you have an event handler, that’s effectively providing a callback… and if that’s an instance method, it’s tied to the relevant instance, usually this.

How sure are you – really, really sure – that none of those event handlers will fire as part of the rest of the initialization? For example, if you use Visual Studio to hook up the ControlAdded event for a WinForms form, and also add a bunch of controls to the form… when is that event going to fire? Will the autogenerated code add the event handler after it adds the controls, or before? If it adds the handler at the start, then clearly the method handling the event will be called before your constructor finishes… so you need to be ready for that.

How much of a problem is this really?

Like many matters of purity, I suspect this is usually more of a theoretical issue than a practical one. In complicated situations like the Windows Forms one above, most event handlers are likely to be fired after initialization… and there’s typically not as strong a sense of invariants being set up as there would be in an immutable data type, for example.

Immutable data types, in turn, are less likely to accidentally let this escape during construction… but the consequences of them doing so are much more severe, of course.

Conclusion

To answer Josh’s question: Yes. At least on the simplest reading of the question: members can certainly end up being invoked on an object during its construction. They can potentially end up being invoked on multiple threads during construction. This is basically under the control of the constructors in the type hierarchy though.

In particular, I believe that the .NET memory model is stricter than the ECMA specification in terms of threading: I believe a constructor will have completed (and all its writes retired) before the reference returned by the constructor can be published to another variable, which was a concern in double-checked locking. It’s a valid concern to consider though.

Alternative conclusion: almost nothing is really as simple as it appears to be.

Code and data

In a recent Stack Overflow question, I answered a question which started off with a broken XPath expression by suggesting that that poster might be better off using LINQ to XML instead. The discussion which followed in the comments (around whether or not this was an appropriate answer) led me to think about the nature of code and data, and how important context is.

I don’t think there’s any particularly deep insight in this post – so I’ll attempt to keep it relatively short. However, you might like to think about how code and data interact in your own experience, and what the effects of this can be.

Code is data

Okay, so let’s start off with the obvious: all code is data, at some level. If it’s compiled code, it’s just binary data which a machine can execute. Put it on another machine with no VM, and there’s nothing remarkable about it. It’s just a load of 1s and 0s. As source code, most languages are just plain text. Open up some source code written in C#, Ruby, Python, Java, C++ etc in Notepad and it’ll be readable. You may miss the syntax highlighting and so forth, but it’s still just text.

Code in the right context is more than just data

So what makes this data different to (say) a CSV file or a plain text story? It’s all in the context. When you load it into the right editor, or pass it to the right compiler, you get more information: in an editor you may see the aforementioned syntax highlighting, autocompletion, documentation for members you’re using; a compiler will either produce errors or a binary file. For something like Python or Ruby, you may want to feed the source into an interpreter instead of a compiler, but the principle is the same: the data takes on more meaning.

Code in the wrong code-related context is just data again

Now let’s think about typical places where you might put code (or something with similar characteristics) into the "wrong" context:

  • SQL statements
  • XSLT transformations
  • XPath expressions
  • XML or HTML text
  • Regular expressions

All of these languages have editors which understand them, and will help you avoid problems. All of these are also possible to embed in other code – C#, for example. Indeed, almost all the regular expressions I’ve personally written have ended up in Java or C# code. At that point, there are two problems:

  • You may want to include text which doesn’t embed easily within the "host" language’s string literals (particularly double quotes, backslashes and newlines)
  • The code editor doesn’t understand the additional meaning to the text

The first problem is at least somewhat mitigated by C#’s support for verbatim string literals – only double quotes remain as a problem. But the second problem is the really big one. Visual Studio isn’t going to check that your regular expression or XPath expression looks valid. It’s not going to give you syntax highlighting for your SQL statement, much less IntelliSense on the columns present in your database. Admittedly such a thing might be possible, if the IDE looked ahead to find out where the text was going to be used – but I haven’t seen any IDE that advanced yet. (The closest I’ve seen is ReSharper noticing when you’re using a format string with the wrong number of parameters – that’s primitive but still really useful.)

Of course, you could write your SQL (or XPath etc) in a dedicated editor, and then either copy and paste it into your code or embed it into your eventual binary and load it at execution time. Neither of these is particularly appealing. Copy and paste works well once, but then when you’re reading or modifying the code you lose the advantages you had unless you copy and paste it again. Embedding the file can work well in some cases – I use it liberally for test data in unit tests, for example – but I wouldn’t want it all over production code. It means that when reading the code, you have to refer to the external resource to work out what’s going to happen. In some cases that’s not too bad – it’s only like opening another class or method, I guess – but in other cases the shift of gears is too distracting.

When code is data, it’s easy to mix it with other data – badly

Within C# code, it’s easy to see the bits of data which sometimes occur in your code: string or numeric literals, typically. Maybe you subscribe to the "no magic values" philosophy, and only ever have literals (other than 0 or 1, typically) as values for constants. Well, that’s just a level of indirection – which in some ways hides the fact that you’ve still got magic values. If you’re only going to use a piece of data once, including it directly in-place actually adds to readability in my view. Anyway, without wishing to dive into that particular debate too deeply, the point is that the compiler (or whatever) will typically stop you from using that data as code – at least without being explicit about it. It will make sure that if you’re using a value, it really is a value. If you’re trying to use a variable, it had better be a variable. Putting a variable name in quotes means it’s just text, and using a word without the quotes will make the compiler complain unless you happen to have a variable with the right name.

Now compare that with embedding XPath within C#, where you might have:

var node = doc.SelectSingleNode("//foo/bar[@baz=xyz]");

Now it may be obvious to you that "xyz" is meant to be a value here, not the name of an attribute, an element, a function or anything like that… but it’s not obvious to Visual Studio, which won’t give you any warnings. This is only a special case of the previous issue of invalid code, of course, but it does lead onto a related issue… SQL injection attacks.

When you’ve already got your "code" as a simple text value – a string literal containing your SQL statement, as an obvious example – it’s all too easy to start mixing that code/data with genuine data data: a value entered by a user, for example. Hey, let’s just concatenate the two together. Or maybe use a format string, effectively mixing three languages (C#, SQL, the primitive string formatting "language" of string.Format) into a single statement. We all know the results, of course: nothing differentiates between the code/data and the genuine data, so if the user-entered value happens to look like SQL to drop a database table, we end up with Little Bobby Tables.

I’m sure 99% of my blog readers know the way to avoid SQL injection attacks: use parameterized SQL statements. Keep the data and the code separate, basically.

Expressing the same ideas, but back in the "native" language

Going back to the start of all this, the above is why I like LINQ to XML. When I express a query using LINQ to XML, it’s often a lot longer than it would have been in the equivalent XPath – but I can tell where the data goes. I know where I’m using an element name, where I’m using an attribute name, and where I’m comparing or extracting values. If I miss out some quotes, chances are pretty high that the resulting code will be invalid, and it’ll be obvious where the problem is. I’m prepared to sacrifice brevity for the fact that I only work in a single language + library, instead of trying to embed one language within another.

Likewise building XML using LINQ to XML is much better than concatenating strings – I don’t need to worry about any nasty escaping issues, for example. LINQ to XML has been so nicely design, it makes all kinds of things incredibly easy.

Regular expressions can sometimes be replaced by simple string operations. Where they can, I will often do so. I’d rather use a few IndexOf and Substring calls over a regular expression in general – but where the patterns I need get too tricky, I will currently fall back to regular expressions. I’m aware of ReadableRex but I haven’t looked at it in enough detail to say whether it can take the place of "normal" regular expressions in the way that LINQ to XML can so often take the place of XPath.

Of course, LINQ to SQL (and the Entity Framework) do something similar for SQL… although that’s slightly different, and has its own issues around predictability.

In all of these cases, however, the point is that by falling back to more verbose but more native-feeling code, some of the issues of embedding one language within another are removed. Code is still code, data is data again, and the two don’t get mixed up with each other.

Conclusion

If I ever manage to organize these thoughts in a more lucid way, I will probably just rewrite them as another (shorter) post. In the meantime, I’d urge you to think about where your code and data get uncomfortably close.

Writing the perfect question

Please note: this blog post is not a suitable place to post a development question. You should probably be doing that on Stack Overflow – after reading this post, of course. I’ve received a truly astonishing number of comments on this post which are basically misplaced questions. Such comments are simply marked as spam – they don’t belong here.

TL;DR: If you don’t have time to read this post fully, I have a checklist version. I’d strongly encourage you to read this post fully though, when you have time.

I’ve added a tinyurl to this post for easy reference:
http://tinyurl.com/stack-hints. Nice and easy to remember :)


A while ago, I wrote a blog entry on how to answer questions helpfully on sites like Stack Overflow. Recently I saw a meta question about bad questions and thought it would be worth following up with another blog post on asking questions. For the sake of convenience – and as Stack Overflow is so popular – I will assume the question is going to be asked on Stack Overflow or a similar Stack Exchange site. Most of the post doesn’t actually depend on that, but if you’re asking elsewhere you may need to tweak the advice a little.

There are plenty of similar resources around, of course – in particular, How to Ask Questions the Smart Way by Eric Raymond and Rick Moen is a perennial favourite. Still, I think I can bring something to the table.

The Golden Rule: Imagine You’re Trying To Answer The Question

If you don’t remember anything else from this post, remember this bit. Everything else follows from here. (And yes, this does smack somewhat of Matthew 7:12.)

Once you’ve finished writing your question, read it through. Imagine you were coming to it fresh, with no context other than what’s on the screen. Does it make sense? Is it clear what’s being asked? Is it easy to read and understand? Are there any obvious areas you’d need to ask about before providing an answer? You can usually do this pretty well however stuck you are on the actual question. Just apply common sense. If there’s anything wrong with the question when you’re reading it, obviously that will be a problem for whoever’s actually trying to answer it. So fix the problems. Improve the question until you can read it and think, “If I only knew the answer to the question, it would be a pleasure to provide that answer.” At that point, post and wait for the answers to come rolling in.

Obviously this is somewhat easier to do if you have a certain amount of experience answering questions, particularly on the forum where you’re about to post. So, what should you be looking out for?

Question title

When a reader first sees your question, they’re likely to be scrolling down a list of snippets. The most eye-catching part of the snippet will be the title – so use that text wisely. While you can include language or platform information, you should only do so naturally – not as a sort of “header”. For example, this is bad:

Java: Why are bytes signed?

But this is okay:

Why are bytes signed in Java?

Of course, you should also include this information in tags, as it will help people who pay particular attention to specific tags.

Ideally, a question title should be a question – but frankly that’s not always feasible. I would recommend favouring a short, descriptive title which captures the theme of the question without actually being a question instead of really trying to crowbar it into the form of a question when it really doesn’t want to be. That’s not an excuse for laziness though – it’s usually possible to come up with a good title which is genuinely a question.

It’s important that the question title is specific though, and has at least some meaning with no other information. A question such as “Why doesn’t this work?” makes absolutely no sense without the rest of the question. Likewise a “question” title of “Please help” is unlikely to do well.

Context

In most cases, anyone answering the question will need to know what language and platform you’re using. The basics should usually be communicated through tags, but it may very well be worth providing more information:

  • Language version (e.g. C# 4)
  • Platform version (e.g. .NET 3.5; note that this isn’t always implicit from the language version, or vice versa)
  • Operating system, if it could be relevant (e.g. particular permissions issues)
  • Any other relevant software (e.g. database type and version, the IDE you’re using, web server you’re connecting to)
  • Any other constraints. This is particularly important. It’s really annoying to give a perfectly good answer to a question, only to be told that you’re not allowed to use feature X or Y which provide the obvious solution.
    • If you have unusual constraints, it’s worth explaining why. Not only does this answer the obvious follow-up comment, but it also gives more information about what other solutions may not be applicable.

Describe what you’ve already tried and the results of any research. (You have searched for a solution to your problem before asking it, haven’t you? Stack Overflow isn’t meant to replace basic search skills.) Often there will be other people in a similar situation, but the answers didn’t quite match your situation. Just like the above point about unusual constraints, it saves time if you can point out differences between your situation and other common ones. It’s even worth referring to other related questions explicitly – particularly if they’re on the same forum. Aside from anything else, this shows a certain amount of “due diligence” – people are generally more willing to help you if can show you’ve already put some effort in.

You should absolutely make sure that you tag the question appropriately. If you’re not sure which exact tags are appropriate, see what gets auto-suggested and look at samples for each one. If that sounds like a lot of work, just remember how much time you may be able to save yourself in the long run. It gets easier over time, of course.

Problem statement

Make sure it’s obvious what you’re trying to get out of the question. Too many “questions” are actually just statements: when I do X, something goes wrong.

Well, what did you expect it to do? What are you trying to accomplish? What have you already tried? What happened in those attempts? Be detailed: in particular, if something didn’t work, don’t just state that: tell us how it failed. If it threw an exception, what was the exception? (Don’t just give the type – give the error message and say which line threw it. If there’s a nested exception, post that too.)

If at all possible, write a sort of “executive summary” at the start of your question, followed by a more detailed description. Remember that on the list of questions, the first few sentences will appear as a snippet. If you can get a sense of the question across in that snippet, you’re more likely to attract views from people who can answer the question.

One trap that many posters fall into is to ask how to achieve some “small” aim, but never say what the larger aim is. Often the smaller aim is either impossible or rarely a good idea – instead, a different approach is needed. Again, if you provide more context when writing your problem statement, we can suggest better designs. It’s fine to specify how you’re currently trying to solve your bigger problem, of course – that’s likely to be necessary detail – but include the bigger goal too.

Sample code and data

I may be biased on this one. I’m a huge believer in sample code, both for questions and answers… and I probably use it in an unconventional way. I usually paste it into a text editor, and try to compile it from the command line. If that’s not likely to work (and the problem isn’t obvious by inspection), I’m unlikely to bother too much with it. Firing up Eclipse or Visual Studio and finding an existing project I don’t care about or starting a new one is going to take much more time.

That means if you want me to look at code, it should:

  • Be standalone. Don’t try to talk to a database unless you really have to. (Obviously for database questions, that’s kinda hard :) If you use sample XML, provide a short but complete XML file for us to reproduce the issue with. (And the same for other file types, obviously.)
  • Be complete. If there are missing imports or using directives, that’s really annoying
  • Actually compile (unless the compilation error is the reason for the question). Don’t give me code which is “something like” the real code but which clearly isn’t the real code, and may well not exhibit the same symptoms by the time I’ve fixed it so that it compiles.
  • Ideally not bring up a UI. Unless your code is about a UI issue, don’t bring one up. Console apps are simpler, and simplicity is a huge benefit when trying to hunt down a problem.
  • Demonstrate the problem. You should be able to say, “I expected the result to be X, it’s actually Y.” (You should actually say that too, so that we can check that we get the same results.)
  • Be as short as possible. If I have to wade through hundreds of lines of code to find the problem, I’m doing work that you should be doing. Often if you work hard to reduce the problem to a short but complete program, you’ll find the issue yourself. You can absolutely do this without knowing what the problem is; you should be looking to the community for their expertise, not their willingness to spend time on your problem doing the work that you can do yourself.

Yes, this is a relatively onerous list. It doesn’t all apply to every problem, but it does apply in a great many situations. While I get put off by reams of irrelevant, badly formatted code, some of which clearly won’t compile, the inverse is true as well: if I can tell by looking at the question that the code can go through a copy/paste/compile/run cycle really quickly, I’m much more likely to pay the question significant attention.

In data-oriented questions, it’s very often helpful to give some sample data. Cut out anything irrelevant (if your real table has 50 columns, you only need to include relevant ones) but make sure that you give enough sample input for it to be meaningful. For example, if you’re trying to group some data by a PersonID column, it’s pretty useless if there’s only one PersonID given, or if each PersonID only appears once. If you are giving examples of expected input and corresponding output, make sure it’s clear why that’s the expected output. Often I see questions which give a small number of samples, and there are various ways they could be interpreted. This is one area where it’s particularly important to reread the question from a stranger’s point of view: while a brief summary of the desired results may well make sense to someone who already knows what your app is trying to achieve, it may be gobbledygook to those trying to answer your question.

Spelling, grammar and formatting

I know not everyone speaks English natively. My own command of non-English languages is lamentably poor – I’m incredibly lucky that my native tongue happens to be the lingua franca of the technical world. However, if you’re trying to communicate on an English-language forum, you owe it to yourself to make an effort to write at least reasonably correct English.

  • Please use capital letters where appropriate. It really can make a big difference in the readability of text.
  • Please split your text into paragraphs. Imagine this blog post as one big paragraph – it would be almost impossible to read.
  • Please write actual words. There are undoubtedly some abbreviations which are acceptable to most readers – IMO, IIRC etc –  there’s no reason to switch into text-speak with “gr8”, “bcoz”, “u” and so forth. It’s unlikely that you’re actually writing your question on a phone with only a primitive keyboard; show your readers respect by writing properly. It may take you a few more seconds, but if it means you get an answer quicker, it’s surely worth the extra effort.
  • Most browsers have built-in spelling checkers these days, or at least have plug-ins or extensions available to check your text. Technical text often creates a lot of false positives for checkers, but if your spelling isn’t generally great, it’s worth looking at the suggestions.

Having said all of this, you’re not trying to create a literary masterpiece. You’re trying to communicate your question as effectively as possible. If you’re faced with the choice between an unambiguous but ugly sentence, or a phrase which stirs the soul but leaves the reader confused about exactly what you mean, go for the unambiguous option every time.

One way a huge number of questions can be improved with very little effort is simply formatting them properly. Stack Overflow’s markdown editor is very good – the preview below your input box is almost always accurate in terms of the eventual result, and you can always edit the question later if anything doesn’t quite work. The exact details of the markdown is beyond the scope of this article – Stack Overflow has a detailed guide though – if you’re new to the site, I’d recommend you at least skim through it.

By far the most important kind of formatting is making code look like code. Within a text paragraph, simply surround the code with backticks `like this`. For blocks of code, just indent everything by four spaces. If you’re cutting and pasting code, it may already be indented (for example if you’re copying code within a class) but if not, the easiest way to indent everything is to paste it, select the whole code block, and then press Ctrl-K or the “{ }” button just above the editor.

One of the important things about code formatting is that it means angle brackets (and some other symbols) are preserved instead of being swallowed by the markdown formatter. In some cases this can mean all the difference between a question which is easy to answer and one which doesn’t make any sense, particularly in terms of generics in Java and C# or templates in C++. For example, like this

Why can’t I convert an expression of type List<string> to List<object>?

makes no sense at all if the type arguments are removed:

Why can’t I convert an expression of type List to List?

Often experienced members of the site will recognise what’s going on and edit your question for you, but obviously it’s better if they don’t have to.

Making a good impression

Leaving aside the main body of the question, there are a few simple ways to get the community “on your side” and therefore more likely to give you a useful answer quickly.

  • Register as a user and give yourself a meaningful name. It doesn’t have to be your real name, but frankly names like “Top Coder” or “Coding Guru” look pretty silly when you’re asking a question which others find simple. That’s still better than leaving yourself as “user154232” or whatever identifier is assigned to you by default though. Aside from anything else, it shows a certain amount of commitment to the question and/or site: if you’ve bothered to give yourself a name, you’re less likely to be an “ask-and-run” questioner.
  • Keep an eye on your question. There may well be requests for clarification – and of course, answers! If you receive an answer which wasn’t quite what you were looking for, explain carefully (and politely) why it’s not suitable for your purposes. Consider going back and editing your question to make it clearer for subsequent users.
  • Don’t add your own answer unless it really is an answer. Often users add extra details in an “answer” when they should really have just edited their question. Likewise editing your question is generally a better idea than adding a long comment to an existing answer – particularly if that comment contains a block of code (which won’t work well in a comment). If you do change the question in response to an answer though, it’s worth adding a comment to the answer just to let the user know that you’ve updated it though… you may well find they quickly edit their answer to match the revised question.
  • There’s no need to include greetings and sign-offs such as “Hi everyone!” and “Thanks – hope to get an answer soon” in the question. These will often be edited out by other users, as they’re basically a distraction. Greetings at the start of a question are particularly useless as they can take up valuable space in the snippet displayed in the question list.
  • Above all, be polite. Remember that no-one is getting paid to answer your question. Users are giving up their time to help you – so please be appreciative of that. If you’re asking a homework question, explain why you’re asking for help with something that traditionally you’d have to answer all by yourself. If a user suggests that your general approach is wrong and that there’s a better way of doing things, don’t take it personally: they’re trying to help you improve your code. By all means disagree robustly, but don’t start into ad hominem arguments. (This advice applies to answerers as well, of course.)
  • (Somewhat specific to Stack Overflow.) If an answer is particularly helpful or solves your problem, accept it by clicking on the tick mark by it. This gives extra credit to the person who provided that answer, as well as giving more information to future readers.

Conclusion and feedback

Stack Overflow is an amazing resource (along with other Q&A sites, of course). The idea that you can get a good answer to a wide range of questions within minutes is pretty staggering… but there’s an obvious correlation between the quality of a question and the likelihood that you’ll get quick, helpful answers. Put that extra bit of effort in yourself, and it will probably pay for itself very quickly.

I’m hoping to keep this blog post up to date with suggestions received – if I’ve missed out anything, over- or under-emphasized a specific point, or generally gone off track, let me know either in the comments here or mail me (skeet@pobox.com). If this document ends up elsewhere, then that copy may end up being the “canonical” one which is edited over time – in which case I’ll indicate that here.

Reflecting on presentation skills: The Guathon, August 13th 2010

(I apologise for the unstructured nature of this post. I honestly don’t know how to structure it. I’ve thought of a few ways of breaking it up by heading, and none of them really work. Particular apologies to Simon Stewart, who has requested more brevity in my blog. Just for Simon, the executive summary is: Scott Guthrie is a really good speaker. I want to be more like him.)

Yesterday I had the good fortune (well, good friends – thanks Phil!) to attend the Guathon in London. This was a free, day-long event with Scott Guthrie and Mike Ormond, talking about MVC 2 and 3, Visual Studio 2010, Windows Phone 7 and more. This was my encounter with Scott – and indeed the first time I’d seen him present. (I value videos of presentations, but rarely find time to actually watch them, more’s the pity.)

Obviously I was interested in hearing about the technologies they were talking about, but I confess I was more interested in watching how Scott went about presenting. (I’ve seen Mike present before – but clearly Scott was the "big name" here. No offence meant to Mike whatsoever, who did a great job talking about Windows Phone 7.) Scott is a legend in the industry, and as I’m very interested in improving my public speaking skills, I felt I had at least as much to learn in that area as anything else.

I was really impressed. In some ways, Scott didn’t present in a way I’d expected him to… but what he did was so much better. Not having seen him before, I’d sort of expected an utterly polished sort of talk – almost like a Steve Jobs presentation. I was hoping to get some insight into what sort of polish I could add to my presentations: where does it make sense to have photos, where do simpler visuals work, where are words important? How do you present against an enormous screen without losing the audience’s focus? Do jokes enhance a presentation or detract from it?

In retrospect, this was hopelessly naïve. I think Scott’s secret sauce is actually pretty simple: he knows what he’s talking about, and talks about it honestly and openly. He’s completely authentic, obviously passionate about what he does, good humoured (we had a few bits of mild Google/Bing banter), and interested in the audience.

At almost every turn, Scott asked the audience how many of us had used a certain feature, or developed in a certain way. This was then reflected in the level at which he pitched the next section, as well as giving a few opportunities for jokes. There were questions throughout – particularly in Mike’s talk, actually – to the extent that I’d say a good quarter to a third of the time was spent answering the audience. This was a very good thing, in my view – I can’t remember finding any of the questions irrelevant or obvious (I should state for the record that I probably asked more questions than anyone else; apologies if other attendees found my questions to be boring). Questions from the audience are always a good reality check, because they’re clearly addressing real concerns rather than the ones in the speaker’s imagination. But the best thing about the questions was Scott’s way of answering, which could broadly be divided into three types of answer:

  • A known answer: "Yes, you can do X – and you can do Y as well. But you can’t do Z."
  • An unknown answer which was easily testable: "Hmm. I’m not sure. Let’s try it. Ah yes, the code does X." (There were fewer of these, just due to the nature of the questions.)
  • An answer which was unknown but needed further investigation: "Send me a mail and I’ll get back to you about it."

The last one is most interesting – because I have absolutely no doubt that Scott will get back to anyone who sent him a mail. (I’ve sent him two.) Now don’t forget that Scott is a Corporate Vice President (Dev Div). He’s clearly a busy man… but his openness and passion make an enormously positive impression, suggesting that he’s the kind of guy who doesn’t think of himself as being above such questions. Assuming this is what he says at all his presentations (and I suspect it is), I dread to think how much time he spends every day answering emails… but I also suspect that it’s of enormous benefit to the products for which he’s responsible, by keeping the executive level in touch with grass-roots developers.

So, what have I learned from the whole experience, in terms of presentation skills?

  • You can definitely give awesome presentations without fancy graphics. Content is king.
  • There’s no substitute for knowing your stuff, and being honest about when you don’t know the answer.
  • Interaction with the audience is beneficial to everyone.
  • Sitting down and just writing out code – particularly with audience participation to make the demo "belong" to them – is absolutely fine.
  • Scott’s an incredibly nice guy, and it shines through very clearly. I really hope to see him again soon.
  • If you speak clearly, speed doesn’t matter too much: Scott talks really fast, but is very easy to listen to.
  • If you lose a vital file in the middle of a presentation, check the recycle bin. It’s the virtual equivalent of checking down the back of the sofa.
  • Don’t worry if you have more material than you have time to present, particularly if that’s due to audience questions.

Whether I’ll be able to apply this myself remains to be seen… although I’ve already been acutely aware of how much more comfortable I am when presenting on "home topics" (e.g. C# language features) than areas where I have a lot less expertise (e.g. Reactive Extensions).

There’s a hole in my abstraction, dear Liza, dear Liza

I had an interesting day at work today. I thought my code had broken… but it turns out it was just a strange corner case which made it work very slowly. Usually when something interesting happens in my code it’s quite hard to blog about it, because of all the confidentiality issues involved. In this case, it’s extremely easy to reproduce the oddity in an entirely vanilla manner. All we need is the Java collections API.

I have a set – a HashSet, in fact. I want to remove some items from it… and many of the items may well not exist. In fact, in our test case, none of the items in the "removals" collection will be in the original set. This sounds – and indeed is – extremely easy to code. After all, we’ve got Set<T>.removeAll to help us, right?

Let’s make this concrete, and look at a little test. We specify the size of the "source" set and the size of the "removals" collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using System.currentTimeMillis(), which isn’t the world most accurate stopwatch but is more than adequate in this case, as you’ll see. Here’s the code:

import java.util.*;

public class Test
{
    public static void main(String[] args)
    {
        int sourceSize = Integer.parseInt(args[0]);
        int removalsSize = Integer.parseInt(args[1]);
        
        Set<Integer> source = new HashSet<Integer>();
        Collection<Integer> removals = new ArrayList<Integer>();
        
        for (int i = 0; i < sourceSize; i++)
        {
            source.add(i);
        }
        for (int i = 1; i <= removalsSize; i++)
        {
            removals.add(-i);
        }
        
        long start = System.currentTimeMillis();
        source.removeAll(removals);
        long end = System.currentTimeMillis();
        System.out.println("Time taken: " + (end – start) + "ms");
    }
}

Let’s start off by giving it an easy job: a source set of 100 items, and 100 to remove:

c:UsersJonTest>java Test 100 100
Time taken: 1ms

Okay, so we hadn’t expected it to be slow… clearly we can ramp things up a bit. How about a source of one million items1 and 300,000 items to remove?

c:UsersJonTest>java Test 1000000 300000
Time taken: 38ms

Hmm. That still seems pretty speedy. Now I feel I’ve been a little bit cruel, asking it to do all that removing. Let’s make it a bit easier – 300,000 source items and 300,000 removals:

c:UsersJonTest>java Test 300000 300000
Time taken: 178131ms

Excuse me? Nearly three minutes? Yikes! Surely it ought to be easier to remove items from a smaller collection than the one we managed in 38ms? Well, it does all make sense, eventually. HashSet<T> extends AbstractSet<T>, which includes this snippet in its documentation for the removeAll method:

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator’s remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set’s remove method.

Now that sounds reasonable on the surface of it – iterate through the smaller collection, check for the presence in the bigger collection. However, this is where the abstraction is leaky. Just because we can ask for the presence of an item in a large collection doesn’t mean it’s going to be fast. In our case, the collections are the same size – but checking for the presence of an item in the HashSet is O(1) whereas checking in the ArrayList is O(N)… whereas the cost of iterating is going to be the same for each collection. Basically by choosing to iterate over the HashSet and check for presence in the ArrayList, we’ve got an O(M * N) solution overall instead of an O(N) solution. Ouch. The removeAll method is making an "optimization" based on assumptions which just aren’t valid in this case.

Then fix it, dear Henry, dear Henry, dear Henry

There are two simple ways of fixing the problem. The first is to simply change the type of the collection we’re removing from. Simply changing ArrayList<Integer> to HashSet<Integer> gets us back down to the 34ms range. We don’t even need to change the declared type of removals.

The second approach is to change the API we use: if we know we want to iterate over removals and perform the lookup in source, that’s easy to do:

for (Integer value : removals)
{
    source.remove(value);
}

In fact, on my machine that performs slightly better than removeAll – it doesn’t need to check the return value of remove on each iteration, which removeAll does in order to return whether or not any items were removed. The above runs in about 28ms. (I’ve tested it with rather larger datasets, and it really is faster than the dual-hash-set approach.)

However, both of these approaches require comments in the source code to explain why we’re not using the most obvious code (a list and removeAll). I can’t complain about the documentation here – it says exactly what it will do. It’s just not obvious that you need to worry about it, until you run into such a problem.

So what should the implementation do? Arguably, it really needs to know what’s cheap in each of the collections it’s dealing with. The idea of probing for performance characteristics before you decide on a strategy is completely anathema to clean abstraction we like to consider with frameworks like Java collections… but maybe in this case it would be a good idea.


1 Please perform Dr Evil impression while reading this. I’m watching you through your webcam, and I’ll be disappointed if I don’t see you put your little finger to your mouth.

Iterate, damn you!

Do you know the hardest thing about presenting code with surprising results? It’s hard to do so without effectively inviting readers to look for the trick. Not that that’s always enough – I failed the last of Neal and Eric’s C# puzzlers at NDC, for example. (If you haven’t already watched the video, please do so now. It’s far more entertaining than this blog post.) Anyway, this one may be obvious to some of you, but there are some interesting aspects even when you’ve got the twist, as it were.

What does the following code print?

using System;
using System.Collections.Generic;

public class WeirdIterators
{
    static void ShowNext(IEnumerator<int> iterator)
    {
        if (iterator.MoveNext())
        {
            Console.WriteLine(iterator.Current);
        }
        else
        {
            Console.WriteLine("Done");
        }
    }
    
    static void Main()
    {
        List<int> values = new List<int> { 1, 2 };
        using (var iterator = values.GetEnumerator())
        {
            ShowNext(iterator);
            ShowNext(iterator);
            ShowNext(iterator);
        }
    }
}

If you guessed "1, 2, Done" despite the title of the post and the hints that it was surprising, then you’re at least brave and firm in your beliefs. I suspect most readers will correctly guess that it prints "1, 1, 1" – but I also suspect some of you won’t have worked out why.

Let’s look at the signature of List<T>.GetEnumerator(). We’d expect it to be

public IEnumerator<T> GetEnumerator()

right? That’s what IEnumerable<T> says it’ll look like. Well, no. List<T> uses explicit interface implementation for IEnumerable<T>. The signature actually looks like this:

public List<T>.Enumerator GetEnumerator()

Hmm… that’s an unfamiliar type. Let’s have another look in MSDN

[SerializableAttribute]
public struct Enumerator : IEnumerator<T>, 
    IDisposable, IEnumerator

(It’s nested in List<T> of course.) Now that’s odd… it’s a struct. You don’t see many custom structs around, beyond the familiar ones in the System namespace. And hang on, don’t iterators fundamentally have to be mutable.

Ah. "Mutable value type" – a phrase to strike terror into the hearts of right-headed .NET developers everywhere.

So what’s going on? If we’re losing all the changes to the value, why is it printing "1, 1, 1" instead of throwing an exception due to printing out Current without first moving?

Well, we’re fetching the iterator into a variable of type List<int>.Enumerator, and then calling ShowNext() three times. On each call, the value is boxed (creating a copy), and the reference to the box is passed to ShowNext().

Within ShowNext(), the value within the box changes when we call MoveNext() – which is how it’s able to get the real first element with Current. So that mutation isn’t lost… until we return from the method. The box is now eligible for garbage collection, and no change has been made to the iterator variable’s value. On the next call to ShowNext(), a new box is created and we see the first item again…

How can we fix it?

There are various things we can do to fix the code – or at least, to make it display "1, 2, Done". We can then find other ways of breaking it again :)

Change the type of the values variable

How does the compiler work out the type of the iterator variable? Why, it looks at the return type of values.GetEnumerator(). And how does it find that? It looks at the type of the values variable, and then finds the GetEnumerator() method. In this case it finds List<int>.GetEnumerator(), so it makes the iterator variable type List<int>.Enumerator.

If suppose just change values to be of type IList<int> (or IEnumerable<int>, or ICollection<int>):

IList<int> values = new List<int> { 1, 2 };

The compiler uses the interface implementation of GetEnumerator() on List<T>. Now that could return a different type entirely – but it actually returns a boxed List<T>.Enumerator. We can see that by just printing out iterator.GetType().

So if it’s just returning the same value as before, why does it work?

Well, this time we’re boxing once – the iterator gets boxed on its way out of the GetEnumerator() method, and the same box is used for all three calls to ShowNext(). No extra copies are created, and the changes within the box don’t get lost.

Change the type of the iterator variable

This is exactly the same as the previous fix – except we don’t need to change the type of values. We can just explicitly state the type of iterator:

using (IEnumerator<int> iterator = values.GetEnumerator())

The reason this works is the same as before – we box once, and the changes within the box don’t get lost.

Pass the iterator variable by reference

The initial problem was due to the mutations involved in ShowNext() getting lost due to repeated boxing. We’ve seen how to solve it by reducing the number of boxing operations down to one, but can we remove them entirely?

Well yes, we can. If we want changes to the value of the parameter in ShowNext() to be propagated back to the caller, we just need to pass the variable by reference. When passing by reference the parameter and argument types have to match exactly of course, so we can’t leave the iterator variable being type List<T>.Enumerator without changing the parameter type. Now we could explicitly change the type of the parameter to List<T>.Enumerator – but that would tie our implementation down rather a lot, wouldn’t it? Let’s use generics instead:

static void ShowNext<T>(ref T iterator)
    where T : IEnumerator<int>

Now we can pass iterator by reference and the compiler will infer the type. The interface members (MoveNext() and Current) will be called using constrained calls, so there’s no boxing involved…

… except that when you try to just change the method calls to use ref, it doesn’t work – because apparently you can’t pass a "using variable" by reference. I’d never come across that rule before. Interesting. Fortunately, we can (roughly) expand out the using statement ourselves, like this:

var iterator = values.GetEnumerator();
try
{
    ShowNext(ref iterator);
    ShowNext(ref iterator);
    ShowNext(ref iterator);
}
finally
{
    iterator.Dispose();
}

Again, this fixes the problem – and this time there’s no boxing involved.

Let’s quickly look at one more example of it not working, before I finish…

Dynamic typing to the anti-rescue

What happens if we change the type of iterator to dynamic (and set everything else back the way it was)? I’ll readily admit, I really didn’t know what was going to happen here. There are two competing forces:

  • The dynamic type is often really just object behind the scenes… so it will be boxed once, right? That means the changes within the box won’t get lost. (This would give "1, 2, Done")
  • The dynamic type is in many ways meant to act as if you’d declared a variable of the type which it actually turns out to be at execution time – so in this case it should work as if the variable was of type List<int>.Enumerator, just like our original code. (This would give "1, 1, 1")

What actually happens? I believe it actually boxes the value returned from GetEnumerator() – and then the C# binder DLR makes sure that the value type behaviour is preserved by copying the box before passing it to ShowNext(). In other words, both bits of intuition are right, but the second effectively overrules the first. Wow. (See the comment below from Chris Burrows for more information about this. I’m sure he’s right that it’s the only design that makes sense. This is a pretty pathological example in various ways.)

Conclusion

Just say "no" to mutable value types. They do weird things.

(Fortunately the vast majority of the time this particular one won’t be a problem – it’s rare to use iterators explicitly in the first place, and when you do you very rarely pass them to another method.)

Degrees of reality in sample code

Yesterday I tweeted a link to an article about overloading that I’d just finished. In that article, all my examples look a bit like this:

using System;

class Test
{
    static void Foo(int x, int y = 5)
    {
        Console.WriteLine("Foo(int x, int y = 5)");
    }
    
    static void Foo(double x)
    {
        Console.WriteLine("Foo(double x)");
    }

    static void Main()
    {
        Foo(10);
    }
}

Each example is followed by an explanation of the output.

Fairly soon afterwards, I received an email from a reader who disagreed with my choices for sample code. ere are a few extracts from the email exchange. Please read them carefully – they really form the context of the rest of this post.

This is really not proper. When a method can do more than one thing, you might offer what are called ‘convenience overloads’, which make it easier for the consuming developer. When you start swaying away so much that you have wildly different arguments, then it’s probably time to refactor and consider creating a second method. With your example with "Foo", it’s hard to tell which is the case.

My point is, the ‘convenience overloads’ should all directly or indirectly call the one REAL method. I’m not a fan of "test", "foo", and "bar", because they rarely make the point clearer, and often make it more confusing. So let me use something more realistic. So let me use something more realistic. This nonsensical example, but hopefully is clear:

...

The point here was to make you aware of the oversight. I do what I can to try to stop bad ideas from propagating, particularly now that you're writing books. When developers read your book and consider it an "authority" on the topic, they take your example as if it's a model for what they should do. I just hope your more mindful of that in your code samples in the future.

...

Specific to this overload issue, this has come up many times for me. Developers will write 3 overloads that do wildly different things or worse, will have 98% of the same code repeated. We try to catch this in a code review, but sometimes we will get pushback because they read it in a book (hence, my comments).

...

I assume your audience is regular developer, right? In other words, the .NET Framework developers at Microsoft perhaps aren't the ones reading your books, but it's thousands of App Developer I and App Developer II that do business development? I just mean that there are far, far more "regular developers" than seasoned, expert developers who will be able to discern the difference and know what is proper. You are DEFINING what is proper in your book, you become an authority on the matter!

Anyhow, all my point was it to realize how far your influence goes once you become an author. Even the simplest, throwaway example can be seen as a best-practice beacon.

Now, this gave me pause for thought. Indeed, I went back and edited the overloading article - not to change the examples, but to make the article's scope clearer. It's describing the mechanics of overloading, rather than suggesting when it is and isn't appropriate to use overloading at all.

I don't think I'm actually wrong here, but I wanted to explore it a little more in this post, and get feedback. First I'd like to suggest a few categorizations - these aren't the only possible ones, of course, but I think they divide the spectrum reasonably. Here I'll give example examples in another area: overriding and polymorphism. I'll just describe the options first, and then we can talk about the pros and cons afterwards.

Totally abstract - no code being presented at all

Sometimes we talk about code without actually giving any examples at all. In order to override a member, it has to be declared as `virtual` in a base class, and then the overriding member uses the `override `modifier. When the virtual member is called, it is dispatched to the most specific implementation which overrides it, even if the caller is unaware of the existence of the implementation class.

Working but pointless code

This is the level my overloading article worked at. Here, you write code whose sole purpose is to demonstrate the mechanics of the feature you're describing. So in this case we might have:

using System;

public class C1
{
    public virtual void M()
    {
        Console.WriteLine("C1.M");
    }
}

public class C2 : C1
{
    public override void M()
    {
        Console.WriteLine("C2.M");
    }
}

public class C3
{
    static void Main()
    {
        C1 c = new C2();
        c.M();
    }
}

Now this is a reasonably extreme example; as a matter of personal preference I tend to use class names like "Test" or "Program" as the entry point, perhaps "BaseClass" and "DerivedClass" where "C1" and "C2" are used here, and "Foo" instead of "M" for the method name. Obviously "Foo" has no more real meaning than "M" as a name - I just get uncomfortable for some reason around single character identifiers other than for local variables. Arguably "M" is better as it stands for "method" and I could use "P" for a property etc. Whatever we choose, we're talking about metasyntactic variables really.

Complete programs indicative of design in a non-business context

This is the level at which I would probably choose to demonstrate overriding. It's certainly the one I've used for talking about generic variance. Here, the goal is to give the audience a flavour of the purpose of the feature as well as demonstrating the mechanics, but to stay in the simplistic realm of non-business examples. To adapt one of my normal examples - where I'd actually use an interface instead of an abstract class - we might end up with an example like this:

using System;
using System.Collections.Generic;

public abstract class Shape
{
    public abstract double Area { get; }
}

public class Square : Shape
{
    private readonly double side;
    
    public Square(double side)
    {
        this.side = side;
    }
    
    public override double Area { get { return side * side; } }
}

public class Circle : Shape
{
    public readonly double radius;
    
    public Circle(double radius)
    {
        this.radius = radius;
    }
    
    public override double Area { get { return Math.PI * radius * radius; } }
}

public class ShapeDemo
{
    static void Main()
    {
        List<Shape> shapes = new List<Shape>
        {
            new Square(10),
            new Circle(5)
        };
        foreach (Shape shape in shapes)
        {
            Console.WriteLine(shape.Area);
        }
    }
}

Now these are pretty tame shapes - they don't even have a location. If I were really going to demonstrate an abstract class I might try to work out something I could do in the base class to make it sensibly a non-interface... but at least we're demonstrating the property being overridden.

Business-like partial example

Here we'll use classes which sound like they could be in a real business application... but we won't fill in all the useful logic, or worry about any properties that aren't needed for the demonstation.

using System;
using System.Collections.Generic;

public abstract class Employee
{
    private readonly DateTime joinDate;
    private readonly decimal salary;
    
    // Most employees don't get bonuses any more
    public virtual int BonusPercentage { get { return 0; } }
    
    public decimal Salary { get { return salary; } }
    public DateTime JoinDate { get { return joinDate; } }
    
    public int YearsOfService
    {
        // TODO: Real calculation
        get { return DateTime.Now.Year - joinDate.Year; }
    }
    
    public Employee(decimal salary, DateTime joinDate)
    {
        this.salary = salary;
        this.joinDate = joinDate;
    }
}

public abstract class Manager : Employee
{
    // Managers always get a 15% bonus
    public override int BonusPercentage { get { return 15; } }
}

public abstract class PreIpoContract : Employee
{
    // The old style contracts were really generous
    public override int BonusPercentage
    {
        get { return YearsOfService * 2; }
    }
}

Now this particular code sample won't even compile: we haven't provided the necessary constructors in the derived classes. Note how the employees don't have names, and there are no relationships between employees and their managers, either.

Obviously we could have filled in all the rest of the code, ending up with a complete solution to an imaginary business need. Other examples at this level may well include customers and orders. One interesting thing to note here: admittedly I've only been working in the industry for 16 years, and only 12 years full time, but I don't think I've ever written a Customer or Order class as part of my job.

Full application example

No, I'm not going to provide an example of this. Usually this is the sort of thing which a book might work up to over the course of the complete text, and you'll end up with a wiki, or an e-commerce site, or an indexed library of books with complete web site around it. If you think I'm going to spend days or even weeks coding something like that just for this blog post, you'll be disappointed :)

Anyway, the idea of this is that it does something genuinely useful, and you can easily lift whole sections of it into other projects - or at least the design of it.

Which approach is best?

I'm sure you know what's coming here: it depends. In particular, I believe it depends on:

Your readership

Are they likely to copy and paste your example into production code without further thought? Arguably in that case the first option might be the best: they may not understand it, but at least it means your code won't be injuring a project.

Simply put, didactic code is not production code. The parables in the Bible aren't meant to be gripping stories with compelling characterization: they're meant to make a point. Scales aren't meant to sound like wonderful music: they're meant to help you improve your abilities to make a nice sound when you're playing real music.

The point you're trying to put across

If I'm trying to explain the mechanics of a feature, I find the second option to be useful. The reader doesn't need to try to take in the context of what the code is trying to accomplish, because it's explicitly not trying to do anything of any use. It's just demonstrating how the language or platform behaves in a particular scenario.

If, on the other hand, you're trying to explain a design principle, then the third or fourth options are useful. The third option can also be useful for the mechanics of a feature which is particularly abstract - like generic variance, as I mentioned earlier. That goes somewhere between "complete guide to where this feature should be used" and "no guidance whatsoever" - a sort of "here's a hint at the kind of situation where it could be useful."

If you're trying to explore a technology for fun, I find the third option works very well for that situation too. For example, while looking at Reactive Extensions, I've written programs to:

  • Group lines in a file by length
  • Give the results of a UK general election
  • Simulate the 1998 Brazil vs Norway world cup football match
  • Implement drag and drop using event filtering

None of these is likely to be directly useful in a real business app - but they were more appealing than solely demonstrating a sequence of numbers being generated (although with an appropriate marble diagram generator, that can be quite fun too).

The technology you're demonstrating

This is clearly related to the previous point, but I think it bears a certain amount of separation. I believe that language topics are fairly easily demonstrated with the second and third options. Library topics often deserve a slightly higher level of abstraction - and if you're going to try to demonstrate that a whole platform is worth investing time and energy in, it's useful to have something pretty real-world to show off.

Your time and skills

You know what? I suck at the fourth and fifth options here. I can't remember writing any complete, independent systems as a software engineer, and none of them have been in line-of-business applications anyway. The closest I've come is writing standalone tools which certainly have been useful, but often take shortcuts in terms of design which I wouldn't countenance in other applications. (And yes, I'm sure there's some discussion to be had around that as well, but it's not the point of this article.)

You may think my employee example above was lousy - and I'd agree with you. It's not really a great fit for inheritance, in my view - and the bonus calculation is certainly a dubious way of forcing in some polymorphism. But it was the best I could come up with in the time available to me. This wasn't some attempt to make it appear less worthy than the other options; I really am that bad at coming up with business-like examples. Other authors (by which I mean anyone writing at all, not just book authors) may well have found much better examples, either by spending more time on them, being more experienced with line-of-business apps, or having a better imagination. Or all three.

I'm not too proud to admit the things I suck at :) If I spent many extra hours coming up with examples for everything I write about, I would get a lot less written. I'm doing this in notional "spare time" after all. So even if you would prefer the fourth option over the third, would you rather have that but see less of my (ahem) "wisdom"? Personally I think everyone's better off with me braindumping using examples in forms which I'm better at.

How to read examples

Most of this post has been from the point of view of an author. Briefly, I'd like to suggest what this might mean for readers. The onus is on the author to make this clear, of course, but I think it's worth trying to be actively better readers ourselves.

  • Understand what the author is trying to achieve. Don't assume that every example will fit nicely in your application. Example code often doesn't come with any argument validation or error handling - and very rarely does it have an appropriate set of unit tests. If you're reading about how something works, don't assume that the examples are in any way realistic. They may well be simplified to demonstrate the behaviour as clearly as possible without the extra "fluff" of useful functionality.
  • Think about what may be missing, particularly if the context is an evangelical one. If someone is trying to sell you on a particular technology, then of course they'll try to show it in its best possible light. Where are the pitfalls? Where does it not stack up?
  • Don't assume authority means anything. I was quite happy to take Jeffrey Richter to task on boxing for example. Jeffrey Richter is a fabulous author and clearly a smart cookie, but that doesn't mean he's right about everything... and I really, really don't like the idea of anyone appealing to my supposed abilities to justify some bad decision. Judge any argument on its merits... find out what people think and why they think it, but then see how well their reasoning actually hangs together.

Conclusion

This was always going to be a somewhat biased look at this topic, because I hold a certain viewpoint which is clearly contrary to the one held by the chap who emailed me. That's why I included a reasonable chunk of his emails - to give at least some representation to the alternatives. This post has effectively been a longwinded justification of the form my examples have taken... but does it ring true?

I can't guarantee to change my writing style drastically on this front - at least not quickly - but I would very much appreciate your thoughts on this. I'm reluctant to exaggerate, but I think it may be even more important than working out whether "Jedi" was meant to be plural or singular - and I certainly received a lot of feedback on that topic.