Category Archives: Edulinq

Reimplementing LINQ to Objects: Part 7 – Count and LongCount

Today’s post covers two operators in one, because they’re so incredibly similar… to the point cut and paste of implementation, merely changing the name, return type, and a couple of variables.

What are they?

Count and LongCount each have two overloads: one with a predicate, and one without. Here are all four signatures:

public static int Count<TSource>(
    this IEnumerable<TSource> source)

public static int Count<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static long LongCount<TSource>(
    this IEnumerable<TSource> source)

public static long LongCount<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

As you can see, the LongCount signatures are identical to Count except in terms of their return types, which are long (Int64) instead of int (Int32).

The overloads without a predicate parameter simply return the number of items in the source collection; the ones with a predicate return the number of items for which that predicate returns true.

Interesting aspects of the behaviour:

  • These are all extension methods on IEnumerable<T> – you might argue that for the versions without a predicate, it would have been better to extend the non-generic IEnumerable, as nothing actually requires the element type.
  • Where there’s no predicate, Count is optimized for ICollection<T> and (in .NET 4) ICollection – both of which have Count properties which are expected to be faster than iterating over the entire collection. LongCount is not optimized in the same way in the .NET implementation – I’ll discuss this in the implementation section.
  • No optimization is performed in the overloads with predicates, as basically there’s no way of telling how many items will "pass" the predicate without testing them.
  • All methods use immediate execution – nothing is deferred. (If you think about it, there’s nothing which can be deferred here, when we’re just returning an int or a long.)
  • All arguments are validated simply by testing they’re non-null
  • Both methods should throw OverflowException when given a collection with more items than they can return the count of… though this is a considerably larger number in the case of LongCount than Count, of course.

What are we going to test?

In some senses, there are only two "success" tests involved here: one without a predicate and one with. Those are easy enough to deal with, but we also want to exercise the optimized paths. That’s actually trickier than it might sound, as we want to test four situations:

  • A source which implements both ICollection<T> and ICollection (easy: use List<T>)
  • A source which implements ICollection<T> but not ICollection (reasonably easy, after a little work finding a suitable type: use HashSet<T>)
  • A source which implements ICollection but not ICollection<T> but still implements IEnumerable<T> (so that we can extend it) – tricky…
  • A source which doesn’t implement ICollection or ICollection<T> (easy: use Enumerable.Range which we’ve already implemented)

The third bullet is the nasty one. Obviously there are plenty of implementations of ICollection but not ICollection<T> (e.g. ArrayList) but because it doesn’t implement IEnumerable<T>, we can’t call the Count extension method on it. In the end I wrote my own SemiGenericCollection class.

Once we’ve got sources for all those tests, we need to decide what we’re actually testing about them. Arguably we should test that the result is optimized, for example by checking that we never really enumerate the collection. That would require writing custom collections with GetEnumerator() methods which threw exceptions, but still returned a count from the Count property. I haven’t gone this far, but it’s another step we certainly could take.

For the overloads which take predicates, we don’t need to worry about the various collection interfaces as we’re not optimizing anyway.

The failure cases for null arguments are very simple, but there’s one other case to consider: overflow. For Count, I’ve implemented a test case to verify the overflow behaviour. Unfortunately we can’t run this test in the Edulinq implementation yet, as it requires Enumerable.Concat, but here it is for the record anyway:

[Test]
[Ignore("Takes an enormous amount of time!")]
public void Overflow()
{
    var largeSequence = Enumerable.Range(0, int.MaxValue)
                                  .Concat(Enumerable.Range(0, 1));
    Assert.Throws<OverflowException>(() => largeSequence.Count());
}

This guards against a bad implementation which overflows by simply wrapping the counter round to Int32.MinValue instead of throwing an exception.

As you can see, this test will be disabled even when it’s uncommented after we implement Concat, as it requires counting up to 2 billion – not great for a quick set of unit tests. Even that isn’t too bad, however, compared with the equivalent in LongCount which would have to count 2^63 items. Generating such a sequence isn’t difficult, but iterating over it all would take a very long time. We also need an equivalent test for the overload with a predicate – something I neglected until writing up this blog post, and finding a bug in the implementation as I did so :)

For LongCount, I merely have an equivalent test to the above which checks that the same sequence can have its length expressed as a long value.

Let’s implement them!

We’ll look at the overload which does have a predicate first – as it’s actually simpler:

public static int Count<TSource>(this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    // No way of optimizing this
    checked
    {
        int count = 0;
        foreach (TSource item in source)
        {
            if (predicate(item))
            {
                count++;
            }
        }
        return count;
    }
}

Note that this time we’re not using an iterator block (we’re not returning a sequence), so we don’t need to split the implementation into two different methods just to get eager argument validation.

After the argument validation, the main part of the method is reasonably simple, with one twist: we’re performing the whole iteration within a "checked" context. This means that if the increment of count overflows, it will throw an OverflowException instead of wrapping round to a negative number. There are some other alternatives here:

  • We could have made just the increment statement checked instead of the whole second part of the method
  • We could have explicitly tested for count == int.MaxValue before incrementing, and thrown an exception in that case
  • We could just build the whole assembly in a checked context

I think it’s useful for this section of code to be explicitly checked – it makes it obvious that it really is a requirement for general correctness. You may well prefer to make only the increment operation checked – I personally believe that the current approach draws more attention to the checked-ness, but it’s definitely a subjective matter. It’s also possible that an explicit check could be faster, although I doubt it – I haven’t benchmarked either approach.

Other than the predicate-specific parts, all the above code also appears in the optimized Count implementation – so I won’t discuss those again. Here’s the full method:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        return genericCollection.Count;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        return nonGenericCollection.Count;
    }

    // Do it the slow way – and make sure we overflow appropriately
    checked
    {
        int count = 0;
        using (var iterator = source.GetEnumerator())
        {
            while (iterator.MoveNext())
            {
                count++;
            }
        }
        return count;
    }
}

The only "new" code here is the optimization. There are effectively two equivalent blocks, just testing for different collection interface types, and using whichever one it finds first (if any). I don’t know whether the .NET implementation tests for ICollection or ICollection<T> first – I could test it by implementing both interfaces but returning different counts from each, of course, but that’s probably overkill. It doesn’t really matter for well-behaved collections other than the slight performance difference – we want to test the "most likely" interface first, which I believe is the generic one.

To optimize or not to optimize?

The LongCount implementations are exactly the same as those for Count, except using long instead of int.

Notably, I still use optimizations for ICollection and ICollection<T> – but I don’t believe the .NET implementation does so. (It’s easy enough to tell by creating a huge list of bytes and comparing the time taken for Count and LongCount.)

There’s an argument for using Array.GetLongLength when the source is an array… but I don’t think the current CLR supports arrays with more than Int32.MaxValue elements anyway, so it’s a bit of a non-issue other than for future-proofing. Beyond that, I’m not sure why the .NET implementation isn’t optimized. It’s not clear what an ICollection/ICollection<T> implementation is meant to return from its Count property if it has more than Int32.MaxValue elements anyway, to be honest.

Suggestions as to what I should have done are welcome… but I should probably point out that LongCount is more likely to be used against Queryable than Enumerable – it’s easy to imagine a service representing a collection (such as a database table) which can quickly tell you the count even when it’s very large. I would imagine that there are relatively few cases where you have a collection to evaluate in-process where you really just want to iterate through the whole lot just to get the count.

Conclusion

These are our first LINQ operators which return scalar values instead of sequences – with the natural consequence that they’re simpler to understand in terms of control flow and timing. The methods simply execute – possibly with some optimization – and return their result. Nice and simple. Still, we’ve seen there can still be a few interesting aspects to consider, including questions around optimization which don’t necessarily have a good answer.

Next time, I think I’ll implement Concat – mostly so that I can uncomment the overflow tests for Count. That’s going back to an operator which returns a sequence, but it’s a really simple one…

Reimplementing LINQ to Objects: Part 6 – Repeat

A trivial method next, with even less to talk about than "Empty"… "Repeat". This blog post is merely a matter of completeness.

What is it?

"Repeat" is a static, generic non-extension method with a single overload:

public static IEnumerable<TResult> Repeat<TResult>(
    TResult element,
    int count)

It simple returns a sequence which contains the specified element, repeated "count" times. The only argument validation is that "count" has to be non-negative.

What are we going to test?

There’s really not a lot to test here. I’ve thought of 4 different scenarios:

  • A vanilla "repeat a string 3 times" sequence
  • An empty sequence (repeat an element 0 times)
  • A sequence containing null values (just to prove that "element" can be null)
  • A negative count to prove that argument validation occurs, and does so eagerly.

None of this is remotely exciting, I’m afraid.

Let’s implement it!

Just about the only thing we could do wrong here is to put the argument validation directly in an iterator block… and we’ve implemented the "split method" pattern so many times already that we wouldn’t fall into that trap. So, here’s the code in all its tedious lack of glory:

public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    return RepeatImpl(element, count);
}

private static IEnumerable<TResult> RepeatImpl<TResult>(TResult element, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return element;
    }
}

That’s it. Um, interesting points to note… none.

Conclusion

There’s no sense in dragging this out. That’s the lot. Next up, Count and LongCount – which actually do have a couple of interesting points.

Reimplementing LINQ to Objects: Part 5 – Empty

Continuing with the non-extension methods, it’s time for possibly the simplest LINQ operator around: "Empty".

What is it?

"Empty" is a generic, static method with just a single signature and no parameters:

public static IEnumerable<TResult> Empty<TResult>()

It returns an empty sequence of the appropriate type. That’s all it does.

There’s only one bit of interesting behaviour: Empty is documented to cache an empty sequence. In other words, it returns a reference to the same empty sequence every time you call it (for the same type argument, of course).

What are we going to test?

There are really only two things we can test here:

  • The returned sequence is empty
  • The returned sequence is cached on a per type argument basis

I’m using the same approach as for Range to call the static method, but this time with an alias of EmptyClass. Here are the tests:

[Test]
public void EmptyContainsNoElements()
{
    using (var empty = EmptyClass.Empty<int>().GetEnumerator())
    {
        Assert.IsFalse(empty.MoveNext());
    }
}

[Test]
public void EmptyIsASingletonPerElementType()
{
    Assert.AreSame(EmptyClass.Empty<int>(), EmptyClass.Empty<int>());
    Assert.AreSame(EmptyClass.Empty<long>(), EmptyClass.Empty<long>());
    Assert.AreSame(EmptyClass.Empty<string>(), EmptyClass.Empty<string>());
    Assert.AreSame(EmptyClass.Empty<object>(), EmptyClass.Empty<object>());

    Assert.AreNotSame(EmptyClass.Empty<long>(), EmptyClass.Empty<int>());
    Assert.AreNotSame(EmptyClass.Empty<string>(), EmptyClass.Empty<object>());
}

Of course, that doesn’t verify that the cache isn’t per-thread, or something like that… but it’ll do.

Let’s implement it!

The implementation is actually slightly more interesting than the description so far may suggest. If it weren’t for the caching aspect, we could just implement it like this:

// Doesn’t cache the empty sequence
public static IEnumerable<TResult> Empty<TResult>()
{
    yield break;
}

… but we want to obey the (somewhat vaguely) documented caching aspect too. It’s not really hard, in the end. There’s a very handy fact that we can use: empty arrays are immutable. Arrays always have a fixed size, but normally there’s no way of making an array read-only… you can always change the value of any element. But an empty array doesn’t have any elements, so there’s nothing to change. So, we can reuse the same array over and over again, returning it directly to the caller… but only if we have an empty array of the right type.

At this point you may be expecting a Dictionary<Type, Array> or something similar… but there’s another useful trick we can take advantage of. If you need a per-type cache and the type is being specific as a type argument, you can use static variables in a generic class, because each constructed type will have a distinct set of static variables.

Unfortunately, Empty is a generic method rather than a non-generic method in a generic type… so we’ve got to create a separate generic type to act as our cache for the empty array. That’s easy to do though, and the CLR takes care of initializing the type in a thread-safe way, too. So our final implementation looks like this:

public static IEnumerable<TResult> Empty<TResult>()
{
    return EmptyHolder<TResult>.Array;
}
        
private static class EmptyHolder<T>
{
    internal static readonly T[] Array = new T[0];       
}

That obeys all the caching we need, and is really simple in terms of lines of code… but it does mean you need to understand how generics work in .NET reasonably well. In some ways this is the opposite of the situation in the previous post – this is a sneaky implementation instead of the slower but arguably simpler dictionary-based one. In this case, I’m happy with the trade-off, because once you do understand how generic types and static variables work, this is simple code. It’s a case where simplicity is in the eye of the beholder.

Conclusion

So, that’s Empty. The next operator – Repeat – is likely to be even simpler, although it’ll have to be another split implementation…

Addendum

Due to the minor revolt over returning an array (which I still think is fine), here’s an alternative implementation:

public static IEnumerable<TResult> Empty<TResult>()
{
    return EmptyEnumerable<TResult>.Instance;
}

#if AVOID_RETURNING_ARRAYS
private class EmptyEnumerable<T> : IEnumerable<T>, IEnumerator<T>
{
    internal static IEnumerable<T> Instance = new EmptyEnumerable<T>();

    // Prevent construction elsewhere
    private EmptyEnumerable()
    {
    }

    public IEnumerator<T> GetEnumerator()
    {
        return this;
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return this;
    }

    public T Current
    {
        get { throw new InvalidOperationException(); }
    }

    object IEnumerator.Current
    {
        get { throw new InvalidOperationException(); }
    }

    public void Dispose()
    {
        // No-op
    }

    public bool MoveNext()
    {
        return false// There’s never a next entry
    }

    public void Reset()
    {
        // No-op
    }
}

#else
private static class EmptyEnumerable<T>
{
    internal static readonly T[] Instance = new T[0];       
}
#endif

Hopefully now everyone can build a version they’re happy with :)

Reimplementing LINQ to Objects: Part 4 – Range

This will be a short post, and there’ll probably be some more short ones coming up too. I think it makes sense to only cover multiple operators in a single post where they’re really similar. (Count and LongCount spring to mind.) I’m in your hands though – if you would prefer "chunkier" posts, please say so in the comments.

This post will deal with the Range generation operator.

What is it?

Range only has a single signature:

public static IEnumerable<int> Range(
    int start,
    int count)

Unlike most of LINQ, this isn’t an extension method – it’s a plain old static method. It returns an iterable object which will yield "count" integers, starting from "start" and incrementing each time – so a call to Enumerable.Range(6, 3) would yield 6, then 7, then 8.

As it doesn’t operate on an input sequence, there’s no sense in which it could stream or buffer its input, but:

  • The arguments need to be validated eagerly; the count can’t be negative, and it can’t be such that any element of the range could overflow Int32.
  • The values will be yielded lazily – Range should be cheap, rather than creating (say) an array of "count" elements and returning that.

How are we going to test it?

Testing a plain static method brings us a new challenge in terms of switching between the "normal" LINQ implementation and the Edulinq one. This is an artefact of the namespaces I’m using – the tests are in Edulinq.Tests, and the implementation is in Edulinq. "Parent" namespaces are always considered when the compiler tries to find a type, and they take priority over anything in using directives – even a using directive which tries to explicitly alias a type name.

The (slightly ugly) solution to this that I’ve chosen is to include a using directive to create an alias which couldn’t otherwise be resolved – in this case, RangeClass. The using directive will either alias RangeClass to System.Linq.Enumerable or Edulinq.Enumerable. The tests then all use RangeClass.Range. I’ve also changed how I’m switching between implementations – I now have two project configurations, one of which defines the NORMAL_LINQ preprocessor symbol, and the other of which doesn’t. The RangeTest class therefore contains:

#if NORMAL_LINQ
using RangeClass = System.Linq.Enumerable;
#else
using RangeClass = Edulinq.Enumerable;
#endif

There are alternatives to this approach, of course:

  • I could move the tests to a different namespace
  • I could make the project references depend on the configuration… so the "Normal LINQ" configuration wouldn’t reference the Edulinq implementation project, and the "Edulinq implementation" configuration wouldn’t reference System.Core. I could then just use Enumerable.Range with an appropriate using directive for System.Linq conditional on the NORMAL_LINQ preprocessor directive, as per the other tests.

I like the idea of the second approach, but it means manually tinkering with the test project file – Visual Studio doesn’t expose any way of conditionally including a reference. I may do this at a later date… thoughts welcome.

What are we going to test?

There isn’t much we can really test for ranges – I only have eight tests, none of which are particularly exciting:

  • A simple valid range should look right when tested with AssertSequenceEqual
  • The start value should be allowed to be negative
  • Range(Int32.MinValue, 0) is an empty range
  • Range(Int32.MaxValue, 1) yields just Int32.MaxValue
  • The count can’t be negative
  • The count can be zero
  • start+count-1 can’t exceed Int32.MaxValue (so Range(Int32.MaxValue, 2) isn’t valid)
  • start+count-1 can be Int32.MaxValue (so Range(Int32.MaxValue, 1) is valid)

The last two are tested with a few different examples each – a large start and a small count, a small start and a large count, and "fairly large" values for both start and count.

Note that I don’t have any tests for lazy evaluation – while I could test that the returned value doesn’t implement any of the other collection interfaces, it would be a little odd to do so. On the other hand, we do have tests which have an enormous count – such that anything which really tried to allocate a collection of that size would almost certainly fail…

Let’s implement it!

It will surely be no surprise by now that we’re going to use a split implementation, with a public method which performs argument validation eagerly and then uses a private method with an iterator block to perform the actual iteration.

Having validated the arguments, we know that we’ll never overflow the bounds of Int32, so we can be pretty casual in the main part of the implementation.

public static IEnumerable<int> Range(int start, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    // Convert everything to long to avoid overflows. There are other ways of checking
    // for overflow, but this way make the code correct in the most obvious way.
    if ((long)start + (long)count – 1L > int.MaxValue)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    return RangeImpl(start, count);
}

private static IEnumerable<int> RangeImpl(int start, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return start + i;
    }
}

Just a few points to note here:

  • Arguably it’s the combination of "start" and "count" which is invalid in the second check, rather than just count. It would possibly be nice to allow ArgumentOutOfRangeException (or ArgumentException in general) to blame multiple arguments rather than just one. However, using "count" here matches the framework implementation.
  • There are other ways of performing the second check, and I certainly didn’t have to make all the operands in the expression longs. However, I think this is the simplest code which is clearly correct based on the documentation. I don’t need to think about all kinds of different situations and check that they all work. The arithmetic will clearly be valid when using the Int64 range of values, so I don’t need to worry about overflow, and I don’t need to consider whether to use a checked or unchecked context.
  • There are also other ways of looping in the private iterator block method, but I think this is the simplest. Another obvious and easy alternative is to keep two values, one for the count of yielded values and the other for the next value to yield, and increment them both on each iteration. A more complex approach would be to use just one loop variable – but you can’t use "value < start + count" in case the final value is exactly Int32.MaxValue, and you can’t use "value <= start + count – 1" in case the arguments are (int.MinValue, 0). Rather than consider all the border cases, I’ve gone for an obviously-correct solution. If you really, really cared about the performance of Range, you’d want to investigate various other options.

Prior to writing up this post, I didn’t have good tests for Range(Int32.MaxValue, 1) and Range(Int32.MinValue, 0)… but as they could easily go wrong as mentioned above, I’ve now included them. I find it interesting how considering alternative implementations suggests extra tests.

Conclusion

"Range" was a useful method to implement in order to test some other operators – "Count" in particular. Now that I’ve started on the non-extension methods though, I might as well do the other two (Empty and Repeat). I’ve already implemented "Empty", and will hopefully be able to write it up today. "Repeat" shouldn’t take much longer, and then we can move on to "Count" and "LongCount".

I think this code is a good example of situations where it’s worth writing "dumb" code which looks like the documentation, rather than trying to write possibly shorter, possibly slightly more efficient code which is harder to think about. No doubt there’ll be more of that in later posts…

Reimplementing LINQ to Objects: Part 3 – “Select” (and a rename…)

It’s been a long time since I wrote part 1 and part 2 of this blog series, but hopefully things will move a bit more quickly now.

The main step forward is that the project now has a source repository on Google Code instead of just being a zip file on each blog post. I had to give the project a title at that point, and I’ve chosen Edulinq, hopefully for obvious reasons. I’ve changed the namespaces etc in the code, and the blog tag for the series is now Edulinq too. Anyway, enough of the preamble… let’s get on with reimplementing LINQ, this time with the Select operator.

What is it?

Like Where, Select has two overloads:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, TResult> selector)

Again, they both operate the same way – but the second overload allows the index into the sequence to be used as part of the projection.

Simple stuff first: the method projects one sequence to another: the "selector" delegate is applied to each input element in turn, to yield an output element. Behaviour notes, which are exactly the same as Where (to the extent that I cut and paste these from the previous blog post, and just tweaked them):

  • The input sequence is not modified in any way.
  • The method uses deferred execution – until you start trying to fetch items from the output sequence, it won’t start fetching items from the input sequence.
  • Despite deferred execution, it will validate that the parameters aren’t null immediately.
  • It streams its results: it only ever needs to look at one result at a time.
  • It will iterate over the input sequence exactly once each time you iterate over the output sequence.
  • The "selector" function is called exactly once per yielded value.
  • Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence.

What are we going to test?

The tests are very much like those for Where – except that in cases where we tested the filtering aspect of Where, we’re now testing the projection aspect of Select.

There are a few tests of some interest. Firstly, you can tell that the method is generic with 2 type parameters instead of 1 – it has type parameters of TSource and TResult. They’re fairly self-explanatory, but it means it’s worth having a test for the case where the type arguments are different – such as converting an int to a string:

[Test]
public void SimpleProjectionToDifferentType()
{
    int[] source = { 1, 5, 2 };
    var result = source.Select(x => x.ToString());
    result.AssertSequenceEqual("1", "5", "2");
}

Secondly, I have a test that shows what sort of bizarre situations you can get into if you include side effects in your query. We could have done this with Where as well of course, but it’s clearer with Select:

[Test]
public void SideEffectsInProjection()
{
    int[] source = new int[3]; // Actual values won’t be relevant
    int count = 0;
    var query = source.Select(x => count++);
    query.AssertSequenceEqual(0, 1, 2);
    query.AssertSequenceEqual(3, 4, 5);
    count = 10;
    query.AssertSequenceEqual(10, 11, 12);
}

Notice how we’re only calling Select once, but the results of iterating over the results change each time – because the "count" variable has been captured, and is being modified within the projection. Please don’t do things like this.

Thirdly, we can now write query expressions which include both "select" and "where" clauses:

[Test]
public void WhereAndSelect()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = from x in source
                 where x < 4
                 select x * 2;
    result.AssertSequenceEqual(2, 6, 4, 2);
}

There’s nothing mind-blowing about any of this, of course – hopefully if you’ve used LINQ to Objects at all, this should all feel very comfortable and familiar.

Let’s implement it!

Surprise surprise, we go about implementing Select in much the same way as Where. Again, I simply copied the implementation file and tweaked it a little – the two methods really are that similar. In particular:

  • We’re using iterator blocks to make it easy to return sequences
  • The semantics of iterator blocks mean that we have to separate the argument validation from the real work. (Since I wrote the previous post, I’ve learned that VB11 will have anonymous iterators, which will avoid this problem. Sigh. It just feels wrong to envy VB users, but I’ll learn to live with it.)
  • We’re using foreach within the iterator blocks to make sure that we dispose of the input sequence iterator appropriately – so long as our output sequence iterator is disposed or we run out of input elements, of course.

I’ll skip straight to the code, as it’s all so similar to Where. It’s also not worth showing you the version with an index – because it’s such a trivial difference.

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return SelectImpl(source, selector);
}

private static IEnumerable<TResult> SelectImpl<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    foreach (TSource item in source)
    {
        yield return selector(item);
    }
}

Simple, isn’t it? Again, the real "work" method is even shorter than the argument validation.

Conclusion

While I don’t generally like boring my readers (which may come as a surprise to some of you) this was a pretty humdrum post, I’ll admit. I’ve emphasized "just like Where" several times to the point of tedium very deliberately though – because it makes it abundantly clear that there aren’t really as many tricky bits to understand as you might expect.

Something slightly different next time (which I hope will be in the next few days). I’m not quite sure what yet, but there’s an awful lot of methods still to choose from…

Reimplementing LINQ to Objects: Part 2 – “Where”

Warning: this post is quite long. Although I’ve chosen a simple operator to implement, we’ll encounter a few of the corner cases and principles involved in LINQ along the way. This will also be a somewhat experimental post in terms of format, as I try to work out the best way of presenting the material.

We’re going to implement the "Where" clause/method/operator. It’s reasonably simple to understand in general, but goes into all of the deferred execution and streaming bits which can cause problems. It’s generic, but only uses one type parameter (which is a big deal, IMO – the more type parameters a method has, the harder I find it to understand in general). Oh, and it’s a starting point for query expressions, which is a bonus.

What is it?

"Where" has two overloads:

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

Before I go into what it actually does, I’ll point out a few things which are going to be common across nearly all of the LINQ operators we’ll be implementing:

  • These are extension methods – they’re declared in a top-level, non-nested, static class and the first parameter has a "this" modifier. They can be invoked roughly as if they were instance methods on the type of the first parameter.
  • They’re generic methods – in this case there’s just a single type parameter (TSource) which indicates the type of sequence we’re dealing with. So for (say) a list of strings, TSource would be string.
  • They take generic delegates in the Func<…> family. These are usually specified with lambda expressions, but any other way of providing a delegate will work fine too.
  • They deal with sequences. These are represented by IEnumerable<T>, with an iterator over a sequence being represented by IEnumerator<T>.

I fully expect that most readers will be comfortable with all of these concepts, so I won’t go into them in more detail. If any of the above makes you nervous, please familiarise yourself with them before continuing, otherwise you’re likely to have a hard time.

The purpose of "Where" is to filter a sequence. It takes an input sequence and a predicate, and returns another sequence. The output sequence is of the same element type (so if you put in a sequence of strings, you’ll get a sequence of strings out) and it will only contain elements from the input sequence which pass the predicate. (Each item will be passed to the predicate in turn. If the predicate returns true, the item will be part of the output sequence; otherwise it won’t.)

Now, a few important details about the behaviour:

  • The input sequence is not modified in any way: this isn’t like List<T>.RemoveAll, for example.
  • The method uses deferred execution – until you start trying to fetch items from the output sequence, it won’t start fetching items from the input sequence
  • Despite deferred execution, it will validate that the parameters aren’t null immediately
  • It streams its results: it only ever needs to look at one result at a time, and will yield it without keeping a reference to it. This means you can apply it to an infinitely long sequence (for example a sequence of random numbers)
  • It will iterate over the input sequence exactly once each time you iterate over the output sequence
  • Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence. (In case you didn’t realise, the foreach statement in C# uses a try/finally block to make sure the iterator is always disposed however the loop finishes.)

Many of these points will be true for a lot of our other operators too.

The overload which takes a Func<TSource, int, bool> lets the predicate use the index within the sequence as well as the value. The index always starts at 0, and increments by 1 each time regardless of previous results from the predicate.

What are we going to test?

Ideally, we’d like to test all of the points above. The details of streaming and how many times the sequence is iterated over are frankly a pain to deal with, unfortunately. Given how much there is to implement already, we’ll come back to those.

Let’s have a look at some tests. First, here’s a simple "positive" test – we’re starting with an array of integers, and using a lambda expression to only include elements less than 4 in the output. (The word "filter" is omnipresent but unfortunate. It’s easier to talk about "filtering out" elements than "including" them, but the predicate is expressed in a positive way.)

[Test]
public void SimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = source.Where(x => x < 4);
    result.AssertSequenceEqual(1, 3, 2, 1);
}

I’ve kept the TestExtensions from MoreLINQ, despite NUnit coming with CollectionAssert. I find the extension methods easier to work with for three reasons:

  • They’re extension methods, which helps to reduce the clutter
  • They can use a parameter array for the expected output, which makes the test simpler to express
  • The message is clearer when the assertion fails

Basically, AssertSequenceEqual does what you’d expect it to – it checks that the actual result (usually expressed as the variable you call the method on) matches the expected result (usually expressed as a parameter array).

So far, so good. Now let’s check argument validation:

[Test]
public void NullSourceThrowsNullArgumentException()
{
    IEnumerable<int> source = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(x => x > 5));
}

[Test]
public void NullPredicateThrowsNullArgumentException()
{
    int[] source = { 1, 3, 7, 9, 10 };
    Func<int, bool> predicate = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(predicate));
}

I’m not bothering to check the name within the ArgumentNullException, but importantly I’m testing that the arguments are being validated immediately. I’m not trying to iterate over the result – so if the validation is deferred, the test will fail.

The final interesting test for the moment is also around deferred execution, using a helper class called ThrowingEnumerable. This is a sequence which blows up with an InvalidOperationException if you ever try to iterate over it. Essentially, we want to check two things:

  • Just calling Where doesn’t start iterating over the source sequence
  • When we call GetEnumerator() to get an iterator and then MoveNext() on that iterator, we should start iterating, causing an exception to be thrown.

We’ll need to do something similar for other operators, so I’ve written a small helper method in ThrowingEnumerable:

internal static void AssertDeferred<T>(
    Func<IEnumerable<int>, IEnumerable<T>> deferredFunction)
{
    ThrowingEnumerable source = new ThrowingEnumerable();
    var result = deferredFunction(source);
    using (var iterator = result.GetEnumerator())
    {
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

Now we can use that to check that Where really defers execution:

[Test]
public void ExecutionIsDeferred()
{
    ThrowingEnumerable.AssertDeferred(src => src.Where(x => x > 0));
}

These tests have all dealt with the simpler predicate overload – where the predicate is only passed the item, not the index. The tests involving the index are very similar.

Let’s implement it!

With all the tests passing when running against the real LINQ to Objects, it’s time to implement it in our production code. We’re going to use iterator blocks, which were introduced in C# 2 to make it easier to implement IEnumerable<T>. I have a couple of articles you can read if you want more background details… or read chapter 6 of C# in Depth (either edition). These give us deferred execution for free… but that can be a curse as well as a blessing, as we’ll see in a minute.

At its heart, the implementation is going to look something like this:

// Naive implementation
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Simple, isn’t it? Iterator blocks allow us to write the code pretty much how we’d describe it: we iterate over each item in the source, and if the predicate returns true for that particular item, we yield (include) it in the output sequence.

Lo and behold, some of our tests pass already. Now we just need argument validation. That’s easy, right? Let’s give it a go:

// Naive validation – broken!
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Hmm. Our validation tests still seem to be red, and putting a breakpoint on the "throw" statements doesn’t help… they’re not getting hit. What’s going on?

I’ve given a few pretty broad hints already. The problem is deferred execution. Until we start trying to iterate over the result, none of our code will run. Our tests deliberately don’t iterate over the result, so validation is never performed.

We’ve just hit a design flaw in C#. Iterator blocks in C# simply don’t work nicely when you want to split execution between "immediate" (usually for validation) and "deferred". Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

It’s ugly, but it works: all our index-less tests go green. From here, it’s a short step to implement the version using an index as well:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<T, int, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (predicate(item, index))
        {
            yield return item;
        }
        index++;
    }
}

Now the bar is green, and we’re done. Hang on though… we haven’t used it every way we can yet.

Query expressions

So far, we’ve been calling the method directly (although as an extension method) – but LINQ also provides us with query expressions. Here’s our "SimpleFiltering" test rewritten to use a query expression:

[Test]
public void QueryExpressionSimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = from x in source
                 where x < 4
                 select x;
    result.AssertSequenceEqual(1, 3, 2, 1);
}

(Note that the name is different here to in the downloadable code, to stop the blog server software blocking the name of the method. Grr.)

That will produce exactly the same code as our earlier test. The compiler basically translates this form into the previous one, leaving the condition (x < 4) as a lambda expression and then converting it appropriately (into a delegate in this case). You may be surprised that this works as we have no Select method yet…  but in this case we have a "no-op" select projection; we’re not actually performing a real transformation. In that case – and so long as there’s something else in the query, in this case our "where" clause – the compiler effectively omits the "select" clause, so it doesn’t matter that we haven’t implemented it. If you changed "select x" to "select x * 2", it would fail to compile against our Where-only LINQ implementation.

The fact that query expressions are just based on patterns like this is a very powerful feature for flexibility – it’s how LINQ to Rx is able to only implement the operators that make sense in that environment, for example. Similarly, there’s nothing in the C# compiler that "knows" about IEnumerable<T> when it comes to query expressions – which is how entirely separate interfaces such as IObservable<T> work just as well.

What have we learned?

There’s been a lot to take in here, in terms of both implementation and core LINQ principles:

  • LINQ to Objects is based on extension methods, delegates and IEnumerable<T>
  • Operators use deferred execution where appropriate and stream their data where possible
  • Operators don’t mutate the original source, but instead return a new sequence which will return the appropriate data
  • Query expressions are based on compiler translations of patterns; you don’t need to implement any more of the pattern than the relevant query expression requires
  • Iterator blocks are great for implementing deferred execution…
  • … but make eager argument validation a pain

Code download

Linq-To-Objects-2.zip

Many people have asked for a source repository for the project, and that makes sense. I’m putting it together a source repository for it now; it’s likely to be done before I post the next part.

Reimplementing LINQ to Objects: Part 1 – Introduction

About a year and a half ago, I gave a talk at a DDD day in Reading, attempting to reimplement as much of LINQ to Objects as possible in an hour. Based on the feedback from the session, I went far too fast… and I was still a long way from finishing. However, I still think it’s an interesting exercise, so I thought I’d do it again in a more leisurely way, blogging as I go. Everything will be under the "Edulinq" tag, so that’s the best way to get all the parts in order, without any of my other posts.

General approach

The plan is to reimplement the whole of LINQ to Objects, explaining each method (or group of methods) in a blog post. I’m going to try to make the code itself production quality, but I’m not going to include any XML documentation – if I’m already writing up how things work, I don’t want to do everything twice. I’ll include optimizations where appropriate, hopefully doing better than LINQ to Objects itself.

The approach is going to be fairly simple: for each LINQ method, I’ll write some unit tests (most of which I won’t show in the blog posts) and make sure they run against the normal .NET implementation. I’ll then comment out the using directive for System.Linq, and introduce one for JonSkeet.Linq instead. The tests will fail, I’ll implement the methods, and gradually they’ll go green again. It’s not quite the normal TDD pattern, but it works pretty well.

I will write up a blog entry for each LINQ operator, probably including all of the production code but only "interesting" tests. I’ll highlight important patterns as they come up – that’s half the point of the exercise, of course.

At the end of each post, I’ll include a link to download the "code so far". For the sake of anyone looking at these posts in the future, I’m going to keep the downloads numbered separately, rather than updating a single download over time. I’m hoping that the code will simply grow, but I dare say there’ll be some modifications along the way too.

The aim is not to end up with LINQBridge: I’m going to be targeting .NET 3.5 (mostly so I can use extension methods without creating my own attribute) and I’m certainly not going to start worrying about installers and the like. The purpose of all of this is purely educational: if you follow along with these blog posts, with any luck you’ll have a deeper understanding of LINQ in general and LINQ to Objects in particular. For example, topics like deferred execution are often misunderstood: looking at the implementation can clarify things pretty well.

Testing

The unit tests will be written using NUnit (just for the sake of my own familiarity). Fairly obviously, one of the things we’ll need to test quite a lot is whether two sequences are equal. We’ll do this using the TestExtensions class from MoreLINQ (which I’ve just copied into the project). The netbook on which I’ll probably write most of this code only has C# Express 2010 on it, so I’m going to use the external NUnit GUI. I’ve set this in the project file as the program to start… which you can’t do from C# Express directly, but editing the project file is easy, to include this:

<StartAction>Program</StartAction>
<StartProgram>C:Program FilesNUnit-2.5.7.10213binnet-2.0nunit-x86.exe</StartProgram>

It’s a bit of a grotty hack, but it works. The "additional command line parameters" are then set to just JonSkeet.Linq.Tests.dll – the current directory is the bin/debug directory by default, so everything’s fine. Obviously if you want to run the tests yourself and you have ReSharper or something similar, you can see the results integrated into Visual Studio.

Although I’m hoping to write reasonably production-level code, I doubt that there’ll be as many unit tests as I’d really write against production code. I fully expect the number of lines of test code to dwarf the number of lines of production code even so. There are simply vast numbers of potential corner cases… and quite a few overloads in several cases. Remember that the goal here is to examine interesting facets of LINQ.

Code layout

Just like the real LINQ to Objects, I’ll be creating an enormous static Enumerable class… but I’ll do so using partial classes, with one method name (but multiple overloads) per file. So Where will be implemented in Where.cs and tested in WhereTest.cs, for example.

First code drop

The first zip file is available: Linq-To-Objects-1.zip. It doesn’t contain any production code yet – just 4 tests for Where, so I could check that NUnit was working properly for me. Next stop… implementing Where.