Reimplementing LINQ to Objects: Part 2 – “Where”

Warning: this post is quite long. Although I’ve chosen a simple operator to implement, we’ll encounter a few of the corner cases and principles involved in LINQ along the way. This will also be a somewhat experimental post in terms of format, as I try to work out the best way of presenting the material.

We’re going to implement the "Where" clause/method/operator. It’s reasonably simple to understand in general, but goes into all of the deferred execution and streaming bits which can cause problems. It’s generic, but only uses one type parameter (which is a big deal, IMO – the more type parameters a method has, the harder I find it to understand in general). Oh, and it’s a starting point for query expressions, which is a bonus.

What is it?

"Where" has two overloads:

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

Before I go into what it actually does, I’ll point out a few things which are going to be common across nearly all of the LINQ operators we’ll be implementing:

  • These are extension methods – they’re declared in a top-level, non-nested, static class and the first parameter has a "this" modifier. They can be invoked roughly as if they were instance methods on the type of the first parameter.
  • They’re generic methods – in this case there’s just a single type parameter (TSource) which indicates the type of sequence we’re dealing with. So for (say) a list of strings, TSource would be string.
  • They take generic delegates in the Func<…> family. These are usually specified with lambda expressions, but any other way of providing a delegate will work fine too.
  • They deal with sequences. These are represented by IEnumerable<T>, with an iterator over a sequence being represented by IEnumerator<T>.

I fully expect that most readers will be comfortable with all of these concepts, so I won’t go into them in more detail. If any of the above makes you nervous, please familiarise yourself with them before continuing, otherwise you’re likely to have a hard time.

The purpose of "Where" is to filter a sequence. It takes an input sequence and a predicate, and returns another sequence. The output sequence is of the same element type (so if you put in a sequence of strings, you’ll get a sequence of strings out) and it will only contain elements from the input sequence which pass the predicate. (Each item will be passed to the predicate in turn. If the predicate returns true, the item will be part of the output sequence; otherwise it won’t.)

Now, a few important details about the behaviour:

  • The input sequence is not modified in any way: this isn’t like List<T>.RemoveAll, for example.
  • The method uses deferred execution – until you start trying to fetch items from the output sequence, it won’t start fetching items from the input sequence
  • Despite deferred execution, it will validate that the parameters aren’t null immediately
  • It streams its results: it only ever needs to look at one result at a time, and will yield it without keeping a reference to it. This means you can apply it to an infinitely long sequence (for example a sequence of random numbers)
  • It will iterate over the input sequence exactly once each time you iterate over the output sequence
  • Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence. (In case you didn’t realise, the foreach statement in C# uses a try/finally block to make sure the iterator is always disposed however the loop finishes.)

Many of these points will be true for a lot of our other operators too.

The overload which takes a Func<TSource, int, bool> lets the predicate use the index within the sequence as well as the value. The index always starts at 0, and increments by 1 each time regardless of previous results from the predicate.

What are we going to test?

Ideally, we’d like to test all of the points above. The details of streaming and how many times the sequence is iterated over are frankly a pain to deal with, unfortunately. Given how much there is to implement already, we’ll come back to those.

Let’s have a look at some tests. First, here’s a simple "positive" test – we’re starting with an array of integers, and using a lambda expression to only include elements less than 4 in the output. (The word "filter" is omnipresent but unfortunate. It’s easier to talk about "filtering out" elements than "including" them, but the predicate is expressed in a positive way.)

[Test]
public void SimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = source.Where(x => x < 4);
    result.AssertSequenceEqual(1, 3, 2, 1);
}

I’ve kept the TestExtensions from MoreLINQ, despite NUnit coming with CollectionAssert. I find the extension methods easier to work with for three reasons:

  • They’re extension methods, which helps to reduce the clutter
  • They can use a parameter array for the expected output, which makes the test simpler to express
  • The message is clearer when the assertion fails

Basically, AssertSequenceEqual does what you’d expect it to – it checks that the actual result (usually expressed as the variable you call the method on) matches the expected result (usually expressed as a parameter array).

So far, so good. Now let’s check argument validation:

[Test]
public void NullSourceThrowsNullArgumentException()
{
    IEnumerable<int> source = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(x => x > 5));
}

[Test]
public void NullPredicateThrowsNullArgumentException()
{
    int[] source = { 1, 3, 7, 9, 10 };
    Func<int, bool> predicate = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(predicate));
}

I’m not bothering to check the name within the ArgumentNullException, but importantly I’m testing that the arguments are being validated immediately. I’m not trying to iterate over the result – so if the validation is deferred, the test will fail.

The final interesting test for the moment is also around deferred execution, using a helper class called ThrowingEnumerable. This is a sequence which blows up with an InvalidOperationException if you ever try to iterate over it. Essentially, we want to check two things:

  • Just calling Where doesn’t start iterating over the source sequence
  • When we call GetEnumerator() to get an iterator and then MoveNext() on that iterator, we should start iterating, causing an exception to be thrown.

We’ll need to do something similar for other operators, so I’ve written a small helper method in ThrowingEnumerable:

internal static void AssertDeferred<T>(
    Func<IEnumerable<int>, IEnumerable<T>> deferredFunction)
{
    ThrowingEnumerable source = new ThrowingEnumerable();
    var result = deferredFunction(source);
    using (var iterator = result.GetEnumerator())
    {
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

Now we can use that to check that Where really defers execution:

[Test]
public void ExecutionIsDeferred()
{
    ThrowingEnumerable.AssertDeferred(src => src.Where(x => x > 0));
}

These tests have all dealt with the simpler predicate overload – where the predicate is only passed the item, not the index. The tests involving the index are very similar.

Let’s implement it!

With all the tests passing when running against the real LINQ to Objects, it’s time to implement it in our production code. We’re going to use iterator blocks, which were introduced in C# 2 to make it easier to implement IEnumerable<T>. I have a couple of articles you can read if you want more background details… or read chapter 6 of C# in Depth (either edition). These give us deferred execution for free… but that can be a curse as well as a blessing, as we’ll see in a minute.

At its heart, the implementation is going to look something like this:

// Naive implementation
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Simple, isn’t it? Iterator blocks allow us to write the code pretty much how we’d describe it: we iterate over each item in the source, and if the predicate returns true for that particular item, we yield (include) it in the output sequence.

Lo and behold, some of our tests pass already. Now we just need argument validation. That’s easy, right? Let’s give it a go:

// Naive validation – broken!
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Hmm. Our validation tests still seem to be red, and putting a breakpoint on the "throw" statements doesn’t help… they’re not getting hit. What’s going on?

I’ve given a few pretty broad hints already. The problem is deferred execution. Until we start trying to iterate over the result, none of our code will run. Our tests deliberately don’t iterate over the result, so validation is never performed.

We’ve just hit a design flaw in C#. Iterator blocks in C# simply don’t work nicely when you want to split execution between "immediate" (usually for validation) and "deferred". Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

It’s ugly, but it works: all our index-less tests go green. From here, it’s a short step to implement the version using an index as well:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<T, int, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (predicate(item, index))
        {
            yield return item;
        }
        index++;
    }
}

Now the bar is green, and we’re done. Hang on though… we haven’t used it every way we can yet.

Query expressions

So far, we’ve been calling the method directly (although as an extension method) – but LINQ also provides us with query expressions. Here’s our "SimpleFiltering" test rewritten to use a query expression:

[Test]
public void QueryExpressionSimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = from x in source
                 where x < 4
                 select x;
    result.AssertSequenceEqual(1, 3, 2, 1);
}

(Note that the name is different here to in the downloadable code, to stop the blog server software blocking the name of the method. Grr.)

That will produce exactly the same code as our earlier test. The compiler basically translates this form into the previous one, leaving the condition (x < 4) as a lambda expression and then converting it appropriately (into a delegate in this case). You may be surprised that this works as we have no Select method yet…  but in this case we have a "no-op" select projection; we’re not actually performing a real transformation. In that case – and so long as there’s something else in the query, in this case our "where" clause – the compiler effectively omits the "select" clause, so it doesn’t matter that we haven’t implemented it. If you changed "select x" to "select x * 2", it would fail to compile against our Where-only LINQ implementation.

The fact that query expressions are just based on patterns like this is a very powerful feature for flexibility – it’s how LINQ to Rx is able to only implement the operators that make sense in that environment, for example. Similarly, there’s nothing in the C# compiler that "knows" about IEnumerable<T> when it comes to query expressions – which is how entirely separate interfaces such as IObservable<T> work just as well.

What have we learned?

There’s been a lot to take in here, in terms of both implementation and core LINQ principles:

  • LINQ to Objects is based on extension methods, delegates and IEnumerable<T>
  • Operators use deferred execution where appropriate and stream their data where possible
  • Operators don’t mutate the original source, but instead return a new sequence which will return the appropriate data
  • Query expressions are based on compiler translations of patterns; you don’t need to implement any more of the pattern than the relevant query expression requires
  • Iterator blocks are great for implementing deferred execution…
  • … but make eager argument validation a pain

Code download

Linq-To-Objects-2.zip

Many people have asked for a source repository for the project, and that makes sense. I’m putting it together a source repository for it now; it’s likely to be done before I post the next part.

29 thoughts on “Reimplementing LINQ to Objects: Part 2 – “Where””

  1. I agree that the Where() implementation looks ugly, but we can do something about that. If we apply some principles to the extreme, we could:

    Say that a method should do only one thing.
    And that the implementation should be on the same level of abstraction.

    Where() has to validate the input and iterate the source. That’s two things. But we could make it orchestrate those things instead, that’s one thing.

    The implementation can be reduced to two method calls, that would make the abstraction levels the same:

    public IEnumerable Where(…)
    {
    ValidateInput(source, predicate);
    return IterateWhere(source, predicate);
    }

    Not sure if the code will show up right, but it looks cleaner inside of my head.

    Like

  2. @Thomas

    Your arguments seems fine but your implementation can be improved. The problem is that the exception’s stack trace will point to your ValidateInput method which may be misleading. You can handle this using something like:

    bool TryValidateInput(…, out exception),

    but again this method signature starts to smell :)

    Like

  3. Jon,

    Shouldn’t there be a test to verify this behavior:

    – Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence.

    Like

  4. @Jason: Yes – that’s one of the types of test I haven’t implemented yet. I figured that testing deferred execution and eager validation was enough for this time. I may do the disposal for Select, which would otherwise be pretty simple :)

    Like

  5. Additional tests would be to check that the predicate is only invoked once for every element and a few others. Ideally you would test that no files have been deleted etc. You venture quite far along the testing route towards inefficiency.

    Like

  6. Hi Jon,

    I disagree with your point on,

    “LINQ to Objects is based on extension methods, delegates and IEnumerable”

    It is ofcourse based on extension methods and delegates, But there is an underlying implementation of IQueryable interface for all these extensions to be invoked from. Simply put, You can have the end-point source as IQueryable and run LINQ to Objects because its query translation is IL. You can check the EnumerableQuery which implements this, and acts the glue for IQueryable translation. How do you think the LINQ to SQL is implemented if it is only based on extension methods / lambdas / delegates?

    -Fahad

    Like

  7. @Fahad: LINQ to SQL isn’t LINQ to Objects. I very carefully wrote that *LINQ to Objects* is based on those things, rather than LINQ in general. Yes, you *can* go via IQueryable, but that’s still going to end up compiling the query expressions down to delegates.

    Like

  8. Jon, My point here is that there is a set of methods that you need to implement that makes “LINQ”, and that is with the IQueryable interface and not the other way around. It would be more like “via” delegates / lambdas and straight with IQueryable.

    -Fahad

    Like

  9. @Fahad: But I’m not implementing all of LINQ – I’m implementing LINQ to Objects. I’ll talk about expression trees etc if and when I implement AsQueryable, but I think my statement about *LINQ to Objects* is reasonable. I think it’s best to start off with *relatively* straightforward concepts, and introduce the trickier ones later, when they become relevant.

    (In particular, LINQ to Objects would still have been useful without IQuerable existing at all.)

    Like

  10. @tobi: Why do you say Jon is “venture[ing] quite far along the testing route towards inefficiency”? I’m not the best tester in the world, and have some questions myself on how the test for deferred execution is actually working, but all of his tests seem reasonable. Which ones would you not include?

    Like

  11. I think that ‘WhereImpl’ should be renamed to ‘WhereIterator’. The ‘Iterator’ suffix is more descriptive, especially when debugging and tracing the stack.

    Of course, this just a minor nitpick, so I’m sure there are differences in opinion. I’m also aware that this is an example and not necessarily the final implemented version.

    Like

  12. @swiftfoxmark2: Well, I view WhereImpl as the real implementation of Where, with the Where method itself just providing an argument-validating wrapper, basically.

    Personally I prefer WhereImpl, but as you say, it’s a matter of opinion.

    Like

  13. Hi Jon, Thanks for the blog series just discovered it today. Just a question on this statement “We’ve just hit a design flaw in C#. Iterator blocks in C# simply don’t work nicely when you want to split execution between “immediate” (usually for validation) and “deferred”. Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:”

    In the code above what causes it to be deferred? Is it the presence of the yield keyword in the method?

    Like

    1. Absolutely fascinating. The part that I don’t get is that why c# doesn’t defer the checking of the validation as well……….this is most confusing to me. It seems like quantum physics – a magic wand: how does the debugger know whether a yield statement is in the where method unless it goes there and checks: yet it seems to foresee whether a yield statement exists in the where method before going there and checking – because when we add in the WhereImp method which actually does the implementation, then the debugger actually enters the where method. i hope that makes sense.

      In any case my question is the same as as @Dominic – what causes it to be deferred? and how does c# know that there is a yield statement in the where method before going there to check it.

      Like

      1. That’s simple – C# is a compiled language. The compiler looks at the method and works out what code it’s going to emit. If the method includes a yield statement, it’s an iterator and uses deferred execution. If it doesn’t contain a yield statement, it’s not an iterator. All of this is decided at compile time.

        Like

  14. Hi Jon, thanks for this series and as usual it has been tremendously helpful.
    One thing I can’t seem to get my head around is the iteration of the enumerable when it is chained.
    For example, if the Where() is followed by ToList(), would the sequence be iterated twice (MoveNext being called for both Where and ToList, ignoring the fact that ToList could optimize the iteration away if it is a collection with CopyTo)?
    If so, does that mean using foreach after Where will use two iterators, thus iterating twice? (We have two foreach, one being inside the Where)
    If it only iterates once, surely there has to be some magic happening that allows the predicate to be passed to subsequent chaining (like ToList)?

    Like

    1. I’m afraid it’s quite hard for me to picture the code you’re talking about – but something like foo.Where(…).ToList() only results in foo being enumerated once. When ToList() calls MoveNext(), it calls it on the result from Where() – so that will result in Where() calling MoveNext() on the iterator associated with foo, but it won’t need to iterate over it twice.

      Like

  15. Thanks, this certainly clears things up. I guess the problem for me was I keep equating MoveNext() x 2 = 2n iterations.
    So to summarize, although MoveNext() will be called twice, it isn’t the same as foo being iterated twice.

    And just to make sure, the logic applies to this as well?
    var query = foo.Where(…);
    foreach (var item in query)…

    Like

Leave a comment