Reimplementing LINQ to Objects: Part 12 – DefaultIfEmpty

After the masses of code required for all the permutations of First/Last/etc, DefaultIfEmpty is a bit of a relief.

What is it?

Even this simple operator has two overloads:

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source)

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source,
    TSource defaultValue)

The behaviour is very simple to describe:

  • If the input sequence is empty, the result sequence has a single element in it, the default value. This is default(TSource) for the overload without an extra parameter, or the specified value otherwise.
  • If the input sequence isn’t empty, the result sequence is the same as the input sequence
  • The source argument must not be null, and this is validated eagerly
  • The result sequence itself uses deferred execution – the input sequence isn’t read at all until the result sequence is read
  • The input sequence is streamed; any values read are yielded immediately; no buffering is used

Dead easy.

What are we going to test?

Despite being relatively late in the day, I decided to test for argument validation – and a good job too, as my first attempt failed to split the implementation into an argument validation method and an iterator block method for the real work! It just shows how easy it is to slip up.

Other than that, I can really only see four cases worth testing:

  • No default value specified, empty input sequence
  • Default value specified, empty input sequence
  • No default value specified, non-empty input sequence
  • Default value specified, non-empty input sequence

So I have tests for all of those, and that’s it. I don’t have anything testing for streaming, lazy evaluation etc.

Let’s implement it!

Despite my reluctance to implement one operator in terms of another elsewhere, this felt like such an obvious case, I figured it would make sense just this once. I even applied DRY to the argument validation aspect. Here’s the implementation in all its brief glory:

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source)
{
    // This will perform an appropriate test for source being null first.
    return source.DefaultIfEmpty(default(TSource));
}

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source,
    TSource defaultValue)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return DefaultIfEmptyImpl(source, defaultValue);
}

private static IEnumerable<TSource> DefaultIfEmptyImpl<TSource>(
    IEnumerable<TSource> source,
    TSource defaultValue)
{
    bool foundAny = false;
    foreach (TSource item in source)
    {
        yield return item;
        foundAny = true;
    }
    if (!foundAny)
    {
        yield return defaultValue;
    }
}

Of course, now that I’ve said how easy it was, someone will find a bug…

Aside from the use of default(TSource) to call the more complex overload from the simpler one, the only aspect of any interest is the bottom method. It irks me slightly that we’re assigning "true" to "foundAny" on every iteration… but the alternative is fairly unpleasant:

private static IEnumerable<TSource> DefaultIfEmptyImpl<TSource>(
    IEnumerable<TSource> source,
    TSource defaultValue)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            yield return defaultValue;
            yield break; // Like a "return"
        }
        yield return iterator.Current;
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

This may be slightly more efficient, but it feels a little clumsier. We could get rid of the "yield break" by putting the remainder of the method in an "else" block, but I’m not dead keen on that, either. We could use a do/while loop instead of a simple while loop – that would at least remove the repetition of "yield return iterator.Current" but I’m not really a fan of do/while loops. I use them sufficiently rarely that they cause me more mental effort to read than I really like.

If any readers have suggestions which are significantly nicer than either of the above implementations, I’d be interested to hear them. This feels a little inelegant at the moment. It’s far from a readability disaster – it’s just not quite neat.

Conclusion

Aside from the slight annoyance at the final lack of elegance, there’s not much of interest here. However, we could now implement FirstOrDefault/LastOrDefault/SingleOrDefault using DefaultIfEmpty. For example, here’s an implementation of FirstOrDefault:

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    return source.DefaultIfEmpty().First();
}

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Can’t just use source.DefaultIfEmpty().First(predicate)
    return source.Where(predicate).DefaultIfEmpty().First();
}

Note the comment in the predicated version – the defaulting has to be the very last step after we’ve applied the predicate… otherwise if we pass in an empty sequence and a predicate which doesn’t match with default(TSource), we’ll get an exception instead of the default value. The other two …OrDefault operators could be implemented in the same way, of course. (I haven’t done so yet, but the above code is in source control.)

I’m currently unsure what I’ll implement next. I’ll see whether I get inspired by any particular method in the morning.

Reimplementing LINQ to Objects: Part 11 – First/Single/Last and the …OrDefault versions

Today I’ve implemented six operators, each with two overloads. At first I expected the implementations to be very similar, but they’ve all turned out slightly differently…

What are they?

We’ve got three axes of permutation here: {First, Last, Single}, {with/without OrDefault }, {with/without a predicate}. That gives us these twelve signatures:

public static TSource First<TSource>(
    this IEnumerable<TSource> source)

public static TSource First<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource Last<TSource>(
    this IEnumerable<TSource> source)

public static TSource Last<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource Single<TSource>(
    this IEnumerable<TSource> source)

public static TSource Single<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

The shared behaviour is as follows:

  • They’re all extension methods with a single generic type argument
  • They’re all implemented with immediate execution
  • They all validate that their parameters are non-null
  • The overloads with a predicate are equivalent to calling source.Where(predicate).SameOperator() – in other words, they just add a filter before applying the operator.

With those rules applied, we simply need to consider three possibilities for each operator: what happens if the source sequence is empty, contains a single element, or contains multiple elements. (This is after applying the filter if a predicate is specified, of course.) We can draw the results up in a simple table:

Operator Empty sequence Single element Multiple elements
First Throws exception Returns element Returns first element
FirstOrDefault Returns default(TSource) Returns element Returns first element
Last Throws exception Returns element Returns last element
LastOrDefault Returns default(TSource) Returns element Returns last element
Single Throws exception Returns element Throws exception
SingleOrDefault Returns default(TSource) Returns element Throws exception

As you can see, for an input sequence with a single element, the results are remarkably uniform :) Likewise, for an empty input sequence, any operator without "OrDefault" throws an exception (InvalidOperationException, in fact) and any operator with "OrDefault" returns the default value for the element type (null for reference types, 0 for int etc). The operators really differ if the (potentially filtered) input sequence contains multiple elements – First and Last do the obvious thing, and Single throws an exception. It’s worth noting that SingleOrDefault also throws an exception – it’s not like it’s saying, "If the sequence is a single element, return it – otherwise return the default value." If you want an operator which handles multiple elements, you should be using First or Last, with the "OrDefault" version if the sequence can legitimately have no elements. Note that if you do use an "OrDefault" operator, the result is exactly the same for an empty input sequence as for an input sequence containing exactly one element which is the default value. (I’ll be looking at the DefaultIfEmpty operator next.)

Now we know what the operators do, let’s test them.

What are we going to test?

This morning I tweeted that I had written 72 tests before writing any implementation. In fact I ended up with 80, for reasons we’ll come to in a minute. For each operator, I tested the following 12 cases:

  • Null source (predicate-less overload)
  • Null source (predicated overload)
  • Null predicate
  • Empty sequence, no predicate
  • Empty sequence, with a predicate
  • Single element sequence, no predicate
  • Single element sequence where the element matched the predicate
  • Single element sequence where the element didn’t match the predicate
  • Multiple element sequence, no predicate
  • Multiple element
  • Multiple element sequence where one element matched the predicate
  • Multiple element sequence where multiple elements matched the predicate

These were pretty much cut and paste jobs – I used the same data for each test against each operator, and just changed the expected results.

There are two extra tests for each of First and FirstOrDefault, and two for each of Last and LastOrDefault:

  • First/FirstOrDefault should return as soon as they’ve seen the first element, when there’s no predicate; they shouldn’t iterate over the rest of the sequence
  • First/FirstOrDefault should return as soon as they’ve seen the first matching element, when there is a predicate
  • Last/LastOrDefault are optimized for the case where the source implements IList<T> and there’s no predicate: it uses Count and the indexer to access the final element
  • Last/LastOrDefault is not optimized for the case where the source implements IList<T> but there is a predicate: it iterates through the entire sequence

The last two tests involved writing a new collection called NonEnumerableList which implements IList<T> by delegating everything to a backing List<T>, except for GetEnumerator() (both the generic and nongeneric forms) which simply throws a NotSupportedException. That should be a handy one for testing optimizations in the future. I’ll discuss the optimization for Last when we get there.

Let’s implement them!

These operators were more interesting to implement than I’d expect, so I’m actually going to show all twelve methods. It was rarely just a matter of cutting and pasting, other than for the argument validation.

Of course, if we chose to implement the predicated versions using "Where" and the non-predicated form, and the "OrDefault" versions by using "DefaultIfEmpty" followed by the non-defaulting version, we would only have had the three non-predicated, non-defaulting versions to deal with… but as I’ve said before, there are some virtues to implementing each operator separately.

For the sake of avoiding fluff, I’ve removed the argument validation from each method – but obviously it’s there in the real code. Let’s start with First:

public static TSource First<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (iterator.MoveNext())
        {
            return iterator.Current;
        }
        throw new InvalidOperationException("Sequence was empty");
    }
}

public static TSource First<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return item;
        }
    }
    throw new InvalidOperationException("No items matched the predicate");
}

These look surprisingly different – and that was actually a deliberate decision. I could easily have implemented the predicate-less version with a foreach loop as well, just returning unconditionally from its body. However, I chose to emphasize the fact that we’re not looping in First: we simply move to the first element if we can, and return it or throw an exception. There’s no hint that we might ever call MoveNext again. In the predicated form, of course, we have to keep looping until we find a matching value – only throwing the exception when we’ve exhausted all possibilities.

Now let’s see how it looks when we return a default for empty sequences:

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        return iterator.MoveNext() ? iterator.Current : default(TSource);
    }
}

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return item;
        }
    }
    return default(TSource);
}

Here the predicated form looks very similar to First, but the predicate-less one is slightly different: instead of using an if block (which would be a perfectly valid approach, of course) I’ve used the conditional operator. We’re going to return something, whether we manage to move to the first element or not. Arguably it would be nice if the conditional operator allowed the second or third operands to be "throw" expressions, taking the overall type of the expression from the other result operand… but it’s no great hardship.

Next up we’ll implement Single, which is actually closer to First than Last is, in some ways:

public static TSource Single<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        TSource ret = iterator.Current;
        if (iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence contained multiple elements");
        }
        return ret;
    }
}

public static TSource Single<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource ret = default(TSource);
    bool foundAny = false;
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            if (foundAny)
            {
                throw new InvalidOperationException("Sequence contained multiple matching elements");
            }
            foundAny = true;
            ret = item;
        }
    }
    if (!foundAny)
    {
        throw new InvalidOperationException("No items matched the predicate");
    }
    return ret;
}

This is already significantly more complex than First. The predicate-less version starts off the same way, but if we manage to move to the first element, we have to remember that value (as we’re hoping to return it) and then try to move to the second element. This time, if the move succeeds, we have to throw an exception – otherwise we can return our saved value.

The predicated version is even hairier. We still need to remember the first matching value we find, but this time we’re looping – so we need to keep track of whether we’ve already seen a matching value or not. If we see a second match, we have to throw an exception… and we also have to throw an exception if we reach the end without finding any matches at all. Note that although we assign an initial value of default(TSource) to ret, we’ll never reach a return statement without assigning a value to it. However, the rules around definite assignment aren’t smart enough to cope with this, so we need to provide a "dummy" value to start with… and default(TSource) is really the only value available. There is an alternative approach without using a foreach statement, where you loop until you find the first match, assign it to a local variable which is only declared at that point, followed by a second loop ensuring that there aren’t any other matches. I personally think that’s a bit more complex, which is why I’ve just used the foreach here.

The difference when we implement SingleOrDefault isn’t quite as pronounced this time though:

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            return default(TSource);
        }
        TSource ret = iterator.Current;
        if (iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence contained multiple elements");
        }
        return ret;
    }
}

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource ret = default(TSource);
    bool foundAny = false;
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            if (foundAny)
            {
                throw new InvalidOperationException("Sequence contained multiple matching elements");
            }
            foundAny = true;
            ret = item;
        }
    }
    return ret;
}

This time we’ve just replaced a "throw" statement in the predicate-less method with a "return" statement, and removed the test for no matches being found in the predicated method. Here our assignment of default(TSource) to ret really works in our favour – if we don’t end up assigning anything else to it, we’ve already got the right return value!

Next up is Last:

public static TSource Last<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    IList<TSource> list = source as IList<TSource>;
    if (list != null)
    {
        if (list.Count == 0)
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        return list[list.Count – 1];
    }

    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        TSource last = iterator.Current;
        while (iterator.MoveNext())
        {
            last = iterator.Current;
        }
        return last;
    }
}

public static TSource Last<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    bool foundAny = false;
    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            foundAny = true;
            last = item;
        }
    }
    if (!foundAny)
    {
        throw new InvalidOperationException("No items matched the predicate");
    }
    return last;
}

Let’s start off with the optimization at the beginning of the predicate-less method. If we find out we’re dealing with a list, we can simply fetch the count, and then either throw an exception or return the element at the final index. As an extra bit of optimization I could store the count in a local variable – but I’m assuming that the count of an IList<T> is cheap to compute. If there are significant objections to that assumption, I’m happy to change it :) Note that this is another situation where I’m assuming that anything implementing IList<T> will only hold at most Int32.MaxValue items – otherwise the optimization will fail.

If we don’t follow the optimized path, we simply iterate over the sequence, updating a local variable with the last-known element on every single iteration. This time I’ve avoided the foreach loop for no particularly good reason – we could easily have had a foundAny variable which was just set to "true" on every iteration, and then tested at the end. In fact, that’s exactly the pattern the predicated method takes. Admittedly that decision is forced upon us to some extent – we can’t just move once and then take the first value as the "first last-known element", because it might not match the predicate.

There’s no optimization for the predicated form of Last. This follows LINQ to Objects, but I don’t honestly know the reason for it there. We could easily iterate backwards from the end of the sequence using the indexer on each iteration. One possible reason which makes a certain amount of sense is that when there’s a predicate, that predicate could throw an exception for some values – and if we just skipped to the end if the collection implements IList<T>, that would be an observable difference. I’d be interested to know whether or not that is the reason – if anyone has any inside information which they can share, I’ll update this post.

From here, we only have one more operator to implement – LastOrDefault:

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    IList<TSource> list = source as IList<TSource>;
    if (list != null)
    {
        return list.Count == 0 ? default(TSource) : list[list.Count – 1];
    }

    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        last = item;
    }
    return last;
}

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            last = item;
        }
    }
    return last;
}

This time, aside from the optimization, the predicated and non-predicated forms look very similar… more so than for any of the other operators. In each case, we start with a return value of default(TSource), and iterate over the whole sequence, updating it – only doing so when it matches the predicate, if we’ve got one.

Conclusion

This was a longer post than I anticipated when I got up this morning, but I hope the slight differences between implementations – and the mystery of the unoptimized predicated "Last/LastOrDefault" operators have made it worth slogging through.

As a contrast – and because I’ve already mentioned it in this post – I’ll implement DefaultIfEmpty next. I reckon I can still do that this evening, if I hurry…

Addendum

It turns out I was missing some tests for Single and SingleOrDefault: what should they do if evaluating the sequence fully throws an exception? It turns out that in LINQ to Objects, the overloads without a predicate throw InvalidOperationException as soon as they see a second element, but the overloads with a predicate keep iterating even when they’ve seen a second element matching a predicate. This seems ludicrously inconsistent to me – I’ve opened a Connect issue about it; we’ll see what happens.

Reimplementing LINQ to Objects: Part 10 – Any and All

Another day, another blog post. I should emphasize that this rate of posting is likely to be short-lived… although if I get into the habit of writing a post on the morning commute when I go back to work after the Christmas holidays, I could keep ploughing through until we’re done.

Anyway, today we have a pair of operators: Any and All.

What are they?

"Any" has two overloads; there’s only one for "All":

public static bool Any<TSource>(
    this IEnumerable<TSource> source)

public static bool Any<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static bool All<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

The names really are fairly self-explanatory:

  • "Any" without a predicate returns whether there are any elements in the input sequence
  • "Any" with a predicate returns whether any elements in the input sequence match the predicate
  • "All" returns whether all the elements in the input sequence match the given predicate

Both operators use immediate execution – they don’t return until they’ve got the answer, basically.

Importantly, "All" has to read through the entire input sequence to return true, but can return as soon as it’s found a non-matching element; "Any" can return true as soon as it’s found a matching element, but has to iterate over the entire input sequence in order to return false. This gives rise to one very simple LINQ performance tip: it’s almost never a good idea to use a query like

// Don’t use this
if (query.Count() != 0)

That has to iterate over *all* the results in the query… when you really only care whether or not there are any results. So use "Any" instead:

// Use this instead
if (query.Any())

If this is part of a bigger LINQ to SQL query, it may not make a difference – but in LINQ to Objects it can certainly be a huge boon.

Anyway, let’s get on to testing the three methods…

What are we going to test?

Feeling virtuous tonight, I’ve even tested argument validation again… although it’s easy to get that right here, as we’re using immediate execution.

Beyond that, I’ve tested a few scenarios:

  • An empty sequence will return false with Any, but true with All. (Whatever the predicate is for All, there are no elements which fail it.)
  • A sequence with any elements at all will make the predicate-less Any return true.
  • If all the elements don’t match the predicate, both Any and All return false.
  • If some elements match the predicate, Any will return true but All will return false.
  • If all elements match the predicate, All will return true.

Those are all straightforward, so I won’t give the code. One final test is interesting though: we prove that Any returns as soon as it’s got the result by giving it a query which will throw an exception if it’s iterated over completely. The easiest way of doing this is to start out with a sequence of integers including 0, and then use Select with a projection which divides some constant value by each element. In this test case, I’ve given it a value which will match the predicate before the value which will cause the exception to be thrown:

[Test]
public void SequenceIsNotEvaluatedAfterFirstMatch()
{
    int[] src = { 10, 2, 0, 3 };
    var query = src.Select(x => 10 / x);
    // This will finish at the second element (x = 2, so 10/x = 5)
    // It won’t evaluate 10/0, which would throw an exception
    Assert.IsTrue(query.Any(y => y > 2));
}

There’s an equivalent test for All, where a non-matching element occurs before the exceptional one.

So, with all the tests written, let’s get on with the interesting bit:

Let’s implement them!

The first thing to note is that all of these could be implemented in terms of either Any-with-a-predicate or All. For example, given All, we could implement Any as:

public static bool Any<TSource>(
    this IEnumerable<TSource> source)
{
    return source.Any(x => true);
}

public static bool Any<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return !source.All(x => !predicate(x));
}

It’s simplest to implement the predicate-less Any in terms of the predicated one – using a predicate which returns true for any element means that Any will return true for any element at all, which is what we want.

The inversions in the call to All take a minute to get your head round, but it’s basically De Morgan’s law in LINQ form: we effectively invert the predicate to find out if all of the elements don’t match the original predicate… then return the inverse. Due to the inversion, this still returns early in all the appropriate situations, too.

While we could do that, I’ve actually preferred a straightforward implementation of all of the separate methods:

public static bool Any<TSource>(
    this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
            
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        return iterator.MoveNext();
    }
}

public static bool Any<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return true;
        }
    }
    return false;
}

public static bool All<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    foreach (TSource item in source)
    {
        if (!predicate(item))
        {
            return false;
        }
    }
    return true;
}

Aside from anything else, this makes it obvious where the "early out" comes in each case – and also means that any stack traces generated are rather easier to understand. It would be quite odd from a client developer’s point of view to call Any but see All in the stack trace, or vice versa.

One interesting point to note is that I don’t actually use a foreach loop in Any – although I could, of course. Instead, I just get the iterator and then return whether the very first call to MoveNext indicates that there are any elements. I like the fact that reading this method it’s obvious (at least to me) that we really couldn’t care less what the value of the first element is – because we never ask for it.

Conclusion

Probably the most important lesson here is the advice to use Any (without a predicate) instead of Count when you can. The rest was pretty simple – although it’s always fun to see one operator implemented in terms of another.

So, what next? Possibly Single/SingleOrDefault/First/FirstOrDefault/Last/LastOrDefault. I might as well do them all together – partly as they’re so similar, and partly to emphasize the differences which do exist.

Reimplementing LINQ to Objects: Part 9 – SelectMany

The next operator we’ll implement is actually the most important in the whole of LINQ. Most (all?) of the other operators returning sequences can be implemented via SelectMany. We’ll have a look at that at the end of this post, but let’s implement it first.

What is it?

SelectMany has 4 overloads, which look gradually more and more scary:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

These aren’t too bad though. Really these are just variations of the same operation, with two "optional" bits.

In every case, we start with an input sequence. We generate a subsequence from each element in the input sequence using a delegate which can optionally take a parameter with the index of the element within the original collection.

Now, we either return each element from each subsequence directly, or we apply another delegate which takes the original element in the input sequence and the element within the subsequence.

In my experience, uses of the overloads where the original selector delegate uses the index are pretty rare – but the others (the first and the third in the list above) are fairly common. In particular, the C# compiler uses the third overload whenever it comes across a "from" clause in a query expression, other than the very first "from" clause.

It helps to put this into a bit more context. Suppose we have a query expression like this:

var query = from file in Directory.GetFiles("logs")
            from line in File.ReadLines(file)
            select Path.GetFileName(file) + ": " + line;

That would be converted into a "normal" call like this:

var query = Directory.GetFiles("logs")
                     .SelectMany(file => File.ReadLines(file),
                                 (file, line) => Path.GetFileName(file) + ": " + line);

In this case the compiler has used our final "select" clause as the projection; if the query expression had continued with "where" clauses etc, it would have created a projection to just pass along "file" and "line" in an anonymous type. This is probably the most confusing bit of the query translation process, involving transparent identifiers. For the moment we’ll stick with the simple version above.

So, the SelectMany call above has three arguments really:

  • The source, which is a list of strings (the filenames returned from Directory.GetFiles)
  • An initial projection which converts from a single filename to a list of the lines of text within that file
  • A final projection which converts a (file, line) pair into a single string, just by separating them with ": ".

The result is a single sequence of strings – every line of every log file, prefixed with the filename in which it appeared. So writing out the results of the query might give output like this:

test1.log: foo
test1.log: bar
test1.log: baz
test2.log: Second log file
test2.log: Another line from the second log file

It can take a little while to get your head round SelectMany – at least it did for me – but it’s a really important one to understand.

A few more details of the behaviour before we go into testing:

  • The arguments are validated eagerly – everything has to be non-null.
  • Everything is streamed. So only one element of the input is read at a time, and then a subsequence is produced from that. Only one element is then read from the subsequence at a time, yielding the results as we go, before we move onto the next input element and thus the next subsequence etc.
  • Every iterator is closed when it’s finished with, just as you’d expect by now.

What are we going to test?

I’m afraid I’ve become lazy by this point. I can’t face writing yet more tests for null arguments. I’ve written a single test for each of the overloads. I found it hard to come up with a clear way of writing the tests, but here’s one example, for the most complicated overload:

[Test]
public void FlattenWithProjectionAndIndex()
{
    int[] numbers = { 3, 5, 20, 15 };
    var query = numbers.SelectMany((x, index) => (x + index).ToString().ToCharArray(),
                                   (x, c) => x + ": " + c);
    // 3 => "3: 3"
    // 5 => "5: 6"
    // 20 => "20: 2", "20: 2"
    // 15 => "15: 1", "15: 8"
    query.AssertSequenceEqual("3: 3", "5: 6", "20: 2", "20: 2", "15: 1", "15: 8");
}

So, to give a bit more explanation to this:

  • Each number is summed with its index (3+0, 5+1, 20+2, 15+3)
  • Each sum is turned into a string, and then converted into a char array. (We don’t really need to ToCharArray call as string implements IEnumerable<char> already, but I thought it made it clearer.)
  • We combine each subsequence character with the original element it came from, in the form: "original value: subsequence character"

The comment shows the eventual results from each input, and the final test shows the complete result sequence.

Clear as mud? Hopefully it’s not to bad when you look at each step in turn. Okay, now let’s make it pass…

Let’s implement it!

We could implement the first three overloads in terms of calls to the final one – or more likely, a single "Impl" method without argument validation, called by all four public methods. For example, the simplest method could be implemented like this:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return SelectManyImpl(source,
                          (value, index) => selector(value),
                          (originalElement, subsequenceElement) => subsequenceElement);
}

However, I’ve decided to implement each of the methods separately – splitting them into the public extension method and a "SelectManyImpl" method with the same signature each time. I think that would make it simpler to step through the code if there were ever any problems… and it allows us to see the differences between the simplest and most complicated versions, too:

// Simplest overload
private static IEnumerable<TResult> SelectManyImpl<TSource, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    foreach (TSource item in source)
    {
        foreach (TResult result in selector(item))
        {
            yield return result;
        }
    }
}

// Most complicated overload:
// – Original projection takes index as well as value
// – There’s a second projection for each original/subsequence element pair
private static IEnumerable<TResult> SelectManyImpl<TSource, TCollection, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)
{
    int index = 0;
    foreach (TSource item in source)
    {
        foreach (TCollection collectionItem in collectionSelector(item, index++))
        {
            yield return resultSelector(item, collectionItem);
        }
    }
}

The correspondence between the two methods is pretty clear… but I find it helpful to have the first form, so that if I ever get confused about the fundamental point of SelectMany, it’s really easy to understand it based on the simple overload. It’s then not too big a jump to apply the two extra "optional" complications, and end up with the final method. The simple overload acts as a conceptual stepping stone, in a way.

Two minor points to note:

  • The first method could have been implemented with a "yield foreach selector(item)" if such an expression existed in C#. Using a similar construct in the more complicated form would be harder, and involve another call to Select, I suspect… probably more hassle than it would be worth.
  • I’m not explicitly using a "checked" block in the second form, even though "index" could overflow. I haven’t looked to see what the BCL does in this situation, to be honest – I think it unlikely that it will come up. For consistency I should probably use a checked block on every method which uses an index like this… or just turn arithmetic checking on for the whole assembly.

Reimplementing operators using SelectMany

I mentioned early on in this post that many of the LINQ operators can be implemented via SelectMany. Just as a quick example of this, here are alternative implementations of Select, Where and Concat:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return source.SelectMany(x => Enumerable.Repeat(selector(x), 1));
}

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return source.SelectMany(x => Enumerable.Repeat(x, predicate(x) ? 1 : 0));
}

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return new[] { first, second }.SelectMany(x => x);
}

Select and Where use Enumerable.Repeat as a convenient way of creating a sequence with either a single element or none. You could alternatively create a new array instead. Concat just uses an array directly: if you think of SelectMany in terms of its flattening operation, Concat is a really natural fit. I suspect that Empty and Repeat are probably feasible with recursion, although the performance would become absolutely horrible.

Currently the above implementations are in the code using conditional compilation. If this becomes a popular thing for me to implement, I might consider breaking it into a separate project. Let me know what you think – my gut feeling is that we won’t actually gain much more insight than the above methods give us… just showing how flexible SelectMany is.

SelectMany is also important in a theoretical way, in that it’s what provides the monadic nature of LINQ. It’s the "Bind" operation in the LINQ monad. I don’t intend to say any more than that on the topic – read Wes Dyer’s blog post for more details, or just search for "bind monad SelectMany" for plenty of posts from people smarter than myself.

Conclusion

SelectMany is one of LINQ’s fundamental operations, and at first sight it’s a fearsome beast. As soon as you understand that the basic operation is a flattening projection just with a couple of optional twiddles, it’s easily tamed.

Next up I’ll implement All and Any – which are nice and easy to describe by comparison.

Reimplementing LINQ to Objects: Part 8 – Concat

After our quick visit to scalar return types with Count and LongCount, we’re back to an operator returning a sequence: Concat.

What is it?

Concat only has a single signature, which makes life simple:

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

The return value is simply a sequence containing the elements of the first sequence followed by the elements of the second sequence – the concatenation of the two sequences.

I sometimes think it’s a pity that there aren’t Prepend/Append methods which do the same thing but for a single extra element – this would be quite useful in situations such as having a drop down list of countries with an extra option of "None". It’s easy enough to use Concat for this purpose by creating a single-element array, but specific methods would be more readable in my view. MoreLINQ has extra Concat methods for this purpose, but Edulinq is only meant to implement the methods already in LINQ to Objects.

As ever, some notes on the behaviour of Concat:

  • Arguments are validated eagerly: they must both be non-null
  • The result uses deferred execution: other than validation, the arguments aren’t used when the method is first called
  • Each sequence is only evaluated when it needs to be… if you stop iterating over the output sequence before the first input has been exhausted, the second input will remain unused

That’s basically it.

What are we going to test?

The actual concatenation part of the behaviour is very easy to test in a single example – we could potentially also demonstrate concatenation using empty sequences, but there’s no reason to suspect they would fail.

The argument validation is tested in the same way as normal, by calling the method with invalid arguments but not attempting to use the returned query.

Finally, there are a couple of tests to indicate the point at which each input sequence is used. This is achieved using the ThrowingEnumerable we originally used in the Where tests:

[Test]
public void FirstSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new ThrowingEnumerable();
    IEnumerable<int> second = new int[] { 5 };
    // No exception yet…
    var query = first.Concat(second);
    // Still no exception…
    using (var iterator = query.GetEnumerator())
    {
        // Now it will go bang
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

[Test]
public void SecondSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new int[] { 5 };
    IEnumerable<int> second = new ThrowingEnumerable();
    // No exception yet…
    var query = first.Concat(second);
    // Still no exception…
    using (var iterator = query.GetEnumerator())
    {
        // First element is fine…
        Assert.IsTrue(iterator.MoveNext());
        Assert.AreEqual(5, iterator.Current);
        // Now it will go bang, as we move into the second sequence
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

I haven’t written tests to check that iterators are disposed, etc – but each input sequence’s iterator should be disposed appropriately. In particular, it’s natural for the first sequence’s iterator to be disposed before the second sequence is iterated over at all.

Let’s impement it!

The implementation is reasonably simple, but it does make me hanker after F#… it’s the normal split between argument validation and iterator block implementation, but each part is really simple:

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return ConcatImpl(first, second);
}

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    foreach (TSource item in second)
    {
        yield return item;
    }
}

It’s worth just remembering at this point that this would still have been very annoying to implement without iterator blocks. Not really difficult as such, but we’d have had to remember which sequence we were currently iterating over (if any) and so on.

However, using F# we could have made this even simpler with the yield! expression, which yields a whole sequence instead of a single item. Admittedly in this case there aren’t significant performance benefits to using yield! (which there certainly can be in recursive situations) but it would just be more elegant to have the ability to yield an entire sequence in one statement. (Spec# has a similar construct called nested iterators, expressed using yield foreach.) I’m not going to pretend to know enough about the details of either F# or Spec# to draw more detailed comparisons, but we’ll see the pattern of "foreach item in a collection, yield the item" several more times before we’re done. Remember that we can’t extract that into a library method, as the "yield" expression needs special treatment by the C# compiler.

Conclusion

Even when presented with a simple implementation, I can still find room to gripe :) It would be nice to have nested iterators in C#, but to be honest the number of times I find myself frustrated by their absence is pretty small.

Concat is a useful operator, but it’s really only a very simple specialization of another operator: SelectMany. After all, Concat just flattens two sequences into one… whereas SelectMany can flatten a whole sequence of sequences, with even more generality available when required. I’ll implement SelectMany next, and show a few examples of how other operators can be implemented simply in terms of SelectMany. (We’ll see the same sort of ability for operators returning a single value when we implement Aggregate.)

Addendum: avoiding holding onto references unnecessarily

A comment suggested that we should set first to "null" after we’ve used it. That way, as soon as we’ve finished iterating over the collection, it may be eligible for garbage collection. That leads to an implementation like this:

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    // Avoid hanging onto a reference we don’t really need
    first = null;
    foreach (TSource item in second)
    {
        yield return item;
    }
}

Now normally I’d say this wouldn’t actually help – setting a local variable to null when it’s not used in the rest of the method doesn’t actually make any difference when the CLR is running in optimized mode, without a debugger attached: the garbage collector only cares about variables whih might still be accessed in the rest of the method.

In this case, however, it makes a difference – because this isn’t a normal local variable. It ends up as an instance variable in the hidden class generated by the C# compiler… and the CLR can’t tell that the instance variable will never be used again.

Arguably we could remove our only reference to "first" at the start of GetEnumerator. We could write a method of the form:

public static T ReturnAndSetToNull<T>(ref T value) where T : class
{
    T tmp = value;
    value = null;
    return tmp;
}

and then call it like this:

foreach (TSource item in ReturnAndSetToNull(ref first))

I would certainly consider that overkill, particularly as it seems very likely that the iterator will still have a reference to the collection itself – but simply setting "first" to null after iterating over it makes sense to me.

I don’t believe that the "real" LINQ to Objects implementation does this, mind you. (At some point I’ll test it with a collection which has a finalizer.)

Reimplementing LINQ to Objects: Part 7 – Count and LongCount

Today’s post covers two operators in one, because they’re so incredibly similar… to the point cut and paste of implementation, merely changing the name, return type, and a couple of variables.

What are they?

Count and LongCount each have two overloads: one with a predicate, and one without. Here are all four signatures:

public static int Count<TSource>(
    this IEnumerable<TSource> source)

public static int Count<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static long LongCount<TSource>(
    this IEnumerable<TSource> source)

public static long LongCount<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

As you can see, the LongCount signatures are identical to Count except in terms of their return types, which are long (Int64) instead of int (Int32).

The overloads without a predicate parameter simply return the number of items in the source collection; the ones with a predicate return the number of items for which that predicate returns true.

Interesting aspects of the behaviour:

  • These are all extension methods on IEnumerable<T> – you might argue that for the versions without a predicate, it would have been better to extend the non-generic IEnumerable, as nothing actually requires the element type.
  • Where there’s no predicate, Count is optimized for ICollection<T> and (in .NET 4) ICollection – both of which have Count properties which are expected to be faster than iterating over the entire collection. LongCount is not optimized in the same way in the .NET implementation – I’ll discuss this in the implementation section.
  • No optimization is performed in the overloads with predicates, as basically there’s no way of telling how many items will "pass" the predicate without testing them.
  • All methods use immediate execution – nothing is deferred. (If you think about it, there’s nothing which can be deferred here, when we’re just returning an int or a long.)
  • All arguments are validated simply by testing they’re non-null
  • Both methods should throw OverflowException when given a collection with more items than they can return the count of… though this is a considerably larger number in the case of LongCount than Count, of course.

What are we going to test?

In some senses, there are only two "success" tests involved here: one without a predicate and one with. Those are easy enough to deal with, but we also want to exercise the optimized paths. That’s actually trickier than it might sound, as we want to test four situations:

  • A source which implements both ICollection<T> and ICollection (easy: use List<T>)
  • A source which implements ICollection<T> but not ICollection (reasonably easy, after a little work finding a suitable type: use HashSet<T>)
  • A source which implements ICollection but not ICollection<T> but still implements IEnumerable<T> (so that we can extend it) – tricky…
  • A source which doesn’t implement ICollection or ICollection<T> (easy: use Enumerable.Range which we’ve already implemented)

The third bullet is the nasty one. Obviously there are plenty of implementations of ICollection but not ICollection<T> (e.g. ArrayList) but because it doesn’t implement IEnumerable<T>, we can’t call the Count extension method on it. In the end I wrote my own SemiGenericCollection class.

Once we’ve got sources for all those tests, we need to decide what we’re actually testing about them. Arguably we should test that the result is optimized, for example by checking that we never really enumerate the collection. That would require writing custom collections with GetEnumerator() methods which threw exceptions, but still returned a count from the Count property. I haven’t gone this far, but it’s another step we certainly could take.

For the overloads which take predicates, we don’t need to worry about the various collection interfaces as we’re not optimizing anyway.

The failure cases for null arguments are very simple, but there’s one other case to consider: overflow. For Count, I’ve implemented a test case to verify the overflow behaviour. Unfortunately we can’t run this test in the Edulinq implementation yet, as it requires Enumerable.Concat, but here it is for the record anyway:

[Test]
[Ignore("Takes an enormous amount of time!")]
public void Overflow()
{
    var largeSequence = Enumerable.Range(0, int.MaxValue)
                                  .Concat(Enumerable.Range(0, 1));
    Assert.Throws<OverflowException>(() => largeSequence.Count());
}

This guards against a bad implementation which overflows by simply wrapping the counter round to Int32.MinValue instead of throwing an exception.

As you can see, this test will be disabled even when it’s uncommented after we implement Concat, as it requires counting up to 2 billion – not great for a quick set of unit tests. Even that isn’t too bad, however, compared with the equivalent in LongCount which would have to count 2^63 items. Generating such a sequence isn’t difficult, but iterating over it all would take a very long time. We also need an equivalent test for the overload with a predicate – something I neglected until writing up this blog post, and finding a bug in the implementation as I did so :)

For LongCount, I merely have an equivalent test to the above which checks that the same sequence can have its length expressed as a long value.

Let’s implement them!

We’ll look at the overload which does have a predicate first – as it’s actually simpler:

public static int Count<TSource>(this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    // No way of optimizing this
    checked
    {
        int count = 0;
        foreach (TSource item in source)
        {
            if (predicate(item))
            {
                count++;
            }
        }
        return count;
    }
}

Note that this time we’re not using an iterator block (we’re not returning a sequence), so we don’t need to split the implementation into two different methods just to get eager argument validation.

After the argument validation, the main part of the method is reasonably simple, with one twist: we’re performing the whole iteration within a "checked" context. This means that if the increment of count overflows, it will throw an OverflowException instead of wrapping round to a negative number. There are some other alternatives here:

  • We could have made just the increment statement checked instead of the whole second part of the method
  • We could have explicitly tested for count == int.MaxValue before incrementing, and thrown an exception in that case
  • We could just build the whole assembly in a checked context

I think it’s useful for this section of code to be explicitly checked – it makes it obvious that it really is a requirement for general correctness. You may well prefer to make only the increment operation checked – I personally believe that the current approach draws more attention to the checked-ness, but it’s definitely a subjective matter. It’s also possible that an explicit check could be faster, although I doubt it – I haven’t benchmarked either approach.

Other than the predicate-specific parts, all the above code also appears in the optimized Count implementation – so I won’t discuss those again. Here’s the full method:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        return genericCollection.Count;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        return nonGenericCollection.Count;
    }

    // Do it the slow way – and make sure we overflow appropriately
    checked
    {
        int count = 0;
        using (var iterator = source.GetEnumerator())
        {
            while (iterator.MoveNext())
            {
                count++;
            }
        }
        return count;
    }
}

The only "new" code here is the optimization. There are effectively two equivalent blocks, just testing for different collection interface types, and using whichever one it finds first (if any). I don’t know whether the .NET implementation tests for ICollection or ICollection<T> first – I could test it by implementing both interfaces but returning different counts from each, of course, but that’s probably overkill. It doesn’t really matter for well-behaved collections other than the slight performance difference – we want to test the "most likely" interface first, which I believe is the generic one.

To optimize or not to optimize?

The LongCount implementations are exactly the same as those for Count, except using long instead of int.

Notably, I still use optimizations for ICollection and ICollection<T> – but I don’t believe the .NET implementation does so. (It’s easy enough to tell by creating a huge list of bytes and comparing the time taken for Count and LongCount.)

There’s an argument for using Array.GetLongLength when the source is an array… but I don’t think the current CLR supports arrays with more than Int32.MaxValue elements anyway, so it’s a bit of a non-issue other than for future-proofing. Beyond that, I’m not sure why the .NET implementation isn’t optimized. It’s not clear what an ICollection/ICollection<T> implementation is meant to return from its Count property if it has more than Int32.MaxValue elements anyway, to be honest.

Suggestions as to what I should have done are welcome… but I should probably point out that LongCount is more likely to be used against Queryable than Enumerable – it’s easy to imagine a service representing a collection (such as a database table) which can quickly tell you the count even when it’s very large. I would imagine that there are relatively few cases where you have a collection to evaluate in-process where you really just want to iterate through the whole lot just to get the count.

Conclusion

These are our first LINQ operators which return scalar values instead of sequences – with the natural consequence that they’re simpler to understand in terms of control flow and timing. The methods simply execute – possibly with some optimization – and return their result. Nice and simple. Still, we’ve seen there can still be a few interesting aspects to consider, including questions around optimization which don’t necessarily have a good answer.

Next time, I think I’ll implement Concat – mostly so that I can uncomment the overflow tests for Count. That’s going back to an operator which returns a sequence, but it’s a really simple one…

Reimplementing LINQ to Objects: Part 6 – Repeat

A trivial method next, with even less to talk about than "Empty"… "Repeat". This blog post is merely a matter of completeness.

What is it?

"Repeat" is a static, generic non-extension method with a single overload:

public static IEnumerable<TResult> Repeat<TResult>(
    TResult element,
    int count)

It simple returns a sequence which contains the specified element, repeated "count" times. The only argument validation is that "count" has to be non-negative.

What are we going to test?

There’s really not a lot to test here. I’ve thought of 4 different scenarios:

  • A vanilla "repeat a string 3 times" sequence
  • An empty sequence (repeat an element 0 times)
  • A sequence containing null values (just to prove that "element" can be null)
  • A negative count to prove that argument validation occurs, and does so eagerly.

None of this is remotely exciting, I’m afraid.

Let’s implement it!

Just about the only thing we could do wrong here is to put the argument validation directly in an iterator block… and we’ve implemented the "split method" pattern so many times already that we wouldn’t fall into that trap. So, here’s the code in all its tedious lack of glory:

public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    return RepeatImpl(element, count);
}

private static IEnumerable<TResult> RepeatImpl<TResult>(TResult element, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return element;
    }
}

That’s it. Um, interesting points to note… none.

Conclusion

There’s no sense in dragging this out. That’s the lot. Next up, Count and LongCount – which actually do have a couple of interesting points.