Reimplementing LINQ to Objects: Part 23 – Take/Skip/TakeWhile/SkipWhile

I genuinely expected these operators to be simple. At the time of this writing, I’m onto my third implementation of SkipWhile, struggling to find code which obviously expresses what’s going on. I find it interesting how the simplest sounding operators sometimes end up being trickier to implement than one might think.

What are they?

Take, TakeWhile, Skip and SkipWhile form a natural group of operators. Together they are known as the partitioning operators. Here are the signatures:

public static IEnumerable<TSource> Take<TSource>(
    this IEnumerable<TSource> source,
    int count)

public static IEnumerable<TSource> TakeWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> TakeWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

public static IEnumerable<TSource> Skip<TSource>(
    this IEnumerable<TSource> source,
    int count)

public static IEnumerable<TSource> SkipWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> SkipWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

The behaviour is reasonably self-explanatory:

  • source.Take(n) returns a sequence of the first n elements of source
  • source.Skip(n) returns a sequence containing all but the first n elements of source
  • source.TakeWhile(predicate) returns a sequence of the first elements of source which match the given predicate; it stops as soon as the predicate fails to match
  • source.SkipWhile(predicate) returns a sequence of all but the the first elements of source which match the given predicate; it starts yielding results as soon as the predicate fails to match

The only reason there are two overloads for TakeWhile and SkipWhile is so that you can provide a predicate which uses the index within the sequence, just like the overloads for Select and Where.

Each operator effectively partitions the input sequence into two parts, and yields either the first or second part. So for any "normal" collection (which returns the same results each time you iterate over it) the following are true:

  • source.Take(n).Concat(source.Skip(n)) is equivalent to source
  • source.TakeWhile(predicate).Concat(source.SkipWhile(predicate)) is equivalent to source (as noted in the comments, this is also assuming the predicate acts solely the same way each time it’s given the same data, without any side-effects)

A few notes on general behaviour:

  • The arguments are validated eagerly: source and predicate can’t be null
  • count can be negative (in which case it’s equivalent to 0) and can be larger than the collection too. Basically any value is valid.
  • Execution is deferred.
  • The source sequence is streamed – there’s no need to buffer anything.

What are we going to test?

One nice thing about the properties I described earlier is that it’s very easy to convert tests for Take/TakeWhile into tests for Skip/SkipWhile: you just change the expected sequence in Skip to be everything in the source sequence which wasn’t in the test for Take. For Take/Skip I’ve just used Enumerable.Range(0, 5) (i.e. 0, 1, 2, 3, 4) as the source sequence; for TakeWhile/SkipWhile I’ve used { "zero", "one", "two", "three", "four", "five" } – so each element is the word corresponding to the index. That makes it reasonably easy to write tests with a predicate involving the index.

I’ve tested:

  • Argument validation
  • Deferred execution
  • For Skip/Take:
    • A negative count
    • A count of 0
    • A count to split the sequence into non-empty parts
    • A count exactly equal to the source length (5)
    • A count larger than the source length
  • For SkipWhile/TakeWhile (both overloads in each case):
    • A predicate which fails on the first element
    • A predicate which matches some elements but not others
    • A predicate which matches all elements

The tests aren’t generally interesting, but as it can be slightly tricky to get your head round TakeWhile and SkipWhile to start with, here’s an example of each:

// In TakeWhileTest
[Test]
public void PredicateMatchingSomeElements()
{
    string[] source = { "zero", "one", "two", "three", "four", "five" };
    source.TakeWhile(x => x.Length < 5).AssertSequenceEqual("zero", "one", "two");
}

// In SkipWhileTest
[Test]
public void PredicateMatchingSomeElements()
{
    string[] source = { "zero", "one", "two", "three", "four", "five" };
    source.SkipWhile(x => x.Length < 5).AssertSequenceEqual("three", "four", "five");
}

There’s been some discussion on Twitter about the best kind of test here, with suggestions of varying the data instead of the predicate, and using sequences of Boolean values. Personally I prefer the tests the way they are – I find this sequence of strings easy to work with, and the fact that it is a sequence of strings means you can’t get the index/value parameter order wrong when providing a predicate which uses the indexes, unlike if you use an integer sequence. A Boolean sequence isn’t clear enough for me – the result isn’t obviously the first or second part of a sequence, whereas with these words, it’s obvious what’s going on… to me, at least. It’s all a matter of personal preference though, and it’s good to think about alternatives.

Let’s implement them!

Let me present this in chronological order. I started off by implementing Take and Skip, as they sounded the simplest. Indeed, they’re not too bad – here are the iterator block parts, after argument validation in the public method, of course:

private static IEnumerable<TSource> TakeImpl<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        for (int i = 0; i < count && iterator.MoveNext(); i++)
        {
            yield return iterator.Current;
        }
    }
}

private static IEnumerable<TSource> SkipImpl<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        for (int i = 0; i < count; i++)
        {
            if (!iterator.MoveNext())
            {
                yield break;
            }
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

These really aren’t too bad, although it’s interesting to note that Skip is more complicated than Take: Skip has to deal with both sections of the sequence (the "ignore" part and the "yield" part) whereas Take can return as soon as it’s finished with the "yield" part. We’ll see the same pattern in TakeWhile/SkipWhile.

Speaking of which, let’s look at them. There’s a decision to make in terms of whether to implement the overload whose predicate doesn’t use the index in terms of the one which does. It would certainly be simple, but I don’t like the double level of indirection with one delegate calling another. It may well be irrational, given that I’ve been introducing identity delegates elsewhere – but it just didn’t feel right. I’ve added it into the source code using conditional compilation just for completeness, but it’s not the default implementation.

Now, the "Impl" bits of TakeWhile really weren’t too bad:

private static IEnumerable<TSource> TakeWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (!predicate(item))
        {
            yield break;
        }
        yield return item;
    }
}

private static IEnumerable<TSource> TakeWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (!predicate(item, index))
        {
            yield break;
        }
        index++;
        yield return item;
    }
}

Again, we don’t need to worry about the second part of the sequence – the part we’re ignoring. We can just let the method complete, and the result iterator terminate. In particular, we don’t care about the value that caused the predicate to return false. Now let’s look at SkipWhile (again, just the interesting bits):

private static IEnumerable<TSource> SkipWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        while (iterator.MoveNext())
        {
            TSource item = iterator.Current;
            if (!predicate(item))
            {
                // Stop skipping now, and yield this item
                yield return item;
                break;
            }
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

private static IEnumerable<TSource> SkipWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        int index = 0;
        while (iterator.MoveNext())
        {
            TSource item = iterator.Current;
            if (!predicate(item, index))
            {
                // Stop skipping now, and yield this item
                yield return item;
                break;
            }
            index++;
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

The above code is my third attempt. You can look at the first two in the history of SkipWhile.cs – you may find one of those preferable. Other suggestions have been made, including ones using do/while… I must admit, the do/while option ends up being quite attractive, but again I have a general aversion to it as a construct. Part of the trickiness is that having found an element which makes the predicate return false, we have to yield that element before we move on to yielding the rest. It’s easy enough to do, but the extra yield return statement feels ugly.

One point to note is that if the "skipping" part finishes due to a call to MoveNext returning false (rather than the predicate failing) then we’ll call MoveNext again (at the start of the "yielding" loop). That’s explicitly called out as being okay in the IEnumerator.MoveNext() documentation, so I’m happy enough to use it.

I’m certainly not happy about those implementations, but I’m confident they at least work. Now, let’s revisit Skip and Take. Is there any way we can simplify them? Absolutely – all we need to do is use TakeWhile or SkipWhile with a predicate which uses the index and compares it to the count we’ve been given:

public static IEnumerable<TSource> Take<TSource>( 
    this IEnumerable<TSource> source, 
    int count) 
{
    // Buggy! See addendum…
    return source.TakeWhile((x, index) => index < count); 

public static IEnumerable<TSource> Skip<TSource>( 
    this IEnumerable<TSource> source, 
    int count) 

    return source.SkipWhile((x, index) => index < count); 
}

TakeWhile and SkipWhile even do the appropriate argument validation for us. As you can see, I’ve become less averse to implementing one operator using another over time… both implementations are still in the source, using conditional compilation, but currently this short form is the one which is "active". I could be persuaded to change my mind :)

An optimized Skip?

Although most of these operations can’t be sensibly optimized, it would make sense to optimize Skip when the source implements IList<T>. We can skip the skipping, so to speak, and go straight to the appropriate index. This wouldn’t spot the case where the source was modified between iterations, which may be one reason it’s not implemented in the framework as far as I’m aware. The optimization would look something like this:

var list = source as IList<TSource>;
if (list != null)
{
    // Make sure we don’t use a negative index
    count = Math.Max(count, 0);
    // Note that "count" is the count of items to skip
    for (int index = count; index < list.Count; index++)
    {
        yield return list[index];
    }
    yield break;
}

Should Edulinq do this? Answers on a postcard…

EDIT: There’s another optimization we can potentially make, even earlier. If we’re given a count which is zero or negative, Skip is effectively pointless – so we can just return the original reference. This would come just after the argument validation, not in the iterator block:

if (count <= 0)
{
    return source;
}

Returning the original reference has some implications in terms of not isolating the return value from the original sequence, but it’s worth considering. We could potentially do the same for Take, in the case where we know it’s a list and the requested count is greater than or equal to the size of the list.

Using Skip and Take together

It’s worth mentioning the most common use of Skip and Take, as well as an easy-to-make mistake. Skip and Take are ideal for paging. For example:

var pageItems = allItems.Skip(pageNumber * itemsPerPage)
                        .Take(itemsPerPage);

This simply skips as many items as it needs to, and only "takes" one page-worth of data afterwards. Very simple… but also easy to get wrong.

In many cases with LINQ, you can reorder operations with no harm – or with a compile-time error if you get them the wrong way round. Let’s consider what happens if we get it wrong this time though:

// Bug! Bug! Bug!
var pageItems = allItems.Take(itemsPerPage)
                        .Skip(pageNumber * itemsPerPage);

This will work for the first page, but after that you’ll always end up displaying an empty page: it will take the first page-worth of data, and then skip it all, leaving nothing to return.

Conclusion

That was more work than I expected it to be. Still, we’ve actually not got very many LINQ operators still to cover. By my reckoning, we still have the following operators to implement:

  • Sum, Average, Min, Max – I’m not looking forward to these, as they all have lots of overloads, and are likely to be tedious to test and implement.
  • Cast and OfType
  • ToArray and ToDictionary
  • ElementAt and ElementAtOrDefault – I should possibly have covered these along with First/Single/Last etc
  • SequenceEqual
  • Zip (from .NET 4)
  • Contains
  • OrderBy/OrderByDescending/ThenBy/ThenByDescending (probably the most interesting remaining operators)
  • Reverse

That’s not too bad a list. I’ll probably try to knock off ToArray tonight – it should be pretty simple.

Addendum

It turns out that my "simplification" for Take introduced a bug. If we only need to take two items, we should know that after we’ve taken the second one… we don’t really need to ask the iterator to move to the third item, just to decide that we’ve finished. Doing so could have side effects – such as causing an exception to be thrown. That’s exactly what we do when we use TakeWhile – we don’t care what the first value after the stopping point is, but we do try to fetch it. Oops.

Fortunately the Mono tests picked this up for me, so I’ve added my own test for it, and gone back to the original implementation of Take which just used a counter. Both my own tests and Mono’s now pass :)

15 thoughts on “Reimplementing LINQ to Objects: Part 23 – Take/Skip/TakeWhile/SkipWhile”

  1. What was the reason for replacing the “skipping” flag version of SkipWhile()? I might prefer using

    if (!skipping && predicate(…))
    {
    skipping = true;
    }

    to be clearer, but that’s very minor. Making the control flow explicit in the control structures is nice, but I feel in this case it is harder to follow. (Of course, I would use goto, so what do I know?)

    Like

  2. I get the nagging feeling that the source sequences for the tests (5 int’s vs. 6 words) are of different length for an intentional reason, but i’m not sure why – is it just to make sure the test itself wasn’t looking at the wrong sequence?

    It’s certainly possible that one being exclusive on 5 and the other being inclusive on “five” wasn’t intentional, but it’s not clear to me either way. :(

    Like

  3. ******
    So for any “normal” collection (which returns the same results each time you iterate over it) the following are true:

    source.TakeWhile(predicate).Concat(source.SkipWhile(predicate)) is equivalent to source
    ******

    Seems like this one would require the predicate to be ‘normal’ (not depend on external state, just on stuff in the collection, assumed to be immutable during the operation) as well?

    [TestMethod]
    public void TakeWhileConcatSkipWhileFailsForWeirdPredicates()
    {
    var source = new[] { 1, 2, 3 };
    int stateNumber = 0;
    Func predicate = _ => stateNumber++ < 3;
    var result = source.TakeWhile(predicate).Concat(source.SkipWhile(predicate)).ToArray(); // ToArray so we can use CollectionAssert
    CollectionAssert.AreEqual(source, result);
    }

    ******
    source.Take(n).Concat(source.Skip(n)) is equivalent to source
    ******

    Unrealted, but IMHO one neat thing is that Take.Concat.Skip works even for negative 'n' (this with mstest)

    [TestMethod]
    public void TakeConcatSkipWorksEvenForNegative()
    {
    var source = new[] { 1, 2, 3 };
    foreach (var n in Enumerable.Range(-4, 10))
    {
    var result = source.Take(n).Concat(source.Skip(n)).ToArray(); // ToArray so we can use CollectionAssert
    CollectionAssert.AreEqual(source, result);
    }
    }

    Like

  4. @SourceTestSequences: It was actually just to give another data item, as that helped with making the predicates clearer. No deep and meaningful other reason :)

    @Picking Nits: Yes, I meant with any predicate only depending on the values it’s presented with and doesn’t have any side effects. I’ll clarify the post.

    Like

  5. Just catching up on your holiday blast of posts…I think there’s a cut-and-paste error in your second “using skip and take together” sample — the semicolon should be on the third line, not second. (Unless that’s the bug! bug! bug! you’re referring to :-)

    Like

  6. I think there might be a small bug in your optimized version of Skip. If count is less than one you would be outside of the array bounds.

    You could instead use something like the following which checks for the invalid count.

    if (list != null)
    {
    // Note that “count” is the count of items to skip
    for (index = count >= 0 ? count : 0; index < list.Count; index++)
    {
    yield return list[index];
    }
    yield break;
    }

    Thanks for the series so far and such a delightful blog.

    Like

  7. @Scott: Yes, you’re right. I hadn’t tested that implementation – it’s now in the source repository, with a fix (calling Math.Max).

    I’m adding details of another possible optimization, too…

    Like

  8. Minor issue. The sentence “We could potentially do the same for Take, in the case where we know it’s a list and the is greater than or equal to the size of the list” seems to be missing the word “count.” Should it be “the count is greater than”?

    Like

Leave a comment