Reimplementing LINQ to Objects: Part 8 – Concat

After our quick visit to scalar return types with Count and LongCount, we’re back to an operator returning a sequence: Concat.

What is it?

Concat only has a single signature, which makes life simple:

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

The return value is simply a sequence containing the elements of the first sequence followed by the elements of the second sequence – the concatenation of the two sequences.

I sometimes think it’s a pity that there aren’t Prepend/Append methods which do the same thing but for a single extra element – this would be quite useful in situations such as having a drop down list of countries with an extra option of "None". It’s easy enough to use Concat for this purpose by creating a single-element array, but specific methods would be more readable in my view. MoreLINQ has extra Concat methods for this purpose, but Edulinq is only meant to implement the methods already in LINQ to Objects.

As ever, some notes on the behaviour of Concat:

  • Arguments are validated eagerly: they must both be non-null
  • The result uses deferred execution: other than validation, the arguments aren’t used when the method is first called
  • Each sequence is only evaluated when it needs to be… if you stop iterating over the output sequence before the first input has been exhausted, the second input will remain unused

That’s basically it.

What are we going to test?

The actual concatenation part of the behaviour is very easy to test in a single example – we could potentially also demonstrate concatenation using empty sequences, but there’s no reason to suspect they would fail.

The argument validation is tested in the same way as normal, by calling the method with invalid arguments but not attempting to use the returned query.

Finally, there are a couple of tests to indicate the point at which each input sequence is used. This is achieved using the ThrowingEnumerable we originally used in the Where tests:

[Test]
public void FirstSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new ThrowingEnumerable();
    IEnumerable<int> second = new int[] { 5 };
    // No exception yet…
    var query = first.Concat(second);
    // Still no exception…
    using (var iterator = query.GetEnumerator())
    {
        // Now it will go bang
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

[Test]
public void SecondSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new int[] { 5 };
    IEnumerable<int> second = new ThrowingEnumerable();
    // No exception yet…
    var query = first.Concat(second);
    // Still no exception…
    using (var iterator = query.GetEnumerator())
    {
        // First element is fine…
        Assert.IsTrue(iterator.MoveNext());
        Assert.AreEqual(5, iterator.Current);
        // Now it will go bang, as we move into the second sequence
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

I haven’t written tests to check that iterators are disposed, etc – but each input sequence’s iterator should be disposed appropriately. In particular, it’s natural for the first sequence’s iterator to be disposed before the second sequence is iterated over at all.

Let’s impement it!

The implementation is reasonably simple, but it does make me hanker after F#… it’s the normal split between argument validation and iterator block implementation, but each part is really simple:

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return ConcatImpl(first, second);
}

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    foreach (TSource item in second)
    {
        yield return item;
    }
}

It’s worth just remembering at this point that this would still have been very annoying to implement without iterator blocks. Not really difficult as such, but we’d have had to remember which sequence we were currently iterating over (if any) and so on.

However, using F# we could have made this even simpler with the yield! expression, which yields a whole sequence instead of a single item. Admittedly in this case there aren’t significant performance benefits to using yield! (which there certainly can be in recursive situations) but it would just be more elegant to have the ability to yield an entire sequence in one statement. (Spec# has a similar construct called nested iterators, expressed using yield foreach.) I’m not going to pretend to know enough about the details of either F# or Spec# to draw more detailed comparisons, but we’ll see the pattern of "foreach item in a collection, yield the item" several more times before we’re done. Remember that we can’t extract that into a library method, as the "yield" expression needs special treatment by the C# compiler.

Conclusion

Even when presented with a simple implementation, I can still find room to gripe :) It would be nice to have nested iterators in C#, but to be honest the number of times I find myself frustrated by their absence is pretty small.

Concat is a useful operator, but it’s really only a very simple specialization of another operator: SelectMany. After all, Concat just flattens two sequences into one… whereas SelectMany can flatten a whole sequence of sequences, with even more generality available when required. I’ll implement SelectMany next, and show a few examples of how other operators can be implemented simply in terms of SelectMany. (We’ll see the same sort of ability for operators returning a single value when we implement Aggregate.)

Addendum: avoiding holding onto references unnecessarily

A comment suggested that we should set first to "null" after we’ve used it. That way, as soon as we’ve finished iterating over the collection, it may be eligible for garbage collection. That leads to an implementation like this:

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    // Avoid hanging onto a reference we don’t really need
    first = null;
    foreach (TSource item in second)
    {
        yield return item;
    }
}

Now normally I’d say this wouldn’t actually help – setting a local variable to null when it’s not used in the rest of the method doesn’t actually make any difference when the CLR is running in optimized mode, without a debugger attached: the garbage collector only cares about variables whih might still be accessed in the rest of the method.

In this case, however, it makes a difference – because this isn’t a normal local variable. It ends up as an instance variable in the hidden class generated by the C# compiler… and the CLR can’t tell that the instance variable will never be used again.

Arguably we could remove our only reference to "first" at the start of GetEnumerator. We could write a method of the form:

public static T ReturnAndSetToNull<T>(ref T value) where T : class
{
    T tmp = value;
    value = null;
    return tmp;
}

and then call it like this:

foreach (TSource item in ReturnAndSetToNull(ref first))

I would certainly consider that overkill, particularly as it seems very likely that the iterator will still have a reference to the collection itself – but simply setting "first" to null after iterating over it makes sense to me.

I don’t believe that the "real" LINQ to Objects implementation does this, mind you. (At some point I’ll test it with a collection which has a finalizer.)

9 thoughts on “Reimplementing LINQ to Objects: Part 8 – Concat”

  1. @Mihailik: Good idea. I haven’t checked whether the LINQ to Objects implementation does that or not, but it would certainly be a valid thing to do – and worth discussing, as well… setting a local variable to null is *usually* not useful, but in this case it would be due to iterator block semantics.

    Like

  2. Slight typo:

    “public void FirstSequencesIsnt”

    Should be singlular not plural. FirstSequenceIsnt.

    I’m really enjoying this series!

    Like

  3. @James: Simply what it returns. You would usually only return IEnumerator in a class implementing IEnumerable.

    @Josh: Thanks, will fix it.

    Like

  4. Hi Jon, thank you for your post. re: the point above that the variable called ‘first’ will still exist – is it the case that extension methods cause the compiler to simply create another class in the background which then stores all the variables used in that particular extension method? clarification much appreciated.

    Like

    1. I’m not quite sure what you’re asking. Extension methods don’t introduce extra classes in the background inherently, although lambda expressions often do. It’s probably worth asking a detailed question about this on Stack Overflow.

      Like

Leave a comment