Category Archives: C#

Reimplementing LINQ to Objects: Part 15 – Union

December 30, 2010 jonskeet 5 Comments

I’m continuing the set-based operators with Union. I may even have a couple of hours tonight – possibly enough to finish off Intersect and Except as well… let’s see.

What is it?

Union is another extension method with two overloads; one taking an equality comparer and one just using the default:

public static IEnumerable<TSource> Union<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second)

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

This "two overloads, one taking an equality comparer" pattern is familiar from Distinct; we’ll see that and Intersect and Except do exactly the same thing.

Simply put, Union returns the union of the two sequences – all items that are in either input sequence. The result sequence has no duplicates in even if one of the input sequences contains duplicates. (I’m using the term "duplicate" here to mean an element which is equal to another according to the equality comparer we’re using in the operator.)

Characteristics:

Union uses deferred execution: argument validation is basically all that happens when the method is first called; it only starts iterating over the input sequences when you iterate over the result sequence
Neither first nor second can be null; the comparer argument can be null, in which case the default equality comparer is used
The input sequences are only read as and when they’re needed; to return the first result element, only the first input element is read

It’s worth noting that the documentation for Union specifies a lot more than the Distinct documentation does:

When the object returned by this method is enumerated, Union enumerates first and second in that order and yields each element that has not already been yielded.

To me, that actually looks like a guarantee of the rules I proposed for Distinct. In particular, it’s guaranteeing that the implementation iterates over "first" before "second", and if it’s yielding elements as it goes, that guarantees that distinct elements will retain their order from the original input sequences. Whether others would read it in the same way or not, I can’t say… input welcome.

What are we going to test?

I’ve written quite a few tests for Union – possibly more than we really need, but they demonstrate a few points of usage. The tests are:

Arguments are validated eagerly
Finding the union of two sequences without specifying a comparer; both inputs have duplicate elements, and there’s one element in both
The same test as above but explicitly specifying null as the comparer, to force the default to be used
The same test as above but using a case-insensitive string comparer
Taking the union of an empty sequence with a non-empty one
Taking the union of a non-empty sequence with an empty one
Taking the union of two empty sequences
Proving that the first sequence isn’t used until we start iterating over the result sequence (using ThrowingEnumerable)
Proving that the second sequence isn’t used until we’ve exhausted the first

No new collections or comparers needed this time though – it’s all pretty straightforward. I haven’t written any tests for null elements this time – I’m convinced enough by what I saw when implementing Distinct to believe they won’t be a problem.

Let’s implement it!

First things first: we can absolutely implement the simpler overload in terms of the more complex one, and I’ll do the same for Except and Intercept. Here’s the Union method:

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    return Union(first, second, EqualityComparer<TSource>.Default);
}

So how do we implement the more complex overload? Well, I’ve basically been a bit disappointed by Union in terms of its conceptual weight. It doesn’t really give us anything that the obvious combination of Concat and Distinct doesn’t – so let’s implement it that way first:

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    return first.Concat(second).Distinct(comparer);
}

The argument validation can be implemented by Concat with no problems here – the parameters have the same name, so any exceptions thrown will be fine in every way.

That’s how I think about Union, but as I’ve mentioned before, I’d rather not actually see Concat and Distinct showing up in the stack trace – so here’s the fuller implementation.

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return UnionImpl(first, second, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> UnionImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer);
    foreach (TSource item in first)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
    foreach (TSource item in second)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
}

That feels like an absurd waste of code when we can achieve the same result so simply, admittedly. This is the first time my resolve against implementing one operator in terms of completely different ones has wavered. Just looking at it in black and white (so to speak), I’m close to going over the edge…

Conclusion

Union was a disappointingly bland operator in my view. (Maybe I should start awarding operators marks out of ten for being interesting, challenging etc.) It doesn’t feel like it’s really earned its place in LINQ, as calls to Concat/Distinct can replace it so easily. Admittedly as I’ve mentioned in several other places, a lot of operators can be implemented in terms of each other – but rarely quite so simply.

Still, I think Intersect and Except should be more interesting.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 14 – Distinct

December 30, 2010 jonskeet 7 Comments

I’m going to implement the set-based operators next, starting with Distinct.

What is it?

Distinct has two overloads, which differ only in the simplest possible way:

public static IEnumerable<TSource> Distinct<TSource>(
this IEnumerable<TSource> source)

public static IEnumerable<TSource> Distinct<TSource>(
this IEnumerable<TSource> source,
IEqualityComparer<TSource> comparer)

The point of the operator is straightforward: the result sequence contains the same items as the input sequence, but with the duplicates removed – so an input of { 0, 1, 3, 1, 5 } will give a result sequence of { 0, 1, 3, 5 } – the second occurrence of 1 is ignored.

This time I’ve checked and double-checked the documentation – and in this case it really is appropriate to think of the first overload as just a simplified version of the second. If you don’t specify an equality comparer, the default comparer for the type will be used. The same will happen if you pass in a null reference for the comparer. The default equality comparer for a type can be obtained with the handy EqualityComparer<T>.Default property.

Just to recap, an equality comparer (represented by the IEqualityComparer<T> interface) is able to do two things:

Determine the hash code for a single item of type T
Compare any two items of type T for equality

It doesn’t have to give any sort of ordering – that’s what IComparer<T> is for, although that doesn’t have the ability to provide a hash code.

One interesting point about IEqualityComparer<T> is that the GetHashCode() method is meant to throw an exception if it’s provided with a null argument, but in practice the EqualityComparer<T>.Default implementations appear not to. This leads to an interesting question about Distinct: how should it handle null elements? It’s not documented either way, but in reality both the LINQ to Objects implementation and the simplest way of implementing it ourselves simply throws a NullReferenceException if you use a not-null-safe comparer and have a null element present. Note that the default equality comparer for any type (EqualityComparer<T>.Default) does cope with nulls.

There are other undocumented aspects of Distinct, too. Both the ordering of the result sequence and the choice of which exact element is returned when there are equal options are unspecified. In the case of ordering, it’s explicitly unspecified. From the documentation: "The Distinct method returns an unordered sequence that contains no duplicate values." However, there’s a natural approach which answers both of these questions. Distinct is specified to use deferred execution (so it won’t look at the input sequence until you start reading from the output sequence) but it also streams the results to some extent: to return the first element in the result sequence, it only needs to read the first element from the input sequence. Some other operators (such as OrderBy) have to read all their data before yielding any results.

When you implement Distinct in a way which only reads as much data as it has to, the answer to the ordering and element choice is easy:

The result sequence is in the same order as the input sequence
When there are multiple equal elements, it is the one which occurs earliest in the input sequence which is returned as part of the result sequence.

Remember that it’s perfectly possible to have elements which are considered equal under a particular comparer, but are still clearly different when looked at another way. The simplest example of this is case-insensitive string equality. Taking the above rules into account, the distinct sequence returned for { "ABC", "abc", "xyz" } with a case-insensitive comparer is { "ABC", "xyz" }.

What are we going to test?

All of the above :)

All the tests use sequences of strings for clarity, but I’m using four different comparers:

The default string comparer (which is a case-sensitive ordinal comparer)
The case-insensitive ordinal comparer
A comparer which uses object identity (so will treat two equal but distinct strings as different)
A comparer which explicitly doesn’t try to cope with null values

The tests assume that the undocumented aspects listed above are implemented with the rules that I’ve given. This means they’re over-sensitive, in that an implementation of Distinct which matches all the documented behaviour but returns elements in a different order would fail the tests. This highlights an interesting aspect of unit testing in general… what exactly are we trying to test? I can think of three options in our case:

Just the documented behaviour: anything conforming to that, however oddly, should pass
The LINQ to Objects behaviour: the framework implementation should pass all our tests, and then our implementation should as well
Our implementation’s known (designed) behaviour: we can specify that our implementation will follow particular rules above and beyond the documented contracts

In production projects, these different options are valid in different circumstances, depending on exactly what you’re trying to do. At the moment, I don’t have any known differences in behaviour between LINQ to Objects and Edulinq, although that may well change later in terms of optimizations.

None of the tests themselves are particularly interesting – although I find it interesting that I had to implement a deliberately fragile (but conformant) implementation of IEqualityComparer<T> in order to test Distinct fully.

Let’s implement it!

I’m absolutely confident in implementing the overload that doesn’t take a custom comparer using the one that does. We have two options for how to specify the custom comparer in the delegating call though – we could pass null or EqualityComparer<T>.Default, as the two are explicitly defined to behave the same way in the second overload. I’ve chosen to pass in EqualityComparer<T>.Default just for the sake of clarity – it means that anyone reading the first method doesn’t need to check the behavior of the second to understand what it will do.

We need to use the "private iterator block method" approach again, so that the arguments can be evaluated eagerly but still let the result sequence use deferred execution. The real work method uses HashSet<T> to keep track of all the elements we’ve already returned – it takes an IEqualityComparer<T> in its constructor, and the Add method adds an element to the set if there isn’t already an equal one, and returns whether or not it really had to add anything. All we have to do is iterate over the input sequence, call Add, and yield the item as part of the result sequence if Add returned true. Simple!

public static IEnumerable<TSource> Distinct<TSource>(
this IEnumerable<TSource> source)
{
return source.Distinct(EqualityComparer<TSource>.Default);
}

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return DistinctImpl(source, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> DistinctImpl<TSource>(
    IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer);
    foreach (TSource item in source)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
}

So what about the behaviour with nulls? Well, it seems that HashSet<T> just handles that automatically, if the comparer it uses does. So long as the comparer returns the same hash code each time it’s passed null, and considers null and null to be equal, it can be present in the sequence. Without HashSet<T>, we’d have had a much uglier implementation – especially as Dictionary<TKey, TValue> doesn’t allow null keys.

Conclusion

I’m frankly bothered by the lack of specificity in the documentation for Distinct. Should you rely on the ordering rules that I’ve given here? I think that in reality, you’re reasonably safe to rely on it – it’s the natural result of the most obvious implementation, after all. I wouldn’t rely on the same results when using a different LINQ provider, mind you – when fetching the results back from a database, for example, I wouldn’t be at all surprised to see the ordering change. And of course, the fact that t documentation explicitly states that the result is unordered should act as a bit of a deterrent from relying on this.

We’ll have to make similar decisions for the other set-based operators: Union, Intersect and Except. And yes, they’re very likely to use HashSet<T> too…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 13 – Aggregate

December 30, 2010 jonskeet 6 Comments

EDIT: I’ve had to edit this quite a bit now that a second bug was discovered… basically my implementation of the first overload was completely broken :(

Last night’s tweet asking for suggestions around which operator to implement next resulted in a win for Aggregate, so here we go.

What is it?

Aggregate has three overloads, effectively allow two defaults:

public static TSource Aggregate<TSource>(
this IEnumerable<TSource> source,
Func<TSource, TSource, TSource> func)

public static TAccumulate Aggregate<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)

Aggregate is an extension method using immediate execution, returning a single result. The generalised behaviour is as follows:

Start off with a seed. For the first overload, this defaults to the first value of the input sequence. The seed is used as the first accumulator value.
For each item in the list, apply the aggregation function, which takes the current accumulator value and the newly found item, and returns a new accumulator value.
Once the sequence has been exhausted, optionally apply a final projection to obtain a result. If no projection has been specified, we can imagine that the identity function has been provided.

The signatures make all of this look a bit more complicated because of the various type parameters involved. You can consider all the overloads as dealing with three different types, even though the first two actually have fewer type parameters:

TSource is the element type of the sequence, always.
TAccumulate is the type of the accumulator – and thus the seed. For the first overload where no seed is provided, TAccumulate is effectively the same as TSource.
TResult is the return type when there’s a final projection involved. For the first two overloads, TResult is effectively the same as TAccumulate (again, think of a default "identity projection" as being used when nothing else is specified)

In the first overload, which uses the first input element as the seed, an InvalidOperationException is thrown if the input sequence is empty.

What are we going to test?

Obviously the argument validation is reasonably simple to test – source, func and resultSelector can’t be null. But there are two different approaches to testing the "success" cases.

We could work out exactly when each delegate should be called and with what values – effectively mock every step of the iteration. This would be a bit of a pain, but a very robust way of proceeding.

The alternative approach is just to take some sample data and aggregation function, work out what the result should be, and assert that result. If the result is sufficiently unlikely to be achieved by chance, this is probably good enough – and it’s a lot simpler to implement. Here’s a sample from the most complicated test, where we have a seed and a final projection:

[Test]
public void SeededAggregationWithResultSelector()
{
    int[] source = { 1, 4, 5 };
    int seed = 5;
    Func<int, int, int> func = (current, value) => current * 2 + value;
    Func<int, string> resultSelector = result => result.ToInvariantString();
    // First iteration: 5 * 2 + 1 = 11
    // Second iteration: 11 * 2 + 4 = 26
    // Third iteration: 26 * 2 + 5 = 57
    // Result projection: 57.ToString() = "57"
    Assert.AreEqual("57", source.Aggregate(seed, func, resultSelector));
}

Now admittedly I’m not testing this to the absolute full – I’m using the same types for TSource and TAccumulate – but frankly it gives me plenty of confidence that the implementation is correct.

EDIT: My result selector now calls ToInvariantString. It used to just call ToString, but as I’ve now been persuaded that there are some cultures where that wouldn’t give us the right results, I’ve implemented an extension method which effectively means that x.ToInvariantString() is equivalent to x.ToString(CultureInfo.InvariantCulture) – so we don’t need to worry about cultures with different numeric representations etc.

Just for the sake of completeness (I’ve convinced myself to improve the code while writing this blog post), here’s an example which sums integers, but results in a long – so it copes with a result which is bigger than Int32.MaxValue. I haven’t bothered with a final projection though:

[Test]
public void DifferentSourceAndAccumulatorTypes()
{
    int largeValue = 2000000000;
    int[] source = { largeValue, largeValue, largeValue };
    long sum = source.Aggregate(0L, (acc, value) => acc + value);
    Assert.AreEqual(6000000000L, sum);
    // Just to prove we haven’t missed off a zero…
    Assert.IsTrue(sum > int.MaxValue);
}

Since I first wrote this post, I’ve also added tests for empty sequences (where the first overload should throw an exception) and a test which relies on the first overload using the first element of the sequence as the seed, rather than the default value of the input sequence’s element type.

Okay, enough about the testing… what about the real code?

Let’s implement it!

I’m still feeling my way around when it’s a good idea to implement one method by using another, but at the moment my gut feeling is that it’s okay to do so when:

You’re implementing one operator by reusing another overload of the same operator; in other words, no unexpected operators will end up in the stack trace of callers
There are no significant performance penalties for doing so
The observed behaviour is exactly the same – including argument validation
The code ends up being simpler to understand (obviously)

Contrary to an earlier version of this post, the first overload can’t be implemented in terms of the second or third ones, because of its behaviour regarding the seed and empty sequences.

public static TSource Aggregate<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, TSource, TSource> func)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (func == null)
    {
        throw new ArgumentNullException("func");
    }

    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Source sequence was empty");
        }
        TSource current = iterator.Current;
        while (iterator.MoveNext())
        {
            current = func(current, iterator.Current);
        }
        return current;
    }
}

It still makes sense to share an implementation for the second and third overloads though. There’s a choice around whether to implement the second operator in terms of the third (giving it an identity projection) or to implement the third operator in terms of the second (by just calling the second overload and then applying a projection). Obviously applying an unnecessary identity projection has a performance penalty in itself – but it’s a tiny penalty. So which is more readable? I’m in two minds about this. I like code where various methods call one other "central" method where all the real work occurs (suggesting implementing the second overload using the third) but equally I suspect I really think about aggregation in terms of getting the final value of the accumulator… with just a twist in the third overload, of an extra projection. I guess it depends on whether you think of the final projection as part of the general form or an "extra" step.

For the moment, I’ve gone with the "keep all logic in one place" approach:

public static TAccumulate Aggregate<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)
{
    return source.Aggregate(seed, func, x => x);
}

public static TResult Aggregate<TSource, TAccumulate, TResult>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func,
    Func<TAccumulate, TResult> resultSelector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (func == null)
    {
        throw new ArgumentNullException("func");
    }
    if (resultSelector == null)
    {
        throw new ArgumentNullException("resultSelector");
    }
    TAccumulate current = seed;
    foreach (TSource item in source)
    {
        current = func(current, item);
    }
    return resultSelector(current);
}

The bulk of the "real work" method is argument validation – the actual iteration is almost painfully simple.

Conclusion

The moral of today’s story is to read the documentation carefully – sometimes there’s unexpected behaviour to implement. I still don’t really know why this difference in behaviour exists… it feels to me as if the first overload really should behave like the second one, just with a default initial seed. EDIT: it seems that you need to read it really carefully. You know, every word of it. Otherwise you could make an embarrassing goof in a public blog post. <sigh>

The second moral should really be about the use of Aggregate – it’s a very generalized operator, and you can implement any number of other operators (Sum, Max, Min, Average etc) using it. In some ways it’s the scalar equivalent of SelectMany, just in terms of its diversity. Maybe I’ll show some later operators implemented using Aggregate…

Next up, there have been requests for some of the set-based operators – Distinct, Union, etc – so I’ll probably look at those soon.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 12 – DefaultIfEmpty

December 29, 2010 jonskeet 9 Comments

After the masses of code required for all the permutations of First/Last/etc, DefaultIfEmpty is a bit of a relief.

What is it?

Even this simple operator has two overloads:

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
this IEnumerable<TSource> source)

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
this IEnumerable<TSource> source,
TSource defaultValue)

The behaviour is very simple to describe:

If the input sequence is empty, the result sequence has a single element in it, the default value. This is default(TSource) for the overload without an extra parameter, or the specified value otherwise.
If the input sequence isn’t empty, the result sequence is the same as the input sequence
The source argument must not be null, and this is validated eagerly
The result sequence itself uses deferred execution – the input sequence isn’t read at all until the result sequence is read
The input sequence is streamed; any values read are yielded immediately; no buffering is used

Dead easy.

What are we going to test?

Despite being relatively late in the day, I decided to test for argument validation – and a good job too, as my first attempt failed to split the implementation into an argument validation method and an iterator block method for the real work! It just shows how easy it is to slip up.

Other than that, I can really only see four cases worth testing:

No default value specified, empty input sequence
Default value specified, empty input sequence
No default value specified, non-empty input sequence
Default value specified, non-empty input sequence

So I have tests for all of those, and that’s it. I don’t have anything testing for streaming, lazy evaluation etc.

Let’s implement it!

Despite my reluctance to implement one operator in terms of another elsewhere, this felt like such an obvious case, I figured it would make sense just this once. I even applied DRY to the argument validation aspect. Here’s the implementation in all its brief glory:

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source)
{
    // This will perform an appropriate test for source being null first.
    return source.DefaultIfEmpty(default(TSource));
}

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source,
    TSource defaultValue)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return DefaultIfEmptyImpl(source, defaultValue);
}

private static IEnumerable<TSource> DefaultIfEmptyImpl<TSource>(
    IEnumerable<TSource> source,
    TSource defaultValue)
{
    bool foundAny = false;
    foreach (TSource item in source)
    {
        yield return item;
        foundAny = true;
    }
    if (!foundAny)
    {
        yield return defaultValue;
    }
}

Of course, now that I’ve said how easy it was, someone will find a bug…

Aside from the use of default(TSource) to call the more complex overload from the simpler one, the only aspect of any interest is the bottom method. It irks me slightly that we’re assigning "true" to "foundAny" on every iteration… but the alternative is fairly unpleasant:

private static IEnumerable<TSource> DefaultIfEmptyImpl<TSource>(
    IEnumerable<TSource> source,
    TSource defaultValue)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            yield return defaultValue;
            yield break; // Like a "return"
        }
        yield return iterator.Current;
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

This may be slightly more efficient, but it feels a little clumsier. We could get rid of the "yield break" by putting the remainder of the method in an "else" block, but I’m not dead keen on that, either. We could use a do/while loop instead of a simple while loop – that would at least remove the repetition of "yield return iterator.Current" but I’m not really a fan of do/while loops. I use them sufficiently rarely that they cause me more mental effort to read than I really like.

If any readers have suggestions which are significantly nicer than either of the above implementations, I’d be interested to hear them. This feels a little inelegant at the moment. It’s far from a readability disaster – it’s just not quite neat.

Conclusion

Aside from the slight annoyance at the final lack of elegance, there’s not much of interest here. However, we could now implement FirstOrDefault/LastOrDefault/SingleOrDefault using DefaultIfEmpty. For example, here’s an implementation of FirstOrDefault:

public static TSource FirstOrDefault<TSource>(
this IEnumerable<TSource> source)
{
return source.DefaultIfEmpty().First();
}

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Can’t just use source.DefaultIfEmpty().First(predicate)
    return source.Where(predicate).DefaultIfEmpty().First();
}

Note the comment in the predicated version – the defaulting has to be the very last step after we’ve applied the predicate… otherwise if we pass in an empty sequence and a predicate which doesn’t match with default(TSource), we’ll get an exception instead of the default value. The other two …OrDefault operators could be implemented in the same way, of course. (I haven’t done so yet, but the above code is in source control.)

I’m currently unsure what I’ll implement next. I’ll see whether I get inspired by any particular method in the morning.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 11 – First/Single/Last and the …OrDefault versions

December 29, 2010 jonskeet 12 Comments

Today I’ve implemented six operators, each with two overloads. At first I expected the implementations to be very similar, but they’ve all turned out slightly differently…

What are they?

We’ve got three axes of permutation here: {First, Last, Single}, {with/without OrDefault }, {with/without a predicate}. That gives us these twelve signatures:

public static TSource First<TSource>(
this IEnumerable<TSource> source)

public static TSource First<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static TSource FirstOrDefault<TSource>(
this IEnumerable<TSource> source)

public static TSource FirstOrDefault<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static TSource Last<TSource>(
this IEnumerable<TSource> source)

public static TSource Last<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static TSource LastOrDefault<TSource>(
this IEnumerable<TSource> source)

public static TSource LastOrDefault<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static TSource Single<TSource>(
this IEnumerable<TSource> source)

public static TSource Single<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static TSource SingleOrDefault<TSource>(
this IEnumerable<TSource> source)

public static TSource SingleOrDefault<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

The shared behaviour is as follows:

They’re all extension methods with a single generic type argument
They’re all implemented with immediate execution
They all validate that their parameters are non-null
The overloads with a predicate are equivalent to calling source.Where(predicate).SameOperator() – in other words, they just add a filter before applying the operator.

With those rules applied, we simply need to consider three possibilities for each operator: what happens if the source sequence is empty, contains a single element, or contains multiple elements. (This is after applying the filter if a predicate is specified, of course.) We can draw the results up in a simple table:

Operator	Empty sequence	Single element	Multiple elements
First	Throws exception	Returns element	Returns first element
FirstOrDefault	Returns default(TSource)	Returns element	Returns first element
Last	Throws exception	Returns element	Returns last element
LastOrDefault	Returns default(TSource)	Returns element	Returns last element
Single	Throws exception	Returns element	Throws exception
SingleOrDefault	Returns default(TSource)	Returns element	Throws exception

As you can see, for an input sequence with a single element, the results are remarkably uniform :) Likewise, for an empty input sequence, any operator without "OrDefault" throws an exception (InvalidOperationException, in fact) and any operator with "OrDefault" returns the default value for the element type (null for reference types, 0 for int etc). The operators really differ if the (potentially filtered) input sequence contains multiple elements – First and Last do the obvious thing, and Single throws an exception. It’s worth noting that SingleOrDefault also throws an exception – it’s not like it’s saying, "If the sequence is a single element, return it – otherwise return the default value." If you want an operator which handles multiple elements, you should be using First or Last, with the "OrDefault" version if the sequence can legitimately have no elements. Note that if you do use an "OrDefault" operator, the result is exactly the same for an empty input sequence as for an input sequence containing exactly one element which is the default value. (I’ll be looking at the DefaultIfEmpty operator next.)

Now we know what the operators do, let’s test them.

What are we going to test?

This morning I tweeted that I had written 72 tests before writing any implementation. In fact I ended up with 80, for reasons we’ll come to in a minute. For each operator, I tested the following 12 cases:

Null source (predicate-less overload)
Null source (predicated overload)
Null predicate
Empty sequence, no predicate
Empty sequence, with a predicate
Single element sequence, no predicate
Single element sequence where the element matched the predicate
Single element sequence where the element didn’t match the predicate
Multiple element sequence, no predicate
Multiple element
Multiple element sequence where one element matched the predicate
Multiple element sequence where multiple elements matched the predicate

These were pretty much cut and paste jobs – I used the same data for each test against each operator, and just changed the expected results.

There are two extra tests for each of First and FirstOrDefault, and two for each of Last and LastOrDefault:

First/FirstOrDefault should return as soon as they’ve seen the first element, when there’s no predicate; they shouldn’t iterate over the rest of the sequence
First/FirstOrDefault should return as soon as they’ve seen the first matching element, when there is a predicate
Last/LastOrDefault are optimized for the case where the source implements IList<T> and there’s no predicate: it uses Count and the indexer to access the final element
Last/LastOrDefault is not optimized for the case where the source implements IList<T> but there is a predicate: it iterates through the entire sequence

The last two tests involved writing a new collection called NonEnumerableList which implements IList<T> by delegating everything to a backing List<T>, except for GetEnumerator() (both the generic and nongeneric forms) which simply throws a NotSupportedException. That should be a handy one for testing optimizations in the future. I’ll discuss the optimization for Last when we get there.

Let’s implement them!

These operators were more interesting to implement than I’d expect, so I’m actually going to show all twelve methods. It was rarely just a matter of cutting and pasting, other than for the argument validation.

Of course, if we chose to implement the predicated versions using "Where" and the non-predicated form, and the "OrDefault" versions by using "DefaultIfEmpty" followed by the non-defaulting version, we would only have had the three non-predicated, non-defaulting versions to deal with… but as I’ve said before, there are some virtues to implementing each operator separately.

For the sake of avoiding fluff, I’ve removed the argument validation from each method – but obviously it’s there in the real code. Let’s start with First:

public static TSource First<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (iterator.MoveNext())
        {
            return iterator.Current;
        }
        throw new InvalidOperationException("Sequence was empty");
    }
}

public static TSource First<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return item;
        }
    }
    throw new InvalidOperationException("No items matched the predicate");
}

These look surprisingly different – and that was actually a deliberate decision. I could easily have implemented the predicate-less version with a foreach loop as well, just returning unconditionally from its body. However, I chose to emphasize the fact that we’re not looping in First: we simply move to the first element if we can, and return it or throw an exception. There’s no hint that we might ever call MoveNext again. In the predicated form, of course, we have to keep looping until we find a matching value – only throwing the exception when we’ve exhausted all possibilities.

Now let’s see how it looks when we return a default for empty sequences:

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        return iterator.MoveNext() ? iterator.Current : default(TSource);
    }
}

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return item;
        }
    }
    return default(TSource);
}

Here the predicated form looks very similar to First, but the predicate-less one is slightly different: instead of using an if block (which would be a perfectly valid approach, of course) I’ve used the conditional operator. We’re going to return something, whether we manage to move to the first element or not. Arguably it would be nice if the conditional operator allowed the second or third operands to be "throw" expressions, taking the overall type of the expression from the other result operand… but it’s no great hardship.

Next up we’ll implement Single, which is actually closer to First than Last is, in some ways:

public static TSource Single<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        TSource ret = iterator.Current;
        if (iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence contained multiple elements");
        }
        return ret;
    }
}

public static TSource Single<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource ret = default(TSource);
    bool foundAny = false;
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            if (foundAny)
            {
                throw new InvalidOperationException("Sequence contained multiple matching elements");
            }
            foundAny = true;
            ret = item;
        }
    }
    if (!foundAny)
    {
        throw new InvalidOperationException("No items matched the predicate");
    }
    return ret;
}

This is already significantly more complex than First. The predicate-less version starts off the same way, but if we manage to move to the first element, we have to remember that value (as we’re hoping to return it) and then try to move to the second element. This time, if the move succeeds, we have to throw an exception – otherwise we can return our saved value.

The predicated version is even hairier. We still need to remember the first matching value we find, but this time we’re looping – so we need to keep track of whether we’ve already seen a matching value or not. If we see a second match, we have to throw an exception… and we also have to throw an exception if we reach the end without finding any matches at all. Note that although we assign an initial value of default(TSource) to ret, we’ll never reach a return statement without assigning a value to it. However, the rules around definite assignment aren’t smart enough to cope with this, so we need to provide a "dummy" value to start with… and default(TSource) is really the only value available. There is an alternative approach without using a foreach statement, where you loop until you find the first match, assign it to a local variable which is only declared at that point, followed by a second loop ensuring that there aren’t any other matches. I personally think that’s a bit more complex, which is why I’ve just used the foreach here.

The difference when we implement SingleOrDefault isn’t quite as pronounced this time though:

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            return default(TSource);
        }
        TSource ret = iterator.Current;
        if (iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence contained multiple elements");
        }
        return ret;
    }
}

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource ret = default(TSource);
    bool foundAny = false;
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            if (foundAny)
            {
                throw new InvalidOperationException("Sequence contained multiple matching elements");
            }
            foundAny = true;
            ret = item;
        }
    }
    return ret;
}

This time we’ve just replaced a "throw" statement in the predicate-less method with a "return" statement, and removed the test for no matches being found in the predicated method. Here our assignment of default(TSource) to ret really works in our favour – if we don’t end up assigning anything else to it, we’ve already got the right return value!

Next up is Last:

public static TSource Last<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    IList<TSource> list = source as IList<TSource>;
    if (list != null)
    {
        if (list.Count == 0)
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        return list[list.Count – 1];
    }

    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        TSource last = iterator.Current;
        while (iterator.MoveNext())
        {
            last = iterator.Current;
        }
        return last;
    }
}

public static TSource Last<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    bool foundAny = false;
    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            foundAny = true;
            last = item;
        }
    }
    if (!foundAny)
    {
        throw new InvalidOperationException("No items matched the predicate");
    }
    return last;
}

Let’s start off with the optimization at the beginning of the predicate-less method. If we find out we’re dealing with a list, we can simply fetch the count, and then either throw an exception or return the element at the final index. As an extra bit of optimization I could store the count in a local variable – but I’m assuming that the count of an IList<T> is cheap to compute. If there are significant objections to that assumption, I’m happy to change it :) Note that this is another situation where I’m assuming that anything implementing IList<T> will only hold at most Int32.MaxValue items – otherwise the optimization will fail.

If we don’t follow the optimized path, we simply iterate over the sequence, updating a local variable with the last-known element on every single iteration. This time I’ve avoided the foreach loop for no particularly good reason – we could easily have had a foundAny variable which was just set to "true" on every iteration, and then tested at the end. In fact, that’s exactly the pattern the predicated method takes. Admittedly that decision is forced upon us to some extent – we can’t just move once and then take the first value as the "first last-known element", because it might not match the predicate.

There’s no optimization for the predicated form of Last. This follows LINQ to Objects, but I don’t honestly know the reason for it there. We could easily iterate backwards from the end of the sequence using the indexer on each iteration. One possible reason which makes a certain amount of sense is that when there’s a predicate, that predicate could throw an exception for some values – and if we just skipped to the end if the collection implements IList<T>, that would be an observable difference. I’d be interested to know whether or not that is the reason – if anyone has any inside information which they can share, I’ll update this post.

From here, we only have one more operator to implement – LastOrDefault:

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    IList<TSource> list = source as IList<TSource>;
    if (list != null)
    {
        return list.Count == 0 ? default(TSource) : list[list.Count – 1];
    }

    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        last = item;
    }
    return last;
}

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            last = item;
        }
    }
    return last;
}

This time, aside from the optimization, the predicated and non-predicated forms look very similar… more so than for any of the other operators. In each case, we start with a return value of default(TSource), and iterate over the whole sequence, updating it – only doing so when it matches the predicate, if we’ve got one.

Conclusion

This was a longer post than I anticipated when I got up this morning, but I hope the slight differences between implementations – and the mystery of the unoptimized predicated "Last/LastOrDefault" operators have made it worth slogging through.

As a contrast – and because I’ve already mentioned it in this post – I’ll implement DefaultIfEmpty next. I reckon I can still do that this evening, if I hurry…

Addendum

It turns out I was missing some tests for Single and SingleOrDefault: what should they do if evaluating the sequence fully throws an exception? It turns out that in LINQ to Objects, the overloads without a predicate throw InvalidOperationException as soon as they see a second element, but the overloads with a predicate keep iterating even when they’ve seen a second element matching a predicate. This seems ludicrously inconsistent to me – I’ve opened a Connect issue about it; we’ll see what happens.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 10 – Any and All

December 28, 2010 jonskeet 12 Comments

Another day, another blog post. I should emphasize that this rate of posting is likely to be short-lived… although if I get into the habit of writing a post on the morning commute when I go back to work after the Christmas holidays, I could keep ploughing through until we’re done.

Anyway, today we have a pair of operators: Any and All.

What are they?

"Any" has two overloads; there’s only one for "All":

public static bool Any<TSource>(
this IEnumerable<TSource> source)

public static bool Any<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static bool All<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

The names really are fairly self-explanatory:

"Any" without a predicate returns whether there are any elements in the input sequence
"Any" with a predicate returns whether any elements in the input sequence match the predicate
"All" returns whether all the elements in the input sequence match the given predicate

Both operators use immediate execution – they don’t return until they’ve got the answer, basically.

Importantly, "All" has to read through the entire input sequence to return true, but can return as soon as it’s found a non-matching element; "Any" can return true as soon as it’s found a matching element, but has to iterate over the entire input sequence in order to return false. This gives rise to one very simple LINQ performance tip: it’s almost never a good idea to use a query like

// Don’t use this
if (query.Count() != 0)

That has to iterate over *all* the results in the query… when you really only care whether or not there are any results. So use "Any" instead:

// Use this instead
if (query.Any())

If this is part of a bigger LINQ to SQL query, it may not make a difference – but in LINQ to Objects it can certainly be a huge boon.

Anyway, let’s get on to testing the three methods…

What are we going to test?

Feeling virtuous tonight, I’ve even tested argument validation again… although it’s easy to get that right here, as we’re using immediate execution.

Beyond that, I’ve tested a few scenarios:

An empty sequence will return false with Any, but true with All. (Whatever the predicate is for All, there are no elements which fail it.)
A sequence with any elements at all will make the predicate-less Any return true.
If all the elements don’t match the predicate, both Any and All return false.
If some elements match the predicate, Any will return true but All will return false.
If all elements match the predicate, All will return true.

Those are all straightforward, so I won’t give the code. One final test is interesting though: we prove that Any returns as soon as it’s got the result by giving it a query which will throw an exception if it’s iterated over completely. The easiest way of doing this is to start out with a sequence of integers including 0, and then use Select with a projection which divides some constant value by each element. In this test case, I’ve given it a value which will match the predicate before the value which will cause the exception to be thrown:

[Test]
public void SequenceIsNotEvaluatedAfterFirstMatch()
{
    int[] src = { 10, 2, 0, 3 };
    var query = src.Select(x => 10 / x);
    // This will finish at the second element (x = 2, so 10/x = 5)
    // It won’t evaluate 10/0, which would throw an exception
    Assert.IsTrue(query.Any(y => y > 2));
}

There’s an equivalent test for All, where a non-matching element occurs before the exceptional one.

So, with all the tests written, let’s get on with the interesting bit:

Let’s implement them!

The first thing to note is that all of these could be implemented in terms of either Any-with-a-predicate or All. For example, given All, we could implement Any as:

public static bool Any<TSource>(
this IEnumerable<TSource> source)
{
return source.Any(x => true);
}

public static bool Any<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return !source.All(x => !predicate(x));
}

It’s simplest to implement the predicate-less Any in terms of the predicated one – using a predicate which returns true for any element means that Any will return true for any element at all, which is what we want.

The inversions in the call to All take a minute to get your head round, but it’s basically De Morgan’s law in LINQ form: we effectively invert the predicate to find out if all of the elements don’t match the original predicate… then return the inverse. Due to the inversion, this still returns early in all the appropriate situations, too.

While we could do that, I’ve actually preferred a straightforward implementation of all of the separate methods:

public static bool Any<TSource>(
    this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        return iterator.MoveNext();
    }
}

public static bool Any<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return true;
        }
    }
    return false;
}

public static bool All<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    foreach (TSource item in source)
    {
        if (!predicate(item))
        {
            return false;
        }
    }
    return true;
}

Aside from anything else, this makes it obvious where the "early out" comes in each case – and also means that any stack traces generated are rather easier to understand. It would be quite odd from a client developer’s point of view to call Any but see All in the stack trace, or vice versa.

One interesting point to note is that I don’t actually use a foreach loop in Any – although I could, of course. Instead, I just get the iterator and then return whether the very first call to MoveNext indicates that there are any elements. I like the fact that reading this method it’s obvious (at least to me) that we really couldn’t care less what the value of the first element is – because we never ask for it.

Conclusion

Probably the most important lesson here is the advice to use Any (without a predicate) instead of Count when you can. The rest was pretty simple – although it’s always fun to see one operator implemented in terms of another.

So, what next? Possibly Single/SingleOrDefault/First/FirstOrDefault/Last/LastOrDefault. I might as well do them all together – partly as they’re so similar, and partly to emphasize the differences which do exist.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 9 – SelectMany

December 27, 2010 jonskeet 9 Comments

The next operator we’ll implement is actually the most important in the whole of LINQ. Most (all?) of the other operators returning sequences can be implemented via SelectMany. We’ll have a look at that at the end of this post, but let’s implement it first.

What is it?

SelectMany has 4 overloads, which look gradually more and more scary:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
this IEnumerable<TSource> source,
Func<TSource, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
this IEnumerable<TSource> source,
Func<TSource, int, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

These aren’t too bad though. Really these are just variations of the same operation, with two "optional" bits.

In every case, we start with an input sequence. We generate a subsequence from each element in the input sequence using a delegate which can optionally take a parameter with the index of the element within the original collection.

Now, we either return each element from each subsequence directly, or we apply another delegate which takes the original element in the input sequence and the element within the subsequence.

In my experience, uses of the overloads where the original selector delegate uses the index are pretty rare – but the others (the first and the third in the list above) are fairly common. In particular, the C# compiler uses the third overload whenever it comes across a "from" clause in a query expression, other than the very first "from" clause.

It helps to put this into a bit more context. Suppose we have a query expression like this:

var query = from file in Directory.GetFiles("logs")
from line in File.ReadLines(file)
select Path.GetFileName(file) + ": " + line;

That would be converted into a "normal" call like this:

var query = Directory.GetFiles("logs")
.SelectMany(file => File.ReadLines(file),
(file, line) => Path.GetFileName(file) + ": " + line);

In this case the compiler has used our final "select" clause as the projection; if the query expression had continued with "where" clauses etc, it would have created a projection to just pass along "file" and "line" in an anonymous type. This is probably the most confusing bit of the query translation process, involving transparent identifiers. For the moment we’ll stick with the simple version above.

So, the SelectMany call above has three arguments really:

The source, which is a list of strings (the filenames returned from Directory.GetFiles)
An initial projection which converts from a single filename to a list of the lines of text within that file
A final projection which converts a (file, line) pair into a single string, just by separating them with ": ".

The result is a single sequence of strings – every line of every log file, prefixed with the filename in which it appeared. So writing out the results of the query might give output like this:

test1.log: foo
test1.log: bar
test1.log: baz
test2.log: Second log file
test2.log: Another line from the second log file

It can take a little while to get your head round SelectMany – at least it did for me – but it’s a really important one to understand.

A few more details of the behaviour before we go into testing:

The arguments are validated eagerly – everything has to be non-null.
Everything is streamed. So only one element of the input is read at a time, and then a subsequence is produced from that. Only one element is then read from the subsequence at a time, yielding the results as we go, before we move onto the next input element and thus the next subsequence etc.
Every iterator is closed when it’s finished with, just as you’d expect by now.

What are we going to test?

I’m afraid I’ve become lazy by this point. I can’t face writing yet more tests for null arguments. I’ve written a single test for each of the overloads. I found it hard to come up with a clear way of writing the tests, but here’s one example, for the most complicated overload:

[Test]
public void FlattenWithProjectionAndIndex()
{
    int[] numbers = { 3, 5, 20, 15 };
    var query = numbers.SelectMany((x, index) => (x + index).ToString().ToCharArray(),
                                   (x, c) => x + ": " + c);
    // 3 => "3: 3"
    // 5 => "5: 6"
    // 20 => "20: 2", "20: 2"
    // 15 => "15: 1", "15: 8"
    query.AssertSequenceEqual("3: 3", "5: 6", "20: 2", "20: 2", "15: 1", "15: 8");
}

So, to give a bit more explanation to this:

Each number is summed with its index (3+0, 5+1, 20+2, 15+3)
Each sum is turned into a string, and then converted into a char array. (We don’t really need to ToCharArray call as string implements IEnumerable<char> already, but I thought it made it clearer.)
We combine each subsequence character with the original element it came from, in the form: "original value: subsequence character"

The comment shows the eventual results from each input, and the final test shows the complete result sequence.

Clear as mud? Hopefully it’s not to bad when you look at each step in turn. Okay, now let’s make it pass…

Let’s implement it!

We could implement the first three overloads in terms of calls to the final one – or more likely, a single "Impl" method without argument validation, called by all four public methods. For example, the simplest method could be implemented like this:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return SelectManyImpl(source,
                          (value, index) => selector(value),
                          (originalElement, subsequenceElement) => subsequenceElement);
}

However, I’ve decided to implement each of the methods separately – splitting them into the public extension method and a "SelectManyImpl" method with the same signature each time. I think that would make it simpler to step through the code if there were ever any problems… and it allows us to see the differences between the simplest and most complicated versions, too:

// Simplest overload
private static IEnumerable<TResult> SelectManyImpl<TSource, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    foreach (TSource item in source)
    {
        foreach (TResult result in selector(item))
        {
            yield return result;
        }
    }
}

// Most complicated overload:
// – Original projection takes index as well as value
// – There’s a second projection for each original/subsequence element pair
private static IEnumerable<TResult> SelectManyImpl<TSource, TCollection, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)
{
    int index = 0;
    foreach (TSource item in source)
    {
        foreach (TCollection collectionItem in collectionSelector(item, index++))
        {
            yield return resultSelector(item, collectionItem);
        }
    }
}

The correspondence between the two methods is pretty clear… but I find it helpful to have the first form, so that if I ever get confused about the fundamental point of SelectMany, it’s really easy to understand it based on the simple overload. It’s then not too big a jump to apply the two extra "optional" complications, and end up with the final method. The simple overload acts as a conceptual stepping stone, in a way.

Two minor points to note:

The first method could have been implemented with a "yield foreach selector(item)" if such an expression existed in C#. Using a similar construct in the more complicated form would be harder, and involve another call to Select, I suspect… probably more hassle than it would be worth.
I’m not explicitly using a "checked" block in the second form, even though "index" could overflow. I haven’t looked to see what the BCL does in this situation, to be honest – I think it unlikely that it will come up. For consistency I should probably use a checked block on every method which uses an index like this… or just turn arithmetic checking on for the whole assembly.

Reimplementing operators using SelectMany

I mentioned early on in this post that many of the LINQ operators can be implemented via SelectMany. Just as a quick example of this, here are alternative implementations of Select, Where and Concat:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return source.SelectMany(x => Enumerable.Repeat(selector(x), 1));
}

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return source.SelectMany(x => Enumerable.Repeat(x, predicate(x) ? 1 : 0));
}

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return new[] { first, second }.SelectMany(x => x);
}

Select and Where use Enumerable.Repeat as a convenient way of creating a sequence with either a single element or none. You could alternatively create a new array instead. Concat just uses an array directly: if you think of SelectMany in terms of its flattening operation, Concat is a really natural fit. I suspect that Empty and Repeat are probably feasible with recursion, although the performance would become absolutely horrible.

Currently the above implementations are in the code using conditional compilation. If this becomes a popular thing for me to implement, I might consider breaking it into a separate project. Let me know what you think – my gut feeling is that we won’t actually gain much more insight than the above methods give us… just showing how flexible SelectMany is.

SelectMany is also important in a theoretical way, in that it’s what provides the monadic nature of LINQ. It’s the "Bind" operation in the LINQ monad. I don’t intend to say any more than that on the topic – read Wes Dyer’s blog post for more details, or just search for "bind monad SelectMany" for plenty of posts from people smarter than myself.

Conclusion

SelectMany is one of LINQ’s fundamental operations, and at first sight it’s a fearsome beast. As soon as you understand that the basic operation is a flattening projection just with a couple of optional twiddles, it’s easily tamed.

Next up I’ll implement All and Any – which are nice and easy to describe by comparison.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 8 – Concat

December 27, 2010 jonskeet 9 Comments

After our quick visit to scalar return types with Count and LongCount, we’re back to an operator returning a sequence: Concat.

What is it?

Concat only has a single signature, which makes life simple:

public static IEnumerable<TSource> Concat<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second)

The return value is simply a sequence containing the elements of the first sequence followed by the elements of the second sequence – the concatenation of the two sequences.

I sometimes think it’s a pity that there aren’t Prepend/Append methods which do the same thing but for a single extra element – this would be quite useful in situations such as having a drop down list of countries with an extra option of "None". It’s easy enough to use Concat for this purpose by creating a single-element array, but specific methods would be more readable in my view. MoreLINQ has extra Concat methods for this purpose, but Edulinq is only meant to implement the methods already in LINQ to Objects.

As ever, some notes on the behaviour of Concat:

Arguments are validated eagerly: they must both be non-null
The result uses deferred execution: other than validation, the arguments aren’t used when the method is first called
Each sequence is only evaluated when it needs to be… if you stop iterating over the output sequence before the first input has been exhausted, the second input will remain unused

That’s basically it.

What are we going to test?

The actual concatenation part of the behaviour is very easy to test in a single example – we could potentially also demonstrate concatenation using empty sequences, but there’s no reason to suspect they would fail.

The argument validation is tested in the same way as normal, by calling the method with invalid arguments but not attempting to use the returned query.

Finally, there are a couple of tests to indicate the point at which each input sequence is used. This is achieved using the ThrowingEnumerable we originally used in the Where tests:

[Test]
public void FirstSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new ThrowingEnumerable();
    IEnumerable<int> second = new int[] { 5 };
    // No exception yet…
    var query = first.Concat(second);
    // Still no exception…
    using (var iterator = query.GetEnumerator())
    {
        // Now it will go bang
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

[Test]
public void SecondSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new int[] { 5 };
    IEnumerable<int> second = new ThrowingEnumerable();
    // No exception yet…
    var query = first.Concat(second);
    // Still no exception…
    using (var iterator = query.GetEnumerator())
    {
        // First element is fine…
        Assert.IsTrue(iterator.MoveNext());
        Assert.AreEqual(5, iterator.Current);
        // Now it will go bang, as we move into the second sequence
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

I haven’t written tests to check that iterators are disposed, etc – but each input sequence’s iterator should be disposed appropriately. In particular, it’s natural for the first sequence’s iterator to be disposed before the second sequence is iterated over at all.

Let’s impement it!

The implementation is reasonably simple, but it does make me hanker after F#… it’s the normal split between argument validation and iterator block implementation, but each part is really simple:

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    foreach (TSource item in second)
    {
        yield return item;
    }
}

It’s worth just remembering at this point that this would still have been very annoying to implement without iterator blocks. Not really difficult as such, but we’d have had to remember which sequence we were currently iterating over (if any) and so on.

However, using F# we could have made this even simpler with the yield! expression, which yields a whole sequence instead of a single item. Admittedly in this case there aren’t significant performance benefits to using yield! (which there certainly can be in recursive situations) but it would just be more elegant to have the ability to yield an entire sequence in one statement. (Spec# has a similar construct called nested iterators, expressed using yield foreach.) I’m not going to pretend to know enough about the details of either F# or Spec# to draw more detailed comparisons, but we’ll see the pattern of "foreach item in a collection, yield the item" several more times before we’re done. Remember that we can’t extract that into a library method, as the "yield" expression needs special treatment by the C# compiler.

Conclusion

Even when presented with a simple implementation, I can still find room to gripe :) It would be nice to have nested iterators in C#, but to be honest the number of times I find myself frustrated by their absence is pretty small.

Concat is a useful operator, but it’s really only a very simple specialization of another operator: SelectMany. After all, Concat just flattens two sequences into one… whereas SelectMany can flatten a whole sequence of sequences, with even more generality available when required. I’ll implement SelectMany next, and show a few examples of how other operators can be implemented simply in terms of SelectMany. (We’ll see the same sort of ability for operators returning a single value when we implement Aggregate.)

Addendum: avoiding holding onto references unnecessarily

A comment suggested that we should set first to "null" after we’ve used it. That way, as soon as we’ve finished iterating over the collection, it may be eligible for garbage collection. That leads to an implementation like this:

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    // Avoid hanging onto a reference we don’t really need
    first = null;
    foreach (TSource item in second)
    {
        yield return item;
    }
}

Now normally I’d say this wouldn’t actually help – setting a local variable to null when it’s not used in the rest of the method doesn’t actually make any difference when the CLR is running in optimized mode, without a debugger attached: the garbage collector only cares about variables whih might still be accessed in the rest of the method.

In this case, however, it makes a difference – because this isn’t a normal local variable. It ends up as an instance variable in the hidden class generated by the C# compiler… and the CLR can’t tell that the instance variable will never be used again.

Arguably we could remove our only reference to "first" at the start of GetEnumerator. We could write a method of the form:

public static T ReturnAndSetToNull<T>(ref T value) where T : class
{
    T tmp = value;
    value = null;
    return tmp;
}

and then call it like this:

foreach (TSource item in ReturnAndSetToNull(ref first))

I would certainly consider that overkill, particularly as it seems very likely that the iterator will still have a reference to the collection itself – but simply setting "first" to null after iterating over it makes sense to me.

I don’t believe that the "real" LINQ to Objects implementation does this, mind you. (At some point I’ll test it with a collection which has a finalizer.)

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 7 – Count and LongCount

December 26, 2010 jonskeet 20 Comments

Today’s post covers two operators in one, because they’re so incredibly similar… to the point cut and paste of implementation, merely changing the name, return type, and a couple of variables.

What are they?

Count and LongCount each have two overloads: one with a predicate, and one without. Here are all four signatures:

public static int Count<TSource>(
this IEnumerable<TSource> source)

public static int Count<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static long LongCount<TSource>(
this IEnumerable<TSource> source)

public static long LongCount<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

As you can see, the LongCount signatures are identical to Count except in terms of their return types, which are long (Int64) instead of int (Int32).

The overloads without a predicate parameter simply return the number of items in the source collection; the ones with a predicate return the number of items for which that predicate returns true.

Interesting aspects of the behaviour:

These are all extension methods on IEnumerable<T> – you might argue that for the versions without a predicate, it would have been better to extend the non-generic IEnumerable, as nothing actually requires the element type.
Where there’s no predicate, Count is optimized for ICollection<T> and (in .NET 4) ICollection – both of which have Count properties which are expected to be faster than iterating over the entire collection. LongCount is not optimized in the same way in the .NET implementation – I’ll discuss this in the implementation section.
No optimization is performed in the overloads with predicates, as basically there’s no way of telling how many items will "pass" the predicate without testing them.
All methods use immediate execution – nothing is deferred. (If you think about it, there’s nothing which can be deferred here, when we’re just returning an int or a long.)
All arguments are validated simply by testing they’re non-null
Both methods should throw OverflowException when given a collection with more items than they can return the count of… though this is a considerably larger number in the case of LongCount than Count, of course.

What are we going to test?

In some senses, there are only two "success" tests involved here: one without a predicate and one with. Those are easy enough to deal with, but we also want to exercise the optimized paths. That’s actually trickier than it might sound, as we want to test four situations:

A source which implements both ICollection<T> and ICollection (easy: use List<T>)
A source which implements ICollection<T> but not ICollection (reasonably easy, after a little work finding a suitable type: use HashSet<T>)
A source which implements ICollection but not ICollection<T> but still implements IEnumerable<T> (so that we can extend it) – tricky…
A source which doesn’t implement ICollection or ICollection<T> (easy: use Enumerable.Range which we’ve already implemented)

The third bullet is the nasty one. Obviously there are plenty of implementations of ICollection but not ICollection<T> (e.g. ArrayList) but because it doesn’t implement IEnumerable<T>, we can’t call the Count extension method on it. In the end I wrote my own SemiGenericCollection class.

Once we’ve got sources for all those tests, we need to decide what we’re actually testing about them. Arguably we should test that the result is optimized, for example by checking that we never really enumerate the collection. That would require writing custom collections with GetEnumerator() methods which threw exceptions, but still returned a count from the Count property. I haven’t gone this far, but it’s another step we certainly could take.

For the overloads which take predicates, we don’t need to worry about the various collection interfaces as we’re not optimizing anyway.

The failure cases for null arguments are very simple, but there’s one other case to consider: overflow. For Count, I’ve implemented a test case to verify the overflow behaviour. Unfortunately we can’t run this test in the Edulinq implementation yet, as it requires Enumerable.Concat, but here it is for the record anyway:

[Test]
[Ignore("Takes an enormous amount of time!")]
public void Overflow()
{
    var largeSequence = Enumerable.Range(0, int.MaxValue)
                                  .Concat(Enumerable.Range(0, 1));
    Assert.Throws<OverflowException>(() => largeSequence.Count());
}

This guards against a bad implementation which overflows by simply wrapping the counter round to Int32.MinValue instead of throwing an exception.

As you can see, this test will be disabled even when it’s uncommented after we implement Concat, as it requires counting up to 2 billion – not great for a quick set of unit tests. Even that isn’t too bad, however, compared with the equivalent in LongCount which would have to count 2^63 items. Generating such a sequence isn’t difficult, but iterating over it all would take a very long time. We also need an equivalent test for the overload with a predicate – something I neglected until writing up this blog post, and finding a bug in the implementation as I did so :)

For LongCount, I merely have an equivalent test to the above which checks that the same sequence can have its length expressed as a long value.

Let’s implement them!

We’ll look at the overload which does have a predicate first – as it’s actually simpler:

public static int Count<TSource>(this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    // No way of optimizing this
    checked
    {
        int count = 0;
        foreach (TSource item in source)
        {
            if (predicate(item))
            {
                count++;
            }
        }
        return count;
    }
}

Note that this time we’re not using an iterator block (we’re not returning a sequence), so we don’t need to split the implementation into two different methods just to get eager argument validation.

After the argument validation, the main part of the method is reasonably simple, with one twist: we’re performing the whole iteration within a "checked" context. This means that if the increment of count overflows, it will throw an OverflowException instead of wrapping round to a negative number. There are some other alternatives here:

We could have made just the increment statement checked instead of the whole second part of the method
We could have explicitly tested for count == int.MaxValue before incrementing, and thrown an exception in that case
We could just build the whole assembly in a checked context

I think it’s useful for this section of code to be explicitly checked – it makes it obvious that it really is a requirement for general correctness. You may well prefer to make only the increment operation checked – I personally believe that the current approach draws more attention to the checked-ness, but it’s definitely a subjective matter. It’s also possible that an explicit check could be faster, although I doubt it – I haven’t benchmarked either approach.

Other than the predicate-specific parts, all the above code also appears in the optimized Count implementation – so I won’t discuss those again. Here’s the full method:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        return genericCollection.Count;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        return nonGenericCollection.Count;
    }

    // Do it the slow way – and make sure we overflow appropriately
    checked
    {
        int count = 0;
        using (var iterator = source.GetEnumerator())
        {
            while (iterator.MoveNext())
            {
                count++;
            }
        }
        return count;
    }
}

The only "new" code here is the optimization. There are effectively two equivalent blocks, just testing for different collection interface types, and using whichever one it finds first (if any). I don’t know whether the .NET implementation tests for ICollection or ICollection<T> first – I could test it by implementing both interfaces but returning different counts from each, of course, but that’s probably overkill. It doesn’t really matter for well-behaved collections other than the slight performance difference – we want to test the "most likely" interface first, which I believe is the generic one.

To optimize or not to optimize?

The LongCount implementations are exactly the same as those for Count, except using long instead of int.

Notably, I still use optimizations for ICollection and ICollection<T> – but I don’t believe the .NET implementation does so. (It’s easy enough to tell by creating a huge list of bytes and comparing the time taken for Count and LongCount.)

There’s an argument for using Array.GetLongLength when the source is an array… but I don’t think the current CLR supports arrays with more than Int32.MaxValue elements anyway, so it’s a bit of a non-issue other than for future-proofing. Beyond that, I’m not sure why the .NET implementation isn’t optimized. It’s not clear what an ICollection/ICollection<T> implementation is meant to return from its Count property if it has more than Int32.MaxValue elements anyway, to be honest.

Suggestions as to what I should have done are welcome… but I should probably point out that LongCount is more likely to be used against Queryable than Enumerable – it’s easy to imagine a service representing a collection (such as a database table) which can quickly tell you the count even when it’s very large. I would imagine that there are relatively few cases where you have a collection to evaluate in-process where you really just want to iterate through the whole lot just to get the count.

Conclusion

These are our first LINQ operators which return scalar values instead of sequences – with the natural consequence that they’re simpler to understand in terms of control flow and timing. The methods simply execute – possibly with some optimization – and return their result. Nice and simple. Still, we’ve seen there can still be a few interesting aspects to consider, including questions around optimization which don’t necessarily have a good answer.

Next time, I think I’ll implement Concat – mostly so that I can uncomment the overflow tests for Count. That’s going back to an operator which returns a sequence, but it’s a really simple one…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 6 – Repeat

December 24, 2010 jonskeet 2 Comments

A trivial method next, with even less to talk about than "Empty"… "Repeat". This blog post is merely a matter of completeness.

What is it?

"Repeat" is a static, generic non-extension method with a single overload:

public static IEnumerable<TResult> Repeat<TResult>(
TResult element,
int count)

It simple returns a sequence which contains the specified element, repeated "count" times. The only argument validation is that "count" has to be non-negative.

What are we going to test?

There’s really not a lot to test here. I’ve thought of 4 different scenarios:

A vanilla "repeat a string 3 times" sequence
An empty sequence (repeat an element 0 times)
A sequence containing null values (just to prove that "element" can be null)
A negative count to prove that argument validation occurs, and does so eagerly.

None of this is remotely exciting, I’m afraid.

Let’s implement it!

Just about the only thing we could do wrong here is to put the argument validation directly in an iterator block… and we’ve implemented the "split method" pattern so many times already that we wouldn’t fall into that trap. So, here’s the code in all its tedious lack of glory:

public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    return RepeatImpl(element, count);
}

private static IEnumerable<TResult> RepeatImpl<TResult>(TResult element, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return element;
    }
}

That’s it. Um, interesting points to note… none.

Conclusion

There’s no sense in dragging this out. That’s the lot. Next up, Count and LongCount – which actually do have a couple of interesting points.