Category Archives: LINQ

Reimplementing LINQ to Objects: Part 37 – Guiding principles

January 20, 2011 jonskeet 5 Comments

Now that I’m "done" reimplementing LINQ to Objects – in that I’ve implemented all the methods in System.Linq.Enumerable – I wanted to write a few posts looking at the bigger picture. I’m not 100% sure of what this will consist of yet; I want to avoid this blog series continuing forever. However, I’m confident it will contain (in no particular order):

This post: principles governing the behaviour of LINQ to Objects
Missing operators: what else I’d have liked to see in Enumerable
Optimization: where the .NET implementation could be further optimized, and why some obvious-sounding optimizations may be inappropriate
How query expression translations work, in brief (and with a cheat sheet)
The difference between IQueryable<T> and IEnumerable<T>
Sequence identity, the "Contains" issue, and other knotty design questions
Running the Edulinq tests against other implementations

If there are other areas you want me to cover, please let me know.

The principles behind the LINQ to Objects implementation

The design LINQ to Objects is built on a few guiding principles, both in terms of design and implementation details. You need to understand these, but also implementations should be clear about what they’re doing in these terms too.

Extension method targets and argument validation

IEnumerable<T> is the core sequence type, not just for LINQ but for .NET as a whole. Almost everything is written in terms of IEnumerable<T> at least as input, with the following exceptions:

Empty, Range and Repeat don’t have input sequences (these are the only non-extension methods)
OfType and Cast work on the non-generic IEnumerable type instead
ThenBy and ThenByDescending work on IOrderedEnumerable<T>

All operators other than AsEnumerable verify that any input sequence is non-null. This validation is performed eagerly (i.e. when the method is called) even if the operator uses deferred execution for the results. Any delegate used (typically a projection or predicate of some kind) must be non-null. Again, this validation is performed eagerly.

IEqualityComparer<T> is used for all custom equality comparisons. Any parameter of this type may be null, in which case the default equality comparer for the type is used. In most cases the default equality comparer for the type is also used when no custom equality comparer is used, but Contains has some odd behaviour around this. Equality comparers are expected to be able to handle null values. IComparer<T> is only used by the OrderBy/ThenBy operators and their descending counterparts – and only then if you want custom comparisons between keys. Again, a null IComparer<T> means "use the default for the type"

Timing of input sequence "opening"

Any operator with a return type of IEnumerable<T> or IOrderedEnumerable<T> uses deferred execution. This means that the method doesn’t read anything from any input sequences until someone starts reading from the result sequence. It’s not clearly defined exactly when input sequences will first be accessed – for some operators if may be when GetEnumerator() is called; for others it may be on the first call to MoveNext() on the resulting iterator. Callers should not depend on these slight variations. Deferred execution is common for operators in the middle of queries. Operators which use deferred execution effectively represent queries rather than the results of queries – so if you change the contents of the original source of the query and then iterate over the query itself again, you’ll see the change. For example:

List<string> source = new List<string>();
var query = source.Select(x => x.ToUpper());

// This loop won’t write anything out
foreach (string x in query)
{
    Console.WriteLine(x);
}

source.Add("foo");
source.Add("bar");

// This loop will write out "FOO" and "BAR" – even
// though we haven’t changed the value of "query"
foreach (string x in query)
{
Console.WriteLine(x);
}

Deferred execution is one of the hardest parts of LINQ to understand, but once you do, everything becomes somewhat simpler.

All other operators use immediate execution, fetching all the data they need from the input before they return a value… so that by the time they do return, they will no longer see or care about changes to the input sequence. For operators returning a scalar value (such as Sum and Average) this is blatantly obvious – the value of a variable of type double isn’t going to change just because you’ve added something to a list. However, it’s slightly less for the "ToXXX" methods: ToLookup, ToArray, ToList and ToDictionary. These do not return views on the original sequence, unlike the "As" methods: AsEnumerable which we’ve seen, and Queryable.AsQueryable which I didn’t implement. Focus on the prefix part of the name: the "To" part indicates a conversion to a particular type. The "As" prefix indicates a wrapper of some kind. This is consistent with other parts of the framework, such as List<T>.AsReadOnly and Array.AsReadOnly<T>.

Very importantly, LINQ to Objects only iterates over any input sequence at most once, whether the execution is deferred or immediate. Some operators would be easier to implement if you could iterate over the input twice – but it’s important that they don’t do so. Of course if you provide the same sequence for two inputs, it will treat those as logically different sequences. Similarly if you iterate over a result sequence more than once (for operators that return IEnumerable<T> or a related interface, rather than List<T> or an array etc), that will iterate over the input sequence again.

This means it’s fine to use LINQ to Objects with sequences which may only be read once (such as a network stream), or which are relatively expensive to reread (imagine a log file reader over a huge set of logs) or which give inconsistent results (imagine a sequence of random numbers). In some cases it’s okay to use LINQ to Objects with an infinite sequence – in others it’s not. It’s usually fairly obvious which is the case.

Timing of input sequence reading, and memory usage

Where possible within deferred execution, operators act in a streaming fashion, only reading from the input sequence when they have to, and "forgetting" data as soon as they can. This allows for long – potentially infinite – sequences to be handled elegantly without memory running out.

Some operators naturally need to read all the data in before they can return anything. The most obvious example of this is Reverse, which will always yield the last element of the input stream as the first element in the result stream.

A third pattern occurs with operators such as Distinct, which yield data as they go, but accumulate elements too, taking more and more memory until the caller stops iterating (usually either by jumping out of the foreach loop, or letting it terminate naturally).

Where an operator takes two input sequences – such as Join – you need to understand the consumption of each one separately. For example, Join uses deferred execution, but as soon as you ask for the first element of the result set, it will read the "second" sequence completely and buffer it – whereas the "first" sequence is streamed. This isn’t the case for all operators with two inputs, of course – Zip streams both input sequences, for example. Check the documentation – and the relevant Edulinq blog post – for details.

Obviously any operator which uses immediate execution has to read all the data it’s interested in before it returns. This doesn’t necessarily mean they will read to the end of the sequence though, and they may not need to buffer the data they read. (Simple examples are ToList which has to keep everything, and Sum which doesn’t.)

Queries vs data

Closely related to the details of when the input is read is the concept of what the result of an operator actually represents. Operators which use deferred execution return queries: each time you iterate over the result sequence, the query will look at the input sequence again. The query itself doesn’t contain the data – it just knows how to get at the data.

Operators which use immediate execution work the other way round: they read all the data they need, and then forget about the input sequence. For operators like Average and Sum this is obvious as it’s just a simple scalar value – but for operators like ToList, ToDictionary, ToLookup and ToArray, it means that the operator has to make a copy of everything it needs. (This is potentially a shallow copy of course – depending on what user-defined projections are applied. The normal behaviour of mutable reference types is still valid.)

I realise that in many ways I’ve just said the same thing multiple times now – but hopefully that will help this crucial aspect of LINQ behaviour sink in, if you were still in any doubt.

Exception handling

I’m unaware of any situation in which LINQ to Objects will catch an exception. If your predicate or projection throws an exception, it will propagate in the obvious way.

However, LINQ to Objects does ensure that any iterator it reads from is disposed appropriately – assuming that the caller disposes of any result sequences properly, of course. Note that the foreach statement implicitly disposes of the iterator in a finally block.

Optimization

Various operators are optimized when they detect at execution time that the input sequence they’re working on offers a shortcut.

The types most commonly detected are:

ICollection<T> and ICollection for their Count property
IList<T> for its random access indexer

I’ll look at optimization in much more detail in a separate post.

Conclusion

This post has not been around the guiding principles behind LINQ itself – lambda calculus or anything like that. It’s more been a summary of the various aspects of behaviour we’ve seen across the various operators we’ve implemented. They’re the rules I’ve had to follow in order to make Edulinq reasonably consistent with LINQ to Objects.

Next time I’ll talk about some of the operators which I think should have made it into the core framework, at least for LINQ to Objects.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 36 – AsEnumerable

January 14, 2011 jonskeet 2 Comments

Our last operator is the simplest of all. Really, really simple.

What is it?

AsEnumerable has a single signature:

public static IEnumerable<TSource> AsEnumerable<TSource>(this IEnumerable<TSource> source)

I can describe its behaviour pretty easily: it returns source.

That’s all it does. There’s no argument validation, it doesn’t create another iterator. It just returns source.

You may well be wondering what the point is… and it’s all about changing the compile-time type of the expression. I’m going to take about IQueryable<T> in another post (although probably not implement anything related to it) but hopefully you’re aware that it’s usually used for "out of process" queries – most commonly in databases.

Now it’s not entirely uncommon to want to perform some aspects of the query in the database, and then a bit more manipulation in .NET – particularly if there are aspects you basically can’t implement in LINQ to SQL (or whatever provider you’re using). For example, you may want to build a particular in-memory representation which isn’t really amenable to the provider’s model.

In that case, a query can look something like this:

var query = db.Context
              .Customers
              .Where(c => some filter for SQL)
              .OrderBy(c => some ordering for SQL)
              .Select(c => some projection for SQL)
              .AsEnumerable() // Switch to "in-process" for rest of query
              .Where(c => some extra LINQ to Objects filtering)
              .Select(c => some extra LINQ to Objects projection);

All we’re doing is changing the compile-time type of the sequence which is propagating through our query from IQueryable<T> to IEnumerable<T> – but that means that the compiler will use the methods in Enumerable (taking delegates, and executing in LINQ to Objects) instead of the ones in Queryable (taking expression trees, and usually executing out-of-process).

Sometimes we could do this with a simple cast or variable declaration. However, for one thing that’s ugly, whereas the above query is fluent and quite readable, so long as you appreciate the importance of AsEnumerable. The more important point is that it’s not always possible, because we may very well be dealing with a sequence of an anonymous type. An extension method lets the compiler use type inference to work out what the T should be for IEnumerable<T>, but you can’t actually express that in your code.

In short – it’s not nearly as useless an operator as it seems at first sight. That doesn’t make it any more complicated to test or implement though…

What are we going to test?

In the spirit of exhaustive testing, I have actually tested:

A normal sequence
A null reference
A sequence which would throw an exception if you actually tried to use it

The tests just assert that the result is the same reference as we’ve passed in.

I have one additional test which comes as close as I can to demonstrating the point of AsEnumerable without using Queryable:

[Test]
public void AnonymousType()
{
    var list = new[] {
        new { FirstName = "Jon", Surname = "Skeet" },
        new { FirstName = "Holly", Surname = "Skeet" }
    }.ToList();

    // We can’t cast to IEnumerable<T> as we can’t express T.
    var sequence = list.AsEnumerable();
    // This will now use Enumerable.Contains instead of List.Contains
    Assert.IsFalse(sequence.Contains(new { FirstName = "Tom", Surname = "Skeet" }));
}

And finally…

Let’s implement it!

There’s not much scope for an interesting implementation here I’m afraid. Here it is, in its totality:

public static IEnumerable<TSource> AsEnumerable<TSource>(this IEnumerable<TSource> source)
{
return source;
}

It feels like a fittingly simple end to the Edulinq implementation.

Conclusion

I think that’s all I’m going to actually implement from LINQ to Objects. Unless I’ve missed something, that covers all the methods of Enumerable from .NET 4.

That’s not the end of this series though. I’m going to take a few days to write up some thoughts about design choices, optimizations, other operators which might have been worth including, and a little bit about how IQueryable<T> works.

Don’t forget that the source code is freely available on Google Code. I’ll be happy to patch any embarrassing bugs :)

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 35 – Zip

January 14, 2011 jonskeet 18 Comments

Zip will be a familiar operator to any readers who use Python. It was introduced in .NET 4 – it’s not entirely clear why it wasn’t part of the first release of LINQ, to be honest. Perhaps no-one thought of it as a useful operator until it was too late in the release cycle, or perhaps implementing it in the other providers (e.g. LINQ to SQL) took too long. Eric Lippert blogged about it in 2009, and I find it interesting to note that aside from braces, layout and names we’ve got exactly the same code. (I read the post at the time of course, but implemented it tonight without looking back at what Eric had done.) It’s not exactly surprising, given how trivial the implementation is. Anyway, enough chit-chat…

What is it?

Zip has a single signature, which isn’t terribly complicated:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
    this IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)

Just from the signature, the name, and experience from the rest of this blog series it should be easy enough to guess what Zip does:

It uses deferred execution, not reading from either sequence until the result sequence is read
All three parameters must be non-null; this is validated eagerly
Both sequences are iterated over "at the same time": it calls GetEnumerator() on each sequence, then moves each iterator forward, then reads from it, and repeats.
The result selector is applied to each pair of items obtained in this way, and the result yielded
It stops when either sequence terminates
As a natural consequence of how the sequences are read, we don’t need to perform any buffering: we only care about one element from each sequence at a time.

There are really only two things that I could see might have been designed differently:

It could have just returned IEnumerable<Tuple<TFirst, TSecond>> but that would have been less efficient in many cases (in terms of the GC) and inconsistent with the rest of LINQ
It could have provided different options for what to do with sequences of differents lengths. For example:
- Throw an exception
- Use the default value of the shorter sequence type against the remaining items of the longer sequence
- Use a specified default value of the shorter sequence in the same way

I don’t have any problem with the design that’s been chosen here though.

What are we going to test?

There are no really interesting test cases here. We test argument validation, deferred execution, and the obvious "normal" cases. I do have tests where "first" is longer than "second" and vice versa.

The one test case which is noteworthy isn’t really present for the sake of testing at all – it’s to demonstrate a technique which can occasionally be handy. Sometimes we really want to perform a projection on adjacent pairs of elements. Unfortunately there’s no LINQ operator to do this naturally (although it’s easy to write one) but Zip can provide a workaround, so long as we don’t mind evaluating the sequence twice. (That could be a problem in some cases, but is fine in others.)

Obviously if you just zip a sequence with itself directly you get each element paired with the same one. We effectively need to "shift" or "delay" one sequence somehow. We can do this using Skip, as shown in this test:

[Test]
public void AdjacentElements()
{
    string[] elements = { "a", "b", "c", "d", "e" };
    var query = elements.Zip(elements.Skip(1), (x, y) => x + y);
    query.AssertSequenceEqual("ab", "bc", "cd", "de");
}

It always takes me a little while to work out whether I want to make first skip or second – but if we want the second element as the first element of second (try getting that right ten times in a row – it makes sense, honest!) means that we want to call Skip on the sequence used as the argument for second. Obviously it would work the other way round too – we’d just get the pairs presented with the values switched, so the results of the query above would be "ba", "cb" etc.

Let’s implement it!

Guess what? It’s yet another operator with a split implementation between the argument validation and the "real work". I’ll skip argument validation, and get into the tricky stuff. Are you ready? Sure you don’t want another coffee?

private static IEnumerable<TResult> ZipImpl<TFirst, TSecond, TResult>(
    IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)
{
    using (IEnumerator<TFirst> iterator1 = first.GetEnumerator())
    using (IEnumerator<TSecond> iterator2 = second.GetEnumerator())
    {
        while (iterator1.MoveNext() && iterator2.MoveNext())
        {
            yield return resultSelector(iterator1.Current, iterator2.Current);
        }
    }
}

Okay, so possibly "tricky stuff" was a bit of an overstatement. Just about the only things to note are:

I’ve "stacked" the using statements instead of putting the inner one in braces and indenting it. For using statements with different variable types, this is one way to keep things readable, although it can be a pain when tools try to reformat the code. (Also, I don’t usually omit optional braces like this. It does make me feel a bit dirty.)
I’ve used the "symmetric" approach again instead of a using statement with a foreach loop inside it. That wouldn’t be hard to do, but it wouldn’t be as simple.

That’s just about it. The code does exactly what it looks like, which doesn’t make for a very interesting blog post, but does make for good readability.

Conclusion

Two operators to go, one of which I might not even tackle fully (AsQueryable – it is part of Queryable rather than Enumerable, after all).

AsEnumerable should be pretty easy…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 34 – SequenceEqual

January 14, 2011 jonskeet 3 Comments

Nearly there now…

What is it?

SequenceEqual has two overloads – the obvious two given that we’re dealing with equality:

public static bool SequenceEqual<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second)

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

The purpose of the operator is to determine if two sequences are equal; that is, if they consist of the same elements, in the same order. A custom equality comparer can be used to compare each individual pair of elements. Characteristics:

The first and second parameters mustn’t be null, and are validated immediately.
The comparer parameter can be null, in which case the default equality comparer for TSource is used.
The overload without a comparer uses the default equality comparer for TSource (no funny discrepancies if ICollection<T> instances are involved, this time).
It uses immediate execution.
It returns as soon as it notices a difference, without evaluating the rest of either sequence.

So far, so good. Note that no optimizations are mentioned above. There are questions you might consider:

Should foo.SequenceEqual(foo) always return true?
If either or both of the sequences implements another collection interface, does that help us?

The first question sounds like it should be a no-brainer, but it’s not as simple as it sounds. Suppose we have a sequence which always generates 10 random numbers. Is it equal to itself? If you iterate over it twice, you’ll usually get different results. What about a sequence which explicitly changes each time you iterate over it, based on some side-effect? Both the .NET and Edulinq implementations say that these are non-equal. (The random case is interesting, of course – it could happen to yield the same elements as we iterate over the two sequences.)

The second question feels a little simpler to me. We can’t take a shortcut to returning true, but it seems reasonably obvious to me that if you have two collections which allow for a fast "count" operation, and the two counts are different, then the sequences are unequal. Unfortunately, LINQ to Objects appears not to optimize for this case: if you create two huge arrays of differing sizes but equal elements as far as possible, it will take a long time for SequenceEqual to return false. Edulinq does perform this optimization. Note that just having one count isn’t useful: you might expect it to, but it turns out that by the time we could realize that the lengths were different in that case, we’re about to find that out in the "normal" fashion anyway, so there’s no point in complicating the code to make use of the information.

What are we going to test?

As well as the obvious argument validation, I have tests for:

Collections of different lengths
Ranges of different lengths, both with first shorter than second and vice versa
Using a null comparer
Using a custom comparer
Using no comparer
Equal arrays
Equal ranges
The non-optimization of foo.SequenceEquals(foo) (using side-effects)
The optimization using Count (fails on LINQ to Objects)
Ordering: { 1, 2 } should not be equal to { 2, 1 }
The use of a HashSet<string> with a case-insensitive comparer: the default (case-sensitive) comparer is still used when no comparer is provided
Infinite first sequence with finite second sequence, and vice versa
Sequences which differ just before they would go bang

None of the test code is particularly interesting, to be honest.

Let’s implement it!

I’m not going to show the comparer-less overoad, as it just delegates to the one with a comparer.

Before we get into the guts of SequenceEqual, it’s time for a bit of refactoring. If we’re going to optimize for count, we’ll need to perform the same type tests as Count() twice. That would be horrifically ugly inline, so let’s extract the functionality out into a private method (which Count() can then call as well):

private static bool TryFastCount<TSource>(
    IEnumerable<TSource> source,
    out int count)
{
    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        count = genericCollection.Count;
        return true;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        count = nonGenericCollection.Count;
        return true;
    }
    // Can’t retrieve the count quickly. Oh well.
    count = 0;
    return false;
}

Pretty simple. Note that we always have to set the out parameter to some value. We use 0 on failure – which happens to work out nicely in the Count, as we can just start incrementing if TryFastCount has returned false.

Now we can make a start on SequenceEqual. Here’s the skeleton before we start doing the real work:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }

    int count1;
    int count2;
    if (TryFastCount(first, out count1) && TryFastCount(second, out count2))
    {
        if (count1 != count2)
        {
            return false;
        }
    }

comparer = comparer ?? EqualityComparer<TSource>.Default;

// Main part of implementation goes here
}

I could have included the comparison between count1 and count2 within the single "if" condition, like this:

if (TryFastCount(first, out count1) &&
    TryFastCount(second, out count2) &&
    count1 != count2)
{
    return false;
}

… but I don’t usually like using the values of out parameters like this. The behaviour is well-defined and correct, but it just feels a little ugly to me.

Okay, now let’s implement the "Main part" which at the bottom of the skeleton. The idea is simple:

Get the iterators for both sequences
Use the iterators "in parallel" (not in the multithreading sense, but in the movement of the logical cursor down the sequence) to compare pairs of elements; we can return false if we ever see an unequal pair
If we ever see that one sequence has finished and the other hasn’t, we can return false
If we get to the end of both sequences in the same iteration, we can return true

I’ve got three different ways of representing the basic algorithm in code though. Fundamentally, the problem is that we don’t have a way of iterating over pairs of elements in two sequences with foreach – we can’t use one foreach loop inside another, for hopefully obvious reasons. So we’ll have to call GetEnumerator() explicitly on at least one of the sequences… and we could do it for both if we want.

The first implementation (and my least favourite) does use a foreach loop:

using (IEnumerator<TSource> iterator2 = second.GetEnumerator())
{
    foreach (TSource item1 in first)
    {
        // second is shorter than first
        if (!iterator2.MoveNext())
        {
            return false;
        }
        if (!comparer.Equals(item1, iterator2.Current))
        {
            return false;
        }
    }
    // If we can get to the next element, first was shorter than second.
    // Otherwise, the sequences are equal.
    return !iterator2.MoveNext();
}

I don’t have a desperately good reason for picking them this way round (i.e. foreach over first, and GetEnumerator() on second) other than that it seems to still give primacy to first somehow… only first gets the "special treatment" of a foreach loop. (I can almost hear the chants now, "Equal rights for second sequences! Don’t leave us out of the loop! Stop just ‘using’ us!") Although I’m being frivolous, I dislike the asymmetry of this.

The second attempt is a half-way house: it’s still asymmetric, but slightly less so as we’re explicitly fetching both iterators:

using (IEnumerator<TSource> iterator1 = first.GetEnumerator(),
                            iterator2 = second.GetEnumerator())
{
    while (iterator1.MoveNext())
    {
        // second is shorter than first
        if (!iterator2.MoveNext())
        {
            return false;
        }
        if (!comparer.Equals(iterator1.Current, iterator2.Current))
        {
            return false;
        }
    }
    // If we can get to the next element, first was shorter than second.
    // Otherwise, the sequences are equal.
    return !iterator2.MoveNext();
}

Note the use of the multi-variable "using" statement; this is equivalent nesting one statement inside another, of course.

The similarities between these two implementations are obvious – but the differences are worth pointing out. In the latter approach, we call MoveNext() on both sequences, and then we access the Current property on both sequences. In each case we use iterator1 before iterator2, but it still feels like they’re being treated more equally somehow. There’s still the fact that iterator1 is being used in the while loop condition, whereas iterator2 has to be used both inside and outside the while loop. Hmm.

The third implementation takes this even further, changing the condition of the while loop:

using (IEnumerator<TSource> iterator1 = first.GetEnumerator(),
       iterator2 = second.GetEnumerator())
{
    while (true)
    {
        bool next1 = iterator1.MoveNext();
        bool next2 = iterator2.MoveNext();
        // Sequences aren’t of same length. We don’t
        // care which way round.
        if (next1 != next2)
        {
            return false;
        }
        // Both sequences have finished – done
        if (!next1)
        {
            return true;
        }
        if (!comparer.Equals(iterator1.Current, iterator2.Current))
        {
            return false;
        }
    }

This feels about as symmetric as we can get. The use of next1 in the middle "if" condition is incidental – it could just as easily be next2, as we know the values are equal. We could switch round the order of the calls to MoveNext(), the order of arguments to comparer.Equals – the structure is symmetric.

I’m not generally a fan of while(true) loops, but in this case I think I rather like it. It makes it obvious that we’re going to keep going until we’ve got a good reason to stop: one of the three return statements. (I suppose I should apologise to fans of the dogma around single exit points for methods, if any are reading. This must be hell for you…)

Arguably this is all a big fuss about nothing – but writing Edulinq has given me a new appreciation for diving into this level of detail to find the most readable code. As ever, I’d be interested to hear your views. (All three versions are in source control. Which one is active is defined with a #define.)

Conclusion

I really don’t know why Microsoft didn’t implement the optimization around different lengths for SequenceEqual. Arguably in the context of LINQ you’re unlikely to be dealing with two materialized collections at a time – it’s much more common to have one collection and a lazily-evaluated query, or possibly just two queries… but it’s a cheap optimization and the benefits can be significant. Maybe it was just an oversight.

Our next operator also deals with pairs of elements, so we may be facing similar readability questions around it. It’s Zip – the only new LINQ query operator in .NET 4.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 33 – Cast and OfType

January 13, 2011 jonskeet 2 Comments

More design decisions around optimization today, but possibly less controversial ones…

What are they?

Cast and OfType are somewhat unusual LINQ operators. They are extension methods, but they work on the non-generic IEnumerable type instead of the generic IEnumerable<T> type:

public static IEnumerable<TResult> Cast<TResult>(this IEnumerable source)

public static IEnumerable<TResult> OfType<TResult>(this IEnumerable source)

It’s worth mentioning what Cast and OfType are used for to start with. There are two main purposes:

Using a non-generic collection (such as a DataTable or an ArrayList) within a LINQ query (DataTable has the AsEnumerable extension method too)
Changing the type of a generic collection, usually to use a more specific type (e.g. you have List<Person> but you’re confident they’re all actually Employee instances – or you only want to query against the Employee instances)

I can’t say that I use either operator terribly often, but if you’re starting off from a non-generic collection for whatever reason, these two are your only easy way to get "into" the LINQ world.

Here’s a quick rundown of the behaviour they have in common:

The source parameter must not be null, and this is validated eagerly
It uses deferred execution: the input sequence is not read until the output sequence is
It streams its data – you can use it on arbitrarily-long sequences and the extra memory required will be constant (and small :)

Both operators effectively try to convert each element of the input sequence to the result type (TResult). When they’re successful, the results are equivalent (ignoring optimizations, which I’ll come to later). The operators differ in how they handle elements which aren’t of the result type.

Cast simply tries to cast each element to the result type. If the cast fails, it will throw an InvalidCastException in the normal way. OfType, however, sees whether each element is a value of the result type first – and ignores it if it’s not.

There’s one important case to consider where Cast will successfully return a value and OfType will ignore it: null references (with a nullable return type). In normal code, you can cast a null reference to any nullable type (whether that’s a reference type or a nullable value type). However, if you use the "is" C# operator with a null value, it will always return false. Cast and OfType follow the same rules, basically.

It’s worth noting that (as of .NET 3.5 SP1) Cast and OfType only perform reference and unboxing conversions. They won’t convert a boxed int to a long, or execute user-defined conversions. Basically they follow the same rules as converting from object to a generic type parameter. (That’s very convenient for the implementation!) In the original implementation of .NET 3.5, I believe some other conversions were supported (in particular, I believe that the boxed int to long conversion would have worked). I haven’t even attempted to replicate the pre-SP1 behaviour. You can read more details in Ed Maurer’s blog post from 2008.

There’s one final aspect to discuss: optimization. If "source" already implements IEnumerable<TResult>, the Cast operator just returns the parameter directly, within the original method call. (In other words, this behaviour isn’t deferred.) Basically we know that every cast will succeed, so there’s no harm in returning the input sequence. This means you shouldn’t use Cast as an "isolation" call to protect your original data source, in the same way as we sometimes use Select with an identity projection. See Eric Lippert’s blog post on degenerate queries for more about protecting the original source of a query.

In the LINQ to Objects implementation, OfType never returns the source directly. It always uses an iterator. Most of the time, it’s probably right to do so. Just because something implements IEnumerable<string> doesn’t mean everything within it should be returned by OfType… because some elements may be null. The same is true of an IEnumerable<int?> – but not an IEnumerable<int>. For a non-nullable value type T, if source implements IEnumerable<T> then source.OfType<T>() will always contain the exact same sequence of elements as source. It does no more harm to return source from OfType() here than it does from Cast().

What are we going to test?

There are "obvious" tests for deferred execution and eager argument validation. Beyond that, I effectively have two types of test: ones which focus on whether the call returns the original argument, and ones which test the behaviour of iterating over the results (including whether or not an exception is thrown).

The iteration tests are generally not that interesting – in particular, they’re similar to tests we’ve got everywhere else. The "identity" tests are more interesting, because they show some differences between conversions that are allowed by the CLR and those allowed by C#. It’s obvious that an array of strings is going to be convertible to IEnumerable<string>, but a test like this might give you more pause for thought:

[Test]
public void OriginalSourceReturnedForInt32ArrayToUInt32SequenceConversion()
{
IEnumerable enums = new int[10];
Assert.AreSame(enums, enums.Cast<uint>());
}

That’s trying to "cast" an int[] to an IEnumerable<uint>. If you try the same in normal C# code, it will fail – although if you cast it to "object" first (to distract the compiler, as it were) it’s fine at both compile time and execution time:

int[] ints = new int[10];
// Fails with CS0030
IEnumerable<uint> uints = (IEnumerable<uint>) ints;

// Succeeds at execution time
IEnumerable<uint> uints = (IEnumerable<uint>)(object) ints;

We can have a bit more fun at the compiler’s expense, and note its arrogance:

int[] ints = new int[10];

if (ints is IEnumerable<uint>)
{
    Console.WriteLine("This won’t be printed");
}
if (((object) ints) is IEnumerable<uint>)
{
    Console.WriteLine("This will be printed");
}

This generates a warning for the first block "The given expression is never of the provided (…) type" and the compiler has the cheek to remove the block entirely… despite the fact that it would have worked if only it had been emitted as code.

Now, I’m not really trying to have a dig at the C# team here – the compiler is actually acting entirely reasonably within the rules of C#. It’s just that the CLR has subtly different rules around conversions – so when the compiler makes a prediction about what would happen with a particular cast or "is" test, it can be wrong. I don’t think this has ever bitten me as an issue, but it’s quite fun to watch. As well as this signed/unsigned difference, there are similar conversions between arrays of enums and their underlying types.

There’s another type of conversion which is interesting:

[Test]
public void OriginalSourceReturnedDueToGenericCovariance()
{
IEnumerable strings = new List<string>();
Assert.AreSame(strings, strings.Cast<object>());
}

This takes advantage of the generic variance introduced in .NET 4 – sort of. There is now a reference conversion from List<string> to IEnumerable<object> which wouldn’t have worked in .NET 3.5. However, this isn’t due to the fact that C# 4 now knows about variance; the compiler isn’t verifying the conversion here, after all. It isn’t due to a new feature in the CLRv4 – generic variance for interfaces and delegates has been present since generics were introduced in CLRv2. It’s only due to the change in the IEnumerable<T> type, which has become IEnumerable<out T> in .NET 4. If you could make the same change to the standard library used in .NET 3.5, I believe the test above would pass. (It’s possible that the precise CLR rules for variance changed between CLRv2 and CLRv4 – I don’t think this variance was widely used before .NET 4, so the risk of it being a problematically-breaking change would have been slim.)

In addition to all these functional tests, I’ve included a couple of tests to show that the compiler uses Cast in query expressions if you give a range variable an explicit type. This works for both "from" and "join":

[Test]
public void CastWithFrom()
{
    IEnumerable strings = new[] { "first", "second", "third" };
    var query = from string x in strings
                select x;
    query.AssertSequenceEqual("first", "second", "third");
}

[Test]
public void CastWithJoin()
{
    var ints = Enumerable.Range(0, 10);
    IEnumerable strings = new[] { "first", "second", "third" };
    var query = from x in ints
                join string y in strings on x equals y.Length
                select x + ":" + y;
    query.AssertSequenceEqual("5:first", "5:third", "6:second");
}

Note how the compile-time type of "strings" is just IEnumerable in both cases. We couldn’t use this in a query expression normally, because LINQ requires generic sequences – but by giving the range variables explicit types, the compiler has inserted a call to Cast which makes the rest of the translation work.

Let’s implement them!

The "eager argument validation, deferred sequence reading" mode of Cast and OfType means we’ll use the familiar approach of a non-iterator-block public method which finally calls an iterator block if it gets that far. This time, however, the optimization occurs within the public method. Here’s Cast, to start with:

public static IEnumerable<TResult> Cast<TResult>(this IEnumerable source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    IEnumerable<TResult> existingSequence = source as IEnumerable<TResult>;
    if (existingSequence != null)
    {
        return existingSequence;
    }
    return CastImpl<TResult>(source);
}

private static IEnumerable<TResult> CastImpl<TResult>(IEnumerable source)
{
    foreach (object item in source)
    {
        yield return (TResult) item;
    }
}

We’re using the normal as/null-test to check whether we can just return the source directly, and in the loop we’re casting. We could have made the iterator block very slightly shorter here, using the behaviour of foreach to our advantage:

foreach (TResult item in source)
{
yield return item;
}

Yikes! Where’s the cast gone? How can this possibly work? Well, the cast is still there – it’s just been inserted automatically by the compiler. It’s the invisible cast that was present in almost every foreach loop in C# 1. The fact that it is invisible is the reason I’ve chosen the previous version. The point of the method is to cast each element – so it’s pretty important to make the cast as obvious as possible.

So that’s Cast. Now for OfType. First let’s look at the public entry point:

public static IEnumerable<TResult> OfType<TResult>(this IEnumerable source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (default(TResult) != null)
    {
        IEnumerable<TResult> existingSequence = source as IEnumerable<TResult>;
        if (existingSequence != null)
        {
            return existingSequence;
        }
    }
    return OfTypeImpl<TResult>(source);
}

This is almost the same as Cast, but with the additional test of "default(TResult) != null" before we check whether the input sequence is an IEnumerable<TResult>. That’s a simple way of saying, "Is this a non-nullablle value type." I don’t know for sure, but I’d hope that when the JIT compiler looks at this method, it can completely wipe out the test, either removing the body of the if statement completely for nullable value types and reference types, or just go execute the body unconditionally for non-nullable value types. It really doesn’t matter if JIT doesn’t do this, but one day I may get up the courage to tackle this with cordbg and find out for sure… but not tonight.

Once we’ve decided we’ve got to iterate over the results ourselves, the iterator block method is quite simple:

private static IEnumerable<TResult> OfTypeImpl<TResult>(IEnumerable source)
{
    foreach (object item in source)
    {
        if (item is TResult)
        {
            yield return (TResult) item;
        }
    }
}

Note that we can’t use the "as and check for null" test here, because we don’t know that TResult is a nullable type. I was tempted to try to write two versions of this code – one for reference types and one for value types. (I’ve found before that using "as and check for null" is really slow for nullable value types. That may change, of course.) However, that would be quite tricky and I’m not convinced it would have much impact. I did a quick test yesterday testing whether an "object" was actually a "string", and the is+cast approach seemed just as good. I suspect that may be because string is a sealed class, however… testing for an interface or a non-sealed class may be more expensive. Either way, it would be premature to write a complicated optimization without testing first.

Conclusion

It’s not clear to me why Microsoft optimizes Cast but not OfType. There’s a possibility that I’ve missed a reason why OfType shouldn’t be optimized even for a sequence of non-nullable value type values – if you can think of one, please point it out in the comments. My immediate objection would be that it "reveals" the source of the query… but as we’ve seen, Cast already does that sometimes, so I don’t think that theory holds.

Other than that decision, the rest of the implementation of these operators has been pretty plain sailing. It did give us a quick glimpse into the difference between the conversions that the CLR allows and the ones that the C# specification allows though, and that’s always fun.

Next up – SequenceEqual.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 32 – Contains

January 12, 2011 jonskeet 7 Comments

After the dubious optimizations of ElementAt/ElementAtOrDefault yesterday, we meet an operator which is remarkably good at defying optimization. Sort of. Depending on how you feel it should behave.

What is it?

Contains has two overloads, which only differ by whether or not they take an equality comparer – just like Distinct, Intersect and the like:

public static bool Contains<TSource>(
this IEnumerable<TSource> source,
TSource value)

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)

The operator simply returns a Boolean indicating whether or not "value" was found in "source". The salient points of its behaviour should be predictable now:

It uses immediate execution (as it’s returning a simple value instead of a sequence)
The source parameter cannot be null, and is validated immediately
The value parameter can be null: it’s valid to search for a null value within a sequence
The comparer parameter can be null, which is equivalent to passing in EquailtyComparer<TSource>.Default.
The overload without a comparer uses the default equality comparer too.
If a match is found, the method returns immediately without reading the rest of the input sequence.
There’s a documented optimization for ICollection<T> – but there are significant issues with it…

So far, so good.

What are we going to test?

Aside from argument validation, I have tests for the value being present in the source, and it not being present in the source… for the three options of "no comparer", "null comparer" and "specific comparer".

I then have one final test to validate that we return as soon as we’ve found a match, by giving a query which will blow up when the element after the match is computed.

Frankly none of the tests are earth-shattering, but in the spirit of giving you an idea of what they’re like, here’s one with a custom comparer – we use the same source and value for a "default comparer" test which doesn’t find the value as the case differs:

[Test]
public void MatchWithCustomComparer()
{
    // Default equality comparer is ordinal
    string[] source = { "foo", "bar", "baz" };
    Assert.IsTrue(source.Contains("BAR", StringComparer.OrdinalIgnoreCase));
}

Currently I don’t have a test for the optimization mentioned in the bullet points above, as I believe it’s broken. More later.

Let’s implement it!

To start with, let’s dispense with the overload without a comparer parameter: that just delegates to the other one by specifying EqualityComparer<TSource>.Default. Trivial. (Or so we might think. There’s more to this than meets the eye.)

I’ve got three implementations, but we’ll start with just two of them. Which one you pick would depend on whether you’re happy to use one operator to implement another. If you think that’s okay, it’s really simple:

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)
{
    comparer = comparer ?? EqualityComparer<TSource>.Default;
    return source.Any(item => comparer.Equals(value, item));
}

"Any" has exactly the traits we want, including validation of the non-nullity of "source". It’s hardly complicated code if we don’t use Any though:

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    comparer = comparer ?? EqualityComparer<TSource>.Default;
    foreach (TSource item in source)
    {
        if (comparer.Equals(value, item))
        {
            return true;
        }
    }
    return false;
}

Obviously there’s a slight penalty in using Any just because of executing a delegate on each iteration – and the extra memory requirement of building an object to capture the comparer. I haven’t measured the performance impact of this – again, it’s a candidate for benchmarking.

Can’t we optimize? (And why does LINQ to Objects think it can?)

The implementations above are all very well, but they feel ever so simplistic. With ElementAt, we were able to take advantage of the fact that an IList<T> allows us random access by index. Surely we’ve got similar collections which allow us to test for containment cheaply?

Well, yes and no. We’ve got IDictionary<TKey, TValue> which allows you to check for the presence of a particular key – but even it would be hard to test whether the sequence we’re looking at is the key sequence for some IDictionary<TSource, TValue>, and somehow get back to the dictionary.

ICollection<T> has a Contains method, but that doesn’t necessarily do the right thing. This is particularly troubling, as the MSDN documentation for the comparer-less overload has contradictory information:

(Summary)

Determines whether a sequence contains a specified element by using the default equality comparer.

(Remarks)

If the type of source implements ICollection<T>, the Contains method in that implementation is invoked to obtain the result. Otherwise, this method determines whether source contains the specified element.

Enumeration is terminated as soon as a matching element is found.

Elements are compared to the specified value by using the default equality comparer, Default.

Why is this troubling? Well, let’s look at a test:

[Test]
public void SetWithDifferentComparer()
{
    HashSet<string> sourceAsSet = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
        { "foo", "bar", "baz" };
    IEnumerable<string> sourceAsSequence = sourceAsSet;
    Assert.IsTrue(sourceAsSet.Contains("BAR"));
    Assert.IsFalse(sourceAsSequence.Contains("BAR"));
    Assert.IsFalse(sourceAsSequence.Contains("BAR", StringComparer.Ordinal));
}

(This exact code won’t build in the Edulinq project configuration, as that doesn’t have a reference to the System.Core assembly which contains HashSet<T>. I’ve got a hack which allows me to run effectively this code though. See the source for details.)

Now this test looks correct to me: while we’re regarding the set as a set, it should use the set’s comparer and find "BAR" with a case-insensitive match. However, when we use it as a sequence in LINQ, it should obey the rules of Enumerable.Contains – which means that the middle call should use the default equality comparer for string. Under that equality comparer, "BAR" isn’t present.

It doesn’t: the above test fails on that middle call in LINQ to Objects, because HashSet<T> implements ICollection<T>. To fit in with the implementation, the documentation summary should actually be worded as something like:

"Determines whether a sequence contains a specified element by using the default equality comparer if the sequence doesn’t implement ICollection<T>, or whatever equality comparison the collection uses if it does implement ICollection<T>."

Now you may be saying to yourself that this is only like relying on IList<T> to fetch an item by index in a fashion consistent with iterating over with it – but I’d argue that any IList<T> implementation which didn’t do that was simply broken… whereas ICollection<T>.Contains is specifically documented to allow custom comparisons:

Implementations can vary in how they determine equality of objects; for example, List<T> uses Comparer<T>.Default, whereas Dictionary<TKey, TValue> allows the user to specify the IComparer<T> implementation to use for comparing keys.

Let’s leave aside the fact that those "Comparer<T>" and "IComparer<T>" should be "EqualityComparer<T>" and "IEqualityComparer<T>" respectively for the minute, and just note that it’s entirely reasonable for an implementation not to use the default equality comparer. That makes sense – but I believe it also makes sense for source.Contains(value) to be more predictable in terms of the equality comparer it uses.

Now I would certainly agree that having a method call which changes semantics based on whether the compile-time type of the source is IEnumerable<T> or ICollection<T> is undesirable too… but I’m not sure there is any particularly nice solution. The options are:

The current LINQ to Objects implementation where the comparer used is hard to predict.
The Edulinq implementation where the type’s default comparer is always used… if the compile-time type means that Enumerable.Contains is used in the first place.
Remove the comparer-less overload entirely, and force people to specify one. This is lousy for convenience.

Note that you might expect the overload which takes a comparer to work the same way if you pass in null as the comparer – but it doesn’t. That overload never delegates to ICollection<T>.Contains.

So: convenience, predictability, consistency. Pick any two. Isn’t API design fun? This isn’t even thinking about performance, of course…

It’s worth bearing in mind that even the current behaviour which is presumably meant to encourage consistency doesn’t work. One might expect that the following would always be equivalent for any sensible collection:

var first = source.Contains(value);
var second = source.Select(x => x).Contains(value);

… but of course the second line will always use EqualityComparer<T>.Default whereas the first may or may not.

(Just for fun, think about Dictionary<TKey, TValue> which implements ICollection<KeyValuePair<TKey, TValue>>; its explicitly-implemented ICollection<T>.Contains method will use its own equality comparer for the key, but the default equality comparer for the value part of the pair. Yay!)

So can we really not optimize?

I can think of exactly one situation which we could legitimately optimize without making the behaviour hard to predict. Basically we’re fine to ask the collection to do our work for us if we can guarantee it will use the right comparer. Ironically, List<T>.Contains has an overload which allows us to specify the equality comparer, so we could delegate to that – but it’s not going to be significantly faster than doing it ourselves. It’s still got to look through everything.

ISet<T> in .NET 4 doesn’t help us much – its API doesn’t talk about equality comparers. (This makes a certain amount of sense – consider SortedSet<T> which uses IComparer<T> instead of IEqualityComparer<T>. It wouldn’t make sense to ask a SortedSet<T> for an equality comparer – it couldn’t give you one, as it wouldn’t know how to produce a hash code.)

However, HashSet<T> does give us something to work with. You can ask a HashSet<T> which equality comparer it uses, so we could delegate to its implementation if and only if it would use the one we’re interested in. We can bolt that into our existing implementation pretty easily, after we’ve worked out the comparer to use:

HashSet<TSource> hashSet = source as HashSet<TSource>;
if (hashSet != null && comparer.Equals(hashSet.Comparer))
{
return hashSet.Contains(value);
}

So is this worth including or not?

Pros:

It covers one of the biggest use cases for optimizing Contains; I suspect this is used more often than the LINQ implementation of Contains working over a dictionary.
So long as the comparer doesn’t override Equals in a bizarre way, it should be a true optimization with no difference in behaviour.
The optimization is applied for both overloads of Enumerable.Contains, not just the comparer-less one.

Cons:

It’s specific to HashSet<T> rather than an interface type. That makes it feel a little too specific to be a good target of optimization.
We’ve still got the issue of consistency in terms of sourceAsSet.Contains(value) vs sourceAsSequence.Contains(value)
There’s a tiny bit of overhead if the source isn’t a hash set, and a further overhead if it is a hash set but with the wrong comparer. I’m not too bothered about this.

It’s not the default implementation in Edulinq at the moment, but I could possibly be persuaded to include it. Likewise I have a conditionally-compiled version of Contains which is compatible with LINQ to Objects, with the "broken" optimization for the comparer-less overload; this is turned off by default too.

Conclusion

Gosh! I hadn’t expected Contains to be nearly this interesting. I’d worked out that optimization would be a pain, but I hadn’t expected it to be such a weird design choice.

This is the first time I’ve deliberately gone against the LINQ to Objects behaviour, other than the MS bug around descending orderings using "extreme" comparers. The option for compatibility is there, but I feel fairly strongly that this was a bad design decision on Microsoft’s part. A bad decision out of some fairly unpleasant alternatives, I grant you. I’m willing to be persuaded of its virtues, of course – and in particular I’d welcome discussion with the LINQ team around this. In particular, it’s always fun to hear about the history of design decisions.

Next up, Cast and OfType.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 31 – ElementAt / ElementAtOrDefault

January 11, 2011 jonskeet 11 Comments

A nice easy pair of operators tonight. I should possibly have covered them at the same time as First/Last/Single and the OrDefault variants, but never mind…

What are they?

ElementAt and ElementAtOrDefault have a single overload each:

public static TSource ElementAt<TSource>(
this IEnumerable<TSource> source,
int index)

public static TSource ElementAtOrDefault<TSource>(
this IEnumerable<TSource> source,
int index)

Isn’t that blissfully simple after the overload storm of the past few days?

The two operators work in very similar ways:

They use immediate execution.
The source parameter must not be null, and this is validated immediately.
They return the element at the specified zero-based index, if it’s in the range 0 <= index < count.

The methods only differ in their handling of an index which falls outside the given bound. ElementAt will throw an ArgumentOutOfRangeException; ElementAtOrDefault will return the default value for TSource (e.g. 0, null, false). This is true even if index is negative. You might have expected some way to specify the default value to return if the index is out of bounds, but there isn’t one. (This is consistent with FirstOrDefault() and so on, but not with Nullable<T>.GetValueOrDefault())

This behaviour leaves us some room for common code – for once I haven’t used cut and paste for the implementation. Anyway, I’m getting ahead of myself.

What are we going to test?

As you can imagine, my tests for the two operators are identical except for the expected result in the case of the index being out of range. I’ve tested the following cases:

Null source
A negative index
An index which is too big on a NonEnumerableCollection
An index which is too big on a NonEnumerableList
An index which is too big on a lazy sequence (using Enumerable.Range)
A valid index in a NonEnumerableList
A valid index in a lazy sequence

The "non-enumerable" list and collection are to test that the optimizations we’re going to perform are working. In fact, the NonEnumerableCollection test fails on LINQ to Objects – it’s only optimized for IList<T>. You’ll see what I mean in a minute… and why that might not be a bad thing.

None of the tests are very interesting, to be honest.

Let’s implement it!

As I mentioned earlier, I’ve used some common code for once (although I admit the first implementation used cut and paste). As the only difference between the two methods is the handling of a particular kind of failure, I’ve used the TryXXX pattern which exists elsewhere in the framework. There’s a common method which tries to retrieve the right element as an out parameter, and indicates whether or not it succeeded via the return value. Not every kind of failure is just returned, of course – we want to throw an ArgumentNullException if source is null in either case.

That leaves our public methods looking quite straightforward:

public static TSource ElementAt<TSource>(
    this IEnumerable<TSource> source,
    int index)
{
    TSource ret;
    if (!TryElementAt(source, index, out ret))
    {
        throw new ArgumentOutOfRangeException("index");
    }
    return ret;
}

public static TSource ElementAtOrDefault<TSource>(
    this IEnumerable<TSource> source,
    int index)
{
    TSource ret;
    // We don’t care about the return value – ret will be default(TSource) if it’s false
    TryElementAt(source, index, out ret);
    return ret;
}

TryElementAt will only return false if the index is out of bounds, so the exception is always appropriate. However, there is a disadvantage to this approach: we can’t easily indicate in the exception message whether index was too large or negative. We could have specified a message which included the value of index itself, of course. I think it’s a minor matter either way, to be honest.

The main body of the code is in TryElementAt, obviously. It would actually be very simple – just looping and counting up to index, checking as we went – except there are two potential optimizations.

The most obvious – and most profitable – optimization is if the collection implements IList<T>. If it does, we can efficiently obtain the count using the ICollection<T>.Count property (don’t forget that IList<T> extends ICollection<T>), check that it’s not too big, and then use the indexer from IList<T> to get straight to the right element. Brilliant! That’s a clear win.

The less clear optimization is if the collection implements ICollection<T> but not IList<T>, or if it only implements the nongeneric ICollection. In those cases we can still get at the count – but we can’t then get directly to the right element. In other words, we can optimize the failure case (possibly hugely), but at a very slight cost – the cost of checking whether the sequence implements either interface – for the success case, where the check won’t do us any good.

This is the sort of optimization which is impossible to judge without real data. How often are these operators called with an invalid index? How often does that happen on a collection which implements ICollection<T> but not IList<T> (or implements ICollection)? How large are those collections (so how long would it take to have found our error the normal way)? What’s the cost of performing the type check? I don’t have the answers to any of these questions. I don’t even have strong suspicions. I know that Microsoft doesn’t use the same optimization, but I don’t know whether that was due to hard data or a gut feeling.

For the moment, I’ve kept all the optimizations. Here’s the code:

private static bool TryElementAt<TSource>(
    IEnumerable<TSource> source,
    int index,
    out TSource element)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    element = default(TSource);
    if (index < 0)
    {
        return false;
    }
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        int count = collection.Count;
        if (index >= count)
        {
            return false;
        }
        // If it’s a list, we know we’re okay now – just return directly…
        IList<TSource> list = source as IList<TSource>;
        if (list != null)
        {
            element = list[index];
            return true;
        }
    }

    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        int count = nonGenericCollection.Count;
        if (index >= count)
        {
            return false;
        }
    }
    // We don’t need to fetch the current value each time – get to the right
    // place first.
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        // Note use of -1 so that we start off my moving onto element 0.
        // Don’t want to use i <= index in case index == int.MaxValue!
        for (int i = -1; i < index; i++)
        {
            if (!iterator.MoveNext())
            {
                return false;
            }
        }
        element = iterator.Current;
        return true;
    }
}

As you can see, the optimized cases actually form the bulk of the code – part of me thinks it would be worth removing the non-IList<T> optimizations just for clarity and brevity.

It’s worth looking at the "slow" case where we actually iterate. The for loop looks odd, until you think that to get at element 0, you have to call MoveNext() once. We don’t want to just add one to index or use a less-than-or-equal condition: both of those would fail in the case where index is int.MaxValue; we’d either not loop at all (by incrementing index and it overflowing either causing an exception or becoming negative) or we’d loop forever, as every int is less than or equal to int.MaxValue.

Another way to look at it is that the loop counter ("i") is the "current index" within the iterator: the iterator starts before the first element, so it’s reasonable to start at -1.

The reason I’m drawing attention to this is that I got all of this wrong first time… and was very grateful for unit tests to catch me out.

Conclusion

For me, the most interesting part of ElementAt is the decision about optimization. I’m sure I’m not the only one who optimizes without data at times – but it’s a dangerous thing to do. The problem is that this isn’t the normal micro-optimization quandary of "it’s always a tiny bit better, but it’s probably insignificant and makes the code harder to read". For the cases where this is faster, it could make an enormous difference – asking for element one million of a linked list which doesn’t quite have enough elements could be very painful. But do failure cases need to be fast? How common are they? As you can tell, I’m dithering. I think it’s at least worth thinking about what optimizations might make a difference – even if we later remove them.

Next time, I think I’ll tackle Contains – an operator which you might expect to be really fast on a HashSet<T>, but which has some interesting problems of its own…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 30 – Average

January 10, 2011 jonskeet 13 Comments

This is the final aggregation operator, after which I suspect we won’t need to worry about floating point difficulties any more. Between this and the unexpected behaviour of Comparer<string>.Default, I’ve covered two of my "big three" pain points. It’s hard to see how I could get dates and times into Edulinq naturally; it’s even harder to see how time zones could cause problems. I’ve still got a few operators to go though, so you never know…

What is it?

Average has 20 overloads, all like the following but for long, decimal, float and double as well as int:

public static double Average(this IEnumerable<int> source)

public static double Average<TSource>(
this IEnumerable<TSource> source,
Func<TSource, int> selector)

public static double? Average(this IEnumerable<int?> source)

public static double? Average<TSource>(
this IEnumerable<TSource> source,
Func<TSource, int?> selector)

The operators acting on float sequences return float, and likewise the operators acting on decimal sequences return decimal, with the same equivalent nullable types for the nullable sequences.

As before (for Min/Max/Sum), the overloads which take a selector are equivalent to just applying that selector to each element in the sequence.

General behaviour – pretty much as you’d expect, I suspect:

Each operator calculates the arithmetic mean of a sequence of values.
source and selector can’t be null, and are validated immediately.
The operators all use immediate execution.
The operators all iterate over the entire input sequence, unless an exception is thrown (e.g. due to overflow).
The operators with a non-nullable return type throw InvalidOperationException if the input sequence is empty.
The operators with a nullable return type ignore any null input values, and return null if the input sequence is empty or contains only null values. If non-null values are present, the return value will be non-null.

It all sounds pretty simple, doesn’t it? We just sum the numbers, and divide by the count. It’s not too complicated, but we have a couple of things to consider:

How should we count items – which data type should we use? Do we need to cope with more than int.MaxValue elements?
How should we sum items? Should we be able to find the average of { int.MaxValue, int.MaxValue } even though the sum clearly overflows the bounds of int?

Given the behaviour of my tests, I believe I’ve made the same decisions. I use a long for the counter, always. I use a long total for the int/long overloads, a double total for the float/double overloads, and a decimal total for the decimal overloads. These aren’t particularly tricky decisions once you’d realised that you need to make them, but it would be very easy to implement the operators in a simplistic way without thinking about such things. (I’d probably have done so if it weren’t for the comments around Sum this morning.)

What are we going to test?

I’ve only got in-depth tests for the int overloads, covering:

Argument validation
Empty sequences for nullable and non-nullable types
Sequences with only null values
Sequences of perfectly normal values :)
Projections for all the above
A sequence with just over int.MaxValue elements, to test we can count properly

Then I have a few extra tests for interesting situations. First I check the overflow behaviour of each type, using a common pattern of averaging a sequence of (max, max, -max, -max) where "max" is the maximum value for the sequence type. The results are:

For int we get the correct result of 0 because we’re accumulating over longs
For long we get an OverflowException when it tries to add the first two values together
For float we get the correct result of 0 because we’re accumulating over doubles
For double we get PositiveInfinity because that’s the result of the first addition
For decimal we get an OverflowException when it tries to add the first two values together

Additionally, I have a couple of floating-point-specific tests: namely further proof that we use a double accumulator when averaging floats, and the behaviour of Average in the presence of NaN values:

[Test]
public void SingleUsesDoubleAccumulator()
{
    // All the values in the array are exactly representable as floats,
    // as is the correct average… but intermediate totals aren’t.
    float[] array = { 20000000f, 1f, 1f, 2f };
    Assert.AreEqual(5000001f, array.Average());
}

[Test]
public void SequenceContainingNan()
{
double[] array = { 1, 2, 3, double.NaN, 4, 5, 6 };
Assert.IsNaN(array.Average());
}

I’m sure someone can think of some other interesting scenarios I should be considering :)

Let’s implement it!

This is another cut-and-paste job, but with more editing required – for each method, I needed to make sure I was using the right accumulator type, and I occasionally removed redundant casts. Still, the code follows pretty much the same pattern for all types. Here’s the int implementation:

public static double Average(this IEnumerable<int> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    checked
    {
        long count = 0;
        long total = 0;
        foreach (int item in source)
        {
            total += item;
            count++;
        }
        if (count == 0)
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        return (double)total / (double)count;
    }
}

public static double Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)
{
    return source.Select(selector).Average();
}

public static double? Average(this IEnumerable<int?> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    checked
    {
        long count = 0;
        long total = 0;
        foreach (int? item in source)
        {
            if (item != null)
            {
                count++;
                total += item.Value;
            }
        }
        return count == 0 ? (double?)null : (double)total / (double)count;
    }
}

public static double? Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)
{
    return source.Select(selector).Average();
}

Salient points:

Again I’m using Select to make the implementation of the overloads with selectors trivial
I’ve cast both operands of the division when calculating the average, just for clarity. We could get away with either of them.
In the case of the conditional operator, I could actually just cast one of the division operators to "double?" and then remove both of the other casts… again, I feel this version is clearer. (I could change my mind tomorrow, mind you…)
I’ve explicitly used checked blocks for int and long. For float and double we won’t get overflow anyway, and for decimal the checked/unchecked context is irrelevant.

There’s one optimization we can perform here. Consider this loop, for the nullable sequence:

long count = 0;
long total = 0;
foreach (int? item in source)
{
    if (item != null)
    {
        count++;
        total += item.Value; // This line can be optimized…
    }
}

The line I’ve highlighted seems perfectly reasonable, right? We’re trying to add the "real" non-null value within a value type value, and we know that there is a real value, because we’ve checked it’s not the null value already.

Now think about what the Value property actually does… it checks whether or not it’s the null value, and then returns the real value or throws an exception. But we know it won’t throw an exception, because we’ve checked it. We just want to get at the value – don’t bother with any more checks. That’s exactly what GetValueOrDefault() does. In the case where the value is non-null, GetValueOrDefault() and the Value property obviously do the same thing – but intuition tells me that GetValueOrDefault() can do it quicker, because it doesn’t actually need to check anything. It can just return the value of the underlying field – which will be the default value of the underlying type for a null wrapper value anyway.

I’ve benchmarked this, and on my laptop it’s about 5% faster than using Value. But… it’s such a grotty hack. I would feel dirty putting it in. Surely Value is the more readable code here – it just happens to be slower. As always, I’m undecided. There’s no behavioural difference, just a slight speed boost. Thoughts, folks?

Conclusion

I’m quite pleased to be shot of the Aggregate Operators Of Overload Doom. I’ve felt for a while that they’ve been hanging over me – I knew they’d be annoying in terms of cut and paste, but there’s been more interesting situations to consider than I’d expected.

There’s not a lot left now. According to my previous list, I’ve got:

Cast and OfType
ElementAt and ElementAtOrDefault
SequenceEqual
Zip (from .NET 4)
Contains

However, that doesn’t include AsEnumerable and AsQueryable. I’m unsure at the moment what I’m doing with those… AsEnumerable is trivial, and probably worth doing… AsQueryable could prove interesting in terms of testing, as it requires expression trees (which are in System.Core; a library I’m not referencing from tests when testing the Edulinq implementation). I’ll play around and see what happens :)

Not sure what I’ll implement next, to be honest… we’ll see tomorrow!

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 29 – Min/Max

January 9, 2011 jonskeet 7 Comments

The second and third AOOOD operators today… if I’m brave enough to tackle Average tomorrow, I’ll have done them all. More surprises here today, this time in terms of documentation…

What are they?

Min and Max are both extension methods with 22 overloads each. Min looks like this:

public static int Min(this IEnumerable<int> source)

public static int Min<TSource>(
this IEnumerable<TSource> source,
Func<TSource, int> selector)

public static int? Min(this IEnumerable<int?> source)

public static int? Min<TSource>(
this IEnumerable<TSource> source,
Func<TSource, int?> selector)

// Repeat the above four overloads for long, float, double and decimal,
// then add two more generic ones:

public static TSource Min<TSource>(this IEnumerable<TSource> source)

public static TResult Min<TSource, TResult>(
this IEnumerable<TSource> source,
Func<TSource, TResult> selector
)

(Max is exactly the same as Min; just replace the name.)

The more obvious aspects of the behaviour are as follows:

source and selector mustn’t be null
All overloads use immediate execution
The minimum or maximum value within the sequence is returned
If a selector is present, it is applied to each value within source, and the maximum of the projected values is returned. (Note how the return type of these methods is TResult, not TSource.)

Some less obvious aspects – in all cases referring to the result type (as the source type is somewhat incidental when a selector is present; it doesn’t affect the behaviour):

The type’s IComparable<T> implementation is used when available, otherwise IComparable is used. An ArgumentException is thrown if values can’t be compared. Fortunately, this is exactly the behaviour of Comparer<T>.Default.
For any nullable type (whether it’s a reference type or a nullable value type), nulls within the sequence are ignored, and an empty sequence (or one which contains only null values) will cause a null value to be returned. If there are any non-null values in the sequence, the return value will be non-null. (Note that this is different from Sum, which will return the non-null zero value for empty sequences over nullable types.)
For any non-nullable value type, an empty sequence will cause InvalidOperationException to be thrown.

The first point is particularly interesting when you consider the double and float types, and their "NaN" (not-a-number) values. For example, Math.Max regards NaN as greater than positive infinity, but Enumerable.Max regards positive infinity as being the greater of the two. Math.Min and Enumerable.Min agree, however, that NaN is less than negative infinity. (It would actually make sense to me for NaN to be treated as the numeric equivalent of null here, but that would be strange in other ways…) Basically, NaN behaves oddly in all kinds of ways. I believe that IEEE-754-2008 actually specifies behaviour with NaNs which encourages the results we’re getting here, but I haven’t verified that yet. (I can’t find a free version of the standard online, which is troubling in itself. Ah well.)

The behaviour of the nullable and non-nullable types is well documented for the type-specific overloads using int, Nullable<int> etc. However, the generic overloads (the ones using TSource) are poorly documented:

InvalidOperationException isn’t in the list of possibly-thrown arguments for any of the overloads
The methods using selectors from TSource to TResult don’t mention the possibility of nullity at all
The methods without selectors describe the behaviour of null values for reference types, but don’t mention the possibility of empty sequences for non-nullable value types, or consider nullable value types at all.

(I should point out that ArgumentException isn’t actually mentioned either for the case where values are incomparable, but that feels like a slightly less important offence for some reason. Possibly just because it didn’t trip me up.)

If I remember, I’ll open a Connect issue against this hole in the documentation when I find time. Unlike the optimizations and set ordering (where it’s reasonably forgivable to deliberately omit implementation details from the contract) you simply can’t predict the behaviour in a useful way from the documentation here. And yes, I’m going on about this because it bit me. I had to resort to writing tests and running them against LINQ to Objects to see if they were correct or not. (They were incorrect in various places.)

If you look at the behaviour of the non-generic methods, the generic ones are entirely consistent of course.

There are a couple of things which you might consider "missing" in terms of Max and Min:

The ability to find out the minimum/maximum value of a sequence by a projection. For example, consider a sequence of people. We may wish to find the youngest person in the sequence, in which case we’d like to be able to write something like:
var oldest = people.MaxBy(person => person.Age);

We can find the maximum age itself easily enough – but then we’d need a second pass to find the first person with that age. I’ve addressed this in MoreLINQ with the MaxBy and MinBy operators. The System.Interactive assembly in Reactive Extensions has the same methods too.
The ability to specify a custom IComparer<T> implementation, as we can in most of the operators using IEqualityComparer<T>. For example, we can’t find the "maximum" string in a sequence, using a case-insensitive ordinal comparison.

Still, at least that means there’s less to test…

What are we going to test?

I decided I really couldn’t find the energy to replicate all the tests for every type involved here. Instead, I have a bunch of tests for int and Nullable<int>, a few tests exploring the oddness of doubles, and a bunch of tests around the generic methods. In particular, I know that I’ve implemented decimal, float etc by calling the same methods that the int overloads use.

The tests cover:

Argument validation
Empty sequences
Sequences of null values where applicable
Projections of the above
Generic tests for nullable and non-nullable value types, and reference types (with empty sequences etc)
Incomparable values

Let’s implement them!

Okay, let’s start off with the simplest detail: the order of implementation:

Max(int)
Max(generic)
Cut and paste Max implementations for other numeric types (replace the type name, basically)
Cut and paste the entirety of Max to Min:
- Replace "Max" with "Min" everywhere
- Replace " < " with " > " everywhere (only 4 occurrences; basically the results of calling Compare or ComparerTo and comparing with 0)

Just as with Sum, I could have used templating – but I don’t think it would actually have saved me significant time.

This time, I thought I’d use Select internally for the overloads with selectors (unlike my approach for Sum which used identity projections). There’s no particular reason for this – I just thought it would be interesting to try both approaches. Overall, I think I prefer this one, but I haven’t done any benchmarking to find out the relative performance penalties.

Each set of numeric overloads calls into a single pair of generic "implementation" methods. These aren’t the public general-purpose ones: they require that the types in use implement IComparable<T>, and I’ve added a "struct" constraint just for kicks. This is just one approach. Other options:

I could have implemented the code separately for each numeric type. That may well be faster than calling IComparable<T>.Compare (at least for most types) as the IL would have contained the appropriate operator directly. However, it would have meant more code and explicitly dealing with the headache of NaNs for double/float. If I ever write benchmarks, I’ll investigate the difference that this can make.
I could have used the public generic overloads, which eventually call into Comparer<T>.Default. Again, the penalty for this (if any) is unknown to me at this point. Can the JIT inline deeply enough to make this as fast as a "native" implementation? I wouldn’t like to guess without tests.

I’ve separated out the nullable implementations from the non-nullable ones, as the behaviour differs significantly between the two.

Here’s the public code for int:

public static int Max(this IEnumerable<int> source)
{
return PrimitiveMax(source);
}

public static int Max<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)
{
    // Select will validate the arguments
    return PrimitiveMax(source.Select(selector));
}

public static int? Max(this IEnumerable<int?> source)
{
return NullablePrimitiveMax(source);
}

public static int? Max<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)
{
    // Select will validate the arguments
    return NullablePrimitiveMax(source.Select(selector));
}

All the methods consider argument validation to be somebody else’s problem – either Select or the generic method we’re calling to find the maximum value. Part of me thinks this is lazy; part of me likes it in terms of not repeating code. All of me would prefer the ability to specify non-nullable parameters declaratively…

Here are the "primitive" methods called into above:

// These are uses by all the overloads which use a known numeric type.
// The term "primitive" isn’t truly accurate here as decimal is not a primitive
// type, but it captures the aim reasonably well.
// The constraint of being a value type isn’t really required, because we don’t rely on
// it within the method and only code which already knows it’s a comparable value type
// will call these methods anyway.

private static T PrimitiveMax<T>(IEnumerable<T> source) where T : struct, IComparable<T>
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    using (IEnumerator<T> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        T max = iterator.Current;
        while (iterator.MoveNext())
        {
            T item = iterator.Current;
            if (max.CompareTo(item) < 0)
            {
                max = item;
            }
        }
        return max;
    }
}

private static T? NullablePrimitiveMax<T>(IEnumerable<T?> source) where T : struct, IComparable<T>
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    T? max = null;
    foreach (T? item in source)
    {
        if (item != null &&
            (max == null || max.Value.CompareTo(item.Value) < 0))
        {
            max = item;
        }
    }
    return max;
}

The first method is interesting in terms of its approach to throwing an exception if the first element isn’t present, and using that as an initial candidate otherwise.

The second method needs to consider nullity twice on each iteration:

Is the item from the sequence null? If so, we can ignore it.
Is our "current maximum" null? If so, we can replace it with the item from the sequence without performing a comparison.

Now there’s one case which is ambiguous here: when both values are null. At that point we can choose to replace our "current maximum" with the item, or not… it doesn’t matter as the values are the same anyway. It is important that we don’t try to perform a comparison unless both values are non-null though… the short-circuiting && and || operators keep us safe here.

Having implemented the code above, all the interesting work lies in the generic forms. Here we don’t have different public methods to determine which kind of behaviour we’ll use: but I wrote two private methods instead, and just delegated to the right one from the public one. This seemed cleaner than putting the code all in one method:

public static TSource Max<TSource>(
    this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    // This condition will be true for reference types and nullable value types, and false for
    // non-nullable value types.
    return default(TSource) == null ? NullableGenericMax(source) : NonNullableGenericMax(source);
}

public static TResult Max<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    return Max(source.Select(selector));
}

/// <summary>
/// Implements the generic behaviour for non-nullable value types.
/// </summary>
/// <remarks>
/// Empty sequences will cause an InvalidOperationException to be thrown.
/// Note that there’s no *compile-time* validation in the caller that the type
/// is a non-nullable value type, hence the lack of a constraint on T.
/// </remarks>
private static T NonNullableGenericMax<T>(IEnumerable<T> source)
{
Comparer<T> comparer = Comparer<T>.Default;

    using (IEnumerator<T> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        T max = iterator.Current;
        while (iterator.MoveNext())
        {
            T item = iterator.Current;
            if (comparer.Compare(max, item) < 0)
            {
                max = item;
            }
        }
        return max;
    }
}

/// <summary>
/// Implements the generic behaviour for nullable types – both reference types and nullable
/// value types.
/// </summary>
/// <remarks>
/// Empty sequences and sequences comprising only of null values will cause the null value
/// to be returned. Any sequence containing non-null values will return a non-null value.
/// </remarks>
private static T NullableGenericMax<T>(IEnumerable<T> source)
{
Comparer<T> comparer = Comparer<T>.Default;

    T max = default(T);
    foreach (T item in source)
    {
        if (item != null &&
            (max == null || comparer.Compare(max, item) < 0))
        {
            max = item;
        }
    }
    return max;
}

As you can tell, there’s a significant similarity between the "PrimitiveMax" and "NonNullableGenericMax" methods, and likewise between "NullablePrimitiveMax" and "NullableGenericMax". This should come as no surprise. Fundamentally the difference is just between using an IComparable<T> implementation, and using Comparer<T>.Default. (The argument validation occurs in a different place too, as we’ll be going through a public entry point for the non-primitive code.)

Once I’d discovered the correct behaviour, this was reasonably simple. Of course, the above code wasn’t my first implementation, where I’d completely forgotten about null values, and hadn’t thought about how the nullability of the source type might affect the behaviour of empty sequences…

Conclusion

If you’re ever in a company which rewards you for checking in lots of lines of code, offer to implement Sum/Min/Max. This weekend I’ve checked in about 2,500 lines of code in (split between production and test) and none of it’s been terribly hard. Of course, if you’re ever in such a company you should also consider looking for another job. (Have I mentioned that Google’s hiring? Email me if you’re interested. I’m serious.)

As you can tell, I was slightly irritated by the lack of clarity around the documentation in some places – but I find it interesting that even a simple-sounding function like "find the maximum value from a sequence" should need the kind of documentation that’s missing here. I’m not saying it’s a failure of the design – more just musing how a complete specification is almost always going to be longer than you might think at first glance. And if you think I was diligent here, think again: I didn’t bother specifying which maximum or minimum value would be returned if there were two. For example, if a sequence consists of references to two equal but distinct strings, which reference should be returned? I have neither stated what my implementation (or the LINQ to Objects implementation) will do, nor tested for it.

Next up is Average – a single method with a mere 20 overloads. There are various corner cases to consider… but that’s a post for another day.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 28 – Sum

January 8, 2011 jonskeet 10 Comments

Okay, I’ve bitten the bullet. The first of the four Aggregation Operators Of Overload Doom (AOOOD) that I’ve implemented is Sum. It was far from difficult to implement – just tedious.

What is it?

Sum has 20 overloads – a set of 4 for each of the types that it covers (int, long, float, double, decimal). Here are the overloads for int:

public static int Sum(this IEnumerable<int> source)

public static int? Sum(this IEnumerable<int?> source)

public static int Sum<T>(
this IEnumerable<T> source,
Func<T, int> selector)

public static int? Sum<T>(
this IEnumerable<T> source,
Func<T, int?> selector)

As you can see, there are basically two variations:

A source of the numeric type itself, or a source of an arbitrary type with a projection to the numeric type
The numeric type can be nullable or non-nullable

The behaviour is as follows:

All overloads use immediate execution: it will immediately iterate over the source sequence to compute the sum, which is obviously the return value.
source and selector must both be non-null
Where there’s a selector, the operator is equivalent to source.Select(selector).Sum() – or you can think of the versions without a selector as using an identity selector
Where the numeric type is nullable, null values are ignored
The sum of an empty sequence is 0 (even for nullable numeric types)

The last point is interesting – because the overloads with nullable numeric types never return a null value. Initially I missed the fact that the return type even was nullable. I think it’s somewhat misleading to be nullable but never null – you might have at least expected that the return value for an empty sequence (or one consisting only of null values) would be null.

For int, long and decimal, overflow within the sum will throw OverflowException. For single and double, the result will be positive or negative infinity. If the sequence contains a "not-a-number" value (NaN), the result will be NaN too.

What are we going to test?

A lot!

In total, I have 123 tests across the five types. The tests are mostly the same for each type, with the exception of overflow behaviour and not-a-number behaviour for single and double. Each overload is tested reasonably thoroughly:

Argument validation
Summing a simple sequence
Summing an empty sequence
(Nullable) Summing a simple sequence containing null values
(Nullable) Summing a sequence containing only null values
Positive overflow (either to an exception or infinity)
Negative overflow (only one test per type, rather than for each overload)
(Single/Double) Sequences containing NaN values
Projections resulting in all of the above

Most of this was done using cut and paste, leading to a 916-line source file. On Twitter, followers have suggested a couple of alternatives – templating (possibly for both the tests and the implementation), or using more advanced features of unit test frameworks. There’s nothing wrong with these suggestions – but I’m always concerned about the balance within an elegant-but-complex solution to repetition. If it takes longer to get to the "neat" solution, and then each individual test is harder to read, is it worth it? It certainly makes it easier to add one test which is then applicable over several types, or to modify all "copies" of an existing test – but equally it makes it harder to make variations (such as overflow) fit within the pattern. I have no quarrel with the idea of using more advanced techniques here, but I’ve stuck to a primitive approach for the moment.

Let’s implement it!

Again, I’ll only demonstrate the "int" implementations – but talk about single/double later on. There are plenty of ways I could have implemented this:

Delegate everything to the simplest overload using "Where" for the nullable sequences and "Select" for the projection sequences
Delegate everything to the most complex overload using identity projections
Implement each method independently
Somewhere in-between :)

In the end I’ve implemented each non-projecting overload by delegating to the corresponding projection-based one with an identity projection. I’ve implemented the non-nullable and nullable versions separately though. Here’s the complete implementation:

public static int Sum(this IEnumerable<int> source)
{
return Sum(source, x => x);
}

public static int? Sum(this IEnumerable<int?> source)
{
return Sum(source, x => x);
}

public static int Sum<T>(
    this IEnumerable<T> source,
    Func<T, int> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    checked
    {
        int sum = 0;
        foreach (T item in source)
        {
            sum += selector(item);
        }
        return sum;
    }
}

public static int? Sum<T>(
    this IEnumerable<T> source,
    Func<T, int?> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    checked
    {
        int sum = 0;
        foreach (T item in source)
        {
            sum += selector(item).GetValueOrDefault();
        }
        return sum;
    }
}

Note the use of Nullable<T>.GetValueOrDefault() to "ignore" null values – it felt easier to add zero than to use an "if" block here. I suspect it’s also more efficient, as there’s no need for any conditionality here: I’d expect the implementation of GetValueOrDefault() to just return the underlying "value" field within the Nullable<T>, without performing the check for HasValue which the Value property normally would.

Of course, if I were really bothered by performance I’d implement each operation separately, instead of using the identity projection.

Note the use of the "checked" block to make sure that overflow is handled appropriately. As I’ve mentioned before, it would quite possibly be a good idea to turn overflow checking on for the whole assembly, but here I feel it’s worth making it explicit to show that we consider overflow as an important part of the behaviour of this operator. The single/double overloads don’t use checked blocks, as their overflow behaviour isn’t affected by the checked context.

Conclusion

One down, three to go! I suspect Min and Max will use even more cutting and pasting (with judiciously applied changes, of course). There are 22 overloads for each of those operators, due to the possibility of using an arbitrary type – but I may well be able to use the most generic form to implement all the numeric versions. I may measure the impact this has on performance before deciding for sure. Anyway, that’s a topic for the next post…

Addendum

As has been pointed out in the comments, my original implementation used a float to accumulate values when summing a sequence of floats. This causes problems, as these two new tests demonstrate:

[Test]
public void NonOverflowOfComputableSumSingle()
{
    float[] source = { float.MaxValue, float.MaxValue,
                      –float.MaxValue, –float.MaxValue };
    // In a world where we summed using a float accumulator, the
    // result would be infinity.
    Assert.AreEqual(0f, source.Sum());
}

[Test]
public void AccumulatorAccuracyForSingle()
{
    // 20000000 and 20000004 are both exactly representable as
    // float values, but 20000001 is not. Therefore if we use
    // a float accumulator, we’ll end up with 20000000. However,
    // if we use a double accumulator, we’ll get the right value.
    float[] array = { 20000000f, 1f, 1f, 1f, 1f };
    Assert.AreEqual(20000004f, array.Sum());
}

The second of these tests is specific to floating point arithmetic – there’s no equivalent in the integer domain. Hopefully the comment makes the test clear. We could still do better if we used the Kahan summation algorithm, but I haven’t implemented that yet, and don’t currently intend to. Worth noting as a potential follow-on project though.

Back to the first test though: this certainly can be represented in integers. If we try to sum { int.MaxValue, int.MaxValue, -int.MaxValue, -int.MaxValue } there are two options: we can overflow (throwing an exception) or we can return 0. If we use a long accumulator, we’ll return 0. If we use an int accumulator, we’ll overflow. I genuinely didn’t know what the result for LINQ to Objects would be until I tried it – and found that it overflows. I’ve added a test to document this behaviour:

[Test]
public void OverflowOfComputableSumInt32()
{
    int[] source = { int.MaxValue, 1, -1, –int.MaxValue };
    // In a world where we summed using a long accumulator, the
    // result would be 0.
    Assert.Throws<OverflowException>(() => source.Sum());
}

Of course, I could have gone my own way and made Edulinq more capable than LINQ to Objects here, but in this case I’ve gone with the existing behaviour.