Reimplementing LINQ to Objects: Part 37 – Guiding principles

Now that I’m "done" reimplementing LINQ to Objects – in that I’ve implemented all the methods in System.Linq.Enumerable – I wanted to write a few posts looking at the bigger picture. I’m not 100% sure of what this will consist of yet; I want to avoid this blog series continuing forever. However, I’m confident it will contain (in no particular order):

  • This post: principles governing the behaviour of LINQ to Objects
  • Missing operators: what else I’d have liked to see in Enumerable
  • Optimization: where the .NET implementation could be further optimized, and why some obvious-sounding optimizations may be inappropriate
  • How query expression translations work, in brief (and with a cheat sheet)
  • The difference between IQueryable<T> and IEnumerable<T>
  • Sequence identity, the "Contains" issue, and other knotty design questions
  • Running the Edulinq tests against other implementations

If there are other areas you want me to cover, please let me know.

The principles behind the LINQ to Objects implementation

The design LINQ to Objects is built on a few guiding principles, both in terms of design and implementation details. You need to understand these, but also implementations should be clear about what they’re doing in these terms too.

Extension method targets and argument validation

IEnumerable<T> is the core sequence type, not just for LINQ but for .NET as a whole. Almost everything is written in terms of IEnumerable<T> at least as input, with the following exceptions:

  • Empty, Range and Repeat don’t have input sequences (these are the only non-extension methods)
  • OfType and Cast work on the non-generic IEnumerable type instead
  • ThenBy and ThenByDescending work on IOrderedEnumerable<T>

All operators other than AsEnumerable verify that any input sequence is non-null. This validation is performed eagerly (i.e. when the method is called) even if the operator uses deferred execution for the results. Any delegate used (typically a projection or predicate of some kind) must be non-null. Again, this validation is performed eagerly.

IEqualityComparer<T> is used for all custom equality comparisons. Any parameter of this type may be null, in which case the default equality comparer for the type is used. In most cases the default equality comparer for the type is also used when no custom equality comparer is used, but Contains has some odd behaviour around this. Equality comparers are expected to be able to handle null values. IComparer<T> is only used by the OrderBy/ThenBy operators and their descending counterparts – and only then if you want custom comparisons between keys. Again, a null IComparer<T> means "use the default for the type"

Timing of input sequence "opening"

Any operator with a return type of IEnumerable<T> or IOrderedEnumerable<T> uses deferred execution. This means that the method doesn’t read anything from any input sequences until someone starts reading from the result sequence. It’s not clearly defined exactly when input sequences will first be accessed – for some operators if may be when GetEnumerator() is called; for others it may be on the first call to MoveNext() on the resulting iterator. Callers should not depend on these slight variations. Deferred execution is common for operators in the middle of queries. Operators which use deferred execution effectively represent queries rather than the results of queries – so if you change the contents of the original source of the query and then iterate over the query itself again, you’ll see the change. For example:

List<string> source = new List<string>();
var query = source.Select(x => x.ToUpper());
        
// This loop won’t write anything out
foreach (string x in query)
{
    Console.WriteLine(x);
}
        
source.Add("foo");
source.Add("bar");

// This loop will write out "FOO" and "BAR" – even
// though we haven’t changed the value of "query"
foreach (string x in query)
{
    Console.WriteLine(x);
}

Deferred execution is one of the hardest parts of LINQ to understand, but once you do, everything becomes somewhat simpler.

All other operators use immediate execution, fetching all the data they need from the input before they return a value… so that by the time they do return, they will no longer see or care about changes to the input sequence. For operators returning a scalar value (such as Sum and Average) this is blatantly obvious – the value of a variable of type double isn’t going to change just because you’ve added something to a list. However, it’s slightly less for the "ToXXX" methods: ToLookup, ToArray, ToList and ToDictionary. These do not return views on the original sequence, unlike the "As" methods: AsEnumerable which we’ve seen, and Queryable.AsQueryable which I didn’t implement. Focus on the prefix part of the name: the "To" part indicates a conversion to a particular type. The "As" prefix indicates a wrapper of some kind. This is consistent with other parts of the framework, such as List<T>.AsReadOnly and Array.AsReadOnly<T>.

Very importantly, LINQ to Objects only iterates over any input sequence at most once, whether the execution is deferred or immediate. Some operators would be easier to implement if you could iterate over the input twice – but it’s important that they don’t do so. Of course if you provide the same sequence for two inputs, it will treat those as logically different sequences. Similarly if you iterate over a result sequence more than once (for operators that return IEnumerable<T> or a related interface, rather than List<T> or an array etc), that will iterate over the input sequence again.

This means it’s fine to use LINQ to Objects with sequences which may only be read once (such as a network stream), or which are relatively expensive to reread (imagine a log file reader over a huge set of logs) or which give inconsistent results (imagine a sequence of random numbers). In some cases it’s okay to use LINQ to Objects with an infinite sequence – in others it’s not. It’s usually fairly obvious which is the case.

Timing of input sequence reading, and memory usage

Where possible within deferred execution, operators act in a streaming fashion, only reading from the input sequence when they have to, and "forgetting" data as soon as they can. This allows for long – potentially infinite – sequences to be handled elegantly without memory running out.

Some operators naturally need to read all the data in before they can return anything. The most obvious example of this is Reverse, which will always yield the last element of the input stream as the first element in the result stream.

A third pattern occurs with operators such as Distinct, which yield data as they go, but accumulate elements too, taking more and more memory until the caller stops iterating (usually either by jumping out of the foreach loop, or letting it terminate naturally).

Where an operator takes two input sequences – such as Join – you need to understand the consumption of each one separately. For example, Join uses deferred execution, but as soon as you ask for the first element of the result set, it will read the "second" sequence completely and buffer it – whereas the "first" sequence is streamed. This isn’t the case for all operators with two inputs, of course – Zip streams both input sequences, for example. Check the documentation – and the relevant Edulinq blog post – for details.

Obviously any operator which uses immediate execution has to read all the data it’s interested in before it returns. This doesn’t necessarily mean they will read to the end of the sequence though, and they may not need to buffer the data they read. (Simple examples are ToList which has to keep everything, and Sum which doesn’t.)

Queries vs data

Closely related to the details of when the input is read is the concept of what the result of an operator actually represents. Operators which use deferred execution return queries: each time you iterate over the result sequence, the query will look at the input sequence again. The query itself doesn’t contain the data – it just knows how to get at the data.

Operators which use immediate execution work the other way round: they read all the data they need, and then forget about the input sequence. For operators like Average and Sum this is obvious as it’s just a simple scalar value – but for operators like ToList, ToDictionary, ToLookup and ToArray, it means that the operator has to make a copy of everything it needs. (This is potentially a shallow copy of course – depending on what user-defined projections are applied. The normal behaviour of mutable reference types is still valid.)

I realise that in many ways I’ve just said the same thing multiple times now – but hopefully that will help this crucial aspect of LINQ behaviour sink in, if you were still in any doubt.

Exception handling

I’m unaware of any situation in which LINQ to Objects will catch an exception. If your predicate or projection throws an exception, it will propagate in the obvious way.

However, LINQ to Objects does ensure that any iterator it reads from is disposed appropriately – assuming that the caller disposes of any result sequences properly, of course. Note that the foreach statement implicitly disposes of the iterator in a finally block.

Optimization

Various operators are optimized when they detect at execution time that the input sequence they’re working on offers a shortcut.

The types most commonly detected are:

  • ICollection<T> and ICollection for their Count property
  • IList<T> for its random access indexer

I’ll look at optimization in much more detail in a separate post.

Conclusion

This post has not been around the guiding principles behind LINQ itself – lambda calculus or anything like that. It’s more been a summary of the various aspects of behaviour we’ve seen across the various operators we’ve implemented. They’re the rules I’ve had to follow in order to make Edulinq reasonably consistent with LINQ to Objects.

Next time I’ll talk about some of the operators which I think should have made it into the core framework, at least for LINQ to Objects.

5 thoughts on “Reimplementing LINQ to Objects: Part 37 – Guiding principles”

  1. Jon,
    An excellent post, and an excellent series of posts. I’m finding them invaluable.

    Thanks mate for sharing your thoughts and insights.

    Regards
    Mark (aka Binary Worrier)

    Like

  2. I wouldn’t have a problem if this blog series continued forever, it’s very interesting in many aspects: performance, unit tests, etc. It also helped me understand some LINQ operators better (like Aggregate for example). Thanks and keep up the good work!

    Like

  3. “The normal behaviour of mutable reference types is still valid.”

    I love that sentence. You’ve expressed so much here, and so clearly, with a very succinct line.

    Operators that should have been in Linq? There’s one I’ve always wanted, and in fact implemented twice: sequence.Cache() returns a sequence that can be read multiple times; the source is only read as far as needed, but the results are cached and the source isn’t read more than once (and would continue advancing if reread past the point where the first read stopped). This is actually harder to implement than all the existing ones because we have a problem with disposabilty. It would need to be disposable so that the underlying enumeration is disposed. An IDisposableEnumerable works well though.

    Like

  4. This blog series has been really great so far, please keep it going ;)

    “If your predicate or projection throws an exception, it will propagate in the obvious way”

    It’s probably obvious to people who followed the series, but perhaps not for everyone else… You should probably make it explicit that the exception will be thrown during the call to MoveNext, while the query is enumerated.

    Like

Leave a comment