Category Archives: Edulinq

Edulinq – the e-book

I’m pleased to announce that I’ve made a first pass at converting the blog posts in the Edulinq series into e-books.

I’m using Calibre to convert to PDF and e-book format. I still have a way to go, but they’re at least readable. The Kindle version (MOBI format) is working somewhat better than the PDF version at the moment, which surprises me. In particular, although hyperlinks are displaying in the PDF, they don’t seem to be working – whereas at least the internal links in the Kindle format are working.

I’ll no doubt try to improve things over time, but I’ve put these early attempts up on the root page of the Edulinq project site. I’ve also put all the blog posts up as HTML in the site’s source control; that means you can browse the latest version directly. It also means if you sync the source control, download Calibre yourself and add “index.html” as a new e-book in the Calibre library, you can play with the conversion yourself and help me improve things. Feedback about problems is welcome; feedback including the fix is even better :)

Reimplementing LINQ to Objects: Part 45 – Conclusion and List of Posts

Table of Contents

You may consider it a little odd to have a list of posts as the final part in the series, but it makes sense when you consider that visiting the Edulinq tag page shows results in reverse chronological order. At that point, a newcomer will hopefully hit this post first, and then find it easier to navigate to the first post. Anyway…

  1. Introduction
  2. Where
  3. Select
  4. Range
  5. Empty
  6. Repeat
  7. Count and LongCount
  8. Concat
  9. SelectMany
  10. Any and All
  11. First, Single, Last and the …OrDefault versions
  12. DefaultIfEmpty
  13. Aggregate
  14. Distinct
  15. Union
  16. Intersect
  17. Except
  18. ToLookup
  19. Join
  20. ToList
  21. GroupBy
  22. GroupJoin
  23. Take, Skip, TakeWhile, SkipWhile
  24. ToArray
  25. ToDictionary
  26. Ordering:
  27. Reverse
  28. Sum
  29. Min and Max
  30. Average
  31. ElementAt and ElementAtOrDefault
  32. Contains
  33. Cast and OfType
  34. SequenceEqual
  35. Zip
  36. AsEnumerable
  37. Guiding principles
  38. What’s missing?
  39. Comparing implementations
  40. Optimization
  41. How query expressions work
  42. More optimization
  43. Out-of-process queries with IQueryable
  44. Aspects of design
  45. Conclusion and List of Posts

Thank you, and good night

When I started this series, I hadn’t realised quite how much there would be to write about. The main thrust was going to be that the implementation of LINQ is simple, and it’s the design that’s clever. As it happened, pretty much every operator ended up raising some interesting issue or other. However, hopefully the series has still "immersed" you in LINQ to Objects to some extent, and clarified how it all hangs together. It would be gratifying to think that the description at the start of each post may end up being used as a sort of "unofficial alternative documentation" with some more details than MSDN provides, but we’ll see whether than happens over time.

A number of people have asked me whether there’ll be an ebook version of this series, and the answer is currently "I don’t know." I have a few plans afoot, but I can’t tell where they’ll lead yet. Suffice to say I like the idea, and I’m looking at some options.

Anyway, thank you for reading as much of the series as you have, and I hope you’ve enjoyed it as much as I have.

Just in case you’re wondering, I’ll probably go back to posting about C# 5’s async support pretty soon…

Reimplementing LINQ to Objects: Part 44 – Aspects of Design

I promised a post on some questions of design that are raised by LINQ to Objects. I suspect that most of these have already been covered in other posts, but it may well be helpful to talk about them here too. This time I’ve thought about it particularly from the point of view of how other APIs can be built on some of the same design principles, and the awkward choices that LINQ has thrown up.

The power of composability and immutability

Perhaps the most important aspect of LINQ which I’d love other API designers to take on board is that of how complicated queries are constructed from lots of little building blocks. What makes it particularly elegant is that the result of applying each building block is unchanged by anything else you do to it afterwards.

LINQ doesn’t enforce immutability of course – you can start off with a mutable list and change its content at any time, for example, or change the properties of one of the objects referenced within it, or pass in a delegate with side-effects – but LINQ itself won’t introduce side-effects.

The Task-based Asynchronous Pattern takes a similar approach, allowing composable building blocks of tasks. I’ve seen this pattern in various guises over the years – if you find yourself thinking in terms of a pipeline of some kind, it may well be appropriate, especially if each state in the pipeline emits the same type as it consumes.

General immutability is a somewhat different design trait of course, but one which can make such a difference. The java.util.{Date,Calendar} classes are horrible, not least because they’re mutable – you can never stash a value away without being concerned that it may get changed by something else. Joda Time has some mutable implementations, but typically the immutable classes are used in a fluent way. Of course, .NET uses value types for various core types to start with, but also makes TimeZoneInfo immutable. For genuine "values" I would highly encourage API designers to at least strongly consider immutable types. They’re not always appropriate by any means, but they can be hugely useful where they fit nicely.

Extension methods on interfaces

It’s no surprise that extension methods are heavily used in LINQ, given that they were effectively introduced into the language in order to enable LINQ in the first place. However, they do work particularly well with interfaces as a way of adding common behaviour.

It also plays very nicely with the pipeline pattern above for creating pipelines in a fluent manner. Even if you just create extension methods which call a constructor to wrap/compose the previous stage in the pipeline, you can still end up with more readable code.

One problem with this is that you can’t "override" behaviour in particular implementations or interfaces which extend the original one – which is why Enumerable.ElementAt() has to detect that a sequence is actually a list, for example. If interfaces allowed method implementations, this wouldn’t be as much of a problem in the situation where you’re in control of the interface – I wouldn’t be at all surprised to find that as a feature of C#’s successor.

The lack of extension properties is also a bit of a handicap in some places, although not as many as one might expect at first glance. For example, even if we could have made Enumerable.Count() a property, would it have been a good idea to do so? Properties give a natural expectation of speed, and Count() is usually an O(n) operation.

Delegates for custom behaviour

In .NET 1.0 and 1.1, most developers used delegates for two purposes:

  • Handling events in UIs
  • Passing around behaviour to be executed in a different thread (either via Control.Invoke, or new Thread(ThreadStart), or ThreadPool.QueueUserWorkItem).

.NET 2.0 increased the range of uses of delegates somewhat, particularly with List.ConvertAll and the ability to create delegates relatively easily using anonymous methods.

However, LINQ really brought them into the mainstream. If you’re building an API which benefits from small pieces of custom behaviour, delegates can be a real boon. More complicated behaviour is still often best represented via an interface, and sometimes it’s worth having both interface and delegate representations, like Comparison<T> and IComparable<T>. It’s generally easy to convert between the two – especially if you use a method group conversion from an interface implementation’s method to the delegate type.

Laziness

One aspect of LINQ which is both a blessing and a curse is its laziness, both in terms of deferred execution (not reading from the input sequence at all until the result sequence is read) and in terms of streaming the data (only reading as much information from the input sequence as is required to answer the immediate needs of the caller).

This is great in various ways, particularly as it means you can build a complex query and use it multiple times, sometimes as a basis for other queries, knowing that it won’t actually do anything until you ask for real results. It also means that you can iterate over huge data sets, so long as you’re careful.

On the other hand, it leads to subtle issues over when code is actually executing, makes debugging harder to understand, makes it easier to accidentally change the values of captured variables between the point at which you create the query and the point at which you execute it, and basically messes with your head. This is probably the aspect of LINQ which confuses newbies more than any other.

I’m not saying it was the wrong decision for LINQ – but I would caution API designers to think carefully before introducing laziness, and to document it really thoroughly. Likewise if your API might end up returning a result which is "gradually evaluated" (streaming data etc), this should be made clear.

When names collide: options for consistency

Just in case you’ve forgotten, this is irritation with the meaning of source.Contains(element). In order to check whether a sequence contains an element or not, there has to be some idea of equality – for example, if you’re trying to find one string in a sequence of strings, are you trying to match in a case sensitive manner or not?

There’s an overload for Enumerable.Contains which allows you to specify the equality comparer to use, but the question is what should happen when you let the implementation pick the comparer.

For every other method in Enumerable, the default equality comparer for the sequence type – i.e. EqualityComparer<TSource>.Default – is picked. That sounds like source.Contains(element) should use the element type’s default comparer too, right? Well, in some cases that’s what will happen… but not if the source implements ICollection<T>, which has its own Contains method which doesn’t take an equality comparer. If that’s the case, LINQ to Objects delegates to the collection’s Contains method.

So, we have three kinds of consistency here:

  • Consistency of compile-time type: it would be nice if the behaviour of source.Contains(element) was the same whether "source" is of type IEnumerable<T> and ICollection<T>
  • Consistency of API: it would be nice if Contains behaved the same way as other methods which have overloads with and without equality comparers
  • Consistency of model: if you consider "source" to be just a sequence of elements, it shouldn’t affect the result (not just the speed) if the object actually implements ICollection<T>

I should point out that this will only be a problem if the collection uses a different notion of equality to the default equality comparer for the type. The most obvious example of this is if you have a HashSet<string> which uses a case-insensitive equality comparer. But it’s still a valid concern.

So what should the API designer do in this kind of case? Admittedly LINQ to Objects is already in a slightly unusual position as it’s based on an existing interface with known and very common interfaces extending that core one… it’s less likely to come up with other APIs. However, I think it might be enough of a smell to suggest that changing the name of the method to "ContainsElement" or something similar would be worthwhile. It’s unfortunate that "Contains" really is the obvious choice…

This issue raises another aspect of API decision I’ve considered in the past… if there’s a common way of doing something in the framework you’re building on top of, but you consider it to be broken, should you abide by that breakage for the sake of familiarity and consistency, or should you strive to be as "clean" as possible? I think it needs to be considered on a case-by-case basis, but I suspect I would usually come down on the side of cleanliness.

Documentation details

Almost all APIs are badly documented – it’s a fact of life, even with some of the best APIs I’ve worked with. I doubt that Noda Time will be a shining example either. However, at the risk of being hypocritical I’ll say that documentation is worth spending significant thought on. Not just the time taken to document your code – but the time taken to consider what you want to guarantee, what should be left unstated, and what should be explicitly left open.

For example, there’s no indication in the documentation of Cast that it will sometimes return the original source value, nor in its companion OfType method that that will never return the original source reference. This might be important to someone – why not state it? It’s possible to state the possibility without saying what cases it applies to of course, leaving some wiggle room in the future. You might consider some of the optimizations in the same way – when should an optimization be documented and when should it be implicit? Sometimes it can make a difference beyond just performance, even if only in "odd" situations (such as a predicate throwing an exception).

If you’re used to defensive coding with Code Contracts, it’s much the same type of decision – and again, it’s similar to deciding whether a method should return IEnumerable<T>, IList<T> or List<T>. There’s a balance between caller convenience, design cleanliness (where you only want to emphasize one interface aspect, even if it also happens to always return a particular type), and room for the implementation to change in the future.

Another example of considering the level of detail to document is when it comes to how input sequences are used in LINQ to Objects. What does it mean to say "this method uses deferred execution" exactly? If I call GetEnumerator() eagerly but defer the call to MoveNext(), is that still "deferred execution"? Should the documentation state when a sequence is buffered and when it’s streamed? Should it guarantee the order of the result sequence when the natural implementation makes that order easy to describe (e.g. for Distinct)? In this series I’ve tried to be as clear as possible about what actually happens – but that’s not to say that in some cases, the documentation wasn’t left deliberately ambiguous.

Conclusion

There are many other design considerations that I haven’t gone into here – particularly optimization, which I’ve already covered twice, probably saying everything I wanted to say here anyway.

I may add a few more bits to this post over time… but aside from that, I think I’m fundamentally done. I’ll write one more conclusion post, then declare Edulinq closed…

Reimplementing LINQ to Objects: Part 43 – Out-of-process queries with IQueryable

I’ve been putting off writing about this for a while now, mostly because it’s such a huge topic. I’m not going to try to give more than a brief introduction to it here – don’t expect to be able to whip up your own LINQ to SQL implementation afterwards – but it’s worth at least having an idea of what happens when you use something like LINQ to SQL, NHibernate or the Entity Framework.

Just as LINQ to Objects is primarily interested in IEnumerable<T> and the static Enumerable class, so out-of-process LINQ is primarily interested in IQueryable<T> and the static Queryable class… but before we get to them, we need to talk about expression trees.

Expression Trees

To put it in a nutshell, expression trees encapsulate logic in data instead of code. While you can introspect .NET code via MethodBase.GetMethodBody and then MethodBody.GetILAsByteArray, that’s not really a practical approach. The types in the System.Linq.Expressions define expressions in an easier-to-process manner. When expression trees were introduced in .NET 3.5, they were strictly for expressions, but the Dynamic Language Runtime uses expression trees to represent operations, and the range of logic represented had to expand accordingly, to include things like blocks.

While you certainly can build expression trees yourself (usually via the factory methods on the nongeneric Expression class), and it’s fun to do so at times, the most common way of creating them is to use the C# compiler’s support for them via lambda expressions. So far we’ve always seen a lambda expression being converted to a delegate, but it can also convert lambdas to instances of Expression<TDelegate>, where TDelegate is a delegate type which is compatible with the lambda expression. A concrete example will help here. The statement:

Expression<Func<int, int>> addOne = x => x + 1;

will be compiled into code which is effectively something like this:

var parameter = Expression.Parameter(typeof(int), "x");
var one = Expression.Constant(1, typeof(int));
var addition = Expression.Add(parameter, one);
var addOne = Expression.Lambda<Func<int, int>>(addition,  new ParameterExpression[] { parameter });

The compiler has some tricks up its sleeves which allow it to refer to methods, events and the like in a simpler way than we can from code, but largely you can regard the transformation as just a way of making life a lot simpler than if you had to build the expression trees yourself every time.

IQueryable, IQueryable<T> and IQueryProvider

Now that we’ve got the idea of being able to inspect logic relatively easily at execution time, let’s see how it applies to LINQ.

There are three interfaces to introduce, and it’s probably easiest to start with how they appear in a class diagram:

Most of the time, queries are represented using the generic IQueryable<T> interface, but this doesn’t actually add much over the nongeneric IQueryable interface it extends, other than also extending IEnumerable<T> – so you can iterate over the contents of an IQueryable<T> just as with any other sequence.

IQueryable contains the interesting bits, in the form of three properties: ElementType which indicates the type of the elements within the query (in other words, a dynamic form of the T from IQueryable<T>), Expression returns the expression tree for the query so far, and Provider returns the query provider which is responsible for creating new queries and executing the existing one. We won’t need to use the ElementType property ourselves, but we’ll need both the Provider and Expression properties.

The static Queryable class

We’re not going to implement any of the interfaces ourselves, but I’ve got a small sample program to demonstrate how they all work, imagining we were implementing most of Queryable ourselves. This static class contains extension methods for IQueryable<T> just as Enumerable does for IEnumerable<T>. Most of the query operators from LINQ to Objects appear in Queryable as well, but there are a few notable omissions, such as the To{Lookup, Array, List, Dictionary} methods. If you call one of those on an IQueryable<T>, the Enumerable implementations will be used instead. (IQueryable<T> extends IEnumerable<T>, so the extension methods in Enumerable are applicable to IQueryable<T> sequences as well.)

The big difference between the Queryable and Enumerable methods in terms of their declarations is in the parameters:

  • The "source" parameter in Queryable is always of type IQueryable<TSource> instead of IEnumerable<TSource>. (Other sequence parameters such as the sequence to concatenate for Queryable.Concat are expressed as IEnumerable<T>, interestingly enough. This allows you to express a SQL query using "local" data as well; the query methods work out whether the sequence is actually an IQueryable<T> and act accordingly.)
  • Any parameters which were delegates in Enumerable are expression trees in Queryable; so while the selector parameter in Enumerable.Select is of type Func<TSource, TResult>, the equivalent in Queryable.Select is of type Expression<Func<TSource, TResult>>

The big difference between the methods in terms of what they do is that whereas the Enumerable methods actually do the work (eventually – possibly after deferred execution of course), the Queryable methods themselves really don’t do any work: they just ask the query provider to build up a query indicating that they’ve been called.

Let’s have a look at Where for example. If we wanted to implement Queryable.Where, we would have to:

  • Perform argument checking
  • Get the "current" query’s Expression
  • Build a new expression representing a call to Queryable.Where using the current expression as the source, and the predicate expression as the predicate
  • Ask the current query’s provider to build a new IQueryable<T> based on that call expression, and return it.

It all sounds a bit recursive, I realize – the Where call needs to record that a Where call has happened… but that’s all. You may very well wonder where all the work is happening. We’ll come to that.

Now building a call expression is slightly tedious because you need to have the right MethodInfo – and as Where is overloaded, that means distinguishing between the two Where methods, which is easier said than done. I’ve actually used a LINQ query to find the right overload – the one where the predicate parameter Expression<Func<T, bool>> rather than Expression<Func<T, int, bool>>. In the .NET implementation, methods can use MethodBase.GetCurrentMethod() instead… although equally they could have created a bunch of static variables computed at class initialization time. We can’t use GetCurrentMethod() for experimentation purposes, because the query provider is likely to expect the exact correct method from System.Linq.Queryable in the System.Core assembly.

Here’s our sample implementation, broken up quite a lot to make it easier to understand:

public static IQueryable<TSource> Where<TSource>(
    this IQueryable<TSource> source,
    Expression<Func<TSource, bool>> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
        
    Expression sourceExpression = source.Expression;
    Expression quotedPredicate = Expression.Quote(predicate);
        
    // This gets the "open" method, without specific type arguments. The second parameter
    // of the method we want is of type Expression<Func<TSource, bool>>, so the sole generic
    // type argument to Expression<T> itself has two generic type arguments.
    // Let’s face it, reflection on generic methods is a mess.
    MethodInfo method = typeof(Queryable).GetMethods()
                                         .Where(m => m.Name == "Where")
                                         .Where(m => m.GetParameters()[1]
                                                      .ParameterType
                                                      .GetGenericArguments()[0]
                                                      .GetGenericArguments().Length == 2)
                                         .First();
        
    // This gets the method with the same type arguments as ours
    MethodInfo closedMethod = method.MakeGenericMethod(new Type[] { typeof(TSource) });
        
    // Now we can create a *representation* of this exact method call
    Expression methodCall = Expression.Call(closedMethod, sourceExpression, quotedPredicate);
        
    // … and ask our query provider to create a query for it
    return source.Provider.CreateQuery<TSource>(methodCall);
}

There’s only one part of this code that I don’t really understand the need for, and that’s the call to Expression.Quote on the predicate expression tree. I’m sure there’s a good reason for it, but this particular example would work without it, as far as I can see. The real implementation uses it though, so dare say it’s required in some way.

EDIT: Daniel’s comment has made this somewhat clearer to me. Each of the arguments to Expression.Call after the MethodInfo itself is meant to be an expression which represents the argument to the method call. In our example we need an expression which represents an argument of type Expression<Func<TSource, bool>>. We already have the value, but we need to provide the layer of wrapping… just as we did with Expression.Constant in the very first expression tree I showed at the top. To wrap the expression value we’ve got, we use Expression.Quote. It’s still not clear to me exactly why we can use Expression.Quote but not Expression.Constant, but at least it’s clearer why we need something

EDIT: I’m gradually getting there. This Stack Overflow answer from Eric Lippert has much to say on the topic. I’m still trying to get my head round it, but I’m sure when I’ve read Eric’s answer several times, I’ll get there.

We can even test that this works, by using the Queryable.AsQueryable method from the real .NET implementation. This creates an IQuerable<T> from any IEnumerable<T> using a built-in query provider. Here’s the test program, where FakeQueryable is a static class containing the extension method above:

using System;
using System.Collections.Generic;
using System.Linq;

class Test
{
    static void Main()
    {        
        List<int> list = new List<int> { 3, 5, 1 };
        IQueryable<int> source = list.AsQueryable();
        IQueryable<int> query = FakeQueryable.Where(source, x => x > 2);
        
        foreach (int value in query)
        {
            Console.WriteLine(value);
        }
    }
}

This works, printing just 3 and 5, filtering out the 1. Yay! (I’m explicitly calling FakeQueryable.Where rather than letting extension method resolution find it, just to make things clearer.)

Um, but what’s doing the actual work? We’ve implemented the Where clause without providing any filtering ourselves. It’s really the query provider which has built an appropriate IQueryable<T> implementation. When we call GetEnumerator() implicitly in the foreach loop, the query can examine everything that’s built up in the expression tree (which could contain multiple operators – it’s nesting queries within queries, essentially) and work out what to do. In the case of our IQueryable<T> built from a list, it just does the filtering in-process… but if we were using LINQ to SQL, that’s when the SQL would be generated. The provider recognizes the specific methods from Queryable, and applies filters, projections etc. That’s why it was important that our demo Where method pretended that the real Queryable.Where had been called – otherwise the query provider wouldn’t know what the call expression

Just to hammer the point home even further… Queryable itself neither knows nor cares what kind of data source you’re using. Its job is not to perform any query operations itself; its job is to record the requested query operations in a source-agnostic manner, and let the source provider handle them when it needs to.

Immediate execution with IQueryProvider.Execute

All the operators using deferred execution in Queryable are implemented in much the same way as our demo Where method. However, that doesn’t cover the situation where we need to execute the query now, because it has to return a value directly instead of another query.

This time I’m going to use ElementAt as the sample, simply because it’s only got one overload, which makes it very easy to grab the relevant MethodInfo. The general procedure is exactly the same as building a new query, except that this time we call the provider’s Execute method instead of CreateQuery.

public static TSource ElementAt<TSource>(this IQueryable<TSource> source, int index)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
        
    Expression sourceExpression = source.Expression;
    Expression indexExpression = Expression.Constant(index);
        
    MethodInfo method = typeof(Queryable).GetMethod("ElementAt");        
    MethodInfo closedMethod = method.MakeGenericMethod(new Type[] { typeof(TSource) });
        
    // Now we can create a *representation* of this exact method call
    Expression methodCall = Expression.Call(closedMethod, sourceExpression, indexExpression);
        
    // … and ask our query provider to execute it
    return source.Provider.Execute<TSource>(methodCall);
}

The type argument we provide to Execute is the desired return type – so for Count, we’d call Execute<int> for example. Again, it’s up to the query provider to work out what the call actually means.

It’s worth mentioning that both CreateQuery and Execute have generic and non-generic overloads. I haven’t personally encountered a use for the non-generic ones, but I gather they’re useful for various situations in generated code, particularly if you really don’t know the element type – or at least only know it dynamically, and don’t want to have to use reflection to generate an appropriate generic method call.

Transparent support in source code

One of the aspects of LINQ which raises it to the "genius" status (and "slightly scary" at the same time) is that most of the time, most developers don’t need to make any changes to their source code in order to use Enumerable or Queryable. Take this query expression and its translation:

var query = from person in family
            where person.LastName == "Skeet"
            select person.FirstName;

// Translation
var query = family.Where(person => person.LastName == "Skeet")
                  .Select(person => person.FirstName);

Which set of query methods will that use? It entirely depends on the compile-time type of the "family" variable. If that’s a type which implements IQueryable<T>, it will use the extension methods in Queryable, the lambda expression will be converted into expression trees, and the type of "query" will be IQueryable<string>. Otherwise (and assuming the type implements IEnumerable<T> isn’t some other interesting type such as ParallelEnumerable) it will use the extension methods in Enumerable, the lambda expressions will be converted into delgeates, and the type of "query" will be IEnumerable<string>.

The query expression translation part of the specification has no need to care about this, because it’s simply translating into a form which uses lambda expressions – the rest of overload resolution and lambda expression conversion deals with the details.

Genius… although it does mean you need to be careful that really you know where your query evaluation is going to take place – you don’t want to accidentally end up performing your whole query in-process having shipped the entire contents of a database across a network connection…

Conclusion

This was really a whistlestop tour of the "other" side of LINQ – and without going into any of the details of the real providers such as LINQ to SQL. However, I hope it’s given you enough of a flavour for what’s going on to appreciate the general design. Highlights:

  • Expression trees are used to capture logic in a data structure which can be examined relatively easily at execution time
  • Lambda expressions can be converted into expression trees as well as delegates
  • IQueryable<T> and IQueryable form a sort of parallel interface hierarchy to IEnumerable<T> and IEnumerable – although the queryable forms extend the enumerable forms
  • IQueryProvider enables one query to be built based on another, or executed immediately where appropriate
  • Queryable provides equivalent extension methods to most of the Enumerable LINQ operators, except that it uses IQueryable<T> sources and expression trees instead of delegates
  • Queryable doesn’t handle the queries itself at all; it simply records what’s been called and delegates the real processing to the query provider

I think I’ve now covered most of the topics I wanted to mention after finishing the actual Edulinq implementation. Next up I’ll talk about some of the thorny design issues (most of which I’ve already mentioned, but which bear repeating) and then I’ll write a brief "series conclusion" post with a list of links to all the other parts.

Reimplementing LINQ to Objects: Part 42 – More optimization

A few parts ago, I jotted down a few thoughts on optimization. Three more topics on that general theme have occurred to me, one of them prompted by the comments.

User-directed optimizations

I mentioned last time that for micro-optimization purposes, we could derive a tiny benefit if there were operators which allowed us to turn off potential optimizations – effectively declare in the LINQ query that we believed the input sequence would never be an IList<T> or an ICollection<T>, so it wasn’t worth checking it. I still believe that level of optimization would be futile.

However, going the other way is entirely possible. Imagine if we could say, "There are probably a lot of items in this collection, and the operations I want to perform on them are independent and thread-safe. Feel free to parallelize them."

That’s exactly what Parallel LINQ gives you, of course. A simple call to AsParallel() somewhere in the query – often at the start, but it doesn’t have to be – enables parallelism. You need to be careful how you use this, of course, which is why it’s opt-in… and it gives you a fair amount of control in terms of degrees of potential parallelism, whether the results are required in the original order and so on.

In some ways my "TopBy" proposal is similar in a very small way, in that it gives information relatively early in the query, allowing the subsequent parts (ThenBy clauses) to take account of the extra information provided by the user. On the other hand, the effect is extremely localized – basically just for the sequence of clauses to do with ordering.

Related to the idea of parallelism is the idea of side-effects, and how they affect LINQ to Objects itself.

Side-effects and optimization

The optimizations in LINQ to Objects appear to make some assumptions about side-effects:

  • Iterating over a collection won’t cause any side-effects
  • Predicates may cause side-effects

Without the first point, all kinds of optimizations would effectively be inappropriate. As the simplest example, Count() won’t use an iterator – it will just take the count of the collection. What if this was an odd collection which mutated something during iteration, though? Or what if accessing the Count property itself had side-effects? At that point we’d be violating our principle of not changing observable behaviour by optimizing. Again, the optimizations are basically assuming "sensible" behaviour from collections.

There’s a rather more subtle possible cause of side-effects which I’ve never seen discussed. In some situations – most obviously Skip – an operator can be implemented to move over an iterator for a time without taking each "current" value. This is due to the separation of MoveNext() from Current. What if we were dealing with an iterator which had side-effects only when Current was fetched? It would be easy to write such a sequence – but again, I suspect there’s an implicit assumption that such sequences simply don’t exist, or that it’s reasonable for the behaviour of LINQ operators with respect to them to be left unspecified.

Predicates, on the other hand, might not be so sensible. Suppose we were computing "sequence.Last(x => 10 / x > 1)" on the sequence { 5, 0, 2 }. Iterating over the sequence forwards, we end up with a DivideByZeroException – whereas if we detected that the sequence was a list, and worked our way backwards from the end, we’d see that 10 / 2 > 1, and return that last element (2) immediately. Of course, exceptions aren’t the only kind of side-effect that a predicate can have: it could mutate other state. However, it’s generally easier to spot that and cry foul of it being a proper functional predicate than notice the possibility of an exception.

I believe this is the reason the predicated Last overload isn’t optimized. It would be nice if these assumptions were documented, however.

Assumptions about performance

There’s a final set of assumptions which the common ICollection<T>/IList<T> optimizations have all been making: that using the more "direct" members of the interfaces (specifically Count and the indexer) are more efficient than simply iterating. The interfaces make no such declarations: there’s no requirement that Count has to be O(1), for example. Indeed, it’s not even the case in the BCL. The first time you ask a "view between" on a sorted set for its count after the underlying set has changed, it has to count the elements again.

I’ve had this problem before, removing items from a HashSet in Java. The problem is that there’s no way of communicating this information in a standardized way. We could use attributes for everything, but it gets very complicated, and I strongly suspect it would be a complete pain to use. Basically, performance is one area where abstractions just don’t hold up – or rather, the abstractions aren’t designed to include performance characteristics.

Even if we knew the complexity of (say) Count that still wouldn’t help us necessarily. Suppose it’s an O(n) operation – that sounds bad, until you discover that for this particular horrible collection, each iteration step is also O(n) for some reason. Or maybe there’s a collection with an O(1) count but a horrible constant value, whereas iterating is really quick per item… so for small values of O(n), iteration would be faster. Then you’ve got to bear in mind how much processor time is needed trying to work out the fastest approach… it’s all bonkers.

So instead we make these assumptions, and for the most part they’re correct. Just be aware of their presence.

Conclusion

I have reached the conclusion that I’m tired, and need sleep. I might write about Queryable, IQueryable and query expressions next time.

Reimplementing LINQ to Objects: Part 41 – How query expressions work

Okay, first a quick plug. This won’t be in as much detail as chapter 11 of C# in Depth. If you want more, buy a copy. (Until Feb 1st, there’s 43% off it if you buy it from Manning with coupon code j2543.) Admittedly that chapter has to also explain all the basic operators rather than just query expression translations, but that’s a fair chunk of it.

If you’re already familiar with query expressions, don’t expect to discover anything particularly insightful here. However, you might be interested in the cheat sheet at the end, in case you forget some of the syntax occasionally. (I know I do.)

What is this "query expression" of which you speak?

Query expressions are a little nugget of goodness hidden away in section 7.16 of the C# specification. Unlike some language features like generics and dynamic typing, query expressions keep themselves to themselves, and don’t impinge on the rest of the spec. A query expression is a bit of C# which looks a bit like a mangled version of SQL. For example:

from person in people
where person.FirstName.StartsWith("J")
orderby person.Age
select person.LastName

It looks somewhat unlike the rest of C#, which is both a blessing and a curse. On the one hand, queries stand out so it’s easy to see they’re queries. On the other hand… they stand out rather than fitting in with the rest of your code. To be honest I haven’t found this to be an issue, but it can take a little getting used to.

Every query expression can be represented in C# code, but the reverse isn’t true. Query expressions only take in a subset of the standard query operators – and only a limited set of the overloads, at that. It’s not unusual to see a query expression followed by a "normal" query operator call, for example:

var list = (from person in people
            where person.FirstName.StartsWith("J")
            orderby person.Age
            select person.LastName)
           .ToList();

So, that’s very roughly what they look like. That’s the sort of thing I’m dealing with in this post. Let’s start dissecting them.

Compiler translations

First it’s worth introducing the general principle of query expressions: they effectively get translated step by step into C# which eventually doesn’t contain any query expressions. To stick with our first example, that ends up be translated into this code:

people.Where(person => person.FirstName.StartsWith("J"))
      .OrderBy(person => person.Age)
      .Select(person => person.LastName)

It’s important to understand that the compiler hasn’t done anything apart from systematic translation to get to this point. In particular, so far we haven’t depended on what "people" is, nor "Where", "OrderBy" or "Select".

Can you tell what this code does yet? You can probably hazard a pretty good guess, but you can’t tell. Is it going to call Edulinq.Enumerable.Select, or System.Linq.Enumerable.Select, or something entirely different? It depends on the context. Heck, "people" could be the name of a type which has a static Where method. Or maybe it could be a reference to a class which has an instance method called Where… the options are open.

Of course, they don’t stay open for long: the compiler takes that expression and compiles it applying all the normal rules. It converts the lambda expression into either a delegate or an expression tree, tries to resolve Where, OrderBy and Select as normal, and life continues. (Don’t worry if you’re not sure about expression trees yet – I’ll come to them in another post.)

The important point is that the query expression translations don’t know about System.Linq. The spec barely mentioned IEnumerable<T>, and certainly doesn’t rely on it. The whole thing is pattern based. If you have an API which provides some or all of the operators used by the pattern, in an appropriate way, you can use query expressions with it. That’s the secret sauce that allows you to use the same syntax for LINQ to Objects, LINQ to SQL, LINQ to Entities, Reactive Extensions, Parallel Extensions and more.

Range variables and the initial "from" clause

The first part of the query to look at is the first "from" clause, at the start of the query. It’s worth mentioning upfront that this is handled somewhat differently to any later "from" clauses – I’ll explain how they’re translated later.

So we have an expression of the form:

from [type] identifier in expression

The "expression" part is just any expression. In most cases there isn’t a type specified, in which case the translated version is simply the expression, but with the compiler remembering the identifier as a range variable. I’ll do my best to explain what range variables are in a minute :)

If there is a type specified, that represents a call to Cast<type>(). So examples of the two translations so far are:

// Query (incomplete)
from x in people

// Translation (+ range variable of "x")
people

// Query (incomplete)
from Person x in people

// Translation (+ range variable of "x")
(people).Cast<Person>()

These aren’t complete query expressions – queries have very precise rules about how they can start and end. They always start with a "from" clause like this, and always end either with a "group by" clause or a "select" clause.

So what’s the point of the range variable? Well, that’s what gets used as the name of the lambda expression parameter used in all the later clauses. Let’s add a select clause to create a complete expression and demonstrate how the variable could be used.

A "select" clause

A select clause is usually translated into a call to Select, using the "body" of the clause as the body of the lambda expression… and the range variable as the parameter. So to expand our previous query, we might have this translation:

// Query
from x in people
select x.Name

// Translation
people.Select(x => x.Name)

That’s all that range variables are used for: to provide placeholders within lambda expressions, effectively. They’re quite unlike normal variables in most senses. It only makes sense to talk about the "value" of a range variable within a particular clause at a particular point in time when the clause is executing, for one particular value. Their nearest conceptual neighbour is probably the iteration variable declared in a foreach statement, but even that’s not really the same – particularly given the way iteration variables are captured, often to the surprise of developers.

The body part has to be a single expression – you can’t use "statement lambdas" in query expressions. For example, there’s no query expression which would translate to this:

// Can’t express this in a query expression
people.Select(x => { 
                     Console.WriteLine("Got " + x);
                     return x.Name;
                   })

That’s a perfectly valid C# expression, it’s just there’s now way of expressing it directly as a query expression.

I mentioned that a select clause usually translates into a Select call. There are two cases where it doesn’t:

  • If it’s the sole clause after a secondary "from" clause, or a "group by", "join" or "join … into" clause, the body is used in the translation of that clause
  • If it’s an "identity" projection coming after another clause, it’s removed entirely.

I’ll deal with the first point when we reach the relevant clauses. The second point leads to these translations:

// Query
from x in people
where x.IsAdult
select x

// Translation: Select is removed
people.Where(x => x.IsAdult)

// Query
from x in people
select x

// Translation: Select is *not* removed
people.Select(x => x)

The point of including the "pointless" select in the second translation is to hide the original source sequence; it’s assumed that there’s no need to do this in the first translation as the "Where" call will already have protected the source sufficiently.

The "where" clause

This one’s really simple – especially as we’ve already seen it! A where clause always just translates into a Where call. Sample translation, this time with no funny business removing degenerate query expressions:

// Query
from x in people
where x.IsAdult
select x.Name

// Translation
people.Where(x => x.IsAdult)
      .Select(x => x.Name)

Note how the range variable is propagated through the query.

The "orderby" clause

Here’s a secret: I can never remember offhand whether it’s "orderby" or "order by" – it’s confusing because it really is "group by", but "orderby" is actually just a single word. Of course, Visual Studio gives a pretty unsubtle hint in terms of colouring.

In the simplest form, an orderby clause might look like this:

// Query
from x in people
orderby x.Age
select x.Name

// Translation
people.OrderBy(x => x.Age)
      .Select(x => x.Name)

There are two things which can add complexity though:

  • You can order by multiple expressions, separating them by commas
  • Each expression can be ordered ascending implicitly, ascending explicitly or descending explicitly.

The first sort expression is always translated into OrderBy or OrderByDescending; subsequent ones always become ThenBy or ThenByDescending. It makes no difference whether you explicitly specify "ascending" or not – I’ve very rarely seen it in real queries. Here’s an example putting it all together:

// Query
from x in people
orderby x.Age, x.FirstName descending, x.LastName ascending
select x.LastName

// Translation
people.OrderBy(x => x.Age)
      .ThenByDescending(x => x.FirstName)
      .ThenBy(x => x.LastName)
      .Select(x => x.LastName)

Top tip: don’t use multiple "orderby" clauses consecutively. This query is almost certainly not what you want:

// Don’t do this!
from x in people 
orderby x.Age
orderby x.FirstName
select x.LastName

That will end up sorting by FirstName and then Age, and doing so rather slowly as it has to sort twice.

The "group by" clause

Grouping is another alternative to "select" as the final clause in a query. There are two expressions involved: the element selector (what you want to get in each group) and the key selector (how you want the groups to be organized). Unsurprisingly, this uses the GroupBy operator. So you might have a query to group people in families by their last name, with each group containing the first names of the family members:

// Query expression
from x in people 
group x.FirstName by x.LastName

// Translation
people.GroupBy(x => x.LastName, x => x.FirstName)

If the element selector is trivial, it isn’t specified as part of the translation:

// Query expression
from x in people 
group x by x.LastName

// Translation
people.GroupBy(x => x.LastName)

Query continuations

Both "select" and "group by" can be followed by "into identifier". This is known as a query continuation, and it’s really simple. Its translation in the specification isn’t in terms of a method call, but instead it transforms one query expression into another, effectively nesting one query as the source of another. I find that translation tricky to think about, personally… I prefer to think of it as using a temporary variable, like this:

// Original query
var query = from x in people
            select x.Name into y
            orderby y.Length
            select y[0];

// Query continuation translation
var tmp = from x in people
          select x.Name;

var query = from y in tmp
            orderby y.Length
            select y[0];

// Final translation into methods
var query = people.Select(x => x.Name)
                  .OrderBy(y => y.Length)
                  .Select(y => y[0]);

Obviously that final translation could have been expressed in terms of two statements as well… they’d be equivalent. This is why it’s important that LINQ uses deferred execution – you can split up a query as much as you like, and it won’t alter the execution flow. The query wouldn’t actually execute when the value is assigned to "tmp" – it’s just preparing the query for execution.

Transparent identifiers and the "let" clause

The rest of the query expression clauses all introduce an extra range variable in some form or other. This is the part of query expression translation which is hardest to understand, because it affects how any usage of the range variable in the query expression is translated.

We’ll start with probably the simplest of the remaining clauses: the "let" clause. This simply introduces a new range variable based upon a projection. It’s a bit like a "select", but after a "let" clause both the original range variable and the new one are in scope for the rest of the query. They’re typically used to avoid redundant computations, or simply to make the code simpler to read. For example, suppose computing an employee’s tax is a complicated operation, and we want to display a list of employees and the tax they pay, with the higher tax-payer first:

from x in employees
let tax = x.ComputeTax()
orderby tax descending
select x.LastName + ": " + tax

That’s pretty readable, and we’ve managed to avoid computing the tax twice (once for sorting and once for display).

The problem is, both "x" and "tax" are in scope at the same time… so what are we going to pass to the Select method at the end? We need one entity to pass through our query, which knows the value of both "x" and "tax" at any point (after the "let" clause, obviously). This is precisely the point of a transparent identifier. You can think of the above query as being translated into this:

// Translation from "let" clause to another query expression
from x in employees
select new { x, tax = x.ComputeTax() } into z
orderby z.tax descending
select z.x.LastName + ": " + z.tax

// Final translated query
employees.Select(x => new { x, tax = x.ComputeTax() })
         .OrderByDescending(z => z.tax)
         .Select(z => z.x.LastName + ": " + z.tax)

Here "z" is the transparent identifier – which I’ve made somewhat more opaque by giving it a name. In the specification, the query translations are performed in terms of "*" – which clearly isn’t a valid identifier, but which stands in for the transparent one.

The good news about transparent identifiers is that most of the time you don’t need to think of them at all. They simply let you have multiple range variables in scope at the same time. I find myself only bothering to think about them explicitly when I’m trying to work out the full translation of a query expression which uses them. It’s worth knowing about them to avoid being stumped by the concept of (say) a select clause being able to use multiple range variables, but that’s all.

Now that we’ve got the basic concept, we can move onto the final few clauses.

Secondary "from" clauses

We’ve seen that the introductory "from" clause isn’t actually translated into a method call, but any subsequent ones are. The syntax is still the same, but the translation uses SelectMany. In many cases this is used just like a cross-join (Cartesian product) but it’s more flexible than that, as the "inner" sequence introduced by the secondary "from" clause can depend on the current value from the "outer" sequence. Here’s an example of that. with the call to SelectMany in the translation:

// Query expression
from parent in adults
from child in parent.Children
where child.Gender == Gender.Male
select child.Name + " is a son of " + parent.Name

// Translation (using z for the transparent identifier)
adults.SelectMany(parent => parent.Children,
                  (parent, child) => new { parent, child })
      .Where(z => z.child.Gender == Gender.Male)
      .Select(z => z.child.Name + " is a son of " + z.parent.Name;

Again we can see the effect of the transparent identifier – an anonymous type is introduced to propagate the { parent, child } tuple through the rest of the query.

There’s a special case, however – if "the rest of the query" is just a "select" clause, we don’t need the anonymous type. We can just apply the projection directly in the SelectMany call. Here’s a similar example, but this time without the "where" clause:

// Query expression
from parent in adults
from child in parent.Children
select child.Name + " is a child of " + parent.Name

// Translation (using z for the transparent identifier)
adults.SelectMany(parent => parent.Children,
                  (parent, child) => child.Name + " is a child of " + parent.Name)

This same trick is used in GroupJoin and Join, but I won’t go into the details there. It’s simpler to just provide examples which use the shortcut, instead of including unnecessary extra clauses just to force the transparent identifier to appear in the translation.

Note that just like the introductory "from" clause, you can specify a type for the range variable, which forces a call to "Cast<>".

Simple "join" clauses (no "into")

A "join" clause without an "into" part corresponds to a call to the Join method, which represents an inner equijoin. In some ways this is like an extra "from" clause with a "where" clause to provide the relevant filtering, but there’s a significant difference: while the "from" clause (and SelectMany) allow you to project each element in the outer sequence to an inner sequence, in Join you merely provide the inner sequence directly, once. You also have to specify the two key selectors – one for the outer sequence, and one for the inner sequence. The general syntax is:

join identifier in inner-sequence on outer-key-selector equals inner-key-selector

The identifier names the extra range variable introduced. Here’s an example including the translation:

// Query expression
from customer in customers
join order in orders on customer.Id equals order.CustomerId
select customer.Name + ": " + order.Price

// Translation
customers.Join(orders,
               customer => customer.Id,
               order => order.CustomerId,
               (customer, order) => customer.Name + ": " + order.Price)

Note how if you put the key selectors the wrong way round, it’s highly unlikely that the result will compile – the lambda expression for the outer sequence doesn’t "know about" the inner sequence element, and vice versa. The C# compiler is even nice enough to guess the probable cause, and suggest the fix.

Group joins – "join … into"

Group joins look exactly the same as inner joins, except they have an extra "into identifier" part at the end. Again, this introduces an extra range variable – but it’s the identifier after the "into" which ends up in scope, not the one after "join"; that one is only used in the key selector. This is easier to see when we look at a sample translation:

// Query expression
from customer in customers
join order in orders on customer.Id equals order.CustomerId into customerOrders
select customer.Name + ": " + customerOrders.Count()

// Translation
customers.GroupJoin(orders,
                    customer => customer.Id,
                    order => order.CustomerId,
                    (customer, customerOrders) => customer.Name + ": " + customerOrders.Count())

If we had tried to refer to "order" in the select clause, the result would have been an error: it’s not in scope any more. Note that this is not a query continuation unlike "select … into" and "group … into". It introduces a new range variable, but all the previous range variables are still in scope.

That’s it! That’s all the translations that the C# compiler supports. VB’s query expressions are rather richer – but I suspect that’s at least partly because it’s more painful to write the "dot notation" syntax in VB, as the lambda expression syntax isn’t as nice as C#’s.

Translation cheat sheet

I thought it would be useful to produce a short table of the kinds of clauses supported in query expressions, with the translation used by the C# compiler. The translation is given assuming a single range variable named "x" is in scope. I haven’t given the alternative options where transparent identifiers are introduced – this table isn’t meant to be a replacement for all the information above! (Likewise this doesn’t mention the optimizations for degenerate query expressions or "identity projection" groupings.)

Query expression clause Translation
First "from [type] x in sequence" Just "sequence" or "sequence.Cast<type>()", but with the introduction of a range variable
Subsequent "from" clauses:
"from [type] y in projection"
SelectMany(x => projection, (x, y) => new { x, y })
or SelectMany(x => projection.Cast<type>(), (x, y) => new { x, y })
where predicate Where(x => predicate)
select projection Select(x => projection)
let y = projection Select(x => new { x, y = projection })
orderby o1, o2 ascending, o3 descending
(Each ordering may have descending or ascending specified explicitly; the default is ascending)
OrderBy(x => o1)
.ThenBy(x => o2)
.ThenByDescending(x => o3)
group projection by key-selector GroupBy(x => key-selector, x => projection)
join y in inner-sequece
on outer-key-selector equals inner-key-selector
Join(x => outer-key-selector,
    y => inner-key-selector,
    (x, y) => new { x, y })
join y in inner-sequece
on outer-key-selector equals inner-key-selector
into z
GroupJoin(x => outer-key-selector,
    y => inner-key-selector,
    (x, z) => new { x, z })
query1 into y
query2
(Translation in terms of a new query expression)
from y in (query1)
query2

Conclusion

Hopefully that’s made a certain amount of sense out of a fairly complicated topic. I find it’s one of those "aha!" things – at some point it clicks, and then seems reasonably simple (aside from transparent identifiers, perhaps). Until that time, query expressions can be a bit magical.

As an aside, I have a sneaking suspicion that one of my first blog posts consisted of my initial impressions of LINQ, written in a plane on the way to the MVP conference in Seattle in September 2005. I would check, but I’m finishing this post in another plane, this time on the way to San Francisco. I think I’d have been somewhat surprised to be told in 2005 that I’d still be writing blog posts about LINQ over five years later. Mind you, I can think of any number of things which have happened in the intervening years which would have astonished me to about the same degree.

Next time: some more thoughts on optimization. Oh, and I’m likely to update my wishlist of extra operators as well, but within the existing post.

Reimplementing LINQ to Objects: Part 40 – Optimization

I’m not an expert in optimization, and most importantly I don’t have any real-world benchmarks to support this post, so please take it with a pinch of salt. That said, let’s dive into what optimizations are available in LINQ to Objects.

What do we mean by optimization?

Just as we think of refactoring as changing the internal structure of code without changing its externally visible behaviour, optimization is the art/craft/science/voodoo of changing the performance of code without changing its externally visible behaviour. Sort of.

This requires two definitions: "performance" and "externally visible behaviour". Neither are as simple as they sound. In almost all cases, performance is a trade-off, whether in speed vs memory, throughput vs latency, big-O complexity vs the factors within that complexity bound, and best case vs worst case.

In LINQ to Objects the "best case vs worst case" is the balance which we need to consider most often: in the cases where we can make a saving, how significant is that saving? How much does it cost in every case to make a saving some of the time? How often do we actually take the fast path?

Externally visible behaviour is even harder to pin down, because we definitely don’t mean all externally visible behaviour. Almost all the optimizations in Edulinq are visible if you work hard enough – indeed, that’s how I have unit tests for them. I can test that Count() uses the Count property of an ICollection<T> instead of iterating over it by creating an ICollection<T> implementation which works with Count but throws an exception if you try to iterate over it. I don’t think we care about that sort of change to externally visible behaviour.

What we really mean is, "If we use the code in a sensible way, will we get the same results with the optimized code as we would without the optimization?" Much better. Nothing woolly about the term "sensible" at all, is there? More realistically, we could talk about a system where every type adheres to the contracts of every interface it implements – that would at least get rid of the examples used for unit testing. Still, even the performance of a system is externally visible… it’s easy to tell the difference between an implementation of Count() which is optimized and one which isn’t, if you’ve got a list of 10 million items.

How can we optimize in LINQ to Objects?

Effectively we have one technique for significant optimization in LINQ to Objects: finding out that a sequence implements a more capable interface than IEnumerable<T>, and then using that interface. (Or in the case of my optimization for HashSet<T> and Contains, using a concrete type in the same way.) Count() is the most obvious example of this: if the sequence implements ICollection or ICollection<T>, then we can use the Count property on that interface and we’re done. No need to iterate at all.

These are generally good optimizations because they allow us to transform an O(n) computation into an O(1) computation. That’s a pretty big win when it’s applicable, and the cost of checking is reasonably small. So long as we hit the right type once every so often – and particularly if in those cases the sequences are long – then it’s a net gain. The performance characteristics are likely to be reasonably fixed here for any one program – it’s not like we’ll sometimes win and sometimes lose for a specific query… it’s that for some queries we win and some we won’t. If we were micro-optimizing, we might want a way of calling a "non-optimized" version which didn’t even bother trying the optimization, because we know it will always fail. I would regard such an approach as a colossal waste of effort in the majority of cases.

Some optimizations are slightly less obvious, particularly because they don’t offer a change in the big-O complexity, but can still make a significant difference. Take ToArray, for example. If we know the sequence is an IList<T> we can construct an array of exactly the right size and ask the list to copy the elements into it. Chances are that copy can be very efficient indeed, basically copying a whole block of bits from one place to another – and we know we won’t need any resizing. Compare that with building up a buffer, resizing periodically including copying all the elements we’ve already discovered. Every part of that process is going to be slower, but they’re both O(n) operations really. This is a good example of where big-O notation doesn’t tell the whole story. Again, the optimization is almost certainly a good one to make.

Then there are distinctly dodgy optimizations which can make a difference, but are unlikely to apply. My optimization for ElementAt and ElementAtOrDefault comes into play here. It’s fine to check whether an object implements IList<T>, and use the indexer if so. That’s an obvious win. But I have an extra optimization to exit quickly if we can find out that the given index is out of the bounds of the sequence. Unfortunately that optimization is only useful when:

  • The sequence implements ICollection<T> or ICollection (but remember it has to implement IEnumerable<T> – there aren’t many collections implementing only the non-generic ICollection, but the generic IEnumerable<T>)
  • The sequence doesn’t implement IList<T> (which gets rid of almost all implementations of ICollection<T>)
  • The given index is actually greater than or equal to the size of the collection

All that comes at the cost of a couple of type checks… not a great cost, and we do potentially save an O(n) check for being given an index out of the bounds of the collection… but how often are we really going to make that win? This is where I’d love to have something like Dapper, but applied to LINQ to Objects and running in a significant number of real-world projects, just logging in as light a way as possible how often we win, how often we lose, and how big the benefit is.

Finally, we come to the optimizations which don’t make sense to me… such as the optimization for First in both Mono and LinqBridge. Both of these projects check whether the sequence is a list, so that they check the count and then use the indexer to fetch item 0 instead of calling GetEnumerator()/MoveNext()/Current. Now yes, there’s a chance this avoids creating an extra object (although not always, as we’ve seen before) – but they’re both O(1) operations which are likely to be darned fast. At this point not only is the payback very small (if it even exists) but the whole operation is likely to be so fast that the tiny check for whether the object implements IList<T> is likely to become more significant. Oh, and then there’s the extra code complexity – yes, that’s only relevant to the implementers, but I’d personally rather they spent their time on other things (like getting OrderByDescending to work properly… smirk). In other words, I think this is a bad target for optimization. At some point I’ll try to do a quick analysis of just how often the collection has to implement IList<T> in order for it to be worth doing this – and whether the improvement is even measurable.

Of course there are other micro-optimizations available. When we don’t need to fetch the current item (e.g. when skipping over items) let’s just call MoveNext() instead of also assigning the return value of a property to a variable. I’ve done that in various places in Edulinq, but not as an optimization strategy, which I suspect won’t make a significant difference, but for readability – to make it clearer to the reader that we’re just moving along the iterator, not examining the contents as we go.

The only other piece of optimization I think I’ve performed in Edulinq is the "yield the first results before sorting the rest" part of my quicksort implementation. I’m reasonably proud of that, at least conceptually. I don’t think it really fits into any other bucket – it’s just a matter of thinking about what we really need and when, deferring work just in case we never need to do it.

What can we not optimize in LINQ to Objects?

I’ve found a few optimizations in both Edulinq and other implementations which I believe to be invalid.

Here’s an example I happened to look at just this morning, when reviewing the code for Skip:

var list = source as IList<TSource>;
if (list != null)
{
    count = Math.Max(count, 0);
    // Note that "count" is the count of items to skip
    for (int index = count; index < list.Count; index++)
    {
        yield return list[index];
    }
    yield break;
}

If our sequence is a list, we can just skip straight to the right part of it and yield the items one at a time. That sounds great, but what if the list changes (or is even truncated!) while we’re iterating over it? An implementation working with the simple iterator would usually throw an exception, as the change would invalidate the iterator. This is definitely a behavioural change. When I first wrote about Skip, I included this as a "possible" optimization – and actually turned it on in the Edulinq source code. I now believe it to be a mistake, and have removed it completely.

Another example is Reverse, and how it should behave. The documentation is fairly unclear, but when I ran the tests, the Mono implementation used an optimization whereby if the sequence is a list, it will just return items from the tail end using the indexer. (This has now been fixed – the Mono team is quick like that!) Again, that means that changes made to the list while iterating will be reflected in the reversed sequence. I believe the documentation for Reverse ought to be clear that:

  • Execution is deferred: the input sequence isn’t read when the method is called.
  • When the result sequence is first read by the caller, a snapshot is taken, and that’s what’s used to return the data.
  • If the result sequence is read more than once (i.e. GetEnumerator is called more than once) then a new snapshot is created each time – so changes to the input sequence between calls to GetEnumerator on the result sequence will be observed.

Now this is still not as precise as it might be in terms of what "reading" a sequence entails – in particular, a simple implementation of Reverse (as per Edulinq) will actually take the snapshot on the first call to MoveNext() on the iterator returned by GetEnumerator() – but that’s probably not too bad. The snapshotting behaviour itself is important though, and should be made explicit in my opinion.

The problem with both of these "optimizations" is arguably that they’re applying list-based optimizations within an iterator block used for deferred execution. Optimizing for lists either upfront at the point of the initial method call or within an immediate execution operator (Count, ToList etc) is fine, because we assume the sequence won’t change during the course of the method’s execution. We can’t make that assumption with an iterator block, because the flow of the code is very different: our code is visited repeatedly based on the caller’s use of MoveNext().

Sequence identity

Another aspect of behaviour which isn’t well-specified is that of identity. When is it valid for an operator to return the input sequence itself as the result sequence?

In the Microsoft implementation, this can occur in two operators: AsEnumerable (which always returns the input sequence reference, even if it’s null) and Cast (which returns the original reference only if it actually implements IEnumerable<TResult>).

In Edulinq, I have two other operators which can return the input sequence: OfType (only if the original reference implements IEnumerable<TResult> and TResult is a non-nullable value type) and Skip (if you provide a count which is zero or negative). Are these valid optimizations? Let’s think about why we might not want them to be…

If you’re returning a sequence from one layer of your code to another, you usually want that sequence to be viewed only as a sequence. In particular, if it’s backed by a List<T>, you don’t want callers casting to List<T> and modifying the list. With any operator implemented by an iterator block, that’s fine – the object returned from the operator has no accessible reference to its input, and the type itself only implements IEnumerable<T> (and IEnumerator<T>, and IDisposable, etc – but not IList<T>). It’s not so good if the operator decides it’s okay to return the original reference.

The C# language specification refers to this in the section about query expression translation: a no-op projection at the end of a query can be omitted if and only if there are other operators in the query. So a query expression of "from foo in bar select foo" will translate to "bar.Select(foo => foo)" but if we had a "where" clause in the query, the Select call would be removed. It’s worth noting that the call to "Cast" generated when you explicitly specify the type of a range variable is not enough to prevent the "no-op" projection from being generated… it’s almost as if the C# team "knows" that Cast can leak sequence identity whereas Where can’t.

Personally I think that the "hiding" of the input sequence should be guaranteed where it makes sense to do so, and explicitly not guaranteed otherwise. We could also add an operator of something like "HideIdentity" which would simply (and unconditionally) add an extra iterator block into the pipeline. That way library authors wouldn’t have to guess, and would have a clear way of expressing their intention. Using Select(x => x) or Skip(0) is not clear, and in the case of Skip it would even be pointless when using Edulinq.

As for whether my optimizations are valid – that’s up for debate, really. It seems hard to justify why leaking sequence identity would be okay for Cast but not okay for OfType, whereas I think there’s a better case for claiming that Skip should always hide sequence identity.

The Contains issue…

If you remember, I have a disagreement around what Contains should do when you don’t provide an equality comparer, and when the sequence implements ICollection<T>. I believe it should be consistent with the rest of LINQ to Objects, which always uses the default equality comparer for the element type when it needs one but the user hasn’t specified one. Everyone else (Microsoft, Mono, LinqBridge) has gone with delegating to the collection’s implementation of ICollection<T>.Contains. That plays well in terms of consistency of what happens if you call Contains on that object, so that it doesn’t matter what the compile-time type is. That’s a debate to go into in another post, but I just want to point out that this is not an example of optimization. In some cases it may be faster (notably for HashSet<T>) but it stands a very good chance of changing the behaviour. There is absolutely nothing to suggest that the equality comparer used by ICollection<T> should be the default one for the type – and in some cases it definitely isn’t.

It’s therefore a matter of what result we want to get, not how to get that result faster. It’s correctness, not optimization – but both the LinqBridge and Mono tests which fail for Edulinq are called "Contains_CollectionOptimization_ReturnsTrueWithoutEnumerating" – and I think that shows a mistaken way of thinking about this.

Can we go further?

I’ve been considering a couple of optimizations which I believe to be perfectly legitimate, but which none of the implementations I’ve seen have used. One reason I haven’t implemented them myself yet is that they will reduce the effectiveness of all my unit tests. You see, I’ve generally used Enumerable.Range as a good way of testing a non-list-based sequence… but what’s to stop Range and Repeat being implemented as IList<T> implementations?

All the non-mutating members are easy to implement, and we can just throw exceptions from the mutating members (as other read-only collections do).

Would this be more efficient? Well yes, if you ever performed a Count(), ElementAt(), ToArray(), ToList() etc operation on a range or a repeated element… but how often is that going to happen? I suspect it’s pretty rare – probably rare enough not to make it worth my time, particularly when you then consider all the tests that would have to be rewritten to use something other than Range when I wanted a non-list sequence…

Conclusion

Surprise, surprise – doing optimization well is difficult. When it’s obvious what can be done, it’s not obvious what should be done… and sometimes it’s not even what is valid in the first place.

Note that none of this has really talked about data structures and algorithms. I looked at some options when implementing ordering, and I’m still thinking about the best approach for implementing TopBy (probably either a heap or a self-balancing tree – something which could take advantage of the size being constant would be nice) – but in general the optimizations here haven’t required any cunning knowledge of computer science. That’s quite a good thing, because it’s many years since I’ve studied CS seriously…

I suspect that with this post more than almost any other, I’m likely to want to add extra items in the future (or amend mistakes which reveal my incompetence). Watch this space.

Next up, I think it would be worth revisiting query expressions from scratch. Anyone who’s read C# in Depth or has followed this blog for long enough is likely to be able to skip it, but I think the series would be incomplete without a quick dive into the compiler translations involved.

Reimplementing LINQ to Objects: Part 39 – Comparing implementations

While implementing Edulinq, I only focused on two implementations: .NET 4.0 and Edulinq. However, I was aware that there were other implementations available, notably LinqBridge and the one which comes with Mono. Obviously it’s interesting to see how other implementations behave, so I’ve now made a few changes in order to make the test code run in these different environments.

The test environments

I’m using Mono 2.8 (I can’t remember the minor version number offhand) but I tend to think of it as "Mono 3.5" or "Mono 4.0" depending on which runtime I’m using and which base libraries I’m compiling against, to correspond with the .NET versions. Both runtimes ship as part of Mono 2.8. I will use these version numbers for this post, and ask forgiveness for my lack of precision: whenever you see "Mono 3.5" please just think "Mono 2.8 running against the 2.0 runtime, possibly using some of the class libraries normally associated with .NET 3.5".

LinqBridge is a bit like Edulinq – a clean room implementation of LINQ to Objects, but built against .NET 2.0. It contains its own Func delegate declarations and its own version of ExtensionAttribute for extension methods. In my experience this makes it difficult to use with the "real" .NET 3.5, so my build targets .NET 2.0 when running against LinqBridge. This means that tests using HashSet had to be disabled. The version of LinqBridge I’m running against is 1.2 – the latest binary available on the web site. This has AsEnumerable as a plain static method rather than an extension method; the code has been fixed in source control, but I wanted to run against a prebuilt binary, so I’ve just disabled my own AsEnumerable tests for LinqBridge. Likewise the tests for Zip are disabled both for LinqBridge and the "Mono 3.5" tests as Zip was only introduced in .NET 4.

The other issue of not having .NET 4 available in the tests is that the string.Join<T>(string, IEnumerable<T>) overload is unavailable – something I’d used quite a lot in the test code. I’ve created a new static class called "StringEx" and replaced string.Join with StringEx.Join everywhere.

There are batch files under a new "testing" directory which will build and run:

  • Microsoft’s LINQ to Objects and Edulinq under .NET
  • LinqBridge, Mono 3.5’s LINQ to Objects and Edulinq under Mono 3.5
  • Mono 4.0’s LINQ to Objects and Edulinq under Mono 4.0

Although I have LinqBridge running under .NET 2.0 in Visual Studio, it’s a bit of a pain building the tests from a batch file (at least without just calling msbuild). The failures running under Mono 3.5 are the same as those running under .NET 2.0 as far as I can tell, so I’m not too worried.

Note that while I have built the Mono tests under both the 3.5 and 4.0 profiles, the results were the same other than due to generic variance, so I’ve only included the results of the 4.0 profile below.

What do the tests cover?

Don’t forget that the Edulinq tests were written in the spirit of investigation. They cover aspects of LINQ’s behaviour which are not guaranteed, both in terms of optimization and simple correctness of behaviour. I have included a test which demonstrates the "issue" with calling Contains on an ICollection<T> which uses a non-default equality comparer, as well as the known issue with OrderByDescending using a comparer which returns int.MinValue. There are optimizations which are present in Edulinq but not in LINQ to Objects, and I have tests for those, too.

The tests which fail against Microsoft’s implementation (for known reasons) are normally marked with an [Ignore] attribute to prevent them from alarming me unduly during development. NUnit categories would make more sense here, but I don’t believe ReSharper supports them, and that’s the way I run the tests normally. Likewise the tests which take a very long time (such as counting more than int.MaxValue elements) are normally suppressed.

In order to truly run all my tests, I now have a horrible hack using conditional compilation: if the ALL_TESTS preprocessor symbol is defined, I build my own IgnoreAttribute class in the Edulinq.Tests namespace, which effectively takes precedence over the NUnit one… so NUnit will ignore the [Ignore], so to speak. Frankly all this conditional compilation is pretty horrible, and I wouldn’t use it for a "real" project, but this is a slightly unusual situation.

EDIT: It turns out that ReSharper does support categories. I’m not sure how far that support goes yet, but at the very least there’s "Group by categories" available. I may go through all my tests and apply a category to each one: optimization, execution mode, time-consuming etc. We’ll see whether I can find the energy for that :)

So, let’s have a look at what the test results are…

Edulinq

Unsurprisingly, Edulinq passes all its own tests, with the minor exception of CastTest.OriginalSourceReturnedDueToGenericCovariance running under Mono 3.5, which doesn’t include covariance. Arguably this test should be conditionalised to not even run in that situation, as it’s not expected to work.

Microsoft’s LINQ to Objects

8 failures, all expected:

  • Contains delegates to the ICollection<T>.Contains implementation if it exists, rather than using the default comparer for the type. This is a design and documentation issue which I’ve discussed in more detail in the Contains part of this series.
  • Optimization: ElementAt and ElementAtOrDefault don’t validate the specified index eagerly when the input sequence implements ICollection<T> but not IList<T>.
  • Optimization: OfType always uses an intermediate iterator even when the input sequence already implements IEnumerable<T> and T is a non-nullable value type.
  • Optimization: SequenceEqual doesn’t compare the counts of the sequences eagerly even when both sequences implement ICollection<T>
  • Correctness: OrderByDescending doesn’t work if you use a key comparer which returns int.MinValue
  • Consistency: Single and SingleOrDefault (with a predicate) don’t throw InvalidOperationException as soon as they encounter a second element matching the predicate; the predicate-less overloads do throw as soon as they see a second element.

All of these have been discussed already, so I won’t go into them now.

LinqBridge

LinqBridge had a total of 33 failures. I haven’t looked into them in detail, but just going from the test output I’ve broken them down into the following broad categories:

  • Optimization:
    • Cast never returns the original source, presumably always introducing an intermediate iterator.
    • All three of Microsoft’s "missed opportunities" listed above are also missed in LinqBridge
  • Use of input sequences:
    • Except and Intersect appear to read the first sequence first (possibly completely?) and then the second sequence. Edulinq and LINQ to Objects read the second sequence completely and then stream the first sequence. This behaviour is undocumented.
    • Join, GroupBy and GroupJoin appear not to be deferred at all. If I’m right, this is a definite bug.
    • Aggregation accuracy: both Average and Sum over an IEnumerable<float> appear to use a float accumulator instead of a double. This is probably worth fixing for the sake of both range and accuracy, but isn’t specified in the documentation.
    • OrderBy (etc) appears to apply the key selector multiple times while sorting. The behaviour here isn’t documented, but as I mentioned before, it could produce performance issues unnecessarily.
  • Exceptions:
    • ToDictionary should throw an exception if you give it duplicate keys; it appears not to – at least when a custom comparer is used. (It’s possible it’s just not passing the comparer along.)
    • The generic Max and Min methods don’t return the null value for the element type when that type is nullable. Instead, they throw an exception – which is the normal behaviour if the element type is non-nullable. This behaviour isn’t well documented, but is consistent with the behaviour of the non-generic overloads. See the Min/Max post for more details.
  • General bugs:
    • The generic form of Min/Max appears not to ignore null values when the element type is nullable.
    • OrderByDescending appears to be broken in the same way as Microsoft’s implementation
    • Range appears to be broken around its boundary testing.
    • Join, GroupJoin, GroupBy and ToLookup break when presented with null keys

Mono 4.0 (and 3.5, effectively)

Mono failed 18 of the tests. There are fewer definite bugs than in LinqBridge, but it’s definitely not perfect. Here’s the breakdown:

  • Optimization:
    • Mono misses the same three opportunities that LinqBridge and Microsoft miss.
    • Contains(item) delegates to ICollection<T> when it’s implemented, just like in the Microsoft implementation. (I assume the authors would call this an "optimization", hence its location in this section.) I believe that LinqBridge has the same behaviour, but that test didn’t run in the LinqBridge configuration as it uses HashSet.
  • Average/Sum accumulator types:
    • Mono appears to use float when working with float values, leading to more accumulator error than is necessary.
  • Average overflow for integer types
    • Mono appears to use checked arithmetic when summing a sequence, but not when taking the average of a sequence. So the average of { long.MaxValue, long.MaxValue, 2 } is 0. (This originally confused me into thinking it was using floating point types during the summation, but I now believe it’s just a checked/unchecked issue.)
  • Bugs:
    • Count doesn’t overflow either with or without a predicate
    • The Max handling of double.NaN isn’t in line with .NET. I haven’t investigated the reason for this yet.
    • OrderByDescending is broken in the same way as for LinqBridge and the Microsoft implementation.
    • Range is broken for both Range(int.MinValue, 0) and Range(int.MaxValue, 1). Test those boundary cases, folks :)
    • When reversing a list, Mono doesn’t buffer the current contents. In other words, changes made while iterating over the reversed list are visible in the returned sequence. The documentation isn’t very clear about the desired behaviour here, admittedly.
    • GroupJoin and Join match null keys, unlike Microsoft’s implementation.

How does Edulinq fare against other unit tests?

It didn’t seem fair to only test other implementations against the Edulinq tests. After all, it’s only natural that my tests should work against my own code. What happens if we run the Mono and LinqBridge tests against my code?

The LinqBridge tests didn’t find anything surprising. There were two failures:

  • I don’t have the "delegate Contains to ICollection<T>.Contains" behaviour, which the tests check for.
  • I don’t optimize First in the case of the collection implementing IList<T>. I view this as a pretty dubious optimization to be honest – I doubt that creating an iterator to get to the first item is going to be much slower than checking for IList<T>, fetching the count, and then fetching the first item via the indexer… and it means that all non-list implementations also have to check whether the sequence implements IList<T>. I don’t intend to change Edulinq for this.

The Mono tests picked up the same two failures as above, and two genuine bugs:

  • By implementing Take via TakeWhile, I was iterating too far: in order for the condition to become false, we had to iterate to the first item we wouldn’t return.
  • ToLookup didn’t accept null keys – a fault which propagated to GroupJoin, Join and GroupBy too. (EDIT: It turns out that it’s more subtle than that. Nothing should break, but the MS implementation ignores null keys for Join and GroupJoin. Edulinq now does the same, but I’ve raised a Connect issue to suggest this should at least be documented.)

I’ve fixed these in source control, and will add an addendum to each of the relevant posts (Take, ToLookup) when I have a moment spare.

There’s one additional failure, trying to find the average of a sequence of two Int64.MaxValue values. That overflows on both Edulinq and LINQ to Objects – that’s the downside of using an Int64 to sum the values. As mentioned, Mono suffers a degree of inaccuracy instead; it’s all a matter of trade-offs. (A really smart implementation might use Int64 while possible, and then go up to using Double where necessary, I suppose.)

Unfortunately I don’t have the tests for the Microsoft implementation, of course… I’d love to know whether there’s anything I’ve failed with there.

Conclusion

This was very interesting – there’s a mixture of failure conditions around, and plenty of "non-failures" where each implementation’s tests are enforcing their own behaviour.

I do find it amusing that all three of the "mainstream" implementations have the same OrderByDescending bug though. Other than that, the clear bugs between Mono and LinqBridge don’t intersect, which is slightly surprising.

It’s nice to see that despite not setting out to create a "production-quality" implementation of LINQ to Objects, that’s mostly what I’ve ended up with. Who knows – maybe some aspects of my implementation or tests will end up in Mono in the future :)

Given the various different optimizations mentioned in this post, I think it’s only fitting that next time I’ll discuss where we can optimize, where it’s worth optimizing, and some more tricks we could still pull out of the bag…

Reimplementing LINQ to Objects: Part 38 – What’s missing?

I mentioned before that the Zip operator was only introduced in .NET 4, so clearly there’s a little wiggle room for LINQ to Object’s query operators to grow in number. This post mentions some of the ones I think are most sorely lack – either because I’ve wanted them myself, or because I’ve seen folks on Stack Overflow want them for entirely reasonable use cases.

There is an issue with respect to other LINQ providers, of course: as soon as some useful operators are available for LINQ to Objects, there will be people who want to apply them to LINQ to SQL, the Entity Framework and the like. Worse, if they’re not included in Queryable with overloads based on expression trees, the LINQ to Objects implementation will silently get picked – leading to what looks like a lovely query performing like treacle while the client slurps over the entire database. If they are included in Queryable, then third party LINQ providers could end up with a nasty versioning problem. In other words, some care is needed and I’m glad I’m not the one who has to decide how new features are introduced.

I’ve deliberately not looked at the extra set of operators introduced in the System.Interactive part of Reactive Extensions… nor have I looked back over what we’ve implemented in MoreLINQ (an open source project I started specifically to create new operators). I figured it would be worth thinking about this afresh – but look at both of those projects for actual implementations instead of just ideas.

Currently there’s no implementation of any of this in Edulinq – but I could potentially create an "Edulinq.Extras" assembly which made it all available. Let me know if any of these sounds particularly interesting to see in terms of implementation.

FooBy

I love OrderBy and ThenBy, with their descending cousins. They’re so much cleaner than building a custom comparer which just performs a comparison between two properties. So why stop with ordering? There’s a whole bunch of operators which could do with some "FooBy" love. For example, imagine we have a list of files, and we want to find the longest one. We don’t want to perform a total ordering by size descending, nor do we want to find the maximum file size itself: we want the file with the maximum size. I’d like to be able to write that query as:

FileInfo biggestFile = files.MaxBy(file => file.Length);

Note that we can get a similar result by performing one pass to find the maximum length, and then another pass to find the file with that length. However, that’s inefficient and assumes we can read the sequence twice (and get the same results both times). There’s no need for that. We could get the same result using Aggregate with a pretty complicated aggregation, but I think this is a sufficiently common case to deserve its own operator.

We’d want to specify which value would be returned if multiple files had the same length (my suggestion would be the first one we encountered with that length) and we could also specify a key comparer to use. The signatures would look like this:

public static TSource MaxBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static TSource MaxBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

Now it’s not just Max and Min that gain from this "By" idea. It would be useful to apply the same idea to the set operators. The simplest of these to think about would be DistinctBy, but UnionBy, IntersectBy and ExceptBy would be reasonable too. In the case of ExceptBy and IntersectBy we could potentially take the key collection to indicate the keys of the elements we wanted to exclude/include, but it would probably be more consistent to force the two input sequences to be of the same type (as they would have to be for UnionBy and IntersectBy of course). ContainsBy might be useful, but that would effectively be a Select followed by a normal Contains – possibly not useful enough to merit its own operator.

TopBy and TopByDescending

These may sound like they belong in the FooBy section, but they’re somewhat different: they’re effectively specializations of OrderBy and OrderByDescending where you already know how many elements you want to preserve. The return type would be IOrderedEnumerable<T> so you could still use ThenBy/ThenByDescending as normal. That would make the following two queries equivalent – but the second might be a lot more efficient than the first:

var takeQuery = people.OrderBy(p => p.LastName)
                      .ThenBy(p => p.FirstName)
                      .Take(3);

var topQuery = people.TopBy(p => p.LastName, 3)
                     .ThenBy(p => p.FirstName);

An implementation could easily delegate to various different strategies depending on the number given – for example, if you asked for more than 10 values, it may not be worth doing anything more than a simple sort and restrict the output. If you asked for just the top 3 values, that could return an IOrderedEnumerable implementation specifically hard-coded to 3 values, etc.

Aside from anything else, if you were confident in what the implementation did (and that’s a very big "if") you could use a potentially huge input sequence with such a query – larger than you could fit into memory in one go. That’s fine if you’re only keeping the top three values you’ve seen so far, but would fail for a complete ordering, even one which was able to yield results before performing all the ordering: if it doesn’t know you’re going to stop after three elements, it can’t throw anything away.

Perhaps this is too specialized an operator – but it’s an interesting one to think about. It’s worth noting that this probably only makes sense for LINQ to Objects, which never gets to see the whole query in one go. Providers like LINQ to SQL can optimize queries of the form OrderBy(…).ThenBy(…).Take(…) because by the time they need to translate the query into SQL, they will have an expression tree representation which includes the "Take" part.

TryFastCount and TryFastElementAt

One of the implementation details of Edulinq is its TryFastCount method, which basically encapsulates the logic around attempting to find the count of a sequence if it implements ICollection or ICollection<T>. Various built-in LINQ operators find this useful, and anyone writing their own operators has a reasonable chance of bumping into it as well. It seems pointless to duplicate the code all over the place… why not expose it? The signatures might look something like this:

public static bool TryFastCount<TSource>(
    this IEnumerable<TSource> source,
    out int count)

public static bool TryFastElementAt<TSource>(
    this IEnumerable<TSource> source,
    int index,
    out TSource value)

I would expect TryFastElementAt to use the indexer if the sequence implemented IList<T> without performing any validation: that ought to be the responsibility of the caller. TryFastCount could use a Nullable<int> return type instead of the return value / out parameter split, but I’ve kept it consistent with the methods which exist elsewhere in the framework

Scan and SelectAdjacent

These are related operators in that they deal with wanting a more global view than just the current element. Scan would act similarly to Aggregate – except that it would yield the accumulator value after each element. Here’s an example of keeping a running total:

// Signature:
public static IEnumerable<TAccumulate> Scan<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)
    

int[] source = new int[] { 3, 5, 2, 1, 4 };
var query = source.Scan(0, (current, item) => current + item);
query.AssertSequenceEqual(3, 8, 10, 11, 15);

There could be a more complicated overload with an extra conversion from TAccumulate to an extra TResult type parameter. That would let us write a Fibonacci sequence query in one line, if we really wanted to…

The SelectAdjacent operator would simply present a selector function with pairs of adjacent items. Here’s a similar example, this time calculating the difference between each pair:

// Signature:
public static IEnumerable<TResult> SelectAdjacent<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TSource, TResult> selector)
    

int[] source = new int[] { 3, 5, 2, 1, 4 };
var query = source.SelectAdjacent((current, next) => next – current);
query.AssertSequenceEqual(2, -3, -1, 3);

One oddity here is that the result sequence always contains one item fewer than the source sequence. If we wanted to keep the length the same, there are various approaches we could take – but the best one would depend on the situation.

This sounds like a pretty obscure operator, but I’ve actually seen quite a few LINQ questions on Stack Overflow where it could have been useful. Is it useful often enough to deserve its own operator? Maybe… maybe not.

DelimitWith

This one is really just a bit of a peeve – but again, it’s a pretty common requirement. We often want to take a sequence and create a single string which is (say) a comma-delimited version. Yay, String.Join does exactly what we need – particularly in .NET 4, where there’s an overload taking IEnumerable<T> so you don’t need to convert it to a string array first. However, it’s still a static method on string – and the name "Join" also looks slightly odd in the context of a LINQ query, as it’s got nothing to do with a LINQ-style join.

Compare these two queries: which do you think reads better, and feels more "natural" in LINQ?

// Current state of play…
var names = string.Join(",",
                        people.Where(p => p.Age < 18)
                              .Select(p => p.FirstName));

// Using DelimitWith
var names = people.Where(p => p.Age < 18)
                  .Select(p => p.FirstName)
                  .DelimitWith(",");

I know which I prefer :)

ToHashSet

(Added on February 23rd 2011.)

I’m surprised I missed this one first time round – I’ve bemoaned its omission in various places before now. It’s easy to create a list, dictionary, lookup or array from an anonymous type, but you can’t create a set that way. That’s mad, given how simple the relevant operator is, even with an overload for a custom equality comparer:

public static HashSet<TSource> ToHashSet<TSource>(
    this IEnumerable<TSource> source)
{
    return source.ToHashSet(EqualityComparer<TSource>.Default);
}

public static HashSet<TSource> ToHashSet<TSource>(
    this IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return new HashSet<TSource>(source, comparer ?? EqualityComparer<TSource>.Default);
}

This also makes it much simpler to create a HashSet in a readable way from an existing query expression, without either wrapping the whole query in the constructor call or using a local variable.

Conclusion

These are just the most useful extra methods I thought of, based on the kinds of query folks on Stack Overflow have asked about. I think it’s interesting that some are quite general – MaxBy, ExceptBy, Scan and so on – whereas others (TopBy, SelectAdjacent and particularly DelimitWith) are simply aimed at making some very specific but common situations simpler. It feels to me like the more general operators really are missing from LINQ – they would fit quite naturally – but the more specific ones probably deserve to be in a separate static class, as "extras".

This is only scratching the surface of what’s possible, of course – System.Interactive.EnumerableEx in Reactive Extensions has loads of options. Some of them are deliberate parallels of the operators in Observable, but plenty make sense on their own too.

One operator you may have expected to see in this list is ForEach. This is a controversial topic, but Eric Lippert has written about it very clearly (no surprise there, then). Fundamentally LINQ is about querying a sequence, not taking action on it. ForEach breaks that philosophy, which is why I haven’t included it here. Usually a foreach statement is a perfectly good alternative, and make the "action" aspect clearer.

Reimplementing LINQ to Objects: Part 37 – Guiding principles

Now that I’m "done" reimplementing LINQ to Objects – in that I’ve implemented all the methods in System.Linq.Enumerable – I wanted to write a few posts looking at the bigger picture. I’m not 100% sure of what this will consist of yet; I want to avoid this blog series continuing forever. However, I’m confident it will contain (in no particular order):

  • This post: principles governing the behaviour of LINQ to Objects
  • Missing operators: what else I’d have liked to see in Enumerable
  • Optimization: where the .NET implementation could be further optimized, and why some obvious-sounding optimizations may be inappropriate
  • How query expression translations work, in brief (and with a cheat sheet)
  • The difference between IQueryable<T> and IEnumerable<T>
  • Sequence identity, the "Contains" issue, and other knotty design questions
  • Running the Edulinq tests against other implementations

If there are other areas you want me to cover, please let me know.

The principles behind the LINQ to Objects implementation

The design LINQ to Objects is built on a few guiding principles, both in terms of design and implementation details. You need to understand these, but also implementations should be clear about what they’re doing in these terms too.

Extension method targets and argument validation

IEnumerable<T> is the core sequence type, not just for LINQ but for .NET as a whole. Almost everything is written in terms of IEnumerable<T> at least as input, with the following exceptions:

  • Empty, Range and Repeat don’t have input sequences (these are the only non-extension methods)
  • OfType and Cast work on the non-generic IEnumerable type instead
  • ThenBy and ThenByDescending work on IOrderedEnumerable<T>

All operators other than AsEnumerable verify that any input sequence is non-null. This validation is performed eagerly (i.e. when the method is called) even if the operator uses deferred execution for the results. Any delegate used (typically a projection or predicate of some kind) must be non-null. Again, this validation is performed eagerly.

IEqualityComparer<T> is used for all custom equality comparisons. Any parameter of this type may be null, in which case the default equality comparer for the type is used. In most cases the default equality comparer for the type is also used when no custom equality comparer is used, but Contains has some odd behaviour around this. Equality comparers are expected to be able to handle null values. IComparer<T> is only used by the OrderBy/ThenBy operators and their descending counterparts – and only then if you want custom comparisons between keys. Again, a null IComparer<T> means "use the default for the type"

Timing of input sequence "opening"

Any operator with a return type of IEnumerable<T> or IOrderedEnumerable<T> uses deferred execution. This means that the method doesn’t read anything from any input sequences until someone starts reading from the result sequence. It’s not clearly defined exactly when input sequences will first be accessed – for some operators if may be when GetEnumerator() is called; for others it may be on the first call to MoveNext() on the resulting iterator. Callers should not depend on these slight variations. Deferred execution is common for operators in the middle of queries. Operators which use deferred execution effectively represent queries rather than the results of queries – so if you change the contents of the original source of the query and then iterate over the query itself again, you’ll see the change. For example:

List<string> source = new List<string>();
var query = source.Select(x => x.ToUpper());
        
// This loop won’t write anything out
foreach (string x in query)
{
    Console.WriteLine(x);
}
        
source.Add("foo");
source.Add("bar");

// This loop will write out "FOO" and "BAR" – even
// though we haven’t changed the value of "query"
foreach (string x in query)
{
    Console.WriteLine(x);
}

Deferred execution is one of the hardest parts of LINQ to understand, but once you do, everything becomes somewhat simpler.

All other operators use immediate execution, fetching all the data they need from the input before they return a value… so that by the time they do return, they will no longer see or care about changes to the input sequence. For operators returning a scalar value (such as Sum and Average) this is blatantly obvious – the value of a variable of type double isn’t going to change just because you’ve added something to a list. However, it’s slightly less for the "ToXXX" methods: ToLookup, ToArray, ToList and ToDictionary. These do not return views on the original sequence, unlike the "As" methods: AsEnumerable which we’ve seen, and Queryable.AsQueryable which I didn’t implement. Focus on the prefix part of the name: the "To" part indicates a conversion to a particular type. The "As" prefix indicates a wrapper of some kind. This is consistent with other parts of the framework, such as List<T>.AsReadOnly and Array.AsReadOnly<T>.

Very importantly, LINQ to Objects only iterates over any input sequence at most once, whether the execution is deferred or immediate. Some operators would be easier to implement if you could iterate over the input twice – but it’s important that they don’t do so. Of course if you provide the same sequence for two inputs, it will treat those as logically different sequences. Similarly if you iterate over a result sequence more than once (for operators that return IEnumerable<T> or a related interface, rather than List<T> or an array etc), that will iterate over the input sequence again.

This means it’s fine to use LINQ to Objects with sequences which may only be read once (such as a network stream), or which are relatively expensive to reread (imagine a log file reader over a huge set of logs) or which give inconsistent results (imagine a sequence of random numbers). In some cases it’s okay to use LINQ to Objects with an infinite sequence – in others it’s not. It’s usually fairly obvious which is the case.

Timing of input sequence reading, and memory usage

Where possible within deferred execution, operators act in a streaming fashion, only reading from the input sequence when they have to, and "forgetting" data as soon as they can. This allows for long – potentially infinite – sequences to be handled elegantly without memory running out.

Some operators naturally need to read all the data in before they can return anything. The most obvious example of this is Reverse, which will always yield the last element of the input stream as the first element in the result stream.

A third pattern occurs with operators such as Distinct, which yield data as they go, but accumulate elements too, taking more and more memory until the caller stops iterating (usually either by jumping out of the foreach loop, or letting it terminate naturally).

Where an operator takes two input sequences – such as Join – you need to understand the consumption of each one separately. For example, Join uses deferred execution, but as soon as you ask for the first element of the result set, it will read the "second" sequence completely and buffer it – whereas the "first" sequence is streamed. This isn’t the case for all operators with two inputs, of course – Zip streams both input sequences, for example. Check the documentation – and the relevant Edulinq blog post – for details.

Obviously any operator which uses immediate execution has to read all the data it’s interested in before it returns. This doesn’t necessarily mean they will read to the end of the sequence though, and they may not need to buffer the data they read. (Simple examples are ToList which has to keep everything, and Sum which doesn’t.)

Queries vs data

Closely related to the details of when the input is read is the concept of what the result of an operator actually represents. Operators which use deferred execution return queries: each time you iterate over the result sequence, the query will look at the input sequence again. The query itself doesn’t contain the data – it just knows how to get at the data.

Operators which use immediate execution work the other way round: they read all the data they need, and then forget about the input sequence. For operators like Average and Sum this is obvious as it’s just a simple scalar value – but for operators like ToList, ToDictionary, ToLookup and ToArray, it means that the operator has to make a copy of everything it needs. (This is potentially a shallow copy of course – depending on what user-defined projections are applied. The normal behaviour of mutable reference types is still valid.)

I realise that in many ways I’ve just said the same thing multiple times now – but hopefully that will help this crucial aspect of LINQ behaviour sink in, if you were still in any doubt.

Exception handling

I’m unaware of any situation in which LINQ to Objects will catch an exception. If your predicate or projection throws an exception, it will propagate in the obvious way.

However, LINQ to Objects does ensure that any iterator it reads from is disposed appropriately – assuming that the caller disposes of any result sequences properly, of course. Note that the foreach statement implicitly disposes of the iterator in a finally block.

Optimization

Various operators are optimized when they detect at execution time that the input sequence they’re working on offers a shortcut.

The types most commonly detected are:

  • ICollection<T> and ICollection for their Count property
  • IList<T> for its random access indexer

I’ll look at optimization in much more detail in a separate post.

Conclusion

This post has not been around the guiding principles behind LINQ itself – lambda calculus or anything like that. It’s more been a summary of the various aspects of behaviour we’ve seen across the various operators we’ve implemented. They’re the rules I’ve had to follow in order to make Edulinq reasonably consistent with LINQ to Objects.

Next time I’ll talk about some of the operators which I think should have made it into the core framework, at least for LINQ to Objects.