Reimplementing LINQ to Objects: Part 45 – Conclusion and List of Posts

February 23, 2011 jonskeet 20 Comments

You may consider it a little odd to have a list of posts as the final part in the series, but it makes sense when you consider that visiting the Edulinq tag page shows results in reverse chronological order. At that point, a newcomer will hopefully hit this post first, and then find it easier to navigate to the first post. Anyway…

Thank you, and good night

When I started this series, I hadn’t realised quite how much there would be to write about. The main thrust was going to be that the implementation of LINQ is simple, and it’s the design that’s clever. As it happened, pretty much every operator ended up raising some interesting issue or other. However, hopefully the series has still "immersed" you in LINQ to Objects to some extent, and clarified how it all hangs together. It would be gratifying to think that the description at the start of each post may end up being used as a sort of "unofficial alternative documentation" with some more details than MSDN provides, but we’ll see whether than happens over time.

A number of people have asked me whether there’ll be an ebook version of this series, and the answer is currently "I don’t know." I have a few plans afoot, but I can’t tell where they’ll lead yet. Suffice to say I like the idea, and I’m looking at some options.

Anyway, thank you for reading as much of the series as you have, and I hope you’ve enjoyed it as much as I have.

Just in case you’re wondering, I’ll probably go back to posting about C# 5’s async support pretty soon…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 44 – Aspects of Design

February 21, 2011 jonskeet 3 Comments

I promised a post on some questions of design that are raised by LINQ to Objects. I suspect that most of these have already been covered in other posts, but it may well be helpful to talk about them here too. This time I’ve thought about it particularly from the point of view of how other APIs can be built on some of the same design principles, and the awkward choices that LINQ has thrown up.

The power of composability and immutability

Perhaps the most important aspect of LINQ which I’d love other API designers to take on board is that of how complicated queries are constructed from lots of little building blocks. What makes it particularly elegant is that the result of applying each building block is unchanged by anything else you do to it afterwards.

LINQ doesn’t enforce immutability of course – you can start off with a mutable list and change its content at any time, for example, or change the properties of one of the objects referenced within it, or pass in a delegate with side-effects – but LINQ itself won’t introduce side-effects.

The Task-based Asynchronous Pattern takes a similar approach, allowing composable building blocks of tasks. I’ve seen this pattern in various guises over the years – if you find yourself thinking in terms of a pipeline of some kind, it may well be appropriate, especially if each state in the pipeline emits the same type as it consumes.

General immutability is a somewhat different design trait of course, but one which can make such a difference. The java.util.{Date,Calendar} classes are horrible, not least because they’re mutable – you can never stash a value away without being concerned that it may get changed by something else. Joda Time has some mutable implementations, but typically the immutable classes are used in a fluent way. Of course, .NET uses value types for various core types to start with, but also makes TimeZoneInfo immutable. For genuine "values" I would highly encourage API designers to at least strongly consider immutable types. They’re not always appropriate by any means, but they can be hugely useful where they fit nicely.

Extension methods on interfaces

It’s no surprise that extension methods are heavily used in LINQ, given that they were effectively introduced into the language in order to enable LINQ in the first place. However, they do work particularly well with interfaces as a way of adding common behaviour.

It also plays very nicely with the pipeline pattern above for creating pipelines in a fluent manner. Even if you just create extension methods which call a constructor to wrap/compose the previous stage in the pipeline, you can still end up with more readable code.

One problem with this is that you can’t "override" behaviour in particular implementations or interfaces which extend the original one – which is why Enumerable.ElementAt() has to detect that a sequence is actually a list, for example. If interfaces allowed method implementations, this wouldn’t be as much of a problem in the situation where you’re in control of the interface – I wouldn’t be at all surprised to find that as a feature of C#’s successor.

The lack of extension properties is also a bit of a handicap in some places, although not as many as one might expect at first glance. For example, even if we could have made Enumerable.Count() a property, would it have been a good idea to do so? Properties give a natural expectation of speed, and Count() is usually an O(n) operation.

Delegates for custom behaviour

In .NET 1.0 and 1.1, most developers used delegates for two purposes:

Handling events in UIs
Passing around behaviour to be executed in a different thread (either via Control.Invoke, or new Thread(ThreadStart), or ThreadPool.QueueUserWorkItem).

.NET 2.0 increased the range of uses of delegates somewhat, particularly with List.ConvertAll and the ability to create delegates relatively easily using anonymous methods.

However, LINQ really brought them into the mainstream. If you’re building an API which benefits from small pieces of custom behaviour, delegates can be a real boon. More complicated behaviour is still often best represented via an interface, and sometimes it’s worth having both interface and delegate representations, like Comparison<T> and IComparable<T>. It’s generally easy to convert between the two – especially if you use a method group conversion from an interface implementation’s method to the delegate type.

Laziness

One aspect of LINQ which is both a blessing and a curse is its laziness, both in terms of deferred execution (not reading from the input sequence at all until the result sequence is read) and in terms of streaming the data (only reading as much information from the input sequence as is required to answer the immediate needs of the caller).

This is great in various ways, particularly as it means you can build a complex query and use it multiple times, sometimes as a basis for other queries, knowing that it won’t actually do anything until you ask for real results. It also means that you can iterate over huge data sets, so long as you’re careful.

On the other hand, it leads to subtle issues over when code is actually executing, makes debugging harder to understand, makes it easier to accidentally change the values of captured variables between the point at which you create the query and the point at which you execute it, and basically messes with your head. This is probably the aspect of LINQ which confuses newbies more than any other.

I’m not saying it was the wrong decision for LINQ – but I would caution API designers to think carefully before introducing laziness, and to document it really thoroughly. Likewise if your API might end up returning a result which is "gradually evaluated" (streaming data etc), this should be made clear.

When names collide: options for consistency

Just in case you’ve forgotten, this is irritation with the meaning of source.Contains(element). In order to check whether a sequence contains an element or not, there has to be some idea of equality – for example, if you’re trying to find one string in a sequence of strings, are you trying to match in a case sensitive manner or not?

There’s an overload for Enumerable.Contains which allows you to specify the equality comparer to use, but the question is what should happen when you let the implementation pick the comparer.

For every other method in Enumerable, the default equality comparer for the sequence type – i.e. EqualityComparer<TSource>.Default – is picked. That sounds like source.Contains(element) should use the element type’s default comparer too, right? Well, in some cases that’s what will happen… but not if the source implements ICollection<T>, which has its own Contains method which doesn’t take an equality comparer. If that’s the case, LINQ to Objects delegates to the collection’s Contains method.

So, we have three kinds of consistency here:

Consistency of compile-time type: it would be nice if the behaviour of source.Contains(element) was the same whether "source" is of type IEnumerable<T> and ICollection<T>
Consistency of API: it would be nice if Contains behaved the same way as other methods which have overloads with and without equality comparers
Consistency of model: if you consider "source" to be just a sequence of elements, it shouldn’t affect the result (not just the speed) if the object actually implements ICollection<T>

I should point out that this will only be a problem if the collection uses a different notion of equality to the default equality comparer for the type. The most obvious example of this is if you have a HashSet<string> which uses a case-insensitive equality comparer. But it’s still a valid concern.

So what should the API designer do in this kind of case? Admittedly LINQ to Objects is already in a slightly unusual position as it’s based on an existing interface with known and very common interfaces extending that core one… it’s less likely to come up with other APIs. However, I think it might be enough of a smell to suggest that changing the name of the method to "ContainsElement" or something similar would be worthwhile. It’s unfortunate that "Contains" really is the obvious choice…

This issue raises another aspect of API decision I’ve considered in the past… if there’s a common way of doing something in the framework you’re building on top of, but you consider it to be broken, should you abide by that breakage for the sake of familiarity and consistency, or should you strive to be as "clean" as possible? I think it needs to be considered on a case-by-case basis, but I suspect I would usually come down on the side of cleanliness.

Documentation details

Almost all APIs are badly documented – it’s a fact of life, even with some of the best APIs I’ve worked with. I doubt that Noda Time will be a shining example either. However, at the risk of being hypocritical I’ll say that documentation is worth spending significant thought on. Not just the time taken to document your code – but the time taken to consider what you want to guarantee, what should be left unstated, and what should be explicitly left open.

For example, there’s no indication in the documentation of Cast that it will sometimes return the original source value, nor in its companion OfType method that that will never return the original source reference. This might be important to someone – why not state it? It’s possible to state the possibility without saying what cases it applies to of course, leaving some wiggle room in the future. You might consider some of the optimizations in the same way – when should an optimization be documented and when should it be implicit? Sometimes it can make a difference beyond just performance, even if only in "odd" situations (such as a predicate throwing an exception).

If you’re used to defensive coding with Code Contracts, it’s much the same type of decision – and again, it’s similar to deciding whether a method should return IEnumerable<T>, IList<T> or List<T>. There’s a balance between caller convenience, design cleanliness (where you only want to emphasize one interface aspect, even if it also happens to always return a particular type), and room for the implementation to change in the future.

Another example of considering the level of detail to document is when it comes to how input sequences are used in LINQ to Objects. What does it mean to say "this method uses deferred execution" exactly? If I call GetEnumerator() eagerly but defer the call to MoveNext(), is that still "deferred execution"? Should the documentation state when a sequence is buffered and when it’s streamed? Should it guarantee the order of the result sequence when the natural implementation makes that order easy to describe (e.g. for Distinct)? In this series I’ve tried to be as clear as possible about what actually happens – but that’s not to say that in some cases, the documentation wasn’t left deliberately ambiguous.

Conclusion

There are many other design considerations that I haven’t gone into here – particularly optimization, which I’ve already covered twice, probably saying everything I wanted to say here anyway.

I may add a few more bits to this post over time… but aside from that, I think I’m fundamentally done. I’ll write one more conclusion post, then declare Edulinq closed…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 43 – Out-of-process queries with IQueryable

February 20, 2011 jonskeet 9 Comments

I’ve been putting off writing about this for a while now, mostly because it’s such a huge topic. I’m not going to try to give more than a brief introduction to it here – don’t expect to be able to whip up your own LINQ to SQL implementation afterwards – but it’s worth at least having an idea of what happens when you use something like LINQ to SQL, NHibernate or the Entity Framework.

Just as LINQ to Objects is primarily interested in IEnumerable<T> and the static Enumerable class, so out-of-process LINQ is primarily interested in IQueryable<T> and the static Queryable class… but before we get to them, we need to talk about expression trees.

Expression Trees

To put it in a nutshell, expression trees encapsulate logic in data instead of code. While you can introspect .NET code via MethodBase.GetMethodBody and then MethodBody.GetILAsByteArray, that’s not really a practical approach. The types in the System.Linq.Expressions define expressions in an easier-to-process manner. When expression trees were introduced in .NET 3.5, they were strictly for expressions, but the Dynamic Language Runtime uses expression trees to represent operations, and the range of logic represented had to expand accordingly, to include things like blocks.

While you certainly can build expression trees yourself (usually via the factory methods on the nongeneric Expression class), and it’s fun to do so at times, the most common way of creating them is to use the C# compiler’s support for them via lambda expressions. So far we’ve always seen a lambda expression being converted to a delegate, but it can also convert lambdas to instances of Expression<TDelegate>, where TDelegate is a delegate type which is compatible with the lambda expression. A concrete example will help here. The statement:

Expression<Func<int, int>> addOne = x => x + 1;

will be compiled into code which is effectively something like this:

var parameter = Expression.Parameter(typeof(int), "x");
var one = Expression.Constant(1, typeof(int));
var addition = Expression.Add(parameter, one);
var addOne = Expression.Lambda<Func<int, int>>(addition, new ParameterExpression[] { parameter });

The compiler has some tricks up its sleeves which allow it to refer to methods, events and the like in a simpler way than we can from code, but largely you can regard the transformation as just a way of making life a lot simpler than if you had to build the expression trees yourself every time.

IQueryable, IQueryable<T> and IQueryProvider

Now that we’ve got the idea of being able to inspect logic relatively easily at execution time, let’s see how it applies to LINQ.

There are three interfaces to introduce, and it’s probably easiest to start with how they appear in a class diagram:

Most of the time, queries are represented using the generic IQueryable<T> interface, but this doesn’t actually add much over the nongeneric IQueryable interface it extends, other than also extending IEnumerable<T> – so you can iterate over the contents of an IQueryable<T> just as with any other sequence.

IQueryable contains the interesting bits, in the form of three properties: ElementType which indicates the type of the elements within the query (in other words, a dynamic form of the T from IQueryable<T>), Expression returns the expression tree for the query so far, and Provider returns the query provider which is responsible for creating new queries and executing the existing one. We won’t need to use the ElementType property ourselves, but we’ll need both the Provider and Expression properties.

The static Queryable class

We’re not going to implement any of the interfaces ourselves, but I’ve got a small sample program to demonstrate how they all work, imagining we were implementing most of Queryable ourselves. This static class contains extension methods for IQueryable<T> just as Enumerable does for IEnumerable<T>. Most of the query operators from LINQ to Objects appear in Queryable as well, but there are a few notable omissions, such as the To{Lookup, Array, List, Dictionary} methods. If you call one of those on an IQueryable<T>, the Enumerable implementations will be used instead. (IQueryable<T> extends IEnumerable<T>, so the extension methods in Enumerable are applicable to IQueryable<T> sequences as well.)

The big difference between the Queryable and Enumerable methods in terms of their declarations is in the parameters:

The "source" parameter in Queryable is always of type IQueryable<TSource> instead of IEnumerable<TSource>. (Other sequence parameters such as the sequence to concatenate for Queryable.Concat are expressed as IEnumerable<T>, interestingly enough. This allows you to express a SQL query using "local" data as well; the query methods work out whether the sequence is actually an IQueryable<T> and act accordingly.)
Any parameters which were delegates in Enumerable are expression trees in Queryable; so while the selector parameter in Enumerable.Select is of type Func<TSource, TResult>, the equivalent in Queryable.Select is of type Expression<Func<TSource, TResult>>

The big difference between the methods in terms of what they do is that whereas the Enumerable methods actually do the work (eventually – possibly after deferred execution of course), the Queryable methods themselves really don’t do any work: they just ask the query provider to build up a query indicating that they’ve been called.

Let’s have a look at Where for example. If we wanted to implement Queryable.Where, we would have to:

Perform argument checking
Get the "current" query’s Expression
Build a new expression representing a call to Queryable.Where using the current expression as the source, and the predicate expression as the predicate
Ask the current query’s provider to build a new IQueryable<T> based on that call expression, and return it.

It all sounds a bit recursive, I realize – the Where call needs to record that a Where call has happened… but that’s all. You may very well wonder where all the work is happening. We’ll come to that.

Now building a call expression is slightly tedious because you need to have the right MethodInfo – and as Where is overloaded, that means distinguishing between the two Where methods, which is easier said than done. I’ve actually used a LINQ query to find the right overload – the one where the predicate parameter Expression<Func<T, bool>> rather than Expression<Func<T, int, bool>>. In the .NET implementation, methods can use MethodBase.GetCurrentMethod() instead… although equally they could have created a bunch of static variables computed at class initialization time. We can’t use GetCurrentMethod() for experimentation purposes, because the query provider is likely to expect the exact correct method from System.Linq.Queryable in the System.Core assembly.

Here’s our sample implementation, broken up quite a lot to make it easier to understand:

public static IQueryable<TSource> Where<TSource>(
    this IQueryable<TSource> source,
    Expression<Func<TSource, bool>> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    Expression sourceExpression = source.Expression;
    Expression quotedPredicate = Expression.Quote(predicate);

    // This gets the "open" method, without specific type arguments. The second parameter
    // of the method we want is of type Expression<Func<TSource, bool>>, so the sole generic
    // type argument to Expression<T> itself has two generic type arguments.
    // Let’s face it, reflection on generic methods is a mess.
    MethodInfo method = typeof(Queryable).GetMethods()
                                         .Where(m => m.Name == "Where")
                                         .Where(m => m.GetParameters()[1]
                                                      .ParameterType
                                                      .GetGenericArguments()[0]
                                                      .GetGenericArguments().Length == 2)
                                         .First();

    // This gets the method with the same type arguments as ours
    MethodInfo closedMethod = method.MakeGenericMethod(new Type[] { typeof(TSource) });

    // Now we can create a *representation* of this exact method call
    Expression methodCall = Expression.Call(closedMethod, sourceExpression, quotedPredicate);

    // … and ask our query provider to create a query for it
    return source.Provider.CreateQuery<TSource>(methodCall);
}

There’s only one part of this code that I don’t really understand the need for, and that’s the call to Expression.Quote on the predicate expression tree. I’m sure there’s a good reason for it, but this particular example would work without it, as far as I can see. The real implementation uses it though, so dare say it’s required in some way.

EDIT: Daniel’s comment has made this somewhat clearer to me. Each of the arguments to Expression.Call after the MethodInfo itself is meant to be an expression which represents the argument to the method call. In our example we need an expression which represents an argument of type Expression<Func<TSource, bool>>. We already have the value, but we need to provide the layer of wrapping… just as we did with Expression.Constant in the very first expression tree I showed at the top. To wrap the expression value we’ve got, we use Expression.Quote. It’s still not clear to me exactly why we can use Expression.Quote but not Expression.Constant, but at least it’s clearer why we need something…

EDIT: I’m gradually getting there. This Stack Overflow answer from Eric Lippert has much to say on the topic. I’m still trying to get my head round it, but I’m sure when I’ve read Eric’s answer several times, I’ll get there.

We can even test that this works, by using the Queryable.AsQueryable method from the real .NET implementation. This creates an IQuerable<T> from any IEnumerable<T> using a built-in query provider. Here’s the test program, where FakeQueryable is a static class containing the extension method above:

using System;
using System.Collections.Generic;
using System.Linq;

class Test
{
    static void Main()
    {
        List<int> list = new List<int> { 3, 5, 1 };
        IQueryable<int> source = list.AsQueryable();
        IQueryable<int> query = FakeQueryable.Where(source, x => x > 2);

        foreach (int value in query)
        {
            Console.WriteLine(value);
        }
    }
}

This works, printing just 3 and 5, filtering out the 1. Yay! (I’m explicitly calling FakeQueryable.Where rather than letting extension method resolution find it, just to make things clearer.)

Um, but what’s doing the actual work? We’ve implemented the Where clause without providing any filtering ourselves. It’s really the query provider which has built an appropriate IQueryable<T> implementation. When we call GetEnumerator() implicitly in the foreach loop, the query can examine everything that’s built up in the expression tree (which could contain multiple operators – it’s nesting queries within queries, essentially) and work out what to do. In the case of our IQueryable<T> built from a list, it just does the filtering in-process… but if we were using LINQ to SQL, that’s when the SQL would be generated. The provider recognizes the specific methods from Queryable, and applies filters, projections etc. That’s why it was important that our demo Where method pretended that the real Queryable.Where had been called – otherwise the query provider wouldn’t know what the call expression

Just to hammer the point home even further… Queryable itself neither knows nor cares what kind of data source you’re using. Its job is not to perform any query operations itself; its job is to record the requested query operations in a source-agnostic manner, and let the source provider handle them when it needs to.

Immediate execution with IQueryProvider.Execute

All the operators using deferred execution in Queryable are implemented in much the same way as our demo Where method. However, that doesn’t cover the situation where we need to execute the query now, because it has to return a value directly instead of another query.

This time I’m going to use ElementAt as the sample, simply because it’s only got one overload, which makes it very easy to grab the relevant MethodInfo. The general procedure is exactly the same as building a new query, except that this time we call the provider’s Execute method instead of CreateQuery.

public static TSource ElementAt<TSource>(this IQueryable<TSource> source, int index)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    Expression sourceExpression = source.Expression;
    Expression indexExpression = Expression.Constant(index);

    MethodInfo method = typeof(Queryable).GetMethod("ElementAt");
    MethodInfo closedMethod = method.MakeGenericMethod(new Type[] { typeof(TSource) });

    // Now we can create a *representation* of this exact method call
    Expression methodCall = Expression.Call(closedMethod, sourceExpression, indexExpression);

    // … and ask our query provider to execute it
    return source.Provider.Execute<TSource>(methodCall);
}

The type argument we provide to Execute is the desired return type – so for Count, we’d call Execute<int> for example. Again, it’s up to the query provider to work out what the call actually means.

It’s worth mentioning that both CreateQuery and Execute have generic and non-generic overloads. I haven’t personally encountered a use for the non-generic ones, but I gather they’re useful for various situations in generated code, particularly if you really don’t know the element type – or at least only know it dynamically, and don’t want to have to use reflection to generate an appropriate generic method call.

Transparent support in source code

One of the aspects of LINQ which raises it to the "genius" status (and "slightly scary" at the same time) is that most of the time, most developers don’t need to make any changes to their source code in order to use Enumerable or Queryable. Take this query expression and its translation:

var query = from person in family
where person.LastName == "Skeet"
select person.FirstName;

// Translation
var query = family.Where(person => person.LastName == "Skeet")
.Select(person => person.FirstName);

Which set of query methods will that use? It entirely depends on the compile-time type of the "family" variable. If that’s a type which implements IQueryable<T>, it will use the extension methods in Queryable, the lambda expression will be converted into expression trees, and the type of "query" will be IQueryable<string>. Otherwise (and assuming the type implements IEnumerable<T> isn’t some other interesting type such as ParallelEnumerable) it will use the extension methods in Enumerable, the lambda expressions will be converted into delgeates, and the type of "query" will be IEnumerable<string>.

The query expression translation part of the specification has no need to care about this, because it’s simply translating into a form which uses lambda expressions – the rest of overload resolution and lambda expression conversion deals with the details.

Genius… although it does mean you need to be careful that really you know where your query evaluation is going to take place – you don’t want to accidentally end up performing your whole query in-process having shipped the entire contents of a database across a network connection…

Conclusion

This was really a whistlestop tour of the "other" side of LINQ – and without going into any of the details of the real providers such as LINQ to SQL. However, I hope it’s given you enough of a flavour for what’s going on to appreciate the general design. Highlights:

Expression trees are used to capture logic in a data structure which can be examined relatively easily at execution time
Lambda expressions can be converted into expression trees as well as delegates
IQueryable<T> and IQueryable form a sort of parallel interface hierarchy to IEnumerable<T> and IEnumerable – although the queryable forms extend the enumerable forms
IQueryProvider enables one query to be built based on another, or executed immediately where appropriate
Queryable provides equivalent extension methods to most of the Enumerable LINQ operators, except that it uses IQueryable<T> sources and expression trees instead of delegates
Queryable doesn’t handle the queries itself at all; it simply records what’s been called and delegates the real processing to the query provider

I think I’ve now covered most of the topics I wanted to mention after finishing the actual Edulinq implementation. Next up I’ll talk about some of the thorny design issues (most of which I’ve already mentioned, but which bear repeating) and then I’ll write a brief "series conclusion" post with a list of links to all the other parts.

Books, C#

C# in Depth 2nd edition: now available in mobi/epub (Kindle) format

February 17, 2011 jonskeet 10 Comments

I’m not quite sure why this hasn’t been emailed to all existing owners, but the ebook of C# in Depth 2nd edition is now available in mobi and epub form, as well as PDF.

You can download it from the Manning user account site. You need to have the existing ebook first, but if you have the hard copy there should be a voucher in the front which will let you get the ebook for free. (This should work wherever you bought the hard copy from; it doesn’t matter whether you originally ordered it from Manning or not.) If you don’t already have a login for the user account site, just register using the same email address that the ebook was sent to. That way the system automatically credits you with all the ebooks you’ve bought. If you have had ebooks delivered to multiple email addresses, you can add those in the settings page.

Anyway, click on the link to C# in Depth, and you can download the book in any of the listed formats – if you want to use it on a Kindle, just download the mobi file, copy it to the Kindle and you should be well away.

Enjoy!

Jon Skeet's coding blog

Monthly Archives: February 2011

Reimplementing LINQ to Objects: Part 45 – Conclusion and List of Posts

Table of Contents

Thank you, and good night

Reimplementing LINQ to Objects: Part 44 – Aspects of Design

The power of composability and immutability

Extension methods on interfaces

Delegates for custom behaviour

Laziness

When names collide: options for consistency

Documentation details

Conclusion

Reimplementing LINQ to Objects: Part 43 – Out-of-process queries with IQueryable

Expression Trees

IQueryable, IQueryable<T> and IQueryProvider

The static Queryable class

Immediate execution with IQueryProvider.Execute

Transparent support in source code

Conclusion

C# in Depth 2nd edition: now available in mobi/epub (Kindle) format