Category Archives: LINQ

Reimplementing LINQ to Objects: Part 1 – Introduction

About a year and a half ago, I gave a talk at a DDD day in Reading, attempting to reimplement as much of LINQ to Objects as possible in an hour. Based on the feedback from the session, I went far too fast… and I was still a long way from finishing. However, I still think it’s an interesting exercise, so I thought I’d do it again in a more leisurely way, blogging as I go. Everything will be under the "Edulinq" tag, so that’s the best way to get all the parts in order, without any of my other posts.

General approach

The plan is to reimplement the whole of LINQ to Objects, explaining each method (or group of methods) in a blog post. I’m going to try to make the code itself production quality, but I’m not going to include any XML documentation – if I’m already writing up how things work, I don’t want to do everything twice. I’ll include optimizations where appropriate, hopefully doing better than LINQ to Objects itself.

The approach is going to be fairly simple: for each LINQ method, I’ll write some unit tests (most of which I won’t show in the blog posts) and make sure they run against the normal .NET implementation. I’ll then comment out the using directive for System.Linq, and introduce one for JonSkeet.Linq instead. The tests will fail, I’ll implement the methods, and gradually they’ll go green again. It’s not quite the normal TDD pattern, but it works pretty well.

I will write up a blog entry for each LINQ operator, probably including all of the production code but only "interesting" tests. I’ll highlight important patterns as they come up – that’s half the point of the exercise, of course.

At the end of each post, I’ll include a link to download the "code so far". For the sake of anyone looking at these posts in the future, I’m going to keep the downloads numbered separately, rather than updating a single download over time. I’m hoping that the code will simply grow, but I dare say there’ll be some modifications along the way too.

The aim is not to end up with LINQBridge: I’m going to be targeting .NET 3.5 (mostly so I can use extension methods without creating my own attribute) and I’m certainly not going to start worrying about installers and the like. The purpose of all of this is purely educational: if you follow along with these blog posts, with any luck you’ll have a deeper understanding of LINQ in general and LINQ to Objects in particular. For example, topics like deferred execution are often misunderstood: looking at the implementation can clarify things pretty well.

Testing

The unit tests will be written using NUnit (just for the sake of my own familiarity). Fairly obviously, one of the things we’ll need to test quite a lot is whether two sequences are equal. We’ll do this using the TestExtensions class from MoreLINQ (which I’ve just copied into the project). The netbook on which I’ll probably write most of this code only has C# Express 2010 on it, so I’m going to use the external NUnit GUI. I’ve set this in the project file as the program to start… which you can’t do from C# Express directly, but editing the project file is easy, to include this:

<StartAction>Program</StartAction>
<StartProgram>C:Program FilesNUnit-2.5.7.10213binnet-2.0nunit-x86.exe</StartProgram>

It’s a bit of a grotty hack, but it works. The "additional command line parameters" are then set to just JonSkeet.Linq.Tests.dll – the current directory is the bin/debug directory by default, so everything’s fine. Obviously if you want to run the tests yourself and you have ReSharper or something similar, you can see the results integrated into Visual Studio.

Although I’m hoping to write reasonably production-level code, I doubt that there’ll be as many unit tests as I’d really write against production code. I fully expect the number of lines of test code to dwarf the number of lines of production code even so. There are simply vast numbers of potential corner cases… and quite a few overloads in several cases. Remember that the goal here is to examine interesting facets of LINQ.

Code layout

Just like the real LINQ to Objects, I’ll be creating an enormous static Enumerable class… but I’ll do so using partial classes, with one method name (but multiple overloads) per file. So Where will be implemented in Where.cs and tested in WhereTest.cs, for example.

First code drop

The first zip file is available: Linq-To-Objects-1.zip. It doesn’t contain any production code yet – just 4 tests for Where, so I could check that NUnit was working properly for me. Next stop… implementing Where.

Just how lazy are you?

I’ve been reviewing chapter 10 of C# in Depth, which is about extension methods. This is where I start introducing some of the methods in System.Linq.Enumerable, such as Where and Reverse. I introduce a few pieces of terminology in callouts – and while I believe I’m using this terminology correctly according to MSDN, I suspect that some of it isn’t quite what many developers expect… in particular, what does it mean for something to be "lazy"?

Let’s start off with a question: is Enumerable.Reverse lazy? Just so we’re all on the same page, here are the interesting bits of the behaviour of Reverse:

  • The call to Reverse doesn’t fetch any items – it merely checks that you’ve not given it a null sequence, stores a reference to that sequence, and returns a new sequence.
  • Once the first item is returned from the returned sequence, all the items in the original sequence are fetched and buffered. Obviously this is required, as the first item in the reversed sequence is the last item in the original sequence.

So, is that lazy or not?

One simple definition of lazy

This morning I tweeted on the somewhat ambiguous notion of something being "lazy" – and the responses I received were along the lines of "it’s deferred execution". You could potentially sum up this notion of laziness as:

An operation is lazy if it defers work until it is required in order to return a result.

By that definition, Reverse is lazy. Assuming we don’t want to perform special optimisations for IList<T> (which change the exact behaviour), Reverse does as little work as it can – it just so happens that when the first item is requested, it has to drain the source sequence.

The MSDN definition of lazy

MSDN describes deferred execution, lazy evaluation and eager evaluation somewhat differently. Admittedly the page I’m taking these definitions from is in the context of LINQ to XML, but that effectively means it’s describing LINQ to Objects. It defines them like this:

Deferred execution means that the evaluation of an expression is delayed until its realized value is actually required.

In lazy evaluation, a single element of the source collection is processed during each call to the iterator. This is the typical way in which iterators are implemented.

In eager evaluation, the first call to the iterator will result in the entire collection being processed. A temporary copy of the source collection might also be required. For example, the OrderBy method has to sort the entire collection before it returns the first element.

Now, I take slight issue with the definition of lazy evaluation here as it specifies that a single element of the source collection is processed on each call. That’s fine for Cast, OfType, Select and a few other operators – but it would preclude Where, which might have to pull several source elements before it finds one which matches the specified filter. I still think of Where as being lazy.

My definition of lazy

Thinking about this more, I believe the following definition of laziness is helpful:

(This space left intentionally blank.)

I don’t believe lazy is a useful term, basically. It’s like "strong typing" – you get some sort of woolly feeling about how something will behave if it’s lazy, but chances are you’ll need something more precise anyway.

For the purposes of LINQ to Objects-style operators which return IEnumerable<T> or any interface derived from it (including IOrderedEnumerable<T>), I propose the following definitions:

An operation uses deferred execution if no elements are fetched from the source sequence until the first element is fetched from the result sequence. This applies to basically all LINQ to Objects operators which return a sequence interface. (It doesn’t apply to ToArray or ToList of course.)

An operation is streaming if it only fetches elements from the source sequence as it requires them, and does not store those elements internally unless otherwise specified.

An operation is buffering if it drains the source sequence (i.e. fetches all the elements) when the first element of the result sequence is fetched, and stores the items internally.

I’m not even entirely comfortable with this: you could claim that Reverse is "streaming but with internal storage" – but that’s against the spirit of the definitions. Why did I mention the internal storage at all? Well, consider Distinct… that streams data in that it will the result sequence will return the first element immediately after reading the first element from the source sequence, and so on – but it has to remember all the elements it’s already returned, for obvious reasons.

Some operations are buffering in one input and streaming in another – for example, Join will read all of the "inner" sequence as soon as it’s asked for its first element, but streams the "outer" sequence.

Why does this matter?

Is this just another example of my pedantry and desire for precision? Not really. Consider my old favourite example of LINQ: processing huge log files. Suppose each log entry contains a user ID, and we’ve got a handy log reader which will iterate over all the log files, yielding one log entry at a time.

  • Using entries.Reverse() would be disastrous if we ever actually used the result. We really, really don’t want to load everything into memory.
  • Using entries.Select(x => x.UserId) would be fine, so long as we used the result without buffering it ourselves.
  • Using entries.Select(x => x.UserId).Distinct() might be fine – it would depend on how many users we have. If you’re processing some company-internal application logs, that’s probably okay even if you’ve generated a huge number of log entries. If you’re processing FaceBook logs, you could have more of a problem.

Basically, you need to know what an operation will do with its input before you can decide whether or not it’s usable for a particular situation.

The best answer for this (IMO) is to document any such methods meticulously. Yes, you then lose the flexibility of changing the behaviour at a later date – but at least you’ve then provided something that can be used with confidence.

Note that Reactive Extensions has a similar problem, but in a slightly different form – in that case, the distinction between "hot" and "cold" observables can make a big difference, along with scheduling etc. Again, documentation is the key in my view.

Optimisations in LINQ to Objects

(Edited on February 11th, 2010 to take account of a few mistakes and changes in the .NET 4.0 release candidate.)

I’ve just been fiddling with the first appendix of C# in Depth, which covers the standard query operators in LINQ, and describes a few details of the LINQ to Objects implementations. As well as specifying which operators stream and which buffer their results, and immediate vs deferred execution, I’ve where LINQ optimises for different collection types – or where it doesn’t, but could. I’m not talking about optimisations which require knowledge of the projection being applied or an overall picture of the query (e.g. seq.Reverse().Any() being optimised to seq.Any()) – this is just about optimisations which can be done on a pretty simple basis.

There are two main operations which can be easily optimised in LINQ: random access to an element by index, and the count of a collection. The tricky thing about optimisations like this is that we do have to make assumptions about implementation: I’m going to assume that any implementation of an interface with a Count property will be able to return the count very efficiently (almost certainly straight from a field) and likewise that random access via an indexer will be speedy. Given that both these operations already have their own LINQ operators (when I say "operator" in this blog post, I mean "LINQ query method" rather than an operator at the level of C# as a language) let’s look at those first.

Count

Count() should be pretty straightforward. Just to be clear, I’m only talking about the overload which doesn’t take a predicate: it’s pretty hard to see how you can do better than iterating through and counting matches when there is a predicate involved.

There are actually two common interfaces which declare a Count property: ICollection<T> and ICollection. While many implemenations of ICollection<T> will also implement ICollection (including List<T> and T[]), it’s not guaranteed: they’re independent interfaces, unrelated other than by the fact that they both extend IEnumerable.

The MSDN documentation for Enumerable.Count() states:

If the type of source implements ICollection<T>, that implementation is used to obtain the count of elements. Otherwise, this method determines the count.

This is accurate for .NET 3.5, but in .NET 4.0 it does optimise for ICollection as well. (In beta 2 it only optimised for ICollection, skipping the generic interface.)

ElementAt

The equivalent of an indexer in LINQ is the ElementAt operator. Note that it can’t really be an indexer as there’s no such thing as an "extension indexer" which is arguably a bit of a pity, but off-topic. Anyway, the obvious interface to look for here is IList<T>… and that’s exactly what ElementAt does. It ignores the possibility that you’ve only implemented the nongeneric IList – but I think that’s fairly reasonable. After all, the extension method extends IEnumerable<T>, so your collection has to be aware of generics – why would you implement IList but not IList<T>? Also, using the implementation of IList would involve a conversion from object to T, which would at the very least be ugly.

So ElementAt doesn’t actually do too badly. Now that we’ve got the core operations, what else could be optimised?

SequenceEqual

If you were going to write a method to compare two lists, you might end up with something like this (ignoring nullity concerns for the sake of brevity):

public static bool ListEqual<T>(List<T> first, List<T> second, IEqualityComparer<T> comparer)
{
    // Optimise for reflexive comparison
    if (first == second)
    {
        return true;
    }
    // If the counts are different we know the lists are different
    if (first.Count != second.Count)
    {
        return false;
    }
    // Compare each pair of elements in turn
    for (int i = 0; i < first.Count; i++)
    {
        if (!comparer.Equals(first[i], second[i]))
        {
            return false;
        }
    }
    return true;
}

Note the two separate optimisations. The first is always applicable, unless you really want to deal with sequences which will yield different results if you call GetEnumerator() on them twice. You could certainly argue that that would be a legitimate implementation, but I’d be interested to see a situation in which it made sense to try to compare such a sequence with itself and return false. SequenceEqual perform this optimisation.

The second optimisation – checking for different counts – is only really applicable in the case where we know that Count is optimised for both lists. In particular, I always make a point of only iterating through each source sequence once when I write a custom LINQ operator – you never know when you’ll be given a sequence which reads a huge log file from disk, yielding one line at a time. (Yes, that is my pet example, but it’s a good one and I’m sticking to it.) But we can certainly tell if both sequences implement ICollection or ICollection<T>, so it would make sense to have an "early negative" in that situation.

Last(predicate)

(All of this applies to LastOrDefault as well, by the way.) The implementation of Last which doesn’t take a predicate is already optimised for the IList<T> case: in that situation the method finds out the count, and returns list[count – 1] as you might expect. We certainly can’t do that when we’ve been given a predicate, as the last value might not match that predicate. However, we could walk backwards from the end of the list… if you have a list which contains a million items, and the last-but-one matches the predicate, you don’t really want to test the first 999998 items, do you? Again, this assumes that we can keep using random access on the list, but I think that’s reasonable for IList<T>.

Reverse

Reverse is an interesting case, because it uses deferred execution and streams data. In reality, it always takes a complete copy of the sequence (which in itself does optimise for the case where it implements ICollection<T>; in that situation you know the count to start with and can use source.CopyTo(destinationArray) to speed things up). You might consider an optimisation which uses random access if the source is an implementation of IList<T> – you could just lazily yield the elements in reverse order using random access. However, that would change behaviour. Admittedly the behaviour of Reverse may not be what people expect in the first place. What would you predict that this code does?

string[] array = { "a", "b", "c", "d" };

var query = array.Reverse();
array[0] = "a1";
        
var iterator = query.GetEnumerator();
array[0] = "a2";
        
// We’ll assume we know when this will stop :)
iterator.MoveNext();
Console.WriteLine(iterator.Current);
array[0] = "a3";
        
iterator.MoveNext();
Console.WriteLine(iterator.Current);
array[0] = "a4";

iterator.MoveNext();
Console.WriteLine(iterator.Current);
array[0] = "a5";
        
iterator.MoveNext();
Console.WriteLine(iterator.Current);

After careful thought, I accurately predicted the result (d, c, b, a2) – but you do need to take deferred execution *and* eager buffering into account. If nothing else, this should be a lesson in not changing the contents of query sources while you’re iterating over the query unless you’re really sure of what you’re doing.

With the candidate "optimisation" in place, we’d see (d, c, b, a5), but only when working on array directly. Working on array.Select(x => x) would have to give the original results, as it would have to iterate through all the initial values before finding the last one.

LongCount

This is an interesting one… LongCount really doesn’t make much sense unless you expect your sequence to have more than 2^31 elements, but there’s no optimisation present. The contract for IList<T> doesn’t state what Count should do if the list has more than Int32.MaxValue elements, so that can’t really be used – but potentially Array.LongValue could be used for large arrays.

A bigger question is when this would actually be useful. I haven’t tried timing Enumerable.Range(0, int.MaxValue) to see how long it would take to become relevant, but I suspect it would be a while. I can see how LongCount could be useful in LINQ to SQL – but does it even make sense in LINQ to Objects? Maybe it will be optimised in a future version with ILongList<T> for large lists…

EDIT: In fact, given comments, it sounds like the time taken to iterate over int.MaxValue items isn’t that high after all. That’ll teach me to make assumptions about running times without benchmarking… I still can’t say I’ve seen LongCount used in anger in LINQ to Objects, but it’s not quite as silly as I thought :)

Conclusion

The optimisations I’ve described here all have the potential to take a long-running operation down to almost instantaneous execution, in the "right" situation. There may well be other opportunities lurking – things I haven’t thought of. The good news is that missing optimisations could be applied in future releases without breaking any sane code. I do wonder whether supporting methods (e.g. TryFastCount and TryFastElementAt) would be useful to encourage other people writing their own LINQ operators to optimise appropriately – but that’s possibly a little too much of a niche case.

Blog post frequency may well change in either direction for the near future – I’m going to be very busy with last-minute changes, fixes, indexing etc for the book, which will give me even less time for blogging. On the other hand, it can be mind-numbingly tedious, so I may resort to blogging as a form of relief…

LINQ to Rx: second impressions

My previous post had the desired effect: it generated discussion on the LINQ to Rx forum, and Erik and Wes very kindly sent me a very detailed response too. There’s no better way to cure ignorance than to display it to the world.

Rather than regurgitating the email verbatim, I’ve decided to try to write it in my own words, with extra thoughts where appropriate. That way if I’ve misunderstood anything, I may be corrected – and the very act of trying to explain all this is likely to make me explore it more deeply than I would otherwise.

I’m leaving out the bits I don’t yet understand. One of the difficulties with LINQ to Rx at the moment is that the documentation is somewhat sparse – there are loads of videos, and at least there is a CHM file for each of the assemblies bundled in the framework, but many methods just have a single sentence of description. This is entirely understandable – the framework is still in flux, after all. I’d rather have the bits but sparse docs than immaculate docs for a framework I can’t play with – but it makes it tricky to go deeper unless you’ve got time to experiment extensively. There’s an rxwiki site which looks like it may be the community’s attempt to solve this problem – but it needs a bit more input, I think. When I get a bit of time to breathe, I’d like to try to contribute there.

The good news is that I don’t think there were any mechanical aspects that I got definitively wrong in what I wrote… but the bad news is that I wasn’t thinking in Rx terms. We’ll look at the different aspects separately.

Subscriptions and collections

My first "complaint" was about the way that IEnumerable<T>.ToObservable() worked. Just to recap, I was expecting a three stage startup process:

  • Create the observable
  • Subscribe any observers
  • Tell the observable to "start"

Instead, as soon as an observer subscribes, the observable publishes everything in the sequence to it (on a different thread, by default). Separate calls to Subscribe make the observable iterate over the sequence multiple times.

Now, my original viewpoint makes sense if you think of Subscribe as being like an event subscription. It feels like something which should be passive: another viewer turning on their television to watch a live broadcast.

However, as soon as you think of IObservable.Subscribe as being the dual of IEnumerable.GetEnumerator, the Rx way makes more sense. Each call to GetEnumerator starts the sequence from scratch, and so does each call to Subscribe. This is more like inserting a disc into the DVD player – you’re still watching something, but there’s a more active element to it. You put the DVD in, it starts playing. I guess following the analogy further would make my suggested model more like a PVR :)

Additionally, this "subscription as an action" model makes more sense of methods like Return and Repeat, and also works better as a reusable object: my own idea of "now push the collection" feels dreadfully stateful: why can’t I push the collection twice? What happens if an observer subscribes after I’ve pushed?

I suspect this will trip up many an Rx neophyte; the video Wes recorded on hot and cold observables should help. Admittedly I’d already watched it before writing the blog post, so I’ve no excuse…

The subscription model can effectively be modified via composition though; using Subject (as per the blog post), AsyncSubject (which remembers the first value it sees, and only yields that), BehaviorSubject (which remembers the last value it’s seen), and ReplaySubject (which remembers everything it sees, optionally limited by a buffer) you can do quite a bit.

Wes included in his email a StartableObservable which one could start and stop. I’d come up with a slightly similar idea at home, an ObservableSequence (not nearly such a good name) but which was limited to sequences: effectively it made the steps listed above explicit for a pull sequence. The code Wes provided was completely isolated from IEnumerable<T> – you would create a StartableObservable from any existing observable, then subscribe to it, then start it. It uses a Subject to effectively collect the subscriptions – starting the observable merely subscribed the subject to the original observable passed into the constructor.

The difference between Wes’s solution and mine is more fundamental than whether his is more general-purpose than mine or not (although it clearly is). Wes didn’t have to go outside the world of Rx at all. All the building blocks were there, he just put them together – and ended up with another building block, ready to be used with the rest. That’s a common theme in this blog post :)

Asynchronous aggregation

I did get one thing right in my previous post: my suggestion that there should be asynchronous versions of the aggregation operators is apparently not a bad one. We may see the framework come with such things in the future… but they won’t revolve around Task<T>.

What do we have to represent an asynchronous computation? Why, IObservable<T> of course. It will present the result at some future point in time. Ideally, I suppose you would deal with the count (or maximum line length, or whatever) by reacting to it asynchronously too. If necessary though, you can always just take the value produced and stash it somewhere… which is exactly what an AsyncSubject does, as mentioned above. You can get the value from that by just calling First() on it, which will block if the value hasn’t been seen yet – and you don’t need to worry about "missing" it, because of the caching within the subject.

When I started this blog post, I didn’t understand Prune, but I’ve found that writing about the whole process has made it somewhat clearer to me. Calling Prune on an observable returns an AsyncSubject – but which also unsubscribes itself from the original observable when the subject is disposed, basically allowing a more orderly cleanup. So, all we need to do is call Prune on the result of our asynchronous aggregation, and we’re away.

That’s one part of the "non-Rx" framework removed… what else can we take out of the code from the previous blog post? Well, if you look at the FutureAggregate method I posted, it does two things: maintains a running aggregate, and publishes the last result (via a Task<T>). Now the "maintain a running aggregate" looks remarkably like Scan, doesn’t it? All the future aggregates (FutureCount etc) can be built from one new building block: an observable which subscribes to an existing one, and yields the last value it sees before completion.

I’ll check with Wes whether he’s happy for me for me to share his code – if so, I’ll put that and the original code into a zip file so it’s easy to compare the dull version with the shiny one.

Conclusion

It’s not enough to be able to think about Rx. To really appreciate it, you’ve got to be able to think in Rx. As I’d written a sort of "mini-Rx" before, I was arrogant enough to assume I already knew how to think in observable sequences… but apparently not. (To be fair to myself, it’s been a while and Push LINQ didn’t try to do anything genuinely asynchronous.)

I’m certainly not "in the zone" yet when it comes to Rx… but I think I can see it in the distance now. I’m heartily glad I raised my concerns over asynchronous aggregation – partly as encouragement to the team to consider including them in the framework, but mostly because it’s helped me appreciate the framework a lot better. With any luck, these two somewhat "stream of consciousness" posts may have helped you too.

Now to go over what I wrote last night for the book, and see how much of it was rubbish :)

First encounters with Reactive Extensions

I’ve been researching Reactive Extensions for the last few days, with an eye to writing a short section in chapter 12 of the second edition of C# in Depth. (This is the most radically changed chapter from the first edition; it will be covering LINQ to SQL, IQueryable, LINQ to XML, Parallel LINQ, Reactive Extensions, and writing your own LINQ to Objects operators.) I’ve watched various videos from Channel 9, but today was the first time I actually played with it. I’m half excited, and half disappointed.

My excited half sees that there’s an awful lot to experiment with, and loads to learn about join patterns etc. I’m also looking forward to trying genuine events (mouse movements etc) – so far my tests have been to do with collections.

My disappointed half thinks it’s missing something. You see, Reactive Extensions shares some concepts with my own Push LINQ library… except it’s had smarter people (no offense meant to Marc Gravell) working harder on it for longer. I’d expect it to be easier to use, and make it a breeze to do anything you could do in Push LINQ. Unfortunately, that’s not quite the case.

Subscription model

First, the way that subscription is handled for collections seems slightly odd. I’ve been imagining two kinds of observable sources:

  • Genuine "event streams" which occur somewhat naturally – for instance, mouse movement events. Subscribing to such an observable wouldn’t do anything to it other than adding subscribers.
  • Collections (and the like) where the usual use case is "set up the data pipeline, then tell it to go". In that case calling Subscribe should just add the relevant observers, but not actually "start" the sequence – after all, you may want to add more observers (we’ll see an example of this in a minute).

In the latter case, I could imagine an extension method to IEnumerable<T> called ToObservable which would return a StartableObservable<T> or something like that – you’d subscribe what you want, and then call Start on the StartableObservable<T>. That’s not what appears to happen though – if you call ToObservable(), you get an implementation which iterates over the source sequence as soon as anything subscribes to it – which just doesn’t feel right to me. Admittedly it makes life easy in the case where that’s really all you want to do, but it’s a pain otherwise.

There’s a way of working round this in Reactive Extensions: there’s Subject<T> which is both an observer and an observable. You can create a Subject<T>, Subscribe all the observers you want (so as to set up the data pipeline) and then subscribe the subject to the real data source. It’s not exactly hard, but it took me a while to work out, and it feels a little unwieldy. The next issue was somewhat more problematic.

Blocking aggregation

When I first started thinking about Push LINQ, it was motivated by a scenario from the C# newsgroup: someone wanted to group a collection in a particular way, and then count how many items were in each group. This is effectively the "favourite colour voting" scenario outlined in the link at the top of this post. The problem to understand is that the normal Count() call is blocking: it fetches items from a collection until there aren’t any more; it’s in control of the execution flow, effectively. That means if you call it in a grouping construct, the whole group has to be available before you call Count(). So, you can’t stream an enormous data set, which is unfortunate.

In Push LINQ, I addressed this by making Count() return Future<int> instead of int. The whole query is evaluated, and then you can ask each future for its actual result. Unfortunately, that isn’t the approach that the Reactive Framework has taken – it still returns int from Count(). I don’t know the reason for this, but fortunately it’s somewhat fixable. We can’t change Observable of course, but we can add our own future-based extensions:

public static class ObservableEx
{
    public static Task<TResult> FutureAggregate<TSource, TResult>
        (this IObservable<TSource> source,
        TResult seed, Func<TResult, TSource, TResult> aggregation)
    {
        TaskCompletionSource<TResult> result = new TaskCompletionSource<TResult>();
        TResult current = seed;
        source.Subscribe(value => current = aggregation(current, value),
            error => result.SetException(error),
            () => result.SetResult(current));
        return result.Task;
    }

    public static Task<int> FutureMax(this IObservable<int> source)
    {
        // TODO: Make this generic and throw exception on
        // empty sequence. Left as an exercise for the reader.
        return source.FutureAggregate(int.MinValue, Math.Max);
    }

    public static Task<int> FutureMin(this IObservable<int> source)
    {
        // TODO: Make this generic and throw exception on
        // empty sequence. Left as an exercise for the reader.
        return source.FutureAggregate(int.MaxValue, Math.Min);
    }

    public static Task<int> FutureCount<T>(this IObservable<T> source)
    {
        return source.FutureAggregate(0, (count, _) => count + 1);
    }
}

This uses Task<T> from Parallel Extensions, which gives us an interesting ability, as we’ll see in a moment. It’s all fairly straightforward – TaskCompletionSource<T> makes it very easy to specify a value when we’ve finished, or indicate that an error occurred. As mentioned in the comments, the maximum/minimum implementations leave something to be desired, but it’s good enough for a blog post :)

Using the non-blocking aggregation operators

Now that we’ve got our extension methods, how can we use them? First I decided to do a demo which would count the number of lines in a file, and find the maximum and minimum line lengths:

public static List<T> ToList<T>(this IObservable<T> source)
{
    List<T> ret = new List<T>();
    source.Subscribe(x => ret.Add(x));
    return ret;
}
private static IEnumerable<string> ReadLines(string filename)
{
    using (TextReader reader = File.OpenText(filename))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}

var subject = new Subject<string>();
var lengths = subject.Select(line => line.Length);
var min = lengths.FutureMin();
var max = lengths.FutureMax();
var count = lengths.FutureCount();
            
var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);
Console.WriteLine("Count: {0}, Min: {1}, Max: {2}",
                  count.Result, min.Result, max.Result);

As you can see, we use the Result property of a task to find its eventual result – this call will block until the result is ready, however, so you do need to be careful about how you use it. Each line is only read from the file once, and pushed to all three observers, who carry their state around until the sequence is complete, whereupon they publish the result to the task.

I got this working fairly quickly – then went back to the "grouping lines by line length" problem I’d originally set myself. I want to group the lines of a file by their length (all lines of length 0, all lines of length 1 etc) and count each group. The result is effectively a histogram of line lengths. Constructing the query itself wasn’t a problem – but iterating through the results was. Fundamentally, I don’t understand the details of ToEnumerable yet, particularly the timing. I need to look into it more deeply, but I’ve got two alternative solutions for the moment.

The first is to implement my own ToList extension method. This simply creates a list and subscribes an observer which adds items to the list as it goes. There’s no attempt at "safety" here – if you access the list before the source sequence has completed, you’ll see whatever has been added so far. I am still just experimenting :) Here’s the implementation:

public static List<T> ToList<T>(this IObservable<T> source)
{
    List<T> ret = new List<T>();
    source.Subscribe(x => ret.Add(x));
    return ret;
}

Now we can construct a query expression, project each group using our future count, make sure we’ve finished pushing the source before we read the results, and everything is fine:

var subject = new Subject<string>();
var groups = from line in subject
             group line.Length by line.Length into grouped
             select new { Length = grouped.Key, Count = grouped.FutureCount() };
var results = groups.ToList();

var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);
foreach (var group in results)
{
    Console.WriteLine("Length: {0}; Count: {1}", group.Length, group.Count.Result);
}

Note how the call to ToList is required before calling source.ToObservable(...).Subscribe – otherwise everything would have been pushed before we started collecting it.

All well and good… but there’s another way of doing it too. We’ve only got a single task being produced for each group – instead of waiting until everything’s finished before we dump the results to the console, we can use Task.ContinueWith to write it (the individual group result) out as soon as that group has been told that it’s finished. We force this extra action to occur on the same thread as the observer just to make things easier in a console app… but it all works very neatly:

var subject = new Subject<string>();
var groups = from line in subject
             group line.Length by line.Length into grouped
             select new { Length = grouped.Key, Count = grouped.FutureCount() };
                                    
groups.Subscribe(group =>
{
    group.Count.ContinueWith(
         x => Console.WriteLine("Length: {0}; Count: {1}"
                                group.Length, x.Result),
         TaskContinuationOptions.ExecuteSynchronously);
});
var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);

Conclusion

That’s the lot, so far. It feels like I’m sort of in the spirit of Reactive Extensions, but that maybe I’m pushing it (no pun intended) in a direction which Erik and Wes either didn’t anticipate, or at least don’t view as particularly valuable/elegant. I very much doubt that they didn’t consider deferred aggregates – it’s much more likely that either I’ve missed some easy way of doing this, or there are good reasons why it’s a bad idea. I hope to find out which at some point… but in the meantime, I really ought to work out a more idiomatic example for C# in Depth.

An object lesson in blogging and accuracy; was: Efficient “vote counting” with LINQ to Objects – and the value of nothing

Well, this is embarrassing.

Yesterday evening, I excitedly wrote a blog post about an interesting little idea for making a particular type of LINQ query (basically vote counting) efficient. It was an idea that had occurred to me a few months back, but I hadn’t got round to blogging about it.

The basic idea was to take a completely empty struct, and use that as the element type in the results of a grouping query – as the struct was empty, it would take no space, therefore "huge" arrays could be created for no cost beyond the fixed array overhead, etc. I carefully checked that the type used for grouping did in fact implement ICollection<T> so that the Count method would be efficient; I wrote sample code which made sure my queries were valid… but I failed to check that the empty struct really took up no memory.

Fortunately, I have smart readers, a number of whom pointed out my mistake in very kind terms.

Ben Voigt gave the reason for the size being 1 in a comment:

The object identity rules require a unique address for each instance… identity can be shared with super- or sub- class objects (Empty Base Optimization) but the total size of the instance has to be at least 1.

This makes perfect sense – it’s just a shame I didn’t realise it before.

Live and learn, I guess – but apologies for the poorly researched post. I’ll attempt to be more careful next time.

What’s in a name?

T.S. Eliot had the right idea when he wrote “The naming of cats”:

The Naming of Cats is a difficult matter,
It isn’t just one of your holiday games

When you notice a cat in profound meditation,
The reason, I tell you, is always the same:
His mind is engaged in a rapt contemplation
Of the thought, of the thought, of the thought of his name:
His ineffable effable
Effanineffable
Deep and inscrutable singular Name.

Okay, so developers may not contemplate their own names much, but I know I’ve certainly spent a significant amount of time recently trying to work out the right name for various types and methods.  It always feels like it’s just out of reach; tauntingly, tantalisingly close.

Recently I’ve been thinking a bit about what the goals might be in coming up with a good name. In particular, I seem to have been plagued with the naming problem more than usual in the last few weeks.

Operations on immutable types

A while ago I asked a question on Stack Overflow about naming a method which “adds” an item to an immutable collection. Of course, when I say “adds” I mean “returns a new collection whose contents is the old collection and the new item.” There’s a really wide range of answers (currently 38 of them) which mostly seem to fall into four categories:

  • Use Add because it’s idiomatic for .NET collections. Developers should know that the type is immutable and act accordingly.
  • Use Cons because that’s the term functional programming has used for this exact operation for decades.
  • Use a new method name (Plus being my favourite at the moment) which will be obvious to non-functional developers, but without being so familiar that it suggests mutability.
  • Use a constructor taking the old collection and the new item.

Part of the reasoning for Add being okay is that I originally posted the question purely about “an immutable collection” – e.g. a type which would have a name like ImmutableList<T>. I then revealed my true intention (which I should have done from the start) – to use this in MiniBench, where the “collection” would actually be a TestSuite. Everything in MiniBench is immutable (it’s partly an exploration in functional programming, as it seems to fit very nicely) but I don’t want to have to name every single type as Immutable[Whatever]. There’s the argument that a developer should know at least a little bit about any API they’re using, and the immutability aspect is one of the first things they should know. However, MiniBench is arguably an extreme case, because it’s designed for sharing test code with people who’ve never seen it before.

I’m pretty sure I’m going to go with Plus in the end:

  • It’s close enough to Add to be familiar
  • It’s different enough to Add to suggest that it’s not quite the same thing as adding to a normal collection
  • It sounds like it returns something – a statement which just calls Plus without using the result sounds like it’s wrong (and indeed it would be)
  • It’s meaningful to everyone
  • I have a precedent in the Joda Time API

Another option is to overload the + operator, but I’m not really sure I’m ready to do that just yet. It would certainly leave brief code, but is that really the most important thing?

Let’s look at a situation with some of the same issues…

LINQ operators

Work on MoreLINQ has progressed faster than expected, mostly because the project now has four members, and they’ve been expending quite a bit of energy on it. (I must do a proper consistency review at some point – in particular it would be nice to have the docs refer to the same concepts in the same way each time. I digress…)

Most of the discussion in the project hasn’t been about functionality – it’s been about naming. In fact, LINQ is particularly odd in this respect. If I had to guess at how the time has been spent (at least for the operators I’ve implemented) I’d go for:

  • 15% designing the behaviour
  • 20% writing the tests
  • 10% implementation
  • 5% writing the documentation (just XML docs)
  • 50% figuring out the best name

It really is that brutal – and for a lot of the operators we still haven’t got the “right” name yet, in my view. There’s generally too much we want to convey in a word or two. As an example, we’ve got an operator similar to the oft-implemented ForEach one, but which yields the input sequence back out again. Basically it takes an action, and for each element it calls the action and then yields the element. The use case is something like logging. We’ve gone through several names, such as Pipe, Tee, Via… and just this morning I asked a colleague who suggested Apply, just off the top of his head. It’s better than anything we’d previously thought of, but does it convey both the “apply an action” and “still yield the original sequence” aspects?

The old advice of “each method should only do one thing” is all very well, and it clearly helps to make naming simpler, but with situations like this one there are just naturally more concepts which you want to get across in the name.

Let’s stay on the LINQ topic, but stray a bit further from the well-trodden path…

The heart of Push LINQ: IDataProducer

I’ve probably bored most of you with Push LINQ by now, and I’m not actively developing it at the moment, but there’s still one aspect which I’m deeply uncomfortable with: the core interface. IDataProducer represents a stream of data which can be observed. Basically clients subscribe to events, and their event handlers will be called when data is “produced” and when the stream ends.

I know IDataProducer is an awful name – but so far I haven’t found anything better. IObservable? Ick. Overused and isn’t descriptive. IPushEnumerable? Sounds like the client can iterate over the data, which they can’t. The actual event names (DataProduced/EndOfData) are okay but there must be something better than IDataProducer. (Various options have been suggested in the past – none of them have been so obviously “right” as to stick in my head…)

This situation is slightly different to the previous ones, however, simply because it’s such a pivotal type. You would think that the more important the type, the more important the name would be – but in some ways the reverse is true. You see, Push LINQ isn’t a terribly “obvious” framework. I say that without shame – it’s great at what it does, but it takes a few mental leaps before you really grok it. You’re really going to have to read some documentation or examples before you write your own queries.

Given that constraint, it doesn’t matter too much what the interface is called – it’s going to be explained to you before you need it. It doesn’t need to be discoverable – whereas when you’re picking method names to pop up in Intellisense, you really want the developer to be able to guess its purpose even before they hover over it and check the documentation.

I haven’t given up on IDataProducer (and I hope to be moving Push LINQ into MoreLINQ, by the way – working out a better name is one of the blockers) but it doesn’t feel like quite as much of a problem.

Read-only or not read-only?

This final example came up at work, just yesterday – after I’d started writing this post. I wanted to refactor some code to emphasize which methods only use the read-only side of an interface. This was purely for the sake of readability – I wanted to make it easier to reason about which areas of the code modified an object and which didn’t. It’s a custom collection – the details don’t matter, but for the sake of discussion let’s call it House and pretend we’re modelling the various things which might be in a house. (This is Java, hence House rather than IHouse.)

I’m explicitly not doing this for safety – I don’t mind the fact that the reference could be cast to a mutable interface. The point is just to make it self-documenting that if a method only has a parameter in the non-mutating form, it’s not going to change the contents of the house.

So, we have two interfaces, like this:

public interface NameMePlease
{
    Color getDoorColor();
    int getWindowCount();

    // This already returned a read-only collection
    Set<Furniture> getFurniture();
}

public interface House extends NameMePlease
{
    void setDoorColor(Color doorColor);
    void setWindowCount(int windows);
    void addFurniture(Furniture item);
}

Obviously the challenge is to find a name for NameMePlease. One option is to use something like ImmutableHouse or ReadOnlyHouse – but the inheritance hierarchy makes liars of both of those names. How can it be a ReadOnlyHouse if there are methods in an implementation which change it? The interface should say what you can do with the type, rather than specifying what you can’t do – unless part of the contract of the interface is that the implementation will genunely prohibit changes.

Thinking of this “positive” aspect led me to ReadableHouse, which is what I’ve gone with for the moment. It states what you can do with it – read information. Again, this is a concept which Joda Time uses.

Another option is to make it just House, and change the mutable interface to MutableHouse or something similar. In this particular situation the refactoring involved would have been enormous. Simple to automate, but causing a huge check-in for relatively little benefit. Almost all uses are actually mutating ones. The consensus within the Google Java mailing list seems to be that this would have been the preferred option, all things being equal. One interesting data point was that although Joda Time uses ReadableInstant etc, the current proposals for the new date/time API which will be included in Java 7, designed by the author of Joda Time, don’t use this convention. Presumably the author found it didn’t work quite as well as he’d hoped, although I don’t have know of any specific problems.

Conclusion

You’ll probably be unsurprised to hear that I don’t have a recipe for coming up with good names. However, in thinking about naming I’ve at least worked out a few points to think about:

  • Context is important: how discoverable does this need to be? Is accuracy more important than brevity? Do you have any example uses (e.g. through tests) which can help to see whether the code feels right or not?
  • Think of your audience. How familiar will they be with the rest of the code you’re writing? Are they likely to have a background in other areas of computer science where you could steal terminology? Can you make your name consistent with other common frameworks they’re likely to use? The reverse is true too: are you reusing a familiar name for a different concept, which could confuse readers?
  • Work out the information the name is trying to convey. For types, this includes working out how it participates in inheritance. Is it trying to advertise capabilities or restrictions?
  • Is it possible to make correct code look correct, and buggy code look wrong? This is rarely feasible, but it’s one of the main attractions of “Plus” in the benchmark case. (I believe this is one of the main selling points of true Hungarian Notation for variable naming, by the way. I’m not generally a fan, but I like this aspect.)

I may expand this list over time…

I think it’s fitting to close with a quote from Phil Karlton:

There are only two hard things in Computer Science: cache invalidation and naming things.

Almost all of us have to handle naming things. Let’s hope most of us don’t have to mess with cache invalidation as well.

Designing LINQ operators

I’ve started a small project (I’ll post a link when I’ve actually got something worthwhile to show) with some extra LINQ operators in – things which I think are missing from LINQ to Objects, basically. (I hope to include many of the ideas from an earlier blog post.) That, and a few Stack Overflow questions where I’ve effectively written extra LINQ operators and compared them with other solutions, have made me think about the desirable properties of a LINQ operator – or at least the things you should think about when implementing one. My thoughts so far:

Lazy/eager execution

If you’re returning a sequence (i.e. another IEnumerable<T> or similar) the execution should almost certainly be lazy, but the parameter checking should be eager. Unfortunately with the limitations of the (otherwise wonderful) C# iterator blocks, this usually means breaking the method into two, like this:

public static IEnumerable<T> Where(this IEnumerable<T> source,
                                   Func<T, bool> predicate)
{
    // Eagerly executed
    if (source == null)
    {
        throw new ArgumentNullException(“source”);
    }
    if (predicate == null)
    {
        throw new ArgumentNullException(“predicate”);
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<T> WhereImpl(IEnumerable<T> source,
                                        Func<T, bool> predicate)
{
    // Lazily executed
    foreach (T element in source)
    {
        if (predicate(element))
        {
            yield return element;
        }
    }
}

Obviously aggregates and conversions (Max, ToList etc) are generally eager anyway, within normal LINQ to Objects. (Just about everything in Push LINQ is lazy. They say pets look like their owners…)

Streaming/buffering

One of my favourite features of LINQ to Objects (and one which doesn’t get nearly the publicity of deferred execution) is that many of the operators stream the data. In other words, they only consume data when they absolutely have to, and they yield data as soon as they can. This means you can process vast amounts of data with very little memory usage, so long as you use the right operators. Of course, not every operator can stream (reversing requires buffering, for example) but where it’s possible, it’s really handy.

Unfortunately, the streaming/buffering nature of operators isn’t well documented in MSDN – and sometimes it’s completely wrong. As I’ve noted before, the docs for Enumerable.Intersect claim that it reads the whole of both sequences (first then second) before yielding any data. In fact it reads and buffers the whole of second, then streams first, yielding intersecting elements as it goes. I strongly encourage new LINQ operators to document their streaming/buffering behaviour (accurately!). This will limit future changes in the implementation admittedly (Intersect can be implemented in a manner where both inputs are streamed, for example) but in this case I think the extra guarantees provided by the documentation make up for that restriction.

Once-only evaluation

When I said that reversing requires buffering earlier on, I was sort of lying. Here’s an implementation of Reverse which doesn’t buffer any data anywhere:

public static IEnumerable<T> StreamingReverse<T>(this IEnumerable<T> source)
{
    // Error checking omitted for brevity
    int count = source.Count();
    for (int i = count-1; i >= 0; i–)
    {
        yield return source.ElementAt(i);
    }
}

If we assume we can read the sequence as often as we like, then we never need to buffer anything – just treat it as a random-access list. I hope I don’t have to tell you that’s a really, really bad idea. Leaving aside the blatant inefficiency even for sequences like lists which are cheap to iterate over, some sequences are inherently once-only (think about reading from a network stream) and some are inherently costly to iterate over (think about lines in a big log file – or the result of an ordering).

I suspect that developers using LINQ operators assume that they’ll only read the input data once. That’s a good assumption – wherever possible, we ought to make sure that it’s correct, and if we absolutely can’t help evaluating a sequence twice (and I can’t remember any times when I’ve really wanted to do that) we should document it in large, friendly letters.

Mind your complexity

In some ways, this falls out of “try to stream, and try to only read once” – if you’re not storing any data and you’re only reading each item once, it’s quite hard to come up with an operator which isn’t just O(n) for a single sequence. It is worth thinking about though – particularly as most of the LINQ operators can work with large amounts of data. For example, to find the smallest element in a sequence you can either sort the whole sequence and take the first element of the result or you can keep track of a “current minimum” and iterate through the whole sequence. Clearly the latter saves a lot of complexity (and doesn’t require buffering) – so don’t just take the first idea that comes into your head. (Or at least, start with that and then think how you could improve it.)

Again, documenting the complexity of the operator is a good idea, and call particular attention to anything which is unintuitively expensive.

Conclusion

Okay, so there’s nothing earth-shattering here. But the more I use LINQ to answer Stack Overflow questions, and the more I invent new operators in the spirit of the existing ones, the more powerful I think it is. It’s amazing how powerful it can be, and how ridiculously simple the code (sometimes) looks afterwards. It’s not like the operator implementation is usually hard, either – it’s just a matter of thinking of the right concepts. I’m going to try to follow these principles when I implement my extra operator library, and I hope you’ll bear them in mind too, should you ever feel that LINQ to Objects doesn’t have quite the extension method you need…

You don’t have to use query expressions to use LINQ

LINQ is clearly gaining a fair amount of traction, given the number of posts I see about it on Stack Overflow. However, I’ve noticed an interesting piece of coding style: a lot of developers are using query expressions for every bit of LINQ they write, however trivial.

Now, don’t get the wrong idea – I love query expressions as a helpful piece of syntactic sugar. For instance, I’d always pick the query expression form over the “dot notation” form for something like this:

var query = from file in Directory.GetFiles(logDirectory, “*.log”)
            from line in new LineReader(file)
            let entry = new LogEntry(line)
            where entry.Severity == Severity.Error
            select file + “: “ + entry.Message;

(Yes, it’s yet another log entry example – it’s one of my favourite demos of LINQ, and particularly Push LINQ.) The equivalent code using just the extension methods would be pretty ugly, especially given the various range variables and transparent identifiers involved.

However, look at these two queries instead:

var query = from person in people
            where person.Salary > 10000m
            select person;

var dotNotation = people.Where(person => person.Salary > 10000m);

In this case, we’re just making a single method call. Why bother with three lines of query expression? If the query becomes more complicated later, it can easily be converted into a query expression at that point. The two queries are exactly the same, even though the syntax is different.

My guess is that there’s a “black magic” fear of LINQ – many developers know how to write query expressions, but aren’t confident about what they’re converted into (or even the basics of what the translation process is like in the first place). Most of the C# 3.0 and LINQ books that I’ve read do cover query expression translation to a greater or lesser extent, but it’s rarely given much prominence.

I suspect the black magic element is reinforced by the inherent “will it work?” factor of LINQ to SQL – you get to write the query in your favourite language, but you may well not be confident in it working until you’ve tried it; there will always be plenty of little gotchas which can’t be picked up at compile time. With LINQ to Objects, there’s a lot more certainty (at least in my experience). However, the query expression translation shouldn’t be part of what developers are wary of. It’s clearly defined in the spec (not that I’m suggesting that all developers should learn it via the spec) and benefits from being relatively dumb and therefore easy to predict.

So next time you’re writing a query expression, take a look at it afterwards – if it’s simple, try writing it without the extra syntactic sugar. It may just be sweet enough on its own.

November 19th: London .NET User Group, Push LINQ!

On November 19th, I’ll be speaking at the London .NET User Group about Push LINQ. I was quite pleasantly surprised by being able to explain it to some extent in Copenhagen, and this evening will be entirely about Push LINQ, so I’ll be able to go into a lot more detail. Skills Matter will be hosting the event (near Farringdon station). It starts at 6.30pm, and registration is now open.

Should be fun. Please come and heckle.