Category Archives: Edulinq

Reimplementing LINQ to Objects: Part 26d – Fixing the key selectors, and yielding early

I feel I need a voice over. "Previously, on reimplementing LINQ to Objects…" Well, we’d got as far as a working implementation of OrderedEnumerable which didn’t have terrible performance – unless you had an expensive key selector. Oh, and it didn’t make use of the fact that we may only want the first few results.

Executing key selectors only once

Our first problem is to do with the key selectors. For various reasons (mentioned in part 26b) life is better if we execute the key selector once per input element. While we can do that with lazy evaluation, it makes more sense in my opinion to do it up-front. That means we need to separate out the key selector from the key comparer – in other words, we need to get rid of the handy ProjectionComparer we used to simplify the arguments to OrderBy/ThenBy/etc.

If we’re going to keep the key selectors in a strongly typed way, that means our OrderedEnumerable (or at least some type involved in the whole business) needs to become generic in the key type. Let’s bite the bullet and make it OrderedEnumerable. Now we have a slight problem right away in the fact that the "CreateOrderedEnumerable" method is generic, introducing a new type parameter TKey… so we shouldn’t use TKey as the name of the new type parameter for OrderedEnumerable. We could rename the type parameter in the generic method implementation, but I’m becoming a big believer in leaving the signatures of methods alone when I implement an interface. For type parameters it’s not too bad, but for normal parameters it can be awful if you mess around with the names – particularly for those using named arguments.

Thinking ahead, our single "key" type parameter in OrderedEnumerable could well end up being a composite key. After all, if we have OrderBy(…).ThenBy(…).ThenBy(…) we’re going to have to have some way of representing the key formed by the three selectors. It makes sense to use a "nested" key type, where the key type of OrderedEnumerable is always the "composite key so far". Thus I named the type parameter TCompositeKey, and introduced an appropriate field. Here’s the skeleton of the new class:

internal class OrderedEnumerable<TElement, TCompositeKey> : IOrderedEnumerable<TElement>
{
    private readonly IEnumerable<TElement> source;
    private readonly Func<TElement, TCompositeKey> compositeSelector;
    private readonly IComparer<TCompositeKey> compositeComparer;

    internal OrderedEnumerable(IEnumerable<TElement> source,
        Func<TElement, TCompositeKey> compositeSelector,
        IComparer<TCompositeKey> compositeComparer)
    {
        this.source = source;
        this.compositeSelector = compositeSelector;
        this.compositeComparer = compositeComparer;
    }

    // Interface implementations here
}

(I’m aware this is very "stream of consciousness" – I’m assuming that presenting the decisions in the order in which I addressed them is a good way of explaining the necessary changes. Apologies if the style doesn’t work for you.)

ThenBy and ThenByDescending don’t have to change at all – they were already just using the interface. OrderBy and OrderByDescending become a little simpler, as we don’t need to build the projection comparer. Here’s the new version of OrderBy:

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    return new OrderedEnumerable<TSource, TKey>
        (source, keySelector, comparer ?? Comparer<TKey>.Default);
}

Lovely – we just call a constructor, basically.

So far, so good. Now what about the implementation of IOrderedEnumerable? We should expect this to get messy, because there are three types of key involved:

  • The current key type
  • The secondary key type
  • The composite key type

Currently we don’t even have a type which can represent the composite key. We could use something like KeyValuePair<TKey, TValue>, but that doesn’t really give the right impression. Instead, let’s create our own simple type:

internal struct CompositeKey<TPrimary, TSecondary>
{
    private readonly TPrimary primary;
    private readonly TSecondary secondary;

    internal TPrimary Primary { get { return primary; } }
    internal TSecondary Secondary{ get { return secondary; } }

    internal CompositeKey(TPrimary primary, TSecondary secondary)
    {
        this.primary = primary;
        this.secondary = secondary;
    }
}

Now we can easily create a projection from two key selectors to a new one which selects a composite key. However, we’ll need to do the same thing for a comparer. We could use the CompoundComparer class we created before, but that will end up with quite a bit of indirection. Instead, it would be nice to have a type to work directly with CompositeKey – something which knew it was dealing with comparers of different types, one for each part of the key.

We could create a completely separate top-level type for that… but specifying the type parameters again seems a bit daft when we can reuse them by simply creating a nested class within CompositeKey:

internal struct CompositeKey<TPrimary, TSecondary>
{
    // Other members as shown above

    internal sealed class Comparer : IComparer<CompositeKey<TPrimary, TSecondary>>
    {
        private readonly IComparer<TPrimary> primaryComparer;
        private readonly IComparer<TSecondary> secondaryComparer;

        internal Comparer(IComparer<TPrimary> primaryComparer,
                          IComparer<TSecondary> secondaryComparer)
        {
            this.primaryComparer = primaryComparer;
            this.secondaryComparer = secondaryComparer;
        }

        public int Compare(CompositeKey<TPrimary, TSecondary> x,
                           CompositeKey<TPrimary, TSecondary> y)
        {
            int primaryResult = primaryComparer.Compare(x.Primary, y.Primary);
            if (primaryResult != 0)
            {
                return primaryResult;
            }
            return secondaryComparer.Compare(x.Secondary, y.Secondary);
        }
    }
}

This may look a little odd to begin with, but the two types really are quite deeply connected.

Now that we can compose keys in terms of both selection and comparison, we can implement CreateOrderedEnumerable:

public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
    Func<TElement, TKey> keySelector,
    IComparer<TKey> comparer,
    bool descending)
{
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    comparer = comparer ?? Comparer<TKey>.Default;
    if (descending)
    {
        comparer = new ReverseComparer<TKey>(comparer);
    }

    // Copy to a local variable so we don’t need to capture "this"
    Func<TElement, TCompositeKey> primarySelector = compositeSelector;
    Func<TElement, CompositeKey<TCompositeKey, TKey>> newKeySelector = 
        element => new CompositeKey<TCompositeKey, TKey>(primarySelector(element), keySelector(element));

    IComparer<CompositeKey<TCompositeKey, TKey>> newKeyComparer =
        new CompositeKey<TCompositeKey, TKey>.Comparer(compositeComparer, comparer);

    return new OrderedEnumerable<TElement, CompositeKey<TCompositeKey, TKey>>
        (source, newKeySelector, newKeyComparer);
}

I’m not going to pretend that the second half of the method is anything other than ghastly. I’m not sure I’ve ever written code which is so dense in type arguments. IComparer<CompositeKey<TCompositeKey, TKey>> is a particularly "fine" type. Ick.

However, it works – and once you’ve got your head round what each of the type parameters actually means at any one time, it’s not really complicated code – it’s just verbose and clunky.

The only bit which might require a bit of explanation is the primarySelector variable. I could certainly have just used compositeSelector within the lambda expression used to create the new key selector – it’s not like it’s going to change, after all. The memory benefits of not having a reference to "this" (where the intermediate OrderedEnumerable is likely to be eligible for GC collection immediately, in a typical OrderBy(…).ThenBy(…) call) are almost certainly not worth it. It just feels right to have both the primary and secondary key selectors in the same type, which is what will happen with the current code. They’re both local variables, they’ll be captured together, all will be well.

I hope you can see the parallel between the old code and the new code. Previously we composed a new (element-based) comparer based on the existing comparer, and a projection comparer from the method parameters. Now we’re composing a new key selector and a new key comparer. It’s all the same idea, just maintaining the split between key selection and key comparison.

Now let’s sort…

So far, we haven’t implemented GetEnumerator – and that’s all. As soon as we’ve done that to our satisfaction, we’re finished with ordering.

There are several approaches to how we could sort. Here are a few of them:

  • Project each element to its key, and create a KeyValuePair for each item. Merge sort in the existing way to achieve stability. This will involve copying a lot of data around – particularly if the element and key types end up being large value types.
  • Project each element to a { key, index } pair, and create another composite comparer which uses the index as a tie-breaker to achieve stability. This still involves copying keys around, but it means we could easily use a built-in sort (such as List<T>).
  • Project each element to a key, and separately create an array of indexes (0, 1, 2, 3…). Sort the indexes by accessing the relevant key at any point, using indexes as tie-breakers. This requires a more fiddly sort, as we need to keep indexing into the indexes array.
  • Build up "chunks" of sorted data as we read it in, keeping some number of chunks and merging them appropriate when we want to. We can then yield the results without ever performing a full sort, by effectively performing the "merge" operation of merge sort, just yielding values instead of copying them to temporary storage. (Obviously this is trivial with 2 chunks, but can be extended to more.)
  • Do something involving a self-balancing binary tree :)

I decided to pick the middle option, using quicksort as the sorting algorithm. This comes with the normal problems of possibly picking bad pivots, but it’s usually a reasonable choice. I believe there are cunning ways of improving the worst-case performance, but I haven’t implemented any of those.

Here’s the non-quicksort part of the code, just to set the scene.

public IEnumerator<TElement> GetEnumerator()
{
    // First copy the elements into an array: don’t bother with a list, as we
    // want to use arrays for all the swapping around.
    int count;
    TElement[] data = source.ToBuffer(out count);

    int[] indexes = new int[count];
    for (int i = 0; i < indexes.Length; i++)
    {
        indexes[i] = i;
    }

    TCompositeKey[] keys = new TCompositeKey[count];
    for (int i = 0; i < keys.Length; i++)
    {
        keys[i] = compositeSelector(data[i]);
    }

    QuickSort(indexes, keys, 0, count – 1);

    for (int i = 0; i < indexes.Length; i++)
    {
        yield return data[indexes[i]];
    }
}

I could certainly have combined the first two loops – I just liked the separation provided in this code. One tiny micro-optimization point to note is that for each loop I’m using the Length property of the array rather than "count" as the upper bound, as I believe that will reduce the amount of array boundary checking the JIT will generate. I very much doubt that it’s relevant, admittedly :) I’ve left the code here as it is in source control – but looking at it now, I could certainly have used a foreach loop on the final yield part. We wouldn’t be able to later, admittedly… but I’ll come to that all in good time.

The actual quicksort part is reasonably standard except for the fact that I pass in both the arrays for both indexes and keys – usually there’s just the one array which is being sorted. Here’s the code for both the recursive call and the partition part:

private void QuickSort(int[] indexes, TCompositeKey[] keys, int left, int right)
{
    if (right > left)
    {
        int pivot = left + (right – left) / 2;
        int pivotPosition = Partition(indexes, keys, left, right, pivot);
        QuickSort(indexes, keys, left, pivotPosition – 1);
        QuickSort(indexes, keys, pivotPosition + 1, right);
    }
}

private int Partition(int[] indexes, TCompositeKey[] keys, int left, int right, int pivot)
{
    // Remember the current index (into the keys/elements arrays) of the pivot location
    int pivotIndex = indexes[pivot];
    TCompositeKey pivotKey = keys[pivotIndex];

    // Swap the pivot value to the end
    indexes[pivot] = indexes[right];
    indexes[right] = pivotIndex;
    int storeIndex = left;
    for (int i = left; i < right; i++)
    {
        int candidateIndex = indexes[i];
        TCompositeKey candidateKey = keys[candidateIndex];
        int comparison = compositeComparer.Compare(candidateKey, pivotKey);
        if (comparison < 0 || (comparison == 0 && candidateIndex < pivotIndex))
        {
            // Swap storeIndex with the current location
            indexes[i] = indexes[storeIndex];
            indexes[storeIndex] = candidateIndex;
            storeIndex++;
        }
    }
    // Move the pivot to its final place
    int tmp = indexes[storeIndex];
    indexes[storeIndex] = indexes[right];
    indexes[right] = tmp;
    return storeIndex;
}

It’s interesting to observe how similar the quicksort and merge sort recursive parts are – both picking a midpoint, recursing on the left of it, recursing on the right of it, and performing some operation on the whole sublist. Of course the "some operation" is very different between partition and merge, and it occurs at a different time – but it’s an interesting parallel nonetheless.

One significant difference between merge sort and quicksort is the use of the pivot. Once Partition has returned where the pivot element ended up, quicksort doesn’t touch that element itself (we already know it will be in the right place). It recurses on the sublist entirely to the left of the pivot and the sublist entirely to the right of the pivot. Compare this with merge sort with recurses on two sublists which together comprise the whole list for that call.

The overloading of the word "index" here is unfortunate, but that is unfortunately life. Both sorts of "index" here really are indexes… you just need to keep an eye on which is which.

The final point to note is how we’re using the indexes in the comparison, as a tie-break to keep stability. It’s an ugly expression, but it does the job.

(As a small matter of language, I wasn’t sure whether to use indexes or indices. I far prefer the former, so I used it. Having just checked in the dictionary, it appears both are correct. This reminds me of when I was writing C# in Depth – I could never decide between appendixes and appendices. Blech.)

Now, do you want to hear the biggest surprise I received last night? After I’d fixed up the compile-time errors to arrive at the code above, it worked first time. I’m not kidding. I’m not quite sure how I pulled that off (merge sort didn’t take long either, but it did at least have a few tweaks to fix up) but it shocked the heck out of me. So, are we done? Well, not quite.

Yielding early

Just as a reminder, one of my aims was to be able to use iterator blocks to return some values to anyone iterating over the result stream without having to do all the sorting work. This means that in the case of calling OrderBy(…).Take(5) on a large collection, we can end up saving a lot of work… I hope!

This is currently fairly normal quicksort code, leaving the "dual arrays" aspect aside… but it’s not quite amenable to early yielding. We’re definitely computing the earliest results first, due to the order of the recursion – but we can’t yield from the recursive method – iterator blocks just don’t do that.

So, we’ll have to fake the recursion. Fortunately, quicksort is only directly recursive – we don’t need to worry about mutually recursive routines: A calling B which might call C or it might call back to A, etc. Instead, we can just keep a Stack<T> of "calls" to quicksort that we want to make, and execute the appropriate code within our GetEnumerator() method, so we can yield at the right point. Now in the original code, quicksort has four parameters, so you might expect our Stack<T> to have those four values within T too… but no! Two of those values are just the keys and indexes… and we already have those in two local variables. We only need to keep track of "right" and "left". Again, for the sake of clarity I decided to implement this using a custom struct – nested within OrderedEnumerable as there’s no need for it to exist anywhere else:

private struct LeftRight
{
    internal int left, right;
    internal LeftRight(int left, int right)
    {
        this.left = left;
        this.right = right;
    }
}

Purists amongst you may curse at the use of internal fields rather than properties. I’m not bothered – this is a private class, and we’re basically using this as a tuple. Heck, I would have used anonymous types if it weren’t for two issues:

  • I wanted to use Stack<T>, and there’s no way of creating one of those for an anonymous type (without introducing more generic methods to use type inference)
  • I wanted to use a struct – we’ll end up creating a lot of these values, and there’s simply no sense in them being individual objects on the heap. Anonymous types are always classes.

So, as a first step we can transform our code to use this "fake recursion" but still yield at the very end:

var stack = new Stack<LeftRight>();
stack.Push(new LeftRight(0, count – 1));
while (stack.Count > 0)
{
    LeftRight leftRight = stack.Pop();
    int left = leftRight.left;
    int right = leftRight.right;
    if (right > left)
    {
        int pivot = left + (right – left) / 2;
        int pivotPosition = Partition(indexes, keys, left, right, pivot);
        stack.Push(new LeftRight(pivotPosition + 1, right));
        stack.Push(new LeftRight(left, pivotPosition – 1));
    }
}

for (int i = 0; i < indexes.Length; i++) 

    yield return data[indexes[i]]; 
}

We initially push a value of (0, count – 1) to simulate the call to QuickSort(0, count – 1) which started it all before. The code within the loop is very similar to the original QuickSort method, with three changes:

  • We have to grab the next value of LeftRight from the stack, and then separate it into left and right values
  • Instead of calls to QuickSort, we have calls to stack.Push
  • We’ve reversed the order of the recursive calls: in order to sort the left sublist first, we have to push it onto the stack last.

Happy so far? We’re getting very close now. All we need to do is work out when to yield. This is the bit which caused me the most headaches, until I worked out that the "if (right > left)" condition really meant "if we’ve got work to do"… and we’re interested in the exact opposite scenario – when we don’t have any work to do, as that means everything up to and including "right" is already sorted. There are two situations here: either right == left, i.e. we’re sorting one element, or right == left – 1, which will occur if we picked a pivot which was the maximum or minimum value in the list at the previous recursive step.

It’s taken me a little bit of thinking (and just running the code) to persuade me that we will always naturally reach a situation where we end up seeing right == count and right <= left, i.e. a place where we know we’re completely done. But it’s okay – it does happen.

It’s not just a case of yielding the values between left and right though – because otherwise we’d never yield a pivot. Remember how I pointed out that quick sort missed out the pivot when specifying the sublists to recurse into? Well, that’s relevant here. Fortunately, it’s really easy to work out what to do. Knowing that everything up to and including "right" has been sorted means we just need to keep a cursor representing the next index to yield, and then just move that cursor up until it’s positioned beyond "right". The code is probably easier to understand than the description:

int nextYield = 0;

var stack = new Stack<LeftRight>();
stack.Push(new LeftRight(0, count – 1));
while (stack.Count > 0)
{
    LeftRight leftRight = stack.Pop();
    int left = leftRight.left;
    int right = leftRight.right;
    if (right > left)
    {
        int pivot = left + (right – left) / 2;
        int pivotPosition = Partition(indexes, keys, left, right, pivot);
        // Push the right sublist first, so that we *pop* the
        // left sublist first
        stack.Push(new LeftRight(pivotPosition + 1, right));
        stack.Push(new LeftRight(left, pivotPosition – 1));
    }
    else
    {
        while (nextYield <= right)
        {
            yield return data[indexes[nextYield]];
            nextYield++;
        }
    }
}

Tada! It works (at least according to my tests).

I have tried optimizing this a little further, to deal with the case when right == left + 1, i.e. we’re only sorting two elements. It feels like that ought to be cheaper to do explicitly than via pivoting and adding two pointless entries to the stack… but the code gets a lot more complicated (to the point where I had to fiddle significantly to get it working) and from what I’ve seen, it doesn’t make much performance difference. Odd. If this were a production-quality library to be used in performance-critical situations I’d go further in the testing, but as it is, I’m happy to declare victory at this point.

Performance

So, how well does it perform? I’ve only performed crude tests, and they perplex me somewhat. I’m sure that last night, when I was running the "yield at the end" code, my tests were running twice as slowly in Edulinq as in LINQ to Objects. Fair enough – this is just a hobby, Microsoft have no doubt put a lot of performance testing effort into this. (That hasn’t stopped them from messing up "descending" comparers, admittedly, as I found out last night to my amusement.) That was on my "meaty" laptop (which is 64-bit with a quad core i7). On my netbook this morning, the same Edulinq code seemed to be running slightly faster than LINQ to Objects. Odd.

This evening, having pulled the "early out" code from the source repository, the Edulinq implementation is running faster than the LINQ to Objects implementation even when the "early out" isn’t actually doing much good. That’s just plain weird. I blame my benchmarking methodology, which is far from rigorous. I’ve tweaked the parameters of my tests quite a bit, but I haven’t tried all kinds of different key and element types, etc. The basic results are very roughly:

  • When evaluating the whole ordered list, Edulinq appears to run about 10% faster than LINQ to Objects
  • When evaluating only the top 5 of a large ordered list, Edulinq can be much faster. How much faster depends on the size of the list of course, and it still has to perform the initial complete partitioning step – but on 100,000 items it’s regularly about 10x faster than LINQ to Objects.

That makes me happy :) Of course, the code is all open source, so if Microsoft wish to include the Edulinq implementation in .NET 5, they’re quite at liberty to do so, as long as they abide by the terms of the licence. I’m not holding my breath ;)

More seriously, I fully expect there are a bunch of scenarios where my knocked-up-in-an-evening code performs slower than that in the framework. Maybe my approach takes a lot more memory. Maybe it has worse locality of reference in some scenarios. There are all kinds of possibilities here. Full performance analysis was never meant to be the goal of Edulinq. I’m doing this in the spirit of learning more about LINQ – but it’s fun to try to optimize just a little bit. I’m going to delete the increasingly-inaccurately-named MergeSortTest project now – I may institute a few more benchmarks later on though. I’m also removing CompoundComparer and ProjectionComparer, which are no longer used. They’ll live on in part 26a though…

Conclusions

Well that was fun, wasn’t it? I’m pretty pleased with the result. The final code has some nasty generic complexity in it, but it’s not too bad if you keep all the types clear in your mind.

None of the remaining operators will be nearly as complex as this, unless I choose to implement AsQueryable (which I wasn’t planning on doing). On the other hand, as I’ve mentioned before, Max/Sum/etc have oodles of overloads. While I’ll certainly implement all of them, I’m sure I’ll only present the code for selected interesting overloads.

As a bit of light relief, I think I’ll tackle Reverse. That’s about as simple as it gets – although it could still present some interesting options.

Addendum

An earlier version of this post (and the merge sort implementation) had a flawed piece of code for choosing the pivot. Here’s both the old and the new code:

// Old code
int pivot = (left + right) / 2;

// New code
int pivot = left + (right – left) / 2;

The difference is whether or not the code can overflow when left and right are very large. Josh Bloch wrote about it back in 2006. A colleague alerted me to this problem shortly after posting, but it’s taken until now to correct it. (I fixed the source repository almost immediately, but deferred writing this addendum.) Why was I not too worried? Because .NET restricts each object to be less than 2GB in size, even in .NET 4.0, even on a 64-bit CLR. As we’ve created an array of integers, one per entry, that means we can only have just under (int.MaxValue / 4) elements. Within those limits, there’s no problem in the original pivot code. However, it’s still worth fixing of course – one never knows when the restriction will be lifted. The CLR team blogged about the issue back in 2005 (when the 64-bit CLR was new) – I haven’t seen any mentions of plans to remove the limitation, but I would imagine it’s discussed periodically.

One oddity about this is that the Array class itself has some API support for large arrays, such as the LongLength property. To be honest, I can’t see large arrays ever being particularly pleasant to work with – what would they return for the normal Length property, for example, or their implementation of IList<T> etc? I suspect we may see support for larger objects before we see support for arrays with more than int.MaxValue elements, but that’s a complete guess.

Reimplementing LINQ to Objects: Part 26c – Optimizing OrderedEnumerable

Part 26b left us with a working implementation of the ordering operators, with two caveats:

  • The sort algorithm used was awful
  • We were performing the key selection on every comparison, instead of once to start with

Today’s post is just going to fix the first bullet – although I’m pretty sure that fixing the second will require changing it again completely.

Choosing a sort algorithm

There are lots of sort algorithms available. In our case, we need the eventual algorithm to:

  • Work on arbitrary pair-based comparisons
  • Be stable
  • Go like the clappers :)
  • (Ideally) allow the first results to be yielded without performing all the sorting work, and without affecting the performance in cases where we do need all the results.

The final bullet it an interesting one to me: it’s far from unheard of to want to get the "top 3" results from an ordered query. In LINQ to Objects we can’t easily tell the Take operator about the OrderBy operator so that it could pass on the information, but we can potentially yield the first results before we’ve sorted everything. (In fact, we could add an extra interface specifically to enable this scenario, but it’s not part of normal LINQ to Objects, and could introduce horrible performance effects with innocent-looking query changes.)

If we decide to implement sorting in terms of a naturally stable algorithm, that limits the choices significantly. I was rather interested in timsort, and may one day set about implementing it – but it looked far too complicated to introduce just for the sake of Edulinq.

The best bet seemed to be merge sort, which is reasonably easy to implement and has reasonable efficiency too. It requires extra memory and a fair amount of copying, but we can probably cope with that.

We don’t have to use a stable sort, of course. We could easily regard our "key" as the user-specified key plus the original index, and use that index as a final tie-breaker when comparing elements. That gives a stable result while allowing us to use any sorting algorithm we want. This may well be the approach I take eventually – especially as quicksort would allow us to start yielding results early in a fairly simple fashion. For the moment though, I’ll stick with merge sort.

Preparing for merge sort

Just looking from the algorithm for merge sort, it’s obvious that there will be a good deal of shuffling data around. As we want to make the implementation as fast as possible, that means it makes sense to use arrays to store the data. We don’t need dynamic space allocation (after we’ve read all the data in, anyway) or any of the other features associated with higher-level collections. I’m aware that arrays are considered (somewhat) harmful, but purely for the internals of an algorithm which does so much data access, I believe they’re the most appropriate solution.

We don’t even need our arrays to be the right size – assuming we need to read in all the data before we start processing it (which will be true for this implementation of merge sort, but not for some other algorithms I may consider in the future) it’s fine to use an oversized array as temporary storage – it’s never going to be seen by the users, after all.

We’ve already got code which reads in all the data into a possibly-oversized array though – in the optimized ToArray code. So my first step was to extract out that functionality into a new internal extension method. This has to return a buffer containing all the data and give us an indication of the size. In .NET 4 I could use Tuple to return both pieces of data, but we can also just use an out parameter – I’ve gone for the latter approach at the moment. Here’s the ToBuffer extension method:

internal static TSource[] ToBuffer<TSource>(this IEnumerable<TSource> source, out int count)
{
    // Optimize for ICollection<T>
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        count = collection.Count;
        TSource[] tmp = new TSource[count];
        collection.CopyTo(tmp, 0);
        return tmp;
    }

    // We’ll have to loop through, creating and copying arrays as we go
    TSource[] ret = new TSource[16];
    int tmpCount = 0;
    foreach (TSource item in source)
    {
        // Need to expand…
        if (tmpCount == ret.Length)
        {
            Array.Resize(ref ret, ret.Length * 2);
        }
        ret[tmpCount++] = item;
    }
    count = tmpCount;
    return ret;
}

Note that I’ve used a local variable to keep track of the count in the loop near the end, only copying it into the output variable just before returning. This is due to a possibly-unfounded performance concern: we don’t know where the variable will actually "live" in storage – and I’d rather not cause some arbitrary page of heap memory to be required all the way through the loop. This is a gross case of micro-optimization without evidence, and I’m tempted to remove it… but I thought I’d at least share my thinking.

This is only an internal API, so I’m trusting callers not to pass me a null "source" reference. It’s possible that it would be a useful operator to expose at some point, but not just now. (If it were public, I would definitely use a local variable in the loop – otherwise callers could get weird effects by passing in a variable which could be changed elsewhere – such as due to side-effects within the loop. That’s a totally avoidable problem, simply by using a local variable. For an internal API, I just need to make sure that I don’t do anything so silly.)

Now ToArray needs to be changed to call ToBuffer, which is straightforward:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    int count;
    TSource[] ret = source.ToBuffer(out count);
    // Now create another copy if we have to, in order to get an array of the
    // right size
    if (count != ret.Length)
    {
        Array.Resize(ref ret, count);
    }
    return ret;
}

then we can prepare our OrderedEnumerable.GetEnumerator method for merging:

public IEnumerator<TElement> GetEnumerator()
{
    // First copy the elements into an array: don’t bother with a list, as we
    // want to use arrays for all the swapping around.
    int count;
    TElement[] data = source.ToBuffer(out count);
    TElement[] tmp = new TElement[count];
            
    MergeSort(data, tmp, 0, count – 1);
    for (int i = 0; i < count; i++)
    {
        yield return data[i];
    }
}

The "tmp" array is for use when merging – while there is an in-place merge sort, it’s more complex than the version where the "merge" step merges two sorted lists into a combined sorted list in temporary storage, then copies it back into the original list.

The arguments of 0 and count – 1 indicate that we want to sort the whole list – the parameters to my MergeSort method take the "left" and "right" boundaries of the sublist to sort – both of which are inclusive. Most of the time I’m more used to using exclusive upper bounds, but all the algorithm descriptions I found used inclusive upper bounds – so it made it easier to stick with that than try to "fix" the algorithm to use exclusive upper bounds everywhere. I think it highly unlikely that I’d get it all right without any off-by-one errors :)

Now all we’ve got to do is write an appropriate MergeSort method, and we’re done.

Implementing MergeSort

I won’t go through the details of how a merge sort works – read the wikipedia article for a pretty good description. In brief though, the MergeSort method guarantees that it will leave the specified portion of the input data sorted. It does this by splitting that section in half, and recursively merge sorting each half. It then merges the two halves by walking along two cursors (one from the start of each subsection) finding the smallest element out of the two at each point, copying that element into the temporary array and advancing just that cursor. When it’s finished, the temporary storage will contain the sorted section, and it’s copied back to the "main" array. The recursion has to stop at some point, of course – and in my implementation it stops if the section has fewer than three elements.

Here’s the MergeSort method itself first:

// Note: right is *inclusive*
private void MergeSort(TElement[] data, TElement[] tmp, int left, int right)
{
    if (right > left)
    {
        if (right == left + 1)
        {
            TElement leftElement = data[left];
            TElement rightElement = data[right];
            if (currentComparer.Compare(leftElement, rightElement) > 0)
            {
                data[left] = rightElement;
                data[right] = leftElement;
            }
        }
        else
        {
            int mid = left + (right – left) / 2;
            MergeSort(data, tmp, left, mid);
            MergeSort(data, tmp, mid + 1, right);
            Merge(data, tmp, left, mid + 1, right);
        }
    }
}

The test for "right > left" is part of a vanilla merge sort (if the section either has one element or none, we don’t need to take any action), but I’ve optimized the common case of only two elements. All we need to do is swap the elements – and even then we only need to do so if they’re currently in the wrong order. There’s no point in setting up all the guff of the two cursors – or even have the slight overhead of a method call – for that situation.

Other than that one twist, this is a pretty standard merge sort. Now for the Merge method, which is slightly more complicated (although still reasonably straighforward):

private void Merge(TElement[] data, TElement[] tmp, int left, int mid, int right)
{
    int leftCursor = left;
    int rightCursor = mid;
    int tmpCursor = left;
    TElement leftElement = data[leftCursor];
    TElement rightElement = data[rightCursor];
    // By never merging empty lists, we know we’ll always have valid starting points
    while (true)
    {
        // When equal, use the left element to achieve stability
        if (currentComparer.Compare(leftElement, rightElement) <= 0)
        {
            tmp[tmpCursor++] = leftElement;
            leftCursor++;
            if (leftCursor < mid)
            {
                leftElement = data[leftCursor];
            }
            else
            {
                // Only the right list is still active. Therefore tmpCursor must equal rightCursor,
                // so there’s no point in copying the right list to tmp and back again. Just copy
                // the already-sorted bits back into data.
                Array.Copy(tmp, left, data, left, tmpCursor – left);
                return;
            }
        }
        else
        {
            tmp[tmpCursor++] = rightElement;
            rightCursor++;
            if (rightCursor <= right)
            {
                rightElement = data[rightCursor];
            }
            else
            {
                // Only the left list is still active. Therefore we can copy the remainder of
                // the left list directly to the appropriate place in data, and then copy the
                // appropriate portion of tmp back.
                Array.Copy(data, leftCursor, data, tmpCursor, mid – leftCursor);
                Array.Copy(tmp, left, data, left, tmpCursor – left);
                return;
            }
        }
    }
}

Here, "mid" is the exclusive upper bound of the left subsection, and the inclusive lower bound of the right subsection… whereas "right" is the inclusive upper bound of the right subsection. Again, it’s possible that this is worth tidying up at some point to be more consistent, but it’s not too bad.

This time there’s a little bit more special-casing. We take the approach that whichever sequence runs out first (which we can detect as soon as the "currently advancing" cursor hits its boundary), we can optimize what still has to be copied. If the "left" sequence runs out first, then we know the remainder of the "right" sequence must already be in the correct place – so all we have to do is copy as far as we’ve written with tmpCursor back from the temporary array to the main array.

If the "right" sequence runs out first, then we can copy the rest of the "left" sequence directly into the right place (at the end of the section) and then again copy just what’s needed from the temporary array back to the main array.

This is as fast as I’ve managed to get it so far (without delving into too many of the more complicated optimizations available) – and I’m reasonably pleased with it. I have no doubt that it could be improved significantly, but I didn’t want to spend too much effort on it when I knew I’d be adapting everything for the key projection difficulty anyway.

Testing

I confess I don’t know the best way to test sorting algorithms. I have two sets of tests here:

  • A new project (MergeSortTest) where I actually implemented the sort before integrating it into OrderedEnumerable
  • All my existing OrderBy (etc) tests

The new project also acts as a sort of benchmark – although it’s pretty unscientific, and the key projection issue means the .NET implementation isn’t really comparable with the Edulinq one at the moment. Still, it’s a good indication of very roughly how well the implementation is doing. (It varies, interestingly enough… on my main laptop, it’s about 80% slower than LINQ to Objects; on my netbook it’s only about 5% slower. Odd, eh?) The new project sorts a range of sizes of input data, against a range of domain sizes (so with a small domain but a large size you’re bound to get equal elements – this helps to verify stability). The values which get sorted are actually doubles, but we only sort based on the integer part – so if the input sequence is 1.3, 3.5, 6.3, 3.1 then we should get an output sequence of 1.3, 3.5, 3.1, 6.3 – the 3.5 and 3.1 are in that order due to stability, as they compare equal under the custom comparer. (I’m performing the "integer only" part using a custom comparer, but we could equally have used OrderBy(x => (int) x)).

Conclusion

One problem (temporarily) down, one to go. I’m afraid that the code in part 26d is likely to end up being pretty messy in terms of generics – and even then I’m likely to talk about rather more options than I actually get round to coding.

Still, our simplistic model of OrderedEnumerable has served us well for the time being. Hopefully it’s proved more useful educationally this way – I suspect that if I’d dived into the final code right from the start, we’d all end up with a big headache.

Reimplementing LINQ to Objects: Part 26b – OrderBy{,Descending}/ThenBy{,Descending}

Last time we looked at IOrderedEnumerable<TElement> and I gave an implementation we could use in order to implement the public extension methods within LINQ. I’m still going to do that in this post, but it’s worth mentioning something else that’s coming up in another part (26d) – I’m going to revisit my OrderedEnumerable implementation.

There may be trouble ahead…

A comment on the previous post mentioned how my comparer executes the keySelector on each element every time it makes a comparison. I didn’t think of that as a particularly awful problem, until I thought of this sample query to rank people’s favourite colours:

var query = people.GroupBy(p => p.FavouriteColour)
                  .OrderByDescending(g => g.Count())
                  .Select(g => g.Key);

Eek. Now every time we compare two elements, we have to count everything in a group. Ironically, I believe that counting the items in a group is fast using the LINQ to Objects implementation, but not in mine – something I may fix later on. But with LINQ to Objects, this wouldn’t cause a problem in the first place!

There are ways to make this use an efficient key selector, of course – a simple Select before the OrderByDescending call would do fine… but it would be nicer if it wasn’t a problem in the first place. Basically we want to extract the keys for each element once, and then compare them repeatedly when we need to. This would also allow us to shuffle a sequence using code such as this:

Random rng = new Random(); // Or get it from elsewhere…
var shuffled = collection.OrderBy(x => rng.NextDouble());

I’m not advocating that way of shuffling, admittedly – but it would be nice if it didn’t cause significant problems, which it currently would, as the key selector is non-deterministic.

The interesting thing is that when I’ve finished today’s post, I believe the code will obey all the documented behaviour of LINQ to Objects: there’s nothing in the documentation about how often the key selector will be called. That doesn’t mean it’s a good idea to ignore this problem though, which is why I’ll revisit OrderedEnumerable later. However, that’s going to complicate the code somewhat… so while we’re still getting to grips with how everything hangs together, I’m going to stick to my inefficient implementation.

Meanwhile, back to the actual LINQ operators for the day…

What are they?

OrderBy, OrderByDescending, ThenBy and ThenByDescending all have very similar overloads:

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> ThenByDescending<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> ThenByDescending<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

They’re all extension methods, but ThenBy/ThenByDescending are extension methods on IOrderedEnumerable<T> instead of IEnumerable<T>.

We’ve already talked about what they do to some extent – each of them returns a sequence which is ordered according to the specified key. However, in terms of details:

  • The source and keySelector parameters can’t be null, and are validated eagerly.
  • The comparer parameter (where provided) can be null, in which case the default comparer for the key type is used.
  • They use deferred execution – the input sequence isn’t read until it has to be.
  • They read and buffer the entire input sequence when the result is iterated. Or rather, they buffer the original input sequence – as I mentioned last time, when a compound ordered sequence (source.OrderBy(…).ThenBy(…).ThenBy(…)) is evaluated, the final query will go straight to the source used for OrderBy, rather than sorting separately for each key.

What are we going to test?

I have tests for the following:

  • Deferred execution (using ThrowingEnumerable)
  • Argument validation
  • Ordering stability
  • Simple comparisons
  • Custom comparers
  • Null comparers
  • Ordering of null keys

In all of the tests which don’t go bang, I’m using an anonymous type as the source, with integer "Value" and "Key" properties. I’m ordering using the key, and then selecting the value – like this:

[Test]
public void OrderingIsStable()
{
    var source = new[]
    {
        new { Value = 1, Key = 10 },
        new { Value = 2, Key = 11 },
        new { Value = 3, Key = 11 },
        new { Value = 4, Key = 10 },
    };
    var query = source.OrderBy(x => x.Key)
                      .Select(x => x.Value);
    query.AssertSequenceEqual(1, 4, 2, 3);
}

For ThenBy/ThenByDescending I have multiple key properties so I can test the interaction between the primary and secondary orderings. For custom key comparer tests, I have an AbsoluteValueComparer which simply compares the absolute values of the integers provided.

The "Value" property is always presented in ascending order (from 1) to make it easier to keep track of, and the "Key" properties are always significantly larger so we can’t get confused between the two. I originally used strings for the keys in all tests, but then I found out that the default string comparer was culture-sensitive and didn’t behave how I expected it to. (The default string equality comparer uses ordinal comparisons, which are rather less brittle…) I still use strings for the keys in nullity tests, but there I’m specifying the ordinal comparer.

I wouldn’t claim the tests are exhaustive – by the time you’ve considered multiple orderings with possibly equal keys, different comparers etc the possibilities are overwhelming. I’m reasonably confident though (particularly after the tests found some embarrassing bugs in the implementation). I don’t think they’re hugely readable either – but I was very keen to keep the value separated from the key, rather than just ordering by "x => x" in tests. If anyone fancies cloning the repository and writing better tests, I’d be happy to merge them :)

What I deliberately don’t have yet is a test for how many times the key selector is executed: I’ll add one before post 26d, so I can prove we’re doing the right thing eventually.

Let’s implement them!

We’ve got two bits of implementation to do before we can run the tests:

  • The extension methods
  • The GetEnumerator() method of OrderedEnumerable

The extension methods are extremely easy. All of the overloads without comparers simply delegate to the ones with comparers (using Comparer<TKey>.Default) and the remaining methods look like this:

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    return new OrderedEnumerable<TSource>(source,
        new ProjectionComparer<TSource, TKey>(keySelector, comparer));
}

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    IComparer<TSource> sourceComparer = new ProjectionComparer<TSource, TKey>(keySelector, comparer);
    sourceComparer = new ReverseComparer<TSource>(sourceComparer);
    return new OrderedEnumerable<TSource>(source, sourceComparer);
}

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    return source.CreateOrderedEnumerable(keySelector, comparer, false);
}

(To get ThenByDescending, just change the name of the method and change the last argument of CreateOrderedEnumerable to true.)

All very easy. I’m pretty sure I’m going to want to change the OrderedEnumerable constructor to accept the key selector and key comparer in the future (in 26d), which will make the above code even simpler. That can wait a bit though.

Now for the sorting part in OrderedEnumerable. Remember that we need a stable sort, so we can’t just delegate to List<T>.Sort – at least, not without a bit of extra fiddling. (We could project to a type which contained the index, and add that onto the end of the comparer as a final tie-breaker.)

For the minute – and I swear it won’t stay like this – here’s the horribly inefficient (but easy to understand) implementation I’ve got:

public IEnumerator<TElement> GetEnumerator()
{
    // This is a truly sucky way of implementing it. It’s the simplest I could think of to start with.
    // We’ll come back to it!
    List<TElement> elements = source.ToList();
    while (elements.Count > 0)
    {
        TElement minElement = elements[0];
        int minIndex = 0;
        for (int i = 1; i < elements.Count; i++)
        {
            if (currentComparer.Compare(elements[i], minElement) < 0)
            {
                minElement = elements[i];
                minIndex = i;
            }
        }
        elements.RemoveAt(minIndex);
        yield return minElement;
    }
}

We simply copy the input to a list (which is something we may well do in the final implementation – we certainly need to suck it all in somehow) and then repeatedly find the minimum element (favouring earlier elements over later ones, in order to achieve stability), removing them as we go. It’s an O(n2) approach, but hey – we’re going for correctness first.

Conclusion

This morning, I was pretty confident this would be an easy and quick post to write. Since then, I’ve been found pain in the following items:

  • Calling key selectors only once per element is more important than it might sound at first blush
  • The default sort order for string isn’t what I’d have guessed
  • My (committed!) extension methods were broken, because I hadn’t edited them properly after a cut and paste
  • Writing tests for situations where there are lots of combinations is irritating

So far these have only extended my estimated number of posts for this group of operators to 4 (26a-26d) but who knows what the next few days will bring…

Reimplementing LINQ to Objects: Part 26a – IOrderedEnumerable

Implementing OrderBy/OrderByDescending/ThenBy/ThenByDescending isn’t too bad, once you understand how the pattern works, but that’s a slight conceptual leap. I’ll take it one step at a time, first introducing some extra infrastructure (both public and internal), then implementing it simply and slowly, and finally optimizing it somewhat.

The sorting features of LINQ

As a query language, it’s natural that LINQ allows you to sort sequences. Of course, collections have had the ability to sort their contents since v1.0 but LINQ is different in various ways:

  • It expects you’ll usually want to sort via projections, rather than on a "natural" comparison of the whole value. For example, should I come before or after my wife in a sorted list of people? It depends on whether you’re ordering by first name, age, height, weight, or something else.
  • It allows you to easily specify whether you want the ordering to be from the minimum value to the maximum (ascending) or from maximum to minimum (descending).
  • It allows you to specify multiple ordering criteria – for example, order by last name and then first name.
  • It doesn’t sort in-place – instead, the sorting operators each return a new sequence which represents the results of the sort.
  • It allows you to order any sequence implementing IEnumerable<T> – rather than Array.Sort being separate from List<T>.Sort, for example.
  • The ordering operators in LINQ to Objects at least are stable – two equal items will be returned in the original order.

Other than stability (I’m at least unaware of any public, stable sort methods within the framework prior to .NET 3.5), all of this can be done using IComparer<T>, but it’s a bit of a pain. Just as a reminder, IComparer<T> is an interface with a single method:

int Compare(T x, T y)

Unlike IEqualityComparer<T>, an IComparer<T> can only be asked to compare two items to determine which should come before the other in an ordered sequence. There’s no notion of a hash code in IComparer<T>. All sorting in .NET is based on either IComparer<T> or Comparison<T>, which is the equivalent delegate. Like IEqualityComparer<T>, a helper class (Comparer<T>) exists mostly to provide a default comparison based on the IComparable<T> and IComparable interfaces. (Unlike EqualityComparer<T>.Default, there’s no "fall back" to a universal default comparison – if you try to compare two objects which have no comparison defined between them, an exception will be thrown.)

Composing comparers

Eventually, we’ll need our ordering implementation to be built on top of IComparer<T>, so I’ll introduce three helper classes to address the first three bullet points above. MiscUtil has more fully-developed versions of all of these classes – the code I’ve written for Edulinq is the bare minimum required for the project. Due to the way I’m testing (building the same tests against both Edulinq and the .NET framework version) I also don’t have any tests for these classes – but they’re simple enough that I’m pretty confident they’ll work. They don’t validate constructor arguments, as the public API is responsible for using them sensibly.

First, projection: we want to be able to specify a key selector and optionally a comparison between two keys – just as we already do for things like ToLookup and Join, but this time using IComparer<T> instead of IEqualityComparer<T>. Here’s a helper class to build a comparison for two values of a "source" element type based on their keys:

internal class ProjectionComparer<TElement, TKey> : IComparer<TElement>
{
    private readonly Func<TElement, TKey> keySelector;
    private readonly IComparer<TKey> comparer;

    internal ProjectionComparer(Func<TElement, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        this.keySelector = keySelector;
        this.comparer = comparer ?? Comparer<TKey>.Default;
    }

    public int Compare(TElement x, TElement y)
    {
        TKey keyX = keySelector(x);
        TKey keyY = keySelector(y);
        return comparer.Compare(keyX, keyY);
    }
}

As an example of how we might use this, we might want to order a List<Person> by last name in a case-insensitive way. For that, we might use:

IComparer<Person> comparer = new ProjectionComparer<Person, string>
    (p => p.LastName, StringComparer.CurrentCultureIgnoreCase);

Again, in public code I’d provide simpler ways of constructing the comparer to take advantage of type inference – but it isn’t really worth it here.

As you can see, the comparer simply extracts the keys from both of the elements it’s trying to compare, and delegates to the key comparer.

Next up is reversal: by reversing the order of the arguments into a comparer, we can effectively reverse the direction of a sort. Note that due to the stability of LINQ orderings, this is not quite the same as just reversing the final order of the elements: even when sorting by a key in a descending direction, elements with equal keys will be preserved in the original order. This actually drops out naturally when we let the comparer handle the direction – the ordering code can preserve stability without really knowing whether it’s sorting in ascending or descending direction. Here’s the reverse comparer:

internal class ReverseComparer<T> : IComparer<T>
{
    private readonly IComparer<T> forwardComparer;

    internal ReverseComparer(IComparer<T> forwardComparer)
    {
        this.forwardComparer = forwardComparer;
    }

    public int Compare(T x, T y)
    {
        return forwardComparer.Compare(y, x);
    }
}

It’s incredibly simple, but there are two important points to note:

  • It actually violates the contract of IComparer<T>, assuming that the forward comparer doesn’t. The interface documentation states that null values should compare less than any non-null references, and obviously if we’re reversing the original comparison, that won’t be the case. I consider this to be a flaw in the documentation, to be honest: it makes sense to allow the comparer itself to decide how to handle nullity. The LINQ ordering documentation doesn’t make any specific reference to this – it’s unclear whether sorting by a key in a descending manner will leave elements with a null key at the start or the end. It turns out that the .NET implementation acts like my reverse comparer: it treats null like any other value, effectively.
  • A naive implementation would call forwardComparer(x, y) and simply negate the return value. This would hide a subtle bug: if the forward comparer returned Int32.MinValue, so would the reverse comparer. In signed integer types, there are more negative numbers available than positive ones (e.g. the range of SByte is -128 to 127 inclusive) and negating the minimum value is effectively a no-op. By reversing the order of the arguments instead, we avoid the problem.

Our final helper class tackles the problem of creating a compound comparer, composed of a primary comparison and a secondary one. The secondary comparison is only used if the primary comparison considers two items to be equal – so for example, if we were comparing people by last name and then first name, the primary comparison would be the last name comparer, and the secondary comparison would be the first name comparer. Comparing "John Smith" and "Fred Bloggs" doesn’t need to take the first names into account at all, but comparing "Dave Skeet" with "Jon Skeet" does.

If we want more than two comparers, we can simply treat one compound comparer as the primary comparsion for another compound comparer. For example, suppose we want to order by last name, then first name, then age. That’s effectively "compound(compound(last name, first name), age)" in the obvious notation. Again, the code implementing a compound comparer is very simple, but will prove invaluable:

internal class CompoundComparer<T> : IComparer<T>
{
    private readonly IComparer<T> primary;
    private readonly IComparer<T> secondary;

    internal CompoundComparer(IComparer<T> primary,
        IComparer<T> secondary)
    {
        this.primary = primary;
        this.secondary = secondary;
    }

    public int Compare(T x, T y)
    {
        int primaryResult = primary.Compare(x, y);
        if (primaryResult != 0)
        {
            return primaryResult;
        }
        return secondary.Compare(x, y);
    }
}

(I could have used a conditional expression in the Compare method here, but this time I felt it wasn’t worth it.)

With these classes in place, we can move on to look at IOrderedEnumerable – and implement some of it.

Implementing IOrderedEnumerable<TElement>

When we eventually get to OrderBy, OrderByDescending, ThenBy and ThenByDescending (in the next post) they will all return IOrderedEnumerable<TElement>. By focusing on that first, we can actually make the LINQ operators themselves fairly trivial. The interface itself looks simple enough:

public interface IOrderedEnumerable<TElement> : IEnumerable<TElement>, IEnumerable
{
    IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
        Func<TElement, TKey> keySelector,
        IComparer<TKey> comparer,
        bool descending)
}

The important point to understand is that most developers will never need to call CreateOrderedEnumerable directly themselves – or even care about the interface. What most of us care about is that we can call ThenBy and ThenByDescending after we’ve called OrderBy or OrderByDescending – and that’s what the interface enables. The ThenBy/ThenByDescending operators are extension methods on IOrderedEnumerable<TElement> instead of IEnumerable<T>, precisely because they have to be called on sequences which already have a primary ordering.

The CreateOrderedEnumerable method is precisely designed to create a new IOrderedEnumerable which represents the existing ordered sequence but with an extra secondary ordering applied.

The tricky bit to get your head round is that this sort of composition doesn’t work the same way as our normal sequence chaining. Usually – for Where, Select etc – we fetch items from a "parent" sequence, perform some filtering or projection or whatever, and then yield the results. We can’t do that to provide a secondary ordering without information about the primary ordering – we’d effectively have to know where the boundaries between the "primary sort keys" occur, so that we only reordered within one key. Even if we could do that, it would be relatively painful from a performance perspective.

Instead, we need to create an IOrderedEnumerable implementation which remembers the original, unordered sequence – and then builds up a compound comparer as it goes. That way, we can then do all the ordering in one go, rather than each intermediate ordered sequence doing its own little bit.

This is possibly easier to show than to describe – my implementation of OrderedEnumerable is really fairly short. Here it is, just with the actual sorting code removed:

internal class OrderedEnumerable<TElement> : IOrderedEnumerable<TElement>
{
    private readonly IEnumerable<TElement> source;
    private readonly IComparer<TElement> currentComparer;

    internal OrderedEnumerable(IEnumerable<TElement> source,
        IComparer<TElement> comparer)
    {
        this.source = source;
        this.currentComparer = comparer;
    }

    public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>
        (Func<TElement, TKey> keySelector,
         IComparer<TKey> comparer,
         bool descending)
    {
        if (keySelector == null)
        {
            throw new ArgumentNullException("keySelector");
        }
        IComparer<TElement> secondaryComparer =
            new ProjectionComparer<TElement, TKey> (keySelector, comparer);
        if (descending)
        {
            secondaryComparer = new ReverseComparer<TElement>(secondaryComparer);
        }
        return new OrderedEnumerable<TElement>(source,
            new CompoundComparer<TElement>(currentComparer, secondaryComparer));
    }

    public IEnumerator<TElement> GetEnumerator()
    {
        // Perform the actual sorting here: basically load
        // "source" into a buffer, and then use a stable
        // sort algorithm with "currentComparer" to compare
        // any two elements
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

Note how when we create one IOrderedEnumerable from another, we don’t store a reference to the original – it can be garbage-collected as far as we’re concerned. We just care about the original (unordered) source, and the new compound comparer.

You can already see how useful our comparer helper classes are already. They allow us to focus on a single abstraction – IComparer<T> – and build interesting comparisons based on it, just as LINQ to Objects focuses on IEnumerable<T> and gives us interesting ways to work with sequences. Each aspect of the comparer we build is distinct:

  • A projection comparer takes our key selector and key comparer, and builds a plain IComparer<TElement>
  • If the secondary ordering is descending, we create a reverse comparer to wrap the projection comparer
  • We then build a compound comparer from whatever comparison we were already using, and the newly created one

I suspect that this isn’t the most efficient way to perform the comparison, to be honest: there’s an awful lot of wrapping going on. However, I think it’s probably the clearest way to represent things – and that’s the goal of Edulinq.

Conclusion

We now have all the internal plumbing in place. There are two more tasks:

  • Implementing the extension methods themselves
  • Performing the sort

I’ll cover those in the next two posts (26b and 26c). In some ways, they’re likely to be less interesting than this post (unless I work out a particularly interesting sort algorithm to use). It’s the design of sorting in LINQ which I find particularly neat – the way that it’s tailored to what developers really need compared with the primitive Sort methods of old, and the elegant way in which those abilities are expressed. As ever, we’ve ended up with lots of small building blocks – but ones that can be combined to great effect.

Reimplementing LINQ to Objects: Part 25 – ToDictionary

This one ended up being genuinely easy to implement, although with lots of tests for different situations.

What is it?

ToDictionary has four overloads which look remarkably similar to the ones we used for ToLookup:

public static Dictionary<TKey, TSource> ToDictionary<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

public static Dictionary<TKey, TSource> ToDictionary<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)
        
public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

Yes, it’s the ever-popular combination of an optional key comparer and an optional element selector when mapping values.

ToDictionary does exactly what it says on the tin: it converts the source sequence to a dictionary. Each source item is mapped to a key, and then either the source item itself is used as the value or the element selector is applied to obtain one. A custom equality comparer can be used to create a dictionary which is (for example) case-insensitive when fetching keys.

The main difference between ToDictionary and ToLookup is that a dictionary can only store a single value per key, rather than a whole sequence of values. Of course, another difference is that Dictionary<TKey, TValue> is mutable concrete class, whereas ILookup<TKey, TElement> is an interface usually implemented in an immutable fashion.

The normal sort of note:

  • ToDictionary uses immediate execution: it reads the entire source sequence, and once it has returned, the dictionary is independent of the source
  • source, keySelector and elementSelector mustn’t be null
  • comparer can be null, in which case the default equality comparer for the key type will be used
  • No key can appear more than once (including non-identical but equal occurrences). This will cause an ArgumentException
  • Null keys are prohibited, causing an ArgumentNullException. (This is slightly odd, as the ToDictionary method itself doesn’t have a null argument at this point… really it should be a NullValueDerivedFromAnArgumentSomehowException, but that doesn’t exist)
  • Null values are allowed

Just a word on usage: one place I’ve seen ToDictionary used to reasonable effect is when the source is an IEnumerable<KeyValuePair<TKey, TValue>>, often from another dictionary. This lets you project all the values of a dictionary, or perhaps "trim" the dictionary in some ways… creating a new dictionary rather than modifying the existing one, of course. For example, here’s a way of creating a new Dictionary<string, Person> from an existing one, but only keeping the entries which are adults:

var adults = nameToPersonMap.Where(pair => pair.Value.Age >= 18)
                            .ToDictionary(pair => pair.Key, pair => pair.Value);

Of course, this is just one use – it can be a pretty handy operator, to be frank.

What are we going to test?

Just the "null argument" validation tests gives us ten tests to start with, then:

  • A test for each overload, always mapping a sequence of strings to a map using the first letter as the key; sometimes as a char and sometimes as a string (so we can easily use a case-insensitive comparer)
  • A test for null keys (ArgumentNullException)
  • A test for null values (no exception)
  • A test for duplicate keys (ArgumentException)
  • A test for a null comparer (uses the default)

Let’s implement it!

In my implementation, the first three overloads all call the last – just using a default equality comparer or an identity element selector where necessary. Then after the argument validation, the implementation is simple:

public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    // Argument validation elided
    Dictionary<TKey, TElement> ret = new Dictionary<TKey, TElement>(comparer ?? EqualityComparer<TKey>.Default);
    foreach (TSource item in source)
    {
        ret.Add(keySelector(item), elementSelector(item));
    }
    return ret;
}

There’s just one thing to be careful of – we have to use the Add method instead of the indexer to add entries to the dictionary. The Add method deals with duplicate and null keys in exactly the way we want, whereas the indexer would overwrite the existing entry if we had a duplicate key.

There’s one thing my implementation doesn’t do which it could if we really wanted – it doesn’t optimize the case where we know the size beforehand, because the source implements ICollection<T>. We could use the Dictionary constructor which takes a capacity in that case, like this:

comparer = comparer ?? EqualityComparer<TKey>.Default;
ICollection<TSource> list = source as ICollection<TSource>;
var ret = list == null ? new Dictionary<TKey, TElement>(comparer)
                       : new Dictionary<TKey, TElement>(list.Count, comparer);

After all, we know the eventual size of the dictionary will be exactly the count of the input sequence, in the success case.

Is this worth it? I’m not sure. It’s not a huge change, admittedly… I’m quite tempted to put it in. Again, we’d have to benchmark it to see what difference it made (and I’d benchmark the .NET implementation too, of course) but a unit test won’t show it up. Thoughts welcome.

EDIT: I’ve now edited the code to include this optimization. I haven’t benchmarked it though…

Conclusion

Gah – even when I thought I had this wrapped up before starting to write the post, that final bit of potential optimization still appeared out of nowhere.

Idle thought: if we all had to write up our code in blog posts as we went along, we might discover some interesting alternative designs and implementations.

That’s it for tonight, and I may not have any time for posts tomorrow – we’ll see.

Reimplementing LINQ to Objects: Part 24 – ToArray

This really is a simple one. So simple I might even find the energy to implement ToDictionary as well tonight… we’ll see. (EDIT: Oops. It became slightly less simple in the end, as I came up with the third implementation. Oh well.)

What is it?

ToArray is basically the equivalent of ToList, but producing an array instead of a list. It has a single signature:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)

Just to recap:

  • It’s another extension method, which is useful when we want to use type inference.
  • It uses immediate execution – nothing is deferred, and the entire sequence is read before the method completes.
  • It performs simple argument validation: source can’t be null
  • It creates an independent but shallow copy of the collection (so any changes to the objects referenced by source will be visible in the result, but not changes to source itself – and vice versa)
  • It’s optimized for ICollection<T>, although in a slightly different way to ToList.

What are we going to test?

Exactly the same as we did for ToList, basically – with one bit of care required. In our test for the source and result being independent of each other, we can’t just create a new variable of type List<T> and call ToArray on it – because that would call the implementation in List<T> itself. I reckoned the easiest way of making this clear was to call the method directly as a normal static method, instead of using it as an extension method:

[Test]
public void ResultIsIndependentOfSource()
{
    List<string> source = new List<string> { "xyz", "abc" };
    // Make it obvious we’re not calling List<T>.ToArray
    string[] result = Enumerable.ToArray(source);
    result.AssertSequenceEqual("xyz", "abc");

    // Change the source: result won’t have changed
    source[0] = "xxx";
    Assert.AreEqual("xyz", result[0]);

    // And the reverse
    result[1] = "yyy";
    Assert.AreEqual("abc", source[1]);
}

One other interesting facet of the testing is that we can only partially test the optimization for ICollection<T>. We can make sure that it’s optimized as far as not using the iterator (as per the previous test), but there’s more optimization coming, and I haven’t worked out how to test for that yet. You’ll see what I mean when we come to implement it. Speaking of which…

Let’s implement it!

We can make all our tests pass with a cheat: whatever the input type, convert it to a List<T> and then call ToArray on the result. Heck, if we call ToList to do the initial conversion, that will even do the argument validation for us and use the right parameter name in the exception:

// Implementation by Mr Cheaty MacCheaterson
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    return source.ToList().ToArray();
}

Now I’m not averse to the general approach taken here – but there’s actually a bit more optimization we can do.

Remember how ICollection<T> exposes Count and CopyTo, which the List<T> constructor uses when the input implements ICollection<T>? Well, that means that building a List<T> is relatively cheap for a collection – but calling ToArray on the list will still mean copying all the data out again (as List<T>.ToArray can’t just return its own internal array – it has to create a copy). We can use exactly the same members ourselves, and avoid one level of copying:

// Somewhat better… though still not quite ideal
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        TSource[] ret = new TSource[collection.Count];
        collection.CopyTo(ret, 0);
        return ret;
    }
    return new List<TSource>(source).ToArray();
}

That’s pretty good now – except it still involves a copy from the List<T> into a new array every time. That’s almost always appropriate, in fact… because unless the resulting list happened to expand its array to exactly the right size, we’d need to make a copy anyway. After all, we can’t return an array that’s too big. However, we can optimize for the "just the right size" case if we basically implement List<T>’s array expansion ourselves, leading to this:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimize for ICollection<T>
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        TSource[] tmp = new TSource[collection.Count];
        collection.CopyTo(tmp, 0);
        return tmp;
    }
            
    // We’ll have to loop through, creating and copying arrays as we go
    TSource[] ret = new TSource[16];
    int count = 0;
    foreach (TSource item in source)
    {
        // Need to expand…
        if (count == ret.Length)
        {
            Array.Resize(ref ret, ret.Length * 2);
        }
        ret[count++] = item;
    }

    // Now create another copy if we have to, in order to get an array of the
    // right size
    if (count != ret.Length)
    {
        Array.Resize(ref ret, count);
    }
    return ret;
}

Is this level of optimization worth it? Probably not. I picked the starting size of 16 out of thin air (or dimly recalled initial counts for some collection or other – possibly Java’s ArrayList<T>). Maybe we should triple the capacity rather than double it, just for laughs. It’s all guesswork, really. The middle implementation feels like a more appropriate one to me, but with an obvious optimization just itching to be implemented, I thought I might as well provide the code. It’s reasonably obviously correct, but it’s just a bit longwinded for the marginal benefit over our second attempt.

It’s even more annoying that I can’t think of a way to test this easily – I could benchmark it of course, but that’s not the same as unit testing it… I can’t easily prove that we’re optimizing either the ICollection<T> or "correct guess at size" cases.

Conclusion

It’s always interesting to see what else springs to mind when I’m writing up the operator as opposed to just implementing it in Visual Studio. I’d got as far as the second implementation but not the third when I started this post.

It’s possible that in the end, the 4-overload ToDictionary operator will actually end up being simpler than ToArray. Who’d have thought?

Reimplementing LINQ to Objects: Part 23 – Take/Skip/TakeWhile/SkipWhile

I genuinely expected these operators to be simple. At the time of this writing, I’m onto my third implementation of SkipWhile, struggling to find code which obviously expresses what’s going on. I find it interesting how the simplest sounding operators sometimes end up being trickier to implement than one might think.

What are they?

Take, TakeWhile, Skip and SkipWhile form a natural group of operators. Together they are known as the partitioning operators. Here are the signatures:

public static IEnumerable<TSource> Take<TSource>(
    this IEnumerable<TSource> source,
    int count)

public static IEnumerable<TSource> TakeWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> TakeWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

public static IEnumerable<TSource> Skip<TSource>(
    this IEnumerable<TSource> source,
    int count)

public static IEnumerable<TSource> SkipWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> SkipWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

The behaviour is reasonably self-explanatory:

  • source.Take(n) returns a sequence of the first n elements of source
  • source.Skip(n) returns a sequence containing all but the first n elements of source
  • source.TakeWhile(predicate) returns a sequence of the first elements of source which match the given predicate; it stops as soon as the predicate fails to match
  • source.SkipWhile(predicate) returns a sequence of all but the the first elements of source which match the given predicate; it starts yielding results as soon as the predicate fails to match

The only reason there are two overloads for TakeWhile and SkipWhile is so that you can provide a predicate which uses the index within the sequence, just like the overloads for Select and Where.

Each operator effectively partitions the input sequence into two parts, and yields either the first or second part. So for any "normal" collection (which returns the same results each time you iterate over it) the following are true:

  • source.Take(n).Concat(source.Skip(n)) is equivalent to source
  • source.TakeWhile(predicate).Concat(source.SkipWhile(predicate)) is equivalent to source (as noted in the comments, this is also assuming the predicate acts solely the same way each time it’s given the same data, without any side-effects)

A few notes on general behaviour:

  • The arguments are validated eagerly: source and predicate can’t be null
  • count can be negative (in which case it’s equivalent to 0) and can be larger than the collection too. Basically any value is valid.
  • Execution is deferred.
  • The source sequence is streamed – there’s no need to buffer anything.

What are we going to test?

One nice thing about the properties I described earlier is that it’s very easy to convert tests for Take/TakeWhile into tests for Skip/SkipWhile: you just change the expected sequence in Skip to be everything in the source sequence which wasn’t in the test for Take. For Take/Skip I’ve just used Enumerable.Range(0, 5) (i.e. 0, 1, 2, 3, 4) as the source sequence; for TakeWhile/SkipWhile I’ve used { "zero", "one", "two", "three", "four", "five" } – so each element is the word corresponding to the index. That makes it reasonably easy to write tests with a predicate involving the index.

I’ve tested:

  • Argument validation
  • Deferred execution
  • For Skip/Take:
    • A negative count
    • A count of 0
    • A count to split the sequence into non-empty parts
    • A count exactly equal to the source length (5)
    • A count larger than the source length
  • For SkipWhile/TakeWhile (both overloads in each case):
    • A predicate which fails on the first element
    • A predicate which matches some elements but not others
    • A predicate which matches all elements

The tests aren’t generally interesting, but as it can be slightly tricky to get your head round TakeWhile and SkipWhile to start with, here’s an example of each:

// In TakeWhileTest
[Test]
public void PredicateMatchingSomeElements()
{
    string[] source = { "zero", "one", "two", "three", "four", "five" };
    source.TakeWhile(x => x.Length < 5).AssertSequenceEqual("zero", "one", "two");
}

// In SkipWhileTest
[Test]
public void PredicateMatchingSomeElements()
{
    string[] source = { "zero", "one", "two", "three", "four", "five" };
    source.SkipWhile(x => x.Length < 5).AssertSequenceEqual("three", "four", "five");
}

There’s been some discussion on Twitter about the best kind of test here, with suggestions of varying the data instead of the predicate, and using sequences of Boolean values. Personally I prefer the tests the way they are – I find this sequence of strings easy to work with, and the fact that it is a sequence of strings means you can’t get the index/value parameter order wrong when providing a predicate which uses the indexes, unlike if you use an integer sequence. A Boolean sequence isn’t clear enough for me – the result isn’t obviously the first or second part of a sequence, whereas with these words, it’s obvious what’s going on… to me, at least. It’s all a matter of personal preference though, and it’s good to think about alternatives.

Let’s implement them!

Let me present this in chronological order. I started off by implementing Take and Skip, as they sounded the simplest. Indeed, they’re not too bad – here are the iterator block parts, after argument validation in the public method, of course:

private static IEnumerable<TSource> TakeImpl<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        for (int i = 0; i < count && iterator.MoveNext(); i++)
        {
            yield return iterator.Current;
        }
    }
}

private static IEnumerable<TSource> SkipImpl<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        for (int i = 0; i < count; i++)
        {
            if (!iterator.MoveNext())
            {
                yield break;
            }
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

These really aren’t too bad, although it’s interesting to note that Skip is more complicated than Take: Skip has to deal with both sections of the sequence (the "ignore" part and the "yield" part) whereas Take can return as soon as it’s finished with the "yield" part. We’ll see the same pattern in TakeWhile/SkipWhile.

Speaking of which, let’s look at them. There’s a decision to make in terms of whether to implement the overload whose predicate doesn’t use the index in terms of the one which does. It would certainly be simple, but I don’t like the double level of indirection with one delegate calling another. It may well be irrational, given that I’ve been introducing identity delegates elsewhere – but it just didn’t feel right. I’ve added it into the source code using conditional compilation just for completeness, but it’s not the default implementation.

Now, the "Impl" bits of TakeWhile really weren’t too bad:

private static IEnumerable<TSource> TakeWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (!predicate(item))
        {
            yield break;
        }
        yield return item;
    }
}

private static IEnumerable<TSource> TakeWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (!predicate(item, index))
        {
            yield break;
        }
        index++;
        yield return item;
    }
}

Again, we don’t need to worry about the second part of the sequence – the part we’re ignoring. We can just let the method complete, and the result iterator terminate. In particular, we don’t care about the value that caused the predicate to return false. Now let’s look at SkipWhile (again, just the interesting bits):

private static IEnumerable<TSource> SkipWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        while (iterator.MoveNext())
        {
            TSource item = iterator.Current;
            if (!predicate(item))
            {
                // Stop skipping now, and yield this item
                yield return item;
                break;
            }
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

private static IEnumerable<TSource> SkipWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        int index = 0;
        while (iterator.MoveNext())
        {
            TSource item = iterator.Current;
            if (!predicate(item, index))
            {
                // Stop skipping now, and yield this item
                yield return item;
                break;
            }
            index++;
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

The above code is my third attempt. You can look at the first two in the history of SkipWhile.cs – you may find one of those preferable. Other suggestions have been made, including ones using do/while… I must admit, the do/while option ends up being quite attractive, but again I have a general aversion to it as a construct. Part of the trickiness is that having found an element which makes the predicate return false, we have to yield that element before we move on to yielding the rest. It’s easy enough to do, but the extra yield return statement feels ugly.

One point to note is that if the "skipping" part finishes due to a call to MoveNext returning false (rather than the predicate failing) then we’ll call MoveNext again (at the start of the "yielding" loop). That’s explicitly called out as being okay in the IEnumerator.MoveNext() documentation, so I’m happy enough to use it.

I’m certainly not happy about those implementations, but I’m confident they at least work. Now, let’s revisit Skip and Take. Is there any way we can simplify them? Absolutely – all we need to do is use TakeWhile or SkipWhile with a predicate which uses the index and compares it to the count we’ve been given:

public static IEnumerable<TSource> Take<TSource>( 
    this IEnumerable<TSource> source, 
    int count) 
{
    // Buggy! See addendum…
    return source.TakeWhile((x, index) => index < count); 

public static IEnumerable<TSource> Skip<TSource>( 
    this IEnumerable<TSource> source, 
    int count) 

    return source.SkipWhile((x, index) => index < count); 
}

TakeWhile and SkipWhile even do the appropriate argument validation for us. As you can see, I’ve become less averse to implementing one operator using another over time… both implementations are still in the source, using conditional compilation, but currently this short form is the one which is "active". I could be persuaded to change my mind :)

An optimized Skip?

Although most of these operations can’t be sensibly optimized, it would make sense to optimize Skip when the source implements IList<T>. We can skip the skipping, so to speak, and go straight to the appropriate index. This wouldn’t spot the case where the source was modified between iterations, which may be one reason it’s not implemented in the framework as far as I’m aware. The optimization would look something like this:

var list = source as IList<TSource>;
if (list != null)
{
    // Make sure we don’t use a negative index
    count = Math.Max(count, 0);
    // Note that "count" is the count of items to skip
    for (int index = count; index < list.Count; index++)
    {
        yield return list[index];
    }
    yield break;
}

Should Edulinq do this? Answers on a postcard…

EDIT: There’s another optimization we can potentially make, even earlier. If we’re given a count which is zero or negative, Skip is effectively pointless – so we can just return the original reference. This would come just after the argument validation, not in the iterator block:

if (count <= 0)
{
    return source;
}

Returning the original reference has some implications in terms of not isolating the return value from the original sequence, but it’s worth considering. We could potentially do the same for Take, in the case where we know it’s a list and the requested count is greater than or equal to the size of the list.

Using Skip and Take together

It’s worth mentioning the most common use of Skip and Take, as well as an easy-to-make mistake. Skip and Take are ideal for paging. For example:

var pageItems = allItems.Skip(pageNumber * itemsPerPage)
                        .Take(itemsPerPage);

This simply skips as many items as it needs to, and only "takes" one page-worth of data afterwards. Very simple… but also easy to get wrong.

In many cases with LINQ, you can reorder operations with no harm – or with a compile-time error if you get them the wrong way round. Let’s consider what happens if we get it wrong this time though:

// Bug! Bug! Bug!
var pageItems = allItems.Take(itemsPerPage)
                        .Skip(pageNumber * itemsPerPage);

This will work for the first page, but after that you’ll always end up displaying an empty page: it will take the first page-worth of data, and then skip it all, leaving nothing to return.

Conclusion

That was more work than I expected it to be. Still, we’ve actually not got very many LINQ operators still to cover. By my reckoning, we still have the following operators to implement:

  • Sum, Average, Min, Max – I’m not looking forward to these, as they all have lots of overloads, and are likely to be tedious to test and implement.
  • Cast and OfType
  • ToArray and ToDictionary
  • ElementAt and ElementAtOrDefault – I should possibly have covered these along with First/Single/Last etc
  • SequenceEqual
  • Zip (from .NET 4)
  • Contains
  • OrderBy/OrderByDescending/ThenBy/ThenByDescending (probably the most interesting remaining operators)
  • Reverse

That’s not too bad a list. I’ll probably try to knock off ToArray tonight – it should be pretty simple.

Addendum

It turns out that my "simplification" for Take introduced a bug. If we only need to take two items, we should know that after we’ve taken the second one… we don’t really need to ask the iterator to move to the third item, just to decide that we’ve finished. Doing so could have side effects – such as causing an exception to be thrown. That’s exactly what we do when we use TakeWhile – we don’t care what the first value after the stopping point is, but we do try to fetch it. Oops.

Fortunately the Mono tests picked this up for me, so I’ve added my own test for it, and gone back to the original implementation of Take which just used a counter. Both my own tests and Mono’s now pass :)

Reimplementing LINQ to Objects: Part 22 – GroupJoin

Another operator that was decidedly easy to implement – partly as I was able to just adapt everything from Join, including the tests. 15 minutes for tests + implementation… and no doubt considerably long writing it up.

What is it?

After the complexity of GroupBy, GroupJoin has a refreshingly small list of overloads:

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector)

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

We can essentially ignore the first overload in the normal way – it just uses the default equality comparer for TKey. (You’ll be pleased to hear that I think we’re coming towards the end of the operators which use equality comparers, so I won’t need to repeat that sort of sentence many more times.)

The best way to get to grips with what GroupJoin does is to think of Join. There, the overall idea was that we looked through the "outer" input sequence, found all the matching items from the "inner" sequence (based on a key projection on each sequence) and then yielded pairs of matching elements. GroupJoin is similar, except that instead of yielding pairs of elements, it yields a single result for each "outer" item based on that item and the sequence of matching "inner" items.

The fact that a result is yielded for every outer item is important: even if there are no matches, a result is still returned, just with an empty sequence. I’ll come back to that later, thinking about "left joins".

In terms of deferred execution, input consumption etc, GroupJoin behaves very much like Join:

  • Arguments are validated eagerly: everything other than comparer has to be non-null
  • It uses deferred execution
  • When you start iterating over the result, the "inner" sequence is immediately read all the way through
  • The "outer" sequence is streamed: each time you read an item from the result sequence, one item from the outer sequence is read

What are we going to test?

To be absolutely honest, I cribbed the tests from Join, and just adapted the resultSelector and the expected results. It’s simple enough to do: I was already using string concatenation to piece together the two items (outer and inner) for the Join tests; for GroupJoin I’ve just modified that to use string.Join (not the same as Enumerable.Join!) to build a semi-colon string delimited string of all the inner matches for that particular outer element.

Here’s a sample test, in both the Join and GroupJoin versions:

[Test]
public void SimpleJoin()
{
    // We’re going to join on the first character in the outer sequence item
    // being equal to the second character in the inner sequence item
    string[] outer = { "first", "second", "third" };
    string[] inner = { "essence", "offer", "eating", "psalm" };

    var query = outer.Join(inner,
                           outerElement => outerElement[0],
                           innerElement => innerElement[1],
                           (outerElement, innerElement) => outerElement + ":" + innerElement);

    // Note: no matches for "third"
    query.AssertSequenceEqual("first:offer", "second:essence", "second:psalm");
}

[Test]
public void SimpleGroupJoin()
{
    // We’re going to join on the first character in the outer sequence item
    // being equal to the second character in the inner sequence item
    string[] outer = { "first", "second", "third" };
    string[] inner = { "essence", "offer", "eating", "psalm" };

    var query = outer.GroupJoin(inner,
                           outerElement => outerElement[0],
                           innerElement => innerElement[1],
                           (outerElement, innerElements) => outerElement + ":" + string.Join(";", innerElements));

    query.AssertSequenceEqual("first:offer", "second:essence;psalm", "third:");
}

Note how the match sequence for "first" contained only a single element and the match sequence for "second" contains two elements. There were no matches for "third", but we still get a result, just with an empty match sequence.

Let’s implement it!

I was able to copy the implementation from Join as well, pretty much – I only had to simplify it to yield a single value instead of using a foreach loop.

Obviously the comparer-free overload calls into the one with the comparer, which then validates the arguments and calls into an iterator block method. If you’ve read all the rest of the posts in this series, you don’t need to see that yet again – and if you haven’t, just go back and look at some. So, what does the iterator block do? It uses a lookup again. It builds a lookup from the inner sequence, and then we just have to iterate over the outer sequence, applying resultSelector to each "outer element / matching inner element sequence" pair:

private static IEnumerable<TResult> GroupJoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    foreach (var outerElement in outer)
    {
        var key = outerKeySelector(outerElement);
        yield return resultSelector(outerElement, lookup[key]);
    }
}

Yup, Lookup makes things so simple it almost hurts… and that’s without even trying hard. I’m sure we could pay some more attention to the design of Lookup and Grouping if we really wanted to. (I may revisit them later… they’re clearly working for us at the moment.)

Query expressions and left joins

C# has direct support for GroupJoin when you use the "join … into" syntax. Note that despite the use of "into" this is not a query continuation. Instead, it’s just a way of giving a name to all the matches in the join. Here’s an example, finding all the words of a given length:

int[] outer = { 5, 3, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            select x + ":" + string.Join(";", matches);

query.AssertSequenceEqual("5:tiger", "3:bee;cat;dog", "7:giraffe");

Note that whereas in a simple "join" syntax, the "y" range variable would be available for the rest of the query, this time it’s only "matches" that’s available – basically "y" is only used to express the key selector (y.Length in this case).

Now think about left joins (or left outer joins) from SQL. From the linked Wikipedia page:

The result of a left outer join (or simply left join) for table A and B always contains all records of the "left" table (A), even if the join-condition does not find any matching record in the "right" table (B). This means that if the ON clause matches 0 (zero) records in B, the join will still return a row in the result—but with NULL in each column from B.

So, how can we mimic this? Well, we could just use GroupJoin directly, and deal with the fact that if there are no matches, the match sequence will be empty. That’s not quite like a left join though – it doesn’t capture the idea of getting a null value for the unmatched result. We can use DefaultIfEmpty for that quite easily though. Here’s an example adapted from the one above, where I introduce 4 into the outer sequence. There are no animals with 4 letters in the inner list, so for that item we end up with no values. I’ve used DefaultIfEmpty to make sure that we do get a value – and just to make it show up, I’ve specified a default value of the string literal "null" rather than just a null reference. Here’s the test, including the results:

int[] outer = { 5, 3, 4, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            select x + ":" + string.Join(";", matches.DefaultIfEmpty("null"));

query.AssertSequenceEqual("5:tiger", "3:bee;cat;dog", "4:null", "7:giraffe");

Okay, that’s getting closer… but we still need to deal with a whole sequence of results at a time. A normal left join still gives pairs of results… can we mimic that? Absolutely – just use GroupJoin and then SelectMany, represented in query expressions as a second "from" clause:

int[] outer = { 5, 3, 4, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            from z in matches.DefaultIfEmpty("null")
            select x + ":" + z;
query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "4:null", "7:giraffe");

That’s now genuinely close to what a SQL left join would do.

Conclusion

That’s it for joins, I believe. I’m so glad I implemented ToLookup first – it made everything else trivial. Next up, I think we’ll look at Take/TakeWhile/Skip/SkipWhile, which should be pleasantly simple.

Reimplementing LINQ to Objects: Part 21 – GroupBy

Okay, after the brief hiatus earlier, we’re ready to group sequences.

What is it?

GroupBy has eight overloads, using up to four type parameters and up to five normal parameters. Those of a sensitive disposition may wish to turn away now:

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, IEnumerable<TSource>, TResult> resultSelector)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, IEnumerable<TSource>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector)
        
public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

Okay, you can look back now. All the nastiness will go away soon, I promise. You see, this is yet another operator with overloads which effectively fill in defaults… although there’s a twist here. Here are the various parameters and their meanings:

  • source: the input sequence, as ever, of element type TSource.
  • keySelector: a delegate to apply to each element in the input sequence to obtain a key (type TKey). The key is what decides which group the element is associated with.
  • elementSelector (optional): a delegate to apply to each element (type TElement) to obtain the value which should be part of the relevant group. If it’s not present, you can think of it as being the identity conversion.
  • resultSelector (optional): a delegate to apply to each grouping to produce a final result (type TResult). See below for the difference between overloads which specify this and those which don’t.
  • comparer (optional): an equality comparer used to compare keys; groups are formed by equal keys. If unspecified or null, the default equality comparer for TKey is used.

Now, there’s only one tricky bit here: resultSelector. In overloads where this is present, the result is type IEnumerable<TResult>; otherwise it’s IEnumerable<IGrouping<TKey, TElement>>. That would make perfect sense if the type of resultSelector were Func<IGrouping<TKey, TElement>, TResult> – that would just make it effectively default to an identity conversion. However, the type is actually Func<TKey, IEnumerable<TElement>, TResult>. There’s not a lot of difference between the two logically: basically you’ve got a key and a sequence of elements in both cases – but the disparity reduces the elegance of both the design and the implementation in my view. In fact, it’s not wholly clear to me why the overloads with a resultSelector are required in the first place – you’ll see why later.

Anyway, the basic operation should be reasonably familiar by now. We’re going to look at each input value in turn, and map it to a key and a group element. We’ll usually return a sequence of groups, where each group consists of the elements produced by input values with an equal key. If we’ve specified a resultSelector, that delegate will be presented with each key and all the members of the group associated with that key, and the final result sequence will consist of the return values of the delegate.

General behaviour:

  • The operator is implemented with deferred execution, but it’s not as deferred as some other operators. More below.
  • All arguments other than comparer must not be null; this is validated eagerly.
  • The source is read completely as soon as you start iterating over the results.

Basically this performs exactly the same steps as ToLookup, except:

  • GroupBy uses deferred execution. Aside from anything else, this means that if you change the contents of "source" later and iterate over the results again, you’ll see the changes (whereas the lookup returned by ToLookup is effectively independent of the source)
  • The return value is always a sequence (whether it’s a sequence of groups or the result of calling resultSelector). This means you can’t perform arbitrary lookups on it later.

I’ve already quoted from the documentation when it comes to the ordering of the groups and the elements within the groups, but for completeness, here it is again:

The IGrouping<TKey, TElement> objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping<TKey, TElement>. Elements in a grouping are yielded in the order they appear in source.

When a resultSelector is present, the last sentence should be reinterpreted to consider the IEnumerable<TElement> sequence presented to resultSelector.

What are we going to test?

First let’s talk about deferred execution. GroupBy does use deferred execution – but that it’s not as deferred as usual. For most operators (such as Select) you can call GetEnumerator() on the result and it still won’t actually start iterating over the input sequence until you call MoveNext() on the result’s iterator. In GroupBy – at least in the "proper" implementation – it’s the call to GetEnumerator() that does all the work.

In other words, my usual test like this wouldn’t work:

// This won’t work for GroupBy in LINQ to Objects
using (new ThrowingEnumerable().GroupBy(x => x).GetEnumerator())
{
    // No exception
}

In fact, I would probably argue that on reflection, my existing tests which check that execution is deferred until the first call to MoveNext() are probably overzealous. For GroupBy, the tests are relatively liberal – they will work whether it’s GetEnumerator() or MoveNext() which starts iterating over the input sequence. This is handy, as my implementation is currently slightly more deferred than that of LINQ to Objects :) Here are the relevant tests:

[Test]
public void ExecutionIsPartiallyDeferred()
{
    // No exception yet…
    new ThrowingEnumerable().GroupBy(x => x);
    // Note that for LINQ to Objects, calling GetEnumerator() starts iterating
    // over the input sequence, so we’re not testing that…
}

[Test]
public void SequenceIsReadFullyBeforeFirstResultReturned()
{
    int[] numbers = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };
    // Final projection will throw
    var query = numbers.Select(x => 10 / x);

    var groups = query.GroupBy(x => x);
    Assert.Throws<DivideByZeroException>(() =>
    {
        using (var iterator = groups.GetEnumerator())
        {
            iterator.MoveNext();
        }
    });
}

I also have one test for each of the four overloads which doesn’t specify a comparer. I haven’t got any tests to ensure that the comparer is used properly – I felt they would be more effort than they were worth.

Here’s the test for the most complicated overload, just to show how all the bits fit together:

[Test]
public void GroupByWithElementProjectionAndCollectionProjection()
{
    string[] source = { "abc", "hello", "def", "there", "four" };
    // This time "values" will be an IEnumerable<char>, the first character of each
    // source string contributing to the group
    var groups = source.GroupBy(x => x.Length,
                                x => x[0],
                                (key, values) => key + ":" + string.Join(";", values));

    groups.AssertSequenceEqual("3:a;d", "5:h;t", "4:f");
}

Let’s look at the parameters involved:

  • source: just an array of strings
  • keySelector: takes a string and maps it to its length. We will end up with three groups, as we have strings of length 3 ("abc" and "def"), 5 ("hello" and "there") and 4 ("four")
  • elementSelector: takes a string and maps it to its first character.
  • resultSelector: takes a key (the length) and a sequence of elements (the first characters of all the input strings of that length) and returns a string joining them all together

So the result for the first group is "3:a;d" – 3 is the key, ‘a’ and ‘d’ are the first letters of "abc" and "def", and they’re joined together according to the resultSelector.

Finally, I also have a test demonstrating the use of GroupBy in a query expression, showing that these two queries are equivalent:

string[] source = { "abc", "hello", "def", "there", "four" };

var query1 = source.GroupBy(x => x.Length, x => x[0]);

var query2 = from x in source
             group x[0] by x.Length;

Okay, enough of the tests… what about the real code?

Let’s implement it!

First, some interesting line counts:

  • Lines in public method signatures: 36
  • Lines implementing argument validation: 16
  • Lines just delegating from one public overload to another: 6
  • Lines of actual implementation: 7

Sad, isn’t it? Still, there are things worth talking about.

Firstly, let’s reduce the eight overloads to two: the two that both have elementSelector and comparer specified. The six overloads which miss one or other of those parameters can simply be implemented by providing an identity conversion or the default key comparer. So that leaves us with this:

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

Now normally I like to implement a method with fewer parameters by calling one with more parameters… but there’s a slight problem in this case. We would have to provide a resultSelector which took a key and a sequence of elements, and creating a grouping from it. We could do that just by reusing our Grouping class… but there seems little reason to do that. In particular, suppose we implement it the other way round: given a grouping, we can easily extract the key (via the Key property) and the grouping itself is a sequence of the elements. In other words, we can implement the more complex method in terms of the simpler one:

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    if (resultSelector == null)
    {
        throw new ArgumentNullException("resultSelector");
    }
    // Let the other GroupBy overload do the rest of the argument validation
    return source.GroupBy(keySelector, elementSelector, comparer)
                 .Select(group => resultSelector(group.Key, group));
}

The difference may seem somewhat arbitrary to start with, until we provide the final overload that we’re delegating to. Surprise surprise, we can use ToLookup yet again:

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    if (elementSelector == null)
    {
        throw new ArgumentNullException("elementSelector");
    }
    return GroupByImpl(source, keySelector, elementSelector, comparer ?? EqualityComparer<TKey>.Default);
}

private static IEnumerable<IGrouping<TKey, TElement>> GroupByImpl<TSource, TKey, TElement>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = source.ToLookup(keySelector, elementSelector, comparer);
    foreach (var result in lookup)
    {
        yield return result;
    }
}

You see, ILookup<TKey, TElement> already implements IEnumerable<IGrouping<TKey, TValue>> so we have no extra work to do. If we had implemented the two overloads the other way round by calling "new Grouping(key, sequence)" as a resultSelector, we’d have ended up with two "wrapper" layers of Grouping when we only need one. (Alternatively, we could have cast straight back to IGrouping, but that would have felt somewhat wrong, as well as requiring an execution-time check.)

Note that again, we could have used "yield foreach" to great effect if it had been available…

Now do you see why I believe this would have been more elegant if the type of resultSelector had been Func<IGrouping<TKey, TElement>, TResult> instead? That way we could have just treated resultSelector as another optional parameter with a "default" value of the identity conversion… and all 7 simpler overloads could have delegated straight to the most complex one. Oh well.

Conclusion

Now we’ve done "Join" and "GroupBy", there’s another very obvious operator to implement: GroupJoin. This (in conjunction with DefaultIfEmpty) is how the effect of left joins are usually achieved in LINQ. Let’s see if ToLookup does what we need yet again…

Reimplementing LINQ to Objects: Part 20 – ToList

This morning I started writing the tests for GroupBy, prior to implementing it. That turned out to be a pain – really I wanted an easy way of getting at each element of the result (i.e. each result group). If only I had the ability to convert an arbitrary sequence into a query… I needed ToList. So, we enter a fairly brief diversion.

What is it?

ToList has a single, simple overload:

public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)

Fairly obviously, ToList converts the source sequence into a list. Some points to note:

  • The signature specifies List<T>, not just IList<T>. Of course it could return a subclass of List<T>, but there seems little point.
  • It uses immediate execution – nothing is deferred here
  • The parameter (source) musn’t be null
  • It’s optimized for the case when source implements ICollection<T>
  • It always creates a new, independent list.

The last two points are worth a bit more discussion. Firstly, the optimization for ICollection<T> isn’t documented, but it makes a lot of sense:

  • List<T> stores its data in an array internally
  • ICollection<T> exposes a Count property so the List<T> can create an array of exactly the right size to start with
  • ICollection<T> exposes a CopyTo method so that the List<T> can copy all the elements into the newly created array in bulk

ToList always creates a new list for consistency. If it just returned the source parameter directly if it was already a List<T>, that would mean that changes to the source after calling ToList would sometimes be visible in the returned list and sometimes not… making it harder to reason about any code which used ToList.

What are we going to test?

I have tests for the bullet points listed above, and one extra test just to prove that it can work with lazily evaluated sequences as well as simple collections like arrays. (The test uses a range followed by a projection.)

In order to test the optimization for ICollection<T>, I’ve implemented another collection like NonEnumerableList, but this time just NonEnumerableCollection. Again, this just delegates all ICollection<T> operations to a backing List<T>, but throws an exception if you try to call GetEnumerator(). The test then just calls ToList on a NonEnumerableCollection: as no exception is thrown, that proves that either the operation is optimized as we’d expect, or the exception is being swallowed. I think it’s reasonable to assume that exceptions aren’t swallowed in LINQ to Objects :)

Let’s implement it!

This will probably be the simplest implementation in the whole of Edulinq:

public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return new List<TSource>(source);
}

You may well be wondering what happened to the optimization… well, it’s in the List<T> constructor. We just get it for free. Unfortunately that’s not documented either… so we end up with an implementation which implements one undocumented optimization if List<T> implements another undocumented optimization :)

We can’t actually do much better than that – we can’t use ICollection<T>.CopyTo ourselves, as we don’t have access to the underlying array of List<T>. We could perform some optimization by calling the List<T> constructor which specifies a capacity, and then call AddRange. That would at least prevent the list from having to resize itself, but it would still need to iterate over the whole collection instead of using the (potentially very fast) CopyTo method.

Conclusion

You may be wondering why we even need ToList, if we could just create a list by calling the constructor directly. The difference is that in order to call a constructor, you need to specify the element type as the type argument. When we use ToList, we can take advantage of type inference. In many cases this is just extremely convenient, but for anonymous types it’s actually required. How can you end up with a strongly typed list of an anonymous type? It’s easy with ToList, like this:

var query = Enumerable.Range(0, 10)
                      .Select(x => new { Value = x, Doubled = x * 2 });
        
var list = query.ToList();

Try doing that without an extension method or something similar. It’s worth noting at this point that although there are similar methods for arrays and Dictionary, there’s no equivalent for HashSet. It’s incredibly easy to write, of course, and an obvious extension to LINQ to Objects – but it’s not in the standard library. Maybe for .NET 5…

So, now that we’ve got ToList sorted, I can get back to GroupBy and its eight overloads – easy to implement, but hard to test simply and hard to describe clearly. Lucky me.