Reimplementing LINQ to Objects: Part 21 – GroupBy

Okay, after the brief hiatus earlier, we’re ready to group sequences.

What is it?

GroupBy has eight overloads, using up to four type parameters and up to five normal parameters. Those of a sensitive disposition may wish to turn away now:

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, IEnumerable<TSource>, TResult> resultSelector)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, IEnumerable<TSource>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector)
        
public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

Okay, you can look back now. All the nastiness will go away soon, I promise. You see, this is yet another operator with overloads which effectively fill in defaults… although there’s a twist here. Here are the various parameters and their meanings:

  • source: the input sequence, as ever, of element type TSource.
  • keySelector: a delegate to apply to each element in the input sequence to obtain a key (type TKey). The key is what decides which group the element is associated with.
  • elementSelector (optional): a delegate to apply to each element (type TElement) to obtain the value which should be part of the relevant group. If it’s not present, you can think of it as being the identity conversion.
  • resultSelector (optional): a delegate to apply to each grouping to produce a final result (type TResult). See below for the difference between overloads which specify this and those which don’t.
  • comparer (optional): an equality comparer used to compare keys; groups are formed by equal keys. If unspecified or null, the default equality comparer for TKey is used.

Now, there’s only one tricky bit here: resultSelector. In overloads where this is present, the result is type IEnumerable<TResult>; otherwise it’s IEnumerable<IGrouping<TKey, TElement>>. That would make perfect sense if the type of resultSelector were Func<IGrouping<TKey, TElement>, TResult> – that would just make it effectively default to an identity conversion. However, the type is actually Func<TKey, IEnumerable<TElement>, TResult>. There’s not a lot of difference between the two logically: basically you’ve got a key and a sequence of elements in both cases – but the disparity reduces the elegance of both the design and the implementation in my view. In fact, it’s not wholly clear to me why the overloads with a resultSelector are required in the first place – you’ll see why later.

Anyway, the basic operation should be reasonably familiar by now. We’re going to look at each input value in turn, and map it to a key and a group element. We’ll usually return a sequence of groups, where each group consists of the elements produced by input values with an equal key. If we’ve specified a resultSelector, that delegate will be presented with each key and all the members of the group associated with that key, and the final result sequence will consist of the return values of the delegate.

General behaviour:

  • The operator is implemented with deferred execution, but it’s not as deferred as some other operators. More below.
  • All arguments other than comparer must not be null; this is validated eagerly.
  • The source is read completely as soon as you start iterating over the results.

Basically this performs exactly the same steps as ToLookup, except:

  • GroupBy uses deferred execution. Aside from anything else, this means that if you change the contents of "source" later and iterate over the results again, you’ll see the changes (whereas the lookup returned by ToLookup is effectively independent of the source)
  • The return value is always a sequence (whether it’s a sequence of groups or the result of calling resultSelector). This means you can’t perform arbitrary lookups on it later.

I’ve already quoted from the documentation when it comes to the ordering of the groups and the elements within the groups, but for completeness, here it is again:

The IGrouping<TKey, TElement> objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping<TKey, TElement>. Elements in a grouping are yielded in the order they appear in source.

When a resultSelector is present, the last sentence should be reinterpreted to consider the IEnumerable<TElement> sequence presented to resultSelector.

What are we going to test?

First let’s talk about deferred execution. GroupBy does use deferred execution – but that it’s not as deferred as usual. For most operators (such as Select) you can call GetEnumerator() on the result and it still won’t actually start iterating over the input sequence until you call MoveNext() on the result’s iterator. In GroupBy – at least in the "proper" implementation – it’s the call to GetEnumerator() that does all the work.

In other words, my usual test like this wouldn’t work:

// This won’t work for GroupBy in LINQ to Objects
using (new ThrowingEnumerable().GroupBy(x => x).GetEnumerator())
{
    // No exception
}

In fact, I would probably argue that on reflection, my existing tests which check that execution is deferred until the first call to MoveNext() are probably overzealous. For GroupBy, the tests are relatively liberal – they will work whether it’s GetEnumerator() or MoveNext() which starts iterating over the input sequence. This is handy, as my implementation is currently slightly more deferred than that of LINQ to Objects :) Here are the relevant tests:

[Test]
public void ExecutionIsPartiallyDeferred()
{
    // No exception yet…
    new ThrowingEnumerable().GroupBy(x => x);
    // Note that for LINQ to Objects, calling GetEnumerator() starts iterating
    // over the input sequence, so we’re not testing that…
}

[Test]
public void SequenceIsReadFullyBeforeFirstResultReturned()
{
    int[] numbers = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };
    // Final projection will throw
    var query = numbers.Select(x => 10 / x);

    var groups = query.GroupBy(x => x);
    Assert.Throws<DivideByZeroException>(() =>
    {
        using (var iterator = groups.GetEnumerator())
        {
            iterator.MoveNext();
        }
    });
}

I also have one test for each of the four overloads which doesn’t specify a comparer. I haven’t got any tests to ensure that the comparer is used properly – I felt they would be more effort than they were worth.

Here’s the test for the most complicated overload, just to show how all the bits fit together:

[Test]
public void GroupByWithElementProjectionAndCollectionProjection()
{
    string[] source = { "abc", "hello", "def", "there", "four" };
    // This time "values" will be an IEnumerable<char>, the first character of each
    // source string contributing to the group
    var groups = source.GroupBy(x => x.Length,
                                x => x[0],
                                (key, values) => key + ":" + string.Join(";", values));

    groups.AssertSequenceEqual("3:a;d", "5:h;t", "4:f");
}

Let’s look at the parameters involved:

  • source: just an array of strings
  • keySelector: takes a string and maps it to its length. We will end up with three groups, as we have strings of length 3 ("abc" and "def"), 5 ("hello" and "there") and 4 ("four")
  • elementSelector: takes a string and maps it to its first character.
  • resultSelector: takes a key (the length) and a sequence of elements (the first characters of all the input strings of that length) and returns a string joining them all together

So the result for the first group is "3:a;d" – 3 is the key, ‘a’ and ‘d’ are the first letters of "abc" and "def", and they’re joined together according to the resultSelector.

Finally, I also have a test demonstrating the use of GroupBy in a query expression, showing that these two queries are equivalent:

string[] source = { "abc", "hello", "def", "there", "four" };

var query1 = source.GroupBy(x => x.Length, x => x[0]);

var query2 = from x in source
             group x[0] by x.Length;

Okay, enough of the tests… what about the real code?

Let’s implement it!

First, some interesting line counts:

  • Lines in public method signatures: 36
  • Lines implementing argument validation: 16
  • Lines just delegating from one public overload to another: 6
  • Lines of actual implementation: 7

Sad, isn’t it? Still, there are things worth talking about.

Firstly, let’s reduce the eight overloads to two: the two that both have elementSelector and comparer specified. The six overloads which miss one or other of those parameters can simply be implemented by providing an identity conversion or the default key comparer. So that leaves us with this:

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

Now normally I like to implement a method with fewer parameters by calling one with more parameters… but there’s a slight problem in this case. We would have to provide a resultSelector which took a key and a sequence of elements, and creating a grouping from it. We could do that just by reusing our Grouping class… but there seems little reason to do that. In particular, suppose we implement it the other way round: given a grouping, we can easily extract the key (via the Key property) and the grouping itself is a sequence of the elements. In other words, we can implement the more complex method in terms of the simpler one:

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    if (resultSelector == null)
    {
        throw new ArgumentNullException("resultSelector");
    }
    // Let the other GroupBy overload do the rest of the argument validation
    return source.GroupBy(keySelector, elementSelector, comparer)
                 .Select(group => resultSelector(group.Key, group));
}

The difference may seem somewhat arbitrary to start with, until we provide the final overload that we’re delegating to. Surprise surprise, we can use ToLookup yet again:

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    if (elementSelector == null)
    {
        throw new ArgumentNullException("elementSelector");
    }
    return GroupByImpl(source, keySelector, elementSelector, comparer ?? EqualityComparer<TKey>.Default);
}

private static IEnumerable<IGrouping<TKey, TElement>> GroupByImpl<TSource, TKey, TElement>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = source.ToLookup(keySelector, elementSelector, comparer);
    foreach (var result in lookup)
    {
        yield return result;
    }
}

You see, ILookup<TKey, TElement> already implements IEnumerable<IGrouping<TKey, TValue>> so we have no extra work to do. If we had implemented the two overloads the other way round by calling "new Grouping(key, sequence)" as a resultSelector, we’d have ended up with two "wrapper" layers of Grouping when we only need one. (Alternatively, we could have cast straight back to IGrouping, but that would have felt somewhat wrong, as well as requiring an execution-time check.)

Note that again, we could have used "yield foreach" to great effect if it had been available…

Now do you see why I believe this would have been more elegant if the type of resultSelector had been Func<IGrouping<TKey, TElement>, TResult> instead? That way we could have just treated resultSelector as another optional parameter with a "default" value of the identity conversion… and all 7 simpler overloads could have delegated straight to the most complex one. Oh well.

Conclusion

Now we’ve done "Join" and "GroupBy", there’s another very obvious operator to implement: GroupJoin. This (in conjunction with DefaultIfEmpty) is how the effect of left joins are usually achieved in LINQ. Let’s see if ToLookup does what we need yet again…

14 thoughts on “Reimplementing LINQ to Objects: Part 21 – GroupBy”

  1. Is there any good reason for the standard GroupBy to consume the source upon the call to GetEnumerator? Or is this just an inconsistency in the standard implementation?

    Like

  2. @simon: It’s just a side-effect of the implementation, basically. I don’t think there’s anything *wrong* with it doing that.

    Like

  3. Regarding the reasoning behind the return value of the resultSelector overloads, I think it’s quite good that the LINQ design team chose to favor the millions of programmers who are *using* linq instead of the three (microsoft, mono team and you) who implment the API ;-)

    As a consumer of the API I think it’s quite convenient to be able to project the group to something useful directly in the GroupBy without having to append an extra .Select().

    Thanks for these articles by the way. It’s the first thing I read every morning when I start Flipboard and browse through my feeds. Keep up the good work

    Like

  4. @Isak: I don’t see how it would have been any harder to use a resultSelector of type Func<IGrouping, TResult> though.

    Also, I can’t remember the last time I actually *used* a GroupBy call with a resultSelector – and the C# compiler doesn’t use it either for “group by” in query expressions.

    Like

  5. @Jon: After re-reading your post I see your point now. I first thought you didn’t like that the GroupBy overloads had different return type but it’s just the resultSelector type parameters you’re concerned with.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s