Reimplementing LINQ to Objects: Part 22 – GroupJoin

Another operator that was decidedly easy to implement – partly as I was able to just adapt everything from Join, including the tests. 15 minutes for tests + implementation… and no doubt considerably long writing it up.

What is it?

After the complexity of GroupBy, GroupJoin has a refreshingly small list of overloads:

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector)

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

We can essentially ignore the first overload in the normal way – it just uses the default equality comparer for TKey. (You’ll be pleased to hear that I think we’re coming towards the end of the operators which use equality comparers, so I won’t need to repeat that sort of sentence many more times.)

The best way to get to grips with what GroupJoin does is to think of Join. There, the overall idea was that we looked through the "outer" input sequence, found all the matching items from the "inner" sequence (based on a key projection on each sequence) and then yielded pairs of matching elements. GroupJoin is similar, except that instead of yielding pairs of elements, it yields a single result for each "outer" item based on that item and the sequence of matching "inner" items.

The fact that a result is yielded for every outer item is important: even if there are no matches, a result is still returned, just with an empty sequence. I’ll come back to that later, thinking about "left joins".

In terms of deferred execution, input consumption etc, GroupJoin behaves very much like Join:

  • Arguments are validated eagerly: everything other than comparer has to be non-null
  • It uses deferred execution
  • When you start iterating over the result, the "inner" sequence is immediately read all the way through
  • The "outer" sequence is streamed: each time you read an item from the result sequence, one item from the outer sequence is read

What are we going to test?

To be absolutely honest, I cribbed the tests from Join, and just adapted the resultSelector and the expected results. It’s simple enough to do: I was already using string concatenation to piece together the two items (outer and inner) for the Join tests; for GroupJoin I’ve just modified that to use string.Join (not the same as Enumerable.Join!) to build a semi-colon string delimited string of all the inner matches for that particular outer element.

Here’s a sample test, in both the Join and GroupJoin versions:

[Test]
public void SimpleJoin()
{
    // We’re going to join on the first character in the outer sequence item
    // being equal to the second character in the inner sequence item
    string[] outer = { "first", "second", "third" };
    string[] inner = { "essence", "offer", "eating", "psalm" };

    var query = outer.Join(inner,
                           outerElement => outerElement[0],
                           innerElement => innerElement[1],
                           (outerElement, innerElement) => outerElement + ":" + innerElement);

    // Note: no matches for "third"
    query.AssertSequenceEqual("first:offer", "second:essence", "second:psalm");
}

[Test]
public void SimpleGroupJoin()
{
    // We’re going to join on the first character in the outer sequence item
    // being equal to the second character in the inner sequence item
    string[] outer = { "first", "second", "third" };
    string[] inner = { "essence", "offer", "eating", "psalm" };

    var query = outer.GroupJoin(inner,
                           outerElement => outerElement[0],
                           innerElement => innerElement[1],
                           (outerElement, innerElements) => outerElement + ":" + string.Join(";", innerElements));

    query.AssertSequenceEqual("first:offer", "second:essence;psalm", "third:");
}

Note how the match sequence for "first" contained only a single element and the match sequence for "second" contains two elements. There were no matches for "third", but we still get a result, just with an empty match sequence.

Let’s implement it!

I was able to copy the implementation from Join as well, pretty much – I only had to simplify it to yield a single value instead of using a foreach loop.

Obviously the comparer-free overload calls into the one with the comparer, which then validates the arguments and calls into an iterator block method. If you’ve read all the rest of the posts in this series, you don’t need to see that yet again – and if you haven’t, just go back and look at some. So, what does the iterator block do? It uses a lookup again. It builds a lookup from the inner sequence, and then we just have to iterate over the outer sequence, applying resultSelector to each "outer element / matching inner element sequence" pair:

private static IEnumerable<TResult> GroupJoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    foreach (var outerElement in outer)
    {
        var key = outerKeySelector(outerElement);
        yield return resultSelector(outerElement, lookup[key]);
    }
}

Yup, Lookup makes things so simple it almost hurts… and that’s without even trying hard. I’m sure we could pay some more attention to the design of Lookup and Grouping if we really wanted to. (I may revisit them later… they’re clearly working for us at the moment.)

Query expressions and left joins

C# has direct support for GroupJoin when you use the "join … into" syntax. Note that despite the use of "into" this is not a query continuation. Instead, it’s just a way of giving a name to all the matches in the join. Here’s an example, finding all the words of a given length:

int[] outer = { 5, 3, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            select x + ":" + string.Join(";", matches);

query.AssertSequenceEqual("5:tiger", "3:bee;cat;dog", "7:giraffe");

Note that whereas in a simple "join" syntax, the "y" range variable would be available for the rest of the query, this time it’s only "matches" that’s available – basically "y" is only used to express the key selector (y.Length in this case).

Now think about left joins (or left outer joins) from SQL. From the linked Wikipedia page:

The result of a left outer join (or simply left join) for table A and B always contains all records of the "left" table (A), even if the join-condition does not find any matching record in the "right" table (B). This means that if the ON clause matches 0 (zero) records in B, the join will still return a row in the result—but with NULL in each column from B.

So, how can we mimic this? Well, we could just use GroupJoin directly, and deal with the fact that if there are no matches, the match sequence will be empty. That’s not quite like a left join though – it doesn’t capture the idea of getting a null value for the unmatched result. We can use DefaultIfEmpty for that quite easily though. Here’s an example adapted from the one above, where I introduce 4 into the outer sequence. There are no animals with 4 letters in the inner list, so for that item we end up with no values. I’ve used DefaultIfEmpty to make sure that we do get a value – and just to make it show up, I’ve specified a default value of the string literal "null" rather than just a null reference. Here’s the test, including the results:

int[] outer = { 5, 3, 4, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            select x + ":" + string.Join(";", matches.DefaultIfEmpty("null"));

query.AssertSequenceEqual("5:tiger", "3:bee;cat;dog", "4:null", "7:giraffe");

Okay, that’s getting closer… but we still need to deal with a whole sequence of results at a time. A normal left join still gives pairs of results… can we mimic that? Absolutely – just use GroupJoin and then SelectMany, represented in query expressions as a second "from" clause:

int[] outer = { 5, 3, 4, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            from z in matches.DefaultIfEmpty("null")
            select x + ":" + z;
query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "4:null", "7:giraffe");

That’s now genuinely close to what a SQL left join would do.

Conclusion

That’s it for joins, I believe. I’m so glad I implemented ToLookup first – it made everything else trivial. Next up, I think we’ll look at Take/TakeWhile/Skip/SkipWhile, which should be pleasantly simple.

8 thoughts on “Reimplementing LINQ to Objects: Part 22 – GroupJoin”

  1. Just to make sure I got this right, if you wanted the query

    var query = from x in outer
    left join y in inner on x equals y.Length
    select x + “:” + y;

    would you do

    var query = from x in outer
    join y in inner on x equals y.Length into matches
    from y in matches.DefaultIfEmpty()
    select x + “:” + y;

    ?

    Also I want to mention I like the way that “null” has four letters and matches the pattern when you get “4:null”

    Like

  2. @configurator: I suspect I wouldn’t reuse the range variable “y” here (although it may well be valid – I haven’t tried) but the idea is basically right, yes.

    Like

  3. Actually, you can reuse the variable “y”, because it’s not in scope anymore after the group join. I personnally find it clearer than introducing a new variable with a different name

    Like

  4. @Thomas: Yup, I expected it would be valid for precisely that reason. I suspect I’d *really* change the original “y” for “tmp” or something like that, but then use “y” in the “from” clause, if you see what I mean.

    Like

Leave a comment