Reimplementing LINQ to Objects: Part 19 – Join

You might expect that as joining is such a fundamental concept in SQL, it would be a complex operation to both describe and implement in LINQ. Fortunately, nothing could be further from the truth…

What is it?

Join has a mere two overloads, with the familiar pattern of one taking a custom equality comparer and the other not:

public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector)

public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

These are daunting signatures to start with – four type parameters, and up to six normal method parameters (including the first "this" parameter indicating an extension method). Don’t panic. It’s actually quite simple to understand each individual bit:

  • We have two sequences (outer and inner). The two sequences can have different element types (TOuter and TInner).
  • For each sequence, there’s also a key selector – a projection from a single item to the key for that item. Note that although the key type can be different from the two sequence types, there’s only one key type – both key selectors have to return the same kind of key (TKey).
  • An optional equality comparer is used to compare keys.
  • A delegate (resultSelector) is used to project a pair of items whose keys are equal (one from each sequence) to the result type (TResult)

The idea is that we look through the two input sequences for pairs which correspond to equal keys, and yield one output element for each pair. This is an equijoin operation: we can only deal with equal keys, not pairs of keys which meet some arbitrary condition. It’s also an inner join in database terms – we will only see an item from one sequence if there’s a "matching" item from the other sequence. I’ll talk about mimicking left joins when I implement GroupJoin.

The documentation gives us details of the order in which we need to return items:

Join preserves the order of the elements of outer, and for each of these elements, the order of the matching elements of inner.

For the sake of clarity, it’s probably worth including a naive implementation which at least gives the right results in the right order:

// Bad implementation, only provided for the purposes of discussion
public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    // Argument validation omitted
    foreach (TOuter outerElement in outer)
    {
        TKey outerKey = outerKeySelector(outerElement);
        foreach (TInner innerElement in inner)
        {
            TKey innerKey = innerKeySelector(innerElement);
            if (comparer.Equals(outerKey, innerKey))
            {
                yield return resultSelector(outerElement, innerElement);
            }
        }
    }
}

Aside from the missing argument validation, there are two important problems with this:

  • It iterates over the inner sequence multiple times. I always advise anyone implementing a LINQ-like operator to only iterate over any input sequence once. There are sequences which are impossible to iterate over multiple times, or which may give different results each time. That’s bad news.
  • It always has a complexity of O(N * M) for N items in the inner sequence and M items in the outer sequence. Eek. Admittedly that’s always a possible complexity – two sequences which have the same key for all elements will always have that complexity – but in a typical situation we can do considerably better.

The real Join operator uses the same behaviour as Except and Intersect when it comes to how the input sequences are consumed:

  • Argument validation occurs eagerly – both sequences and all the "selector" delegates have to be non-null; the comparer argument can be null, leading to the default equality comparer for TKey being used.
  • The overall operation uses deferred execution: it doesn’t iterate over either input sequence until something starts iterating over the result sequence.
  • When MoveNext is called on the result sequence for the first time, it immediately consumes the whole of the inner sequence, buffering it.
  • The outer sequence is streamed – it’s only read one element at a time. By the time the result sequence has started yielding results from the second element of outer, it’s forgotten about the first element.

We’ve started veering towards an implementation already, so let’s think about tests.

What are we going to test?

I haven’t bothered with argument validation tests this time – even with cut and paste, the 10 tests required to completely check everything feels like overkill.

However, I have tested:

  • Joining two invalid sequences, but not using the results (i.e. testing deferred execution)
  • The way that the two sequences are consumed
  • Using a custom comparer
  • Not specifying a comparer
  • Using sequences of different types

See the source code for more details, but here’s a flavour – the final test:

[Test]
public void DifferentSourceTypes()
{
    int[] outer = { 5, 3, 7 };
    string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

    var query = outer.Join(inner,
                           outerElement => outerElement,
                           innerElement => innerElement.Length,
                           (outerElement, innerElement) => outerElement + ":" + innerElement);
    query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "7:giraffe");
}

To be honest, the tests aren’t very exciting. The implementation is remarkably simple though.

Let’s implement it!

Trivial decision 1: make the comparer-less overload call the one with the comparer.

Trivial decision 2: use the split-method technique to validate arguments eagerly but defer the rest of the operation

That leaves us with the actual implementation in an iterator block, which is all I’m going to provide the code for here. So, what are we going to do?

Well, we know that we’re going to have to read in the whole of the "inner" sequence – but let’s wait a minute before deciding how to store it. We’re then going to iterate over the "outer" sequence. For each item in the outer sequence, we need to find the key and then work out all the "inner" sequence items which match that key. Now, the idea of finding all the items in a sequence which match a particular key should sound familiar to you – that’s exactly what a lookup is for. If we build a lookup from the inner sequence as our very first step, the rest becomes easy: we can fetch the sequence of matches, then iterate over them and yield the return value of calling result selector on the pair of elements.

All of this is easier to see in code than in words, so here’s the method:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    foreach (var outerElement in outer)
    {
        var key = outerKeySelector(outerElement);
        foreach (var innerElement in lookup[key])
        {
            yield return resultSelector(outerElement, innerElement);
        }
    }
}

Personally, I think this is rather beautiful… in particular, I like the way that it uses every parameter exactly once. Everything is just set up to work nicely.

But wait, there’s more… If you look at the nested foreach loops, that should remind you of something: for each outer sequence element, we’re computing a nested sequence, then applying a delegate to each pair, and yielding the result. That’s almost exactly the definition of SelectMany! If only we had a "yield foreach" or "yield!" I’d be tempted to use an implementation like this:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    // Warning: not really valid C#
    yield foreach outer.SelectMany(outerElement => lookup[outerKeySelector(outerElement)],
                                   resultSelector);
}

Unfortunately there’s no such thing as a "yield foreach" statement. We can’t just call SelectMany and return the result directly, because then we wouldn’t be deferring execution. The best we can sensibly do is loop:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    var results = outer.SelectMany(outerElement => lookup[outerKeySelector(outerElement)],
                                   resultSelector);
    foreach (var result in results)
    {
        yield return result;
    }
}

At that stage, I think I prefer the original form – which is still pretty simple, thanks to the lookup.

While I have avoided using other operators for implementations in the past, in this case it feels so completely natural, it would be silly not to use ToLookup to implement Join.

One final point before I wrap up for the night…

Query expressions

Most of the operators I’ve been implementing recently don’t have any direct mapping in C# query expressions. Join does, however. The final test I showed before can also be written like this:

int[] outer = { 5, 3, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length
            select x + ":" + y;

query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "7:giraffe");

The compiler will effectively generate the same code (although admittedly I’ve used shorter range variable names here – x and y instead of outerElement and innerElement respectively). In this case the resultSelector delegate it supplies is simply the final projection from the "select" clause – but if we had anything else (such as a where clause) between the join and the select, the compiler would introduce a transparent identifier to propagate the values of both x and y through the query. It looks like I haven’t blogged explicitly about transparent identifiers, although I cover them in C# in Depth. Maybe once I’ve finished actually implementing the operators, I’ll have a few more general posts on this sort of thing.

Anyway, the point is that for simple join clauses (as opposed to join…into) we’ve implemented everything we need to. Hoorah.

Conclusion

I spoiled some of the surprise around how easy Join would be to implement by mentioning it in the ToLookup post, but I’m still impressed by how neat it is. Again I should emphasize that this is due to the design of LINQ – it’s got nothing to do with my own prowess.

This won’t be the end of our use of lookups though… the other grouping constructs can all use them too. I’ll try to get them all out of the way before moving on to operators which feel a bit different…

Addendum

It turns out this wasn’t quite as simple as I’d expected. Although ToLookup and GroupBy handle null keys without blinking, Join and GroupJoin ignore them. I had to write an alternative version of ToLookup which ignores null keys while populating the lookup, and then replace the calls of "ToLookup" in the code above with calls to "ToLookupNoNullKeys". This isn’t documented anywhere, and is inconsistent with ToLookup/GroupBy. I’ve opened up a Connect issue about it, in the hope that it at least gets documented properly. (It’s too late to change the behaviour now, I suspect.)

7 thoughts on “Reimplementing LINQ to Objects: Part 19 – Join”

  1. You say: “We can’t just call SelectMany and return the result directly, because then we wouldn’t be deferring execution.” We wouldn’t, but SelectMany would. That is, if you did a “return outer.SelectMany(…)”, the enumeration still wouldn’t happen until the caller began enumerating the result, because the result would be the deferred sequence generated by SelectMany. So I think calling SelectMany and returning the result directly would still do what you want, while avoiding the loop. Am I missing something?

    Like

  2. @Ivan: I’m afraid you *are* missing something, yes – before our call to SelectMany, we’d have to call inner.ToLookup – and *that* would be executed eagerly.

    I tried it – my tests failed :)

    Like

  3. An interesting test would be to use MS Pex to verfiy that the simple nested-loop join algorithm behaves identically to the more sophisticated one based on hashing. That would be quite trivial to do with Pex and would provide quite a nice check that all join properties are met.

    Like

  4. How about this:

    private static IEnumerable JoinImpl(
    IEnumerable outer,
    IEnumerable inner,
    Func outerKeySelector,
    Func innerKeySelector,
    Func resultSelector,
    IEqualityComparer comparer)
    {
    var lookup = new Lazy<ILookup>(() => inner.ToLookup(innerKeySelector, comparer));
    return outer.SelectMany(outerElement => lookup.Value[outerKeySelector(outerElement)],
    resultSelector);
    }

    Like

Leave a comment