Reimplementing LINQ to Objects: Part 18 – ToLookup

I’ve had a request to implement Join, which seems a perfectly reasonable operator to aim towards. However, it’s going to be an awful lot easier to implement if we write ToLookup first. That will also help with GroupBy and GroupJoin, later on.

In the course of this post we’ll create a certain amount of infrastructure – some of which we may want to tweak later on.

What is it?

ToLookup has four overloads, with these signatures:

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

Essentially these boil down to two required parameters and two optional ones:

  • The source is required, and must not be null
  • The keySelector is required, and must not be null
  • The elementSelector is optional, and defaults to an identity projection of a source element to itself. If it is specified, it must not be null.
  • The comparer is optional, and defaults to the default equality comparer for the key type. It may be null, which is equivalent to specifying the default equality comparer for the key type.

Now we can just consider the most general case – the final overload. However, in order to understand what ToLookup does, we have to know what ILookup<TKey, TElement> means – which in turn means knowing about IGrouping<TKey, TElement>:

public interface IGrouping<out TKey, out TElement> : IEnumerable<TElement>
{
    TKey Key { get; }
}

public interface ILookup<TKey, TElement> : IEnumerable<IGrouping<TKey, TElement>>
{
    int Count { get; }
    IEnumerable<TElement> this[TKey key] { get; }
    bool Contains(TKey key);
}

The generics make these interfaces seem somewhat scary at first sight, but they’re really not so bad. A grouping is simply a sequence with an associated key. This is a pretty simple concept to transfer to the real world – something like "Plays by Oscar Wilde" could be an IGrouping<string, Play> with a key of "Oscar Wilde". The key doesn’t have to be "embedded" within the element type though – it would also be reasonable to have an IGrouping<string, string> representing just the names of plays by Oscar Wilde.

A lookup is essentially a map or dictionary where each key is associated with a sequence of values instead of a single one. Note that the interface is read-only, unlike IDictionary<TKey, TValue>. As well as looking up a single sequence of values associated with a key, you can also iterate over the whole lookup in terms of groupings (instead of the key/value pair from a dictionary). There’s one other important difference between a lookup and a dictionary: if you ask a lookup for the sequence corresponding to a key which it doesn’t know about, it will return an empty sequence, rather than throwing an exception. (A key which the lookup does know about will never yield an empty sequence.)

One slightly odd point to note is that while IGrouping is covariant in TKey and TElement, ILookup is invariant in both of its type parameters. While TKey has to be invariant, it would be reasonable for TElement to be covariant – to go back to our "plays" example, an IGrouping<string, Play> could be sensibly regarded as an IGrouping<string, IWrittenWork> (with the obvious type hierarchy). However, the interface declarations above are the ones in .NET 4, so that’s what I’ve used in Edulinq.

Now that we understand the signature of ToLookup, let’s talk about what it actually does. Firstly, it uses immediate execution – the lookup returned by the method is effectively divorced from the original sequence; changes to the sequence after the method has returned won’t change the lookup. (Obviously changes to the objects within the original sequence may still be seen, depending on what the element selector does, etc.) The rest is actually fairly straightforward, when you consider what parameters we have to play with:

  • The keySelector is applied to each item in the input sequence. Keys are always compared for equality using the "comparer" parameter.
  • The elementSelector is applied to each item in the input sequence, to project the item to the value which will be returned within the lookup.

To demonstrate this a little further, here are two applications of ToLookup – one using the first overload, and one using the last. In each case we’re going to group plays by author and then display some information about them. Hopefully this will make some of the concepts I’ve described a little more concrete:

ILookup<string, Play> lookup = allPlays.ToLookup(play => play.Author);
foreach (IGrouping<string, Play> group in lookup)
{
    Console.WriteLine("Plays by {0}", group.Key);
    foreach (Play play in group)
    {
        Console.WriteLine("  {0} ({1})", play.Name, play.CopyrightDate);
    }
}

// Or…

ILookup<string, string> lookup = allPlays.ToLookup(play => play.Author,
                                                   play => play.Name,
                                                   StringComparer.OrdinalIgnoreCase);
foreach (IGrouping<string, string> group in lookup)
{
    Console.WriteLine("Plays by {0}:", group.Key);
    foreach (string playName in group)
    {
        Console.WriteLine("  {0}", playName);
    }
}
        
// And demonstrating looking up by a key:
IEnumerable<string> playsByWilde = lookup["Oscar Wilde"];
Console.WriteLine("Plays by Oscar Wilde: {0}",
                  string.Join("; ", playsByWilde));

Note that although I’ve used explicit typing for all the variables here, I would usually use implicit typing with var, just to keep the code more concise.

Now we just have the small matter of working out how to go from a key/element pair for each item, to a data structure which lets us look up sequences by key.

Before we move onto the implementation and tests, there are two aspects of ordering to consider:

  • How are the groups within the lookup ordered?
  • How are the elements within each group ordered?

The documentation for ToLookup is silent on these matters – but the docs for the closely-related GroupBy operator are more specific:

The IGrouping<TKey, TElement> objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping<TKey, TElement>. Elements in a grouping are yielded in the order they appear in source.

Admittedly "in an order based on…" isn’t as clear as it might be, but I think it’s reasonable to make sure that our implementation yields groups such that the first group returned has the same key as the first element in the source, and so on.

What are we going to test?

I’ve actually got relatively few tests this time. I test each of the overloads, but not terribly exhaustively. There are three tests worth looking at though. The first two show the "eager" nature of the operator, and the final one demonstrates the most complex overload by grouping people into families.

[Test]
public void SourceSequenceIsReadEagerly()
{
    var source = new ThrowingEnumerable();
    Assert.Throws<InvalidOperationException>(() => source.ToLookup(x => x));
}

[Test]
public void ChangesToSourceSequenceAfterToLookupAreNotNoticed()
{
    List<string> source = new List<string> { "abc" };
    var lookup = source.ToLookup(x => x.Length);
    Assert.AreEqual(1, lookup.Count);

    // Potential new key is ignored
    source.Add("x");
    Assert.AreEqual(1, lookup.Count);

    // Potential new value for existing key is ignored
    source.Add("xyz");
    lookup[3].AssertSequenceEqual("abc");
}

[Test]
public void LookupWithComparareAndElementSelector()
{
    var people = new[] {
        new { First = "Jon", Last = "Skeet" },
        new { First = "Tom", Last = "SKEET" }, // Note upper-cased name
        new { First = "Juni", Last = "Cortez" },
        new { First = "Holly", Last = "Skeet" },
        new { First = "Abbey", Last = "Bartlet" },
        new { First = "Carmen", Last = "Cortez" },                
        new { First = "Jed", Last = "Bartlet" }
    };

    var lookup = people.ToLookup(p => p.Last, p => p.First, StringComparer.OrdinalIgnoreCase);

    lookup["Skeet"].AssertSequenceEqual("Jon", "Tom", "Holly");
    lookup["Cortez"].AssertSequenceEqual("Juni", "Carmen");
    // The key comparer is used for lookups too
    lookup["BARTLET"].AssertSequenceEqual("Abbey", "Jed");

    lookup.Select(x => x.Key).AssertSequenceEqual("Skeet", "Cortez", "Bartlet");
}

I’m sure there are plenty of other tests I could have come up with. lf you want to see any ones in particular (either as an edit to this post if they already exist in source control, or as entirely new tests), please leave a comment.

EDIT: It turned out there were a few important tests missing… see the addendum.

Let’s implement it!

There’s quite a lot of code involved in implementing this – but it should be reusable later on, potentially with a few tweaks. Let’s get the first three overloads out of the way to start with:

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)
{
    return source.ToLookup(keySelector, element => element, EqualityComparer<TKey>.Default);
}

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)
{
    return source.ToLookup(keySelector, element => element, comparer);
}

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)
{
    return source.ToLookup(keySelector, elementSelector, EqualityComparer<TKey>.Default);
}

Now we can just worry about the final one. Before we implement that though, we’re definitely going to need an implementation of IGrouping. Let’s come up with a really simple one to start with:

internal sealed class Grouping<TKey, TElement> : IGrouping<TKey, TElement>
{
    private readonly TKey key;
    private readonly IEnumerable<TElement> elements;

    internal Grouping(TKey key, IEnumerable<TElement> elements)
    {
        this.key = key;
        this.elements = elements;
    }

    public TKey Key { get { return key; } }

    public IEnumerator<TElement> GetEnumerator()
    {
        return elements.GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

How do we feel about this? Well, groupings should be immutable. We have no guarantee that the "values" sequence won’t be changed after creation – but it’s an internal class, so we’ve just got to be careful about how we use it. We also have to be careful that we don’t use a mutable sequence type which allows a caller to get from the iterator (the IEnumerator<TElement> returned by GetEnumerator) back to the sequence and then mutate it. In our case we’re actually going to use List<T> to provide the sequences, and while List<T>.Enumerator is a public type, it doesn’t expose the underlying list. Of course a caller could use reflection to mess with things, but we’re not going to try to protect against that.

Okay, so now we can pair a sequence with a key… but we still need to implement ILookup. This is where there are multiple options. We want our lookup to be immutable, but there are various degrees of immutability we could use:

  • Mutable internally, immutable in public API
  • Mutable privately, immutable to the internal API
  • Totally immutable, even within the class itself

The first option is the simplest to implement, and it’s what I’ve gone for at the moment. I’ve created a Lookup class which is allows a key/element pair to be added to it from within the Edulinq assembly. It uses a Dictionary<TKey, List<TElement>> to map the keys to sequences efficiently, and a List<TKey> to remember the order in which we first saw the keys. Here’s the complete implementation:

internal sealed class Lookup<TKey, TElement> : ILookup<TKey, TElement>
{
    private readonly Dictionary<TKey, List<TElement>> map;
    private readonly List<TKey> keys;

    internal Lookup(IEqualityComparer<TKey> comparer)
    {
        map = new Dictionary<TKey, List<TElement>>(comparer);
        keys = new List<TKey>();
    }

    internal void Add(TKey key, TElement element)
    {
        List<TElement> list;
        if (!map.TryGetValue(key, out list))
        {
            list = new List<TElement>();
            map[key] = list;
            keys.Add(key);
        }
        list.Add(element);
    }

    public int Count
    {
        get { return map.Count; }
    }

    public IEnumerable<TElement> this[TKey key]
    {
        get
        {
            List<TElement> list;
            if (!map.TryGetValue(key, out list))
            {
                return Enumerable.Empty<TElement>();
            }
            return list.Select(x => x);
        }
    }

    public bool Contains(TKey key)
    {
        return map.ContainsKey(key);
    }

    public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator()
    {
        return keys.Select(key => new Grouping<TKey, TElement>(key, map[key]))
                   .GetEnumerator();
    }
        
    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

All of this is really quite straightforward. Note that we provide an equality comparer to the constructor, which is then passed onto the dictionary – that’s the only thing that needs to know how to compare keys.

There are only two points of interest, really:

  • In the indexer, we don’t return the list itself – that would allow the caller to mutate it, after casting it to List<TElement>. Instead, we just call Select with an identity projection, as a simple way of inserting a sort of "buffering" layer between the list and the caller. There are other ways of doing this, of course – including implementing the indexer with an iterator block.
  • In GetEnumerator, we’re retaining the key order by using our list of keys and performing a lookup on each key.

We’re currently creating the new Grouping objects lazily – which will lead to fewer of them being created if the caller doesn’t actually iterate over the lookup, but more of them if the caller iterates over it several times. Again, there are alternatives here – but without any good information about where to make the trade-off, I’ve just gone for the simplest code which works for the moment.

One last thing to note about Lookup – I’ve left it internal. In .NET, it’s actually public – but the only way of getting at an instance of it is to call ToLookup and then cast the result. I see no particular reason to make it public, so I haven’t.

Now we’re finally ready to implement the last ToLookup overload – and it becomes pretty much trivial:

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    if (elementSelector == null)
    {
        throw new ArgumentNullException("elementSelector");
    }

    Lookup<TKey, TElement> lookup = new Lookup<TKey, TElement>(comparer ?? EqualityComparer<TKey>.Default);

    foreach (TSource item in source)
    {
        TKey key = keySelector(item);
        TElement element = elementSelector(item);
        lookup.Add(key, element);
    }
    return lookup;
}

When you look past the argument validation, we’re just creating a Lookup, populating it, and then returning it. Simple.

Thread safety

Something I haven’t addressed anywhere so far is thead safety. In particular, although all of this is nicely immutable when viewed from a single thread, I have a nasty feeling that theoretically, if the return value of our implementation of ToLookup was exposed to another thread, it could potentially observe the internal mutations we’re making here, as we’re not doing anything special in terms of the memory model here.

I’m basically scared by lock-free programming these days, unless I’m using building blocks provided by someone else. While investigating the exact guarantees offered here would be interesting, I don’t think it would really help with our understanding of LINQ as a whole. I’m therefore declaring thread safety to be out of scope for Edulinq in general :)

Conclusion

So that’s ToLookup. Two new interfaces, two new classes, all for one new operator… so far. We can reuse almost all of this in Join though, which will make it very simple to implement. Stay tuned…

Addendum

It turns out that I missed something quite important: ToLookup has to handle null keys, as do various other LINQ operators (GroupBy etc). We’re currently using a Dictionary<TKey, TValue> to organize the groups… and that doesn’t support null keys. Oops.

So, first steps: write some tests proving that it fails as we expect it to. Fetch by a null key of a lookup. Include a null key in the source of a lookup. Use GroupBy, GroupJoin and Join with a null key. Watch it go bang. That’s the easy part…

Now, we can do all the special-casing in Lookup itself – but it gets ugly. Our Lookup code was pretty simple before; it seems a shame to spoil it with checks everywhere. What we really need is a dictionary which does support null keys. Step forward NullKeyFriendlyDictionary, a new internal class in Edulinq. Now you might expect this to implement IDictionary<TKey, TValue>, but it turns out that’s a pain in the neck. We hardly use any of the members of Dictionary – TryGetValue, the indexer, ContainsKey, and the Count property. That’s it! So those are the only members I’ve implemented.

The class contains a Dictionary<TKey, TValue> to delegate most requests to, and it just handles null keys itself. Here’s a quick sample:

internal sealed class NullKeyFriendlyDictionary<TKey, TValue>
{
    private readonly Dictionary<TKey, TValue> map;
    private bool haveNullKey = false;
    private TValue valueForNullKey;

    internal NullKeyFriendlyDictionary(IEqualityComparer<TKey> comparer)
    {
        map = new Dictionary<TKey, TValue>(comparer);
    }

    internal bool TryGetValue(TKey key, out TValue value)
    {
        if (key == null)
        {
            // This will be default(TValue) if haveNullKey is false,
            // which is what we want.
            value = valueForNullKey;
            return haveNullKey;
        }
        return map.TryGetValue(key, out value);
    }

    // etc
}

There’s one potential flaw here, that I can think of: if you provide an IEqualityComparer<TKey> which treats some non-null key as equal to a null key, we won’t spot that. If your source then contains both those keys, we’ll end up splitting them into two groups instead of keeping them together. I’m not too worried about that – and I suspect there are all kinds of ways that could cause problems elsewhere anyway.

With this in place, and Lookup adjusted to use NullKeyFriendlyDictionary instead of just Dictionary, all the tests pass. Hooray!

At the same time as implementing this, I’ve tidied up Grouping itself – it now implements IList<T> itself, in a way which is immutable to the outside world. The Lookup now contains groups directly, and can return them very easily. The code is generally tidier, and anything using a group can take advantage of the optimizations applied to IList<T> – particularly the Count() operator, which is often applied to groups.

10 thoughts on “Reimplementing LINQ to Objects: Part 18 – ToLookup”

  1. A casual look says you’re perfectly safe since you don’t mutate after making the object public. The build, freeze, publish model is pretty axiomatic to threaded programming.

    Like

  2. @Simon: It’s not quite as simple as that though, because there are no explicit memory fences between the final mutate and the publish. At least, that’s my understanding of it…

    @configurator: Yes, I could do that. I suppose that would then make it quicker if someone only wanted the count from the lookup, too. I don’t know what LINQ to Objects does here. I could create a new grouping of course – that would be perfectly okay.

    Like

  3. @skeet: Damn, keep forgetting about fences, since at least on x86 (and therefore AMD64) you are safe since they will flush all the per core caches on any write specifically because of this pattern. Even so, I would say the client should be doing the fence since they are the one publishing. I’m not sure what the current theory on that is though – perhaps a fully immutable API could have more course to do that?

    Like

  4. I was actually going for readability – new ReadOnlyCollection(x) is much more readable than x.Select(x => x) and much clearer in what it’s trying to do.

    Like

    1. Where are you talking about returning from? I’m not sure what you mean. If you mean the Select(x =&gt; x), we could definitely use AsReadOnly.

      And I definitely want to keep an additional list, as I need to be able to return the entries in the original order in which I encountered the keys.

      Like

      1. Yes, I meant that part.
        I would return AsReadOnly() also to prevent not materialized enumerator to be returned.

        Having the requirement to persist order dwdinitelt makes sense here then. Thanks for pointing it out!

        Like

Leave a comment