Reimplementing LINQ to Objects: Part 25 – ToDictionary

This one ended up being genuinely easy to implement, although with lots of tests for different situations.

What is it?

ToDictionary has four overloads which look remarkably similar to the ones we used for ToLookup:

public static Dictionary<TKey, TSource> ToDictionary<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

public static Dictionary<TKey, TSource> ToDictionary<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)
        
public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

Yes, it’s the ever-popular combination of an optional key comparer and an optional element selector when mapping values.

ToDictionary does exactly what it says on the tin: it converts the source sequence to a dictionary. Each source item is mapped to a key, and then either the source item itself is used as the value or the element selector is applied to obtain one. A custom equality comparer can be used to create a dictionary which is (for example) case-insensitive when fetching keys.

The main difference between ToDictionary and ToLookup is that a dictionary can only store a single value per key, rather than a whole sequence of values. Of course, another difference is that Dictionary<TKey, TValue> is mutable concrete class, whereas ILookup<TKey, TElement> is an interface usually implemented in an immutable fashion.

The normal sort of note:

  • ToDictionary uses immediate execution: it reads the entire source sequence, and once it has returned, the dictionary is independent of the source
  • source, keySelector and elementSelector mustn’t be null
  • comparer can be null, in which case the default equality comparer for the key type will be used
  • No key can appear more than once (including non-identical but equal occurrences). This will cause an ArgumentException
  • Null keys are prohibited, causing an ArgumentNullException. (This is slightly odd, as the ToDictionary method itself doesn’t have a null argument at this point… really it should be a NullValueDerivedFromAnArgumentSomehowException, but that doesn’t exist)
  • Null values are allowed

Just a word on usage: one place I’ve seen ToDictionary used to reasonable effect is when the source is an IEnumerable<KeyValuePair<TKey, TValue>>, often from another dictionary. This lets you project all the values of a dictionary, or perhaps "trim" the dictionary in some ways… creating a new dictionary rather than modifying the existing one, of course. For example, here’s a way of creating a new Dictionary<string, Person> from an existing one, but only keeping the entries which are adults:

var adults = nameToPersonMap.Where(pair => pair.Value.Age >= 18)
                            .ToDictionary(pair => pair.Key, pair => pair.Value);

Of course, this is just one use – it can be a pretty handy operator, to be frank.

What are we going to test?

Just the "null argument" validation tests gives us ten tests to start with, then:

  • A test for each overload, always mapping a sequence of strings to a map using the first letter as the key; sometimes as a char and sometimes as a string (so we can easily use a case-insensitive comparer)
  • A test for null keys (ArgumentNullException)
  • A test for null values (no exception)
  • A test for duplicate keys (ArgumentException)
  • A test for a null comparer (uses the default)

Let’s implement it!

In my implementation, the first three overloads all call the last – just using a default equality comparer or an identity element selector where necessary. Then after the argument validation, the implementation is simple:

public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    // Argument validation elided
    Dictionary<TKey, TElement> ret = new Dictionary<TKey, TElement>(comparer ?? EqualityComparer<TKey>.Default);
    foreach (TSource item in source)
    {
        ret.Add(keySelector(item), elementSelector(item));
    }
    return ret;
}

There’s just one thing to be careful of – we have to use the Add method instead of the indexer to add entries to the dictionary. The Add method deals with duplicate and null keys in exactly the way we want, whereas the indexer would overwrite the existing entry if we had a duplicate key.

There’s one thing my implementation doesn’t do which it could if we really wanted – it doesn’t optimize the case where we know the size beforehand, because the source implements ICollection<T>. We could use the Dictionary constructor which takes a capacity in that case, like this:

comparer = comparer ?? EqualityComparer<TKey>.Default;
ICollection<TSource> list = source as ICollection<TSource>;
var ret = list == null ? new Dictionary<TKey, TElement>(comparer)
                       : new Dictionary<TKey, TElement>(list.Count, comparer);

After all, we know the eventual size of the dictionary will be exactly the count of the input sequence, in the success case.

Is this worth it? I’m not sure. It’s not a huge change, admittedly… I’m quite tempted to put it in. Again, we’d have to benchmark it to see what difference it made (and I’d benchmark the .NET implementation too, of course) but a unit test won’t show it up. Thoughts welcome.

EDIT: I’ve now edited the code to include this optimization. I haven’t benchmarked it though…

Conclusion

Gah – even when I thought I had this wrapped up before starting to write the post, that final bit of potential optimization still appeared out of nowhere.

Idle thought: if we all had to write up our code in blog posts as we went along, we might discover some interesting alternative designs and implementations.

That’s it for tonight, and I may not have any time for posts tomorrow – we’ll see.

Reimplementing LINQ to Objects: Part 24 – ToArray

This really is a simple one. So simple I might even find the energy to implement ToDictionary as well tonight… we’ll see. (EDIT: Oops. It became slightly less simple in the end, as I came up with the third implementation. Oh well.)

What is it?

ToArray is basically the equivalent of ToList, but producing an array instead of a list. It has a single signature:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)

Just to recap:

  • It’s another extension method, which is useful when we want to use type inference.
  • It uses immediate execution – nothing is deferred, and the entire sequence is read before the method completes.
  • It performs simple argument validation: source can’t be null
  • It creates an independent but shallow copy of the collection (so any changes to the objects referenced by source will be visible in the result, but not changes to source itself – and vice versa)
  • It’s optimized for ICollection<T>, although in a slightly different way to ToList.

What are we going to test?

Exactly the same as we did for ToList, basically – with one bit of care required. In our test for the source and result being independent of each other, we can’t just create a new variable of type List<T> and call ToArray on it – because that would call the implementation in List<T> itself. I reckoned the easiest way of making this clear was to call the method directly as a normal static method, instead of using it as an extension method:

[Test]
public void ResultIsIndependentOfSource()
{
    List<string> source = new List<string> { "xyz", "abc" };
    // Make it obvious we’re not calling List<T>.ToArray
    string[] result = Enumerable.ToArray(source);
    result.AssertSequenceEqual("xyz", "abc");

    // Change the source: result won’t have changed
    source[0] = "xxx";
    Assert.AreEqual("xyz", result[0]);

    // And the reverse
    result[1] = "yyy";
    Assert.AreEqual("abc", source[1]);
}

One other interesting facet of the testing is that we can only partially test the optimization for ICollection<T>. We can make sure that it’s optimized as far as not using the iterator (as per the previous test), but there’s more optimization coming, and I haven’t worked out how to test for that yet. You’ll see what I mean when we come to implement it. Speaking of which…

Let’s implement it!

We can make all our tests pass with a cheat: whatever the input type, convert it to a List<T> and then call ToArray on the result. Heck, if we call ToList to do the initial conversion, that will even do the argument validation for us and use the right parameter name in the exception:

// Implementation by Mr Cheaty MacCheaterson
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    return source.ToList().ToArray();
}

Now I’m not averse to the general approach taken here – but there’s actually a bit more optimization we can do.

Remember how ICollection<T> exposes Count and CopyTo, which the List<T> constructor uses when the input implements ICollection<T>? Well, that means that building a List<T> is relatively cheap for a collection – but calling ToArray on the list will still mean copying all the data out again (as List<T>.ToArray can’t just return its own internal array – it has to create a copy). We can use exactly the same members ourselves, and avoid one level of copying:

// Somewhat better… though still not quite ideal
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        TSource[] ret = new TSource[collection.Count];
        collection.CopyTo(ret, 0);
        return ret;
    }
    return new List<TSource>(source).ToArray();
}

That’s pretty good now – except it still involves a copy from the List<T> into a new array every time. That’s almost always appropriate, in fact… because unless the resulting list happened to expand its array to exactly the right size, we’d need to make a copy anyway. After all, we can’t return an array that’s too big. However, we can optimize for the "just the right size" case if we basically implement List<T>’s array expansion ourselves, leading to this:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimize for ICollection<T>
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        TSource[] tmp = new TSource[collection.Count];
        collection.CopyTo(tmp, 0);
        return tmp;
    }
            
    // We’ll have to loop through, creating and copying arrays as we go
    TSource[] ret = new TSource[16];
    int count = 0;
    foreach (TSource item in source)
    {
        // Need to expand…
        if (count == ret.Length)
        {
            Array.Resize(ref ret, ret.Length * 2);
        }
        ret[count++] = item;
    }

    // Now create another copy if we have to, in order to get an array of the
    // right size
    if (count != ret.Length)
    {
        Array.Resize(ref ret, count);
    }
    return ret;
}

Is this level of optimization worth it? Probably not. I picked the starting size of 16 out of thin air (or dimly recalled initial counts for some collection or other – possibly Java’s ArrayList<T>). Maybe we should triple the capacity rather than double it, just for laughs. It’s all guesswork, really. The middle implementation feels like a more appropriate one to me, but with an obvious optimization just itching to be implemented, I thought I might as well provide the code. It’s reasonably obviously correct, but it’s just a bit longwinded for the marginal benefit over our second attempt.

It’s even more annoying that I can’t think of a way to test this easily – I could benchmark it of course, but that’s not the same as unit testing it… I can’t easily prove that we’re optimizing either the ICollection<T> or "correct guess at size" cases.

Conclusion

It’s always interesting to see what else springs to mind when I’m writing up the operator as opposed to just implementing it in Visual Studio. I’d got as far as the second implementation but not the third when I started this post.

It’s possible that in the end, the 4-overload ToDictionary operator will actually end up being simpler than ToArray. Who’d have thought?

Reimplementing LINQ to Objects: Part 23 – Take/Skip/TakeWhile/SkipWhile

I genuinely expected these operators to be simple. At the time of this writing, I’m onto my third implementation of SkipWhile, struggling to find code which obviously expresses what’s going on. I find it interesting how the simplest sounding operators sometimes end up being trickier to implement than one might think.

What are they?

Take, TakeWhile, Skip and SkipWhile form a natural group of operators. Together they are known as the partitioning operators. Here are the signatures:

public static IEnumerable<TSource> Take<TSource>(
    this IEnumerable<TSource> source,
    int count)

public static IEnumerable<TSource> TakeWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> TakeWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

public static IEnumerable<TSource> Skip<TSource>(
    this IEnumerable<TSource> source,
    int count)

public static IEnumerable<TSource> SkipWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> SkipWhile<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

The behaviour is reasonably self-explanatory:

  • source.Take(n) returns a sequence of the first n elements of source
  • source.Skip(n) returns a sequence containing all but the first n elements of source
  • source.TakeWhile(predicate) returns a sequence of the first elements of source which match the given predicate; it stops as soon as the predicate fails to match
  • source.SkipWhile(predicate) returns a sequence of all but the the first elements of source which match the given predicate; it starts yielding results as soon as the predicate fails to match

The only reason there are two overloads for TakeWhile and SkipWhile is so that you can provide a predicate which uses the index within the sequence, just like the overloads for Select and Where.

Each operator effectively partitions the input sequence into two parts, and yields either the first or second part. So for any "normal" collection (which returns the same results each time you iterate over it) the following are true:

  • source.Take(n).Concat(source.Skip(n)) is equivalent to source
  • source.TakeWhile(predicate).Concat(source.SkipWhile(predicate)) is equivalent to source (as noted in the comments, this is also assuming the predicate acts solely the same way each time it’s given the same data, without any side-effects)

A few notes on general behaviour:

  • The arguments are validated eagerly: source and predicate can’t be null
  • count can be negative (in which case it’s equivalent to 0) and can be larger than the collection too. Basically any value is valid.
  • Execution is deferred.
  • The source sequence is streamed – there’s no need to buffer anything.

What are we going to test?

One nice thing about the properties I described earlier is that it’s very easy to convert tests for Take/TakeWhile into tests for Skip/SkipWhile: you just change the expected sequence in Skip to be everything in the source sequence which wasn’t in the test for Take. For Take/Skip I’ve just used Enumerable.Range(0, 5) (i.e. 0, 1, 2, 3, 4) as the source sequence; for TakeWhile/SkipWhile I’ve used { "zero", "one", "two", "three", "four", "five" } – so each element is the word corresponding to the index. That makes it reasonably easy to write tests with a predicate involving the index.

I’ve tested:

  • Argument validation
  • Deferred execution
  • For Skip/Take:
    • A negative count
    • A count of 0
    • A count to split the sequence into non-empty parts
    • A count exactly equal to the source length (5)
    • A count larger than the source length
  • For SkipWhile/TakeWhile (both overloads in each case):
    • A predicate which fails on the first element
    • A predicate which matches some elements but not others
    • A predicate which matches all elements

The tests aren’t generally interesting, but as it can be slightly tricky to get your head round TakeWhile and SkipWhile to start with, here’s an example of each:

// In TakeWhileTest
[Test]
public void PredicateMatchingSomeElements()
{
    string[] source = { "zero", "one", "two", "three", "four", "five" };
    source.TakeWhile(x => x.Length < 5).AssertSequenceEqual("zero", "one", "two");
}

// In SkipWhileTest
[Test]
public void PredicateMatchingSomeElements()
{
    string[] source = { "zero", "one", "two", "three", "four", "five" };
    source.SkipWhile(x => x.Length < 5).AssertSequenceEqual("three", "four", "five");
}

There’s been some discussion on Twitter about the best kind of test here, with suggestions of varying the data instead of the predicate, and using sequences of Boolean values. Personally I prefer the tests the way they are – I find this sequence of strings easy to work with, and the fact that it is a sequence of strings means you can’t get the index/value parameter order wrong when providing a predicate which uses the indexes, unlike if you use an integer sequence. A Boolean sequence isn’t clear enough for me – the result isn’t obviously the first or second part of a sequence, whereas with these words, it’s obvious what’s going on… to me, at least. It’s all a matter of personal preference though, and it’s good to think about alternatives.

Let’s implement them!

Let me present this in chronological order. I started off by implementing Take and Skip, as they sounded the simplest. Indeed, they’re not too bad – here are the iterator block parts, after argument validation in the public method, of course:

private static IEnumerable<TSource> TakeImpl<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        for (int i = 0; i < count && iterator.MoveNext(); i++)
        {
            yield return iterator.Current;
        }
    }
}

private static IEnumerable<TSource> SkipImpl<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        for (int i = 0; i < count; i++)
        {
            if (!iterator.MoveNext())
            {
                yield break;
            }
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

These really aren’t too bad, although it’s interesting to note that Skip is more complicated than Take: Skip has to deal with both sections of the sequence (the "ignore" part and the "yield" part) whereas Take can return as soon as it’s finished with the "yield" part. We’ll see the same pattern in TakeWhile/SkipWhile.

Speaking of which, let’s look at them. There’s a decision to make in terms of whether to implement the overload whose predicate doesn’t use the index in terms of the one which does. It would certainly be simple, but I don’t like the double level of indirection with one delegate calling another. It may well be irrational, given that I’ve been introducing identity delegates elsewhere – but it just didn’t feel right. I’ve added it into the source code using conditional compilation just for completeness, but it’s not the default implementation.

Now, the "Impl" bits of TakeWhile really weren’t too bad:

private static IEnumerable<TSource> TakeWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (!predicate(item))
        {
            yield break;
        }
        yield return item;
    }
}

private static IEnumerable<TSource> TakeWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (!predicate(item, index))
        {
            yield break;
        }
        index++;
        yield return item;
    }
}

Again, we don’t need to worry about the second part of the sequence – the part we’re ignoring. We can just let the method complete, and the result iterator terminate. In particular, we don’t care about the value that caused the predicate to return false. Now let’s look at SkipWhile (again, just the interesting bits):

private static IEnumerable<TSource> SkipWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        while (iterator.MoveNext())
        {
            TSource item = iterator.Current;
            if (!predicate(item))
            {
                // Stop skipping now, and yield this item
                yield return item;
                break;
            }
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

private static IEnumerable<TSource> SkipWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        int index = 0;
        while (iterator.MoveNext())
        {
            TSource item = iterator.Current;
            if (!predicate(item, index))
            {
                // Stop skipping now, and yield this item
                yield return item;
                break;
            }
            index++;
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

The above code is my third attempt. You can look at the first two in the history of SkipWhile.cs – you may find one of those preferable. Other suggestions have been made, including ones using do/while… I must admit, the do/while option ends up being quite attractive, but again I have a general aversion to it as a construct. Part of the trickiness is that having found an element which makes the predicate return false, we have to yield that element before we move on to yielding the rest. It’s easy enough to do, but the extra yield return statement feels ugly.

One point to note is that if the "skipping" part finishes due to a call to MoveNext returning false (rather than the predicate failing) then we’ll call MoveNext again (at the start of the "yielding" loop). That’s explicitly called out as being okay in the IEnumerator.MoveNext() documentation, so I’m happy enough to use it.

I’m certainly not happy about those implementations, but I’m confident they at least work. Now, let’s revisit Skip and Take. Is there any way we can simplify them? Absolutely – all we need to do is use TakeWhile or SkipWhile with a predicate which uses the index and compares it to the count we’ve been given:

public static IEnumerable<TSource> Take<TSource>( 
    this IEnumerable<TSource> source, 
    int count) 
{
    // Buggy! See addendum…
    return source.TakeWhile((x, index) => index < count); 

public static IEnumerable<TSource> Skip<TSource>( 
    this IEnumerable<TSource> source, 
    int count) 

    return source.SkipWhile((x, index) => index < count); 
}

TakeWhile and SkipWhile even do the appropriate argument validation for us. As you can see, I’ve become less averse to implementing one operator using another over time… both implementations are still in the source, using conditional compilation, but currently this short form is the one which is "active". I could be persuaded to change my mind :)

An optimized Skip?

Although most of these operations can’t be sensibly optimized, it would make sense to optimize Skip when the source implements IList<T>. We can skip the skipping, so to speak, and go straight to the appropriate index. This wouldn’t spot the case where the source was modified between iterations, which may be one reason it’s not implemented in the framework as far as I’m aware. The optimization would look something like this:

var list = source as IList<TSource>;
if (list != null)
{
    // Make sure we don’t use a negative index
    count = Math.Max(count, 0);
    // Note that "count" is the count of items to skip
    for (int index = count; index < list.Count; index++)
    {
        yield return list[index];
    }
    yield break;
}

Should Edulinq do this? Answers on a postcard…

EDIT: There’s another optimization we can potentially make, even earlier. If we’re given a count which is zero or negative, Skip is effectively pointless – so we can just return the original reference. This would come just after the argument validation, not in the iterator block:

if (count <= 0)
{
    return source;
}

Returning the original reference has some implications in terms of not isolating the return value from the original sequence, but it’s worth considering. We could potentially do the same for Take, in the case where we know it’s a list and the requested count is greater than or equal to the size of the list.

Using Skip and Take together

It’s worth mentioning the most common use of Skip and Take, as well as an easy-to-make mistake. Skip and Take are ideal for paging. For example:

var pageItems = allItems.Skip(pageNumber * itemsPerPage)
                        .Take(itemsPerPage);

This simply skips as many items as it needs to, and only "takes" one page-worth of data afterwards. Very simple… but also easy to get wrong.

In many cases with LINQ, you can reorder operations with no harm – or with a compile-time error if you get them the wrong way round. Let’s consider what happens if we get it wrong this time though:

// Bug! Bug! Bug!
var pageItems = allItems.Take(itemsPerPage)
                        .Skip(pageNumber * itemsPerPage);

This will work for the first page, but after that you’ll always end up displaying an empty page: it will take the first page-worth of data, and then skip it all, leaving nothing to return.

Conclusion

That was more work than I expected it to be. Still, we’ve actually not got very many LINQ operators still to cover. By my reckoning, we still have the following operators to implement:

  • Sum, Average, Min, Max – I’m not looking forward to these, as they all have lots of overloads, and are likely to be tedious to test and implement.
  • Cast and OfType
  • ToArray and ToDictionary
  • ElementAt and ElementAtOrDefault – I should possibly have covered these along with First/Single/Last etc
  • SequenceEqual
  • Zip (from .NET 4)
  • Contains
  • OrderBy/OrderByDescending/ThenBy/ThenByDescending (probably the most interesting remaining operators)
  • Reverse

That’s not too bad a list. I’ll probably try to knock off ToArray tonight – it should be pretty simple.

Addendum

It turns out that my "simplification" for Take introduced a bug. If we only need to take two items, we should know that after we’ve taken the second one… we don’t really need to ask the iterator to move to the third item, just to decide that we’ve finished. Doing so could have side effects – such as causing an exception to be thrown. That’s exactly what we do when we use TakeWhile – we don’t care what the first value after the stopping point is, but we do try to fetch it. Oops.

Fortunately the Mono tests picked this up for me, so I’ve added my own test for it, and gone back to the original implementation of Take which just used a counter. Both my own tests and Mono’s now pass :)

Reimplementing LINQ to Objects: Part 22 – GroupJoin

Another operator that was decidedly easy to implement – partly as I was able to just adapt everything from Join, including the tests. 15 minutes for tests + implementation… and no doubt considerably long writing it up.

What is it?

After the complexity of GroupBy, GroupJoin has a refreshingly small list of overloads:

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector)

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

We can essentially ignore the first overload in the normal way – it just uses the default equality comparer for TKey. (You’ll be pleased to hear that I think we’re coming towards the end of the operators which use equality comparers, so I won’t need to repeat that sort of sentence many more times.)

The best way to get to grips with what GroupJoin does is to think of Join. There, the overall idea was that we looked through the "outer" input sequence, found all the matching items from the "inner" sequence (based on a key projection on each sequence) and then yielded pairs of matching elements. GroupJoin is similar, except that instead of yielding pairs of elements, it yields a single result for each "outer" item based on that item and the sequence of matching "inner" items.

The fact that a result is yielded for every outer item is important: even if there are no matches, a result is still returned, just with an empty sequence. I’ll come back to that later, thinking about "left joins".

In terms of deferred execution, input consumption etc, GroupJoin behaves very much like Join:

  • Arguments are validated eagerly: everything other than comparer has to be non-null
  • It uses deferred execution
  • When you start iterating over the result, the "inner" sequence is immediately read all the way through
  • The "outer" sequence is streamed: each time you read an item from the result sequence, one item from the outer sequence is read

What are we going to test?

To be absolutely honest, I cribbed the tests from Join, and just adapted the resultSelector and the expected results. It’s simple enough to do: I was already using string concatenation to piece together the two items (outer and inner) for the Join tests; for GroupJoin I’ve just modified that to use string.Join (not the same as Enumerable.Join!) to build a semi-colon string delimited string of all the inner matches for that particular outer element.

Here’s a sample test, in both the Join and GroupJoin versions:

[Test]
public void SimpleJoin()
{
    // We’re going to join on the first character in the outer sequence item
    // being equal to the second character in the inner sequence item
    string[] outer = { "first", "second", "third" };
    string[] inner = { "essence", "offer", "eating", "psalm" };

    var query = outer.Join(inner,
                           outerElement => outerElement[0],
                           innerElement => innerElement[1],
                           (outerElement, innerElement) => outerElement + ":" + innerElement);

    // Note: no matches for "third"
    query.AssertSequenceEqual("first:offer", "second:essence", "second:psalm");
}

[Test]
public void SimpleGroupJoin()
{
    // We’re going to join on the first character in the outer sequence item
    // being equal to the second character in the inner sequence item
    string[] outer = { "first", "second", "third" };
    string[] inner = { "essence", "offer", "eating", "psalm" };

    var query = outer.GroupJoin(inner,
                           outerElement => outerElement[0],
                           innerElement => innerElement[1],
                           (outerElement, innerElements) => outerElement + ":" + string.Join(";", innerElements));

    query.AssertSequenceEqual("first:offer", "second:essence;psalm", "third:");
}

Note how the match sequence for "first" contained only a single element and the match sequence for "second" contains two elements. There were no matches for "third", but we still get a result, just with an empty match sequence.

Let’s implement it!

I was able to copy the implementation from Join as well, pretty much – I only had to simplify it to yield a single value instead of using a foreach loop.

Obviously the comparer-free overload calls into the one with the comparer, which then validates the arguments and calls into an iterator block method. If you’ve read all the rest of the posts in this series, you don’t need to see that yet again – and if you haven’t, just go back and look at some. So, what does the iterator block do? It uses a lookup again. It builds a lookup from the inner sequence, and then we just have to iterate over the outer sequence, applying resultSelector to each "outer element / matching inner element sequence" pair:

private static IEnumerable<TResult> GroupJoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    foreach (var outerElement in outer)
    {
        var key = outerKeySelector(outerElement);
        yield return resultSelector(outerElement, lookup[key]);
    }
}

Yup, Lookup makes things so simple it almost hurts… and that’s without even trying hard. I’m sure we could pay some more attention to the design of Lookup and Grouping if we really wanted to. (I may revisit them later… they’re clearly working for us at the moment.)

Query expressions and left joins

C# has direct support for GroupJoin when you use the "join … into" syntax. Note that despite the use of "into" this is not a query continuation. Instead, it’s just a way of giving a name to all the matches in the join. Here’s an example, finding all the words of a given length:

int[] outer = { 5, 3, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            select x + ":" + string.Join(";", matches);

query.AssertSequenceEqual("5:tiger", "3:bee;cat;dog", "7:giraffe");

Note that whereas in a simple "join" syntax, the "y" range variable would be available for the rest of the query, this time it’s only "matches" that’s available – basically "y" is only used to express the key selector (y.Length in this case).

Now think about left joins (or left outer joins) from SQL. From the linked Wikipedia page:

The result of a left outer join (or simply left join) for table A and B always contains all records of the "left" table (A), even if the join-condition does not find any matching record in the "right" table (B). This means that if the ON clause matches 0 (zero) records in B, the join will still return a row in the result—but with NULL in each column from B.

So, how can we mimic this? Well, we could just use GroupJoin directly, and deal with the fact that if there are no matches, the match sequence will be empty. That’s not quite like a left join though – it doesn’t capture the idea of getting a null value for the unmatched result. We can use DefaultIfEmpty for that quite easily though. Here’s an example adapted from the one above, where I introduce 4 into the outer sequence. There are no animals with 4 letters in the inner list, so for that item we end up with no values. I’ve used DefaultIfEmpty to make sure that we do get a value – and just to make it show up, I’ve specified a default value of the string literal "null" rather than just a null reference. Here’s the test, including the results:

int[] outer = { 5, 3, 4, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            select x + ":" + string.Join(";", matches.DefaultIfEmpty("null"));

query.AssertSequenceEqual("5:tiger", "3:bee;cat;dog", "4:null", "7:giraffe");

Okay, that’s getting closer… but we still need to deal with a whole sequence of results at a time. A normal left join still gives pairs of results… can we mimic that? Absolutely – just use GroupJoin and then SelectMany, represented in query expressions as a second "from" clause:

int[] outer = { 5, 3, 4, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            from z in matches.DefaultIfEmpty("null")
            select x + ":" + z;
query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "4:null", "7:giraffe");

That’s now genuinely close to what a SQL left join would do.

Conclusion

That’s it for joins, I believe. I’m so glad I implemented ToLookup first – it made everything else trivial. Next up, I think we’ll look at Take/TakeWhile/Skip/SkipWhile, which should be pleasantly simple.

Reimplementing LINQ to Objects: Part 21 – GroupBy

Okay, after the brief hiatus earlier, we’re ready to group sequences.

What is it?

GroupBy has eight overloads, using up to four type parameters and up to five normal parameters. Those of a sensitive disposition may wish to turn away now:

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, IEnumerable<TSource>, TResult> resultSelector)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, IEnumerable<TSource>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector)
        
public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

Okay, you can look back now. All the nastiness will go away soon, I promise. You see, this is yet another operator with overloads which effectively fill in defaults… although there’s a twist here. Here are the various parameters and their meanings:

  • source: the input sequence, as ever, of element type TSource.
  • keySelector: a delegate to apply to each element in the input sequence to obtain a key (type TKey). The key is what decides which group the element is associated with.
  • elementSelector (optional): a delegate to apply to each element (type TElement) to obtain the value which should be part of the relevant group. If it’s not present, you can think of it as being the identity conversion.
  • resultSelector (optional): a delegate to apply to each grouping to produce a final result (type TResult). See below for the difference between overloads which specify this and those which don’t.
  • comparer (optional): an equality comparer used to compare keys; groups are formed by equal keys. If unspecified or null, the default equality comparer for TKey is used.

Now, there’s only one tricky bit here: resultSelector. In overloads where this is present, the result is type IEnumerable<TResult>; otherwise it’s IEnumerable<IGrouping<TKey, TElement>>. That would make perfect sense if the type of resultSelector were Func<IGrouping<TKey, TElement>, TResult> – that would just make it effectively default to an identity conversion. However, the type is actually Func<TKey, IEnumerable<TElement>, TResult>. There’s not a lot of difference between the two logically: basically you’ve got a key and a sequence of elements in both cases – but the disparity reduces the elegance of both the design and the implementation in my view. In fact, it’s not wholly clear to me why the overloads with a resultSelector are required in the first place – you’ll see why later.

Anyway, the basic operation should be reasonably familiar by now. We’re going to look at each input value in turn, and map it to a key and a group element. We’ll usually return a sequence of groups, where each group consists of the elements produced by input values with an equal key. If we’ve specified a resultSelector, that delegate will be presented with each key and all the members of the group associated with that key, and the final result sequence will consist of the return values of the delegate.

General behaviour:

  • The operator is implemented with deferred execution, but it’s not as deferred as some other operators. More below.
  • All arguments other than comparer must not be null; this is validated eagerly.
  • The source is read completely as soon as you start iterating over the results.

Basically this performs exactly the same steps as ToLookup, except:

  • GroupBy uses deferred execution. Aside from anything else, this means that if you change the contents of "source" later and iterate over the results again, you’ll see the changes (whereas the lookup returned by ToLookup is effectively independent of the source)
  • The return value is always a sequence (whether it’s a sequence of groups or the result of calling resultSelector). This means you can’t perform arbitrary lookups on it later.

I’ve already quoted from the documentation when it comes to the ordering of the groups and the elements within the groups, but for completeness, here it is again:

The IGrouping<TKey, TElement> objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping<TKey, TElement>. Elements in a grouping are yielded in the order they appear in source.

When a resultSelector is present, the last sentence should be reinterpreted to consider the IEnumerable<TElement> sequence presented to resultSelector.

What are we going to test?

First let’s talk about deferred execution. GroupBy does use deferred execution – but that it’s not as deferred as usual. For most operators (such as Select) you can call GetEnumerator() on the result and it still won’t actually start iterating over the input sequence until you call MoveNext() on the result’s iterator. In GroupBy – at least in the "proper" implementation – it’s the call to GetEnumerator() that does all the work.

In other words, my usual test like this wouldn’t work:

// This won’t work for GroupBy in LINQ to Objects
using (new ThrowingEnumerable().GroupBy(x => x).GetEnumerator())
{
    // No exception
}

In fact, I would probably argue that on reflection, my existing tests which check that execution is deferred until the first call to MoveNext() are probably overzealous. For GroupBy, the tests are relatively liberal – they will work whether it’s GetEnumerator() or MoveNext() which starts iterating over the input sequence. This is handy, as my implementation is currently slightly more deferred than that of LINQ to Objects :) Here are the relevant tests:

[Test]
public void ExecutionIsPartiallyDeferred()
{
    // No exception yet…
    new ThrowingEnumerable().GroupBy(x => x);
    // Note that for LINQ to Objects, calling GetEnumerator() starts iterating
    // over the input sequence, so we’re not testing that…
}

[Test]
public void SequenceIsReadFullyBeforeFirstResultReturned()
{
    int[] numbers = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };
    // Final projection will throw
    var query = numbers.Select(x => 10 / x);

    var groups = query.GroupBy(x => x);
    Assert.Throws<DivideByZeroException>(() =>
    {
        using (var iterator = groups.GetEnumerator())
        {
            iterator.MoveNext();
        }
    });
}

I also have one test for each of the four overloads which doesn’t specify a comparer. I haven’t got any tests to ensure that the comparer is used properly – I felt they would be more effort than they were worth.

Here’s the test for the most complicated overload, just to show how all the bits fit together:

[Test]
public void GroupByWithElementProjectionAndCollectionProjection()
{
    string[] source = { "abc", "hello", "def", "there", "four" };
    // This time "values" will be an IEnumerable<char>, the first character of each
    // source string contributing to the group
    var groups = source.GroupBy(x => x.Length,
                                x => x[0],
                                (key, values) => key + ":" + string.Join(";", values));

    groups.AssertSequenceEqual("3:a;d", "5:h;t", "4:f");
}

Let’s look at the parameters involved:

  • source: just an array of strings
  • keySelector: takes a string and maps it to its length. We will end up with three groups, as we have strings of length 3 ("abc" and "def"), 5 ("hello" and "there") and 4 ("four")
  • elementSelector: takes a string and maps it to its first character.
  • resultSelector: takes a key (the length) and a sequence of elements (the first characters of all the input strings of that length) and returns a string joining them all together

So the result for the first group is "3:a;d" – 3 is the key, ‘a’ and ‘d’ are the first letters of "abc" and "def", and they’re joined together according to the resultSelector.

Finally, I also have a test demonstrating the use of GroupBy in a query expression, showing that these two queries are equivalent:

string[] source = { "abc", "hello", "def", "there", "four" };

var query1 = source.GroupBy(x => x.Length, x => x[0]);

var query2 = from x in source
             group x[0] by x.Length;

Okay, enough of the tests… what about the real code?

Let’s implement it!

First, some interesting line counts:

  • Lines in public method signatures: 36
  • Lines implementing argument validation: 16
  • Lines just delegating from one public overload to another: 6
  • Lines of actual implementation: 7

Sad, isn’t it? Still, there are things worth talking about.

Firstly, let’s reduce the eight overloads to two: the two that both have elementSelector and comparer specified. The six overloads which miss one or other of those parameters can simply be implemented by providing an identity conversion or the default key comparer. So that leaves us with this:

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

Now normally I like to implement a method with fewer parameters by calling one with more parameters… but there’s a slight problem in this case. We would have to provide a resultSelector which took a key and a sequence of elements, and creating a grouping from it. We could do that just by reusing our Grouping class… but there seems little reason to do that. In particular, suppose we implement it the other way round: given a grouping, we can easily extract the key (via the Key property) and the grouping itself is a sequence of the elements. In other words, we can implement the more complex method in terms of the simpler one:

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    if (resultSelector == null)
    {
        throw new ArgumentNullException("resultSelector");
    }
    // Let the other GroupBy overload do the rest of the argument validation
    return source.GroupBy(keySelector, elementSelector, comparer)
                 .Select(group => resultSelector(group.Key, group));
}

The difference may seem somewhat arbitrary to start with, until we provide the final overload that we’re delegating to. Surprise surprise, we can use ToLookup yet again:

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    if (elementSelector == null)
    {
        throw new ArgumentNullException("elementSelector");
    }
    return GroupByImpl(source, keySelector, elementSelector, comparer ?? EqualityComparer<TKey>.Default);
}

private static IEnumerable<IGrouping<TKey, TElement>> GroupByImpl<TSource, TKey, TElement>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = source.ToLookup(keySelector, elementSelector, comparer);
    foreach (var result in lookup)
    {
        yield return result;
    }
}

You see, ILookup<TKey, TElement> already implements IEnumerable<IGrouping<TKey, TValue>> so we have no extra work to do. If we had implemented the two overloads the other way round by calling "new Grouping(key, sequence)" as a resultSelector, we’d have ended up with two "wrapper" layers of Grouping when we only need one. (Alternatively, we could have cast straight back to IGrouping, but that would have felt somewhat wrong, as well as requiring an execution-time check.)

Note that again, we could have used "yield foreach" to great effect if it had been available…

Now do you see why I believe this would have been more elegant if the type of resultSelector had been Func<IGrouping<TKey, TElement>, TResult> instead? That way we could have just treated resultSelector as another optional parameter with a "default" value of the identity conversion… and all 7 simpler overloads could have delegated straight to the most complex one. Oh well.

Conclusion

Now we’ve done "Join" and "GroupBy", there’s another very obvious operator to implement: GroupJoin. This (in conjunction with DefaultIfEmpty) is how the effect of left joins are usually achieved in LINQ. Let’s see if ToLookup does what we need yet again…

Reimplementing LINQ to Objects: Part 20 – ToList

This morning I started writing the tests for GroupBy, prior to implementing it. That turned out to be a pain – really I wanted an easy way of getting at each element of the result (i.e. each result group). If only I had the ability to convert an arbitrary sequence into a query… I needed ToList. So, we enter a fairly brief diversion.

What is it?

ToList has a single, simple overload:

public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)

Fairly obviously, ToList converts the source sequence into a list. Some points to note:

  • The signature specifies List<T>, not just IList<T>. Of course it could return a subclass of List<T>, but there seems little point.
  • It uses immediate execution – nothing is deferred here
  • The parameter (source) musn’t be null
  • It’s optimized for the case when source implements ICollection<T>
  • It always creates a new, independent list.

The last two points are worth a bit more discussion. Firstly, the optimization for ICollection<T> isn’t documented, but it makes a lot of sense:

  • List<T> stores its data in an array internally
  • ICollection<T> exposes a Count property so the List<T> can create an array of exactly the right size to start with
  • ICollection<T> exposes a CopyTo method so that the List<T> can copy all the elements into the newly created array in bulk

ToList always creates a new list for consistency. If it just returned the source parameter directly if it was already a List<T>, that would mean that changes to the source after calling ToList would sometimes be visible in the returned list and sometimes not… making it harder to reason about any code which used ToList.

What are we going to test?

I have tests for the bullet points listed above, and one extra test just to prove that it can work with lazily evaluated sequences as well as simple collections like arrays. (The test uses a range followed by a projection.)

In order to test the optimization for ICollection<T>, I’ve implemented another collection like NonEnumerableList, but this time just NonEnumerableCollection. Again, this just delegates all ICollection<T> operations to a backing List<T>, but throws an exception if you try to call GetEnumerator(). The test then just calls ToList on a NonEnumerableCollection: as no exception is thrown, that proves that either the operation is optimized as we’d expect, or the exception is being swallowed. I think it’s reasonable to assume that exceptions aren’t swallowed in LINQ to Objects :)

Let’s implement it!

This will probably be the simplest implementation in the whole of Edulinq:

public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return new List<TSource>(source);
}

You may well be wondering what happened to the optimization… well, it’s in the List<T> constructor. We just get it for free. Unfortunately that’s not documented either… so we end up with an implementation which implements one undocumented optimization if List<T> implements another undocumented optimization :)

We can’t actually do much better than that – we can’t use ICollection<T>.CopyTo ourselves, as we don’t have access to the underlying array of List<T>. We could perform some optimization by calling the List<T> constructor which specifies a capacity, and then call AddRange. That would at least prevent the list from having to resize itself, but it would still need to iterate over the whole collection instead of using the (potentially very fast) CopyTo method.

Conclusion

You may be wondering why we even need ToList, if we could just create a list by calling the constructor directly. The difference is that in order to call a constructor, you need to specify the element type as the type argument. When we use ToList, we can take advantage of type inference. In many cases this is just extremely convenient, but for anonymous types it’s actually required. How can you end up with a strongly typed list of an anonymous type? It’s easy with ToList, like this:

var query = Enumerable.Range(0, 10)
                      .Select(x => new { Value = x, Doubled = x * 2 });
        
var list = query.ToList();

Try doing that without an extension method or something similar. It’s worth noting at this point that although there are similar methods for arrays and Dictionary, there’s no equivalent for HashSet. It’s incredibly easy to write, of course, and an obvious extension to LINQ to Objects – but it’s not in the standard library. Maybe for .NET 5…

So, now that we’ve got ToList sorted, I can get back to GroupBy and its eight overloads – easy to implement, but hard to test simply and hard to describe clearly. Lucky me.