Category Archives: C#

Reimplementing LINQ to Objects: Part 25 – ToDictionary

January 2, 2011 jonskeet 9 Comments

This one ended up being genuinely easy to implement, although with lots of tests for different situations.

What is it?

ToDictionary has four overloads which look remarkably similar to the ones we used for ToLookup:

public static Dictionary<TKey, TSource> ToDictionary<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector)

public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

public static Dictionary<TKey, TSource> ToDictionary<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)

public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

Yes, it’s the ever-popular combination of an optional key comparer and an optional element selector when mapping values.

ToDictionary does exactly what it says on the tin: it converts the source sequence to a dictionary. Each source item is mapped to a key, and then either the source item itself is used as the value or the element selector is applied to obtain one. A custom equality comparer can be used to create a dictionary which is (for example) case-insensitive when fetching keys.

The main difference between ToDictionary and ToLookup is that a dictionary can only store a single value per key, rather than a whole sequence of values. Of course, another difference is that Dictionary<TKey, TValue> is mutable concrete class, whereas ILookup<TKey, TElement> is an interface usually implemented in an immutable fashion.

The normal sort of note:

ToDictionary uses immediate execution: it reads the entire source sequence, and once it has returned, the dictionary is independent of the source
source, keySelector and elementSelector mustn’t be null
comparer can be null, in which case the default equality comparer for the key type will be used
No key can appear more than once (including non-identical but equal occurrences). This will cause an ArgumentException
Null keys are prohibited, causing an ArgumentNullException. (This is slightly odd, as the ToDictionary method itself doesn’t have a null argument at this point… really it should be a NullValueDerivedFromAnArgumentSomehowException, but that doesn’t exist)
Null values are allowed

Just a word on usage: one place I’ve seen ToDictionary used to reasonable effect is when the source is an IEnumerable<KeyValuePair<TKey, TValue>>, often from another dictionary. This lets you project all the values of a dictionary, or perhaps "trim" the dictionary in some ways… creating a new dictionary rather than modifying the existing one, of course. For example, here’s a way of creating a new Dictionary<string, Person> from an existing one, but only keeping the entries which are adults:

var adults = nameToPersonMap.Where(pair => pair.Value.Age >= 18)
.ToDictionary(pair => pair.Key, pair => pair.Value);

Of course, this is just one use – it can be a pretty handy operator, to be frank.

What are we going to test?

Just the "null argument" validation tests gives us ten tests to start with, then:

A test for each overload, always mapping a sequence of strings to a map using the first letter as the key; sometimes as a char and sometimes as a string (so we can easily use a case-insensitive comparer)
A test for null keys (ArgumentNullException)
A test for null values (no exception)
A test for duplicate keys (ArgumentException)
A test for a null comparer (uses the default)

Let’s implement it!

In my implementation, the first three overloads all call the last – just using a default equality comparer or an identity element selector where necessary. Then after the argument validation, the implementation is simple:

public static Dictionary<TKey, TElement> ToDictionary<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    // Argument validation elided
    Dictionary<TKey, TElement> ret = new Dictionary<TKey, TElement>(comparer ?? EqualityComparer<TKey>.Default);
    foreach (TSource item in source)
    {
        ret.Add(keySelector(item), elementSelector(item));
    }
    return ret;
}

There’s just one thing to be careful of – we have to use the Add method instead of the indexer to add entries to the dictionary. The Add method deals with duplicate and null keys in exactly the way we want, whereas the indexer would overwrite the existing entry if we had a duplicate key.

There’s one thing my implementation doesn’t do which it could if we really wanted – it doesn’t optimize the case where we know the size beforehand, because the source implements ICollection<T>. We could use the Dictionary constructor which takes a capacity in that case, like this:

comparer = comparer ?? EqualityComparer<TKey>.Default;
ICollection<TSource> list = source as ICollection<TSource>;
var ret = list == null ? new Dictionary<TKey, TElement>(comparer)
: new Dictionary<TKey, TElement>(list.Count, comparer);

After all, we know the eventual size of the dictionary will be exactly the count of the input sequence, in the success case.

Is this worth it? I’m not sure. It’s not a huge change, admittedly… I’m quite tempted to put it in. Again, we’d have to benchmark it to see what difference it made (and I’d benchmark the .NET implementation too, of course) but a unit test won’t show it up. Thoughts welcome.

EDIT: I’ve now edited the code to include this optimization. I haven’t benchmarked it though…

Conclusion

Gah – even when I thought I had this wrapped up before starting to write the post, that final bit of potential optimization still appeared out of nowhere.

Idle thought: if we all had to write up our code in blog posts as we went along, we might discover some interesting alternative designs and implementations.

That’s it for tonight, and I may not have any time for posts tomorrow – we’ll see.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 24 – ToArray

January 2, 2011 jonskeet 9 Comments

This really is a simple one. So simple I might even find the energy to implement ToDictionary as well tonight… we’ll see. (EDIT: Oops. It became slightly less simple in the end, as I came up with the third implementation. Oh well.)

What is it?

ToArray is basically the equivalent of ToList, but producing an array instead of a list. It has a single signature:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)

Just to recap:

It’s another extension method, which is useful when we want to use type inference.
It uses immediate execution – nothing is deferred, and the entire sequence is read before the method completes.
It performs simple argument validation: source can’t be null
It creates an independent but shallow copy of the collection (so any changes to the objects referenced by source will be visible in the result, but not changes to source itself – and vice versa)
It’s optimized for ICollection<T>, although in a slightly different way to ToList.

What are we going to test?

Exactly the same as we did for ToList, basically – with one bit of care required. In our test for the source and result being independent of each other, we can’t just create a new variable of type List<T> and call ToArray on it – because that would call the implementation in List<T> itself. I reckoned the easiest way of making this clear was to call the method directly as a normal static method, instead of using it as an extension method:

[Test]
public void ResultIsIndependentOfSource()
{
    List<string> source = new List<string> { "xyz", "abc" };
    // Make it obvious we’re not calling List<T>.ToArray
    string[] result = Enumerable.ToArray(source);
    result.AssertSequenceEqual("xyz", "abc");

    // Change the source: result won’t have changed
    source[0] = "xxx";
    Assert.AreEqual("xyz", result[0]);

    // And the reverse
    result[1] = "yyy";
    Assert.AreEqual("abc", source[1]);
}

One other interesting facet of the testing is that we can only partially test the optimization for ICollection<T>. We can make sure that it’s optimized as far as not using the iterator (as per the previous test), but there’s more optimization coming, and I haven’t worked out how to test for that yet. You’ll see what I mean when we come to implement it. Speaking of which…

Let’s implement it!

We can make all our tests pass with a cheat: whatever the input type, convert it to a List<T> and then call ToArray on the result. Heck, if we call ToList to do the initial conversion, that will even do the argument validation for us and use the right parameter name in the exception:

// Implementation by Mr Cheaty MacCheaterson
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
return source.ToList().ToArray();
}

Now I’m not averse to the general approach taken here – but there’s actually a bit more optimization we can do.

Remember how ICollection<T> exposes Count and CopyTo, which the List<T> constructor uses when the input implements ICollection<T>? Well, that means that building a List<T> is relatively cheap for a collection – but calling ToArray on the list will still mean copying all the data out again (as List<T>.ToArray can’t just return its own internal array – it has to create a copy). We can use exactly the same members ourselves, and avoid one level of copying:

// Somewhat better… though still not quite ideal
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        TSource[] ret = new TSource[collection.Count];
        collection.CopyTo(ret, 0);
        return ret;
    }
    return new List<TSource>(source).ToArray();
}

That’s pretty good now – except it still involves a copy from the List<T> into a new array every time. That’s almost always appropriate, in fact… because unless the resulting list happened to expand its array to exactly the right size, we’d need to make a copy anyway. After all, we can’t return an array that’s too big. However, we can optimize for the "just the right size" case if we basically implement List<T>’s array expansion ourselves, leading to this:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimize for ICollection<T>
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        TSource[] tmp = new TSource[collection.Count];
        collection.CopyTo(tmp, 0);
        return tmp;
    }

    // We’ll have to loop through, creating and copying arrays as we go
    TSource[] ret = new TSource[16];
    int count = 0;
    foreach (TSource item in source)
    {
        // Need to expand…
        if (count == ret.Length)
        {
            Array.Resize(ref ret, ret.Length * 2);
        }
        ret[count++] = item;
    }

    // Now create another copy if we have to, in order to get an array of the
    // right size
    if (count != ret.Length)
    {
        Array.Resize(ref ret, count);
    }
    return ret;
}

Is this level of optimization worth it? Probably not. I picked the starting size of 16 out of thin air (or dimly recalled initial counts for some collection or other – possibly Java’s ArrayList<T>). Maybe we should triple the capacity rather than double it, just for laughs. It’s all guesswork, really. The middle implementation feels like a more appropriate one to me, but with an obvious optimization just itching to be implemented, I thought I might as well provide the code. It’s reasonably obviously correct, but it’s just a bit longwinded for the marginal benefit over our second attempt.

It’s even more annoying that I can’t think of a way to test this easily – I could benchmark it of course, but that’s not the same as unit testing it… I can’t easily prove that we’re optimizing either the ICollection<T> or "correct guess at size" cases.

Conclusion

It’s always interesting to see what else springs to mind when I’m writing up the operator as opposed to just implementing it in Visual Studio. I’d got as far as the second implementation but not the third when I started this post.

It’s possible that in the end, the 4-overload ToDictionary operator will actually end up being simpler than ToArray. Who’d have thought?

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 23 – Take/Skip/TakeWhile/SkipWhile

January 2, 2011 jonskeet 15 Comments

I genuinely expected these operators to be simple. At the time of this writing, I’m onto my third implementation of SkipWhile, struggling to find code which obviously expresses what’s going on. I find it interesting how the simplest sounding operators sometimes end up being trickier to implement than one might think.

What are they?

Take, TakeWhile, Skip and SkipWhile form a natural group of operators. Together they are known as the partitioning operators. Here are the signatures:

public static IEnumerable<TSource> Take<TSource>(
this IEnumerable<TSource> source,
int count)

public static IEnumerable<TSource> TakeWhile<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static IEnumerable<TSource> TakeWhile<TSource>(
this IEnumerable<TSource> source,
Func<TSource, int, bool> predicate)

public static IEnumerable<TSource> Skip<TSource>(
this IEnumerable<TSource> source,
int count)

public static IEnumerable<TSource> SkipWhile<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static IEnumerable<TSource> SkipWhile<TSource>(
this IEnumerable<TSource> source,
Func<TSource, int, bool> predicate)

The behaviour is reasonably self-explanatory:

source.Take(n) returns a sequence of the first n elements of source
source.Skip(n) returns a sequence containing all but the first n elements of source
source.TakeWhile(predicate) returns a sequence of the first elements of source which match the given predicate; it stops as soon as the predicate fails to match
source.SkipWhile(predicate) returns a sequence of all but the the first elements of source which match the given predicate; it starts yielding results as soon as the predicate fails to match

The only reason there are two overloads for TakeWhile and SkipWhile is so that you can provide a predicate which uses the index within the sequence, just like the overloads for Select and Where.

Each operator effectively partitions the input sequence into two parts, and yields either the first or second part. So for any "normal" collection (which returns the same results each time you iterate over it) the following are true:

source.Take(n).Concat(source.Skip(n)) is equivalent to source
source.TakeWhile(predicate).Concat(source.SkipWhile(predicate)) is equivalent to source (as noted in the comments, this is also assuming the predicate acts solely the same way each time it’s given the same data, without any side-effects)

A few notes on general behaviour:

The arguments are validated eagerly: source and predicate can’t be null
count can be negative (in which case it’s equivalent to 0) and can be larger than the collection too. Basically any value is valid.
Execution is deferred.
The source sequence is streamed – there’s no need to buffer anything.

What are we going to test?

One nice thing about the properties I described earlier is that it’s very easy to convert tests for Take/TakeWhile into tests for Skip/SkipWhile: you just change the expected sequence in Skip to be everything in the source sequence which wasn’t in the test for Take. For Take/Skip I’ve just used Enumerable.Range(0, 5) (i.e. 0, 1, 2, 3, 4) as the source sequence; for TakeWhile/SkipWhile I’ve used { "zero", "one", "two", "three", "four", "five" } – so each element is the word corresponding to the index. That makes it reasonably easy to write tests with a predicate involving the index.

I’ve tested:

Argument validation
Deferred execution
For Skip/Take:
- A negative count
- A count of 0
- A count to split the sequence into non-empty parts
- A count exactly equal to the source length (5)
- A count larger than the source length
For SkipWhile/TakeWhile (both overloads in each case):
- A predicate which fails on the first element
- A predicate which matches some elements but not others
- A predicate which matches all elements

The tests aren’t generally interesting, but as it can be slightly tricky to get your head round TakeWhile and SkipWhile to start with, here’s an example of each:

// In TakeWhileTest
[Test]
public void PredicateMatchingSomeElements()
{
string[] source = { "zero", "one", "two", "three", "four", "five" };
source.TakeWhile(x => x.Length < 5).AssertSequenceEqual("zero", "one", "two");
}

// In SkipWhileTest
[Test]
public void PredicateMatchingSomeElements()
{
string[] source = { "zero", "one", "two", "three", "four", "five" };
source.SkipWhile(x => x.Length < 5).AssertSequenceEqual("three", "four", "five");
}

There’s been some discussion on Twitter about the best kind of test here, with suggestions of varying the data instead of the predicate, and using sequences of Boolean values. Personally I prefer the tests the way they are – I find this sequence of strings easy to work with, and the fact that it is a sequence of strings means you can’t get the index/value parameter order wrong when providing a predicate which uses the indexes, unlike if you use an integer sequence. A Boolean sequence isn’t clear enough for me – the result isn’t obviously the first or second part of a sequence, whereas with these words, it’s obvious what’s going on… to me, at least. It’s all a matter of personal preference though, and it’s good to think about alternatives.

Let’s implement them!

Let me present this in chronological order. I started off by implementing Take and Skip, as they sounded the simplest. Indeed, they’re not too bad – here are the iterator block parts, after argument validation in the public method, of course:

private static IEnumerable<TSource> TakeImpl<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        for (int i = 0; i < count && iterator.MoveNext(); i++)
        {
            yield return iterator.Current;
        }
    }
}

private static IEnumerable<TSource> SkipImpl<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        for (int i = 0; i < count; i++)
        {
            if (!iterator.MoveNext())
            {
                yield break;
            }
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

These really aren’t too bad, although it’s interesting to note that Skip is more complicated than Take: Skip has to deal with both sections of the sequence (the "ignore" part and the "yield" part) whereas Take can return as soon as it’s finished with the "yield" part. We’ll see the same pattern in TakeWhile/SkipWhile.

Speaking of which, let’s look at them. There’s a decision to make in terms of whether to implement the overload whose predicate doesn’t use the index in terms of the one which does. It would certainly be simple, but I don’t like the double level of indirection with one delegate calling another. It may well be irrational, given that I’ve been introducing identity delegates elsewhere – but it just didn’t feel right. I’ve added it into the source code using conditional compilation just for completeness, but it’s not the default implementation.

Now, the "Impl" bits of TakeWhile really weren’t too bad:

private static IEnumerable<TSource> TakeWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (!predicate(item))
        {
            yield break;
        }
        yield return item;
    }
}

private static IEnumerable<TSource> TakeWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (!predicate(item, index))
        {
            yield break;
        }
        index++;
        yield return item;
    }
}

Again, we don’t need to worry about the second part of the sequence – the part we’re ignoring. We can just let the method complete, and the result iterator terminate. In particular, we don’t care about the value that caused the predicate to return false. Now let’s look at SkipWhile (again, just the interesting bits):

private static IEnumerable<TSource> SkipWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        while (iterator.MoveNext())
        {
            TSource item = iterator.Current;
            if (!predicate(item))
            {
                // Stop skipping now, and yield this item
                yield return item;
                break;
            }
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

private static IEnumerable<TSource> SkipWhileImpl<TSource>(
    IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        int index = 0;
        while (iterator.MoveNext())
        {
            TSource item = iterator.Current;
            if (!predicate(item, index))
            {
                // Stop skipping now, and yield this item
                yield return item;
                break;
            }
            index++;
        }
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

The above code is my third attempt. You can look at the first two in the history of SkipWhile.cs – you may find one of those preferable. Other suggestions have been made, including ones using do/while… I must admit, the do/while option ends up being quite attractive, but again I have a general aversion to it as a construct. Part of the trickiness is that having found an element which makes the predicate return false, we have to yield that element before we move on to yielding the rest. It’s easy enough to do, but the extra yield return statement feels ugly.

One point to note is that if the "skipping" part finishes due to a call to MoveNext returning false (rather than the predicate failing) then we’ll call MoveNext again (at the start of the "yielding" loop). That’s explicitly called out as being okay in the IEnumerator.MoveNext() documentation, so I’m happy enough to use it.

I’m certainly not happy about those implementations, but I’m confident they at least work. Now, let’s revisit Skip and Take. Is there any way we can simplify them? Absolutely – all we need to do is use TakeWhile or SkipWhile with a predicate which uses the index and compares it to the count we’ve been given:

public static IEnumerable<TSource> Take<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    // Buggy! See addendum…
    return source.TakeWhile((x, index) => index < count);
}

public static IEnumerable<TSource> Skip<TSource>(
    this IEnumerable<TSource> source,
    int count)
{
    return source.SkipWhile((x, index) => index < count);
}

TakeWhile and SkipWhile even do the appropriate argument validation for us. As you can see, I’ve become less averse to implementing one operator using another over time… both implementations are still in the source, using conditional compilation, but currently this short form is the one which is "active". I could be persuaded to change my mind :)

An optimized Skip?

Although most of these operations can’t be sensibly optimized, it would make sense to optimize Skip when the source implements IList<T>. We can skip the skipping, so to speak, and go straight to the appropriate index. This wouldn’t spot the case where the source was modified between iterations, which may be one reason it’s not implemented in the framework as far as I’m aware. The optimization would look something like this:

var list = source as IList<TSource>;
if (list != null)
{
    // Make sure we don’t use a negative index
    count = Math.Max(count, 0);
    // Note that "count" is the count of items to skip
    for (int index = count; index < list.Count; index++)
    {
        yield return list[index];
    }
    yield break;
}

Should Edulinq do this? Answers on a postcard…

EDIT: There’s another optimization we can potentially make, even earlier. If we’re given a count which is zero or negative, Skip is effectively pointless – so we can just return the original reference. This would come just after the argument validation, not in the iterator block:

if (count <= 0)
{
return source;
}

Returning the original reference has some implications in terms of not isolating the return value from the original sequence, but it’s worth considering. We could potentially do the same for Take, in the case where we know it’s a list and the requested count is greater than or equal to the size of the list.

Using Skip and Take together

It’s worth mentioning the most common use of Skip and Take, as well as an easy-to-make mistake. Skip and Take are ideal for paging. For example:

var pageItems = allItems.Skip(pageNumber * itemsPerPage)
.Take(itemsPerPage);

This simply skips as many items as it needs to, and only "takes" one page-worth of data afterwards. Very simple… but also easy to get wrong.

In many cases with LINQ, you can reorder operations with no harm – or with a compile-time error if you get them the wrong way round. Let’s consider what happens if we get it wrong this time though:

// Bug! Bug! Bug!
var pageItems = allItems.Take(itemsPerPage)
.Skip(pageNumber * itemsPerPage);

This will work for the first page, but after that you’ll always end up displaying an empty page: it will take the first page-worth of data, and then skip it all, leaving nothing to return.

Conclusion

That was more work than I expected it to be. Still, we’ve actually not got very many LINQ operators still to cover. By my reckoning, we still have the following operators to implement:

Sum, Average, Min, Max – I’m not looking forward to these, as they all have lots of overloads, and are likely to be tedious to test and implement.
Cast and OfType
ToArray and ToDictionary
ElementAt and ElementAtOrDefault – I should possibly have covered these along with First/Single/Last etc
SequenceEqual
Zip (from .NET 4)
Contains
OrderBy/OrderByDescending/ThenBy/ThenByDescending (probably the most interesting remaining operators)
Reverse

That’s not too bad a list. I’ll probably try to knock off ToArray tonight – it should be pretty simple.

Addendum

It turns out that my "simplification" for Take introduced a bug. If we only need to take two items, we should know that after we’ve taken the second one… we don’t really need to ask the iterator to move to the third item, just to decide that we’ve finished. Doing so could have side effects – such as causing an exception to be thrown. That’s exactly what we do when we use TakeWhile – we don’t care what the first value after the stopping point is, but we do try to fetch it. Oops.

Fortunately the Mono tests picked this up for me, so I’ve added my own test for it, and gone back to the original implementation of Take which just used a counter. Both my own tests and Mono’s now pass :)

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 22 – GroupJoin

January 1, 2011 jonskeet 8 Comments

Another operator that was decidedly easy to implement – partly as I was able to just adapt everything from Join, including the tests. 15 minutes for tests + implementation… and no doubt considerably long writing it up.

What is it?

After the complexity of GroupBy, GroupJoin has a refreshingly small list of overloads:

public static IEnumerable<TResult> GroupJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector)

We can essentially ignore the first overload in the normal way – it just uses the default equality comparer for TKey. (You’ll be pleased to hear that I think we’re coming towards the end of the operators which use equality comparers, so I won’t need to repeat that sort of sentence many more times.)

The best way to get to grips with what GroupJoin does is to think of Join. There, the overall idea was that we looked through the "outer" input sequence, found all the matching items from the "inner" sequence (based on a key projection on each sequence) and then yielded pairs of matching elements. GroupJoin is similar, except that instead of yielding pairs of elements, it yields a single result for each "outer" item based on that item and the sequence of matching "inner" items.

The fact that a result is yielded for every outer item is important: even if there are no matches, a result is still returned, just with an empty sequence. I’ll come back to that later, thinking about "left joins".

In terms of deferred execution, input consumption etc, GroupJoin behaves very much like Join:

Arguments are validated eagerly: everything other than comparer has to be non-null
It uses deferred execution
When you start iterating over the result, the "inner" sequence is immediately read all the way through
The "outer" sequence is streamed: each time you read an item from the result sequence, one item from the outer sequence is read

What are we going to test?

To be absolutely honest, I cribbed the tests from Join, and just adapted the resultSelector and the expected results. It’s simple enough to do: I was already using string concatenation to piece together the two items (outer and inner) for the Join tests; for GroupJoin I’ve just modified that to use string.Join (not the same as Enumerable.Join!) to build a semi-colon string delimited string of all the inner matches for that particular outer element.

Here’s a sample test, in both the Join and GroupJoin versions:

[Test]
public void SimpleJoin()
{
    // We’re going to join on the first character in the outer sequence item
    // being equal to the second character in the inner sequence item
    string[] outer = { "first", "second", "third" };
    string[] inner = { "essence", "offer", "eating", "psalm" };

    var query = outer.Join(inner,
                           outerElement => outerElement[0],
                           innerElement => innerElement[1],
                           (outerElement, innerElement) => outerElement + ":" + innerElement);

// Note: no matches for "third"
query.AssertSequenceEqual("first:offer", "second:essence", "second:psalm");
}

[Test]
public void SimpleGroupJoin()
{
    // We’re going to join on the first character in the outer sequence item
    // being equal to the second character in the inner sequence item
    string[] outer = { "first", "second", "third" };
    string[] inner = { "essence", "offer", "eating", "psalm" };

    var query = outer.GroupJoin(inner,
                           outerElement => outerElement[0],
                           innerElement => innerElement[1],
                           (outerElement, innerElements) => outerElement + ":" + string.Join(";", innerElements));

query.AssertSequenceEqual("first:offer", "second:essence;psalm", "third:");
}

Note how the match sequence for "first" contained only a single element and the match sequence for "second" contains two elements. There were no matches for "third", but we still get a result, just with an empty match sequence.

Let’s implement it!

I was able to copy the implementation from Join as well, pretty much – I only had to simplify it to yield a single value instead of using a foreach loop.

Obviously the comparer-free overload calls into the one with the comparer, which then validates the arguments and calls into an iterator block method. If you’ve read all the rest of the posts in this series, you don’t need to see that yet again – and if you haven’t, just go back and look at some. So, what does the iterator block do? It uses a lookup again. It builds a lookup from the inner sequence, and then we just have to iterate over the outer sequence, applying resultSelector to each "outer element / matching inner element sequence" pair:

private static IEnumerable<TResult> GroupJoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, IEnumerable<TInner>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    foreach (var outerElement in outer)
    {
        var key = outerKeySelector(outerElement);
        yield return resultSelector(outerElement, lookup[key]);
    }
}

Yup, Lookup makes things so simple it almost hurts… and that’s without even trying hard. I’m sure we could pay some more attention to the design of Lookup and Grouping if we really wanted to. (I may revisit them later… they’re clearly working for us at the moment.)

Query expressions and left joins

C# has direct support for GroupJoin when you use the "join … into" syntax. Note that despite the use of "into" this is not a query continuation. Instead, it’s just a way of giving a name to all the matches in the join. Here’s an example, finding all the words of a given length:

int[] outer = { 5, 3, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
join y in inner on x equals y.Length into matches
select x + ":" + string.Join(";", matches);

query.AssertSequenceEqual("5:tiger", "3:bee;cat;dog", "7:giraffe");

Note that whereas in a simple "join" syntax, the "y" range variable would be available for the rest of the query, this time it’s only "matches" that’s available – basically "y" is only used to express the key selector (y.Length in this case).

Now think about left joins (or left outer joins) from SQL. From the linked Wikipedia page:

The result of a left outer join (or simply left join) for table A and B always contains all records of the "left" table (A), even if the join-condition does not find any matching record in the "right" table (B). This means that if the ON clause matches 0 (zero) records in B, the join will still return a row in the result—but with NULL in each column from B.

So, how can we mimic this? Well, we could just use GroupJoin directly, and deal with the fact that if there are no matches, the match sequence will be empty. That’s not quite like a left join though – it doesn’t capture the idea of getting a null value for the unmatched result. We can use DefaultIfEmpty for that quite easily though. Here’s an example adapted from the one above, where I introduce 4 into the outer sequence. There are no animals with 4 letters in the inner list, so for that item we end up with no values. I’ve used DefaultIfEmpty to make sure that we do get a value – and just to make it show up, I’ve specified a default value of the string literal "null" rather than just a null reference. Here’s the test, including the results:

int[] outer = { 5, 3, 4, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
join y in inner on x equals y.Length into matches
select x + ":" + string.Join(";", matches.DefaultIfEmpty("null"));

query.AssertSequenceEqual("5:tiger", "3:bee;cat;dog", "4:null", "7:giraffe");

Okay, that’s getting closer… but we still need to deal with a whole sequence of results at a time. A normal left join still gives pairs of results… can we mimic that? Absolutely – just use GroupJoin and then SelectMany, represented in query expressions as a second "from" clause:

int[] outer = { 5, 3, 4, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length into matches
            from z in matches.DefaultIfEmpty("null")
            select x + ":" + z;
query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "4:null", "7:giraffe");

That’s now genuinely close to what a SQL left join would do.

Conclusion

That’s it for joins, I believe. I’m so glad I implemented ToLookup first – it made everything else trivial. Next up, I think we’ll look at Take/TakeWhile/Skip/SkipWhile, which should be pleasantly simple.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 21 – GroupBy

January 1, 2011 jonskeet 17 Comments

Okay, after the brief hiatus earlier, we’re ready to group sequences.

What is it?

GroupBy has eight overloads, using up to four type parameters and up to five normal parameters. Those of a sensitive disposition may wish to turn away now:

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector)

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, IEnumerable<TSource>, TResult> resultSelector)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector)

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

Okay, you can look back now. All the nastiness will go away soon, I promise. You see, this is yet another operator with overloads which effectively fill in defaults… although there’s a twist here. Here are the various parameters and their meanings:

source: the input sequence, as ever, of element type TSource.
keySelector: a delegate to apply to each element in the input sequence to obtain a key (type TKey). The key is what decides which group the element is associated with.
elementSelector (optional): a delegate to apply to each element (type TElement) to obtain the value which should be part of the relevant group. If it’s not present, you can think of it as being the identity conversion.
resultSelector (optional): a delegate to apply to each grouping to produce a final result (type TResult). See below for the difference between overloads which specify this and those which don’t.
comparer (optional): an equality comparer used to compare keys; groups are formed by equal keys. If unspecified or null, the default equality comparer for TKey is used.

Now, there’s only one tricky bit here: resultSelector. In overloads where this is present, the result is type IEnumerable<TResult>; otherwise it’s IEnumerable<IGrouping<TKey, TElement>>. That would make perfect sense if the type of resultSelector were Func<IGrouping<TKey, TElement>, TResult> – that would just make it effectively default to an identity conversion. However, the type is actually Func<TKey, IEnumerable<TElement>, TResult>. There’s not a lot of difference between the two logically: basically you’ve got a key and a sequence of elements in both cases – but the disparity reduces the elegance of both the design and the implementation in my view. In fact, it’s not wholly clear to me why the overloads with a resultSelector are required in the first place – you’ll see why later.

Anyway, the basic operation should be reasonably familiar by now. We’re going to look at each input value in turn, and map it to a key and a group element. We’ll usually return a sequence of groups, where each group consists of the elements produced by input values with an equal key. If we’ve specified a resultSelector, that delegate will be presented with each key and all the members of the group associated with that key, and the final result sequence will consist of the return values of the delegate.

General behaviour:

The operator is implemented with deferred execution, but it’s not as deferred as some other operators. More below.
All arguments other than comparer must not be null; this is validated eagerly.
The source is read completely as soon as you start iterating over the results.

Basically this performs exactly the same steps as ToLookup, except:

GroupBy uses deferred execution. Aside from anything else, this means that if you change the contents of "source" later and iterate over the results again, you’ll see the changes (whereas the lookup returned by ToLookup is effectively independent of the source)
The return value is always a sequence (whether it’s a sequence of groups or the result of calling resultSelector). This means you can’t perform arbitrary lookups on it later.

I’ve already quoted from the documentation when it comes to the ordering of the groups and the elements within the groups, but for completeness, here it is again:

The IGrouping<TKey, TElement> objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping<TKey, TElement>. Elements in a grouping are yielded in the order they appear in source.

When a resultSelector is present, the last sentence should be reinterpreted to consider the IEnumerable<TElement> sequence presented to resultSelector.

What are we going to test?

First let’s talk about deferred execution. GroupBy does use deferred execution – but that it’s not as deferred as usual. For most operators (such as Select) you can call GetEnumerator() on the result and it still won’t actually start iterating over the input sequence until you call MoveNext() on the result’s iterator. In GroupBy – at least in the "proper" implementation – it’s the call to GetEnumerator() that does all the work.

In other words, my usual test like this wouldn’t work:

// This won’t work for GroupBy in LINQ to Objects
using (new ThrowingEnumerable().GroupBy(x => x).GetEnumerator())
{
// No exception
}

In fact, I would probably argue that on reflection, my existing tests which check that execution is deferred until the first call to MoveNext() are probably overzealous. For GroupBy, the tests are relatively liberal – they will work whether it’s GetEnumerator() or MoveNext() which starts iterating over the input sequence. This is handy, as my implementation is currently slightly more deferred than that of LINQ to Objects :) Here are the relevant tests:

[Test]
public void ExecutionIsPartiallyDeferred()
{
    // No exception yet…
    new ThrowingEnumerable().GroupBy(x => x);
    // Note that for LINQ to Objects, calling GetEnumerator() starts iterating
    // over the input sequence, so we’re not testing that…
}

[Test]
public void SequenceIsReadFullyBeforeFirstResultReturned()
{
    int[] numbers = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };
    // Final projection will throw
    var query = numbers.Select(x => 10 / x);

    var groups = query.GroupBy(x => x);
    Assert.Throws<DivideByZeroException>(() =>
    {
        using (var iterator = groups.GetEnumerator())
        {
            iterator.MoveNext();
        }
    });
}

I also have one test for each of the four overloads which doesn’t specify a comparer. I haven’t got any tests to ensure that the comparer is used properly – I felt they would be more effort than they were worth.

Here’s the test for the most complicated overload, just to show how all the bits fit together:

[Test]
public void GroupByWithElementProjectionAndCollectionProjection()
{
    string[] source = { "abc", "hello", "def", "there", "four" };
    // This time "values" will be an IEnumerable<char>, the first character of each
    // source string contributing to the group
    var groups = source.GroupBy(x => x.Length,
                                x => x[0],
                                (key, values) => key + ":" + string.Join(";", values));

groups.AssertSequenceEqual("3:a;d", "5:h;t", "4:f");
}

Let’s look at the parameters involved:

source: just an array of strings
keySelector: takes a string and maps it to its length. We will end up with three groups, as we have strings of length 3 ("abc" and "def"), 5 ("hello" and "there") and 4 ("four")
elementSelector: takes a string and maps it to its first character.
resultSelector: takes a key (the length) and a sequence of elements (the first characters of all the input strings of that length) and returns a string joining them all together

So the result for the first group is "3:a;d" – 3 is the key, ‘a’ and ‘d’ are the first letters of "abc" and "def", and they’re joined together according to the resultSelector.

Finally, I also have a test demonstrating the use of GroupBy in a query expression, showing that these two queries are equivalent:

string[] source = { "abc", "hello", "def", "there", "four" };

var query1 = source.GroupBy(x => x.Length, x => x[0]);

var query2 = from x in source
group x[0] by x.Length;

Okay, enough of the tests… what about the real code?

Let’s implement it!

First, some interesting line counts:

Lines in public method signatures: 36
Lines implementing argument validation: 16
Lines just delegating from one public overload to another: 6
Lines of actual implementation: 7

Sad, isn’t it? Still, there are things worth talking about.

Firstly, let’s reduce the eight overloads to two: the two that both have elementSelector and comparer specified. The six overloads which miss one or other of those parameters can simply be implemented by providing an identity conversion or the default key comparer. So that leaves us with this:

Now normally I like to implement a method with fewer parameters by calling one with more parameters… but there’s a slight problem in this case. We would have to provide a resultSelector which took a key and a sequence of elements, and creating a grouping from it. We could do that just by reusing our Grouping class… but there seems little reason to do that. In particular, suppose we implement it the other way round: given a grouping, we can easily extract the key (via the Key property) and the grouping itself is a sequence of the elements. In other words, we can implement the more complex method in terms of the simpler one:

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    if (resultSelector == null)
    {
        throw new ArgumentNullException("resultSelector");
    }
    // Let the other GroupBy overload do the rest of the argument validation
    return source.GroupBy(keySelector, elementSelector, comparer)
                 .Select(group => resultSelector(group.Key, group));
}

The difference may seem somewhat arbitrary to start with, until we provide the final overload that we’re delegating to. Surprise surprise, we can use ToLookup yet again:

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    if (elementSelector == null)
    {
        throw new ArgumentNullException("elementSelector");
    }
    return GroupByImpl(source, keySelector, elementSelector, comparer ?? EqualityComparer<TKey>.Default);
}

private static IEnumerable<IGrouping<TKey, TElement>> GroupByImpl<TSource, TKey, TElement>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = source.ToLookup(keySelector, elementSelector, comparer);
    foreach (var result in lookup)
    {
        yield return result;
    }
}

You see, ILookup<TKey, TElement> already implements IEnumerable<IGrouping<TKey, TValue>> so we have no extra work to do. If we had implemented the two overloads the other way round by calling "new Grouping(key, sequence)" as a resultSelector, we’d have ended up with two "wrapper" layers of Grouping when we only need one. (Alternatively, we could have cast straight back to IGrouping, but that would have felt somewhat wrong, as well as requiring an execution-time check.)

Note that again, we could have used "yield foreach" to great effect if it had been available…

Now do you see why I believe this would have been more elegant if the type of resultSelector had been Func<IGrouping<TKey, TElement>, TResult> instead? That way we could have just treated resultSelector as another optional parameter with a "default" value of the identity conversion… and all 7 simpler overloads could have delegated straight to the most complex one. Oh well.

Conclusion

Now we’ve done "Join" and "GroupBy", there’s another very obvious operator to implement: GroupJoin. This (in conjunction with DefaultIfEmpty) is how the effect of left joins are usually achieved in LINQ. Let’s see if ToLookup does what we need yet again…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 20 – ToList

January 1, 2011 jonskeet 6 Comments

This morning I started writing the tests for GroupBy, prior to implementing it. That turned out to be a pain – really I wanted an easy way of getting at each element of the result (i.e. each result group). If only I had the ability to convert an arbitrary sequence into a query… I needed ToList. So, we enter a fairly brief diversion.

What is it?

ToList has a single, simple overload:

public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)

Fairly obviously, ToList converts the source sequence into a list. Some points to note:

The signature specifies List<T>, not just IList<T>. Of course it could return a subclass of List<T>, but there seems little point.
It uses immediate execution – nothing is deferred here
The parameter (source) musn’t be null
It’s optimized for the case when source implements ICollection<T>
It always creates a new, independent list.

The last two points are worth a bit more discussion. Firstly, the optimization for ICollection<T> isn’t documented, but it makes a lot of sense:

List<T> stores its data in an array internally
ICollection<T> exposes a Count property so the List<T> can create an array of exactly the right size to start with
ICollection<T> exposes a CopyTo method so that the List<T> can copy all the elements into the newly created array in bulk

ToList always creates a new list for consistency. If it just returned the source parameter directly if it was already a List<T>, that would mean that changes to the source after calling ToList would sometimes be visible in the returned list and sometimes not… making it harder to reason about any code which used ToList.

What are we going to test?

I have tests for the bullet points listed above, and one extra test just to prove that it can work with lazily evaluated sequences as well as simple collections like arrays. (The test uses a range followed by a projection.)

In order to test the optimization for ICollection<T>, I’ve implemented another collection like NonEnumerableList, but this time just NonEnumerableCollection. Again, this just delegates all ICollection<T> operations to a backing List<T>, but throws an exception if you try to call GetEnumerator(). The test then just calls ToList on a NonEnumerableCollection: as no exception is thrown, that proves that either the operation is optimized as we’d expect, or the exception is being swallowed. I think it’s reasonable to assume that exceptions aren’t swallowed in LINQ to Objects :)

Let’s implement it!

This will probably be the simplest implementation in the whole of Edulinq:

public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return new List<TSource>(source);
}

You may well be wondering what happened to the optimization… well, it’s in the List<T> constructor. We just get it for free. Unfortunately that’s not documented either… so we end up with an implementation which implements one undocumented optimization if List<T> implements another undocumented optimization :)

We can’t actually do much better than that – we can’t use ICollection<T>.CopyTo ourselves, as we don’t have access to the underlying array of List<T>. We could perform some optimization by calling the List<T> constructor which specifies a capacity, and then call AddRange. That would at least prevent the list from having to resize itself, but it would still need to iterate over the whole collection instead of using the (potentially very fast) CopyTo method.

Conclusion

You may be wondering why we even need ToList, if we could just create a list by calling the constructor directly. The difference is that in order to call a constructor, you need to specify the element type as the type argument. When we use ToList, we can take advantage of type inference. In many cases this is just extremely convenient, but for anonymous types it’s actually required. How can you end up with a strongly typed list of an anonymous type? It’s easy with ToList, like this:

var query = Enumerable.Range(0, 10)
.Select(x => new { Value = x, Doubled = x * 2 });

var list = query.ToList();

Try doing that without an extension method or something similar. It’s worth noting at this point that although there are similar methods for arrays and Dictionary, there’s no equivalent for HashSet. It’s incredibly easy to write, of course, and an obvious extension to LINQ to Objects – but it’s not in the standard library. Maybe for .NET 5…

So, now that we’ve got ToList sorted, I can get back to GroupBy and its eight overloads – easy to implement, but hard to test simply and hard to describe clearly. Lucky me.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 19 – Join

December 31, 2010 jonskeet 7 Comments

You might expect that as joining is such a fundamental concept in SQL, it would be a complex operation to both describe and implement in LINQ. Fortunately, nothing could be further from the truth…

What is it?

Join has a mere two overloads, with the familiar pattern of one taking a custom equality comparer and the other not:

public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector)

These are daunting signatures to start with – four type parameters, and up to six normal method parameters (including the first "this" parameter indicating an extension method). Don’t panic. It’s actually quite simple to understand each individual bit:

We have two sequences (outer and inner). The two sequences can have different element types (TOuter and TInner).
For each sequence, there’s also a key selector – a projection from a single item to the key for that item. Note that although the key type can be different from the two sequence types, there’s only one key type – both key selectors have to return the same kind of key (TKey).
An optional equality comparer is used to compare keys.
A delegate (resultSelector) is used to project a pair of items whose keys are equal (one from each sequence) to the result type (TResult)

The idea is that we look through the two input sequences for pairs which correspond to equal keys, and yield one output element for each pair. This is an equijoin operation: we can only deal with equal keys, not pairs of keys which meet some arbitrary condition. It’s also an inner join in database terms – we will only see an item from one sequence if there’s a "matching" item from the other sequence. I’ll talk about mimicking left joins when I implement GroupJoin.

The documentation gives us details of the order in which we need to return items:

Join preserves the order of the elements of outer, and for each of these elements, the order of the matching elements of inner.

For the sake of clarity, it’s probably worth including a naive implementation which at least gives the right results in the right order:

// Bad implementation, only provided for the purposes of discussion
public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    // Argument validation omitted
    foreach (TOuter outerElement in outer)
    {
        TKey outerKey = outerKeySelector(outerElement);
        foreach (TInner innerElement in inner)
        {
            TKey innerKey = innerKeySelector(innerElement);
            if (comparer.Equals(outerKey, innerKey))
            {
                yield return resultSelector(outerElement, innerElement);
            }
        }
    }
}

Aside from the missing argument validation, there are two important problems with this:

It iterates over the inner sequence multiple times. I always advise anyone implementing a LINQ-like operator to only iterate over any input sequence once. There are sequences which are impossible to iterate over multiple times, or which may give different results each time. That’s bad news.
It always has a complexity of O(N * M) for N items in the inner sequence and M items in the outer sequence. Eek. Admittedly that’s always a possible complexity – two sequences which have the same key for all elements will always have that complexity – but in a typical situation we can do considerably better.

The real Join operator uses the same behaviour as Except and Intersect when it comes to how the input sequences are consumed:

Argument validation occurs eagerly – both sequences and all the "selector" delegates have to be non-null; the comparer argument can be null, leading to the default equality comparer for TKey being used.
The overall operation uses deferred execution: it doesn’t iterate over either input sequence until something starts iterating over the result sequence.
When MoveNext is called on the result sequence for the first time, it immediately consumes the whole of the inner sequence, buffering it.
The outer sequence is streamed – it’s only read one element at a time. By the time the result sequence has started yielding results from the second element of outer, it’s forgotten about the first element.

We’ve started veering towards an implementation already, so let’s think about tests.

What are we going to test?

I haven’t bothered with argument validation tests this time – even with cut and paste, the 10 tests required to completely check everything feels like overkill.

However, I have tested:

Joining two invalid sequences, but not using the results (i.e. testing deferred execution)
The way that the two sequences are consumed
Using a custom comparer
Not specifying a comparer
Using sequences of different types

See the source code for more details, but here’s a flavour – the final test:

[Test]
public void DifferentSourceTypes()
{
int[] outer = { 5, 3, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

    var query = outer.Join(inner,
                           outerElement => outerElement,
                           innerElement => innerElement.Length,
                           (outerElement, innerElement) => outerElement + ":" + innerElement);
    query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "7:giraffe");
}

To be honest, the tests aren’t very exciting. The implementation is remarkably simple though.

Let’s implement it!

Trivial decision 1: make the comparer-less overload call the one with the comparer.

Trivial decision 2: use the split-method technique to validate arguments eagerly but defer the rest of the operation

That leaves us with the actual implementation in an iterator block, which is all I’m going to provide the code for here. So, what are we going to do?

Well, we know that we’re going to have to read in the whole of the "inner" sequence – but let’s wait a minute before deciding how to store it. We’re then going to iterate over the "outer" sequence. For each item in the outer sequence, we need to find the key and then work out all the "inner" sequence items which match that key. Now, the idea of finding all the items in a sequence which match a particular key should sound familiar to you – that’s exactly what a lookup is for. If we build a lookup from the inner sequence as our very first step, the rest becomes easy: we can fetch the sequence of matches, then iterate over them and yield the return value of calling result selector on the pair of elements.

All of this is easier to see in code than in words, so here’s the method:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    foreach (var outerElement in outer)
    {
        var key = outerKeySelector(outerElement);
        foreach (var innerElement in lookup[key])
        {
            yield return resultSelector(outerElement, innerElement);
        }
    }
}

Personally, I think this is rather beautiful… in particular, I like the way that it uses every parameter exactly once. Everything is just set up to work nicely.

But wait, there’s more… If you look at the nested foreach loops, that should remind you of something: for each outer sequence element, we’re computing a nested sequence, then applying a delegate to each pair, and yielding the result. That’s almost exactly the definition of SelectMany! If only we had a "yield foreach" or "yield!" I’d be tempted to use an implementation like this:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    // Warning: not really valid C#
    yield foreach outer.SelectMany(outerElement => lookup[outerKeySelector(outerElement)],
                                   resultSelector);
}

Unfortunately there’s no such thing as a "yield foreach" statement. We can’t just call SelectMany and return the result directly, because then we wouldn’t be deferring execution. The best we can sensibly do is loop:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    var results = outer.SelectMany(outerElement => lookup[outerKeySelector(outerElement)],
                                   resultSelector);
    foreach (var result in results)
    {
        yield return result;
    }
}

At that stage, I think I prefer the original form – which is still pretty simple, thanks to the lookup.

While I have avoided using other operators for implementations in the past, in this case it feels so completely natural, it would be silly not to use ToLookup to implement Join.

One final point before I wrap up for the night…

Query expressions

Most of the operators I’ve been implementing recently don’t have any direct mapping in C# query expressions. Join does, however. The final test I showed before can also be written like this:

int[] outer = { 5, 3, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
join y in inner on x equals y.Length
select x + ":" + y;

query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "7:giraffe");

The compiler will effectively generate the same code (although admittedly I’ve used shorter range variable names here – x and y instead of outerElement and innerElement respectively). In this case the resultSelector delegate it supplies is simply the final projection from the "select" clause – but if we had anything else (such as a where clause) between the join and the select, the compiler would introduce a transparent identifier to propagate the values of both x and y through the query. It looks like I haven’t blogged explicitly about transparent identifiers, although I cover them in C# in Depth. Maybe once I’ve finished actually implementing the operators, I’ll have a few more general posts on this sort of thing.

Anyway, the point is that for simple join clauses (as opposed to join…into) we’ve implemented everything we need to. Hoorah.

Conclusion

I spoiled some of the surprise around how easy Join would be to implement by mentioning it in the ToLookup post, but I’m still impressed by how neat it is. Again I should emphasize that this is due to the design of LINQ – it’s got nothing to do with my own prowess.

This won’t be the end of our use of lookups though… the other grouping constructs can all use them too. I’ll try to get them all out of the way before moving on to operators which feel a bit different…

Addendum

It turns out this wasn’t quite as simple as I’d expected. Although ToLookup and GroupBy handle null keys without blinking, Join and GroupJoin ignore them. I had to write an alternative version of ToLookup which ignores null keys while populating the lookup, and then replace the calls of "ToLookup" in the code above with calls to "ToLookupNoNullKeys". This isn’t documented anywhere, and is inconsistent with ToLookup/GroupBy. I’ve opened up a Connect issue about it, in the hope that it at least gets documented properly. (It’s too late to change the behaviour now, I suspect.)

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 18 – ToLookup

December 31, 2010 jonskeet 10 Comments

I’ve had a request to implement Join, which seems a perfectly reasonable operator to aim towards. However, it’s going to be an awful lot easier to implement if we write ToLookup first. That will also help with GroupBy and GroupJoin, later on.

In the course of this post we’ll create a certain amount of infrastructure – some of which we may want to tweak later on.

What is it?

ToLookup has four overloads, with these signatures:

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector)

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

Essentially these boil down to two required parameters and two optional ones:

The source is required, and must not be null
The keySelector is required, and must not be null
The elementSelector is optional, and defaults to an identity projection of a source element to itself. If it is specified, it must not be null.
The comparer is optional, and defaults to the default equality comparer for the key type. It may be null, which is equivalent to specifying the default equality comparer for the key type.

Now we can just consider the most general case – the final overload. However, in order to understand what ToLookup does, we have to know what ILookup<TKey, TElement> means – which in turn means knowing about IGrouping<TKey, TElement>:

public interface IGrouping<out TKey, out TElement> : IEnumerable<TElement>
{
TKey Key { get; }
}

public interface ILookup<TKey, TElement> : IEnumerable<IGrouping<TKey, TElement>>
{
    int Count { get; }
    IEnumerable<TElement> this[TKey key] { get; }
    bool Contains(TKey key);
}

The generics make these interfaces seem somewhat scary at first sight, but they’re really not so bad. A grouping is simply a sequence with an associated key. This is a pretty simple concept to transfer to the real world – something like "Plays by Oscar Wilde" could be an IGrouping<string, Play> with a key of "Oscar Wilde". The key doesn’t have to be "embedded" within the element type though – it would also be reasonable to have an IGrouping<string, string> representing just the names of plays by Oscar Wilde.

A lookup is essentially a map or dictionary where each key is associated with a sequence of values instead of a single one. Note that the interface is read-only, unlike IDictionary<TKey, TValue>. As well as looking up a single sequence of values associated with a key, you can also iterate over the whole lookup in terms of groupings (instead of the key/value pair from a dictionary). There’s one other important difference between a lookup and a dictionary: if you ask a lookup for the sequence corresponding to a key which it doesn’t know about, it will return an empty sequence, rather than throwing an exception. (A key which the lookup does know about will never yield an empty sequence.)

One slightly odd point to note is that while IGrouping is covariant in TKey and TElement, ILookup is invariant in both of its type parameters. While TKey has to be invariant, it would be reasonable for TElement to be covariant – to go back to our "plays" example, an IGrouping<string, Play> could be sensibly regarded as an IGrouping<string, IWrittenWork> (with the obvious type hierarchy). However, the interface declarations above are the ones in .NET 4, so that’s what I’ve used in Edulinq.

Now that we understand the signature of ToLookup, let’s talk about what it actually does. Firstly, it uses immediate execution – the lookup returned by the method is effectively divorced from the original sequence; changes to the sequence after the method has returned won’t change the lookup. (Obviously changes to the objects within the original sequence may still be seen, depending on what the element selector does, etc.) The rest is actually fairly straightforward, when you consider what parameters we have to play with:

The keySelector is applied to each item in the input sequence. Keys are always compared for equality using the "comparer" parameter.
The elementSelector is applied to each item in the input sequence, to project the item to the value which will be returned within the lookup.

To demonstrate this a little further, here are two applications of ToLookup – one using the first overload, and one using the last. In each case we’re going to group plays by author and then display some information about them. Hopefully this will make some of the concepts I’ve described a little more concrete:

ILookup<string, Play> lookup = allPlays.ToLookup(play => play.Author);
foreach (IGrouping<string, Play> group in lookup)
{
    Console.WriteLine("Plays by {0}", group.Key);
    foreach (Play play in group)
    {
        Console.WriteLine(" {0} ({1})", play.Name, play.CopyrightDate);
    }
}

// Or…

ILookup<string, string> lookup = allPlays.ToLookup(play => play.Author,
                                                   play => play.Name,
                                                   StringComparer.OrdinalIgnoreCase);
foreach (IGrouping<string, string> group in lookup)
{
    Console.WriteLine("Plays by {0}:", group.Key);
    foreach (string playName in group)
    {
        Console.WriteLine(" {0}", playName);
    }
}

// And demonstrating looking up by a key:
IEnumerable<string> playsByWilde = lookup["Oscar Wilde"];
Console.WriteLine("Plays by Oscar Wilde: {0}",
                  string.Join("; ", playsByWilde));

Note that although I’ve used explicit typing for all the variables here, I would usually use implicit typing with var, just to keep the code more concise.

Now we just have the small matter of working out how to go from a key/element pair for each item, to a data structure which lets us look up sequences by key.

Before we move onto the implementation and tests, there are two aspects of ordering to consider:

How are the groups within the lookup ordered?
How are the elements within each group ordered?

The documentation for ToLookup is silent on these matters – but the docs for the closely-related GroupBy operator are more specific:

The IGrouping<TKey, TElement> objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping<TKey, TElement>. Elements in a grouping are yielded in the order they appear in source.

Admittedly "in an order based on…" isn’t as clear as it might be, but I think it’s reasonable to make sure that our implementation yields groups such that the first group returned has the same key as the first element in the source, and so on.

What are we going to test?

I’ve actually got relatively few tests this time. I test each of the overloads, but not terribly exhaustively. There are three tests worth looking at though. The first two show the "eager" nature of the operator, and the final one demonstrates the most complex overload by grouping people into families.

[Test]
public void SourceSequenceIsReadEagerly()
{
var source = new ThrowingEnumerable();
Assert.Throws<InvalidOperationException>(() => source.ToLookup(x => x));
}

[Test]
public void ChangesToSourceSequenceAfterToLookupAreNotNoticed()
{
    List<string> source = new List<string> { "abc" };
    var lookup = source.ToLookup(x => x.Length);
    Assert.AreEqual(1, lookup.Count);

    // Potential new key is ignored
    source.Add("x");
    Assert.AreEqual(1, lookup.Count);

    // Potential new value for existing key is ignored
    source.Add("xyz");
    lookup[3].AssertSequenceEqual("abc");
}

[Test]
public void LookupWithComparareAndElementSelector()
{
    var people = new[] {
        new { First = "Jon", Last = "Skeet" },
        new { First = "Tom", Last = "SKEET" }, // Note upper-cased name
        new { First = "Juni", Last = "Cortez" },
        new { First = "Holly", Last = "Skeet" },
        new { First = "Abbey", Last = "Bartlet" },
        new { First = "Carmen", Last = "Cortez" },
        new { First = "Jed", Last = "Bartlet" }
    };

var lookup = people.ToLookup(p => p.Last, p => p.First, StringComparer.OrdinalIgnoreCase);

    lookup["Skeet"].AssertSequenceEqual("Jon", "Tom", "Holly");
    lookup["Cortez"].AssertSequenceEqual("Juni", "Carmen");
    // The key comparer is used for lookups too
    lookup["BARTLET"].AssertSequenceEqual("Abbey", "Jed");

lookup.Select(x => x.Key).AssertSequenceEqual("Skeet", "Cortez", "Bartlet");
}

I’m sure there are plenty of other tests I could have come up with. lf you want to see any ones in particular (either as an edit to this post if they already exist in source control, or as entirely new tests), please leave a comment.

EDIT: It turned out there were a few important tests missing… see the addendum.

Let’s implement it!

There’s quite a lot of code involved in implementing this – but it should be reusable later on, potentially with a few tweaks. Let’s get the first three overloads out of the way to start with:

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)
{
    return source.ToLookup(keySelector, element => element, EqualityComparer<TKey>.Default);
}

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)
{
    return source.ToLookup(keySelector, element => element, comparer);
}

Now we can just worry about the final one. Before we implement that though, we’re definitely going to need an implementation of IGrouping. Let’s come up with a really simple one to start with:

internal sealed class Grouping<TKey, TElement> : IGrouping<TKey, TElement>
{
private readonly TKey key;
private readonly IEnumerable<TElement> elements;

    internal Grouping(TKey key, IEnumerable<TElement> elements)
    {
        this.key = key;
        this.elements = elements;
    }

public TKey Key { get { return key; } }

    public IEnumerator<TElement> GetEnumerator()
    {
        return elements.GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

How do we feel about this? Well, groupings should be immutable. We have no guarantee that the "values" sequence won’t be changed after creation – but it’s an internal class, so we’ve just got to be careful about how we use it. We also have to be careful that we don’t use a mutable sequence type which allows a caller to get from the iterator (the IEnumerator<TElement> returned by GetEnumerator) back to the sequence and then mutate it. In our case we’re actually going to use List<T> to provide the sequences, and while List<T>.Enumerator is a public type, it doesn’t expose the underlying list. Of course a caller could use reflection to mess with things, but we’re not going to try to protect against that.

Okay, so now we can pair a sequence with a key… but we still need to implement ILookup. This is where there are multiple options. We want our lookup to be immutable, but there are various degrees of immutability we could use:

Mutable internally, immutable in public API
Mutable privately, immutable to the internal API
Totally immutable, even within the class itself

The first option is the simplest to implement, and it’s what I’ve gone for at the moment. I’ve created a Lookup class which is allows a key/element pair to be added to it from within the Edulinq assembly. It uses a Dictionary<TKey, List<TElement>> to map the keys to sequences efficiently, and a List<TKey> to remember the order in which we first saw the keys. Here’s the complete implementation:

internal sealed class Lookup<TKey, TElement> : ILookup<TKey, TElement>
{
private readonly Dictionary<TKey, List<TElement>> map;
private readonly List<TKey> keys;

    internal Lookup(IEqualityComparer<TKey> comparer)
    {
        map = new Dictionary<TKey, List<TElement>>(comparer);
        keys = new List<TKey>();
    }

    internal void Add(TKey key, TElement element)
    {
        List<TElement> list;
        if (!map.TryGetValue(key, out list))
        {
            list = new List<TElement>();
            map[key] = list;
            keys.Add(key);
        }
        list.Add(element);
    }

    public int Count
    {
        get { return map.Count; }
    }

    public IEnumerable<TElement> this[TKey key]
    {
        get
        {
            List<TElement> list;
            if (!map.TryGetValue(key, out list))
            {
                return Enumerable.Empty<TElement>();
            }
            return list.Select(x => x);
        }
    }

    public bool Contains(TKey key)
    {
        return map.ContainsKey(key);
    }

    public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator()
    {
        return keys.Select(key => new Grouping<TKey, TElement>(key, map[key]))
                   .GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

All of this is really quite straightforward. Note that we provide an equality comparer to the constructor, which is then passed onto the dictionary – that’s the only thing that needs to know how to compare keys.

There are only two points of interest, really:

In the indexer, we don’t return the list itself – that would allow the caller to mutate it, after casting it to List<TElement>. Instead, we just call Select with an identity projection, as a simple way of inserting a sort of "buffering" layer between the list and the caller. There are other ways of doing this, of course – including implementing the indexer with an iterator block.
In GetEnumerator, we’re retaining the key order by using our list of keys and performing a lookup on each key.

We’re currently creating the new Grouping objects lazily – which will lead to fewer of them being created if the caller doesn’t actually iterate over the lookup, but more of them if the caller iterates over it several times. Again, there are alternatives here – but without any good information about where to make the trade-off, I’ve just gone for the simplest code which works for the moment.

One last thing to note about Lookup – I’ve left it internal. In .NET, it’s actually public – but the only way of getting at an instance of it is to call ToLookup and then cast the result. I see no particular reason to make it public, so I haven’t.

Now we’re finally ready to implement the last ToLookup overload – and it becomes pretty much trivial:

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    if (elementSelector == null)
    {
        throw new ArgumentNullException("elementSelector");
    }

Lookup<TKey, TElement> lookup = new Lookup<TKey, TElement>(comparer ?? EqualityComparer<TKey>.Default);

    foreach (TSource item in source)
    {
        TKey key = keySelector(item);
        TElement element = elementSelector(item);
        lookup.Add(key, element);
    }
    return lookup;
}

When you look past the argument validation, we’re just creating a Lookup, populating it, and then returning it. Simple.

Thread safety

Something I haven’t addressed anywhere so far is thead safety. In particular, although all of this is nicely immutable when viewed from a single thread, I have a nasty feeling that theoretically, if the return value of our implementation of ToLookup was exposed to another thread, it could potentially observe the internal mutations we’re making here, as we’re not doing anything special in terms of the memory model here.

I’m basically scared by lock-free programming these days, unless I’m using building blocks provided by someone else. While investigating the exact guarantees offered here would be interesting, I don’t think it would really help with our understanding of LINQ as a whole. I’m therefore declaring thread safety to be out of scope for Edulinq in general :)

Conclusion

So that’s ToLookup. Two new interfaces, two new classes, all for one new operator… so far. We can reuse almost all of this in Join though, which will make it very simple to implement. Stay tuned…

Addendum

It turns out that I missed something quite important: ToLookup has to handle null keys, as do various other LINQ operators (GroupBy etc). We’re currently using a Dictionary<TKey, TValue> to organize the groups… and that doesn’t support null keys. Oops.

So, first steps: write some tests proving that it fails as we expect it to. Fetch by a null key of a lookup. Include a null key in the source of a lookup. Use GroupBy, GroupJoin and Join with a null key. Watch it go bang. That’s the easy part…

Now, we can do all the special-casing in Lookup itself – but it gets ugly. Our Lookup code was pretty simple before; it seems a shame to spoil it with checks everywhere. What we really need is a dictionary which does support null keys. Step forward NullKeyFriendlyDictionary, a new internal class in Edulinq. Now you might expect this to implement IDictionary<TKey, TValue>, but it turns out that’s a pain in the neck. We hardly use any of the members of Dictionary – TryGetValue, the indexer, ContainsKey, and the Count property. That’s it! So those are the only members I’ve implemented.

The class contains a Dictionary<TKey, TValue> to delegate most requests to, and it just handles null keys itself. Here’s a quick sample:

internal sealed class NullKeyFriendlyDictionary<TKey, TValue>
{
    private readonly Dictionary<TKey, TValue> map;
    private bool haveNullKey = false;
    private TValue valueForNullKey;

    internal NullKeyFriendlyDictionary(IEqualityComparer<TKey> comparer)
    {
        map = new Dictionary<TKey, TValue>(comparer);
    }

    internal bool TryGetValue(TKey key, out TValue value)
    {
        if (key == null)
        {
            // This will be default(TValue) if haveNullKey is false,
            // which is what we want.
            value = valueForNullKey;
            return haveNullKey;
        }
        return map.TryGetValue(key, out value);
    }

// etc
}

There’s one potential flaw here, that I can think of: if you provide an IEqualityComparer<TKey> which treats some non-null key as equal to a null key, we won’t spot that. If your source then contains both those keys, we’ll end up splitting them into two groups instead of keeping them together. I’m not too worried about that – and I suspect there are all kinds of ways that could cause problems elsewhere anyway.

With this in place, and Lookup adjusted to use NullKeyFriendlyDictionary instead of just Dictionary, all the tests pass. Hooray!

At the same time as implementing this, I’ve tidied up Grouping itself – it now implements IList<T> itself, in a way which is immutable to the outside world. The Lookup now contains groups directly, and can return them very easily. The code is generally tidier, and anything using a group can take advantage of the optimizations applied to IList<T> – particularly the Count() operator, which is often applied to groups.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 17 – Except

December 30, 2010 jonskeet 4 Comments

I’m really pleased with this one. Six minutes ago (at the time of writing this), I tweeted about the Intersect blog post. I then started writing the tests and implementation for Except… and I’m now done.

The tests are cut/paste/replace with a few tweaks – but it’s the implementation that I’m most pleased with. You’ll see what I mean later – it’s beautiful.

What is it?

Except is our final set operator, with the customary two overloads:

public static IEnumerable<TSource> Except<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second)

public static IEnumerable<TSource> Except<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

The result of calling Except is the elements of the first sequence which are not in the second sequence.

Just for completeness, here’s a quick summary of the behaviour:

The first overload uses the default equality comparer for TSource, and the second overload does if you pass in "null" as the comparer, too.
Neither "first" nor "second" can be null
The method does use deferred execution
The result sequence only contains distinct elements; even if the first sequence contains duplicates, the result sequence won’t

This time, the documentation doesn’t say how "first" and "second" will be used, other than to describe the result as a set difference. In practice though, it’s exactly the same as Intersect: when we ask for the first result, the "second" input sequence is fully evaluated, then the "first" input sequence is streamed.

What are we going to test?

I literally cut and paste the tests for Intersect, did a find/replace on Intersect/Except, and then looked at the data in each test. In particular, I made sure that there were potential duplicates in the "first" input sequence which would have to be removed in the result sequence. I also tweaked the details of the data in the final tests shown before (which proved the way in which the two sequences were read) but the main thrust of the tests are the same.

Nothing to see here, move on…

Let’s implement it!

I’m not even going to bother showing the comparer-free overload this time. It just calls the other overload, as you’d expect. Likewise the argument validation part of the implementation is tedious. Let’s focus on the part which does the work. First, we’ll think back to Distinct and Intersect:

In Distinct, we started with an empty set and populated it as we went along, making sure we never returned anything already in the set
In Intersect, we started with a populated set (from the second input sequence) and removed elements from it as we went along, only returning elements where an equal element was previously in the set

Except is simply a cross between the two: from Distinct we keep the idea of a "banned" set of elements that we add to; from Intersect we take the idea of starting off with a set populated from the second input element. Here’s the implementation:

private static IEnumerable<TSource> ExceptImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> bannedElements = new HashSet<TSource>(second, comparer);
    foreach (TSource item in first)
    {
        if (bannedElements.Add(item))
        {
            yield return item;
        }
    }
}

The only differences between this and Intersect are:

The name of the method (ExceptImpl instead of IntersectImpl)
The name of the local variable holding the set (bannedElements instead of potentialElements)
The method called in the loop (Add instead of Remove)

Isn’t that just wonderful? Perhaps it shouldn’t make me quite as happy as it does, but hey…

Conclusion

That concludes the set operators – and indeed my blogging for the night. It’s unsurprising that all of the set operators have used a set implementation internally… but I’ve been quite surprised at just how simple the implementations all were. Again, the joy of LINQ resides in the ability for such simple operators to be combined in useful ways.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 16 – Intersect (and build fiddling)

December 30, 2010 jonskeet 13 Comments

Okay, this is more like it – after the dullness of Union, Intersect has a new pattern to offer… and one which we’ll come across repeatedly.

First, however, I should explain some more changes I’ve made to the solution structure…

Building the test assembly

I’ve just had an irritating time sorting out something I thought I’d fixed this afternoon. Fed up of accidentally testing against the wrong implementation, my two project configurations (“Normal LINQ” and “Edulinq implementation”) now target different libraries from the test project: only the “Normal LINQ” configuration refers to System.Core, and only the “Edulinq implementation” configuration refers to the Edulinq project itself. Or so I thought. Unfortunately, msbuild automatically adds System.Core in unless you’re careful. I had to add this into the “Edulinq implementation” property group part of my project file to avoid accidentally pulling in System.Core:

<AddAdditionalExplicitAssemblyReferences>false</AddAdditionalExplicitAssemblyReferences>

Unfortunately, at that point all the extension methods I’d written within the tests project – and the references to HashSet<T> – failed. I should have noticed them not failing before, and been suspicious. Hey ho.

Now I’m aware that you can add your own version of ExtensionAttribute, but I believe that can become a problem if at execution time you do actually load System.Core… which I will end up doing, as the Edulinq assembly itself references it (for HashSet<T> aside from anything else).

There may well be multiple solutions to this problem, but I’ve come up with one which appears to work:

Add an Edulinq.TestSupport project, and refer to that from both configurations of Edulinq.Tests
Make Edulinq.TestSupport refer to System.Core so it can use whatever collections it likes, as well as extension methods
Put all the non-test classes (those containing extension methods, and the weird and wonderful collections) into the Edulinq.TestSupport project.
Add a DummyClass to Edulinq.TestSupport in the namespace System.Linq, so that using directives within Edulinq.Tests don’t need to be conditionalized

All seems to be working now – and finally, if I try to refer to an extension method I haven’t implemented in Edulinq yet, it will fail to compile instead of silently using the System.Core one. Phew. Now, back to Intersect…

What is it?

Intersect has a now-familiar pair of overloads:

public static IEnumerable<TSource> Intersect<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second)

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

Fairly obviously, it computes the intersection of two sequences: the elements that occur in both “first” and “second”. Points to note:

Again, the first overload uses the default equality comparer for TSource, and the second overload does if you pass in “null” as the comparer, too.
Neither “first” nor “second” can be null
The method does use deferred execution – but in a way which may seem unusual at first sight
The result sequence only contains distinct elements – even if an element occurs multiple times in both input sequences, it will only occur once in the result

Now for the interesting bit – exactly when the two sequences are used. MSDN claims this:

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

This is demonstrably incorrect. Indeed, I have a test which would fail under LINQ to Objects if this were the case. What actually happens is this:

Nothing happens until the first result element is requested. I know I’ve said so already, but it’s worth repeating.
As soon as the first element of the result is requested, the whole of the “second” input sequence is read, as well as the first (and possibly more) elements of the “first” input sequence – enough to return the first result, basically.
Results are read from the “first” input sequence as and when they are required. Only elements which were originally in the “second” input sequence and haven’t already been yielded are returned.

We’ll see this pattern of “greedily read the second sequence, but stream the first sequence” again when we look at Join… but let’s get back to the tests.

What are we going to test?

Obviously I’ve got simple tests including:

Argument validation
Elements which occur multiple times in each sequence
The overload without a comparer specified
Specifying a null comparer
Specifying a “custom” comparer (I’m using the case-insensitive string comparer again; news at 11)

However, the interesting tests are the ones which show how the sequences are actually consumed. Here they are:

[Test]
public void NoSequencesUsedBeforeIteration()
{
    var first = new ThrowingEnumerable();
    var second = new ThrowingEnumerable();
    // No exceptions!
    var query = first.Intersect(second);
    // Still no exceptions… we’re not calling MoveNext.
    using (var iterator = query.GetEnumerator())
    {
    }
}

[Test]
public void SecondSequenceReadFullyOnFirstResultIteration()
{
int[] first = { 1 };
var secondQuery = new[] { 10, 2, 0 }.Select(x => 10 / x);

    var query = first.Intersect(secondQuery);
    using (var iterator = query.GetEnumerator())
    {
        Assert.Throws<DivideByZeroException>(() => iterator.MoveNext());
    }
}

[Test]
public void FirstSequenceOnlyReadAsResultsAreRead()
{
var firstQuery = new[] { 10, 2, 0, 2 }.Select(x => 10 / x);
int[] second = { 1 };

    var query = firstQuery.Intersect(second);
    using (var iterator = query.GetEnumerator())
    {
        // We can get the first value with no problems
        Assert.IsTrue(iterator.MoveNext());
        Assert.AreEqual(1, iterator.Current);

        // Getting at the *second* value of the result sequence requires
        // reading from the first input sequence until the “bad” division
        Assert.Throws<DivideByZeroException>(() => iterator.MoveNext());
    }
}

The first test just proves that execution really is deferred. That’s straightforward.

The second test proves that the “second” input sequence is completely read as soon as we ask for our first result. If the operator had really read “first” and then started reading “second”, it could have yielded the first result (1) without throwing an exception… but it didn’t.

The third test proves that the “first” input sequence isn’t read in its entirety before we start returning results. We manage to read the first result with no problems – it’s only when we ask for the second result that we get an exception.

Let’s implement it!

I have chosen to implement the same behaviour as LINQ to Objects rather than the behaviour described by MSDN, because it makes more sense to me. In general, the “first” sequence is given more importance than the “second” sequence in LINQ: it’s generally the one which is streamed, with the second sequence being buffered if necessary.

Let’s start off with the comparer-free overload – as before, it will just call the other one:

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    return Intersect(first, second, EqualityComparer<TSource>.Default);
}

Now for the more interesting part. Obviously we’ll have an argument-validating method, but what should we do in the guts? I find the duality between this and Distinct fascinating: there, we started with an empty set of elements, and tried to add each source element to it, yielding the element if we successfully added it (meaning it wasn’t there before).

This time, we’ve effectively got a limited set of elements which we can yield, but we only want to yield each of them once – so as we see items, we can remove them from the set, yielding only if that operation was successful. The initial set is formed from the “second” input sequence, and then we just iterate over the “first” input sequence, removing and yielding appropriately:

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null)
    {
        throw new ArgumentNullException(“first”);
    }
    if (second == null)
    {
        throw new ArgumentNullException(“second”);
    }
    return IntersectImpl(first, second, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> IntersectImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> potentialElements = new HashSet<TSource>(second, comparer);
    foreach (TSource item in first)
    {
        if (potentialElements.Remove(item))
        {
            yield return item;
        }
    }
}

Next time we’ll see a cross between the two techniques.

Ta-da – it all works as expected. I don’t know whether this is how Intersect really works in LINQ to Objects, but I expect it does something remarkably similar.

Conclusion

After suffering from some bugs earlier today where my implementation didn’t follow the documentation, it’s nice to find some documentation which doesn’t follow the real implementation :)

Seriously though, there’s an efficiency point to be noted here. If you have two sequences, one long and one short, then it’s much more efficient (mostly in terms of space) to use longSequence.Intersect(shortSequence) than shortSequence.Intersect(longSequence). The latter will require the whole of the long sequence to be in memory at once.

Next up – and I might just manage it tonight – our final set operator, Except.