Reimplementing LINQ to Objects: Part 17 – Except

I’m really pleased with this one. Six minutes ago (at the time of writing this), I tweeted about the Intersect blog post. I then started writing the tests and implementation for Except… and I’m now done.

The tests are cut/paste/replace with a few tweaks – but it’s the implementation that I’m most pleased with. You’ll see what I mean later – it’s beautiful.

What is it?

Except is our final set operator, with the customary two overloads:

public static IEnumerable<TSource> Except<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

public static IEnumerable<TSource> Except<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

The result of calling Except is the elements of the first sequence which are not in the second sequence.

Just for completeness, here’s a quick summary of the behaviour:

  • The first overload uses the default equality comparer for TSource, and the second overload does if you pass in "null" as the comparer, too.
  • Neither "first" nor "second" can be null
  • The method does use deferred execution
  • The result sequence only contains distinct elements; even if the first sequence contains duplicates, the result sequence won’t

This time, the documentation doesn’t say how "first" and "second" will be used, other than to describe the result as a set difference. In practice though, it’s exactly the same as Intersect: when we ask for the first result, the "second" input sequence is fully evaluated, then the "first" input sequence is streamed.

What are we going to test?

I literally cut and paste the tests for Intersect, did a find/replace on Intersect/Except, and then looked at the data in each test. In particular, I made sure that there were potential duplicates in the "first" input sequence which would have to be removed in the result sequence. I also tweaked the details of the data in the final tests shown before (which proved the way in which the two sequences were read) but the main thrust of the tests are the same.

Nothing to see here, move on…

Let’s implement it!

I’m not even going to bother showing the comparer-free overload this time. It just calls the other overload, as you’d expect. Likewise the argument validation part of the implementation is tedious. Let’s focus on the part which does the work. First, we’ll think back to Distinct and Intersect:

  • In Distinct, we started with an empty set and populated it as we went along, making sure we never returned anything already in the set
  • In Intersect, we started with a populated set (from the second input sequence) and removed elements from it as we went along, only returning elements where an equal element was previously in the set

Except is simply a cross between the two: from Distinct we keep the idea of a "banned" set of elements that we add to; from Intersect we take the idea of starting off with a set populated from the second input element. Here’s the implementation:

private static IEnumerable<TSource> ExceptImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> bannedElements = new HashSet<TSource>(second, comparer);
    foreach (TSource item in first)
    {
        if (bannedElements.Add(item))
        {
            yield return item;
        }
    }
}

The only differences between this and Intersect are:

  • The name of the method (ExceptImpl instead of IntersectImpl)
  • The name of the local variable holding the set (bannedElements instead of potentialElements)
  • The method called in the loop (Add instead of Remove)

Isn’t that just wonderful? Perhaps it shouldn’t make me quite as happy as it does, but hey…

Conclusion

That concludes the set operators – and indeed my blogging for the night. It’s unsurprising that all of the set operators have used a set implementation internally… but I’ve been quite surprised at just how simple the implementations all were. Again, the joy of LINQ resides in the ability for such simple operators to be combined in useful ways.

4 thoughts on “Reimplementing LINQ to Objects: Part 17 – Except”

  1. Super necro, I know, but I just wanted to point out that this implementation has the same hidden (but very serious) performance problem that Linq to Objects has: it’s quite common when using Except for the second sequence to be a HashSet. For example, it would be natural to put something like “adjacentNodes.Except(closedSet)” into a pathfinding algorithm. This implementation then wastes a bunch of time to build bannedElements as a new HashSet when it already has a HashSet. (The problem becomes particularly painful when this is run in a tight loop–again, such as pathfinding. Guess how I first discovered this problem?)

    A good optimization would be to check if it’s a HashSet and move to an optimized path where it checks “secondSet.Contains(item) || bannedElements.Add(item)” in the conditional. This could give significantly better performance than Linq to Objects under the right conditions.

    Like

    1. Argh, I wrote that conditional completely wrong, and can’t edit it now like on SO.

      Should be “(!secondSet.Contains(item)) && bannedElements.Add(item)”

      Like

Leave a comment