Reimplementing LINQ to Objects: Part 15 – Union

I’m continuing the set-based operators with Union. I may even have a couple of hours tonight – possibly enough to finish off Intersect and Except as well… let’s see.

What is it?

Union is another extension method with two overloads; one taking an equality comparer and one just using the default:

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

This "two overloads, one taking an equality comparer" pattern is familiar from Distinct; we’ll see that and Intersect and Except do exactly the same thing.

Simply put, Union returns the union of the two sequences – all items that are in either input sequence. The result sequence has no duplicates in even if one of the input sequences contains duplicates. (I’m using the term "duplicate" here to mean an element which is equal to another according to the equality comparer we’re using in the operator.)

Characteristics:

  • Union uses deferred execution: argument validation is basically all that happens when the method is first called; it only starts iterating over the input sequences when you iterate over the result sequence
  • Neither first nor second can be null; the comparer argument can be null, in which case the default equality comparer is used
  • The input sequences are only read as and when they’re needed; to return the first result element, only the first input element is read

It’s worth noting that the documentation for Union specifies a lot more than the Distinct documentation does:

When the object returned by this method is enumerated, Union enumerates first and second in that order and yields each element that has not already been yielded.

To me, that actually looks like a guarantee of the rules I proposed for Distinct. In particular, it’s guaranteeing that the implementation iterates over "first" before "second", and if it’s yielding elements as it goes, that guarantees that distinct elements will retain their order from the original input sequences. Whether others would read it in the same way or not, I can’t say… input welcome.

What are we going to test?

I’ve written quite a few tests for Union – possibly more than we really need, but they demonstrate a few points of usage. The tests are:

  • Arguments are validated eagerly
  • Finding the union of two sequences without specifying a comparer; both inputs have duplicate elements, and there’s one element in both
  • The same test as above but explicitly specifying null as the comparer, to force the default to be used
  • The same test as above but using a case-insensitive string comparer
  • Taking the union of an empty sequence with a non-empty one
  • Taking the union of a non-empty sequence with an empty one
  • Taking the union of two empty sequences
  • Proving that the first sequence isn’t used until we start iterating over the result sequence (using ThrowingEnumerable)
  • Proving that the second sequence isn’t used until we’ve exhausted the first

No new collections or comparers needed this time though – it’s all pretty straightforward. I haven’t written any tests for null elements this time – I’m convinced enough by what I saw when implementing Distinct to believe they won’t be a problem.

Let’s implement it!

First things first: we can absolutely implement the simpler overload in terms of the more complex one, and I’ll do the same for Except and Intercept. Here’s the Union method:

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    return Union(first, second, EqualityComparer<TSource>.Default);
}

So how do we implement the more complex overload? Well, I’ve basically been a bit disappointed by Union in terms of its conceptual weight. It doesn’t really give us anything that the obvious combination of Concat and Distinct doesn’t – so let’s implement it that way first:

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    return first.Concat(second).Distinct(comparer);
}

The argument validation can be implemented by Concat with no problems here – the parameters have the same name, so any exceptions thrown will be fine in every way.

That’s how I think about Union, but as I’ve mentioned before, I’d rather not actually see Concat and Distinct showing up in the stack trace – so here’s the fuller implementation.

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return UnionImpl(first, second, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> UnionImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer);
    foreach (TSource item in first)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
    foreach (TSource item in second)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
}

That feels like an absurd waste of code when we can achieve the same result so simply, admittedly. This is the first time my resolve against implementing one operator in terms of completely different ones has wavered. Just looking at it in black and white (so to speak), I’m close to going over the edge…

Conclusion

Union was a disappointingly bland operator in my view. (Maybe I should start awarding operators marks out of ten for being interesting, challenging etc.) It doesn’t feel like it’s really earned its place in LINQ, as calls to Concat/Distinct can replace it so easily. Admittedly as I’ve mentioned in several other places, a lot of operators can be implemented in terms of each other – but rarely quite so simply.

Still, I think Intersect and Except should be more interesting.

5 thoughts on “Reimplementing LINQ to Objects: Part 15 – Union”

  1. Union wins its place in LINQ by being an elevated first-class concept. If everyone just used Concat/Distinct, then every LINQ provider would have to know about that pattern to run a set union operation.

    If anything, I might be inclined to think that there should be additional first-class concepts, not fewer.

    Like

  2. As you pointed out, Distinct explicitly isn’t required to maintain order, whereas Union appears to have that requirement. That alone makes the source.Concat(second).Distinct(comparer) implementation suspect.

    I am uncertain how acceptable it is to rely on behavior in your own implementation that isn’t specified in the contract. Since your implementation of Distinct does maintain order, the shorter Union call would work — until a change is main in Distinct which breaks that implementation detail but doesn’t break the LINQ To Object spec. I would be concerned about that risk.

    Like

  3. @Blaise: While Union is calling Distinct which is documented (and tested) to maintain order, I don’t think it’s a problem to depend on it.

    In other words, my implementation of Union is depending on *my implementation* of Distinct, not just the documented version.

    Like

  4. Maybe they included Union because people are familiar with SQL’s UNION. And Concat is UNION ALL in SQL (at least in SQL Server).

    Would there actually be a API relation between IEnumerable and IQueryable?

    Like

  5. @Doeke: There’s absolutely an API relationship between IEnumerable and IQueryable, although there are ways in which it’s a troubled one. I’ll be looking at that in the future.

    Like

Leave a comment