Reimplementing LINQ to Objects: Part 14 – Distinct

I’m going to implement the set-based operators next, starting with Distinct.

What is it?

Distinct has two overloads, which differ only in the simplest possible way:

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source)

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)

The point of the operator is straightforward: the result sequence contains the same items as the input sequence, but with the duplicates removed – so an input of { 0, 1, 3, 1, 5 } will give a result sequence of { 0, 1, 3, 5 } – the second occurrence of 1 is ignored.

This time I’ve checked and double-checked the documentation – and in this case it really is appropriate to think of the first overload as just a simplified version of the second. If you don’t specify an equality comparer, the default comparer for the type will be used. The same will happen if you pass in a null reference for the comparer. The default equality comparer for a type can be obtained with the handy EqualityComparer<T>.Default property.

Just to recap, an equality comparer (represented by the IEqualityComparer<T> interface) is able to do two things:

  • Determine the hash code for a single item of type T
  • Compare any two items of type T for equality

It doesn’t have to give any sort of ordering – that’s what IComparer<T> is for, although that doesn’t have the ability to provide a hash code.

One interesting point about IEqualityComparer<T> is that the GetHashCode() method is meant to throw an exception if it’s provided with a null argument, but in practice the EqualityComparer<T>.Default implementations appear not to. This leads to an interesting question about Distinct: how should it handle null elements? It’s not documented either way, but in reality both the LINQ to Objects implementation and the simplest way of implementing it ourselves simply throws a NullReferenceException if you use a not-null-safe comparer and have a null element present. Note that the default equality comparer for any type (EqualityComparer<T>.Default) does cope with nulls.

There are other undocumented aspects of Distinct, too. Both the ordering of the result sequence and the choice of which exact element is returned when there are equal options are unspecified. In the case of ordering, it’s explicitly unspecified. From the documentation: "The Distinct method returns an unordered sequence that contains no duplicate values." However, there’s a natural approach which answers both of these questions. Distinct is specified to use deferred execution (so it won’t look at the input sequence until you start reading from the output sequence) but it also streams the results to some extent: to return the first element in the result sequence, it only needs to read the first element from the input sequence. Some other operators (such as OrderBy) have to read all their data before yielding any results.

When you implement Distinct in a way which only reads as much data as it has to, the answer to the ordering and element choice is easy:

  • The result sequence is in the same order as the input sequence
  • When there are multiple equal elements, it is the one which occurs earliest in the input sequence which is returned as part of the result sequence.

Remember that it’s perfectly possible to have elements which are considered equal under a particular comparer, but are still clearly different when looked at another way. The simplest example of this is case-insensitive string equality. Taking the above rules into account, the distinct sequence returned for { "ABC", "abc", "xyz" } with a case-insensitive comparer is { "ABC", "xyz" }.

What are we going to test?

All of the above :)

All the tests use sequences of strings for clarity, but I’m using four different comparers:

  • The default string comparer (which is a case-sensitive ordinal comparer)
  • The case-insensitive ordinal comparer
  • A comparer which uses object identity (so will treat two equal but distinct strings as different)
  • A comparer which explicitly doesn’t try to cope with null values

The tests assume that the undocumented aspects listed above are implemented with the rules that I’ve given. This means they’re over-sensitive, in that an implementation of Distinct which matches all the documented behaviour but returns elements in a different order would fail the tests. This highlights an interesting aspect of unit testing in general… what exactly are we trying to test? I can think of three options in our case:

  • Just the documented behaviour: anything conforming to that, however oddly, should pass
  • The LINQ to Objects behaviour: the framework implementation should pass all our tests, and then our implementation should as well
  • Our implementation’s known (designed) behaviour: we can specify that our implementation will follow particular rules above and beyond the documented contracts

In production projects, these different options are valid in different circumstances, depending on exactly what you’re trying to do. At the moment, I don’t have any known differences in behaviour between LINQ to Objects and Edulinq, although that may well change later in terms of optimizations.

None of the tests themselves are particularly interesting – although I find it interesting that I had to implement a deliberately fragile (but conformant) implementation of IEqualityComparer<T> in order to test Distinct fully.

Let’s implement it!

I’m absolutely confident in implementing the overload that doesn’t take a custom comparer using the one that does. We have two options for how to specify the custom comparer in the delegating call though – we could pass null or EqualityComparer<T>.Default, as the two are explicitly defined to behave the same way in the second overload. I’ve chosen to pass in EqualityComparer<T>.Default just for the sake of clarity – it means that anyone reading the first method doesn’t need to check the behavior of the second to understand what it will do.

We need to use the "private iterator block method" approach again, so that the arguments can be evaluated eagerly but still let the result sequence use deferred execution. The real work method uses HashSet<T> to keep track of all the elements we’ve already returned – it takes an IEqualityComparer<T> in its constructor, and the Add method adds an element to the set if there isn’t already an equal one, and returns whether or not it really had to add anything. All we have to do is iterate over the input sequence, call Add, and yield the item as part of the result sequence if Add returned true. Simple!

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source)
{
    return source.Distinct(EqualityComparer<TSource>.Default);
}

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    if (source == null
    {
        throw new ArgumentNullException("source");
    }
    return DistinctImpl(source, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> DistinctImpl<TSource>(
    IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer);
    foreach (TSource item in source)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
}

So what about the behaviour with nulls? Well, it seems that HashSet<T> just handles that automatically, if the comparer it uses does. So long as the comparer returns the same hash code each time it’s passed null, and considers null and null to be equal, it can be present in the sequence. Without HashSet<T>, we’d have had a much uglier implementation – especially as Dictionary<TKey, TValue> doesn’t allow null keys.

Conclusion

I’m frankly bothered by the lack of specificity in the documentation for Distinct. Should you rely on the ordering rules that I’ve given here? I think that in reality, you’re reasonably safe to rely on it – it’s the natural result of the most obvious implementation, after all. I wouldn’t rely on the same results when using a different LINQ provider, mind you – when fetching the results back from a database, for example, I wouldn’t be at all surprised to see the ordering change. And of course, the fact that t documentation explicitly states that the result is unordered should act as a bit of a deterrent from relying on this.

We’ll have to make similar decisions for the other set-based operators: Union, Intersect and Except. And yes, they’re very likely to use HashSet<T> too…

7 thoughts on “Reimplementing LINQ to Objects: Part 14 – Distinct”

  1. Couldn’t one argue that, given that *sets* don’t have an “order”, that to document one would be incorrect? Or am I being too much of a purist?

    Like

  2. @Damien: The problem is that what’s returned *is* a sequence, which can clearly be iterated. Without extra help, that’s *all* you can do with a sequence.

    If the return type were ISet that would be a different matter… but I think it would have been more useful for the documentation to specify the order, and describe the relationship between the sequence and a set – namely that the items in the sequence *form* a set, as there are no duplicates. You can do that but still document the order of the sequence.

    Like

  3. I think the order is unspecified precisely because of other providers like linq to any database where the order is very likely to be changed – therefore, Distinct returns an unordered sequence. If you want it ordered, OrderBy later!

    Like

  4. You didn’t use the var statement with declaring the seenElements local variabel. I’d rather use var, when constructor and declaration type are the same: shorter code (less horizontal scrolling) and it’s quicker to identify that left and right are the same. I included three variations below.

    Do you have any thoughts on that?

    HashSet seenElements = new HashSet(comparer);
    ISet seenElements = new HashSet(comparer);
    var seenElements = new HashSet(comparer);

    Like

  5. In the NullElementsAreNotPassedToComparer test you didn’t specify the comparer in the Distinct method call so EqualityComparer.Default was used. It seems that HashSet doesn’t handle null values and the specified comparer should do it instead. If we specify the correct comparer, we should get NullReferenceException.

    ?

    Like

  6. @Marcinn: Thanks, that was a bug in the tests and the post itself. I’ve fixed both now… and it’s highlighted another LinqBridge divergence from LINQ to Objects, interestingly enough…

    Like

Leave a comment