Reimplementing LINQ to Objects: Part 32 – Contains

After the dubious optimizations of ElementAt/ElementAtOrDefault yesterday, we meet an operator which is remarkably good at defying optimization. Sort of. Depending on how you feel it should behave.

What is it?

Contains has two overloads, which only differ by whether or not they take an equality comparer – just like Distinct, Intersect and the like:

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value)

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)

The operator simply returns a Boolean indicating whether or not "value" was found in "source". The salient points of its behaviour should be predictable now:

  • It uses immediate execution (as it’s returning a simple value instead of a sequence)
  • The source parameter cannot be null, and is validated immediately
  • The value parameter can be null: it’s valid to search for a null value within a sequence
  • The comparer parameter can be null, which is equivalent to passing in EquailtyComparer<TSource>.Default.
  • The overload without a comparer uses the default equality comparer too.
  • If a match is found, the method returns immediately without reading the rest of the input sequence.
  • There’s a documented optimization for ICollection<T> – but there are significant issues with it…

So far, so good.

What are we going to test?

Aside from argument validation, I have tests for the value being present in the source, and it not being present in the source… for the three options of "no comparer", "null comparer" and "specific comparer".

I then have one final test to validate that we return as soon as we’ve found a match, by giving a query which will blow up when the element after the match is computed.

Frankly none of the tests are earth-shattering, but in the spirit of giving you an idea of what they’re like, here’s one with a custom comparer – we use the same source and value for a "default comparer" test which doesn’t find the value as the case differs:

[Test]
public void MatchWithCustomComparer()
{
    // Default equality comparer is ordinal
    string[] source = { "foo", "bar", "baz" };
    Assert.IsTrue(source.Contains("BAR", StringComparer.OrdinalIgnoreCase));
}

Currently I don’t have a test for the optimization mentioned in the bullet points above, as I believe it’s broken. More later.

Let’s implement it!

To start with, let’s dispense with the overload without a comparer parameter: that just delegates to the other one by specifying EqualityComparer<TSource>.Default. Trivial. (Or so we might think. There’s more to this than meets the eye.)

I’ve got three implementations, but we’ll start with just two of them. Which one you pick would depend on whether you’re happy to use one operator to implement another. If you think that’s okay, it’s really simple:

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)
{
    comparer = comparer ?? EqualityComparer<TSource>.Default;
    return source.Any(item => comparer.Equals(value, item));
}

"Any" has exactly the traits we want, including validation of the non-nullity of "source". It’s hardly complicated code if we don’t use Any though:

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    comparer = comparer ?? EqualityComparer<TSource>.Default;
    foreach (TSource item in source)
    {
        if (comparer.Equals(value, item))
        {
            return true;
        }
    }
    return false;
}

Obviously there’s a slight penalty in using Any just because of executing a delegate on each iteration – and the extra memory requirement of building an object to capture the comparer. I haven’t measured the performance impact of this – again, it’s a candidate for benchmarking.

Can’t we optimize? (And why does LINQ to Objects think it can?)

The implementations above are all very well, but they feel ever so simplistic. With ElementAt, we were able to take advantage of the fact that an IList<T> allows us random access by index. Surely we’ve got similar collections which allow us to test for containment cheaply?

Well, yes and no. We’ve got IDictionary<TKey, TValue> which allows you to check for the presence of a particular key – but even it would be hard to test whether the sequence we’re looking at is the key sequence for some IDictionary<TSource, TValue>, and somehow get back to the dictionary.

ICollection<T> has a Contains method, but that doesn’t necessarily do the right thing. This is particularly troubling, as the MSDN documentation for the comparer-less overload has contradictory information:

(Summary)

Determines whether a sequence contains a specified element by using the default equality comparer.

(Remarks)

If the type of source implements ICollection<T>, the Contains method in that implementation is invoked to obtain the result. Otherwise, this method determines whether source contains the specified element.

Enumeration is terminated as soon as a matching element is found.

Elements are compared to the specified value by using the default equality comparer, Default.

Why is this troubling? Well, let’s look at a test:

[Test]
public void SetWithDifferentComparer()
{
    HashSet<string> sourceAsSet = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
        { "foo", "bar", "baz" };
    IEnumerable<string> sourceAsSequence = sourceAsSet;
    Assert.IsTrue(sourceAsSet.Contains("BAR"));
    Assert.IsFalse(sourceAsSequence.Contains("BAR"));
    Assert.IsFalse(sourceAsSequence.Contains("BAR", StringComparer.Ordinal));
}

(This exact code won’t build in the Edulinq project configuration, as that doesn’t have a reference to the System.Core assembly which contains HashSet<T>. I’ve got a hack which allows me to run effectively this code though. See the source for details.)

Now this test looks correct to me: while we’re regarding the set as a set, it should use the set’s comparer and find "BAR" with a case-insensitive match. However, when we use it as a sequence in LINQ, it should obey the rules of Enumerable.Contains – which means that the middle call should use the default equality comparer for string. Under that equality comparer, "BAR" isn’t present.

It doesn’t: the above test fails on that middle call in LINQ to Objects, because HashSet<T> implements ICollection<T>. To fit in with the implementation, the documentation summary should actually be worded as something like:

"Determines whether a sequence contains a specified element by using the default equality comparer if the sequence doesn’t implement ICollection<T>, or whatever equality comparison the collection uses if it does implement ICollection<T>."

Now you may be saying to yourself that this is only like relying on IList<T> to fetch an item by index in a fashion consistent with iterating over with it – but I’d argue that any IList<T> implementation which didn’t do that was simply broken… whereas ICollection<T>.Contains is specifically documented to allow custom comparisons:

Implementations can vary in how they determine equality of objects; for example, List<T> uses Comparer<T>.Default, whereas Dictionary<TKey, TValue> allows the user to specify the IComparer<T> implementation to use for comparing keys.

Let’s leave aside the fact that those "Comparer<T>" and "IComparer<T>" should be "EqualityComparer<T>" and "IEqualityComparer<T>" respectively for the minute, and just note that it’s entirely reasonable for an implementation not to use the default equality comparer. That makes sense – but I believe it also makes sense for source.Contains(value) to be more predictable in terms of the equality comparer it uses.

Now I would certainly agree that having a method call which changes semantics based on whether the compile-time type of the source is IEnumerable<T> or ICollection<T> is undesirable too… but I’m not sure there is any particularly nice solution. The options are:

  • The current LINQ to Objects implementation where the comparer used is hard to predict.
  • The Edulinq implementation where the type’s default comparer is always used… if the compile-time type means that Enumerable.Contains is used in the first place.
  • Remove the comparer-less overload entirely, and force people to specify one. This is lousy for convenience.

Note that you might expect the overload which takes a comparer to work the same way if you pass in null as the comparer – but it doesn’t. That overload never delegates to ICollection<T>.Contains.

So: convenience, predictability, consistency. Pick any two. Isn’t API design fun? This isn’t even thinking about performance, of course…

It’s worth bearing in mind that even the current behaviour which is presumably meant to encourage consistency doesn’t work. One might expect that the following would always be equivalent for any sensible collection:

var first = source.Contains(value);
var second = source.Select(x => x).Contains(value);

… but of course the second line will always use EqualityComparer<T>.Default whereas the first may or may not.

(Just for fun, think about Dictionary<TKey, TValue> which implements ICollection<KeyValuePair<TKey, TValue>>; its explicitly-implemented ICollection<T>.Contains method will use its own equality comparer for the key, but the default equality comparer for the value part of the pair. Yay!)

So can we really not optimize?

I can think of exactly one situation which we could legitimately optimize without making the behaviour hard to predict. Basically we’re fine to ask the collection to do our work for us if we can guarantee it will use the right comparer. Ironically, List<T>.Contains has an overload which allows us to specify the equality comparer, so we could delegate to that – but it’s not going to be significantly faster than doing it ourselves. It’s still got to look through everything.

ISet<T> in .NET 4 doesn’t help us much – its API doesn’t talk about equality comparers. (This makes a certain amount of sense – consider SortedSet<T> which uses IComparer<T> instead of IEqualityComparer<T>. It wouldn’t make sense to ask a SortedSet<T> for an equality comparer – it couldn’t give you one, as it wouldn’t know how to produce a hash code.)

However, HashSet<T> does give us something to work with. You can ask a HashSet<T> which equality comparer it uses, so we could delegate to its implementation if and only if it would use the one we’re interested in. We can bolt that into our existing implementation pretty easily, after we’ve worked out the comparer to use:

HashSet<TSource> hashSet = source as HashSet<TSource>;
if (hashSet != null && comparer.Equals(hashSet.Comparer))
{
    return hashSet.Contains(value);
}

So is this worth including or not?

Pros:

  • It covers one of the biggest use cases for optimizing Contains; I suspect this is used more often than the LINQ implementation of Contains working over a dictionary.
  • So long as the comparer doesn’t override Equals in a bizarre way, it should be a true optimization with no difference in behaviour.
  • The optimization is applied for both overloads of Enumerable.Contains, not just the comparer-less one.

Cons:

  • It’s specific to HashSet<T> rather than an interface type. That makes it feel a little too specific to be a good target of optimization.
  • We’ve still got the issue of consistency in terms of sourceAsSet.Contains(value) vs sourceAsSequence.Contains(value)
  • There’s a tiny bit of overhead if the source isn’t a hash set, and a further overhead if it is a hash set but with the wrong comparer. I’m not too bothered about this.

It’s not the default implementation in Edulinq at the moment, but I could possibly be persuaded to include it. Likewise I have a conditionally-compiled version of Contains which is compatible with LINQ to Objects, with the "broken" optimization for the comparer-less overload; this is turned off by default too.

Conclusion

Gosh! I hadn’t expected Contains to be nearly this interesting. I’d worked out that optimization would be a pain, but I hadn’t expected it to be such a weird design choice.

This is the first time I’ve deliberately gone against the LINQ to Objects behaviour, other than the MS bug around descending orderings using "extreme" comparers. The option for compatibility is there, but I feel fairly strongly that this was a bad design decision on Microsoft’s part. A bad decision out of some fairly unpleasant alternatives, I grant you. I’m willing to be persuaded of its virtues, of course – and in particular I’d welcome discussion with the LINQ team around this. In particular, it’s always fun to hear about the history of design decisions.

Next up, Cast and OfType.

7 thoughts on “Reimplementing LINQ to Objects: Part 32 – Contains”

  1. Remembering there are other situations where “seq” and “seq.Select(x => x)” are not equivalent, I of the opinion that the LINQ to Objects behavior is reasonable here. Of course, this depends on whether you interpret “seq.Contains(item)” as using *the* default comparer or *a* default comparer. The fact it’s documented smells like there was specific reasoning behind it, but as you’ve demonstrated, these docs are not water-tight.

    Like

  2. @Simon: There are cases where the differences between seq and seq.Select are detectable, but they usually make more sense than this. It’s normally either something which is detectable but only subtly so (e.g. can you then cast the result back to List, or performance) or obvious from the result type (e.g. can you call ThenBy).

    I would argue that the Contains behaviour is far from obvious, and that it’s easy to write entirely reasonable-looking tests which receive a sequence, assert its contents, and then assert that the sequence doesn’t contain a certain value. Those tests may fail, as I’ve shown.

    I think using the single word “default” to mean “a default associated with the type” and “a default associated with the sequence” without *heavily* documenting the difference is dangerous.

    Like

  3. @skeet: Well… they do document it, don’t they? I’ll admit, I didn’t know of this behavior before-hand (if I knew everything about LINQ, I wouldn’t be following this, after all!), but if I ran into this in production, I believe I’d have to think pretty hard about which is preferable anyway.

    The real issue here is whether it’s worse for an extension method on a base interface with the same signature as a method on a derived interface to have different behavior – or for the extension method to have observably different behavior for otherwise indistinguishable argument values. I lean on the former, you on the latter. It just happens that this is the only overlap between LINQ extensions and the System.Collections.Generic interfaces due to the immutability of LINQ (Enumerable.Union can’t be implemented with ISet.UnionWith, for example).

    Like

  4. @Simon: They don’t document it *clearly*. To my mind, the documentation contradicts itself. The statement that “Elements are compared to the specified value by using the default equality comparer, Default” should be qualified with “if this method doesn’t delegate to ICollection” – and the summary statement would need tweaking too.

    I agree with your reformulation of the problem – and I offer this reasoning for my preference: I can tell at compile-time what the compile-time type of my expression is, so I can make sure I use the right one. I can’t tell at compile-time what the execution-time type of the expression will be without using horrible type testing.

    A colleague has suggested an alternative approach, btw – that IEnumerable should have an EqualityComparer property, so that any sequence advertises what *its* idea of equality is. That has some issues, certainly, but it’s an interesting idea.

    Like

  5. @Matt: I haven’t yet, but when I’m “done” with Edulinq, I’m going to try running all the tests against as many implementations as I can find :)

    That won’t show performance differences of course…

    Like

  6. I think the HashSet optimization makes far more sense than the ICollection one. I know if I ran into a production bug because of the ICollection difference I would be extremely upset… but only after I wasted ages trying to find it.

    ISet is the only thing that really should matter to Contains optimizations; but since we have to make sure we’re talking about the right comparer, it makes sense to use HashSet. And it is likely a very popular type to call Contains on, or to pass to methods which call Contains (when accepting IEnumerables)

    Like

Leave a comment