Reimplementing LINQ to Objects: Part 16 – Intersect (and build fiddling)

Okay, this is more like it – after the dullness of Union, Intersect has a new pattern to offer… and one which we’ll come across repeatedly.

First, however, I should explain some more changes I’ve made to the solution structure…

Building the test assembly

I’ve just had an irritating time sorting out something I thought I’d fixed this afternoon. Fed up of accidentally testing against the wrong implementation, my two project configurations (“Normal LINQ” and “Edulinq implementation”) now target different libraries from the test project: only the “Normal LINQ” configuration refers to System.Core, and only the “Edulinq implementation” configuration refers to the Edulinq project itself. Or so I thought. Unfortunately, msbuild automatically adds System.Core in unless you’re careful. I had to add this into the “Edulinq implementation” property group part of my project file to avoid accidentally pulling in System.Core:

<AddAdditionalExplicitAssemblyReferences>false</AddAdditionalExplicitAssemblyReferences>

Unfortunately, at that point all the extension methods I’d written within the tests project – and the references to HashSet<T> – failed. I should have noticed them not failing before, and been suspicious. Hey ho.

Now I’m aware that you can add your own version of ExtensionAttribute, but I believe that can become a problem if at execution time you do actually load System.Core… which I will end up doing, as the Edulinq assembly itself references it (for HashSet<T> aside from anything else).

There may well be multiple solutions to this problem, but I’ve come up with one which appears to work:

  • Add an Edulinq.TestSupport project, and refer to that from both configurations of Edulinq.Tests
  • Make Edulinq.TestSupport refer to System.Core so it can use whatever collections it likes, as well as extension methods
  • Put all the non-test classes (those containing extension methods, and the weird and wonderful collections) into the Edulinq.TestSupport project.
  • Add a DummyClass to Edulinq.TestSupport in the namespace System.Linq, so that using directives within Edulinq.Tests don’t need to be conditionalized

All seems to be working now – and finally, if I try to refer to an extension method I haven’t implemented in Edulinq yet, it will fail to compile instead of silently using the System.Core one. Phew. Now, back to Intersect…

What is it?

Intersect has a now-familiar pair of overloads:

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

Fairly obviously, it computes the intersection of two sequences: the elements that occur in both “first” and “second”. Points to note:

  • Again, the first overload uses the default equality comparer for TSource, and the second overload does if you pass in “null” as the comparer, too.
  • Neither “first” nor “second” can be null
  • The method does use deferred execution – but in a way which may seem unusual at first sight
  • The result sequence only contains distinct elements – even if an element occurs multiple times in both input sequences, it will only occur once in the result

Now for the interesting bit – exactly when the two sequences are used. MSDN claims this:

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

This is demonstrably incorrect. Indeed, I have a test which would fail under LINQ to Objects if this were the case. What actually happens is this:

  • Nothing happens until the first result element is requested. I know I’ve said so already, but it’s worth repeating.
  • As soon as the first element of the result is requested, the whole of the “second” input sequence is read, as well as the first (and possibly more) elements of the “first” input sequence – enough to return the first result, basically.
  • Results are read from the “first” input sequence as and when they are required. Only elements which were originally in the “second” input sequence and haven’t already been yielded are returned.

We’ll see this pattern of “greedily read the second sequence, but stream the first sequence” again when we look at Join… but let’s get back to the tests.

What are we going to test?

Obviously I’ve got simple tests including:

  • Argument validation
  • Elements which occur multiple times in each sequence
  • The overload without a comparer specified
  • Specifying a null comparer
  • Specifying a “custom” comparer (I’m using the case-insensitive string comparer again; news at 11)

However, the interesting tests are the ones which show how the sequences are actually consumed. Here they are:

[Test]
public void NoSequencesUsedBeforeIteration()
{
    var first = new ThrowingEnumerable();
    var second = new ThrowingEnumerable();
    // No exceptions!
    var query = first.Intersect(second);
    // Still no exceptions… we’re not calling MoveNext.
    using (var iterator = query.GetEnumerator())
    {
    }
}

[Test]
public void SecondSequenceReadFullyOnFirstResultIteration()
{
    int[] first = { 1 };
    var secondQuery = new[] { 10, 2, 0 }.Select(x => 10 / x);

    var query = first.Intersect(secondQuery);
    using (var iterator = query.GetEnumerator())
    {
        Assert.Throws<DivideByZeroException>(() => iterator.MoveNext());
    }
}

[Test]
public void FirstSequenceOnlyReadAsResultsAreRead()
{
    var firstQuery = new[] { 10, 2, 0, 2 }.Select(x => 10 / x);
    int[] second = { 1 };

    var query = firstQuery.Intersect(second);
    using (var iterator = query.GetEnumerator())
    {
        // We can get the first value with no problems
        Assert.IsTrue(iterator.MoveNext());
        Assert.AreEqual(1, iterator.Current);

        // Getting at the *second* value of the result sequence requires
        // reading from the first input sequence until the “bad” division
        Assert.Throws<DivideByZeroException>(() => iterator.MoveNext());
    }
}

The first test just proves that execution really is deferred. That’s straightforward.

The second test proves that the “second” input sequence is completely read as soon as we ask for our first result. If the operator had really read “first” and then started reading “second”, it could have yielded the first result (1) without throwing an exception… but it didn’t.

The third test proves that the “first” input sequence isn’t read in its entirety before we start returning results. We manage to read the first result with no problems – it’s only when we ask for the second result that we get an exception.

Let’s implement it!

I have chosen to implement the same behaviour as LINQ to Objects rather than the behaviour described by MSDN, because it makes more sense to me. In general, the “first” sequence is given more importance than the “second” sequence in LINQ: it’s generally the one which is streamed, with the second sequence being buffered if necessary.

Let’s start off with the comparer-free overload – as before, it will just call the other one:

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    return Intersect(first, second, EqualityComparer<TSource>.Default);
}

Now for the more interesting part. Obviously we’ll have an argument-validating method, but what should we do in the guts? I find the duality between this and Distinct fascinating: there, we started with an empty set of elements, and tried to add each source element to it, yielding the element if we successfully added it (meaning it wasn’t there before).

This time, we’ve effectively got a limited set of elements which we can yield, but we only want to yield each of them once – so as we see items, we can remove them from the set, yielding only if that operation was successful. The initial set is formed from the “second” input sequence, and then we just iterate over the “first” input sequence, removing and yielding appropriately:

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{           
    if (first == null)
    {
        throw new ArgumentNullException(“first”);
    }
    if (second == null)
    {
        throw new ArgumentNullException(“second”);
    }
    return IntersectImpl(first, second, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> IntersectImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> potentialElements = new HashSet<TSource>(second, comparer);
    foreach (TSource item in first)
    {
        if (potentialElements.Remove(item))
        {
            yield return item;
        }
    }
}

Next time we’ll see a cross between the two techniques.

Ta-da – it all works as expected. I don’t know whether this is how Intersect really works in LINQ to Objects, but I expect it does something remarkably similar.

Conclusion

After suffering from some bugs earlier today where my implementation didn’t follow the documentation, it’s nice to find some documentation which doesn’t follow the real implementation :)

Seriously though, there’s an efficiency point to be noted here. If you have two sequences, one long and one short, then it’s much more efficient (mostly in terms of space) to use longSequence.Intersect(shortSequence) than shortSequence.Intersect(longSequence). The latter will require the whole of the long sequence to be in memory at once.

Next up – and I might just manage it tonight – our final set operator, Except.

13 thoughts on “Reimplementing LINQ to Objects: Part 16 – Intersect (and build fiddling)”

  1. An optimisation for when the second is a subset of first and especially if first is large or expensive to iterate would be to check potentialElements.Count after a successful remove (and yield) and if the count is zero, break. Maybe this is too specific to carry the extra code/clarity cost?

    Eamon

    Like

  2. I’m glad to learn that the documentation is incorrect… I tried to implement this myself, and I spend a few minutes scratching my head when I read the remarks in the documentation. Perhaps this should be reported to MS…

    Like

  3. “it’s much more efficient (mostly in terms of space) to use longSequence.Intersect(shortSequence) than shortSequence.Intersect(longSequence)”

    What about time?

    Creating the hash set is O(n). Each Remove call is O(1) so we have two cases: one is O(n+m) and the other is O(m+n). As far as big-O is concerned, they’re equal. But are they really?

    This sounds like something that should be measured.

    Like

  4. The implementation of Intersect (both in Linq and Eduling) seem too restrictive when it comes to infinite sequences since one of the source sequences has to be fully iterated upfront.

    For example, the following test will throw an OutOfMemory exception where clearly there are items that could be yield returned.

    [Test]
    public void TestInfiniteSequence()
    {
    IEnumerable naturalNumbers = GetNaturalNumbers(); //1, 2, 3,…
    IEnumerable evenNumbers = GetEvenNumbers(); //0, 2, 4,…
    var query = naturalNumbers.Intersect(evenNumbers);
    using (var iterator = query.GetEnumerator())
    {
    Assert.IsTrue(iterator.MoveNext());
    }
    }

    I understant that any possible implementation that deals with above test case may be too inefficient to be worthwile. I am just curious to understand why Intersect is implemented the way it is.

    Like

  5. @payman: Well, we *could* keep two gradually-increasing-in-size sets of elements, one for each sequence, and call MoveNext() on each iterator in turn. That way, however:

    – It would be harder to predict the order; it’s not *specified* for Intersect, but actually it easily *could* be, and it’s easy to predict currently… in a way which follows how other query operators such as GroupBy work.

    – Unless we made it quite complicated (knowing not to bother to add to the “first” set when “second” had run out of elements and vice versa) we could run into similarly tricky situations with infinite sequences… a short second sequence with an infinite first sequence is fine with the current implementation, but could be awful with a naive “one each way” implementation.

    There are definitely pros and cons for each approach… and if you know that each sequence is *ordered* of course, that turns it into an entirely different matter. Maybe I’ll look at some alternative implementations when I’ve finished the other operators.

    Like

  6. It does get quite complicated, but it’s not *too* horrendous:

    private static IEnumerable IntersectImpl(IEnumerable first, IEnumerable second, IEqualityComparer comparer)
    {
    var intersection = new HashSet(comparer);
    var firstSet = new HashSet(comparer);
    var secondSet = new HashSet(comparer);

    using (var firstEnumerator = first.GetEnumerator())
    using (var secondEnumerator = second.GetEnumerator())
    {
    bool firstHasValues = firstEnumerator.MoveNext();
    bool secondHasValues = secondEnumerator.MoveNext();

    while (firstHasValues && secondHasValues)
    {
    T currentFirst = firstEnumerator.Current;
    if (secondSet.Contains(currentFirst) && !intersection.Contains(currentFirst))
    {
    yield return currentFirst;
    intersection.Add(currentFirst);
    }
    firstSet.Add(currentFirst);

    T currentSecond = secondEnumerator.Current;
    if (firstSet.Contains(currentSecond) && !intersection.Contains(currentSecond))
    {
    yield return currentSecond;
    intersection.Add(currentSecond);
    }
    secondSet.Add(currentSecond);

    firstHasValues = firstEnumerator.MoveNext();
    secondHasValues = secondEnumerator.MoveNext();
    }

    while (firstHasValues)
    {
    T currentFirst = firstEnumerator.Current;
    if (secondSet.Contains(currentFirst) && !intersection.Contains(currentFirst))
    {
    yield return currentFirst;
    intersection.Add(currentFirst);
    }

    firstHasValues = firstEnumerator.MoveNext();
    }

    while (secondHasValues)
    {
    T currentSecond = secondEnumerator.Current;
    if (firstSet.Contains(currentSecond) && !intersection.Contains(currentSecond))
    {
    yield return currentSecond;
    intersection.Add(currentSecond);
    }

    secondHasValues = secondEnumerator.MoveNext();
    }
    }
    }

    Like

  7. Configurator is joking of course, but the actual implementation of IntersectIterator (what the BCL calls IntersectImpl) is almost identical to yours except that Set is being used instead of HashSet (at least for 4.0, I’ve not checked 3.5), and there is a loop to add the members of second to the set, rather than passing them the Set’s Constructor.

    Like

  8. Shouldn’t line 8 in the test code block be a call to Intersect instead of Union? That would become:

    [Test]
    public void NoSequencesUsedBeforeIteration()
    {
    var first = new ThrowingEnumerable();
    var second = new ThrowingEnumerable();
    // No exceptions!
    var query = first.Intersect(second);
    // Still no exceptions… we’re not calling MoveNext.
    using (var iterator = query.GetEnumerator())
    {
    }
    }

    Like

Leave a comment