Logging enumeration flow

I’m currently reading Pro LINQ: Language Integrated Query in C# 2008 by Joe Rattz and yesterday I came across a claim about Enumerable.Intersect which didn’t quite ring true. I consulted MSDN and the documentation is exactly the same as the book. Here’s what it says:

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

(first is the first parameter, the one which the method appears to be called on when using it as an extension method. second is the second parameter – the other sequence involved.)

This seems to be needlessly restrictive. In particular, it doesn’t allow you to work with an infinite sequence on either side. It also means loading the whole of both sequences into memory at the same time. Given the way that Join works, I was surprised to see this. So I thought I’d test it. This raised the question of how you trace the flow of a sequence – how do you know when data is being pulled from it? The obvious answer is to create a new sequence which fetches from the old one, logging as it goes. Fortunately this is really easy to implement:

using System;
using System.Collections.Generic;

public static class Extensions
{
    public static IEnumerable<T> WithLogging<T>(this IEnumerable<T> source,
                                                string name)
    {
        foreach (T element in source)
        {
            Console.WriteLine(“{0}: {1}”, name, element);
            yield return element;
        }
    }
}

We keep a name for the sequence so we can easily trace which sequence is being pulled from at what point. Now let’s apply this logging to a call to Intersect:

using System;
using System.Linq;
using System.Collections.Generic;

// Compile alongside the Extensions class

class Test
{
    static void Main()
    {
        var first = Enumerable.Range(1, 5).WithLogging(“first”);
        var second = Enumerable.Range(3, 5).WithLogging(“second”);
        foreach (int i in first.Intersect(second))
        {
            Console.WriteLine(“Intersect: {0}”, i);
        }
    }
}

As you can see, we’re intersecting the numbers 1-5 with the numbers 3-7 – the intersection should clearly be 3-5. We’ll see a line of output each time data is pulled from either first or second, and also when the result of Intersect yields a value. Given the documentation and the book, one would expect to see this output:

// Theoretical output. It doesn’t really do this
first: 1
first: 2
first: 3
first: 4
first: 5
second: 3
second: 4
second: 5
second: 6
second: 7
Intersect: 3
Intersect: 4
Intersect: 5

Fortunately, it actually works exactly how I’d expect: the second sequence is evaluated fully, then the first is evaluated in a streaming fashion, with results being yielded as they’re found. (This means that, if you’re sufficiently careful with the result, e.g. by calling Take with a suitably small value, the first sequence can be infinite.) Here’s the actual output demonstrating that:

// Actual output.
second: 3
second: 4
second: 5
second: 6
second: 7
first: 1
first: 2
first: 3
Intersect: 3
first: 4
Intersect: 4
first: 5
Intersect: 5

Initial Conclusion

There are two interesting points here, to my mind. The first is demonstrating that the documentation for Intersect is wrong – the real code is more sensible than the docs. That’s not as important as seeing how easy it is to log the flow of sequence data – as simple as adding a single extension method and calling it. (You could do it with a Select projection which writes the data and then yields the value of course, but I think this is neater.)

I’m hoping to finish reading Joe’s book this week, and write the review over the weekend, by the way.

Update (Sept. 11th 2008)

Frederik Siekmann replied to this post with a thrilling alternative implementation to stream the intersection, which takes alternating elements from the two sequences involved. It’s a bit more memory hungry (with three sets of elements to remember instead of just one) but it means that we can deal with two infinite streams, if we’re careful. Here’s a complete example:

using System;
using System.Collections.Generic;
using System.Linq;

public static class Extensions
{
    public static IEnumerable<T> AlternateIntersect<T>(this IEnumerable<T> first, IEnumerable<T> second)
    {
        var intersection = new HashSet<T>();
        var firstSet = new HashSet<T>();
        var secondSet = new HashSet<T>();
        using (IEnumerator<T> firstEnumerator = first.GetEnumerator())
        {
            using (IEnumerator<T> secondEnumerator = second.GetEnumerator())
            {
                bool firstHasValues = firstEnumerator.MoveNext();
                bool secondHasValues = secondEnumerator.MoveNext();
                while (firstHasValues && secondHasValues)
                {
                    T currentFirst = firstEnumerator.Current;
                    T currentSecond = secondEnumerator.Current;
                   
                    if (!intersection.Contains(currentFirst) &&
                        secondSet.Contains(currentFirst))
                    {
                        intersection.Add(currentFirst);
                        yield return currentFirst;
                    }
                    firstSet.Add(currentFirst);
                    if (!intersection.Contains(currentSecond) &&
                        firstSet.Contains(currentSecond))
                    {
                        intersection.Add(currentSecond);
                        yield return currentSecond;
                    }
                    secondSet.Add(currentSecond);
                    firstHasValues = firstEnumerator.MoveNext();
                    secondHasValues = secondEnumerator.MoveNext();
                }
                if (firstHasValues)
                {
                    do
                    {
                        T currentFirst = firstEnumerator.Current;
                        if (!intersection.Contains(currentFirst) &&
                            secondSet.Contains(currentFirst))
                        {
                            intersection.Add(currentFirst);
                            yield return currentFirst;
                        }
                    } while (firstEnumerator.MoveNext());
                }
                if (secondHasValues)
                {
                    do
                    {
                        T currentSecond = secondEnumerator.Current;
                        if (!intersection.Contains(currentSecond) &&
                            firstSet.Contains(currentSecond))
                        {
                            intersection.Add(currentSecond);
                            yield return currentSecond;
                        }
                    } while (secondEnumerator.MoveNext());
                }
            }
        }
    }
   
    public static IEnumerable<T> WithLogging<T>(this IEnumerable<T> source, string name)
    {
        foreach (T element in source)
        {
            Console.WriteLine(string.Format(“{0}: {1}”, name, element));
            yield return element;
        }
    }
}

class Test
{
    static void Main()
    {
        var positiveIntegers = Enumerable.Range(0, int.MaxValue);
       
        var multiplesOfTwo = positiveIntegers.Where(x => (x%2) == 0)
                                             .WithLogging(“Twos”);
        var multiplesOfThree = positiveIntegers.Where(x => (x%3) == 0)
                                               .WithLogging(“Threes”);
   
        foreach (int x in multiplesOfTwo.AlternateIntersect(multiplesOfThree).Take(10))
        {
            Console.WriteLine (“AlternateIntersect: {0}”, x);
        }
    }
}

Most of the code is the alternating intersection – the test at the end just shows intersection of the sequence 2, 4, 6, 8… with 3, 6, 9, 12… The output shows elements being taken from both sequences, and yielded when a match is found. It’s important that we limit the output in some way – in the above code we call Take(10) but anything which prevents the loop from just executing until we run out of memory is fine.

Twos: 0
Threes: 0
AlternateIntersect: 0
Twos: 2
Threes: 3
Twos: 4
Threes: 6
Twos: 6
Threes: 9
AlternateIntersect: 6
Twos: 8
Threes: 12
Twos: 10
Threes: 15
Twos: 12
Threes: 18
AlternateIntersect: 12
Twos: 14
Threes: 21
Twos: 16
Threes: 24
Twos: 18
Threes: 27
AlternateIntersect: 18
(etc)

That’s really neat. Quite how often it’ll be useful is a different matter, but I find this kind of thing fascinating to consider. Thanks Frederik!

9 thoughts on “Logging enumeration flow”

  1. > exactly how I’d expect: the second sequence is evaluated fully, then the first is evaluated in a streaming fashion, with results being yielded as they’re found.

    Why is it any more natural to fully evaluate second rather than first? Obviously, one of them has to be, but from just reading calling code it doesn’t seem to me to be any more intuitive that ‘second’ is evaluated and then each ‘first’ looked up in it, rather than the other way round.

    I realise we can’t get to the mathematical reality of intersect being commutative even when infinite sets are involved, since we don’t actually know whether the sets are infinite, but I can’t say I have any strong ‘feels more right’ feeling over which of these should work:

    {1, 2}.Intersect(AllNaturals)
    AllNaturals.Intersect({1, 2})

    It is of course a bit off that the docs disagree with the implementation here…

    Like

  2. It’s only a gut feeling as to why this feels more natural (as well as the fact that it’s documented that way for Join, so Intersection is just being consistent) but it’s possibly best explained with a metaphor.

    Imagine that LINQ is creating a physical pipe of data, e.g.

    source.Where(…).Select(…) etc is a series of pipes connected together in one straight line. Anything which brings in an extra sequence has a sort of “T” pipe with extra data coming in from the side. The “natural flow” for the output is with the rest of the straight line, i.e. the first sequence – so it feels reasonable that the streaming result of the output is related to the streaming of the first sequence. That means the second sequence has to be fully evaluated first.

    Does that make any sense whatsoever? If not, I can try drawing a picture and scanning it tonight… I think it’s more obvious visually than just in words.

    Jon

    Like

  3. “The “natural flow” for the output is with the rest of the straight line, i.e. the first sequence – so it feels reasonable that the streaming result of the output is related to the streaming of the first sequence. That means the second sequence has to be fully evaluated first.”

    I do agree that this seems more natural.

    Unfortunately though, the literal semantics of Intersect don’t imply any such thing. As such I think it’d be unwise to assume any particular order of evaluation. The documentation does not promise the behavior that actually occurs and while it’s unfortunate that it incorrectly _does_ document some other behavior, at least it’s less likely (though not impossible) for code that relies on the documented behavior to break (by virtue of it simply being less useful to rely on that).

    In other words, as useful as it might be to be able to use an infinite iterator as the second argument for Intersect, it sure seems dangerous to me for code to rely on that. It’s not documented to work that way, and code that relies on that behavior would break in a horrible way should the behavior change in the future.

    Like

  4. Peter: I take your point entirely. It would be somewhat different if it actually guaranteed the current behaviour. But yes, with the current docs it’s *far* from guaranteed to work. I had thought that Join documented the behaviour, but it doesn’t. It’s possible I was thinking of my book – which *does* document the streaming/buffered behaviour of Join but not Intersect!

    Like

  5. @Larry
    For the following two statement fragments;

    {1, 2}.Intersect(AllNaturals)

    AllNaturals.Intersect({1, 2})

    I think that the second fragment here seems more natural to read and write. The only reasoning I can come up with is that {1, 2} is an actual parameter, and AllNaturals is an object. An actual parameter is something you actually have, whereas an object is a representation of an abstraction. This reasoning is not scientific in any way, but it seems to explain my gut feeling at least.

    Like

  6. The book Pro LINQ: Language Integrated Query in C# 2008 Windows.Net) (Paperback) by Jr., Joseph C. Rattz (Author) has so many positive reviews on Amazon.com that it’s scary (or suspect)!

    Jon Skeet is impartial and will set the record straight I’m sure ;-)

    In the meantime, I will put this book in my Amazon shopping cart for future purchase, unless I hear it’s no good.

    RL

    Like

Leave a comment