Reimplementing LINQ to Objects: Part 35 – Zip

January 14, 2011 jonskeet 18 Comments

Zip will be a familiar operator to any readers who use Python. It was introduced in .NET 4 – it’s not entirely clear why it wasn’t part of the first release of LINQ, to be honest. Perhaps no-one thought of it as a useful operator until it was too late in the release cycle, or perhaps implementing it in the other providers (e.g. LINQ to SQL) took too long. Eric Lippert blogged about it in 2009, and I find it interesting to note that aside from braces, layout and names we’ve got exactly the same code. (I read the post at the time of course, but implemented it tonight without looking back at what Eric had done.) It’s not exactly surprising, given how trivial the implementation is. Anyway, enough chit-chat…

What is it?

Zip has a single signature, which isn’t terribly complicated:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
    this IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)

Just from the signature, the name, and experience from the rest of this blog series it should be easy enough to guess what Zip does:

It uses deferred execution, not reading from either sequence until the result sequence is read
All three parameters must be non-null; this is validated eagerly
Both sequences are iterated over "at the same time": it calls GetEnumerator() on each sequence, then moves each iterator forward, then reads from it, and repeats.
The result selector is applied to each pair of items obtained in this way, and the result yielded
It stops when either sequence terminates
As a natural consequence of how the sequences are read, we don’t need to perform any buffering: we only care about one element from each sequence at a time.

There are really only two things that I could see might have been designed differently:

It could have just returned IEnumerable<Tuple<TFirst, TSecond>> but that would have been less efficient in many cases (in terms of the GC) and inconsistent with the rest of LINQ
It could have provided different options for what to do with sequences of differents lengths. For example:
- Throw an exception
- Use the default value of the shorter sequence type against the remaining items of the longer sequence
- Use a specified default value of the shorter sequence in the same way

I don’t have any problem with the design that’s been chosen here though.

What are we going to test?

There are no really interesting test cases here. We test argument validation, deferred execution, and the obvious "normal" cases. I do have tests where "first" is longer than "second" and vice versa.

The one test case which is noteworthy isn’t really present for the sake of testing at all – it’s to demonstrate a technique which can occasionally be handy. Sometimes we really want to perform a projection on adjacent pairs of elements. Unfortunately there’s no LINQ operator to do this naturally (although it’s easy to write one) but Zip can provide a workaround, so long as we don’t mind evaluating the sequence twice. (That could be a problem in some cases, but is fine in others.)

Obviously if you just zip a sequence with itself directly you get each element paired with the same one. We effectively need to "shift" or "delay" one sequence somehow. We can do this using Skip, as shown in this test:

[Test]
public void AdjacentElements()
{
    string[] elements = { "a", "b", "c", "d", "e" };
    var query = elements.Zip(elements.Skip(1), (x, y) => x + y);
    query.AssertSequenceEqual("ab", "bc", "cd", "de");
}

It always takes me a little while to work out whether I want to make first skip or second – but if we want the second element as the first element of second (try getting that right ten times in a row – it makes sense, honest!) means that we want to call Skip on the sequence used as the argument for second. Obviously it would work the other way round too – we’d just get the pairs presented with the values switched, so the results of the query above would be "ba", "cb" etc.

Let’s implement it!

Guess what? It’s yet another operator with a split implementation between the argument validation and the "real work". I’ll skip argument validation, and get into the tricky stuff. Are you ready? Sure you don’t want another coffee?

private static IEnumerable<TResult> ZipImpl<TFirst, TSecond, TResult>(
    IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)
{
    using (IEnumerator<TFirst> iterator1 = first.GetEnumerator())
    using (IEnumerator<TSecond> iterator2 = second.GetEnumerator())
    {
        while (iterator1.MoveNext() && iterator2.MoveNext())
        {
            yield return resultSelector(iterator1.Current, iterator2.Current);
        }
    }
}

Okay, so possibly "tricky stuff" was a bit of an overstatement. Just about the only things to note are:

I’ve "stacked" the using statements instead of putting the inner one in braces and indenting it. For using statements with different variable types, this is one way to keep things readable, although it can be a pain when tools try to reformat the code. (Also, I don’t usually omit optional braces like this. It does make me feel a bit dirty.)
I’ve used the "symmetric" approach again instead of a using statement with a foreach loop inside it. That wouldn’t be hard to do, but it wouldn’t be as simple.

That’s just about it. The code does exactly what it looks like, which doesn’t make for a very interesting blog post, but does make for good readability.

Conclusion

Two operators to go, one of which I might not even tackle fully (AsQueryable – it is part of Queryable rather than Enumerable, after all).

AsEnumerable should be pretty easy…

18 thoughts on “Reimplementing LINQ to Objects: Part 35 – Zip”

Alex Humphrey says:

January 14, 2011 at 4:01 pm

That’s another method I implemented in one of our libs which will have to go when we finally move to .NET 4.0. (the other is String.Join(String, IEnumerable) implemented as an extension method). It’s reassuring to see I got the same code as you and Eric.

I don’t think the using stacking is too dirty, although I understand that you do need to be aware that it only works by virtue of the fact the second using block is regarded as a single statement. It’s a shame there’s no syntax for declaring multiple variables of differing types in a single using block. It’s either using stacking or code creeping across your screen.

Another way might be a type like CompositeDisposable, although that’s probably overkill.

LikeLike

Reply
Strilanc says:

January 14, 2011 at 4:29 pm

C# doesn’t allow you to do “using x=E1, y=E2”? VB does.

Zip is the main operator I wish worked in query expressions. In particular, the naming is a lot worse with lambdas than it would be with some nice behind-the-scenes anonymous types.

IEnumerable Step(this IEnumerable sequence, int step)
{
return from e in sequence
zip i in naturals
where i % step == 0
select e
}

LikeLike

Reply
skeet says:

January 14, 2011 at 5:18 pm

@Strilanc: C# allows multiple variables to be declared in one using statement, but only if they’re of the same type.

LikeLike

Reply
Simon Buchan says:

January 14, 2011 at 11:46 pm

From memory, Visual Studio knows what’s going on and will *unindent* to that formatting for consecutive usings.

Also, I’m not sure how I feel about the different length behavior. This way does make stuff easier, but the BCL has tended towards ensuring you know what’s going on.

LikeLike

Reply
skeet says:

January 15, 2011 at 1:43 am

@Simon: I’m not quite sure what you mean about the “different length behaviour” – could you clarify?

LikeLike

Reply
Simon Buchan says:

January 15, 2011 at 2:21 am

@skeet: Sorry, I was a little unclear, the documented behavior of Zip when the sequences have different lengths. Not referencing your implementation.

LikeLike

Reply
configurator says:

January 15, 2011 at 1:33 pm

I was going to write a comment but Simon already wrote it.

VS treats “stacked” usings very nicely. And the different length behaviour is surprising and would cause loss of information, which isn’t a good combination. What’s wrong with getting default(T) for the rest?

LikeLike

Reply
James Miles says:

January 17, 2011 at 9:05 am

While they didn’t release Zip in the original release of LINQ. They included it as a “Custom Operator” in the samples. However it had a different name (Combine).

http://msdn.microsoft.com/en-us/vcsharp/aa336749#combine

LikeLike

Reply
Brandon Dimperio says:

February 1, 2011 at 4:39 pm

Zip done in LinqBridge compatible terms for VB as

Sub Main
dim elements() as string={“a”,”b”,”c”,”d”,”e”}
dim query = zipbyindex(elements,elements.Skip(1), Function (x,y) x+y)
‘query.AssertSequenceEqual(“ab”, “bc”, “cd”, “de”)
query.Dump()

End Sub

‘ Define other methods and classes here

Public Function ZipByIndex(Of TFirst, TSecond, TResult)(ByVal first As IEnumerable(Of TFirst), _
ByVal second As IEnumerable(Of TSecond), _
ByVal resultSelector As Func(Of TFirst, TSecond, TResult) _
) As IEnumerable(Of TResult)

Dim q = From f In first.Select(Function(item1, index) New With {index, item1}) _
Join s In second.Select(Function(item2, index) New With {index, item2}) _
On f.index Equals s.index _
Select resultSelector(f.item1, s.item2)
Return q
End Function

Where there other tests to pass or run? or any obvious issues you see with this?

LikeLike

Reply
skeet says:

February 1, 2011 at 4:50 pm

@Brandon: Using a join means the second sequence is completely consumed. Zip itself can be used with two infinite sequences – the result will be infinite too, but with no significant memory usage.

LikeLike

Reply
Brandon Dimperio says:

February 1, 2011 at 7:17 pm

alright, will note/comment that the work-around has scalability issues, anything else obvious or troubling (besides my saying Were instead of where above)?

LikeLike

Reply
skeet says:

February 1, 2011 at 7:28 pm

@Brandon: Changing the model of how it’s read fundamentally changes the operator IMO. Given that Zip is relatively easy to implement “normally”, why not just do that?

LikeLike

Reply
Brandon Dimperio says:

February 1, 2011 at 8:31 pm

Well I was trying to get this hack to work in 2.0 with linq bridge to make it pretty and do more than one zip in the same query a little less ugly.

Hacking LINQ Expressions: Select With Index

and had no luck since IQueryable has to be one of the extension methods for GetIndex to work.

LikeLike

Reply
Dennis Palmer says:

March 15, 2011 at 11:58 am

Is this a typo: “I do have tests where first is longer than short”?

Should it be “first is longer than second”?

LikeLike

Reply
skeet says:

March 15, 2011 at 1:09 pm

@Dennis: Yup, thanks – fixing now.

LikeLike

Reply
Steven Jeuris says:

April 4, 2011 at 5:06 pm

Thank you for this article! I could easily expand on it to create a zip for three sequences. :) Just wanted to point out that you forgot the ‘this’ of the first parameter.

LikeLike

Reply
skeet says:

April 5, 2011 at 12:27 am

@Steven: I didn’t – the code I’ve shown is the “impl” method which would be *called* by the actual extension method, after performing argument validation.

I deliberately left out the extension method itself as it’s tedious and exactly what you’d expect.

LikeLike

Reply
Steven Jeuris says:

April 5, 2011 at 4:54 am

@Jon: Oh, I missed that. :) I did argument checking using code contracts, so I placed everything directly in the extension method. Thanks for pointing it out.

LikeLike

Reply

Jon Skeet's coding blog

Reimplementing LINQ to Objects: Part 35 – Zip

What is it?

What are we going to test?

Let’s implement it!

Conclusion

18 thoughts on “Reimplementing LINQ to Objects: Part 35 – Zip”

Leave a comment Cancel reply

What is it?

What are we going to test?

Let’s implement it!

Conclusion

Share this:

Related

18 thoughts on “Reimplementing LINQ to Objects: Part 35 – Zip”

Leave a comment Cancel reply