I’ve been reviewing chapter 10 of C# in Depth, which is about extension methods. This is where I start introducing some of the methods in System.Linq.Enumerable, such as Where and Reverse. I introduce a few pieces of terminology in callouts – and while I believe I’m using this terminology correctly according to MSDN, I suspect that some of it isn’t quite what many developers expect… in particular, what does it mean for something to be "lazy"?
Let’s start off with a question: is Enumerable.Reverse lazy? Just so we’re all on the same page, here are the interesting bits of the behaviour of Reverse:
- The call to Reverse doesn’t fetch any items – it merely checks that you’ve not given it a null sequence, stores a reference to that sequence, and returns a new sequence.
- Once the first item is returned from the returned sequence, all the items in the original sequence are fetched and buffered. Obviously this is required, as the first item in the reversed sequence is the last item in the original sequence.
So, is that lazy or not?
One simple definition of lazy
This morning I tweeted on the somewhat ambiguous notion of something being "lazy" – and the responses I received were along the lines of "it’s deferred execution". You could potentially sum up this notion of laziness as:
An operation is lazy if it defers work until it is required in order to return a result.
By that definition, Reverse is lazy. Assuming we don’t want to perform special optimisations for IList<T> (which change the exact behaviour), Reverse does as little work as it can – it just so happens that when the first item is requested, it has to drain the source sequence.
The MSDN definition of lazy
MSDN describes deferred execution, lazy evaluation and eager evaluation somewhat differently. Admittedly the page I’m taking these definitions from is in the context of LINQ to XML, but that effectively means it’s describing LINQ to Objects. It defines them like this:
Deferred execution means that the evaluation of an expression is delayed until its realized value is actually required.
In lazy evaluation, a single element of the source collection is processed during each call to the iterator. This is the typical way in which iterators are implemented.
In eager evaluation, the first call to the iterator will result in the entire collection being processed. A temporary copy of the source collection might also be required. For example, the OrderBy method has to sort the entire collection before it returns the first element.
Now, I take slight issue with the definition of lazy evaluation here as it specifies that a single element of the source collection is processed on each call. That’s fine for Cast, OfType, Select and a few other operators – but it would preclude Where, which might have to pull several source elements before it finds one which matches the specified filter. I still think of Where as being lazy.
My definition of lazy
Thinking about this more, I believe the following definition of laziness is helpful:
(This space left intentionally blank.)
I don’t believe lazy is a useful term, basically. It’s like "strong typing" – you get some sort of woolly feeling about how something will behave if it’s lazy, but chances are you’ll need something more precise anyway.
For the purposes of LINQ to Objects-style operators which return IEnumerable<T> or any interface derived from it (including IOrderedEnumerable<T>), I propose the following definitions:
An operation uses deferred execution if no elements are fetched from the source sequence until the first element is fetched from the result sequence. This applies to basically all LINQ to Objects operators which return a sequence interface. (It doesn’t apply to ToArray or ToList of course.)
An operation is streaming if it only fetches elements from the source sequence as it requires them, and does not store those elements internally unless otherwise specified.
An operation is buffering if it drains the source sequence (i.e. fetches all the elements) when the first element of the result sequence is fetched, and stores the items internally.
I’m not even entirely comfortable with this: you could claim that Reverse is "streaming but with internal storage" – but that’s against the spirit of the definitions. Why did I mention the internal storage at all? Well, consider Distinct… that streams data in that it will the result sequence will return the first element immediately after reading the first element from the source sequence, and so on – but it has to remember all the elements it’s already returned, for obvious reasons.
Some operations are buffering in one input and streaming in another – for example, Join will read all of the "inner" sequence as soon as it’s asked for its first element, but streams the "outer" sequence.
Why does this matter?
Is this just another example of my pedantry and desire for precision? Not really. Consider my old favourite example of LINQ: processing huge log files. Suppose each log entry contains a user ID, and we’ve got a handy log reader which will iterate over all the log files, yielding one log entry at a time.
- Using entries.Reverse() would be disastrous if we ever actually used the result. We really, really don’t want to load everything into memory.
- Using entries.Select(x => x.UserId) would be fine, so long as we used the result without buffering it ourselves.
- Using entries.Select(x => x.UserId).Distinct() might be fine – it would depend on how many users we have. If you’re processing some company-internal application logs, that’s probably okay even if you’ve generated a huge number of log entries. If you’re processing FaceBook logs, you could have more of a problem.
Basically, you need to know what an operation will do with its input before you can decide whether or not it’s usable for a particular situation.
The best answer for this (IMO) is to document any such methods meticulously. Yes, you then lose the flexibility of changing the behaviour at a later date – but at least you’ve then provided something that can be used with confidence.
Note that Reactive Extensions has a similar problem, but in a slightly different form – in that case, the distinction between "hot" and "cold" observables can make a big difference, along with scheduling etc. Again, documentation is the key in my view.