Just how lazy are you?

I’ve been reviewing chapter 10 of C# in Depth, which is about extension methods. This is where I start introducing some of the methods in System.Linq.Enumerable, such as Where and Reverse. I introduce a few pieces of terminology in callouts – and while I believe I’m using this terminology correctly according to MSDN, I suspect that some of it isn’t quite what many developers expect… in particular, what does it mean for something to be "lazy"?

Let’s start off with a question: is Enumerable.Reverse lazy? Just so we’re all on the same page, here are the interesting bits of the behaviour of Reverse:

  • The call to Reverse doesn’t fetch any items – it merely checks that you’ve not given it a null sequence, stores a reference to that sequence, and returns a new sequence.
  • Once the first item is returned from the returned sequence, all the items in the original sequence are fetched and buffered. Obviously this is required, as the first item in the reversed sequence is the last item in the original sequence.

So, is that lazy or not?

One simple definition of lazy

This morning I tweeted on the somewhat ambiguous notion of something being "lazy" – and the responses I received were along the lines of "it’s deferred execution". You could potentially sum up this notion of laziness as:

An operation is lazy if it defers work until it is required in order to return a result.

By that definition, Reverse is lazy. Assuming we don’t want to perform special optimisations for IList<T> (which change the exact behaviour), Reverse does as little work as it can – it just so happens that when the first item is requested, it has to drain the source sequence.

The MSDN definition of lazy

MSDN describes deferred execution, lazy evaluation and eager evaluation somewhat differently. Admittedly the page I’m taking these definitions from is in the context of LINQ to XML, but that effectively means it’s describing LINQ to Objects. It defines them like this:

Deferred execution means that the evaluation of an expression is delayed until its realized value is actually required.

In lazy evaluation, a single element of the source collection is processed during each call to the iterator. This is the typical way in which iterators are implemented.

In eager evaluation, the first call to the iterator will result in the entire collection being processed. A temporary copy of the source collection might also be required. For example, the OrderBy method has to sort the entire collection before it returns the first element.

Now, I take slight issue with the definition of lazy evaluation here as it specifies that a single element of the source collection is processed on each call. That’s fine for Cast, OfType, Select and a few other operators – but it would preclude Where, which might have to pull several source elements before it finds one which matches the specified filter. I still think of Where as being lazy.

My definition of lazy

Thinking about this more, I believe the following definition of laziness is helpful:

(This space left intentionally blank.)

I don’t believe lazy is a useful term, basically. It’s like "strong typing" – you get some sort of woolly feeling about how something will behave if it’s lazy, but chances are you’ll need something more precise anyway.

For the purposes of LINQ to Objects-style operators which return IEnumerable<T> or any interface derived from it (including IOrderedEnumerable<T>), I propose the following definitions:

An operation uses deferred execution if no elements are fetched from the source sequence until the first element is fetched from the result sequence. This applies to basically all LINQ to Objects operators which return a sequence interface. (It doesn’t apply to ToArray or ToList of course.)

An operation is streaming if it only fetches elements from the source sequence as it requires them, and does not store those elements internally unless otherwise specified.

An operation is buffering if it drains the source sequence (i.e. fetches all the elements) when the first element of the result sequence is fetched, and stores the items internally.

I’m not even entirely comfortable with this: you could claim that Reverse is "streaming but with internal storage" – but that’s against the spirit of the definitions. Why did I mention the internal storage at all? Well, consider Distinct… that streams data in that it will the result sequence will return the first element immediately after reading the first element from the source sequence, and so on – but it has to remember all the elements it’s already returned, for obvious reasons.

Some operations are buffering in one input and streaming in another – for example, Join will read all of the "inner" sequence as soon as it’s asked for its first element, but streams the "outer" sequence.

Why does this matter?

Is this just another example of my pedantry and desire for precision? Not really. Consider my old favourite example of LINQ: processing huge log files. Suppose each log entry contains a user ID, and we’ve got a handy log reader which will iterate over all the log files, yielding one log entry at a time.

  • Using entries.Reverse() would be disastrous if we ever actually used the result. We really, really don’t want to load everything into memory.
  • Using entries.Select(x => x.UserId) would be fine, so long as we used the result without buffering it ourselves.
  • Using entries.Select(x => x.UserId).Distinct() might be fine – it would depend on how many users we have. If you’re processing some company-internal application logs, that’s probably okay even if you’ve generated a huge number of log entries. If you’re processing FaceBook logs, you could have more of a problem.

Basically, you need to know what an operation will do with its input before you can decide whether or not it’s usable for a particular situation.

The best answer for this (IMO) is to document any such methods meticulously. Yes, you then lose the flexibility of changing the behaviour at a later date – but at least you’ve then provided something that can be used with confidence.

Note that Reactive Extensions has a similar problem, but in a slightly different form – in that case, the distinction between "hot" and "cold" observables can make a big difference, along with scheduling etc. Again, documentation is the key in my view.

16 thoughts on “Just how lazy are you?”

  1. I don’t think calling Reverse “streaming with internal storage” is a good idea. I think you should draw the line between streaming and buffering in the fact that Reverse will *always* drain the entire sequence, while Distinct won’t.
    Think of infinite sequences. Reverse can’t handle them, but Distinct can. In fact, because streams are in some functional contexts treated like infinite lists, it makes even more sense for me to draw the distinction between streaming and buffering at the fact that an operator can or can’t work on streams, aka infinite lists.

    Like

  2. I agree with both the need to ditch the eager/lazy evaluation terms and to use streaming/buffering instead. I naturally tend toward using deferred execution,. streaming and buffering though this has been solidified since reading some of the MoreLinq code.

    That said, I also can see a need for finer granularity to explain what a call is doing and how besides just streaming and buffering. There’s something missing that makes one thing “Oh, so this one consumes the sequence as it goes whereas this one does it all at once” and something is needed that covers calls that buffer chunks in a streaming manner.

    Perhaps, the concept of stream-chunks is need to indicate exactly what is being streamed versus what is being buffered.

    Like

  3. I agree that the terms “streaming” and “buffering” are much more useful than “deferred”, “lazy”, or “eager”.

    But anyway, the actual behavior of the method might be different depending on the actual data source : for instance, for instance, Reverse is not buffering when working on a IList, since it can access each item by its index…

    Like

  4. This is where Ruby’s convention of suffixing method names with ! to indicate a side-effect is so nice. If LINQ methods which cause the underlying source to be run ended in !, you’d know right there in the editor.

    e.g.

    xml.Elements().Select(e => e.Value).Sort!();

    or

    lines.Run!(l => Console.WriteLine(l));

    Like

  5. I suggest the best thing to do is give O(n) definitions of time (items pulled from the source and perhaps items resulting if not one to one or less) and space (extra space required internally)

    * up front costs where appropriate (so on creation, first call, or xth call for something with complex behaviour)

    * total cost if you iterate over the entire result (if infinite then as a function of the number of elements requested)

    There will be some that don’t fit in this model, but they will almost certainly require more detailed descriptions.

    Where the costs are dependent not on source.Count() (or result.Count()) then define the variable in question and use it (plus an indication of the worst case, typically when x == source.Count())

    so for example:

    Where()

    let N = elements in source
    let M = matching elements in source

    * constant up front cost
    * constant space overhead
    * amortized O(1) per element retrieved
    * in worst case it will be O(N) for the first call (M == 0)
    * O(N) to exhaust the result

    Max()

    let N = number of elements in source

    * up front cost of O(N)
    * constant space overhead

    OrderBy()

    let N = elements in source

    * constant up front cost
    * first retrieval (depends on the sort – numbers for example only)
    * O(N) space overhead
    * O(N Log N) time
    * subsequent calls
    * O(1) per element retrieved
    * O(N) to exhaust the list

    OrderBy()

    let N = elements in source

    * constant up front cost
    * first retrieval (depends on the sort – numbers for example only)
    * O(N) space overhead
    * O(N Log N) time
    * subsequent calls
    * O(1) per element retrieved
    * O(N) to exhaust the list

    Concat()

    let N1 = elements in first source
    let N2 = elements in second source

    * constant up front cost
    * constant space cost
    * O(1) cost per retrieved element
    * O(N1 + N2) iteration cost

    Distinct()

    let N = elements in source
    let D = number of distinct elements

    * constant up front cost
    * O(D) space cost (worst case O(N))
    * O(1) cost per retrieved element (assuming good hash distribution – O(D) if not)
    * O(N) iteration cost

    They may choose to also state “this behaviour is not guaranteed, we may change it in future” but that expresses (to me) neatly what the cost implications are

    Like

  6. @Thomas: No, Reverse doesn’t optimise for the IList[T] case – as I’ve mentioned elsewhere, that would change the semantics in terms of what happens if you modify the list after starting to iterate over the results of Reverse. For the sake of consistency, it will *always* buffer the input sequence at the point of retrieving the first element.

    Like

  7. For buffering situations like Reverse(), I think the space semantics becomes a correctness issue. Ideally the type system should incorporate the information about streaming vs buffering in the type of the return value of the monadic operations, so that you are less likely to make stupid mistakes – put your filter before your reverse is quite different to having it after, even if you end up with the same results semantically.

    One could handle this to a degree with a LINQ to Objects runtime that was modeled around expression trees rather than lambdas. That would give it the opportunity to analyse the whole pipeline before running any of it. Such a runtime could use attributes on the various operations (like Reverse()) to inform it as to the performance characteristics etc. For it to make useful predictions, it would also need to know characteristics of the data, such as which filters (Where()) discriminate more (these should be front-loaded), but it could also be simply empirical.

    Has anyone written such a runtime? It looks like it could be an interesting side-project.

    Like

  8. Fun that you blog about this as there was a question on StackOverflow about this today that i tried to answer.

    And it’s true that MSDN should be a lot more clear on what exactly they mean the suggestion of ShuggyCoUk is interesting for this as it will allow to know the cost but also what interface to implement to optimize execution time : Maybe in face of a performance bottleneck i will be tempted to remove linq usage to fall back to manual management where in fact implementing IList instead of just ICollection will have been enough for everything to become blazing fast…

    Like

  9. Adding a bit more to ShuggyCoUk’s comment… I think they picked their three definitions of Lazy, Deferred and eager as dumbed down replacements for Big O notation.

    Their definitions allow someone to compare two potential calls and immediately know if it will traverse the entire list or not. Your “make it as optimized as possible” definition of Lazy doesn’t allow for that comparison. Lazy could still be expensive.

    Essentially you’ve argued that Lazy means optimized.

    I *do* think the definitions are dumbed down, but unfortunately, few people seem to know what Big O notation is anymore. I’d like to say that everyone who writes software has taken a class or two in algorithms and data structures… but then again… I read and answer questions for people on StackOverflow.com just like you do, so we both know that’s not the case.

    Like

  10. Jon…soory, I am SO lazy, that I could not even read your post

    You are 100% that there are far too many terms that provide a “woolly” feeling without providing any actual information. [See Eric Liperts post on “Begging the Question”]

    Like

  11. My concern is the use of a custom struct “Buffer” as seen in line 7 in .NET’s Reverse implementation….I think it masks the differences in collection types that implement IEnumerable but don’t support index access, but I’m not certain. I’ll have to look into this in more detail.

    Like

  12. Hi – Cant wait for the new book

    Anyway, I have a problem directly related to this issue, and wondered if you can give me some pointers.

    I am about to build a version of ObservableCollection (implementing INotifyPropertyChanged) as well as implementing IQueryable with a custom Linq provider.

    The task is to create a collection that can be used as a standard binding target for WPF applications, but gets data from a custom high-performance data service that only delivers rows of data via the concept of a ‘virtual viewport’ – i.e. you can only get a certain contiguous range of rows at any one call (as there may be millions or rows in the total server-side source).

    I played with the IList (non-generic) interface as well as IList, as it creates a special kind of CollectionViewSource called ListViewSource that uses a collections this[i] interface rather than enumerating the whole collection via IEnumerable – this enables me to hook up to the server-side viewport and maintain only a small collection of pages on the client (with some clevel caching algorithms)

    Now the harder part is the IQueryable

    I want to support the following expressions

    1) Where
    2) Select
    3) OrderBy
    4) OrderByDesc
    5) Count
    6) GroupBy

    and maybe some bepoke aggregate functions for calculating the greeks. The provider that will support the IQueryable will translate extression tree elements into a single string to pass to the server date (there is a bespoke string-based language for sorting/filtering etc).

    Now my big question is (finally), I must support lazy data access only, as Im trying to only bring back the rows required by the user interface via ListViewSource, and a situation where IEnumerable is invoked to run throught the entire collection must not happen.

    As Im writing my own Linq provider, I guess the implementation is down to me, but are there any ‘gotchas’ that I need to bear in mind ?

    Many thanks

    Dean

    Like

  13. I would say Reverse is completely lazy it is just that using pure sequence semantics, Reverse iteration is fundamentally going to have to complete iteration of the source sequence before it can return the first element. If the underlying sequence happens to also implement random-access lookup, the behavior can be different but that’s another matter.

    Also, we should distinguish between sequence operators and sequence iteration!

    A sequence operator can be “lazy”, while iteration over a sequence can be implemented in a more or less “lazy” way but the nature of LINQ is to be as lazy as possible; unfortunately, with Reverse “eagerness” appears to be part of the very function specification.

    Buffering is often a separate, orthogonal concern.

    One should also distinguish between explicit and implicit laziness. With C# iterators we have implicit laziness for sequences (at least all we need is yield return*) but not for expressions in general – although we can introduce explicit laziness through closures and what not.

    * In a more pure language, with no return keyword, I could imagine this syntax

    mySeq =
    0
    yield
    _++
    yield
    _ += _
    yield
    _ *= _

    where _ would represent the current value of the scope and yield would return a sequence element (same meaning as yield _).

    Enough babble from me…

    Like

  14. Reverse is lazy in the same way that Where is lazy. It just always needs the entire sequence to produce the first output element. That is rarely the case with Where, but it can surely happen.

    When something is “lazy, but eager” then something is definitely wrong with the semantics. Those should be symmetric concepts.

    I suspect that Distinct would have to deal with some issues as well if dealing with an infinite stream.

    Like

  15. ShuggyCoUk makes mention of the cost of enumerating the output sequence produced by a LINQ operator; this is something that’s not well covered in the documentation.

    For example, MSDN states that OrderBy uses deferred evaluation, but what that means is that there’s one O(n log n) hit the first time you call MoveNext. This could be undesirable if the ordering is immediately followed by something like .Take(10); strictly speaking, we don’t need to totally order the input to find the smallest ten elements.

    I thought it would be an interesting challenge to write a “lazier” OrderBy that uses deferred execution of the sort itself; I’ve blogged my results and the code at http://code.logos.com/blog/2010/04/a_truly_lazy_orderby_in_linq.html

    Like

Leave a comment