Reimplementing LINQ to Objects: Part 40 – Optimization

I’m not an expert in optimization, and most importantly I don’t have any real-world benchmarks to support this post, so please take it with a pinch of salt. That said, let’s dive into what optimizations are available in LINQ to Objects.

What do we mean by optimization?

Just as we think of refactoring as changing the internal structure of code without changing its externally visible behaviour, optimization is the art/craft/science/voodoo of changing the performance of code without changing its externally visible behaviour. Sort of.

This requires two definitions: "performance" and "externally visible behaviour". Neither are as simple as they sound. In almost all cases, performance is a trade-off, whether in speed vs memory, throughput vs latency, big-O complexity vs the factors within that complexity bound, and best case vs worst case.

In LINQ to Objects the "best case vs worst case" is the balance which we need to consider most often: in the cases where we can make a saving, how significant is that saving? How much does it cost in every case to make a saving some of the time? How often do we actually take the fast path?

Externally visible behaviour is even harder to pin down, because we definitely don’t mean all externally visible behaviour. Almost all the optimizations in Edulinq are visible if you work hard enough – indeed, that’s how I have unit tests for them. I can test that Count() uses the Count property of an ICollection<T> instead of iterating over it by creating an ICollection<T> implementation which works with Count but throws an exception if you try to iterate over it. I don’t think we care about that sort of change to externally visible behaviour.

What we really mean is, "If we use the code in a sensible way, will we get the same results with the optimized code as we would without the optimization?" Much better. Nothing woolly about the term "sensible" at all, is there? More realistically, we could talk about a system where every type adheres to the contracts of every interface it implements – that would at least get rid of the examples used for unit testing. Still, even the performance of a system is externally visible… it’s easy to tell the difference between an implementation of Count() which is optimized and one which isn’t, if you’ve got a list of 10 million items.

How can we optimize in LINQ to Objects?

Effectively we have one technique for significant optimization in LINQ to Objects: finding out that a sequence implements a more capable interface than IEnumerable<T>, and then using that interface. (Or in the case of my optimization for HashSet<T> and Contains, using a concrete type in the same way.) Count() is the most obvious example of this: if the sequence implements ICollection or ICollection<T>, then we can use the Count property on that interface and we’re done. No need to iterate at all.

These are generally good optimizations because they allow us to transform an O(n) computation into an O(1) computation. That’s a pretty big win when it’s applicable, and the cost of checking is reasonably small. So long as we hit the right type once every so often – and particularly if in those cases the sequences are long – then it’s a net gain. The performance characteristics are likely to be reasonably fixed here for any one program – it’s not like we’ll sometimes win and sometimes lose for a specific query… it’s that for some queries we win and some we won’t. If we were micro-optimizing, we might want a way of calling a "non-optimized" version which didn’t even bother trying the optimization, because we know it will always fail. I would regard such an approach as a colossal waste of effort in the majority of cases.

Some optimizations are slightly less obvious, particularly because they don’t offer a change in the big-O complexity, but can still make a significant difference. Take ToArray, for example. If we know the sequence is an IList<T> we can construct an array of exactly the right size and ask the list to copy the elements into it. Chances are that copy can be very efficient indeed, basically copying a whole block of bits from one place to another – and we know we won’t need any resizing. Compare that with building up a buffer, resizing periodically including copying all the elements we’ve already discovered. Every part of that process is going to be slower, but they’re both O(n) operations really. This is a good example of where big-O notation doesn’t tell the whole story. Again, the optimization is almost certainly a good one to make.

Then there are distinctly dodgy optimizations which can make a difference, but are unlikely to apply. My optimization for ElementAt and ElementAtOrDefault comes into play here. It’s fine to check whether an object implements IList<T>, and use the indexer if so. That’s an obvious win. But I have an extra optimization to exit quickly if we can find out that the given index is out of the bounds of the sequence. Unfortunately that optimization is only useful when:

  • The sequence implements ICollection<T> or ICollection (but remember it has to implement IEnumerable<T> – there aren’t many collections implementing only the non-generic ICollection, but the generic IEnumerable<T>)
  • The sequence doesn’t implement IList<T> (which gets rid of almost all implementations of ICollection<T>)
  • The given index is actually greater than or equal to the size of the collection

All that comes at the cost of a couple of type checks… not a great cost, and we do potentially save an O(n) check for being given an index out of the bounds of the collection… but how often are we really going to make that win? This is where I’d love to have something like Dapper, but applied to LINQ to Objects and running in a significant number of real-world projects, just logging in as light a way as possible how often we win, how often we lose, and how big the benefit is.

Finally, we come to the optimizations which don’t make sense to me… such as the optimization for First in both Mono and LinqBridge. Both of these projects check whether the sequence is a list, so that they check the count and then use the indexer to fetch item 0 instead of calling GetEnumerator()/MoveNext()/Current. Now yes, there’s a chance this avoids creating an extra object (although not always, as we’ve seen before) – but they’re both O(1) operations which are likely to be darned fast. At this point not only is the payback very small (if it even exists) but the whole operation is likely to be so fast that the tiny check for whether the object implements IList<T> is likely to become more significant. Oh, and then there’s the extra code complexity – yes, that’s only relevant to the implementers, but I’d personally rather they spent their time on other things (like getting OrderByDescending to work properly… smirk). In other words, I think this is a bad target for optimization. At some point I’ll try to do a quick analysis of just how often the collection has to implement IList<T> in order for it to be worth doing this – and whether the improvement is even measurable.

Of course there are other micro-optimizations available. When we don’t need to fetch the current item (e.g. when skipping over items) let’s just call MoveNext() instead of also assigning the return value of a property to a variable. I’ve done that in various places in Edulinq, but not as an optimization strategy, which I suspect won’t make a significant difference, but for readability – to make it clearer to the reader that we’re just moving along the iterator, not examining the contents as we go.

The only other piece of optimization I think I’ve performed in Edulinq is the "yield the first results before sorting the rest" part of my quicksort implementation. I’m reasonably proud of that, at least conceptually. I don’t think it really fits into any other bucket – it’s just a matter of thinking about what we really need and when, deferring work just in case we never need to do it.

What can we not optimize in LINQ to Objects?

I’ve found a few optimizations in both Edulinq and other implementations which I believe to be invalid.

Here’s an example I happened to look at just this morning, when reviewing the code for Skip:

var list = source as IList<TSource>;
if (list != null)
{
    count = Math.Max(count, 0);
    // Note that "count" is the count of items to skip
    for (int index = count; index < list.Count; index++)
    {
        yield return list[index];
    }
    yield break;
}

If our sequence is a list, we can just skip straight to the right part of it and yield the items one at a time. That sounds great, but what if the list changes (or is even truncated!) while we’re iterating over it? An implementation working with the simple iterator would usually throw an exception, as the change would invalidate the iterator. This is definitely a behavioural change. When I first wrote about Skip, I included this as a "possible" optimization – and actually turned it on in the Edulinq source code. I now believe it to be a mistake, and have removed it completely.

Another example is Reverse, and how it should behave. The documentation is fairly unclear, but when I ran the tests, the Mono implementation used an optimization whereby if the sequence is a list, it will just return items from the tail end using the indexer. (This has now been fixed – the Mono team is quick like that!) Again, that means that changes made to the list while iterating will be reflected in the reversed sequence. I believe the documentation for Reverse ought to be clear that:

  • Execution is deferred: the input sequence isn’t read when the method is called.
  • When the result sequence is first read by the caller, a snapshot is taken, and that’s what’s used to return the data.
  • If the result sequence is read more than once (i.e. GetEnumerator is called more than once) then a new snapshot is created each time – so changes to the input sequence between calls to GetEnumerator on the result sequence will be observed.

Now this is still not as precise as it might be in terms of what "reading" a sequence entails – in particular, a simple implementation of Reverse (as per Edulinq) will actually take the snapshot on the first call to MoveNext() on the iterator returned by GetEnumerator() – but that’s probably not too bad. The snapshotting behaviour itself is important though, and should be made explicit in my opinion.

The problem with both of these "optimizations" is arguably that they’re applying list-based optimizations within an iterator block used for deferred execution. Optimizing for lists either upfront at the point of the initial method call or within an immediate execution operator (Count, ToList etc) is fine, because we assume the sequence won’t change during the course of the method’s execution. We can’t make that assumption with an iterator block, because the flow of the code is very different: our code is visited repeatedly based on the caller’s use of MoveNext().

Sequence identity

Another aspect of behaviour which isn’t well-specified is that of identity. When is it valid for an operator to return the input sequence itself as the result sequence?

In the Microsoft implementation, this can occur in two operators: AsEnumerable (which always returns the input sequence reference, even if it’s null) and Cast (which returns the original reference only if it actually implements IEnumerable<TResult>).

In Edulinq, I have two other operators which can return the input sequence: OfType (only if the original reference implements IEnumerable<TResult> and TResult is a non-nullable value type) and Skip (if you provide a count which is zero or negative). Are these valid optimizations? Let’s think about why we might not want them to be…

If you’re returning a sequence from one layer of your code to another, you usually want that sequence to be viewed only as a sequence. In particular, if it’s backed by a List<T>, you don’t want callers casting to List<T> and modifying the list. With any operator implemented by an iterator block, that’s fine – the object returned from the operator has no accessible reference to its input, and the type itself only implements IEnumerable<T> (and IEnumerator<T>, and IDisposable, etc – but not IList<T>). It’s not so good if the operator decides it’s okay to return the original reference.

The C# language specification refers to this in the section about query expression translation: a no-op projection at the end of a query can be omitted if and only if there are other operators in the query. So a query expression of "from foo in bar select foo" will translate to "bar.Select(foo => foo)" but if we had a "where" clause in the query, the Select call would be removed. It’s worth noting that the call to "Cast" generated when you explicitly specify the type of a range variable is not enough to prevent the "no-op" projection from being generated… it’s almost as if the C# team "knows" that Cast can leak sequence identity whereas Where can’t.

Personally I think that the "hiding" of the input sequence should be guaranteed where it makes sense to do so, and explicitly not guaranteed otherwise. We could also add an operator of something like "HideIdentity" which would simply (and unconditionally) add an extra iterator block into the pipeline. That way library authors wouldn’t have to guess, and would have a clear way of expressing their intention. Using Select(x => x) or Skip(0) is not clear, and in the case of Skip it would even be pointless when using Edulinq.

As for whether my optimizations are valid – that’s up for debate, really. It seems hard to justify why leaking sequence identity would be okay for Cast but not okay for OfType, whereas I think there’s a better case for claiming that Skip should always hide sequence identity.

The Contains issue…

If you remember, I have a disagreement around what Contains should do when you don’t provide an equality comparer, and when the sequence implements ICollection<T>. I believe it should be consistent with the rest of LINQ to Objects, which always uses the default equality comparer for the element type when it needs one but the user hasn’t specified one. Everyone else (Microsoft, Mono, LinqBridge) has gone with delegating to the collection’s implementation of ICollection<T>.Contains. That plays well in terms of consistency of what happens if you call Contains on that object, so that it doesn’t matter what the compile-time type is. That’s a debate to go into in another post, but I just want to point out that this is not an example of optimization. In some cases it may be faster (notably for HashSet<T>) but it stands a very good chance of changing the behaviour. There is absolutely nothing to suggest that the equality comparer used by ICollection<T> should be the default one for the type – and in some cases it definitely isn’t.

It’s therefore a matter of what result we want to get, not how to get that result faster. It’s correctness, not optimization – but both the LinqBridge and Mono tests which fail for Edulinq are called "Contains_CollectionOptimization_ReturnsTrueWithoutEnumerating" – and I think that shows a mistaken way of thinking about this.

Can we go further?

I’ve been considering a couple of optimizations which I believe to be perfectly legitimate, but which none of the implementations I’ve seen have used. One reason I haven’t implemented them myself yet is that they will reduce the effectiveness of all my unit tests. You see, I’ve generally used Enumerable.Range as a good way of testing a non-list-based sequence… but what’s to stop Range and Repeat being implemented as IList<T> implementations?

All the non-mutating members are easy to implement, and we can just throw exceptions from the mutating members (as other read-only collections do).

Would this be more efficient? Well yes, if you ever performed a Count(), ElementAt(), ToArray(), ToList() etc operation on a range or a repeated element… but how often is that going to happen? I suspect it’s pretty rare – probably rare enough not to make it worth my time, particularly when you then consider all the tests that would have to be rewritten to use something other than Range when I wanted a non-list sequence…

Conclusion

Surprise, surprise – doing optimization well is difficult. When it’s obvious what can be done, it’s not obvious what should be done… and sometimes it’s not even what is valid in the first place.

Note that none of this has really talked about data structures and algorithms. I looked at some options when implementing ordering, and I’m still thinking about the best approach for implementing TopBy (probably either a heap or a self-balancing tree – something which could take advantage of the size being constant would be nice) – but in general the optimizations here haven’t required any cunning knowledge of computer science. That’s quite a good thing, because it’s many years since I’ve studied CS seriously…

I suspect that with this post more than almost any other, I’m likely to want to add extra items in the future (or amend mistakes which reveal my incompetence). Watch this space.

Next up, I think it would be worth revisiting query expressions from scratch. Anyone who’s read C# in Depth or has followed this blog for long enough is likely to be able to skip it, but I think the series would be incomplete without a quick dive into the compiler translations involved.

8 thoughts on “Reimplementing LINQ to Objects: Part 40 – Optimization”

  1. Definitely agree with your opinion on Contains() not being an optimization (still disagree on whether it is correct though :P).

    I just checked, the BCL doesn’t optimize Skip() for IList either. It feels weird to leave obvious and likely common performance wins on the table, but the LINQ documentation tries hard to make it clear the result of LINQ queries are *queries*, not results, so I get the idea. Possibly a “IList/IEnumerable ViewRange(this IList, int index, int count)” method?

    Like

  2. If you *only* consider big-O, then heapsort is optimal for TopBy – O(n + x*log(n)) where n is the size of the sequence and x is the number of items you need.

    If you try to optimize it you can build the heap sequentially and you get a natural place to drop items, but then the complexity actually increases a bit – the heapify part turns from O(n) to O(n*log(x)), so the total is O(n*log(x) + x*log(x)), which is an improvement whenever:
    n + x*log(n) > n*log(x) + x*log(x)
    or in other words, never. My point is that it at least seems that heapsort can’t be optimized using the value from TopBy. Can self-balancing trees be optimized that way? Or perhaps I’m missing something big here?

    Like

  3. TopBy() could be implemented by reading the ‘count’ first items from source into an array ‘result’, then sorting it – probably using heap_push() for each element then heap_sort() – followed by merging the following elements into ‘result’.

    For small ‘count’s, binary search (upper_bound() for stability) then copy up insertion is probably pretty good, larger ‘count’s probably want to do something like merge sort, merging back and forward between two result arrays.

    I have a dirty version hacked up, but I’d like to do some perf tests to see if it actually can run faster at all. At a quick guess this would be O(count * source.Length), which seems pretty good?

    Like

  4. “That sounds great, but what if the list changes (or is even truncated!) while we’re iterating over it?”

    There’s some conflicting MSDN documentation in this case. IEnumerable.GetEnumerator, and IEnumerator.Current, and IEnumerator all clearly state the enumerator behavior is “undefined,” whereas IEnumerator.MoveNext states its behavior is to throw InvalidOperationException.

    If you accept the “undefined” semantics, then the optimizations for Skip/Reverse/etc. become valid.

    Like

  5. (x = count, n = source.Length)

    @Simon: x * n seems inefficient to me when count is large. An optimal implementation would be limited to n*log(n) when count == source.Length. You mentioned binary search but you can’t do a binary search over an unsorted list. Did you mean a regular search? That does get O(x*n) up to O(n*n)

    Then this is just limited selection sort which is not as efficient as a limited heapsort.
    For mergesort if you keep only count items after each merge you get O(n*x) but limited by O(n*log(n)).

    I think heapsort, which would be O(n + x*log(n)), is quicker here, at least in terms of complexity, and it generally is almost as quick as mergesort when x == n.

    Like

  6. @configurator: result is maintained sorted. It’s probably clearer in pseudocode:

    TopBy(source, count, cmp):
    res := new[count]
    for i := 0 to count:
    if !source.MoveNext():
    return heap_sort(res)
    heap_push(res, source.Current)
    heap_sort(res)
    while source.MoveNext():
    x := source.Current
    insert(res, upper_bound(res, x), x)
    return res

    Remember that the use case of TopBy() is optimizing out sorting the entire input: count is treatable as constant compared to source length in intended usage. If count is very large, in the worst case it can degrade to source.OrderBy(cmp).Top(count), but I feel something better than insertion sort is possible instead.

    Like

  7. On optimizing TopBy:

    Have you (commenters included) considered using a different algorithm based on count (where count is the parameter to TopBy)?

    For example, if count < 10 use insertion sort, another sort in other cases?

    BTW, the OrderBy in Edulinq uses (a variation on) QuickSort right now with early yielding. Two common QuickSort optimizations haven't been implemented on that one: replacing the tail recursion by iteration (you need to push only one case to the stack) and switching to another sorting algorithm when the range to sort is small, e.g. insertion sort when the current range has less than 10 or so items left.

    I think that, when OrderBy has all these optimization in the box, including the early yielding, it just _might_ not be worth it to implement TopBy.

    On the other hand, TopBy is likely to be called with a count that is small (typically less than what fits on a screen, say 50 or so) and probably significantly smaller than the number of items in the sequence.

    That's where even the best OrderBy will always consume more memory than a TopBy needs.

    Remember, optimization is not only about CPU cycles, it's also about memory usage and finding the correct balance between the two. Also don't forget that memory is speed: less cpu instructions to execute but using more memory might make a program slower. There's two reasons for that. First, waisted memory could have been used for something else, disk cache for example. Secondly, modern CPU architectures, with all there caching in multiple levels (including RAM as a cache for the virtual memory in your swap file), are faster on small amounts of memory, and with algorithms with good locality characteristics.

    Like

  8. There is another kind of optimization, which is not explored here: the kind that a developer-user explicitely asks for, because the developer-user knows that in this specific case certain assumptions are true that allow a specific kind of optimization.

    PLINQ is one of those, with its .AsParallel() method.

    Others are certainly possible. For example, I can immagine a .AssumeImmutable() that brings you into an implementation that assumes the source sequence will not change after constructing the query (i.e. not between constructing it and iterating it, not while iterating it, and not after iterating it). If you document that such an implementation will also not guarantee hiding sequence identity, lots of optimizations become possible. For example: .ToArray() on an array could just return the source, .Reverse() on an IList, etc. Turns out hat this assumption is actually valid in many cases!

    Like

Leave a comment