Reimplementing LINQ to Objects: Part 42 – More optimization

A few parts ago, I jotted down a few thoughts on optimization. Three more topics on that general theme have occurred to me, one of them prompted by the comments.

User-directed optimizations

I mentioned last time that for micro-optimization purposes, we could derive a tiny benefit if there were operators which allowed us to turn off potential optimizations – effectively declare in the LINQ query that we believed the input sequence would never be an IList<T> or an ICollection<T>, so it wasn’t worth checking it. I still believe that level of optimization would be futile.

However, going the other way is entirely possible. Imagine if we could say, "There are probably a lot of items in this collection, and the operations I want to perform on them are independent and thread-safe. Feel free to parallelize them."

That’s exactly what Parallel LINQ gives you, of course. A simple call to AsParallel() somewhere in the query – often at the start, but it doesn’t have to be – enables parallelism. You need to be careful how you use this, of course, which is why it’s opt-in… and it gives you a fair amount of control in terms of degrees of potential parallelism, whether the results are required in the original order and so on.

In some ways my "TopBy" proposal is similar in a very small way, in that it gives information relatively early in the query, allowing the subsequent parts (ThenBy clauses) to take account of the extra information provided by the user. On the other hand, the effect is extremely localized – basically just for the sequence of clauses to do with ordering.

Related to the idea of parallelism is the idea of side-effects, and how they affect LINQ to Objects itself.

Side-effects and optimization

The optimizations in LINQ to Objects appear to make some assumptions about side-effects:

  • Iterating over a collection won’t cause any side-effects
  • Predicates may cause side-effects

Without the first point, all kinds of optimizations would effectively be inappropriate. As the simplest example, Count() won’t use an iterator – it will just take the count of the collection. What if this was an odd collection which mutated something during iteration, though? Or what if accessing the Count property itself had side-effects? At that point we’d be violating our principle of not changing observable behaviour by optimizing. Again, the optimizations are basically assuming "sensible" behaviour from collections.

There’s a rather more subtle possible cause of side-effects which I’ve never seen discussed. In some situations – most obviously Skip – an operator can be implemented to move over an iterator for a time without taking each "current" value. This is due to the separation of MoveNext() from Current. What if we were dealing with an iterator which had side-effects only when Current was fetched? It would be easy to write such a sequence – but again, I suspect there’s an implicit assumption that such sequences simply don’t exist, or that it’s reasonable for the behaviour of LINQ operators with respect to them to be left unspecified.

Predicates, on the other hand, might not be so sensible. Suppose we were computing "sequence.Last(x => 10 / x > 1)" on the sequence { 5, 0, 2 }. Iterating over the sequence forwards, we end up with a DivideByZeroException – whereas if we detected that the sequence was a list, and worked our way backwards from the end, we’d see that 10 / 2 > 1, and return that last element (2) immediately. Of course, exceptions aren’t the only kind of side-effect that a predicate can have: it could mutate other state. However, it’s generally easier to spot that and cry foul of it being a proper functional predicate than notice the possibility of an exception.

I believe this is the reason the predicated Last overload isn’t optimized. It would be nice if these assumptions were documented, however.

Assumptions about performance

There’s a final set of assumptions which the common ICollection<T>/IList<T> optimizations have all been making: that using the more "direct" members of the interfaces (specifically Count and the indexer) are more efficient than simply iterating. The interfaces make no such declarations: there’s no requirement that Count has to be O(1), for example. Indeed, it’s not even the case in the BCL. The first time you ask a "view between" on a sorted set for its count after the underlying set has changed, it has to count the elements again.

I’ve had this problem before, removing items from a HashSet in Java. The problem is that there’s no way of communicating this information in a standardized way. We could use attributes for everything, but it gets very complicated, and I strongly suspect it would be a complete pain to use. Basically, performance is one area where abstractions just don’t hold up – or rather, the abstractions aren’t designed to include performance characteristics.

Even if we knew the complexity of (say) Count that still wouldn’t help us necessarily. Suppose it’s an O(n) operation – that sounds bad, until you discover that for this particular horrible collection, each iteration step is also O(n) for some reason. Or maybe there’s a collection with an O(1) count but a horrible constant value, whereas iterating is really quick per item… so for small values of O(n), iteration would be faster. Then you’ve got to bear in mind how much processor time is needed trying to work out the fastest approach… it’s all bonkers.

So instead we make these assumptions, and for the most part they’re correct. Just be aware of their presence.

Conclusion

I have reached the conclusion that I’m tired, and need sleep. I might write about Queryable, IQueryable and query expressions next time.

4 thoughts on “Reimplementing LINQ to Objects: Part 42 – More optimization”

  1. “Basically, performance is one area where abstractions just don’t hold up – or rather, the abstractions aren’t designed to include performance characteristics.”

    This is one of the two biggest reasons why advanced C++ programmers resist moving to C#. If you ever get the time, check out the STL documentation (http://www.sgi.com/tech/stl/table_of_contents.html) and the Boost refinements to iterator concepts (http://www.boost.org/doc/libs/1_45_0/libs/iterator/doc/new-iter-concepts.html).

    There’s been many a time when I’ve looked at the MSDN documentation and missed the precision of the C++ standard.

    Like

  2. You certainly need to exercise care with optimisations of the form ‘var y = x as IFoo; if (y != null) { … }’

    For example, arrays implement the non-generic IList interface. Even multidimensional arrays, and non-zero-based arrays implement that interface. But in either of those cases, their indexer won’t necessarily work the way you expect, and their Count property might not return what you expect either (or it might just throw an exception). Although in either case, calling GetEnumerator returns a perfectly well-behaved iterator that does what you’d expect.

    Some interface contracts are explicitly documented as allowing implementers to throw a NotImplementedException. If your optimisation calls a method that might not be implemented you should probably catch it and back out of that optimisation path.

    Like

  3. The framework certainly uses assumptions, not only about performance, but even about correctness. For example, it assumes that calling Count on a collection will return the same value as counting when iterating. Like you said, for the most part these assumptions are correct.

    However, there is an assumption present in the .NET Framework as of version 1.0, and I have seen it being violated many times. The assumption is that calling GetHashCode multiple times on an object that is used as a key in a HashTable, Dictionary and the likes will always return the same value. In practice, this means that GetHashCode should only be overriden on immutable types. (I know, in theory it can be overriden correctly on mutable types as well, if it only takes into account the immutable part of them. Now go and define “immutable”…)

    Like I said, I can’t count the number of times I’ve seen this being violated. But it’s important for LINQ methods such as ToDictionary.

    Like

  4. @Kris: GetHashCode is documented to return the same value *so long as the object isn’t modified in a way which changes equality*:

    “The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object’s Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.”

    I think that’s reasonable – users need to understand that if they change a hash key in a material way, they’re not going to be able to find it again.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s