Reimplementing LINQ to Objects: Part 38 – What’s missing?

I mentioned before that the Zip operator was only introduced in .NET 4, so clearly there’s a little wiggle room for LINQ to Object’s query operators to grow in number. This post mentions some of the ones I think are most sorely lack – either because I’ve wanted them myself, or because I’ve seen folks on Stack Overflow want them for entirely reasonable use cases.

There is an issue with respect to other LINQ providers, of course: as soon as some useful operators are available for LINQ to Objects, there will be people who want to apply them to LINQ to SQL, the Entity Framework and the like. Worse, if they’re not included in Queryable with overloads based on expression trees, the LINQ to Objects implementation will silently get picked – leading to what looks like a lovely query performing like treacle while the client slurps over the entire database. If they are included in Queryable, then third party LINQ providers could end up with a nasty versioning problem. In other words, some care is needed and I’m glad I’m not the one who has to decide how new features are introduced.

I’ve deliberately not looked at the extra set of operators introduced in the System.Interactive part of Reactive Extensions… nor have I looked back over what we’ve implemented in MoreLINQ (an open source project I started specifically to create new operators). I figured it would be worth thinking about this afresh – but look at both of those projects for actual implementations instead of just ideas.

Currently there’s no implementation of any of this in Edulinq – but I could potentially create an "Edulinq.Extras" assembly which made it all available. Let me know if any of these sounds particularly interesting to see in terms of implementation.

FooBy

I love OrderBy and ThenBy, with their descending cousins. They’re so much cleaner than building a custom comparer which just performs a comparison between two properties. So why stop with ordering? There’s a whole bunch of operators which could do with some "FooBy" love. For example, imagine we have a list of files, and we want to find the longest one. We don’t want to perform a total ordering by size descending, nor do we want to find the maximum file size itself: we want the file with the maximum size. I’d like to be able to write that query as:

FileInfo biggestFile = files.MaxBy(file => file.Length);

Note that we can get a similar result by performing one pass to find the maximum length, and then another pass to find the file with that length. However, that’s inefficient and assumes we can read the sequence twice (and get the same results both times). There’s no need for that. We could get the same result using Aggregate with a pretty complicated aggregation, but I think this is a sufficiently common case to deserve its own operator.

We’d want to specify which value would be returned if multiple files had the same length (my suggestion would be the first one we encountered with that length) and we could also specify a key comparer to use. The signatures would look like this:

public static TSource MaxBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static TSource MaxBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

Now it’s not just Max and Min that gain from this "By" idea. It would be useful to apply the same idea to the set operators. The simplest of these to think about would be DistinctBy, but UnionBy, IntersectBy and ExceptBy would be reasonable too. In the case of ExceptBy and IntersectBy we could potentially take the key collection to indicate the keys of the elements we wanted to exclude/include, but it would probably be more consistent to force the two input sequences to be of the same type (as they would have to be for UnionBy and IntersectBy of course). ContainsBy might be useful, but that would effectively be a Select followed by a normal Contains – possibly not useful enough to merit its own operator.

TopBy and TopByDescending

These may sound like they belong in the FooBy section, but they’re somewhat different: they’re effectively specializations of OrderBy and OrderByDescending where you already know how many elements you want to preserve. The return type would be IOrderedEnumerable<T> so you could still use ThenBy/ThenByDescending as normal. That would make the following two queries equivalent – but the second might be a lot more efficient than the first:

var takeQuery = people.OrderBy(p => p.LastName)
                      .ThenBy(p => p.FirstName)
                      .Take(3);

var topQuery = people.TopBy(p => p.LastName, 3)
                     .ThenBy(p => p.FirstName);

An implementation could easily delegate to various different strategies depending on the number given – for example, if you asked for more than 10 values, it may not be worth doing anything more than a simple sort and restrict the output. If you asked for just the top 3 values, that could return an IOrderedEnumerable implementation specifically hard-coded to 3 values, etc.

Aside from anything else, if you were confident in what the implementation did (and that’s a very big "if") you could use a potentially huge input sequence with such a query – larger than you could fit into memory in one go. That’s fine if you’re only keeping the top three values you’ve seen so far, but would fail for a complete ordering, even one which was able to yield results before performing all the ordering: if it doesn’t know you’re going to stop after three elements, it can’t throw anything away.

Perhaps this is too specialized an operator – but it’s an interesting one to think about. It’s worth noting that this probably only makes sense for LINQ to Objects, which never gets to see the whole query in one go. Providers like LINQ to SQL can optimize queries of the form OrderBy(…).ThenBy(…).Take(…) because by the time they need to translate the query into SQL, they will have an expression tree representation which includes the "Take" part.

TryFastCount and TryFastElementAt

One of the implementation details of Edulinq is its TryFastCount method, which basically encapsulates the logic around attempting to find the count of a sequence if it implements ICollection or ICollection<T>. Various built-in LINQ operators find this useful, and anyone writing their own operators has a reasonable chance of bumping into it as well. It seems pointless to duplicate the code all over the place… why not expose it? The signatures might look something like this:

public static bool TryFastCount<TSource>(
    this IEnumerable<TSource> source,
    out int count)

public static bool TryFastElementAt<TSource>(
    this IEnumerable<TSource> source,
    int index,
    out TSource value)

I would expect TryFastElementAt to use the indexer if the sequence implemented IList<T> without performing any validation: that ought to be the responsibility of the caller. TryFastCount could use a Nullable<int> return type instead of the return value / out parameter split, but I’ve kept it consistent with the methods which exist elsewhere in the framework

Scan and SelectAdjacent

These are related operators in that they deal with wanting a more global view than just the current element. Scan would act similarly to Aggregate – except that it would yield the accumulator value after each element. Here’s an example of keeping a running total:

// Signature:
public static IEnumerable<TAccumulate> Scan<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)
    

int[] source = new int[] { 3, 5, 2, 1, 4 };
var query = source.Scan(0, (current, item) => current + item);
query.AssertSequenceEqual(3, 8, 10, 11, 15);

There could be a more complicated overload with an extra conversion from TAccumulate to an extra TResult type parameter. That would let us write a Fibonacci sequence query in one line, if we really wanted to…

The SelectAdjacent operator would simply present a selector function with pairs of adjacent items. Here’s a similar example, this time calculating the difference between each pair:

// Signature:
public static IEnumerable<TResult> SelectAdjacent<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TSource, TResult> selector)
    

int[] source = new int[] { 3, 5, 2, 1, 4 };
var query = source.SelectAdjacent((current, next) => next – current);
query.AssertSequenceEqual(2, -3, -1, 3);

One oddity here is that the result sequence always contains one item fewer than the source sequence. If we wanted to keep the length the same, there are various approaches we could take – but the best one would depend on the situation.

This sounds like a pretty obscure operator, but I’ve actually seen quite a few LINQ questions on Stack Overflow where it could have been useful. Is it useful often enough to deserve its own operator? Maybe… maybe not.

DelimitWith

This one is really just a bit of a peeve – but again, it’s a pretty common requirement. We often want to take a sequence and create a single string which is (say) a comma-delimited version. Yay, String.Join does exactly what we need – particularly in .NET 4, where there’s an overload taking IEnumerable<T> so you don’t need to convert it to a string array first. However, it’s still a static method on string – and the name "Join" also looks slightly odd in the context of a LINQ query, as it’s got nothing to do with a LINQ-style join.

Compare these two queries: which do you think reads better, and feels more "natural" in LINQ?

// Current state of play…
var names = string.Join(",",
                        people.Where(p => p.Age < 18)
                              .Select(p => p.FirstName));

// Using DelimitWith
var names = people.Where(p => p.Age < 18)
                  .Select(p => p.FirstName)
                  .DelimitWith(",");

I know which I prefer :)

ToHashSet

(Added on February 23rd 2011.)

I’m surprised I missed this one first time round – I’ve bemoaned its omission in various places before now. It’s easy to create a list, dictionary, lookup or array from an anonymous type, but you can’t create a set that way. That’s mad, given how simple the relevant operator is, even with an overload for a custom equality comparer:

public static HashSet<TSource> ToHashSet<TSource>(
    this IEnumerable<TSource> source)
{
    return source.ToHashSet(EqualityComparer<TSource>.Default);
}

public static HashSet<TSource> ToHashSet<TSource>(
    this IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return new HashSet<TSource>(source, comparer ?? EqualityComparer<TSource>.Default);
}

This also makes it much simpler to create a HashSet in a readable way from an existing query expression, without either wrapping the whole query in the constructor call or using a local variable.

Conclusion

These are just the most useful extra methods I thought of, based on the kinds of query folks on Stack Overflow have asked about. I think it’s interesting that some are quite general – MaxBy, ExceptBy, Scan and so on – whereas others (TopBy, SelectAdjacent and particularly DelimitWith) are simply aimed at making some very specific but common situations simpler. It feels to me like the more general operators really are missing from LINQ – they would fit quite naturally – but the more specific ones probably deserve to be in a separate static class, as "extras".

This is only scratching the surface of what’s possible, of course – System.Interactive.EnumerableEx in Reactive Extensions has loads of options. Some of them are deliberate parallels of the operators in Observable, but plenty make sense on their own too.

One operator you may have expected to see in this list is ForEach. This is a controversial topic, but Eric Lippert has written about it very clearly (no surprise there, then). Fundamentally LINQ is about querying a sequence, not taking action on it. ForEach breaks that philosophy, which is why I haven’t included it here. Usually a foreach statement is a perfectly good alternative, and make the "action" aspect clearer.

25 thoughts on “Reimplementing LINQ to Objects: Part 38 – What’s missing?”

  1. I don’t think TopBy can implement the same behavior as the Take alternative – consider the case when the top last name is used by four people.

    Either: TopBy takes the 3rd equal names, yielding 4 entries, and you have a Take at the end; or TopBy returns the first 3 equal names (by input order), and gives a different answer to the alternative query.

    In the first case you might want a ThenTopBy method so you can trim each step.

    Like

  2. I like those operators. I implemented SelectAdjacent myself (called it Window2, Window3 and WindowN). WindowN was quite a challenge to implement because you have to have a pool of N arrays open at all times.

    I called DelimitWith Concat but that was a poor naming choice. DelimitWith is much better. On the other hand it allows the following use: new string[0].Concat();

    Another useful one was SplitBetween(Func).

    Like

  3. TopBy doesn’t work for the reason Douglas L already gave.

    Furthermore, the idea that it is more efficient than .OrderBy(…).Take(…) is based on the assumption that you implement it with insertion sort. That will indeed be more efficient if the numer of elements you take is very small compared to the length of the input sequence. However, if the number of elements you take is close to the number of elements in the collection, that certainly is not true. Infortunately, only ICollection sequences know how many elements they contain, which would allow to choose the most efficient algorithm.

    Like

  4. @Douglas L: I think you’ve forgotten the way that OrderBy(…).ThenBy(…) works. ThenBy is able to get at the original source and the comparer so far. So the result of TopBy(…).ThenBy(…) could easily be the same as OrderBy(…).ThenBy(…).Take(3).

    To put it another way: you could easily implement TopBy by performing a full sort and then only returning the top n elements. So it’s clearly possible to get the required behaviour. If I implement that, will you believe me? :)

    @Kris: You could change implementation as you go along, of course. In practice I suspect that it would only be used with a relatively small number (e.g. taking the top 10 or the top 100) at which point even the algorithm you pick *isn’t* the most efficient, it’s not going to take very long. Compare that with the benefits available when picking the top 5 from a million elements…

    Like

  5. I don’t understand the requirement for the DelimitWith extension. What’s wrong with Aggregate()?

    // aggregate?
    var names = people.Where(p => p.Age p.FirstName)
    .Aggregate((p, q) => p + “,” + q);

    Like

  6. @Anders: How about the fact that it’ll be shockingly inefficient? Do you often concatenate strings in a loop? :) You could use a StringBuilder instead, but then the Aggregate call becomes rather harder to read, and it becomes fiddly to avoid an extra comma.

    Like

  7. I’d like you to talk about IAsyncEnumerable, it’s Linq operators and use with the async/await new feature. Async enumerables seems like a powerfull abstraction. Thanks!

    Like

  8. You’re right of course about the equivalence of .OrderBy(…).ThenBy(…).Take(…) and .TopBy(…).ThenBy(…). Sorry about that, my mistake.

    I think it’s going to be tricky to have an implementation that is guaranteed to be at least as efficient as .OrderBy(…).ThenBy(…).Take(…) no matter what the length of the input sequence and the number of elements to take (complexity-wise, i.e. same O(…)).

    I’d love to have one though…

    Like

  9. Thinking about a possible implementation of .TopBy(…), I just realized we don’t need .TopBy(…) at all!

    What we “need” is a Take(…) (and possibly TakeWhile, Skip and SkipWhile) that optimize for IOrderedEnumerable!

    Like

  10. In fact, if you have First and/or FirstOrDefault optimized for IOrderedEnumerable, then you don’t need MaxBy(…) and MinBy(…) either. Just do an OrderBy(…) or OrderByDescending(…) followed by First() and you have it.

    In fact, such an IOrderedEnumerable optimized version of First(OrDefault) would be trivially easy to implement by using an IOrderedEnumerable optimized version of Take(1).

    Like

  11. @Kris: That would require a change to the IOrderedEnumerable interface itself. There’s nothing in the interface which would help you to implement a custom Take/First etc. My suggestion for TopBy doesn’t require a change in interface.

    I would also suggest that MaxBy is easier to read than TopByDescending(…, 1).First(). Just MHO though.

    Like

  12. Just like ThenBy is able to get at the original source and the comparer so far, so are Take and First and their variations.
    In other words, you shouldn’t compare MaxBy(…) to TopByDescending(…, 1).First(), but to OrderByDescending(…).First().
    The clear advantage here is that it doesn’t require any new operator. As such, there’s nothing to learn and existing code can benefit from the optimization.

    Like

  13. @Kris: Patrick has hit the nail on the head. The options are:

    – Create a new operator such as TopBy

    – Add more methods to IOrderedEnumerable

    – Special-case the implementation of IOrderedEnumerable which you happen to use in your library (urgh)

    Like

  14. @patrick: If we would want to add only a new Take method to the existing LINQ to Objects, without access to its source, you’re right of course. But that’s not what we are discussing here, is it?

    @skeet: These are indeed the tree options.

    The first one has been discussed.

    It would certainly be possible to “extend” IOrderedEnumerable. Maybe not literally, but certainly in the sense that PLINQ does. Perfectly acceptable to me. The additional methods (along the lines of get source and get comparer) can even be internal. Which brings us to the third option:

    Special casing Take for the implementation of IOrderedEnumerable in your library is also possible. After all, (taking Edulinq as an example) OrderedEnumerable is an internal class. Why would a Take() that lives in the same assembly not be allowed to test for it? And use its internal methods?

    The only real difference between option 2 and 3 is the static return type of the OrderBy method (and the likes). In both cases, the runtime type lives in the same assembly as Take(). With option 2, Take can be statically specialised for it (i.e. have an overload specifically for it), but you probably wouldn’t do that anyway (for the same reason that there is no Count() overload for ICollection).

    Some readers might think that option 3 has the advantage of interoperability with System.Linq.Enumerable. However, naming ambiguities make that inpractical. And again, it’s not the scenario we are discussing. We are discusssing a full implementation from scratch, such as Edulinq (or a hypothetical next version of Linq to objects made by those that have the source code for it). In that implementation, we define what OrderBy() returns.

    Like

  15. @Kris: But given that we can do everything we need to *without* changing the design of IOrderedEnumerable and without relying on an internal implementation (which feels fragile to me) I think adding a new operator is the best approach.

    Like

  16. @skeet: Now we’re at the heart of the matter.

    It seems that we disagree on the cost of an additional operator. I think that it is very high.

    You pointed out some cost reasons in the introduction to this post. In addition, I think there is a cost for the implementors of the library (see http://blogs.msdn.com/b/ericlippert/archive/0001/01/01/53298.aspx) as well as for consumers related to additional complexity and learning curve.

    And like I said, with my suggestion existing code gets the benefit op the optimization. I think that’s a big advantage.

    Also, why is it a goal not to change the design of IOrderedEnumerable? And why does it feel fragile to you that an assembly relies on its own internal implementation?

    But to a large extent, this disagreement is based on subjective factors and opinions, and I certainly respect your view. I have for a long time in fact, that way I read your blog.

    Like

  17. By far the most useful extension method I’ve written is DelimitWith (which I simply called Delimit).

    Mine returns void and takes an ITextWriter. There’s an overload that returns a string and simply calls the real implementation with a StringWriter.

    Unendingly useful in one library I wrote that emits SQL Update and insert statements, and in another that creates large delimited text files.

    I even posted them in answer to a SO question one time.

    Like

  18. How about .Randomized() which puts the elements in no particular order. The system will take care of creating a random object and intializing it to a proper random seed.

    Usefull applications would be:

    .Randomized().FirstOrDefault();
    .Randomized().Take(5);

    Or if we do not want to actually randomize the sequence but we just need a random element(s).
    Random(1)
    Random(4)

    These could be tweaked for performance if the source implements ICollection.

    Strange that they are not already there.

    Like

  19. @Bas: I would call it Shuffle, personally – but I would *require* the caller to pass in a Random or something similar.

    @Kris: There’s definitely a cost for the implementation either way. There are pros and cons for whether to put the logic in Take or not. My main reason *against* that idea is that it would then be really, really easy to mess up the optimization with a trivial-looking change – e.g. a Select between ThenBy and Take. If the “top 5”-ness is baked into the operator, it’s harder to break that.

    One benefit of TopBy is that it *can* be implemented today by a 3rd party library without also implementing OrderBy and ThenBy separately. You could implement TopBy and everything would just work ™.

    All of this is hypothetical of course – there would be different design options based on whether we were imagining ourselves starting from scratch, adding new operators into System.Core, or doing it as a third party.

    Like

  20. @skeet: It is indeed true that the possibility to implement TopBy() in a third party library is an important benefit. And the point about a Select() between OrderBy() and Take() is important as well.

    Also, the fact that in Edulinq you use (QuickSort with) early yielding is already a big plus. LINQ to Objects would probably benefit from a TopBy a lot more than Edulinq would, which is already somewhat optimized for this scenario.

    Like

  21. I’m not sure the naming of DelimitWith() precisely describes its behaviour. This may seem pedantic, but I always remember the Tom Christiansen (of Perl camel book fame) adage:

    hyphen-separated
    “quote-delimited”
    bang-terminated!

    So maybe SeparateWith() would be more precise if we accept these definitions. I suppose JoinWith() could cause confusion with the other LINQ join methods, although it would suggest the similarity to string.Join().

    Like

Leave a comment