Designing LINQ operators

I’ve started a small project (I’ll post a link when I’ve actually got something worthwhile to show) with some extra LINQ operators in – things which I think are missing from LINQ to Objects, basically. (I hope to include many of the ideas from an earlier blog post.) That, and a few Stack Overflow questions where I’ve effectively written extra LINQ operators and compared them with other solutions, have made me think about the desirable properties of a LINQ operator – or at least the things you should think about when implementing one. My thoughts so far:

Lazy/eager execution

If you’re returning a sequence (i.e. another IEnumerable<T> or similar) the execution should almost certainly be lazy, but the parameter checking should be eager. Unfortunately with the limitations of the (otherwise wonderful) C# iterator blocks, this usually means breaking the method into two, like this:

public static IEnumerable<T> Where(this IEnumerable<T> source,
                                   Func<T, bool> predicate)
{
    // Eagerly executed
    if (source == null)
    {
        throw new ArgumentNullException(“source”);
    }
    if (predicate == null)
    {
        throw new ArgumentNullException(“predicate”);
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<T> WhereImpl(IEnumerable<T> source,
                                        Func<T, bool> predicate)
{
    // Lazily executed
    foreach (T element in source)
    {
        if (predicate(element))
        {
            yield return element;
        }
    }
}

Obviously aggregates and conversions (Max, ToList etc) are generally eager anyway, within normal LINQ to Objects. (Just about everything in Push LINQ is lazy. They say pets look like their owners…)

Streaming/buffering

One of my favourite features of LINQ to Objects (and one which doesn’t get nearly the publicity of deferred execution) is that many of the operators stream the data. In other words, they only consume data when they absolutely have to, and they yield data as soon as they can. This means you can process vast amounts of data with very little memory usage, so long as you use the right operators. Of course, not every operator can stream (reversing requires buffering, for example) but where it’s possible, it’s really handy.

Unfortunately, the streaming/buffering nature of operators isn’t well documented in MSDN – and sometimes it’s completely wrong. As I’ve noted before, the docs for Enumerable.Intersect claim that it reads the whole of both sequences (first then second) before yielding any data. In fact it reads and buffers the whole of second, then streams first, yielding intersecting elements as it goes. I strongly encourage new LINQ operators to document their streaming/buffering behaviour (accurately!). This will limit future changes in the implementation admittedly (Intersect can be implemented in a manner where both inputs are streamed, for example) but in this case I think the extra guarantees provided by the documentation make up for that restriction.

Once-only evaluation

When I said that reversing requires buffering earlier on, I was sort of lying. Here’s an implementation of Reverse which doesn’t buffer any data anywhere:

public static IEnumerable<T> StreamingReverse<T>(this IEnumerable<T> source)
{
    // Error checking omitted for brevity
    int count = source.Count();
    for (int i = count-1; i >= 0; i–)
    {
        yield return source.ElementAt(i);
    }
}

If we assume we can read the sequence as often as we like, then we never need to buffer anything – just treat it as a random-access list. I hope I don’t have to tell you that’s a really, really bad idea. Leaving aside the blatant inefficiency even for sequences like lists which are cheap to iterate over, some sequences are inherently once-only (think about reading from a network stream) and some are inherently costly to iterate over (think about lines in a big log file – or the result of an ordering).

I suspect that developers using LINQ operators assume that they’ll only read the input data once. That’s a good assumption – wherever possible, we ought to make sure that it’s correct, and if we absolutely can’t help evaluating a sequence twice (and I can’t remember any times when I’ve really wanted to do that) we should document it in large, friendly letters.

Mind your complexity

In some ways, this falls out of “try to stream, and try to only read once” – if you’re not storing any data and you’re only reading each item once, it’s quite hard to come up with an operator which isn’t just O(n) for a single sequence. It is worth thinking about though – particularly as most of the LINQ operators can work with large amounts of data. For example, to find the smallest element in a sequence you can either sort the whole sequence and take the first element of the result or you can keep track of a “current minimum” and iterate through the whole sequence. Clearly the latter saves a lot of complexity (and doesn’t require buffering) – so don’t just take the first idea that comes into your head. (Or at least, start with that and then think how you could improve it.)

Again, documenting the complexity of the operator is a good idea, and call particular attention to anything which is unintuitively expensive.

Conclusion

Okay, so there’s nothing earth-shattering here. But the more I use LINQ to answer Stack Overflow questions, and the more I invent new operators in the spirit of the existing ones, the more powerful I think it is. It’s amazing how powerful it can be, and how ridiculously simple the code (sometimes) looks afterwards. It’s not like the operator implementation is usually hard, either – it’s just a matter of thinking of the right concepts. I’m going to try to follow these principles when I implement my extra operator library, and I hope you’ll bear them in mind too, should you ever feel that LINQ to Objects doesn’t have quite the extension method you need…

14 thoughts on “Designing LINQ operators”

  1. When .net newbies start using IEnumerable, they may think that Count is part of IEnumerable whithout realizing it is an extension method in the first place

    If you stick to the IEnumerable members, you should immediately realise you’re doing something odd

    BTW Jon, not related to this post but… I often write code like yours:

    if (source == null)
    {
    throw new ArgumentNullException(“source”);
    }

    especially when implementing WCF operations where I want to make sure the caller provides a valid data contract parameter

    I’ve always wanted to be able to do something like:

    // Operation DoSomething
    public void DoSomething(ADataContract parameter)
    {
    CheckOperationParameter(parameter);

    }

    private void CheckOperationParameter(object parameter)
    {
    if (parameter == null)
    {
    string localVariableName = …
    throw new FaultException(localVariableName);
    }
    }

    Theoritically it should be possible to retrieve the local variable’s name using reflection but I haven’t found an easy way of doing it yet…

    Like

  2. @Vincent: In fact in MiscUtil I have a ThrowIfNull extension method (or something like that) so I’d write:

    x.ThrowIfNull(“x”);

    Like

  3. Yes, but I’d like to have
    x.ThrowIfNull()

    And the variable name “x” to be retrieved automatically for display purposes

    Like

  4. @Vincent: Yes, I know that would be nice – but unless you did it in some post-compile step it’s basically impossible as far as I can tell.

    Like

  5. Thanks Jon

    It’s in these moments I realise that creating a tool such as Reflector involves more than reflection skills

    Like

  6. @Vincent: Oh yes. Reflector has to do a lot of clever work to guess at the source code. Did you use a lambda expression? An anonymous method? How do you tell the difference between for and foreach over an array? I bet there’s a ton of heuristics in there.

    Like

  7. Hi, Jon!

    I’ve been a fan of your blog for some time, but thought I’d comment regarding your note on LINQ buffering.

    I have many years of background in C++ iterator/algorithm details as well as Python generators, and so I find LINQ to be particularly interesting as the C# equivalent of these concepts. In particular, I’ve given some thought to how buffering can be improved for the initial set of LINQ operations.

    There are two main problems for LINQ, as I see it:
    1) There’s no way to distinguish between sequential and random-access. You can kind of say that IEnumerable would imply sequential, while IList implies random-access; but this raises two other problems:
    1a) The random-access nature is not preserved through LINQ operators. i.e., Enumerable.Reverse does not return IList, even when applied on an IList.
    1b) LINQ operators do not look for random-access input sequences, so they choose less efficient algorithms. i.e., Reverse requires buffering even when passed an array.
    2) Some algorithms are not present in LINQ. e.g., an Intersect (or any other set-related operation) on already-ordered input sequences would be much more efficient.

    I believe (2) is largely the result of LINQ being done to provide “SQL in C#” rather than a general-purpose sequence algorithm framework.

    I’ve played around with the idea of adding a bunch of LINQ operators in a library that would help to fill these gaps, and I believe it would require a new “LINQ namespace” (for lack of a better name – I’m referring to the facility provided by AsQueryable, AsParallel for PLINQ, etc). So, a new set of operators would work on IList interfaces.

    What do you think about the need for this?

    Like

  8. @Stephen: Interesting ideas. Would you be thinking of a “Listable” then?

    The main problem I see with that is either a) it becomes a mutating operation (which is nasty – I didn’t include “don’t mutate the input” in the guidance above, because with IEnumerable you really don’t have to) or b) you have to return a whole list. I suppose that *could* be lazily evaluated, but that’s quite tricky – it certainly couldn’t be done with anything as simple as an iterator block.

    Of course, *some* LINQ query operators notice if they’re working with an IList – certainly Count() does. I’m not sure beyond that.

    Your point 2 could certainly be done quite easily though – an “OrderedIntersect” or something like that which streams both input sequences, and throws if it ever sees anything out of order. (That does require an IComparer rather than an IEqualityComparer, of course.)

    As for the “need” for it, I’m not sure. I’ve found quite a few little gaps, mostly in terms of “I want to apply a projection for the sake of finding the max, but I want the original value rather than the projected one” but I’ve never really felt the need for the kind of things you’re mentioning. That’s not to say they wouldn’t be useful though – I’m not the only developer using LINQ :)

    Thanks very much for your input – food for thought, certainly.

    Like

  9. I agree that mutating operations should not be supported. C++ iterator categories did allow for this, but I never considered it a clean solution.

    As for lazily evaluating lists, I do it all the time. Yes, it’s more work than an iterator block, but with a common base class it’s not too bad. PowerCollections has one very similar to the one I use.

    I think I view LINQ as having more gaps than many people who have worked with it since it first came out. If LINQ is a way of saying “run this query on this input stream”, then it’s sufficient. But for me, I see LINQ as “execute these algorithms as the data streams through”, and when using that perspective, the gaps (both algorithm gaps and performance gaps) become more apparent.

    I’ve been using these types of algorithms for ~10 years. Back then, I called them “pipe algorithms” after the Unix shell pipe (|), because it represented a similar concept: running a stream of data through a series of algorithms. C++ iterator categories and Python generators discovered the same concept, followed now by LINQ.

    All of these eventually run into what I call the “tee” problem; Unix developed process substitution with redirection to named pipes, I wrote my own rather complex system for C++ algorithms… and you’re developing “Push” LINQ.

    Like

  10. Here’s a cool trick for getting the parameter name without duplicating it and making it visible to refactoring tools. It’s not mine, but I really can’t remember where I saw it…

    using System.Linq.Expressions;

    public static void ThrowIfNull(Expression<Func> expr) {

    Func f = expr.Compile();
    if (f() == null) {
    var body = (MemberExpression) expr.Body;
    throw new ArgumentException(body.Member.Name + ” is null”);
    }
    }

    Example:
    void test(string param) {
    ThrowIfNull(() => param);
    // do stuff with a non-null param
    }

    For efficiency, an overload could be added which takes the parameter, so we don’t need to compile the expression.

    Like

  11. @Mike Rosenblum: The only reason to compile the expression is to execute it. If you pass in the parameter as well as the expression, you don’t need to execute it.

    Like

Leave a comment