A short case study in LINQ efficiency

I’ve been thinking a bit about how I’d use LINQ in real life (leaving DLinq and XLinq alone for the moment). One of the examples I came up with is a fairly common one – trying to find the element in a collection which has the maximum value for a certain property. Note that quite often I don’t just need to know the maximum value of the property itself – I need to know which element had that value. Now, it’s not at all hard to implement that in “normal” code, but using LINQ could potentially make the intention clearer. So, I tried to work out what the appropriate LINQ expression should look like.

I’ve come up with three ways of expressing what I’m interested in in LINQ. For these examples, I’ve created a type NameAndSize which has (unsurprisingly) properties Name (a string) and Size (an int). For testing purposes, I’ve created a list of these items, with a variable list storing a reference to the list. All samples assume that the list is non-empty.

Sort, and take first element

(from item in list
orderby item.Size descending
select item).First();

This orders by size descending, and takes the first element. The obvious disadvantage of this is that before finding the first element (which is the only one we care about) we have to sort all the elements – nasty! Assuming a reasonable sort, this operation is likely to be O(n log (n))

Subselect in where clause

(from item in list
where item.Size==list.Max(x=>x.Size)
select item).First();

This goes through the list, finding every element whose size is equal to the maximum one, and then takes the first of those elements. Unfortunately, the comparison calculates the maximum size on every iteration This makes it an O(n^2) operation.

Two selects

int maxSize = list.Max(x=>x.Size);
NameAndSize max = (from item in list
where item.Size==maxSize
select item).First();

This is similar to the previous version, but solves the problem of the repeated calculation of the maximum size by doing it before anything else. This makes the whole operation O(n), but it’s still somewhat dissatisfying, as we’re having to iterate through the list twice.

The non-LINQ way

NameAndSize max = list[0];
foreach (NameAndSize item in list)
{
if (item.Size > max.Size)
{
max = item;
}
}

This keeps a reference to the “maximum element so far”. It only iterates through the list once, and is still O(n).

Benchmarks

Now, I’ve written a little benchmark which runs all of these except the “embedded select” version which was just too slow to run sensibly by the time I’d made the list large enough to get sensible results for the other versions. Here are the results using a list of a million elements, averaged over five runs:
Sorting: 437ms
Two queries: 109ms
Non-LINQ: 38ms

After tripling the size of the list, the results were:
Sorting: 1437ms
Two queries: 324ms
Non-LINQ: 117ms

These results show the complexities being roughly as predicted above, and in particular show that it’s definitely cheaper to only iterate through the collection once than to iterate through it twice.

Now, this query is a fairly simple one, conceptually – it would be a shame if LINQ couldn’t cope with it efficiently. I suspect it could be solved by giving the Max operator another parameter, which specified what should be selected, as well as what should be used for comparisons. Then I could just use list.Max(item => item.Size, item=>item). At that stage, the only loss in efficiency would be through invoking the delegates, which is a second-order problem (and one which is inherent in LINQ). Fortunately, the way LINQ works makes this really easy to try out – just write an extension class:

static class Extensions
{
    public static V Max<T,V>(this IEnumerable<T> source, 
                             Func<T,int> comparisonMap, 
                             Func<T,V> selectMap)
    {
        int maxValue=0;
        V maxElement=default(V);
        bool gotAny = false;
        using (IEnumerator<T> enumerator = source.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                T sourceValue = enumerator.Current;
                int value = comparisonMap(sourceValue);
                if (!gotAny || value > maxValue)
                {
                    maxValue = value;
                    maxElement = selectMap(sourceValue);
                    gotAny = true;
                }
            }
        }
        if (!gotAny)
        {
            throw new EmptySequenceException();
        }
        return maxElement;
    }
}

This gave results of 57ms and 169ms for the two list sizes used earlier – not quite as fast as the non-LINQ way, but much better than any of the others – and by far the simplest to express, too.

Lessons learned

  • You really need to think about the complexity of LINQ queries, and know where they will be executed (I suspect that DLinq would have coped with the “subselect” version admirably).
  • There’s still some more work to be done on the standard query operators to find efficient solutions to common use cases.
  • Even if the standard query operators don’t quite do what you want, it can be worthwhile to implement your own – and it’s not that hard to do so!

7 thoughts on “A short case study in LINQ efficiency”

  1. …and of course that says nothing about the amount of IL generated, # of temporary objects constructed (delegates, enumerators), and so forth. Watch that working size go down the toilet. ;)

    Like

  2. Well, in this case I don’t think many actual delegates and enumerators were constructed – two or three of each, I suspect. Certainly nothing compared to the huge number of objects I had to put into the List in order to get meaningful figures.

    As most places in code don’t need to be as far as they possibly can be, I’m not too worried about this aspect of things. Give me something I can maintain easily over something which gets the right answer speedily but can’t be changed for fear of collapse any day :)

    I should have noted – one of the nice things about my alternative version of Max is that the projection to go from a source object to a projection object is only executed when you find a new maximum. In my sample code this was actually on every iteration, but if I’d had the list going the other way round, it would only have ever called the projection delegate once.

    Like

  3. – i see the problem that the initial value for maxValue should be in.MinValue, or the method will report 0 as Max, if only negative values are in the collection
    – why not use foreach?
    – I can’t find EmptySequenceException anywhere.
    – I think this would be more general (but here again the roblem with negatove values)

    static class Extensions
    {
    public static V Max(this IEnumerable source,
    Func comparisonMap,
    Func selectMap)
    where X: IComparable
    {
    X maxValue = default(X);
    V maxElement = default(V);
    bool gotAny = false;
    foreach (T sourceValue in source)
    {

    X value = comparisonMap(sourceValue);
    if (!gotAny
    || value.CompareTo(maxValue) > 0)
    {
    maxValue = value;
    maxElement = selectMap(sourceValue);
    gotAny = true;
    }
    }
    if (!gotAny)
    {
    throw new InvalidOperationException(“source is empty”);
    }
    return maxElement;
    }
    }

    Like

  4. @Ecki: It’s possible that there was an EmptySequenceException in the CTP back when I wrote this code originally (back in 2005) but it’s also possible I just created an exception for no particular reason.

    The initial minimum value actually isn’t a problem – because we always take the first value as the real minimum. That means your version is simply more flexible than mine with no downsides. Hooray.

    I’ve absolutely no idea why I didn’t use foreach. I claim jetlag at the time :)

    Like

  5. I know it’s an old article but I wonder why you didn’t “simply” use the Aggregate extension ; traversing a source accumulating a state (here the current max so far) is exactly what a fold is (and Aggregate is fold with another name)

    private static NameAndSize FindMaxWithAggregate (IList list)
    {
        return list.Aggregate ((max, current) => current.Size > max.Size ? current : max);
    }

    Like

    1. This is exactly what I thought. For me, when I hear the problem I immediately expect to see Aggregate. When you are used to thinking of everything as map/reduce (or whatever the syntax is in your chosen language), you see the world a little differently. So I find this version more readable than any other because it fits the pattern I am expecting to see.

      Like

Leave a comment