Reimplementing LINQ to Objects: Part 30 – Average

This is the final aggregation operator, after which I suspect we won’t need to worry about floating point difficulties any more. Between this and the unexpected behaviour of Comparer<string>.Default, I’ve covered two of my "big three" pain points. It’s hard to see how I could get dates and times into Edulinq naturally; it’s even harder to see how time zones could cause problems. I’ve still got a few operators to go though, so you never know…

What is it?

Average has 20 overloads, all like the following but for long, decimal, float and double as well as int:

public static double Average(this IEnumerable<int> source)

public static double Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)

public static double? Average(this IEnumerable<int?> source)

public static double? Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)

The operators acting on float sequences return float, and likewise the operators acting on decimal sequences return decimal, with the same equivalent nullable types for the nullable sequences.

As before (for Min/Max/Sum), the overloads which take a selector are equivalent to just applying that selector to each element in the sequence.

General behaviour – pretty much as you’d expect, I suspect:

  • Each operator calculates the arithmetic mean of a sequence of values.
  • source and selector can’t be null, and are validated immediately.
  • The operators all use immediate execution.
  • The operators all iterate over the entire input sequence, unless an exception is thrown (e.g. due to overflow).
  • The operators with a non-nullable return type throw InvalidOperationException if the input sequence is empty.
  • The operators with a nullable return type ignore any null input values, and return null if the input sequence is empty or contains only null values. If non-null values are present, the return value will be non-null.

It all sounds pretty simple, doesn’t it? We just sum the numbers, and divide by the count. It’s not too complicated, but we have a couple of things to consider:

  • How should we count items – which data type should we use? Do we need to cope with more than int.MaxValue elements?
  • How should we sum items? Should we be able to find the average of { int.MaxValue, int.MaxValue } even though the sum clearly overflows the bounds of int?

Given the behaviour of my tests, I believe I’ve made the same decisions. I use a long for the counter, always. I use a long total for the int/long overloads, a double total for the float/double overloads, and a decimal total for the decimal overloads. These aren’t particularly tricky decisions once you’d realised that you need to make them, but it would be very easy to implement the operators in a simplistic way without thinking about such things. (I’d probably have done so if it weren’t for the comments around Sum this morning.)

What are we going to test?

I’ve only got in-depth tests for the int overloads, covering:

  • Argument validation
  • Empty sequences for nullable and non-nullable types
  • Sequences with only null values
  • Sequences of perfectly normal values :)
  • Projections for all the above
  • A sequence with just over int.MaxValue elements, to test we can count properly

Then I have a few extra tests for interesting situations. First I check the overflow behaviour of each type, using a common pattern of averaging a sequence of (max, max, -max, -max) where "max" is the maximum value for the sequence type. The results are:

  • For int we get the correct result of 0 because we’re accumulating over longs
  • For long we get an OverflowException when it tries to add the first two values together
  • For float we get the correct result of 0 because we’re accumulating over doubles
  • For double we get PositiveInfinity because that’s the result of the first addition
  • For decimal we get an OverflowException when it tries to add the first two values together

Additionally, I have a couple of floating-point-specific tests: namely further proof that we use a double accumulator when averaging floats, and the behaviour of Average in the presence of NaN values:

[Test]
public void SingleUsesDoubleAccumulator()
{
    // All the values in the array are exactly representable as floats,
    // as is the correct average… but intermediate totals aren’t.
    float[] array = { 20000000f, 1f, 1f, 2f };
    Assert.AreEqual(5000001f, array.Average());
}

[Test]
public void SequenceContainingNan()
{
    double[] array = { 1, 2, 3, double.NaN, 4, 5, 6 };
    Assert.IsNaN(array.Average());
}

I’m sure someone can think of some other interesting scenarios I should be considering :)

Let’s implement it!

This is another cut-and-paste job, but with more editing required – for each method, I needed to make sure I was using the right accumulator type, and I occasionally removed redundant casts. Still, the code follows pretty much the same pattern for all types. Here’s the int implementation:

public static double Average(this IEnumerable<int> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    checked
    {
        long count = 0;
        long total = 0;
        foreach (int item in source)
        {
            total += item;
            count++;
        }
        if (count == 0)
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        return (double)total / (double)count;
    }
}

public static double Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)
{
    return source.Select(selector).Average();
}

public static double? Average(this IEnumerable<int?> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    checked
    {
        long count = 0;
        long total = 0;
        foreach (int? item in source)
        {
            if (item != null)
            {
                count++;
                total += item.Value;
            }
        }
        return count == 0 ? (double?)null : (double)total / (double)count;
    }
}

public static double? Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)
{
    return source.Select(selector).Average();
}

Salient points:

  • Again I’m using Select to make the implementation of the overloads with selectors trivial
  • I’ve cast both operands of the division when calculating the average, just for clarity. We could get away with either of them.
  • In the case of the conditional operator, I could actually just cast one of the division operators to "double?" and then remove both of the other casts… again, I feel this version is clearer. (I could change my mind tomorrow, mind you…)
  • I’ve explicitly used checked blocks for int and long. For float and double we won’t get overflow anyway, and for decimal the checked/unchecked context is irrelevant.

There’s one optimization we can perform here. Consider this loop, for the nullable sequence:

long count = 0;
long total = 0;
foreach (int? item in source)
{
    if (item != null)
    {
        count++;
        total += item.Value; // This line can be optimized…
    }
}

The line I’ve highlighted seems perfectly reasonable, right? We’re trying to add the "real" non-null value within a value type value, and we know that there is a real value, because we’ve checked it’s not the null value already.

Now think about what the Value property actually does… it checks whether or not it’s the null value, and then returns the real value or throws an exception. But we know it won’t throw an exception, because we’ve checked it. We just want to get at the value – don’t bother with any more checks. That’s exactly what GetValueOrDefault() does. In the case where the value is non-null, GetValueOrDefault() and the Value property obviously do the same thing – but intuition tells me that GetValueOrDefault() can do it quicker, because it doesn’t actually need to check anything. It can just return the value of the underlying field – which will be the default value of the underlying type for a null wrapper value anyway.

I’ve benchmarked this, and on my laptop it’s about 5% faster than using Value. But… it’s such a grotty hack. I would feel dirty putting it in. Surely Value is the more readable code here – it just happens to be slower. As always, I’m undecided. There’s no behavioural difference, just a slight speed boost. Thoughts, folks?

Conclusion

I’m quite pleased to be shot of the Aggregate Operators Of Overload Doom. I’ve felt for a while that they’ve been hanging over me – I knew they’d be annoying in terms of cut and paste, but there’s been more interesting situations to consider than I’d expected.

There’s not a lot left now. According to my previous list, I’ve got:

  • Cast and OfType
  • ElementAt and ElementAtOrDefault
  • SequenceEqual
  • Zip (from .NET 4)
  • Contains

However, that doesn’t include AsEnumerable and AsQueryable. I’m unsure at the moment what I’m doing with those… AsEnumerable is trivial, and probably worth doing… AsQueryable could prove interesting in terms of testing, as it requires expression trees (which are in System.Core; a library I’m not referencing from tests when testing the Edulinq implementation). I’ll play around and see what happens :)

Not sure what I’ll implement next, to be honest… we’ll see tomorrow!

13 thoughts on “Reimplementing LINQ to Objects: Part 30 – Average”

  1. AsQueryable also requires IQueryable which is in System.Core. And IQueryProvider although I’m not sure you’d need it in the tests. But it generally seems like a *lot* of work since you’d have to implement all of the operators as a forwarder to the IQueryable provider to get any use from it, and then implement the provider so it can call the Enumerable methods. Doesn’t seem like it’s worth it.

    Like

  2. @configurator: About IQueryable… I see what you mean, but it would be nice to get the *principle* of it blogged. Maybe I could just implement Select.

    Like

  3. Hey Jon – I’ve done a quick test and I see GetValueOrDefault being between 20% and 30% faster than Value, but only in release mode – in debug mode I saw the 5% difference you mentioned. Given that you would probably release this code without debug optimisations I would probably go for GetValueOrDefault.

    Like

  4. @rune: Well, for the moment I’m just reimplementing what LINQ to Objects does. We could potentially use BigInteger (from .NET 4) and some arbitrary precision floating point library, but that would cause performance issues…

    Like

  5. I did at iPhone and don’t tested, anyway I know that you will find some bug
    But this can work?

            long count = 0; 
            double final = 0; 
            foreach (int item in source) 
            { 
    count++; 
    double sumThis = (item – final) / count;
                final += sumThis; 
                
            } 

    Like

  6. @Felipe: You gain the ability to withstand overflow, but at the cost of precision. Imagine the result of Enumerable.Repeat(1, 10000000).Average() using that code – what are the chances it’ll be exactly 1?

    Like

  7. @Jon well I was thinking more in the line of

    count++
    avgSoFar + (current-avgSoFar)/count

    this of course has the disadvantage of potential precision errors at every summation. That could be a reason for not doing it (my original qoustion) but I have no real life experience with arbitrary precision ion C# I wouldn’t know the cost to use it also part of my motivation for the qoustion

    Like

  8. “I use a long total for the int/long overloads, a double total for the float/long overloads, and a decimal total for the decimal overloads.”

    You’ve mentioned the long overloads twice, and haven’t mentioned the double overload at all. Can you correct the article please!

    Thanks :)

    Like

Leave a comment