Reimplementing LINQ to Objects: Part 13 – Aggregate

EDIT: I’ve had to edit this quite a bit now that a second bug was discovered… basically my implementation of the first overload was completely broken :(

Last night’s tweet asking for suggestions around which operator to implement next resulted in a win for Aggregate, so here we go.

What is it?

Aggregate has three overloads, effectively allow two defaults:

public static TSource Aggregate<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, TSource, TSource> func)

public static TAccumulate Aggregate<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)

public static TResult Aggregate<TSource, TAccumulate, TResult>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func,
    Func<TAccumulate, TResult> resultSelector)

Aggregate is an extension method using immediate execution, returning a single result. The generalised behaviour is as follows:

  • Start off with a seed. For the first overload, this defaults to the first value of the input sequence. The seed is used as the first accumulator value.
  • For each item in the list, apply the aggregation function, which takes the current accumulator value and the newly found item, and returns a new accumulator value.
  • Once the sequence has been exhausted, optionally apply a final projection to obtain a result. If no projection has been specified, we can imagine that the identity function has been provided.

The signatures make all of this look a bit more complicated because of the various type parameters involved. You can consider all the overloads as dealing with three different types, even though the first two actually have fewer type parameters:

  • TSource is the element type of the sequence, always.
  • TAccumulate is the type of the accumulator – and thus the seed. For the first overload where no seed is provided, TAccumulate is effectively the same as TSource.
  • TResult is the return type when there’s a final projection involved. For the first two overloads, TResult is effectively the same as TAccumulate (again, think of a default "identity projection" as being used when nothing else is specified)

In the first overload, which uses the first input element as the seed, an InvalidOperationException is thrown if the input sequence is empty.

What are we going to test?

Obviously the argument validation is reasonably simple to test – source, func and resultSelector can’t be null. But there are two different approaches to testing the "success" cases.

We could work out exactly when each delegate should be called and with what values – effectively mock every step of the iteration. This would be a bit of a pain, but a very robust way of proceeding.

The alternative approach is just to take some sample data and aggregation function, work out what the result should be, and assert that result. If the result is sufficiently unlikely to be achieved by chance, this is probably good enough – and it’s a lot simpler to implement. Here’s a sample from the most complicated test, where we have a seed and a final projection:

[Test]
public void SeededAggregationWithResultSelector()
{
    int[] source = { 1, 4, 5 };
    int seed = 5;
    Func<int, int, int> func = (current, value) => current * 2 + value;
    Func<int, string> resultSelector = result => result.ToInvariantString();
    // First iteration: 5 * 2 + 1 = 11
    // Second iteration: 11 * 2 + 4 = 26
    // Third iteration: 26 * 2 + 5 = 57
    // Result projection: 57.ToString() = "57"
    Assert.AreEqual("57", source.Aggregate(seed, func, resultSelector));
}

Now admittedly I’m not testing this to the absolute full – I’m using the same types for TSource and TAccumulate – but frankly it gives me plenty of confidence that the implementation is correct.

EDIT: My result selector now calls ToInvariantString. It used to just call ToString, but as I’ve now been persuaded that there are some cultures where that wouldn’t give us the right results, I’ve implemented an extension method which effectively means that x.ToInvariantString() is equivalent to x.ToString(CultureInfo.InvariantCulture) – so we don’t need to worry about cultures with different numeric representations etc.

Just for the sake of completeness (I’ve convinced myself to improve the code while writing this blog post), here’s an example which sums integers, but results in a long – so it copes with a result which is bigger than Int32.MaxValue. I haven’t bothered with a final projection though:

[Test]
public void DifferentSourceAndAccumulatorTypes()
{
    int largeValue = 2000000000;
    int[] source = { largeValue, largeValue, largeValue };
    long sum = source.Aggregate(0L, (acc, value) => acc + value);
    Assert.AreEqual(6000000000L, sum);
    // Just to prove we haven’t missed off a zero…
    Assert.IsTrue(sum > int.MaxValue); 
}

Since I first wrote this post, I’ve also added tests for empty sequences (where the first overload should throw an exception) and a test which relies on the first overload using the first element of the sequence as the seed, rather than the default value of the input sequence’s element type.

Okay, enough about the testing… what about the real code?

Let’s implement it!

I’m still feeling my way around when it’s a good idea to implement one method by using another, but at the moment my gut feeling is that it’s okay to do so when:

  • You’re implementing one operator by reusing another overload of the same operator; in other words, no unexpected operators will end up in the stack trace of callers
  • There are no significant performance penalties for doing so
  • The observed behaviour is exactly the same – including argument validation
  • The code ends up being simpler to understand (obviously)

Contrary to an earlier version of this post, the first overload can’t be implemented in terms of the second or third ones, because of its behaviour regarding the seed and empty sequences.

public static TSource Aggregate<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, TSource, TSource> func)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (func == null)
    {
        throw new ArgumentNullException("func");
    }

    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Source sequence was empty");
        }
        TSource current = iterator.Current;
        while (iterator.MoveNext())
        {
            current = func(current, iterator.Current);
        }
        return current;
    }
}

It still makes sense to share an implementation for the second and third overloads though. There’s a choice around whether to implement the second operator in terms of the third (giving it an identity projection) or to implement the third operator in terms of the second (by just calling the second overload and then applying a projection). Obviously applying an unnecessary identity projection has a performance penalty in itself – but it’s a tiny penalty. So which is more readable? I’m in two minds about this. I like code where various methods call one other "central" method where all the real work occurs (suggesting implementing the second overload using the third) but equally I suspect I really think about aggregation in terms of getting the final value of the accumulator… with just a twist in the third overload, of an extra projection. I guess it depends on whether you think of the final projection as part of the general form or an "extra" step.

For the moment, I’ve gone with the "keep all logic in one place" approach:

public static TAccumulate Aggregate<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)
{
    return source.Aggregate(seed, func, x => x);
}

public static TResult Aggregate<TSource, TAccumulate, TResult>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func,
    Func<TAccumulate, TResult> resultSelector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (func == null)
    {
        throw new ArgumentNullException("func");
    }
    if (resultSelector == null)
    {
        throw new ArgumentNullException("resultSelector");
    }
    TAccumulate current = seed;
    foreach (TSource item in source)
    {
        current = func(current, item);
    }
    return resultSelector(current);
}

The bulk of the "real work" method is argument validation – the actual iteration is almost painfully simple.

Conclusion

The moral of today’s story is to read the documentation carefully – sometimes there’s unexpected behaviour to implement. I still don’t really know why this difference in behaviour exists… it feels to me as if the first overload really should behave like the second one, just with a default initial seed. EDIT: it seems that you need to read it really carefully. You know, every word of it. Otherwise you could make an embarrassing goof in a public blog post. <sigh>

The second moral should really be about the use of Aggregate – it’s a very generalized operator, and you can implement any number of other operators (Sum, Max, Min, Average etc) using it. In some ways it’s the scalar equivalent of SelectMany, just in terms of its diversity. Maybe I’ll show some later operators implemented using Aggregate…

Next up, there have been requests for some of the set-based operators – Distinct, Union, etc – so I’ll probably look at those soon.

6 thoughts on “Reimplementing LINQ to Objects: Part 13 – Aggregate”

  1. There’s another bug in your implementation of the first overload. The docs say: ” The first element of source is used as the initial aggregate value. ” – you are using the default value.

    Like

  2. “I still don’t really know why this difference in behaviour exists… it feels to me as if the first overload really should behave like the second one, just with a default initial seed.”

    For one, it makes multiplying all of the elements of a sequence together much easier:

    numbers.Aggregate((acc, n) => acc * n)

    If the initial seed was default(int), the result would be zero.

    Like

  3. Came here to note the already discovered oversight.

    ..which brings me to an observation: I’m kinda wondering how useful unit testing is in this scenario. It’s not entirely useless, but the tests here pretty much repeat the logic of the functions under test – so, that’s a fine way to catch typos and accidental errors, but that’s about it: you’re almost certainly not going to catch logic errors since the level at which the assertions are phrased is so similar to that of code itself. And of course, specification (interpretation) errors are hopeless too.

    Like

  4. @Eamon: Well, unit tests certainly *have* discovered bugs in the implementations. Not when I’ve misunderstood the intended behaviour to start with, of course, but at other times they’ve caught things.

    They serve a second purpose though – they can make the meaning of an operator clearer than just reading a description or seeing the implementation. It’s not the case for *all* tests of course, but it definitely is with some. Admittedly the examples I’m giving aren’t the best that can be chosen for this purpose – I *am* speeding through the operators to some extent, and choosing an example well can be a very time-consuming process. I think even my hastily-constructed examples can serve to some extent though.

    Your point is taken though, and I’ve seen it more explicitly in production code elsewhere – when it feels like writing the test and writing the code are performing the same task twice, that’s a definite warning sign.

    Like

  5. Jacob’s example of multiplication can be easily solved with using a seed of 1 – but there are other functions which can’t, e.g. Math.Pow. if you want to aggregate powering e.g. [2, 3, 4] should return (2 ^ 3) ^ 4 you can’t do that with any seed (I realise that’s a very silly requirement though).

    Like

Leave a comment