Reimplementing LINQ to Objects: Part 7 – Count and LongCount

Today’s post covers two operators in one, because they’re so incredibly similar… to the point cut and paste of implementation, merely changing the name, return type, and a couple of variables.

What are they?

Count and LongCount each have two overloads: one with a predicate, and one without. Here are all four signatures:

public static int Count<TSource>(
    this IEnumerable<TSource> source)

public static int Count<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static long LongCount<TSource>(
    this IEnumerable<TSource> source)

public static long LongCount<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

As you can see, the LongCount signatures are identical to Count except in terms of their return types, which are long (Int64) instead of int (Int32).

The overloads without a predicate parameter simply return the number of items in the source collection; the ones with a predicate return the number of items for which that predicate returns true.

Interesting aspects of the behaviour:

  • These are all extension methods on IEnumerable<T> – you might argue that for the versions without a predicate, it would have been better to extend the non-generic IEnumerable, as nothing actually requires the element type.
  • Where there’s no predicate, Count is optimized for ICollection<T> and (in .NET 4) ICollection – both of which have Count properties which are expected to be faster than iterating over the entire collection. LongCount is not optimized in the same way in the .NET implementation – I’ll discuss this in the implementation section.
  • No optimization is performed in the overloads with predicates, as basically there’s no way of telling how many items will "pass" the predicate without testing them.
  • All methods use immediate execution – nothing is deferred. (If you think about it, there’s nothing which can be deferred here, when we’re just returning an int or a long.)
  • All arguments are validated simply by testing they’re non-null
  • Both methods should throw OverflowException when given a collection with more items than they can return the count of… though this is a considerably larger number in the case of LongCount than Count, of course.

What are we going to test?

In some senses, there are only two "success" tests involved here: one without a predicate and one with. Those are easy enough to deal with, but we also want to exercise the optimized paths. That’s actually trickier than it might sound, as we want to test four situations:

  • A source which implements both ICollection<T> and ICollection (easy: use List<T>)
  • A source which implements ICollection<T> but not ICollection (reasonably easy, after a little work finding a suitable type: use HashSet<T>)
  • A source which implements ICollection but not ICollection<T> but still implements IEnumerable<T> (so that we can extend it) – tricky…
  • A source which doesn’t implement ICollection or ICollection<T> (easy: use Enumerable.Range which we’ve already implemented)

The third bullet is the nasty one. Obviously there are plenty of implementations of ICollection but not ICollection<T> (e.g. ArrayList) but because it doesn’t implement IEnumerable<T>, we can’t call the Count extension method on it. In the end I wrote my own SemiGenericCollection class.

Once we’ve got sources for all those tests, we need to decide what we’re actually testing about them. Arguably we should test that the result is optimized, for example by checking that we never really enumerate the collection. That would require writing custom collections with GetEnumerator() methods which threw exceptions, but still returned a count from the Count property. I haven’t gone this far, but it’s another step we certainly could take.

For the overloads which take predicates, we don’t need to worry about the various collection interfaces as we’re not optimizing anyway.

The failure cases for null arguments are very simple, but there’s one other case to consider: overflow. For Count, I’ve implemented a test case to verify the overflow behaviour. Unfortunately we can’t run this test in the Edulinq implementation yet, as it requires Enumerable.Concat, but here it is for the record anyway:

[Test]
[Ignore("Takes an enormous amount of time!")]
public void Overflow()
{
    var largeSequence = Enumerable.Range(0, int.MaxValue)
                                  .Concat(Enumerable.Range(0, 1));
    Assert.Throws<OverflowException>(() => largeSequence.Count());
}

This guards against a bad implementation which overflows by simply wrapping the counter round to Int32.MinValue instead of throwing an exception.

As you can see, this test will be disabled even when it’s uncommented after we implement Concat, as it requires counting up to 2 billion – not great for a quick set of unit tests. Even that isn’t too bad, however, compared with the equivalent in LongCount which would have to count 2^63 items. Generating such a sequence isn’t difficult, but iterating over it all would take a very long time. We also need an equivalent test for the overload with a predicate – something I neglected until writing up this blog post, and finding a bug in the implementation as I did so :)

For LongCount, I merely have an equivalent test to the above which checks that the same sequence can have its length expressed as a long value.

Let’s implement them!

We’ll look at the overload which does have a predicate first – as it’s actually simpler:

public static int Count<TSource>(this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    // No way of optimizing this
    checked
    {
        int count = 0;
        foreach (TSource item in source)
        {
            if (predicate(item))
            {
                count++;
            }
        }
        return count;
    }
}

Note that this time we’re not using an iterator block (we’re not returning a sequence), so we don’t need to split the implementation into two different methods just to get eager argument validation.

After the argument validation, the main part of the method is reasonably simple, with one twist: we’re performing the whole iteration within a "checked" context. This means that if the increment of count overflows, it will throw an OverflowException instead of wrapping round to a negative number. There are some other alternatives here:

  • We could have made just the increment statement checked instead of the whole second part of the method
  • We could have explicitly tested for count == int.MaxValue before incrementing, and thrown an exception in that case
  • We could just build the whole assembly in a checked context

I think it’s useful for this section of code to be explicitly checked – it makes it obvious that it really is a requirement for general correctness. You may well prefer to make only the increment operation checked – I personally believe that the current approach draws more attention to the checked-ness, but it’s definitely a subjective matter. It’s also possible that an explicit check could be faster, although I doubt it – I haven’t benchmarked either approach.

Other than the predicate-specific parts, all the above code also appears in the optimized Count implementation – so I won’t discuss those again. Here’s the full method:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        return genericCollection.Count;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        return nonGenericCollection.Count;
    }

    // Do it the slow way – and make sure we overflow appropriately
    checked
    {
        int count = 0;
        using (var iterator = source.GetEnumerator())
        {
            while (iterator.MoveNext())
            {
                count++;
            }
        }
        return count;
    }
}

The only "new" code here is the optimization. There are effectively two equivalent blocks, just testing for different collection interface types, and using whichever one it finds first (if any). I don’t know whether the .NET implementation tests for ICollection or ICollection<T> first – I could test it by implementing both interfaces but returning different counts from each, of course, but that’s probably overkill. It doesn’t really matter for well-behaved collections other than the slight performance difference – we want to test the "most likely" interface first, which I believe is the generic one.

To optimize or not to optimize?

The LongCount implementations are exactly the same as those for Count, except using long instead of int.

Notably, I still use optimizations for ICollection and ICollection<T> – but I don’t believe the .NET implementation does so. (It’s easy enough to tell by creating a huge list of bytes and comparing the time taken for Count and LongCount.)

There’s an argument for using Array.GetLongLength when the source is an array… but I don’t think the current CLR supports arrays with more than Int32.MaxValue elements anyway, so it’s a bit of a non-issue other than for future-proofing. Beyond that, I’m not sure why the .NET implementation isn’t optimized. It’s not clear what an ICollection/ICollection<T> implementation is meant to return from its Count property if it has more than Int32.MaxValue elements anyway, to be honest.

Suggestions as to what I should have done are welcome… but I should probably point out that LongCount is more likely to be used against Queryable than Enumerable – it’s easy to imagine a service representing a collection (such as a database table) which can quickly tell you the count even when it’s very large. I would imagine that there are relatively few cases where you have a collection to evaluate in-process where you really just want to iterate through the whole lot just to get the count.

Conclusion

These are our first LINQ operators which return scalar values instead of sequences – with the natural consequence that they’re simpler to understand in terms of control flow and timing. The methods simply execute – possibly with some optimization – and return their result. Nice and simple. Still, we’ve seen there can still be a few interesting aspects to consider, including questions around optimization which don’t necessarily have a good answer.

Next time, I think I’ll implement Concat – mostly so that I can uncomment the overflow tests for Count. That’s going back to an operator which returns a sequence, but it’s a really simple one…

20 thoughts on “Reimplementing LINQ to Objects: Part 7 – Count and LongCount”

  1. One thing I would have liked to see in Linq is a sortof LazyCount. Instead of returning an int it would return a ‘Counter’, which enumerated only as many elements as necessary to check comparisons.

    So X.LazyCount produces a Counter, and X.LazyCount > 5 makes it count up to 6.

    Like

  2. I personally find the separation into Count() and LongCount() a bit awkward as it requires callers to decide ahead of time whether they anticipate more than int.MaxValue items to be in the sequence. The principal argument for doing so is that it avoids casts (which could lose significant digits) when passing such values to BCL methods, almost all of which accept ints. It’s hard to say which is better – as either choice presents tradeoffs.

    Like

  3. Did you really copy-paste the implementation for these two methods? I would have used a generic helper function templated on the return value (int or long).

    Like

  4. @Thomas

    Actually, Take(5).Any() will always return true if there is at least 1 item. It’s equivalent with just Any().

    When there are 5 or more items in the sequence, Take(5) will return the first 5. When there are less than 5, Take(5) will return them all.

    Like

  5. ICollection.Count is statically known to fail in the operating conditions LongCount is designed for.

    It’s exactly like trying to use a room thermometer on the surface of the Moon. The operating range is nowhere near the expected conditions, even though it is contained within those conditions.

    Like

  6. @Thomas/Patrick: Yes, I did mean to say Skip instead of Take.

    @Motti: You’d have to specify the “increment” operation as a delegate, and the whole thing would end up being harder to read than the cut and paste version, IMO. Later on I’ll show how a lot of these operators can be implemented just using Aggregate.

    Like

  7. @Mihailik: But if you’re given an IEnumerable[T] and you want to get its count, knowing that it *might* be a very long collection, it would be reasonable to call LongCount… even though the source might *actually* be a collection with less than Int32.MaxValue elements.

    In such a situation, wouldn’t you want it to be optimized?

    I view LongCount as being designed for situations where the count *may* be out of the bounds of Count, not just for situations where it’s known that it *will* be out of those bounds.

    If we can reasonably assume that any implementation of ICollection/ICollection[T] will fail if you try to store more than Int32.MaxValue elements in it, then I believe the optimization is entirely appropriate. (After all, the overhead of those checks is going to be miniscule compared with the time taken to iterate over a very long sequence…)

    Like

  8. At one hand, it is reasonable to expect if ICollection is implemented, it is implemented correctly (which implies Count is within int range).

    At another hand, when in doubt we have to rank the facts.

    The fact that ICollection is implemented and all the reasoning flowing out of it is a derived second-rank knowledge.

    What is most important is the living human being explicitly asked for LongCount. You can’t type it in by mistake — if you use LongCount you genuinely expect cases where LongCount WILL be out of int range.

    So if inside LongCount we detect the underlying collection implements non-64bit-enabled interface — the safest decision is to see it as a quirk of the implementation (who knows, backward compatibility or other trick?) and don’t bet on Count being within int range.

    Not the ideal, but the best possible given the options

    Like

  9. @Mihailik: You call LongCount if you expect that the count *may* be out of the range of Int32, but that doesn’t mean you expect it to be out of the range of Int32 on *every single call*.

    You may be calling it on a sequence where you don’t know whether the count will be within the range or not… and on one call it may be on a huge collection, with another call being a more limited one. The collection in that first call may not implement ICollection[T]. with the one in the second call implementing it absolutely normally.

    Your reasoning is assuming that any implementation of ICollection[T] passed into LongCount is flawed… whereas I prefer to assume that the interface has been appropriately implemented. As soon as you start assuming that *those* interfaces haven’t been implemented correctly, what else are you going to question? Maybe IEnumerable[T] hasn’t been implemented properly either…

    Like

  10. OK, say somebody called LongCount and apparently the collection implements ICollection. Now you can’t have it both ways:

    You can either ignore hints inferred from ICollection implementation.

    Or you can ignore the explicit intent of the live human written the code.

    Surely the former is more sensible? Given all the cloudy uncertainty surrounding ICollection implementation for >int.MaxValue.

    BTW, the only implementation of ICollection in BCL where Count can exceed int range is in fact broken in .NET 3.5 (fixed in 4.0). Although you can’t use BitArray with Enumerable.LongCount anyway as it doesn’t implement IEnumerable.

    Like

  11. @Mihailik: The explicit intent is that the passed sequence *may* have more than Int32.MaxValue items – but as they may well have just been passed the collection as a sequence to start with, I see no reason to think that they *know* it will have.

    Are you suggesting that the live human is really saying, “Please ignore any indication that the sequence implements the ICollection interface or its generic counterpart – I know that it’s broken”? That seems to be unlikely to me.

    I suspect we’ll have to agree to differ about this…

    Liked by 1 person

  12. Shouldn’t the checked {} block only wrap around the count++ statement, and not the for loop?

    Otherwise if I intentionally overflow in my MoveNext() implementation, let’s say I have a PRNG generator, then it would throw because of that, not because of count overflowing int.

    Like

  13. @jason: “checked” only affects the code generated by the compiler within the block. It doesn’t affect what happens within code which is *called* by that method. It’s not like it sets and then resets a flag which the CLR checks on overflow.

    Like

  14. @Payman: I didn’t imply that the max value was 2^63+1 – I implied that you’d have to count 2^63+1 values to make it fail. Actually it should just 2^63 for the reasons you’ve given – so I’ve fixed it.

    Like

  15. Thanks, Jon!

    There’s a small detail which was glossed over but I think is actually worth mentioning explicitly – the code which iterates over the entire sequence, is actually not identical in the case where there is no predicate and in the case where there is one: where there’s a predicate a foreach loop is used whereas where there isn’t there’s a using statement with a while loop calling MoveNext. I suppose the reason is that the latter saves on the calls to Current (and possibly other overhead involved in the foreach mechanism), which are only required when checking for a predicate.

    Like

Leave a comment