Reimplementing LINQ to Objects: Part 24 – ToArray

This really is a simple one. So simple I might even find the energy to implement ToDictionary as well tonight… we’ll see. (EDIT: Oops. It became slightly less simple in the end, as I came up with the third implementation. Oh well.)

What is it?

ToArray is basically the equivalent of ToList, but producing an array instead of a list. It has a single signature:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)

Just to recap:

  • It’s another extension method, which is useful when we want to use type inference.
  • It uses immediate execution – nothing is deferred, and the entire sequence is read before the method completes.
  • It performs simple argument validation: source can’t be null
  • It creates an independent but shallow copy of the collection (so any changes to the objects referenced by source will be visible in the result, but not changes to source itself – and vice versa)
  • It’s optimized for ICollection<T>, although in a slightly different way to ToList.

What are we going to test?

Exactly the same as we did for ToList, basically – with one bit of care required. In our test for the source and result being independent of each other, we can’t just create a new variable of type List<T> and call ToArray on it – because that would call the implementation in List<T> itself. I reckoned the easiest way of making this clear was to call the method directly as a normal static method, instead of using it as an extension method:

[Test]
public void ResultIsIndependentOfSource()
{
    List<string> source = new List<string> { "xyz", "abc" };
    // Make it obvious we’re not calling List<T>.ToArray
    string[] result = Enumerable.ToArray(source);
    result.AssertSequenceEqual("xyz", "abc");

    // Change the source: result won’t have changed
    source[0] = "xxx";
    Assert.AreEqual("xyz", result[0]);

    // And the reverse
    result[1] = "yyy";
    Assert.AreEqual("abc", source[1]);
}

One other interesting facet of the testing is that we can only partially test the optimization for ICollection<T>. We can make sure that it’s optimized as far as not using the iterator (as per the previous test), but there’s more optimization coming, and I haven’t worked out how to test for that yet. You’ll see what I mean when we come to implement it. Speaking of which…

Let’s implement it!

We can make all our tests pass with a cheat: whatever the input type, convert it to a List<T> and then call ToArray on the result. Heck, if we call ToList to do the initial conversion, that will even do the argument validation for us and use the right parameter name in the exception:

// Implementation by Mr Cheaty MacCheaterson
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    return source.ToList().ToArray();
}

Now I’m not averse to the general approach taken here – but there’s actually a bit more optimization we can do.

Remember how ICollection<T> exposes Count and CopyTo, which the List<T> constructor uses when the input implements ICollection<T>? Well, that means that building a List<T> is relatively cheap for a collection – but calling ToArray on the list will still mean copying all the data out again (as List<T>.ToArray can’t just return its own internal array – it has to create a copy). We can use exactly the same members ourselves, and avoid one level of copying:

// Somewhat better… though still not quite ideal
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        TSource[] ret = new TSource[collection.Count];
        collection.CopyTo(ret, 0);
        return ret;
    }
    return new List<TSource>(source).ToArray();
}

That’s pretty good now – except it still involves a copy from the List<T> into a new array every time. That’s almost always appropriate, in fact… because unless the resulting list happened to expand its array to exactly the right size, we’d need to make a copy anyway. After all, we can’t return an array that’s too big. However, we can optimize for the "just the right size" case if we basically implement List<T>’s array expansion ourselves, leading to this:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimize for ICollection<T>
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        TSource[] tmp = new TSource[collection.Count];
        collection.CopyTo(tmp, 0);
        return tmp;
    }
            
    // We’ll have to loop through, creating and copying arrays as we go
    TSource[] ret = new TSource[16];
    int count = 0;
    foreach (TSource item in source)
    {
        // Need to expand…
        if (count == ret.Length)
        {
            Array.Resize(ref ret, ret.Length * 2);
        }
        ret[count++] = item;
    }

    // Now create another copy if we have to, in order to get an array of the
    // right size
    if (count != ret.Length)
    {
        Array.Resize(ref ret, count);
    }
    return ret;
}

Is this level of optimization worth it? Probably not. I picked the starting size of 16 out of thin air (or dimly recalled initial counts for some collection or other – possibly Java’s ArrayList<T>). Maybe we should triple the capacity rather than double it, just for laughs. It’s all guesswork, really. The middle implementation feels like a more appropriate one to me, but with an obvious optimization just itching to be implemented, I thought I might as well provide the code. It’s reasonably obviously correct, but it’s just a bit longwinded for the marginal benefit over our second attempt.

It’s even more annoying that I can’t think of a way to test this easily – I could benchmark it of course, but that’s not the same as unit testing it… I can’t easily prove that we’re optimizing either the ICollection<T> or "correct guess at size" cases.

Conclusion

It’s always interesting to see what else springs to mind when I’m writing up the operator as opposed to just implementing it in Visual Studio. I’d got as far as the second implementation but not the third when I started this post.

It’s possible that in the end, the 4-overload ToDictionary operator will actually end up being simpler than ToArray. Who’d have thought?

9 thoughts on “Reimplementing LINQ to Objects: Part 24 – ToArray”

  1. The factor used to grow an array can also be fiddled with, Microsoft’s STL uses 1.5 (apparently they did a lot of benchmarking). This also reminds me of cow-orker of mine who grew his container by a factor of 2.54 with the enlightening comment of “Grow by one inch”.

    Like

  2. @Motti: That’s what I meant by my “triple” comment. Of course, the data being used for benchmarking back in the days of STL may not be appropriate for modern workloads and CPUs (cache sizes etc). I believe List doubles, and that’s good enough for me :)

    Like

  3. If you want to optimize for the just right case you optimize for a very low probability case. When we assume an gaussian distribution around 4 (this is what List picks as default capacity) and assume FWHM of 4 we get a sigma of 1.7 and therefore a distribution where 99% of all list sizes are centered around 4 +- 5,1 (=3*sigma).
    So from a pure statistical perspective you are optimizing for a < 0.1% case.
    Of course I did assme some things like FWHM being 4 since and a gaussian distribution but the point is that if you really want to optimize this thing for a measurable effect then you should first count the number of elements and then copy it in one step into the right sized array.
    But then we have possibly optimized for memory consumption at the cost of more CPU utilization. As long as there is no real problem I would not optimize for one or the other since LINQ is not something which does play very well in highly optimized code where every CPU cycle counts as Joe Duffy eloquently explains at his blog: http://www.bluebytesoftware.com/blog/2010/09/06/ThePrematureOptimizationIsEvilMyth.aspx

    Yours,
    Alois Kraus

    Like

  4. I’ve never liked the CLR’s arrays. The weird dynamic polymorphism, that it’s only allowed for reference types, the difficulty building them and the impossibility of embedding them in structures (which matters for value types, eg Matrix) makes the whole thing a pain to deal with. It’s more annoying since a lot if it (looks like it) could be fixed back-compatibly, although I do have to say I do prefer that they are doing async. :)

    Like

  5. @Alois: counting the number of elements and then copying requires reading the sequence twice, which is a bad idea.

    @Simon: I think you’re talking about array covariance, which is indeed annoying. You *can* embed fixed sized arrays directly in value types in unsafe code, as of C# 2. Can’t say I’ve ever done it…

    Like

  6. Why are you testing against a situation that is completely impossible to create?
    (// Change the source: result won’t have changed)

    Also, you can test for the CopyTo optimization by implementing ICollection and saving a reference to the array passed to CopyTo().

    Like

  7. @SLaks: The test is to show the contrast between this and (say) Select where the result and the source are still “connected”. It’s the difference between a query and a result, effectively.

    Yes, as soon as you stop and think about it it’s almost bound to be true – but that doesn’t mean it’s not worth pointing out.

    Good idea about storing the array for the CopyTo optimization – not sure it’s worth implementing, but it’s neat anyway. I’ll include an edit to mention it.

    Like

  8. It was really interesting to see your third optimization. Did you consult Reflector, by any chance? The reason I ask is that I recall looking into the ToArray method a while back (I was specifically curious if it just called ToList().ToArray() and therefore if ToList were in fact always more efficient), and I was surprised and intrigued to find the Buffer class, an internal type almost identical to List except that ToArray returns its own private array in the event its size is perfect. So it appears that the developer who wrote ToArray had the same idea you did!

    Like

Leave a comment