Reimplementing LINQ to Objects: Part 9 – SelectMany

December 27, 2010 jonskeet 9 Comments

The next operator we’ll implement is actually the most important in the whole of LINQ. Most (all?) of the other operators returning sequences can be implemented via SelectMany. We’ll have a look at that at the end of this post, but let’s implement it first.

What is it?

SelectMany has 4 overloads, which look gradually more and more scary:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
this IEnumerable<TSource> source,
Func<TSource, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
this IEnumerable<TSource> source,
Func<TSource, int, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

These aren’t too bad though. Really these are just variations of the same operation, with two "optional" bits.

In every case, we start with an input sequence. We generate a subsequence from each element in the input sequence using a delegate which can optionally take a parameter with the index of the element within the original collection.

Now, we either return each element from each subsequence directly, or we apply another delegate which takes the original element in the input sequence and the element within the subsequence.

In my experience, uses of the overloads where the original selector delegate uses the index are pretty rare – but the others (the first and the third in the list above) are fairly common. In particular, the C# compiler uses the third overload whenever it comes across a "from" clause in a query expression, other than the very first "from" clause.

It helps to put this into a bit more context. Suppose we have a query expression like this:

var query = from file in Directory.GetFiles("logs")
from line in File.ReadLines(file)
select Path.GetFileName(file) + ": " + line;

That would be converted into a "normal" call like this:

var query = Directory.GetFiles("logs")
.SelectMany(file => File.ReadLines(file),
(file, line) => Path.GetFileName(file) + ": " + line);

In this case the compiler has used our final "select" clause as the projection; if the query expression had continued with "where" clauses etc, it would have created a projection to just pass along "file" and "line" in an anonymous type. This is probably the most confusing bit of the query translation process, involving transparent identifiers. For the moment we’ll stick with the simple version above.

So, the SelectMany call above has three arguments really:

The source, which is a list of strings (the filenames returned from Directory.GetFiles)
An initial projection which converts from a single filename to a list of the lines of text within that file
A final projection which converts a (file, line) pair into a single string, just by separating them with ": ".

The result is a single sequence of strings – every line of every log file, prefixed with the filename in which it appeared. So writing out the results of the query might give output like this:

test1.log: foo
test1.log: bar
test1.log: baz
test2.log: Second log file
test2.log: Another line from the second log file

It can take a little while to get your head round SelectMany – at least it did for me – but it’s a really important one to understand.

A few more details of the behaviour before we go into testing:

The arguments are validated eagerly – everything has to be non-null.
Everything is streamed. So only one element of the input is read at a time, and then a subsequence is produced from that. Only one element is then read from the subsequence at a time, yielding the results as we go, before we move onto the next input element and thus the next subsequence etc.
Every iterator is closed when it’s finished with, just as you’d expect by now.

What are we going to test?

I’m afraid I’ve become lazy by this point. I can’t face writing yet more tests for null arguments. I’ve written a single test for each of the overloads. I found it hard to come up with a clear way of writing the tests, but here’s one example, for the most complicated overload:

[Test]
public void FlattenWithProjectionAndIndex()
{
    int[] numbers = { 3, 5, 20, 15 };
    var query = numbers.SelectMany((x, index) => (x + index).ToString().ToCharArray(),
                                   (x, c) => x + ": " + c);
    // 3 => "3: 3"
    // 5 => "5: 6"
    // 20 => "20: 2", "20: 2"
    // 15 => "15: 1", "15: 8"
    query.AssertSequenceEqual("3: 3", "5: 6", "20: 2", "20: 2", "15: 1", "15: 8");
}

So, to give a bit more explanation to this:

Each number is summed with its index (3+0, 5+1, 20+2, 15+3)
Each sum is turned into a string, and then converted into a char array. (We don’t really need to ToCharArray call as string implements IEnumerable<char> already, but I thought it made it clearer.)
We combine each subsequence character with the original element it came from, in the form: "original value: subsequence character"

The comment shows the eventual results from each input, and the final test shows the complete result sequence.

Clear as mud? Hopefully it’s not to bad when you look at each step in turn. Okay, now let’s make it pass…

Let’s implement it!

We could implement the first three overloads in terms of calls to the final one – or more likely, a single "Impl" method without argument validation, called by all four public methods. For example, the simplest method could be implemented like this:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return SelectManyImpl(source,
                          (value, index) => selector(value),
                          (originalElement, subsequenceElement) => subsequenceElement);
}

However, I’ve decided to implement each of the methods separately – splitting them into the public extension method and a "SelectManyImpl" method with the same signature each time. I think that would make it simpler to step through the code if there were ever any problems… and it allows us to see the differences between the simplest and most complicated versions, too:

// Simplest overload
private static IEnumerable<TResult> SelectManyImpl<TSource, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    foreach (TSource item in source)
    {
        foreach (TResult result in selector(item))
        {
            yield return result;
        }
    }
}

// Most complicated overload:
// – Original projection takes index as well as value
// – There’s a second projection for each original/subsequence element pair
private static IEnumerable<TResult> SelectManyImpl<TSource, TCollection, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)
{
    int index = 0;
    foreach (TSource item in source)
    {
        foreach (TCollection collectionItem in collectionSelector(item, index++))
        {
            yield return resultSelector(item, collectionItem);
        }
    }
}

The correspondence between the two methods is pretty clear… but I find it helpful to have the first form, so that if I ever get confused about the fundamental point of SelectMany, it’s really easy to understand it based on the simple overload. It’s then not too big a jump to apply the two extra "optional" complications, and end up with the final method. The simple overload acts as a conceptual stepping stone, in a way.

Two minor points to note:

The first method could have been implemented with a "yield foreach selector(item)" if such an expression existed in C#. Using a similar construct in the more complicated form would be harder, and involve another call to Select, I suspect… probably more hassle than it would be worth.
I’m not explicitly using a "checked" block in the second form, even though "index" could overflow. I haven’t looked to see what the BCL does in this situation, to be honest – I think it unlikely that it will come up. For consistency I should probably use a checked block on every method which uses an index like this… or just turn arithmetic checking on for the whole assembly.

Reimplementing operators using SelectMany

I mentioned early on in this post that many of the LINQ operators can be implemented via SelectMany. Just as a quick example of this, here are alternative implementations of Select, Where and Concat:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return source.SelectMany(x => Enumerable.Repeat(selector(x), 1));
}

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return source.SelectMany(x => Enumerable.Repeat(x, predicate(x) ? 1 : 0));
}

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return new[] { first, second }.SelectMany(x => x);
}

Select and Where use Enumerable.Repeat as a convenient way of creating a sequence with either a single element or none. You could alternatively create a new array instead. Concat just uses an array directly: if you think of SelectMany in terms of its flattening operation, Concat is a really natural fit. I suspect that Empty and Repeat are probably feasible with recursion, although the performance would become absolutely horrible.

Currently the above implementations are in the code using conditional compilation. If this becomes a popular thing for me to implement, I might consider breaking it into a separate project. Let me know what you think – my gut feeling is that we won’t actually gain much more insight than the above methods give us… just showing how flexible SelectMany is.

SelectMany is also important in a theoretical way, in that it’s what provides the monadic nature of LINQ. It’s the "Bind" operation in the LINQ monad. I don’t intend to say any more than that on the topic – read Wes Dyer’s blog post for more details, or just search for "bind monad SelectMany" for plenty of posts from people smarter than myself.

Conclusion

SelectMany is one of LINQ’s fundamental operations, and at first sight it’s a fearsome beast. As soon as you understand that the basic operation is a flattening projection just with a couple of optional twiddles, it’s easily tamed.

Next up I’ll implement All and Any – which are nice and easy to describe by comparison.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 8 – Concat

December 27, 2010 jonskeet 9 Comments

After our quick visit to scalar return types with Count and LongCount, we’re back to an operator returning a sequence: Concat.

What is it?

Concat only has a single signature, which makes life simple:

public static IEnumerable<TSource> Concat<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second)

The return value is simply a sequence containing the elements of the first sequence followed by the elements of the second sequence – the concatenation of the two sequences.

I sometimes think it’s a pity that there aren’t Prepend/Append methods which do the same thing but for a single extra element – this would be quite useful in situations such as having a drop down list of countries with an extra option of "None". It’s easy enough to use Concat for this purpose by creating a single-element array, but specific methods would be more readable in my view. MoreLINQ has extra Concat methods for this purpose, but Edulinq is only meant to implement the methods already in LINQ to Objects.

As ever, some notes on the behaviour of Concat:

Arguments are validated eagerly: they must both be non-null
The result uses deferred execution: other than validation, the arguments aren’t used when the method is first called
Each sequence is only evaluated when it needs to be… if you stop iterating over the output sequence before the first input has been exhausted, the second input will remain unused

That’s basically it.

What are we going to test?

The actual concatenation part of the behaviour is very easy to test in a single example – we could potentially also demonstrate concatenation using empty sequences, but there’s no reason to suspect they would fail.

The argument validation is tested in the same way as normal, by calling the method with invalid arguments but not attempting to use the returned query.

Finally, there are a couple of tests to indicate the point at which each input sequence is used. This is achieved using the ThrowingEnumerable we originally used in the Where tests:

[Test]
public void FirstSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new ThrowingEnumerable();
    IEnumerable<int> second = new int[] { 5 };
    // No exception yet…
    var query = first.Concat(second);
    // Still no exception…
    using (var iterator = query.GetEnumerator())
    {
        // Now it will go bang
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

[Test]
public void SecondSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new int[] { 5 };
    IEnumerable<int> second = new ThrowingEnumerable();
    // No exception yet…
    var query = first.Concat(second);
    // Still no exception…
    using (var iterator = query.GetEnumerator())
    {
        // First element is fine…
        Assert.IsTrue(iterator.MoveNext());
        Assert.AreEqual(5, iterator.Current);
        // Now it will go bang, as we move into the second sequence
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

I haven’t written tests to check that iterators are disposed, etc – but each input sequence’s iterator should be disposed appropriately. In particular, it’s natural for the first sequence’s iterator to be disposed before the second sequence is iterated over at all.

Let’s impement it!

The implementation is reasonably simple, but it does make me hanker after F#… it’s the normal split between argument validation and iterator block implementation, but each part is really simple:

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    foreach (TSource item in second)
    {
        yield return item;
    }
}

It’s worth just remembering at this point that this would still have been very annoying to implement without iterator blocks. Not really difficult as such, but we’d have had to remember which sequence we were currently iterating over (if any) and so on.

However, using F# we could have made this even simpler with the yield! expression, which yields a whole sequence instead of a single item. Admittedly in this case there aren’t significant performance benefits to using yield! (which there certainly can be in recursive situations) but it would just be more elegant to have the ability to yield an entire sequence in one statement. (Spec# has a similar construct called nested iterators, expressed using yield foreach.) I’m not going to pretend to know enough about the details of either F# or Spec# to draw more detailed comparisons, but we’ll see the pattern of "foreach item in a collection, yield the item" several more times before we’re done. Remember that we can’t extract that into a library method, as the "yield" expression needs special treatment by the C# compiler.

Conclusion

Even when presented with a simple implementation, I can still find room to gripe :) It would be nice to have nested iterators in C#, but to be honest the number of times I find myself frustrated by their absence is pretty small.

Concat is a useful operator, but it’s really only a very simple specialization of another operator: SelectMany. After all, Concat just flattens two sequences into one… whereas SelectMany can flatten a whole sequence of sequences, with even more generality available when required. I’ll implement SelectMany next, and show a few examples of how other operators can be implemented simply in terms of SelectMany. (We’ll see the same sort of ability for operators returning a single value when we implement Aggregate.)

Addendum: avoiding holding onto references unnecessarily

A comment suggested that we should set first to "null" after we’ve used it. That way, as soon as we’ve finished iterating over the collection, it may be eligible for garbage collection. That leads to an implementation like this:

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    // Avoid hanging onto a reference we don’t really need
    first = null;
    foreach (TSource item in second)
    {
        yield return item;
    }
}

Now normally I’d say this wouldn’t actually help – setting a local variable to null when it’s not used in the rest of the method doesn’t actually make any difference when the CLR is running in optimized mode, without a debugger attached: the garbage collector only cares about variables whih might still be accessed in the rest of the method.

In this case, however, it makes a difference – because this isn’t a normal local variable. It ends up as an instance variable in the hidden class generated by the C# compiler… and the CLR can’t tell that the instance variable will never be used again.

Arguably we could remove our only reference to "first" at the start of GetEnumerator. We could write a method of the form:

public static T ReturnAndSetToNull<T>(ref T value) where T : class
{
    T tmp = value;
    value = null;
    return tmp;
}

and then call it like this:

foreach (TSource item in ReturnAndSetToNull(ref first))

I would certainly consider that overkill, particularly as it seems very likely that the iterator will still have a reference to the collection itself – but simply setting "first" to null after iterating over it makes sense to me.

I don’t believe that the "real" LINQ to Objects implementation does this, mind you. (At some point I’ll test it with a collection which has a finalizer.)

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 7 – Count and LongCount

December 26, 2010 jonskeet 20 Comments

Today’s post covers two operators in one, because they’re so incredibly similar… to the point cut and paste of implementation, merely changing the name, return type, and a couple of variables.

What are they?

Count and LongCount each have two overloads: one with a predicate, and one without. Here are all four signatures:

public static int Count<TSource>(
this IEnumerable<TSource> source)

public static int Count<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

public static long LongCount<TSource>(
this IEnumerable<TSource> source)

public static long LongCount<TSource>(
this IEnumerable<TSource> source,
Func<TSource, bool> predicate)

As you can see, the LongCount signatures are identical to Count except in terms of their return types, which are long (Int64) instead of int (Int32).

The overloads without a predicate parameter simply return the number of items in the source collection; the ones with a predicate return the number of items for which that predicate returns true.

Interesting aspects of the behaviour:

These are all extension methods on IEnumerable<T> – you might argue that for the versions without a predicate, it would have been better to extend the non-generic IEnumerable, as nothing actually requires the element type.
Where there’s no predicate, Count is optimized for ICollection<T> and (in .NET 4) ICollection – both of which have Count properties which are expected to be faster than iterating over the entire collection. LongCount is not optimized in the same way in the .NET implementation – I’ll discuss this in the implementation section.
No optimization is performed in the overloads with predicates, as basically there’s no way of telling how many items will "pass" the predicate without testing them.
All methods use immediate execution – nothing is deferred. (If you think about it, there’s nothing which can be deferred here, when we’re just returning an int or a long.)
All arguments are validated simply by testing they’re non-null
Both methods should throw OverflowException when given a collection with more items than they can return the count of… though this is a considerably larger number in the case of LongCount than Count, of course.

What are we going to test?

In some senses, there are only two "success" tests involved here: one without a predicate and one with. Those are easy enough to deal with, but we also want to exercise the optimized paths. That’s actually trickier than it might sound, as we want to test four situations:

A source which implements both ICollection<T> and ICollection (easy: use List<T>)
A source which implements ICollection<T> but not ICollection (reasonably easy, after a little work finding a suitable type: use HashSet<T>)
A source which implements ICollection but not ICollection<T> but still implements IEnumerable<T> (so that we can extend it) – tricky…
A source which doesn’t implement ICollection or ICollection<T> (easy: use Enumerable.Range which we’ve already implemented)

The third bullet is the nasty one. Obviously there are plenty of implementations of ICollection but not ICollection<T> (e.g. ArrayList) but because it doesn’t implement IEnumerable<T>, we can’t call the Count extension method on it. In the end I wrote my own SemiGenericCollection class.

Once we’ve got sources for all those tests, we need to decide what we’re actually testing about them. Arguably we should test that the result is optimized, for example by checking that we never really enumerate the collection. That would require writing custom collections with GetEnumerator() methods which threw exceptions, but still returned a count from the Count property. I haven’t gone this far, but it’s another step we certainly could take.

For the overloads which take predicates, we don’t need to worry about the various collection interfaces as we’re not optimizing anyway.

The failure cases for null arguments are very simple, but there’s one other case to consider: overflow. For Count, I’ve implemented a test case to verify the overflow behaviour. Unfortunately we can’t run this test in the Edulinq implementation yet, as it requires Enumerable.Concat, but here it is for the record anyway:

[Test]
[Ignore("Takes an enormous amount of time!")]
public void Overflow()
{
    var largeSequence = Enumerable.Range(0, int.MaxValue)
                                  .Concat(Enumerable.Range(0, 1));
    Assert.Throws<OverflowException>(() => largeSequence.Count());
}

This guards against a bad implementation which overflows by simply wrapping the counter round to Int32.MinValue instead of throwing an exception.

As you can see, this test will be disabled even when it’s uncommented after we implement Concat, as it requires counting up to 2 billion – not great for a quick set of unit tests. Even that isn’t too bad, however, compared with the equivalent in LongCount which would have to count 2^63 items. Generating such a sequence isn’t difficult, but iterating over it all would take a very long time. We also need an equivalent test for the overload with a predicate – something I neglected until writing up this blog post, and finding a bug in the implementation as I did so :)

For LongCount, I merely have an equivalent test to the above which checks that the same sequence can have its length expressed as a long value.

Let’s implement them!

We’ll look at the overload which does have a predicate first – as it’s actually simpler:

public static int Count<TSource>(this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    // No way of optimizing this
    checked
    {
        int count = 0;
        foreach (TSource item in source)
        {
            if (predicate(item))
            {
                count++;
            }
        }
        return count;
    }
}

Note that this time we’re not using an iterator block (we’re not returning a sequence), so we don’t need to split the implementation into two different methods just to get eager argument validation.

After the argument validation, the main part of the method is reasonably simple, with one twist: we’re performing the whole iteration within a "checked" context. This means that if the increment of count overflows, it will throw an OverflowException instead of wrapping round to a negative number. There are some other alternatives here:

We could have made just the increment statement checked instead of the whole second part of the method
We could have explicitly tested for count == int.MaxValue before incrementing, and thrown an exception in that case
We could just build the whole assembly in a checked context

I think it’s useful for this section of code to be explicitly checked – it makes it obvious that it really is a requirement for general correctness. You may well prefer to make only the increment operation checked – I personally believe that the current approach draws more attention to the checked-ness, but it’s definitely a subjective matter. It’s also possible that an explicit check could be faster, although I doubt it – I haven’t benchmarked either approach.

Other than the predicate-specific parts, all the above code also appears in the optimized Count implementation – so I won’t discuss those again. Here’s the full method:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        return genericCollection.Count;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        return nonGenericCollection.Count;
    }

    // Do it the slow way – and make sure we overflow appropriately
    checked
    {
        int count = 0;
        using (var iterator = source.GetEnumerator())
        {
            while (iterator.MoveNext())
            {
                count++;
            }
        }
        return count;
    }
}

The only "new" code here is the optimization. There are effectively two equivalent blocks, just testing for different collection interface types, and using whichever one it finds first (if any). I don’t know whether the .NET implementation tests for ICollection or ICollection<T> first – I could test it by implementing both interfaces but returning different counts from each, of course, but that’s probably overkill. It doesn’t really matter for well-behaved collections other than the slight performance difference – we want to test the "most likely" interface first, which I believe is the generic one.

To optimize or not to optimize?

The LongCount implementations are exactly the same as those for Count, except using long instead of int.

Notably, I still use optimizations for ICollection and ICollection<T> – but I don’t believe the .NET implementation does so. (It’s easy enough to tell by creating a huge list of bytes and comparing the time taken for Count and LongCount.)

There’s an argument for using Array.GetLongLength when the source is an array… but I don’t think the current CLR supports arrays with more than Int32.MaxValue elements anyway, so it’s a bit of a non-issue other than for future-proofing. Beyond that, I’m not sure why the .NET implementation isn’t optimized. It’s not clear what an ICollection/ICollection<T> implementation is meant to return from its Count property if it has more than Int32.MaxValue elements anyway, to be honest.

Suggestions as to what I should have done are welcome… but I should probably point out that LongCount is more likely to be used against Queryable than Enumerable – it’s easy to imagine a service representing a collection (such as a database table) which can quickly tell you the count even when it’s very large. I would imagine that there are relatively few cases where you have a collection to evaluate in-process where you really just want to iterate through the whole lot just to get the count.

Conclusion

These are our first LINQ operators which return scalar values instead of sequences – with the natural consequence that they’re simpler to understand in terms of control flow and timing. The methods simply execute – possibly with some optimization – and return their result. Nice and simple. Still, we’ve seen there can still be a few interesting aspects to consider, including questions around optimization which don’t necessarily have a good answer.

Next time, I think I’ll implement Concat – mostly so that I can uncomment the overflow tests for Count. That’s going back to an operator which returns a sequence, but it’s a really simple one…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 6 – Repeat

December 24, 2010 jonskeet 2 Comments

A trivial method next, with even less to talk about than "Empty"… "Repeat". This blog post is merely a matter of completeness.

What is it?

"Repeat" is a static, generic non-extension method with a single overload:

public static IEnumerable<TResult> Repeat<TResult>(
TResult element,
int count)

It simple returns a sequence which contains the specified element, repeated "count" times. The only argument validation is that "count" has to be non-negative.

What are we going to test?

There’s really not a lot to test here. I’ve thought of 4 different scenarios:

A vanilla "repeat a string 3 times" sequence
An empty sequence (repeat an element 0 times)
A sequence containing null values (just to prove that "element" can be null)
A negative count to prove that argument validation occurs, and does so eagerly.

None of this is remotely exciting, I’m afraid.

Let’s implement it!

Just about the only thing we could do wrong here is to put the argument validation directly in an iterator block… and we’ve implemented the "split method" pattern so many times already that we wouldn’t fall into that trap. So, here’s the code in all its tedious lack of glory:

public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    return RepeatImpl(element, count);
}

private static IEnumerable<TResult> RepeatImpl<TResult>(TResult element, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return element;
    }
}

That’s it. Um, interesting points to note… none.

Conclusion

There’s no sense in dragging this out. That’s the lot. Next up, Count and LongCount – which actually do have a couple of interesting points.

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 5 – Empty

December 24, 2010 jonskeet 14 Comments

Continuing with the non-extension methods, it’s time for possibly the simplest LINQ operator around: "Empty".

What is it?

"Empty" is a generic, static method with just a single signature and no parameters:

public static IEnumerable<TResult> Empty<TResult>()

It returns an empty sequence of the appropriate type. That’s all it does.

There’s only one bit of interesting behaviour: Empty is documented to cache an empty sequence. In other words, it returns a reference to the same empty sequence every time you call it (for the same type argument, of course).

What are we going to test?

There are really only two things we can test here:

The returned sequence is empty
The returned sequence is cached on a per type argument basis

I’m using the same approach as for Range to call the static method, but this time with an alias of EmptyClass. Here are the tests:

[Test]
public void EmptyContainsNoElements()
{
    using (var empty = EmptyClass.Empty<int>().GetEnumerator())
    {
        Assert.IsFalse(empty.MoveNext());
    }
}

[Test]
public void EmptyIsASingletonPerElementType()
{
    Assert.AreSame(EmptyClass.Empty<int>(), EmptyClass.Empty<int>());
    Assert.AreSame(EmptyClass.Empty<long>(), EmptyClass.Empty<long>());
    Assert.AreSame(EmptyClass.Empty<string>(), EmptyClass.Empty<string>());
    Assert.AreSame(EmptyClass.Empty<object>(), EmptyClass.Empty<object>());

Assert.AreNotSame(EmptyClass.Empty<long>(), EmptyClass.Empty<int>());
Assert.AreNotSame(EmptyClass.Empty<string>(), EmptyClass.Empty<object>());
}

Of course, that doesn’t verify that the cache isn’t per-thread, or something like that… but it’ll do.

Let’s implement it!

The implementation is actually slightly more interesting than the description so far may suggest. If it weren’t for the caching aspect, we could just implement it like this:

// Doesn’t cache the empty sequence
public static IEnumerable<TResult> Empty<TResult>()
{
yield break;
}

… but we want to obey the (somewhat vaguely) documented caching aspect too. It’s not really hard, in the end. There’s a very handy fact that we can use: empty arrays are immutable. Arrays always have a fixed size, but normally there’s no way of making an array read-only… you can always change the value of any element. But an empty array doesn’t have any elements, so there’s nothing to change. So, we can reuse the same array over and over again, returning it directly to the caller… but only if we have an empty array of the right type.

At this point you may be expecting a Dictionary<Type, Array> or something similar… but there’s another useful trick we can take advantage of. If you need a per-type cache and the type is being specific as a type argument, you can use static variables in a generic class, because each constructed type will have a distinct set of static variables.

Unfortunately, Empty is a generic method rather than a non-generic method in a generic type… so we’ve got to create a separate generic type to act as our cache for the empty array. That’s easy to do though, and the CLR takes care of initializing the type in a thread-safe way, too. So our final implementation looks like this:

public static IEnumerable<TResult> Empty<TResult>()
{
    return EmptyHolder<TResult>.Array;
}

private static class EmptyHolder<T>
{
    internal static readonly T[] Array = new T[0];
}

That obeys all the caching we need, and is really simple in terms of lines of code… but it does mean you need to understand how generics work in .NET reasonably well. In some ways this is the opposite of the situation in the previous post – this is a sneaky implementation instead of the slower but arguably simpler dictionary-based one. In this case, I’m happy with the trade-off, because once you do understand how generic types and static variables work, this is simple code. It’s a case where simplicity is in the eye of the beholder.

Conclusion

So, that’s Empty. The next operator – Repeat – is likely to be even simpler, although it’ll have to be another split implementation…

Addendum

Due to the minor revolt over returning an array (which I still think is fine), here’s an alternative implementation:

public static IEnumerable<TResult> Empty<TResult>()
{
return EmptyEnumerable<TResult>.Instance;
}

#if AVOID_RETURNING_ARRAYS
private class EmptyEnumerable<T> : IEnumerable<T>, IEnumerator<T>
{
internal static IEnumerable<T> Instance = new EmptyEnumerable<T>();

    // Prevent construction elsewhere
    private EmptyEnumerable()
    {
    }

    public IEnumerator<T> GetEnumerator()
    {
        return this;
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return this;
    }

    public T Current
    {
        get { throw new InvalidOperationException(); }
    }

    object IEnumerator.Current
    {
        get { throw new InvalidOperationException(); }
    }

    public void Dispose()
    {
        // No-op
    }

    public bool MoveNext()
    {
        return false; // There’s never a next entry
    }

    public void Reset()
    {
        // No-op
    }
}

#else
private static class EmptyEnumerable<T>
{
internal static readonly T[] Instance = new T[0];
}
#endif

Hopefully now everyone can build a version they’re happy with :)

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 4 – Range

December 24, 2010 jonskeet 12 Comments

This will be a short post, and there’ll probably be some more short ones coming up too. I think it makes sense to only cover multiple operators in a single post where they’re really similar. (Count and LongCount spring to mind.) I’m in your hands though – if you would prefer "chunkier" posts, please say so in the comments.

This post will deal with the Range generation operator.

What is it?

Range only has a single signature:

public static IEnumerable<int> Range(
int start,
int count)

Unlike most of LINQ, this isn’t an extension method – it’s a plain old static method. It returns an iterable object which will yield "count" integers, starting from "start" and incrementing each time – so a call to Enumerable.Range(6, 3) would yield 6, then 7, then 8.

As it doesn’t operate on an input sequence, there’s no sense in which it could stream or buffer its input, but:

The arguments need to be validated eagerly; the count can’t be negative, and it can’t be such that any element of the range could overflow Int32.
The values will be yielded lazily – Range should be cheap, rather than creating (say) an array of "count" elements and returning that.

How are we going to test it?

Testing a plain static method brings us a new challenge in terms of switching between the "normal" LINQ implementation and the Edulinq one. This is an artefact of the namespaces I’m using – the tests are in Edulinq.Tests, and the implementation is in Edulinq. "Parent" namespaces are always considered when the compiler tries to find a type, and they take priority over anything in using directives – even a using directive which tries to explicitly alias a type name.

The (slightly ugly) solution to this that I’ve chosen is to include a using directive to create an alias which couldn’t otherwise be resolved – in this case, RangeClass. The using directive will either alias RangeClass to System.Linq.Enumerable or Edulinq.Enumerable. The tests then all use RangeClass.Range. I’ve also changed how I’m switching between implementations – I now have two project configurations, one of which defines the NORMAL_LINQ preprocessor symbol, and the other of which doesn’t. The RangeTest class therefore contains:

#if NORMAL_LINQ
using RangeClass = System.Linq.Enumerable;
#else
using RangeClass = Edulinq.Enumerable;
#endif

There are alternatives to this approach, of course:

I could move the tests to a different namespace
I could make the project references depend on the configuration… so the "Normal LINQ" configuration wouldn’t reference the Edulinq implementation project, and the "Edulinq implementation" configuration wouldn’t reference System.Core. I could then just use Enumerable.Range with an appropriate using directive for System.Linq conditional on the NORMAL_LINQ preprocessor directive, as per the other tests.

I like the idea of the second approach, but it means manually tinkering with the test project file – Visual Studio doesn’t expose any way of conditionally including a reference. I may do this at a later date… thoughts welcome.

What are we going to test?

There isn’t much we can really test for ranges – I only have eight tests, none of which are particularly exciting:

A simple valid range should look right when tested with AssertSequenceEqual
The start value should be allowed to be negative
Range(Int32.MinValue, 0) is an empty range
Range(Int32.MaxValue, 1) yields just Int32.MaxValue
The count can’t be negative
The count can be zero
start+count-1 can’t exceed Int32.MaxValue (so Range(Int32.MaxValue, 2) isn’t valid)
start+count-1 can be Int32.MaxValue (so Range(Int32.MaxValue, 1) is valid)

The last two are tested with a few different examples each – a large start and a small count, a small start and a large count, and "fairly large" values for both start and count.

Note that I don’t have any tests for lazy evaluation – while I could test that the returned value doesn’t implement any of the other collection interfaces, it would be a little odd to do so. On the other hand, we do have tests which have an enormous count – such that anything which really tried to allocate a collection of that size would almost certainly fail…

Let’s implement it!

It will surely be no surprise by now that we’re going to use a split implementation, with a public method which performs argument validation eagerly and then uses a private method with an iterator block to perform the actual iteration.

Having validated the arguments, we know that we’ll never overflow the bounds of Int32, so we can be pretty casual in the main part of the implementation.

public static IEnumerable<int> Range(int start, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    // Convert everything to long to avoid overflows. There are other ways of checking
    // for overflow, but this way make the code correct in the most obvious way.
    if ((long)start + (long)count – 1L > int.MaxValue)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    return RangeImpl(start, count);
}

private static IEnumerable<int> RangeImpl(int start, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return start + i;
    }
}

Just a few points to note here:

Arguably it’s the combination of "start" and "count" which is invalid in the second check, rather than just count. It would possibly be nice to allow ArgumentOutOfRangeException (or ArgumentException in general) to blame multiple arguments rather than just one. However, using "count" here matches the framework implementation.
There are other ways of performing the second check, and I certainly didn’t have to make all the operands in the expression longs. However, I think this is the simplest code which is clearly correct based on the documentation. I don’t need to think about all kinds of different situations and check that they all work. The arithmetic will clearly be valid when using the Int64 range of values, so I don’t need to worry about overflow, and I don’t need to consider whether to use a checked or unchecked context.
There are also other ways of looping in the private iterator block method, but I think this is the simplest. Another obvious and easy alternative is to keep two values, one for the count of yielded values and the other for the next value to yield, and increment them both on each iteration. A more complex approach would be to use just one loop variable – but you can’t use "value < start + count" in case the final value is exactly Int32.MaxValue, and you can’t use "value <= start + count – 1" in case the arguments are (int.MinValue, 0). Rather than consider all the border cases, I’ve gone for an obviously-correct solution. If you really, really cared about the performance of Range, you’d want to investigate various other options.

Prior to writing up this post, I didn’t have good tests for Range(Int32.MaxValue, 1) and Range(Int32.MinValue, 0)… but as they could easily go wrong as mentioned above, I’ve now included them. I find it interesting how considering alternative implementations suggests extra tests.

Conclusion

"Range" was a useful method to implement in order to test some other operators – "Count" in particular. Now that I’ve started on the non-extension methods though, I might as well do the other two (Empty and Repeat). I’ve already implemented "Empty", and will hopefully be able to write it up today. "Repeat" shouldn’t take much longer, and then we can move on to "Count" and "LongCount".

I think this code is a good example of situations where it’s worth writing "dumb" code which looks like the documentation, rather than trying to write possibly shorter, possibly slightly more efficient code which is harder to think about. No doubt there’ll be more of that in later posts…

C#, Edulinq, LINQ

Reimplementing LINQ to Objects: Part 3 – “Select” (and a rename…)

December 23, 2010 jonskeet 9 Comments

It’s been a long time since I wrote part 1 and part 2 of this blog series, but hopefully things will move a bit more quickly now.

The main step forward is that the project now has a source repository on Google Code instead of just being a zip file on each blog post. I had to give the project a title at that point, and I’ve chosen Edulinq, hopefully for obvious reasons. I’ve changed the namespaces etc in the code, and the blog tag for the series is now Edulinq too. Anyway, enough of the preamble… let’s get on with reimplementing LINQ, this time with the Select operator.

What is it?

Like Where, Select has two overloads:

public static IEnumerable<TResult> Select<TSource, TResult>(
this IEnumerable<TSource> source,
Func<TSource, TResult> selector)

public static IEnumerable<TResult> Select<TSource, TResult>(
this IEnumerable<TSource> source,
Func<TSource, int, TResult> selector)

Again, they both operate the same way – but the second overload allows the index into the sequence to be used as part of the projection.

Simple stuff first: the method projects one sequence to another: the "selector" delegate is applied to each input element in turn, to yield an output element. Behaviour notes, which are exactly the same as Where (to the extent that I cut and paste these from the previous blog post, and just tweaked them):

The input sequence is not modified in any way.
The method uses deferred execution – until you start trying to fetch items from the output sequence, it won’t start fetching items from the input sequence.
Despite deferred execution, it will validate that the parameters aren’t null immediately.
It streams its results: it only ever needs to look at one result at a time.
It will iterate over the input sequence exactly once each time you iterate over the output sequence.
The "selector" function is called exactly once per yielded value.
Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence.

What are we going to test?

The tests are very much like those for Where – except that in cases where we tested the filtering aspect of Where, we’re now testing the projection aspect of Select.

There are a few tests of some interest. Firstly, you can tell that the method is generic with 2 type parameters instead of 1 – it has type parameters of TSource and TResult. They’re fairly self-explanatory, but it means it’s worth having a test for the case where the type arguments are different – such as converting an int to a string:

[Test]
public void SimpleProjectionToDifferentType()
{
    int[] source = { 1, 5, 2 };
    var result = source.Select(x => x.ToString());
    result.AssertSequenceEqual("1", "5", "2");
}

Secondly, I have a test that shows what sort of bizarre situations you can get into if you include side effects in your query. We could have done this with Where as well of course, but it’s clearer with Select:

[Test]
public void SideEffectsInProjection()
{
    int[] source = new int[3]; // Actual values won’t be relevant
    int count = 0;
    var query = source.Select(x => count++);
    query.AssertSequenceEqual(0, 1, 2);
    query.AssertSequenceEqual(3, 4, 5);
    count = 10;
    query.AssertSequenceEqual(10, 11, 12);
}

Notice how we’re only calling Select once, but the results of iterating over the results change each time – because the "count" variable has been captured, and is being modified within the projection. Please don’t do things like this.

Thirdly, we can now write query expressions which include both "select" and "where" clauses:

[Test]
public void WhereAndSelect()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = from x in source
                 where x < 4
                 select x * 2;
    result.AssertSequenceEqual(2, 6, 4, 2);
}

There’s nothing mind-blowing about any of this, of course – hopefully if you’ve used LINQ to Objects at all, this should all feel very comfortable and familiar.

Let’s implement it!

Surprise surprise, we go about implementing Select in much the same way as Where. Again, I simply copied the implementation file and tweaked it a little – the two methods really are that similar. In particular:

We’re using iterator blocks to make it easy to return sequences
The semantics of iterator blocks mean that we have to separate the argument validation from the real work. (Since I wrote the previous post, I’ve learned that VB11 will have anonymous iterators, which will avoid this problem. Sigh. It just feels wrong to envy VB users, but I’ll learn to live with it.)
We’re using foreach within the iterator blocks to make sure that we dispose of the input sequence iterator appropriately – so long as our output sequence iterator is disposed or we run out of input elements, of course.

I’ll skip straight to the code, as it’s all so similar to Where. It’s also not worth showing you the version with an index – because it’s such a trivial difference.

private static IEnumerable<TResult> SelectImpl<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    foreach (TSource item in source)
    {
        yield return selector(item);
    }
}

Simple, isn’t it? Again, the real "work" method is even shorter than the argument validation.

Conclusion

While I don’t generally like boring my readers (which may come as a surprise to some of you) this was a pretty humdrum post, I’ll admit. I’ve emphasized "just like Where" several times to the point of tedium very deliberately though – because it makes it abundantly clear that there aren’t really as many tricky bits to understand as you might expect.

Something slightly different next time (which I hope will be in the next few days). I’m not quite sure what yet, but there’s an awful lot of methods still to choose from…

General, Wacky Ideas

A Model/View to a Kill (Naked came the null delegate, part 5)

December 10, 2010 jonskeet 5 Comments

(I suggest you read the earlier parts of the story first. I’m not claiming it’ll make any more sense afterwards, mind you.)

Even though Seymour Sharpton’s brain was in a spinlock, a low-level interrupt brought him out of his stupor – namely, an enormous motorcycle bursting through the floor near the daemon. It was impossible to tell the form of the rider under the leather and helmet. When the biker spoke, the voice was digitally disguised but its authority was clear:

"Sharpton. Here, now. The rest of you: you know me. Follow us, and there’ll be trouble."

Algol hissed sharply, and then started cajoling Seymour: "Don’t go. Stay with us. My master didn’t mean what he said about, um, deletion. It was just a little joke. You’re safe here…"

But Seymour was already running towards the motorcycle. The biker had never stopped revving the engine, and the moment Seymour jumped on the rear seat, the wheels span briefly, kicking up clouds of dust before the bike raced through the warehouse and through the (fortunately still open) door.

The ride was fast, bumpy, and terrifying. Seymour hadn’t felt like this since he’d given up C, years ago. The roar of the engine drowned out any conversation, and he was left holding on for dear life until the bike came to a screeching halt in a deserted street. The biker dismounted and offered a hand to Seymour, who shook it nervously.

"Who are you?" he murmured, almost afraid to find out.

"My name is unimportant," replied the metallic voice, still masked by the helmet.

"It’s not Slartibartfast, is it?" Seymour had visions of being whisked away to see fjords. That would just about fit in with the rest of his strange evening.

"No, it’s… unspeakable."

"Ah, so you’re an anonymous type of guy, huh?"

"Anonymous, yes… guy, no." The biker removed her helmet and shook her head. Her long blonde hair swooshed from side to side, and time seemed to slow for Seymour as he watched her. She was a model he could view forever… although the idea of trying to control her seemed out of the question. Then several strands of hair were caught in the anonymous girl’s gently pouting mouth, and she spat them out hurriedly. "Damn it, I hate it when that happens. Anyway, you are lucky to be alive. You have no idea what our shady underworld contains… those zombies aren’t the worst of it by a long chalk."

"There’s more?" Seymour gulped.

"Worse than you can imagine. We’re lucky it’s a new moon tonight, for example. Otherwise this place would be heaving with were-clauses. Most of the month they just filter as normal, but come the full moon… urgh." She shuddered, and Seymour didn’t want to ask any further. The biker paused, and then continued.

"Then there’s the mutants. They’re harmless enough, but not for want of trying. They’ll lope after you, but they mutate almost without thinking about it. Totally dysfunctional. A quick kick to the monads will usually despatch them… But tonight, we have something else to fear." She looked around, cautiously. "The word on the street is that the Vimpires are in town. Every few years we think we’ve got rid of them… and then they come back, with their long and surprisingly dexterous fingers. You know how you can tell when the Vimpires are around?"

Seymour was spellbound. "How?"

"The mice all leave, in droves. The rats don’t care, but a Vimpires will torture a mouse just for the fun of it. But this time, there are rumours. There’s talk of a bigger plan afoot. The one thing the Vimpires are still afraid of is bright light. During the day, we’re safe… but imagine if there were no more days? Perpetual twilight – like "Breaking Dawn part 1" but forever."

"They wouldn’t!" Seymour gasped. He remembered that long night in the cinema only too well.

"They would. And they have allies… for the first time, the Eclipse posse and the Vimpires are joining forces. So we have to fight them. You’re not the first innocent man I’ve rescued tonight, and you won’t be the last. But I need to be sure of you… do you have references?"

"Um, what kind of references?"

"Anything to prove your identity. It’s a class war out there, Seymour… now what type of man are you? Where do your values lie? Oh, never mind… I’ll trust you for now. But Seymour, you need to be ready. Brace yourself. Are you in The Zone?"

"I don’t know what you mean… what zone are you talking about?"

"Ah, true, that was ambiguous. UTC or not UTC, that is the question. Whether ’tis nobler in in the mind to suffer the leap seconds and missing hours of outrageous chronology, or to take ARM against a sea of doubles, and by opposing end them?"

"What on earth are you babbling about?"

"No matter. All you need to know is this… the Vimpires are trying to extinguish the sun, but we’re going to stop them. It’s daylight saving time."

Continued in part 6 – The Great Destructor…

C#, General

Creative freedom, control, and the balance of power

December 1, 2010 jonskeet 12 Comments

Stephen Colebourne’s comment on my last blog post (adding 1 month -1 day to January 29th) have knocked me for six. To avoid thinking about how I might implement the preferred behaviour in Noda Time while still using Joda Time’s "engine" I’ve decided to write about something else which has been niggling at me.

For a long time, I’ve espoused the idea of "design for inheritance or prohibit it" – in other words, default to sealing classes and making methods non-virtual unless you have a good reason to do otherwise. I’ve usually attributed this phrase to Josh Bloch writing in Effective Java, but it could well have been round for a long time.

Whenever I’ve espoused this in public, it’s caused disagreement. Not universal disagreement (which would be easy to cope with; if everyone else thinks I’m wrong, that’s a very strong indicator that I’m really missing something) but a fairly even distribution of "absolutely" and "hell no". Most people seem to feel passionately one way or the other. This has led me to water down my line of "the C# designers should have made classes sealed by default" to "the C# designers shouldn’t have included a default for classes being sealed or not – make the developer specify it".

One thing I’ve found interesting is that the split between "make everything non-virtual" and "make everything virtual" isn’t one of "smart" vs "dumb". There are plenty of publically admired developers on both sides of the fence, and my colleagues are reasonably evenly split too. However, I have detected a correlation in terms of programming preferences around type systems: I’ve generally found that those who are in favour of making everything virtual by default are generally more likely to also be enthusiastic around dynamic typing. That won’t be universally true of course, but I think one is likely to be a reasonably good predictor of the other.

Ultimately I think it’s about a balance, and different people place different amounts of value on the various pros and cons. It’s also about the relationship between different parties. Different pros and cons affect different parties in different ways.

A relatively inflexible API with a flexible implementation

I’m happy when I know everything that is going on in my code. I interact with other code through obvious dependencies: they are provided to me explicitly. You’re welcome to modify my code’s visible behaviour by implementing those dependencies in different ways, but my code should be fine as long as you abide by the contracts expressed in the dependencies (typically interfaces).

If I call one of my own non-virtual methods from within my code, I know what it’s going to do. If I have two non-virtual methods which could be implemented by one calling the other either way round, then it doesn’t matter which option I pick. I can change my mind later on, and no-one will be any the wiser. All the externally visible behaviour will be exactly the same. I don’t need to document which method calls which – just what the final results are.

If I create an immutable type and seal it, then all the immutability is within my control. If I’ve picked immutable types for my member variables, have Object as a base class, and make sure I don’t mutate anything myself, I’m fine. I can rely on my values being unchanging, and so can my callers. They can cache a value with impunity.

Basically, everything is simple… unless you want to make one of my types behave slightly differently.

Flexibility with the risk of breakage

The above section sounds all nice and rosy… but what if you want to just tweak my type slightly? You only want to override one method – how much harm can that do? You’ve looked at the implementation and seen that nothing actually relies on it working exactly the way it does… and it doesn’t call any other public members. If my type implements an interface including the member you want to tweak, then you could potentially implement the interface and delegate almost all calls to an instance of the original type, but implement that one call differently. Of course, delegation is great in many cases – but can fail when there are complex relationships involved (such as when the delegated instance passes itself to something else). Basically there are identity issues.

It would be much simpler in this case if you could override my method. That might help your testing, or allow you to use my type in a way I’d never anticipated, achieving fabulous things. As Stroustrup wrote, "I wouldn’t like to build a tool that could only do what I had been able to imagine for it." Now I believe there’s a big difference between imagining a "big picture" which my component may be some tiny part of, and imagining a crazy use for the type itself, but the sentiment is still worth considering. Creative freedom is a nice thing to have, after all… who am I to stop you from achieving greatness?

The downside is that you’re opening yourself to the risk of your code breaking if I change my implementation details in another release. Maybe it would only mean your tests breaking – maybe it’s your production code. (While I don’t want to put too many words in the mouths of those who hold the opposite opinion to me, I believe a lot of their reason for wanting to be able to override methods is to make testing easier. Personally I prefer to construct test doubles which implement interfaces directly, but I do understand that’s not always feasible – especially if the component in question hasn’t been designed with testability in mind to start with.)

In many cases there’s genuinely little risk of that actually happening… but how tolerant are you of that risk?

Risk evaluation and propagation

When I wrote about what might break if the library producer changes their code, I mentioned your production code and your test code. There’s a much nastier risk though: you break someone else’s code which is relying on yours.

Suppose I produce library X, and you use it in library Y. Now Joe Developer uses both of our libraries in his application… except he uses a new version of X. Maybe it’s a bug-fix version, which is meant to have no externally-visible changes… except it changes how it uses its own methods, in a way which will break if you’ve overridden some methods in a particular way… and you’ve done so in library Y. As far as Joe Developer is concerned, his combination of X + Y is broken. Who’s fault is it?

Mine for changing the behaviour of X in a seemingly sensible way?
Yours for overriding a member of X in Y in a way I hadn’t anticipated?
Joe’s for using a version of X which you hadn’t developed Y against?

Maybe all three. The trouble is, as the developer of Y you have no way of knowing how likely it is that I’ll change the details of my implementation in "seemingly harmless" ways. Indeed, you may even have performed some testing of Y against the version of X that Joe is using… but maybe Joe’s overriding some other members of the types from X and Y in ways that neither you nor I expected… and the combination could be complex to work out.

Now this all sounds like doom and gloom – but you need to remember that there must have been reasons for overriding those members to start with. Achieving the same goals without using inheritance could certainly have been considerably more complex, and introduced other bugs along the way. Using inheritance could have been a big win all round, right up until the point where everything screwed up… at which point it may still be recoverable, or it could be a knot which can’t be untied. You probably won’t know until the breakage happens, and you probably can’t accurately gauge the likelihood of it happening in the first place. It may well never happen.

Three options as library providers and consumers

It seems to me that when you’re building an API, there are three broad options available (obviously with nuanced positions between them):

Make every type unsealed, and every method virtual – but don’t make any guarantees about what will happen if someone overrides methods.
Make every type unsealed and every method virtual – but document/guarantee every internal interaction, so that anyone deriving from your class can predict how it will behave.
Make everything sealed or non-virtual unless you can foresee a reason for overriding it. Document what sort of overriding you expect to handle, and where the overridden methods will be called.

As the consumer of an API, you have various choices too:

Rely on undocumented behaviour, betting that you’ll save more time by doing and fixing breakage later
Only rely on documented behaviour when calling code, but rely on undocumented behaviour when overriding code, as this is typically less well documented anyway (very few APIs will specify exactly what’s deemed acceptable)
Only rely on documented behaviour

While these options are reasonably easy to describe, they again miss the oh-so-common situation: I’m consuming someone else’s types, but providing my own types to other developers. This mixed behaviour is where a lot of the complexity comes in, increasing the risk of breakage and increasing the cost of fixing the problem.

Conclusion

I still believe that designing for inheritance or prohibiting it makes sense if you want to provide a robust library which makes it hard for the consumer to abuse. However, I appreciate that others want the ability to abuse a library – and are willing to pay the price for that down the line. I’m concerned by the "3 party" scenario though – where developer X can shoot your foot off by abusing my code.

Sadly, I can’t see this long-running argument coming any closer to resolution. Better mix-in support within C# would at least help, I believe – but delegation is no panacea either.

I want to be a pragmatic developer: I dislike the dogmatic prohibition of convenient practices for the sake of purely theoretical reasons as much as the next person… and I genuinely can see where it can be a pain not being able to override behaviour at times. However, I have yet to be convinced that a gung-ho, "It probably won’t break, honest!" attitude is really a good option in the long term. I hope I’m gaining an increasingly open mind though – and I hope that at least by discussing this from slightly less religious viewpoints from normal, both camps can learn something from each other.

Noda Time

The joys of date/time arithmetic

December 1, 2010 jonskeet 28 Comments

(Cross-posted to my main blog and the Noda Time blog, in the hope that the overall topic is still of interest to those who aren’t terribly interested in Noda Time per se.)

I’ve been looking at the "period" part of Noda Time recently, trying to redesign the API to simplify it somewhat. This part of the API is what we use to answer questions such as:

What will the date be in 14 days?
How many hours are there between now and my next birthday?
How many years, months and days have I been alive for?

I’ve been taking a while to get round to this because there are some tricky choices to make. Date and time arithmetic is non-trivial – not because of complicated rules which you may be unaware of, but simply because of the way calendaring systems work. As ever, time zones make life harder too. This post won’t talk very much about the Noda Time API details, but will give the results of various operations as I currently expect to implement them.

The simple case: arithmetic on the instant time line

One of the key concepts to understand when working with time is that the usual human "view" on time isn’t the only possible one. We don’t have to break time up into months, days, hours and so on. It’s entirely reasonable (in many cases, at least) to consider time as just a number which progresses linearly. In the case of Noda Time, it’s the number of ticks (there are 10 ticks in a microsecond, 10,000 ticks in a millisecond, and 10 million ticks in a second) since midnight on January 1st 1970 UTC.

Leaving relativity aside, everyone around the world can agree on an instant, even if they disagree about everything else. If you’re talking over the phone (using a magic zero-latency connection) you may think you’re in different years, using different calendar systems, in different time zones – but still both think of "now" as "634266985845407773 ticks".

That makes arithmetic really easy – but also very limited. You can only add or subtract numbers of ticks, effectively. Of course you can derive those ticks from some larger units which have a fixed duration – for example, you could convert "3 hours" into ticks – but some other concepts don’t really apply. How would you add a month? The instant time line has no concept of months, and in most calendars different months have different durations (28-31 days in the ISO calendar, for example). Even the idea of a day is somewhat dubious – it’s convenient to treat a day as 24 hours, but you need to at least be aware that when you translate an instant into a calendar that a real person would use, days don’t always last for 24 hours due to daylight savings.

Anyway, the basic message is that it’s easy to do arithmetic like this. In Noda Time we have the Instant structure for the position on the time line, and the Duration structure as a number of ticks which can be added to an Instant. This is the most appropriate pair of concepts to use to measure how much time has passed, without worrying about daylight savings and so on: ideal for things like timeouts, cache purging and so on.

Things start to get messy: local dates, times and date/times

The second type of arithmetic is what humans tend to actually think in. We talk about having a meeting in a month’s time, or how many days it is until Christmas (certainly my boys do, anyway). We don’t tend to consciously bring time zones into the equation – which is a good job, as we’ll see later.

Now just to make things clear, I’m not planning on talking about recurrent events – things like "the second Tuesday and the last Wednesday of every month". I’m not planning on supporting recurrences in Noda Time, and having worked on the calendar part of Google Mobile Sync for quite a while, I can tell you that they’re not fun. But even without recurrences, life is tricky.

Introducing periods and period arithmetic

The problem is that our units are inconsistent. I mentioned before that "a month" is an ambiguous length of time… but it doesn’t just change by the month, but potentially by the year as well: February is either 28 or 29 days long depending on the year. (I’m only considering the ISO calendar for the moment; that gives enough challenges to start with.)

If we have inconsistent units, we need to keep track of those units during arithmetic, and even request that the arithmetic be performed using specific units. So, it doesn’t really make sense to ask "how long is the period between June 10th 2010 and October 13th 2010" but it does make sense to ask "how many days are there between June 10th 2010 and October 13th 2010" or "how many years, months and days are there between June 10th 2010 and October 13th 2010".

Once you’ve got a period – which I’ll describe as a collection of unit/value pairs, e.g. "0 years, 4 months and 3 days" (for the last example above) you can still give unexpected behaviour. If you add that period to your original start date, you should get the original end date… but if you advance the start date by one day, you may not advance the end date by one day. It depends on how you handle things like "one month after January 30th 2010" – some valid options are:

Round down to the end of the month: February 28th
Round up to the start of the next month: March 1st
Work out how far we’ve overshot, and apply that to the next month: March 2nd
Throw an exception

All of these are justifiable. Currently, Noda Time will always take the first approach. I believe that JSR-310 (the successor to Joda Time) will allow the behaviour to be resolved according to a strategy provided by the user… it’s unclear to me at the moment whether we’ll want to go that far in Noda Time.

Arithmetic in Noda Time is easily described, but the consequences can be subtle. When adding or subtracting a period from something like a LocalDate, we simply iterate over all of the field/value pairs in the period, starting with the most significant, and add each one in turn. When finding the difference between two LocalDate values with a given set of field types (e.g. "months and days") we get as close as we can without overshooting using the most significant field, then the next field etc.

The "without overshooting" part means that if you add the result to the original start value, the result will always either be the target end value (if sufficiently fine-grained fields are available) or somewhere between the original start and the target end value. So "June 2nd 2010 to October 1st 2010 in months" gives a result of "3 months" even though if we chose "4 months" we’d only overshoot by a tiny amount.

Now we know what approach we’re taking, let’s look at some consequences.

Asymmetry and other oddities

It’s trivial to show some assymetry just using a period of a single month. For example:

January 28th 2010 + 1 month = February 28th 2010
January 29th 2010 + 1 month = February 28th 2010
January 30th 2010 + 1 month = February 28th 2010
February 28th 2010 – 1 month = January 28th 2010

It gets even more confusing when we add days into the mix:

January 28th 2010 + 1 month + 1 day = March 1st 2010
January 29th 2010 + 1 month + 1 day = March 1st 2010
March 1st 2010 – 1 month – 1 day = January 31st 2010

And leap years:

March 30th 2013 – 1 year – 1 month – 10 days = February 19th 2012 (as "February 30th 2012" is truncated to February 29th 2012)
March 30th 2012 – 1 year – 1 month – 10 days = February 18th 2012 (as "February 30th 2011" is truncated to February 28th 2011)

Then we need to consider how rounding works when finding the difference between days… (forgive the pseudocode):

Between(January 31st 2010, February 28th 2010, Months & Days) = ?
Between(February 28th 2010, January 31st 2010, Months & Days) = -28 days

The latter case is relatively obvious – because if you take a whole month of February 28th 2010 you end up with January 28th 2010, which is an overshoot… but what about the first case?

Should we return the determine the number of months by "the largest number such that start + period <= end"? If so, we get a result of "1 month" – which makes sense given the first set of results in this section.

What worries me most about this situation is that I honestly don’t know offhand what the current implementation will do. I think it would be best to return "28 days" as there isn’t genuinely a complete month between the two… <tappety tappety>

Since writing the previous paragraph, I’ve tested it, and it returns 1 month and 0 days. I don’t know how hard it would be to change this behaviour or whether we want to. Whatever we do, however, we need to document it.

That’s really at the heart of this: we must make Noda Time predictable. Where there are multiple feasible results, there should be a simple way of doing the arithmetic by hand and getting the same results as Noda Time. Of course, picking the best option out of the ones available would be good – but I’d rather be consistent and predictable than "usually right" be unpredictably so.

Think it’s bad so far? It gets worse…

ZonedDateTime: send in the time zones… (well maybe next year?)

I’ve described the "instant time line" and its simplicity.

I’ve described the local date/time complexities, where there’s a calendar but there’s no time zone.

So far, the two worlds have been separate: you can’t add a Duration to a LocalDateTime (etc), and you can’t add a Period to an Instant. Unfortunately, sooner or later many applications will need ZonedDateTime.

Now, you can think of ZonedDateTime in two different ways:

It’s an Instant which knows about a calendar and a time zone
It’s a LocalDateTime which knows about a time zone and the offset from UTC

The "offset from UTC" part sounds redundant at first – but during daylight saving transitions the same LocalDateTime occurs at two different instants; the time zone is the same in both cases, but the offset is different.

The latter way of thinking is how we actually represent a ZonedDateTime internally, but it’s important to know that a ZonedDateTime still unambiguously maps to an Instant.

So, what should we be able to do with a ZonedDateTime in terms of arithmetic? I think the answer is that we should be able to add both Periods and Durations to a ZonedDateTime – but expect them to give different results.

When we add a Duration, that should work out the Instant represented by the current DateTime, advance it by the given duration, and return a new ZonedDateTime based on that result with the same calendar and time zone. In other words, this is saying, "If I were to wait for the given duration, what date/time would I see afterwards?"

When we add a Period, that should add it to the LocalDateTime represented by the ZonedDateTime, and then return a new ZonedDateTime with the result, the original time zone and calendar, and whatever offset is suitable for the new LocalDateTime. (That’s deliberately woolly – I’ll come back to it.) This is the sort of arithmetic a real person would probably perform if you asked them to tell you what time it would be "three hours from now". Most people don’t take time zones into account…

In most cases, where a period can be represented as a duration (for example "three hours") the two forms of addition will give the same result. Around daylight saving transitions, however, they won’t. Let’s consider some calculations on Sunday November 7th 2010 in the "Pacific/Los_Angeles" time zone. It had a daylight saving transition from UTC-7 to UTC-8 at 2am local time. In other words, the clock went 1:58, 1:59, 1:00. Let’s start at 12:30am (local time, offset = -7) and add a few different values:

12:30am + 1 hour duration = 1:30am, offset = -7
12:30am + 2 hours duration = 1:30am, offset = -8
12:30am + 3 hours duration = 2:30am, offset = -8
12:30am + 1 hour period = 1:30am, offset = ???
12:30am + 2 hour period = 2:30am, offset = -8
12:30am + 3 hour period = 3:30am, offset = -8

The ??? value is the most problematic one, because 1:30 occurs twice… when thinking of the time in a calendar-centric way, what should the result be? Options here:

Always use the earlier offset
Always use the later offset
Use the same offset as the start date/time
Use the offset in the direction of travel (so adding one hour from 12:30am would give 1:30am with an offset of -7, but subtracting one hour from 2:30am would give 1:30am with an offset of -8)
Throw an exception
Allow the user to pass in an argument which represents a strategy for resolving this

This is currently unimplemented in Noda Time, so I could probably choose whatever behaviour I want, but frankly none of them has much appeal.

At the other daylight saving transition, when the clocks go forward, we have the opposite problem: adding one hour to 12:30am can’t give 1:30am because that time never occurs. Options in this case include:

Return the first valid time after the transition (this has problems if we’re subtracting time, where we’d presumably want to return the latest valid time before the transition… but the transition has an exclusive lower bound, so there’s no such "latest valid time" really)
Add the offset difference, so we’d skip to 2:30am
Throw an exception
Allow the user to pass in a strategy

Again, nothing particularly appeals.

All of this is just involved in adding a period to a ZonedDateTime – then the same problems occur all over again when trying to find the period between them. What’s the difference (as a Period rather than a simple Duration) between 1:30am with an offset of -7 and 1:30am with an offset of -8? Nothing, or an hour? Again, at the moment I really don’t know the best course of action.

Conclusion

This post has ended up being longer than I’d expected, but hopefully you’ve got a flavour of the challenges we’re facing. Even without time zones getting involved, date and time arithmetic is pretty silly – and with time zones, it becomes very hard to reason about – and to work out what the "right" result to be returned by an API should be, let alone implement it.

Above all, it’s important to me that Noda Time is predictable and clearly documented. Very often, if a library doesn’t behave exactly the way you want it to, but you can tell what it’s going to do, you can work around that – but if you’re having to experiment to guess the behaviour, you’re on a hiding to nothing.