Category Archives: Java

All about java.util.Date

This post is an attempt to reduce the number of times I need to explain things in Stack Overflow comments. You may well be reading it via a link from Stack Overflow – I intend to refer to this post frequently in comments. Note that this post is mostly not about text handling – see my post on common mistakes in date/time formatting and parsing for more details on that.

There are few classes which cause so many similar questions on Stack Overflow as java.util.Date. There are four causes for this:

  • Date and time work is fundamentally quite complicated and full of corner cases. It’s manageable, but you do need to put some time into understanding it.
  • The java.util.Date class is awful in many ways (details given below).
  • It’s poorly understood by developers in general.
  • It’s been badly abused by library authors, adding further to the confusion.

TL;DR: java.util.Date in a nutshell

The most important things to know about java.util.Date are:

  • You should avoid it if you possibly can. Use java.time.* if possible, or the ThreeTen-Backport (java.time for older versions, basically) or Joda Time if you’re not on Java 8 yet.
    • If you’re forced to use it, avoid the deprecated members. Most of them have been deprecated for nearly 20 years, and for good reason.
    • If you really, really feel you have to use the deprecated members, make sure you really understand them.
  • A Date instance represents an instant in time, not a date. Importantly, that means:
    • It doesn’t have a time zone.
    • It doesn’t have a format.
    • It doesn’t have a calendar system.

Now, onto the details…

What’s wrong with java.util.Date?

java.util.Date (just Date from now on) is a terrible type, which explains why so much of it was deprecated in Java 1.1 (but is still being used, unfortunately).

Design flaws include:

  • Its name is misleading: it doesn’t represent a Date, it represents an instant in time. So it should be called Instant – as its java.time equivalent is.
  • It’s non-final: that encourages poor uses of inheritance such as java.sql.Date (which is meant to represent a date, and is also confusing due to having the same short-name)
  • It’s mutable: date/time types are natural values which are usefully modeled by immutable types. The fact that Date is mutable (e.g. via the setTime method) means diligent developers end up creating defensive copies all over the place.
  • It implicitly uses the system-local time zone in many places – including toString() – which confuses many developers. More on this in the “What’s an instant” section
  • Its month numbering is 0-based, copied from C. This has led to many, many off-by-one errors.
  • Its year numbering is 1900-based, also copied from C. Surely by the time Java came out we had an idea that this was bad for readability?
  • Its methods are unclearly named: getDate() returns the day-of-month, and getDay() returns the day-of-week. How hard would it have been to give those more descriptive names?
  • It’s ambiguous about whether or not it supports leap seconds: “A second is represented by an integer from 0 to 61; the values 60 and 61 occur only for leap seconds and even then only in Java implementations that actually track leap seconds correctly.” I strongly suspect that most developers (including myself) have made plenty of assumptions that the range for getSeconds() is actually in the range 0-59 inclusive.
  • It’s lenient for no obvious reason: “In all cases, arguments given to methods for these purposes need not fall within the indicated ranges; for example, a date may be specified as January 32 and is interpreted as meaning February 1.” How often is that useful?

I could find more problems, but they would be getting pickier. That’s a plentiful list to be going on with. On the plus side:

  • It unambiguously represents a single value: an instant in time, with no associated calendar system, time zone or text format, to a precision of milliseconds.

Unfortunately even this one “good aspect” is poorly understood by developers. Let’s unpack it…

What’s an “instant in time”?

Note: I’m ignoring relativity and leap seconds for the whole of the rest of this post. They’re very important to some people, but for most readers they would just introduce more confusion.

When I talk about an “instant” I’m talking about the sort of concept that could be used to identify when something happened. (It could be in the future, but it’s easiest to think about in terms of a past occurrence.) It’s independent of time zone and calendar system, so multiple people using their “local” time representations could talk about it in different ways.

Let’s use a very concrete example of something that happened somewhere that doesn’t use any time zones we’re familiar with: Neil Armstrong walking on the moon. The moon walk started at a particular instant in time – if multiple people from around the world were watching at the same time, they’d all (pretty much) say “I can see it happening now” simultaneously.

If you were watching from mission control in Houston, you might have thought of that instant as “July 20th 1969, 9:56:20 pm CDT”. If you were watching from London, you might have thought of that instant as “July 21st 1969, 3:26:20 am BST”. If you were watching from Riyadh, you might have thought of that instant as “Jumādá 7th 1389, 5:56:20 am (+03)” (using the Umm al-Qura calendar). Even though different observers would see different times on their clocks – and even different years – they would still be considering the same instant. They’d just be applying different time zones and calendar systems to convert from the instant into a more human-centric concept.

So how do computers represent instants? They typically store an amount of time before or after a particular instant which is effectively an origin. Many systems use the Unix epoch, which is the instant represented in the Gregorian calendar in UTC as midnight at the start of January 1st 1970. That doesn’t mean the epoch is inherently “in” UTC – the Unix epoch could equally well be defined as “the instant at which it was 7pm on December 31st 1969 in New York”.

The Date class uses “milliseconds since the Unix epoch” – that’s the value returned by getTime(), and set by either the Date(long) constructor or the setTime() method. As the moon walk occurred before the Unix epoch, the value is negative: it’s actually -14159020000.

To demonstrate how Date interacts with the system time zone, let’s show the three time zones mentioned before – Houston (America/Chicago), London (Europe/London) and Riyadh (Asia/Riyadh). It doesn’t matter what the system time zone is when we construct the date from its epoch-millis value – that doesn’t depend on the local time zone at all. But if we use Date.toString(), that converts to the current default time zone to display the result. Changing the default time zone does not change the Date value at all. The internal state of the object is exactly the same. It still represents the same instant, but methods like toString(), getMonth() and getDate() will be affected. Here’s sample code to show that:

import java.util.Date;
import java.util.TimeZone;

public class Test {

    public static void main(String[] args) {
        // The default time zone makes no difference when constructing
        // a Date from a milliseconds-since-Unix-epoch value
        Date date = new Date(-14159020000L);

        // Display the instant in three different time zones
        TimeZone.setDefault(TimeZone.getTimeZone("America/Chicago"));
        System.out.println(date);

        TimeZone.setDefault(TimeZone.getTimeZone("Europe/London"));
        System.out.println(date);

        TimeZone.setDefault(TimeZone.getTimeZone("Asia/Riyadh"));
        System.out.println(date);

        // Prove that the instant hasn't changed...
        System.out.println(date.getTime());
    }
}

The output is as follows:

Sun Jul 20 21:56:20 CDT 1969
Mon Jul 21 03:56:20 GMT 1969
Mon Jul 21 05:56:20 AST 1969
-14159020000

The “GMT” and “AST” abbreviations in the output here are highly unfortunate – java.util.TimeZone doesn’t have the right names for pre-1970 values in all cases. The times are right though.

Common questions

How do I convert a Date to a different time zone?

You don’t – because a Date doesn’t have a time zone. It’s an instant in time. Don’t be fooled by the output of toString(). That’s showing you the instant in the default time zone. It’s not part of the value.

If your code takes a Date as an input, any conversion from a “local time and time zone” to an instant has already occurred. (Hopefully it was done correctly…)

If you start writing a method with a signature like this, you’re not helping yourself:

// A method like this is always wrong
Date convertTimeZone(Date input, TimeZone fromZone, TimeZone toZone)

How do I convert a Date to a different format?

You don’t – because a Date doesn’t have a format. Don’t be fooled by the output of toString(). That always uses the same format, as described by the documentation.

To format a Date in a particular way, use a suitable DateFormat (potentially a SimpleDateFormat) – remembering to set the time zone to the appropriate zone for your use.

There’s a hole in my abstraction, dear Liza, dear Liza

I had an interesting day at work today. I thought my code had broken… but it turns out it was just a strange corner case which made it work very slowly. Usually when something interesting happens in my code it’s quite hard to blog about it, because of all the confidentiality issues involved. In this case, it’s extremely easy to reproduce the oddity in an entirely vanilla manner. All we need is the Java collections API.

I have a set – a HashSet, in fact. I want to remove some items from it… and many of the items may well not exist. In fact, in our test case, none of the items in the "removals" collection will be in the original set. This sounds – and indeed is – extremely easy to code. After all, we’ve got Set<T>.removeAll to help us, right?

Let’s make this concrete, and look at a little test. We specify the size of the "source" set and the size of the "removals" collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using System.currentTimeMillis(), which isn’t the world most accurate stopwatch but is more than adequate in this case, as you’ll see. Here’s the code:

import java.util.*;

public class Test
{
    public static void main(String[] args)
    {
        int sourceSize = Integer.parseInt(args[0]);
        int removalsSize = Integer.parseInt(args[1]);
        
        Set<Integer> source = new HashSet<Integer>();
        Collection<Integer> removals = new ArrayList<Integer>();
        
        for (int i = 0; i < sourceSize; i++)
        {
            source.add(i);
        }
        for (int i = 1; i <= removalsSize; i++)
        {
            removals.add(-i);
        }
        
        long start = System.currentTimeMillis();
        source.removeAll(removals);
        long end = System.currentTimeMillis();
        System.out.println("Time taken: " + (end – start) + "ms");
    }
}

Let’s start off by giving it an easy job: a source set of 100 items, and 100 to remove:

c:UsersJonTest>java Test 100 100
Time taken: 1ms

Okay, so we hadn’t expected it to be slow… clearly we can ramp things up a bit. How about a source of one million items1 and 300,000 items to remove?

c:UsersJonTest>java Test 1000000 300000
Time taken: 38ms

Hmm. That still seems pretty speedy. Now I feel I’ve been a little bit cruel, asking it to do all that removing. Let’s make it a bit easier – 300,000 source items and 300,000 removals:

c:UsersJonTest>java Test 300000 300000
Time taken: 178131ms

Excuse me? Nearly three minutes? Yikes! Surely it ought to be easier to remove items from a smaller collection than the one we managed in 38ms? Well, it does all make sense, eventually. HashSet<T> extends AbstractSet<T>, which includes this snippet in its documentation for the removeAll method:

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator’s remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set’s remove method.

Now that sounds reasonable on the surface of it – iterate through the smaller collection, check for the presence in the bigger collection. However, this is where the abstraction is leaky. Just because we can ask for the presence of an item in a large collection doesn’t mean it’s going to be fast. In our case, the collections are the same size – but checking for the presence of an item in the HashSet is O(1) whereas checking in the ArrayList is O(N)… whereas the cost of iterating is going to be the same for each collection. Basically by choosing to iterate over the HashSet and check for presence in the ArrayList, we’ve got an O(M * N) solution overall instead of an O(N) solution. Ouch. The removeAll method is making an "optimization" based on assumptions which just aren’t valid in this case.

Then fix it, dear Henry, dear Henry, dear Henry

There are two simple ways of fixing the problem. The first is to simply change the type of the collection we’re removing from. Simply changing ArrayList<Integer> to HashSet<Integer> gets us back down to the 34ms range. We don’t even need to change the declared type of removals.

The second approach is to change the API we use: if we know we want to iterate over removals and perform the lookup in source, that’s easy to do:

for (Integer value : removals)
{
    source.remove(value);
}

In fact, on my machine that performs slightly better than removeAll – it doesn’t need to check the return value of remove on each iteration, which removeAll does in order to return whether or not any items were removed. The above runs in about 28ms. (I’ve tested it with rather larger datasets, and it really is faster than the dual-hash-set approach.)

However, both of these approaches require comments in the source code to explain why we’re not using the most obvious code (a list and removeAll). I can’t complain about the documentation here – it says exactly what it will do. It’s just not obvious that you need to worry about it, until you run into such a problem.

So what should the implementation do? Arguably, it really needs to know what’s cheap in each of the collections it’s dealing with. The idea of probing for performance characteristics before you decide on a strategy is completely anathema to clean abstraction we like to consider with frameworks like Java collections… but maybe in this case it would be a good idea.


1 Please perform Dr Evil impression while reading this. I’m watching you through your webcam, and I’ll be disappointed if I don’t see you put your little finger to your mouth.

Noda Time is born

There was an amazing response to yesterday’s post – not only did readers come up with plenty of names, but lots of people volunteered to help. As a result, I’m feeling under a certain amount of pressure for this project to actually take shape.

The final name chosen is Noda Time. We now have a Google Code Project and a Google Group (/mailing list). Now we just need some code…

I figured it would be worth explaining a bit more about my vision for the project. Obviously I’m only one contributor, and I’m expecting everyone to add there own views, but this can act as a starting point.

I want this project to be more than just a way of getting better date and time handling on .NET. I want it to be a shining example of how to build, maintain and deploy an open source .NET library. As some of you know, I have a few other open source projects on the go, and they have different levels of polish. Some have downloadable binaries, some don’t. They all have just-about-enough-to-get-started documentation, but not nearly enough, really. They have widely varying levels of test coverage. Some are easier to build than others, depending on what platform you’re using.

In some ways, I’m expecting the code to be the easy part of Noda Time. After all, the implementation is there already – we’ll have plenty of interesting design decisions to make in order to marry the concepts of Joda Time with the conventions of .NET, but that shouldn’t be too hard. Here are the trickier things, which need discussion, investigation and so forth:

  • What platforms do we support? Here’s my personal suggested list:
    • .NET 4.0
    • .NET 3.5
    • .NET 2.0SP1 (require the service pack for DateTimeOffset)
    • Mono (versions TBD)
    • Silverlight 2, 3 and 4
    • Compact Framework 2.0 and 3.5
  • What do we ship, and how do we handle different platforms? For example, can we somehow use Code Contracts to give developers a better experience on .NET 4.0 without making it really hard to build for other versions of .NET? Can we take advantage of the availability of TimeZoneInfo in .NET 3.5 and still build fairly easily for earlier versions? Do developers want debug or release binaries? Can we build against the client profile of .NET 3.5/4.0?
  • What should we use to build? I’ve previously used NAnt for the overall build process and MSBuild for the code building part. While this has worked quite well, I’m nervous of the dependency on NAnt-Contrib library for the <msbuild> task, and generally being dependent on a build project whose last release was a beta nearly two years ago. Are there better alternatives?
  • How should documentation be created and distributed?
    • Is Sandcastle the best way of building docs? How easy is it to get it running so that any developer can build the docs at any time? (I’ve previously tried a couple of times, and failed miserable.)
    • Would Monodoc be a better approach?
    • How should non-API documentation be handled? Is the wiki which comes with the Google Code project good enough? Do we need to somehow suck the wiki into an offline format for distribution with the binaries?
  • What do we need to do in order to work in low-trust environments, and how easily can we test that?
  • What do we do about signing? Ship with a "public" snk file which anyone can build with, but have a private version which the team uses to validate a "known good" release? Or just have the private key and use deferred signing?
  • While the library itself will support i18n for things like date/time formatting, do we need to apply it to "developer only" messages such as exceptions?
  • I’m used to testing with NUnit and Rhino.Mocks, but they’re not the last word in testing on .NET – what should we use, and why? What about coverage?
  • Do we need any dependencies (e.g. logging)? If so, how do we handle versioning of those dependencies? How are we affected by various licences?

These are all interesting topics, but they’re not really specific to Noda Time. Information about them is available all over the place, but that’s just the problem – it’s all over the place. I would like there to be some sort of documentation saying, "These are the decisions you need to think about, here are the options we chose for Noda Time, and this is why we did so." I don’t know what form that documentation will take yet, but I’m considering an ebook.

As you can tell, I’m aiming pretty high with this project – especially as I won’t even be using Google’s 20% time on it. However, there’s little urgency in it for me personally. I want to work out how to do things right rather than how to do them quickly. If it takes me a bit of time to document various decisions, and the code itself ships later, so be it… it’ll make the next project that much speedier.

I’m expecting a lot of discussion in the group, and no doubt some significant disagreements. I’m expecting to have to ask a bunch of questions on Stack Overflow, revealing just how ignorant I am on a lot of the topics above (and more). I think it’ll be worth it though. I think it’s worth setting a goal:

In one year, I want this to be a first-class project which is the natural choice for any developers wanting to do anything more than the simplest of date/time handling on .NET. In one year, I want to have a guide to developing open source class libraries on .NET which tells you everything you need to know other than how to write the code itself.

A year may seem like a long time, but I’m sure everyone who has expressed an interest in the project has significant other commitments – I know I do. Getting there in a year is going to be a stretch – but I’m expecting it to be a very enlightening journey.

What’s in a name (again)?

I have possibly foolishly decided to stop resisting the urge to port Joda Time to .NET. For those of you who are unaware, "use Joda Time" is almost always the best answer to any question involving "how do I achieve X with java.util.Date/Calendar?" It’s a Java library for handling dates and times, and it rocks. There is a plan to include a somewhat redesigned version in some future edition of Java (JSR-310) but it’s uncertain whether this will ever happen.

Now, .NET only gained the ability to work with time zones other than UTC and the local time zone (using only managed code) – it has a bit of catching up to do. It’s generally easier to work with the .NET BCL than the Java built-in libraries, but it’s still not a brilliant position to be in. I think .NET deserves good date/time support, and as no-one else appears to be porting Joda Time, I’m going to do it. (A few people have already volunteered to help. I don’t know how easily we’ll be able to divvy up the work, but we’ll see. I suspect the core may need to be done first, and then people can jump in to implement different chronologies etc. As a side-effect, I may try to use this project as a sort of case in terms of porting, managing an open source project, and properly implementing a .NET library with useful versioning etc.)

The first two problems, however, are to do with naming. First, the project name. Contenders include:

  • Joda Time.NET (sounds like it would be an absolutely direct port; while I intend to port all the tricky bits directly, it’s going to be an idiomatic port with appropriate .NET bits. It’s also a bit of a mouthful.)
  • Noda Time (as suggested in the comments and in email)
  • TonyTime (after Tony the Pony)
  • CoffeeTime
  • TeaTime
  • A progression of BreakfastTime, CoffeeTime, LunchTime, TeaTime, DinnerTime and SupperTime for different versions (not a serious contender)
  • ParsleySageRosemaryAndThyme (not a serious contender)
  • A few other silly ones too

I suspect I’m going to go for CoffeeTime, but we’ll see.

The second problem is going to prove more awkward. I want to mostly copy the names given in Joda Time – aside from anything else, it’ll make it familiar to anyone who uses Joda Time in Java (such as me). Now one of the most commonly used classes in Joda is "DateTime". Using that name in my port would be a Bad Idea. Shadowing a name in the System namespace is likely to lead to very disgruntled users who may prove hard to regruntle before they abandon the library.

So what do I do? Go for the subtly different DateAndTime? Tie it to the library with CoffeeDateTime? Change it to Instant? (It’ll derive from AbstractInstant anyway – assuming I keep the same hierarchy instead of moving to a composition model and value types.)

Obviously this is a decision which the "team" can make, when we’ve got one… but it feels like a decision which is lurking round the corner in a hostile way.

What I find interesting is that these are two very different naming problems: one is trying to name something in a relatively arbitrary way – I know I want something reasonably short and memorable for the overall name, but beyond that it doesn’t matter too much. The other is trying to nail a very specific name which really has to convey its meaning clearly… but where the obvious name is already taken. Also interestingly, neither is a particularly good example of my most common issue with naming: attempting to come up with a two or three word noun for something that actually needs a whole sentence to describe it adequately.

Oh well – we’ll see what happens. In another blog post I’ll suggest some of the goals I have in terms of what I’m hoping to learn from the project, and how I’d like it to progress. In other words, expect a work of complete fiction…

If you’re interested in helping out with the project, please mail me directly (rather than adding comments here) and as soon as I’ve set the project up, I’ll invite you to the mailing list.

UPDATE: I’ve already got a few interested names, which is great. Rather than be dictatorial about this, I’ll put it to a vote of the people who are willing to help out on it.

A different approach to inappropriate defaults

I’ve had a couple of bug reports about my Protocol Buffers port – both nicely detailed, and one including a patch to fix it. (It’s only due to my lack of timeliness in actually submitting the change that the second bug report occurred. Oops.)

The bug was in text formatting (although it also affected parsing). I was using the default ToString behaviour for numbers, which meant that floats and doubles were being formatted as "50,15" in Germany instead of "50.15". The unit tests caught this, but only if you ran them on a machine with an appropriate default culture.

Aaargh. I’ve been struggling with a similar problem in a library I can’t change, which uses the system default time zone for various calculations in Java. When you’re running server code, the default time zone is almost never the one you want to use, and it certainly isn’t in my case.

A similar problem is Java’s decision to use the system default encoding in all kinds of bizarre places – FileReader doesn’t even let you specify the encoding, which makes it almost entirely useless in my view.

So I’ve been wondering how we could fix this and problems like it. One option is to completely remove the defaults. If you always had to pass in a CultureInfo/Locale, TimeZoneInfo/TimeZone, Encoding/Charset when you call any method which might be culturally sensitive.

Making life easier (in .NET)

It strikes me that .NET has a useful abstraction here: the assembly as the unit of deployment. (Java’s closest equivalent is probably a jar file, which probably gets messier.)

Within one assembly, I suspect in many cases you always want to make the same decision. For example, in protocol buffers I would like to use the invariant culture all the time. It would be nice if I could say that, and then get the right behaviour by default. Here are the options I’d like to be able to apply (for each of culture, time zone and character encoding – there may be others):

  • Use a culture-neutral default (the invariant culture, UTF-8, UTC)
  • Use a specific set of values (e.g. en-GB, Windows-1252, "Europe/London")
  • Use the system default values
  • Use whatever the calling assembly is using

Of course you should still have the option of specifying overrides on a per call basis, but I think this might be a way forward.

Thoughts? I realise it’s almost certainly too late for this to actually be implemented now, but would it have been a good idea? Or is it just an alternative source of confusion?

Language proliferation

I’ve always been aware that .NET supports multiple languages (obviously) and that Microsoft has been experimenting with this to some extent. It’s only recently struck me just to what extent this is the case though.

Here’s a list – almost certainly incomplete – of .NET languages from Microsoft alone.

Some of these are research languages which are more important for the ideas they’ve contributed to more mainstream ones at a later date than for anything else – but there’s still a lot of effort represented in the list.

In addition, there are third party languages targeting .NET, such as Boo, IronScheme and Scala. (Wikipedia lists loads of them.)

Now, think back to the time before .NET. Was Microsoft actively experimenting with languages back then? Plenty of people were trying things against the JVM, but Sun was pretty much absent from that party. .NET seems to be a "missing ingredient" that has allowed smart folk at Microsoft to let their imaginations loose in ways which they couldn’t previously. (Of course, not everyone in the language business at MS started there: Jim Hugunin was hired by Microsoft precisely because of his work on IronPython.)

I wonder how long this will continue.

Tower of Babel, or land of polyglots?

What does this mean for the average developer? Currently, if you’re writing a non-web application in .NET, you really only need to know a single language – and any of them will do. (Plus potentially SQL of course…) Compare this with web developers who have to be intimately familiar with HTML, CSS and JavaScript – and the differences between various implementations.

How long will it be before backend developers are expected to know a dynamic language, a static OO language and a functional language? Does the benefit of mixing several languages in a project worth the impedance mismatch and the increased skillset requirements? I’m not going to make any predictions on that front – I can certainly see the benefits of each of these approaches in certain situations. They’ve been designed to play well together, but there are bound to be limitations and oddities: times when you need to change how you write your F# so that it’s easily callable from C#, for example.

Whether or not you learn multiple languages to a professional level is one thing, but becoming familiar with them is a different matter. In the course of co-authoring Functional Programming for the Real World (where "co-author" is a bit of a stretch title – I’ve played more of an editorial role really, with the added bonus of picking on Tomas whenever I felt he was perhaps a little harsh towards C#) I’ve learned to appreciate many of F#’s qualities, but I don’t really know the language. If someone asked me to write a complete application in it (rather than just a toy experiment) I’d be reaching for books every other minute. I hope I’ll learn more over the course of time, but I doubt that I’ll ever be sufficiently experienced in it to put it on my CV. The same goes for IronPython, although I’m considerably more likely to need Python at work than I am F#. (Python is one of the three "approved" languages at Google, along with Java and C++.) None of this means that time spent in these languages is wasted: I’ll be able to apply a lot of what I’ve learned about F# to my C# coding, even if it will make me pine for things like pattern matching and asynchronous workflows periodically.

I think it’s pretty much a given that these days we all need to bring a wide range of technologies to bear in most jobs. While it used to be just about feasible in the .NET 1.1 days to have a pretty good grasp of all the major aspects (ASP.NET for sites and web services, ADO.NET, WinForms, Windows services, class libraries, interop) it’s just impossible these days. We learn something new when we need to – but usually against the background of a familiar language. How well would we cope if we had to learn whole new languages (to the level of being able to use them for production code) as often as we have to learn new libraries?

This worries me a little. I’m pleased to see that C# 4 is a much smaller change than the previous versions were. Admittedly I’d rather have had immutability support than dynamic, but that’s just me… and that’s the problem, too. While I worry about our ability to actually learn everything that’s becoming available, it’s all good stuff. Can there be "too much of a good thing"?

What I really don’t want to see is developers having to know multiple languages, and everyone knowing them poorly. I’m a big believer in having a thorough understanding of your language, so that even if everything else is new, you can rely on your understanding of that aspect of your code. It would be a shame if the pressure of knowing many languages turned many of us into cargo cult programmers. The utopia would be for us all to turn into language renaissance developers. I suspect the reality will be somewhere between the two.

Still, as long as I get to keep helping authors write about languages I know almost nothing about, I’m sure I’ll be happy…

What’s in a name?

T.S. Eliot had the right idea when he wrote “The naming of cats”:

The Naming of Cats is a difficult matter,
It isn’t just one of your holiday games

When you notice a cat in profound meditation,
The reason, I tell you, is always the same:
His mind is engaged in a rapt contemplation
Of the thought, of the thought, of the thought of his name:
His ineffable effable
Effanineffable
Deep and inscrutable singular Name.

Okay, so developers may not contemplate their own names much, but I know I’ve certainly spent a significant amount of time recently trying to work out the right name for various types and methods.  It always feels like it’s just out of reach; tauntingly, tantalisingly close.

Recently I’ve been thinking a bit about what the goals might be in coming up with a good name. In particular, I seem to have been plagued with the naming problem more than usual in the last few weeks.

Operations on immutable types

A while ago I asked a question on Stack Overflow about naming a method which “adds” an item to an immutable collection. Of course, when I say “adds” I mean “returns a new collection whose contents is the old collection and the new item.” There’s a really wide range of answers (currently 38 of them) which mostly seem to fall into four categories:

  • Use Add because it’s idiomatic for .NET collections. Developers should know that the type is immutable and act accordingly.
  • Use Cons because that’s the term functional programming has used for this exact operation for decades.
  • Use a new method name (Plus being my favourite at the moment) which will be obvious to non-functional developers, but without being so familiar that it suggests mutability.
  • Use a constructor taking the old collection and the new item.

Part of the reasoning for Add being okay is that I originally posted the question purely about “an immutable collection” – e.g. a type which would have a name like ImmutableList<T>. I then revealed my true intention (which I should have done from the start) – to use this in MiniBench, where the “collection” would actually be a TestSuite. Everything in MiniBench is immutable (it’s partly an exploration in functional programming, as it seems to fit very nicely) but I don’t want to have to name every single type as Immutable[Whatever]. There’s the argument that a developer should know at least a little bit about any API they’re using, and the immutability aspect is one of the first things they should know. However, MiniBench is arguably an extreme case, because it’s designed for sharing test code with people who’ve never seen it before.

I’m pretty sure I’m going to go with Plus in the end:

  • It’s close enough to Add to be familiar
  • It’s different enough to Add to suggest that it’s not quite the same thing as adding to a normal collection
  • It sounds like it returns something – a statement which just calls Plus without using the result sounds like it’s wrong (and indeed it would be)
  • It’s meaningful to everyone
  • I have a precedent in the Joda Time API

Another option is to overload the + operator, but I’m not really sure I’m ready to do that just yet. It would certainly leave brief code, but is that really the most important thing?

Let’s look at a situation with some of the same issues…

LINQ operators

Work on MoreLINQ has progressed faster than expected, mostly because the project now has four members, and they’ve been expending quite a bit of energy on it. (I must do a proper consistency review at some point – in particular it would be nice to have the docs refer to the same concepts in the same way each time. I digress…)

Most of the discussion in the project hasn’t been about functionality – it’s been about naming. In fact, LINQ is particularly odd in this respect. If I had to guess at how the time has been spent (at least for the operators I’ve implemented) I’d go for:

  • 15% designing the behaviour
  • 20% writing the tests
  • 10% implementation
  • 5% writing the documentation (just XML docs)
  • 50% figuring out the best name

It really is that brutal – and for a lot of the operators we still haven’t got the “right” name yet, in my view. There’s generally too much we want to convey in a word or two. As an example, we’ve got an operator similar to the oft-implemented ForEach one, but which yields the input sequence back out again. Basically it takes an action, and for each element it calls the action and then yields the element. The use case is something like logging. We’ve gone through several names, such as Pipe, Tee, Via… and just this morning I asked a colleague who suggested Apply, just off the top of his head. It’s better than anything we’d previously thought of, but does it convey both the “apply an action” and “still yield the original sequence” aspects?

The old advice of “each method should only do one thing” is all very well, and it clearly helps to make naming simpler, but with situations like this one there are just naturally more concepts which you want to get across in the name.

Let’s stay on the LINQ topic, but stray a bit further from the well-trodden path…

The heart of Push LINQ: IDataProducer

I’ve probably bored most of you with Push LINQ by now, and I’m not actively developing it at the moment, but there’s still one aspect which I’m deeply uncomfortable with: the core interface. IDataProducer represents a stream of data which can be observed. Basically clients subscribe to events, and their event handlers will be called when data is “produced” and when the stream ends.

I know IDataProducer is an awful name – but so far I haven’t found anything better. IObservable? Ick. Overused and isn’t descriptive. IPushEnumerable? Sounds like the client can iterate over the data, which they can’t. The actual event names (DataProduced/EndOfData) are okay but there must be something better than IDataProducer. (Various options have been suggested in the past – none of them have been so obviously “right” as to stick in my head…)

This situation is slightly different to the previous ones, however, simply because it’s such a pivotal type. You would think that the more important the type, the more important the name would be – but in some ways the reverse is true. You see, Push LINQ isn’t a terribly “obvious” framework. I say that without shame – it’s great at what it does, but it takes a few mental leaps before you really grok it. You’re really going to have to read some documentation or examples before you write your own queries.

Given that constraint, it doesn’t matter too much what the interface is called – it’s going to be explained to you before you need it. It doesn’t need to be discoverable – whereas when you’re picking method names to pop up in Intellisense, you really want the developer to be able to guess its purpose even before they hover over it and check the documentation.

I haven’t given up on IDataProducer (and I hope to be moving Push LINQ into MoreLINQ, by the way – working out a better name is one of the blockers) but it doesn’t feel like quite as much of a problem.

Read-only or not read-only?

This final example came up at work, just yesterday – after I’d started writing this post. I wanted to refactor some code to emphasize which methods only use the read-only side of an interface. This was purely for the sake of readability – I wanted to make it easier to reason about which areas of the code modified an object and which didn’t. It’s a custom collection – the details don’t matter, but for the sake of discussion let’s call it House and pretend we’re modelling the various things which might be in a house. (This is Java, hence House rather than IHouse.)

I’m explicitly not doing this for safety – I don’t mind the fact that the reference could be cast to a mutable interface. The point is just to make it self-documenting that if a method only has a parameter in the non-mutating form, it’s not going to change the contents of the house.

So, we have two interfaces, like this:

public interface NameMePlease
{
    Color getDoorColor();
    int getWindowCount();

    // This already returned a read-only collection
    Set<Furniture> getFurniture();
}

public interface House extends NameMePlease
{
    void setDoorColor(Color doorColor);
    void setWindowCount(int windows);
    void addFurniture(Furniture item);
}

Obviously the challenge is to find a name for NameMePlease. One option is to use something like ImmutableHouse or ReadOnlyHouse – but the inheritance hierarchy makes liars of both of those names. How can it be a ReadOnlyHouse if there are methods in an implementation which change it? The interface should say what you can do with the type, rather than specifying what you can’t do – unless part of the contract of the interface is that the implementation will genunely prohibit changes.

Thinking of this “positive” aspect led me to ReadableHouse, which is what I’ve gone with for the moment. It states what you can do with it – read information. Again, this is a concept which Joda Time uses.

Another option is to make it just House, and change the mutable interface to MutableHouse or something similar. In this particular situation the refactoring involved would have been enormous. Simple to automate, but causing a huge check-in for relatively little benefit. Almost all uses are actually mutating ones. The consensus within the Google Java mailing list seems to be that this would have been the preferred option, all things being equal. One interesting data point was that although Joda Time uses ReadableInstant etc, the current proposals for the new date/time API which will be included in Java 7, designed by the author of Joda Time, don’t use this convention. Presumably the author found it didn’t work quite as well as he’d hoped, although I don’t have know of any specific problems.

Conclusion

You’ll probably be unsurprised to hear that I don’t have a recipe for coming up with good names. However, in thinking about naming I’ve at least worked out a few points to think about:

  • Context is important: how discoverable does this need to be? Is accuracy more important than brevity? Do you have any example uses (e.g. through tests) which can help to see whether the code feels right or not?
  • Think of your audience. How familiar will they be with the rest of the code you’re writing? Are they likely to have a background in other areas of computer science where you could steal terminology? Can you make your name consistent with other common frameworks they’re likely to use? The reverse is true too: are you reusing a familiar name for a different concept, which could confuse readers?
  • Work out the information the name is trying to convey. For types, this includes working out how it participates in inheritance. Is it trying to advertise capabilities or restrictions?
  • Is it possible to make correct code look correct, and buggy code look wrong? This is rarely feasible, but it’s one of the main attractions of “Plus” in the benchmark case. (I believe this is one of the main selling points of true Hungarian Notation for variable naming, by the way. I’m not generally a fan, but I like this aspect.)

I may expand this list over time…

I think it’s fitting to close with a quote from Phil Karlton:

There are only two hard things in Computer Science: cache invalidation and naming things.

Almost all of us have to handle naming things. Let’s hope most of us don’t have to mess with cache invalidation as well.

Redesigning System.Object/java.lang.Object

I’ve had quite a few discussions with a colleague about some failures of Java and .NET. The issue we keep coming back to is the root of the inheritance tree. There’s no doubt in my mind that having a tree with a single top-level class is a good thing, but it’s grown a bit too big for its boots.

Pretty much everything in this post applies to both .NET and Java, sometimes with a few small changes. Where it might be unclear, I’ll point out the changes explicitly – otherwise I’ll just use the .NET terminology.

What’s in System.Object?

Before we work out what we might be able to change, let’s look at what we’ve got. I’m only talking about instance methods. At the moment:

Life-cycle and type identity

There are three members which I believe really need to be left alone.

We need a parameterless constructor because (at least with the current system of chaining constructors to each other) we have to have some constructor, and I can’t imagine what parameter we might want to give it. I certainly find it hard to believe there’s a particular piece of state which really deserves to be a part of every object but which we’re currently missing.

I really don’t care that much about finalizers. Should the finalizer be part of Object itself, or should it just get handled automatically by the CLR if and only if it’s defined somewhere in the inheritance chain? Frankly, who cares. No doubt it makes a big difference to the implementation somewhere, but that’s not my problem. All I care about when it comes to finalizers is that when I have to write them it’s as easy as possible to do it properly, and that I don’t have to write them very often in the first place. (With SafeHandle, it should be a pretty rare occurrence in .NET, even when you’re dealing directly with unmanaged resources.)

GetType() or (getClass() in Java) is pretty important. I can’t see any particular alternative to having this within Object, unless you make it a static method somewhere else with an Object parameter. In fact, that would have the advantage of freeing up the name for use within your own classes. The functionality is sufficiently important (and really does apply to every object) that I think it’s worth keeping.

Comparison methods

Okay, time to get controversial. I don’t think every object should have to be able to compare itself with another object. Of course, most types don’t really support this anyway – we just end up with reference equality by default.

The trouble with comparisons is that everyone’s got a different idea of what makes something equal. There are some types where it really is obvious – there’s only one natural comparison. Integers spring to mind. There are other types which have multiple natural equality comparisons – floating point numbers (exact, within an absolute epsilon, and within a relative epsilon) and strings (ordinal, culture sensitive and/or case sensitive) are examples of this. Then there are composite types where you may or may not care about certain aspects – when comparing URLs, do I care about case? Do I care about fragments? For http, if the port number is explicitly specified as 80, is that different to a URL which is still http but leaves the port number implicit?

.NET represents these reasonably well already, with the IEquatable<T> interface saying “I know how to compare myself with an instance of type T, and how to produce a hashcode for myself” and IEqualityComparer<T> interface saying “I know how to compare two instances of T, and how to produce a hashcode for one instance of T.” Now suppose we didn’t have the (nongeneric!) Equals() method and GetHashCode() in System.Object. Any type which had a natural equality comparison would still let you compare it for equality by implementing IEquatable<T>.Equals – but anything else would either force you to use reference equality or an implementation of IEqualityComparer<T>.

Some of the principle consumers of equality comparisons are collections – particularly dictionaries (which is why it’s so important that the interfaces should include hashcode generation). With the current way that .NET generics work, it would be tricky to have a constraint on a constructor such that if you only specified the types, it would only work if the key type implemented IEquatable<T>, but it’s easy enough to do with static methods (on a non-generic type). Alternatively you could specify any type and an appropriate IEqualityComparer<T> to use for the keys. We’d need an IdentityComparer<T> to work just with references (and provide the equivalent functionaliy to Object.GetHashCode) but that’s not hard – and it would be absolutely obvious what the comparison was when you built the dictionary.

Monitors and threading

This is possibly my biggest gripe. The fact that every object has a monitor associated with it was a mistake in Java, and was unfortunately copied in .NET. This promotes the bad practice of locking on “this” and on types – both of which are typically publicly accessible references. I believe that unless a reference is exposed explicitly for the purpose of locking (like ICollection.SyncRoot) then you should avoid locking on any reference which other code knows about. I typically have a private read-only variable for locking purposes. If you’re following these guidelines, it makes no sense to be able to lock on absolutely any reference – it would be better to make the Monitor class instantiable, and make Wait/Pulse/PulseAll instance members. (In Java this would mean creating a new class and moving Object.wait/notify/notifyAll members to that class.)

This would lead to cleaner, more readable code in my view. I’d also do away with the “lock” statement in C#, making Monitor.Enter return a token implementing IDisposable – so “using” statements would replace locks, freeing up a keyword and giving the flexibility of having multiple overloads of Monitor.Enter. Arguably if one were redesigning all of this anyway, it would be worth looking at whether or not monitors should really be reentrant. Any time you use lock reentrancy, you’re probably not thinking hard enough about the design. Now there’s a nice overgeneralisation with which to end this section…

String representations

This is an interesting one. I’m genuinely on the fence here. I find ToString() (and the fact that it’s called implicitly in many circumstances) hugely useful, but it feels like it’s attempting to satisfy three different goals:

  • Giving a developer-readable representation when logging and debugging
  • Giving a user-readable representation as part of a formatted message in a UI
  • Giving a machine-readable format (although this is relatively rare for anything other than numeric types)

It’s interesting to note that Java and .NET differ as to which of these to use for numbers – Java plumps for “machine-readable” and .NET goes for “human-readable in the current thread’s culture”. Of course it’s clearer to explicitly specify the culture on both platforms.

The trouble is that very often, it’s not immediately clear which of these has been implemented. This leads to guidelines such as “don’t use ToString() other than for logging” on the grounds that at least if it’s implemented inappropriately, it’ll only be a log file which ends up with difficult-to-understand data.

Should this usage be explicitly stated – perhaps even codified in the name: “ToDebugString” or something similar? I will leave this for smarter minds to think about, but I think there’s enough value in the method to make it worth keeping.

MemberwiseClone

Again, I’m not sure on this one. It would perhaps be better as a static (generic!) method somewhere in a class whose name indicated “this is for sneaky runtime stuff”. After all, it constructs a new object without calling a constructor, and other funkiness. I’m less bothered by this than the other items though.

Conclusion

To summarise, in an ideal world:

  • Equals and GetHashCode would disappear from Object. Types would have to explicitly say that they could be compared
  • Wait/Pulse/PulseAll would become instance methods in Monitor, which would be instantiated every time you want a lock.
  • ToString might be renamed to give clearer usage guidance.
  • MemberwiseClone might be moved to a different class.

Obviously it’s far too late for either Java or .NET to make these changes, but it’s always interesting to dream. Any more changes you’d like to see? Or violent disagreements with any of the above?

Data Structures and Algorithms: new free eBook available (first draft)

I’ve been looking at this for a while: Data Structures and Algorithms: Annotated reference with examples. It’s only in “first draft” stage at the moment, but the authors would love your feedback (as would I). Somehow I’ve managed to end up as the editor and proof-reader, although due to my holiday the version currently available doesn’t have many of my edits in. It’s a non-academic data structures and algorithms book, intended (as I see it, anyway) as a good starting point for those who know that they ought to be more aware of the data structures they use every day (lists, heaps etc) but don’t have an academic background in computer science.

An implementation will be available in both Java and C#, I believe.

Automatic lambda expressions

This morning I happened to show a colleague (Malcolm Rowe) the neat trick of using nullable types and the null-coalescing operator (??) to implement compound comparisons in C#. He asked whether it wouldn’t have been nicer to make this a library feature rather than a language feature. I’m all for putting features into libraries where possible, but there’s a problem in this case: the ?? operator doesn’t evaluate its right operand unless the left operand evaluates to null. This can’t be replicated in a library. Or can it?

The obvious way to lazily evaluate an expression is to turn it into a closure. So, we can write out coalescing method as:

 

public static T CoalesceNulls<T>(T lhs, Func<T> rhs)
{
    return lhs != null ? lhs : rhs();
}

 

That’s quite an efficient way of doing it, but the assymetry isn’t ideal. We can fix that by making the first argument a function too:

 

public static T CoalesceNulls<T>(Func<T> lhs, Func<T> rhs)
{
    T first = lhs();
    return first != null ? first : rhs();
}

 

One of the nice things you can do with the null-coalescing operator is use it for multiple expressions, e.g. a ?? b ?? c ?? d which will evaluate a, then (if it’s null) evaluate b, then (if that’s null) evaluate c etc. With the symmetry present we can now make this into a parameter array:

 

public static T CoalesceNulls<T>(params Func<T>[] functions)
{
    T current = default(T);
    foreach (Func<T> func in functions)
    {
        current = func();
        if (current != null)
        {
            break;
        }
    }
    return current;
}

 

(In some ways it’s still more elegant to specify a bare T as the first parameter, as that ensures that at least you’ve got one function to call. The change required is pretty obvious.)

Now there are only two problems: invoking the method, and the performance characteristics. I’m going to ignore the latter – yes, you could end up with a significant performance hit creating all these closures all the time if you use this in performance critical code. There’s not a lot of options available there, really. But what about the syntax for invoking the method?

In its current form, we can write the current a ?? b ?? c ?? d as CoalesceNulls(() => a, () => b, () => c, () => d)l. That’s a wee bit ugly. Malcolm blogged about an alternative where the parameters could be declared on the declaration side) as calling for deferred execution, and automatically converted into closures by the compiler.

Just like Malcolm, I’m not terribly keen on this – for instance, I really like the fact that in C# it’s always very clear when a parameter is being passed by reference, because there’s the ref keyword on the caller side as well as the declaration side. Going against this would just feel wrong.

If we have to change the language and introduce new keywords or symbols, we’re then back where we started – the whole idea was to avoid having to introduce ?? into the language. The plus side is that it could be “usage agnostic” – just because it could be used for a null-coalescing operator replacement doesn’t mean it would have to be. However, this feels like using a sledgehammer to crack a nut – the number of times this would be useful is pretty minimal. The same argument could be applied to the null-coalescing operator itself, of course, but I think its utility – and the way it fits into the language pretty seamlessly – justifies the decisions made here.

Still, it was an interesting thought experiment…

Update: Since originally writing this (a few days ago, before the blog outage, Malcolm has been informed that Scala has precisely this capability. Just another reason for getting round to learning Scala at some point…