Category Archives: Stack Overflow

Farewell, Daisy Shipton

This is more of a quick, explanatory “heads-up” post than anything else.

On March 31st 2018, I started an experiment: I created a new Stack Overflow user called “Daisy Shipton” with no picture and a profile that just read “Love coding in C#” (or similar). I wanted to see how a new user presenting with a traditionally-female name would be treated, while posting the same content that I normally would. This experiment was only a small part of my thinking around the culture of Stack Overflow, and I expect to write more on that subject, touching on the experience of “Daisy”, at another time.

I let a few people in on the secret as I went along – people who I fully expected to recognize my writing style fairly quickly. A single person emailed me to ask whether Daisy and I were the same person – well done to them for spotting it. (Once someone had the idea, the evidence was pretty compelling – the “Jon Skeet” account went into a decline in posting answers at the same time that the “Daisy Shipton” account was created, and Daisy just happened to post about C#, Noda Time, Protocol Buffers, time zones and Google Cloud Platform client libraries for .NET. I really wasn’t trying to cover my tracks.)

As Daisy reached a rep of about 12,000 points, there became little point in continuing the experiment, so I asked for “her” account to be merged into my regular one. So if you see comments on my posts referring to @DaisyShipton, that’s why.

There’s one aspect of experimentation that never happened: Daisy never asked a question. Next time I want to ask a question on Stack Overflow, I’ll probably create another account to see how a question I think is good is received when posted from a 1-rep account.

It’s been fun, but it’ll also be nice to only have one account to manage now…

Stack Overflow Culture

This blog post was most directly provoked by this tweet from my friend Rob Conery, explaining why he’s giving up contributing on Stack Overflow.

However, it’s been a long time coming. A while ago I started writing a similar post, but it got longer and longer without coming to any conclusion. I’m writing this one with a timebox of one hour, and then I’ll post whatever I’ve got. (I may then reformat it later.)

I’m aware of the mixed feelings many people have about Stack Overflow. Some consider it to be completely worthless, but I think more people view it as “a valuable resource, but a scary place to contribute due to potential hostility.” Others contribute on a regular basis, occasionally experiencing or witnessing hostility, but generally having a reasonable time.

This post talks about my experiences and my thoughts on where Stack Overflow has a problem, where I disagree with some of the perceived problems, and what can be done to improve the situation. This is a topic I wish I’d had time to talk about in more detail with the Stack Overflow team when I visited them in New York in February, but we were too busy discussing other important issues.

For a lot of this post I’ll talk about “askers” and “answerers”. This is a deliberate simplification for the sake of, well, simplicity. Many users are both askers and answerers, and a lot of the time I’ll write comments with a view to being an answerer, but without necessarily ending up writing an answer. Although any given user may take on different roles even in the course of an hour, for a single post each person usually has a single role. There are other roles of course “commenter on someone else’s answer” for example – I’m not trying to be exhaustive here.

Differences in goals and expectations

Like most things in life, Stack Overflow works best when everyone has the same goal. We can all take steps towards that goal together. Conversely, when people in a single situation have different goals, that’s when trouble often starts.

On Stack Overflow, the most common disconnect is between these two goals:

  • Asker: minimize the time before I’m unblocked on the problem I’m facing
  • Answerer: maximize the value to the site of any given post, treating the site as a long-lasting resource

In my case, I have often have a sub-goal of “try to help improve the diagnostic skill of software engineers so that they’re in a better position to solve their own problems.”

As an example, consider this question – invented, but not far-fetched:

Random keeps giving me the same numbers. Is it broken?

This is a low-quality question, in my view. (I’ll talk more about that later.) I know what the problem is likely to be, but to work towards my goal I want the asker to improve the question – I want to see their code, the results etc. If I’m right about the problem (creating multiple instances of System.Random in quick succession, which will also use the same system-time-based seed), I’d then almost certainly be able to close the question as a duplicate, and it could potentially be deleted. In its current form, it provides no benefit to the site. I don’t want to close the question as a duplicate without seeing that it really is a duplicate though.

Now from the asker’s perspective, none of that is important. If they know that I have an idea what the problem might be, their perspective is probably that I should just tell them so they can be unblocked. Why take another 10 minutes to reproduce the problem in a good question, if I can just give them the answer now? Worse, if they do take the time to do that and then I promptly close their question as a duplicate, it feels like wasted time.

Now if I ignored emotions, I’d argue that the time wasn’t wasted:

  • The asker learned that when they ask a clearer question, they get to their answer more quickly. (Assuming they follow the link to the duplicate and apply it.)
  • The asker learned that it’s worth searching for duplicate questions in their research phase, as that may mean they don’t need to ask the question at all.

But ignoring emotions is a really bad idea, because we’re all human. What may well happen in that situation – even if I’ve been polite throughout – is that the asker will decide that Stack Overflow is full of “traffic cop” moderators who only care about wielding power. I could certainly argue that that’s unfair – perhaps highlighting my actual goals – but that may not change anyone’s mind.

So that’s one problem. How does the Stack Overflow community agree what the goal of site is, and then make that clearer to users when they ask a question? It’s worth noting that the tour page (which curiously doesn’t seem to be linked from the front page of the site any more) does include this text:

With your help, we’re working together to build a library of detailed answers to every question about programming.

I tend to put it slightly differently:

The goal of Stack Overflow is to create a repository of high-quality questions, and high-quality answers to those questions.

Is that actually a shared vision? If askers were aware of it, would that help? I’d like to hope so, although I doubt that it would completely stop all problems. (I don’t think anything would. The world isn’t a perfect place.)

Let’s move onto another topic where I disagree with some people: low-quality questions.

Yes, there are low-quality questions

I assert that even if it can’t be measured in a totally objective manner, there are high-quality questions and low-quality questions (and lots in between).

I view a high-quality question in the context of Stack Overflow as one which:

  • Asks a question, and is clear in what it’s asking for. It should be reasonably obvious whether any given attempted answer does answer the question. (That’s separate to whether the answer is correct.)
  • Avoids irrelevancies. This can be hard, but I view it as part of due diligence: if you’re encountering a problem as part of writing a web-app, you should at least try to determine whether the context of a web-app is relevant to the problem.
  • Is potentially useful to other people. This is where avoiding irrelevant aspects is important. Lots of people need to parse strings as dates; relatively few will need to parse strings as dates using framework X version Y in conjunction with a client written in COBOL, over a custom and proprietary network protocol.
  • Explains what the asker has already tried or researched, and where they’ve become stuck.
  • Where appropriate (which is often the case) contains a minimal example demonstrating the problem.
  • Is formatted appropriately. No whole-page paragraphs, no code that’s not formatted as code, etc.

There are lots of questions which meet all those requirements, or at least most of them.

I think it’s reasonable to assert that such a question is of higher quality than a question which literally consists of a link to a photo of a homework assignment, and that’s it. Yes, I’ve seen questions like that. They’re not often quite that bad, but if we really can’t agree that that is a low-quality question, I don’t know what we can agree on.

Of course, there’s a huge spectrum in between – but I think it’s important to accept that there are such things as low-quality questions, or at least to debate it and find out where we disagree.

Experience helps write good questions, but isn’t absolutely required

I’ve seen a lot of Meta posts complaining that Stack Overflow is too hard on newcomers, who can’t be expected to write a good question.

I would suggest that a newcomer who accepts the premise of the site and is willing to put in effort is likely to be able to come up with at least a reasonable question. It may take them longer to perform the research and write the question, and the question may well not be as crisp as one written by a more experienced developer in the same situation, but I believe that on the whole, newcomers are capable of writing questions of sufficient quality for Stack Overflow. They may not be aware of what they need to do or why, but that’s a problem with a different solution than just “we should answer awful questions which show no effort because the asker may be new to tech”.

One slightly separate issue is whether people have the diagnostic skills required to write genuinely good questions. This is a topic dear to my heart, and I really wish I had a good solution, but I don’t. I firmly believe that if we can help programmers become better at diagnostics, then that will be of huge benefit to them well beyond asking better Stack Overflow questions.

Some regular users behave like jerks on Stack Overflow, but most don’t

I’m certainly not going to claim that the Stack Overflow community is perfect. I have seen people being rude to people asking bad questions – and I’m not going to excuse that. If you catch me being rude, call me out on it. I don’t believe that requesting improvements to a question is rude in and of itself though. It can be done nicely, or it can be done meanly. I’m all for raising the level of civility on Stack Overflow, but I don’t think that has to be done at the expense of site quality.

I’d also say that I’ve experienced plenty of askers who react very rudely to being asked for more information. It’s far from one way traffic. I think I’ve probably seen more rudeness in this direction than from answerers, in fact – although the questions usually end up being closed and deleted, so anyone just browsing the site casually is unlikely to see that.

My timebox is rapidly diminishing, so let me get to the most important point. We need to be nicer to each other.

Jon’s Stack Overflow Covenant

I’ve deliberately called this my covenant, because it’s not my place to try to impose it on anyone else. If you think it’s something you could get behind (maybe with modifications), that’s great. If Stack Overflow decides to adopt it somewhere in the site guidelines, they’re very welcome to take it and change it however they see fit.

Essentially, I see many questions as a sort of transaction between askers and answerers. As such, it makes sense to have a kind of contract – but that sounds more like business, so I’d prefer to think of a covenant of good faith.

As an answerer, I will…

  • Not be a jerk.
  • Remember that the person I’m responding to is a human being, with feelings.
  • Assume that the person I’m responding to is acting in good faith and wants to be helped.
  • Be clear that a comment on the quality of a question is not a value judgement on the person asking it.
  • Remember that sometimes, the person I’m responding to may feel they’re being judged, even if I don’t think I’m doing that.
  • Be clear in my comments about how a question can be improved, giving concrete suggestions for positive changes rather than emphasizing the negative aspects of the current state.
  • Be clear in my answers, remembering that not everyone has the same technical context that I do (so some terms may need links etc).
  • Take the time to present my answer well, formatting it as readably as I can.

As an asker, I will…

  • Not be a jerk.
  • Remember that anyone who responds to me is a human being, with feelings.
  • Assume that any person who responds to me is acting in good faith and trying to help me.
  • Remember that I’m asking people to give up their time, for free, to help me with a problem.
  • Respect the time of others by researching my question before asking, narrowing it down as far as I can, and then presenting as much information as I think may be relevant.
  • Take the time to present my question well, formatting it as readably as I can.

I hope that most of the time, I’ve already been following that. Sometimes I suspect I’ve fallen down. Hopefully by writing it out explicitly, and then reading it, I’ll become a better community member.

I think if everyone fully took something like this on board before posting anything on Stack Overflow, we’d be in a better place.

Surprise! Creating an instance of an open generic type

This is a brief post documenting a very weird thing I partly came up with on Stack Overflow today.

The context is this question. But to skip to the shock, we end up with code like this:

object x = GetWeirdValue();
// This line prints True. Be afraid - be very afraid!
Console.WriteLine(x.GetType().GetTypeInfo().IsGenericTypeDefinition);

That just shouldn’t happen. You shouldn’t be able to create an instance of an open type – a type that still contains generic type parameters. What does a List<T> (rather than a List<string> or List<int>) mean? It’s like creating an instance of an abstract class.

Before today, I’d have expected it to be impossible – the CLR should just not allow such an object to exist. I now know one – and only one – way to do it. While you can’t get normal field values for an open generic type, you can get constants… after all, they’re constant values, right? That’s fine for most constants, because those can’t be generic types – int, string etc. The only type of constant with a user-defined type is an enum. Enums themselves aren’t generic, of course… but what if it’s nested inside another generic type, like this:

class Generic<T>
{
    enum GenericEnum
    {
        Foo = 0
    }
}

Now Generic<>.GenericEnum is an open type, because it’s nested in an open type. Using Enum.GetValues(typeof(Generic<>.GenericEnum)) fails in the expected way: the CLR complains that it can’t create instances of the open type. But if you use reflection to get at the constant field representing Foo, the CLR magically converts the underlying integer (which is what’s in the IL of course) into an instance of the open type.

Here’s the complete code:

using System;
using System.Reflection;

class Program
{
    static void Main(string[] args)
    {
        object x = GetWeirdValue();
        // This line prints True
        Console.WriteLine(x.GetType().GetTypeInfo().IsGenericTypeDefinition);
    }

    static object GetWeirdValue() =>
        typeof(Generic<>.GenericEnum).GetTypeInfo()
            .GetDeclaredField("Foo")
            .GetValue(null);

    class Generic<T>
    {
        public enum GenericEnum
        {
            Foo = 0
        }
    }
}

… and the corresponding project file, to prove it works for both the desktop and .NET Core…

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp1.0;net45</TargetFrameworks>
  </PropertyGroup>

</Project>

Use this at your peril. I expect that many bits of code dealing with reflection would be surprised if they were provided with a value like this…

It turns out I’m not the first one to spot this. (That would be pretty unlikely, admittedly.) Kirill Osenkov blogged two other ways of doing this, discovered by Vladimir Reshetnikov, back in 2014.

All about java.util.Date

This post is an attempt to reduce the number of times I need to explain things in Stack Overflow comments. You may well be reading it via a link from Stack Overflow – I intend to refer to this post frequently in comments. Note that this post is mostly not about text handling – see my post on common mistakes in date/time formatting and parsing for more details on that.

There are few classes which cause so many similar questions on Stack Overflow as java.util.Date. There are four causes for this:

  • Date and time work is fundamentally quite complicated and full of corner cases. It’s manageable, but you do need to put some time into understanding it.
  • The java.util.Date class is awful in many ways (details given below).
  • It’s poorly understood by developers in general.
  • It’s been badly abused by library authors, adding further to the confusion.

TL;DR: java.util.Date in a nutshell

The most important things to know about java.util.Date are:

  • You should avoid it if you possibly can. Use java.time.* if possible, or the ThreeTen-Backport (java.time for older versions, basically) or Joda Time if you’re not on Java 8 yet.
    • If you’re forced to use it, avoid the deprecated members. Most of them have been deprecated for nearly 20 years, and for good reason.
    • If you really, really feel you have to use the deprecated members, make sure you really understand them.
  • A Date instance represents an instant in time, not a date. Importantly, that means:
    • It doesn’t have a time zone.
    • It doesn’t have a format.
    • It doesn’t have a calendar system.

Now, onto the details…

What’s wrong with java.util.Date?

java.util.Date (just Date from now on) is a terrible type, which explains why so much of it was deprecated in Java 1.1 (but is still being used, unfortunately).

Design flaws include:

  • Its name is misleading: it doesn’t represent a Date, it represents an instant in time. So it should be called Instant – as its java.time equivalent is.
  • It’s non-final: that encourages poor uses of inheritance such as java.sql.Date (which is meant to represent a date, and is also confusing due to having the same short-name)
  • It’s mutable: date/time types are natural values which are usefully modeled by immutable types. The fact that Date is mutable (e.g. via the setTime method) means diligent developers end up creating defensive copies all over the place.
  • It implicitly uses the system-local time zone in many places – including toString() – which confuses many developers. More on this in the “What’s an instant” section
  • Its month numbering is 0-based, copied from C. This has led to many, many off-by-one errors.
  • Its year numbering is 1900-based, also copied from C. Surely by the time Java came out we had an idea that this was bad for readability?
  • Its methods are unclearly named: getDate() returns the day-of-month, and getDay() returns the day-of-week. How hard would it have been to give those more descriptive names?
  • It’s ambiguous about whether or not it supports leap seconds: “A second is represented by an integer from 0 to 61; the values 60 and 61 occur only for leap seconds and even then only in Java implementations that actually track leap seconds correctly.” I strongly suspect that most developers (including myself) have made plenty of assumptions that the range for getSeconds() is actually in the range 0-59 inclusive.
  • It’s lenient for no obvious reason: “In all cases, arguments given to methods for these purposes need not fall within the indicated ranges; for example, a date may be specified as January 32 and is interpreted as meaning February 1.” How often is that useful?

I could find more problems, but they would be getting pickier. That’s a plentiful list to be going on with. On the plus side:

  • It unambiguously represents a single value: an instant in time, with no associated calendar system, time zone or text format, to a precision of milliseconds.

Unfortunately even this one “good aspect” is poorly understood by developers. Let’s unpack it…

What’s an “instant in time”?

Note: I’m ignoring relativity and leap seconds for the whole of the rest of this post. They’re very important to some people, but for most readers they would just introduce more confusion.

When I talk about an “instant” I’m talking about the sort of concept that could be used to identify when something happened. (It could be in the future, but it’s easiest to think about in terms of a past occurrence.) It’s independent of time zone and calendar system, so multiple people using their “local” time representations could talk about it in different ways.

Let’s use a very concrete example of something that happened somewhere that doesn’t use any time zones we’re familiar with: Neil Armstrong walking on the moon. The moon walk started at a particular instant in time – if multiple people from around the world were watching at the same time, they’d all (pretty much) say “I can see it happening now” simultaneously.

If you were watching from mission control in Houston, you might have thought of that instant as “July 20th 1969, 9:56:20 pm CDT”. If you were watching from London, you might have thought of that instant as “July 21st 1969, 3:26:20 am BST”. If you were watching from Riyadh, you might have thought of that instant as “Jumādá 7th 1389, 5:56:20 am (+03)” (using the Umm al-Qura calendar). Even though different observers would see different times on their clocks – and even different years – they would still be considering the same instant. They’d just be applying different time zones and calendar systems to convert from the instant into a more human-centric concept.

So how do computers represent instants? They typically store an amount of time before or after a particular instant which is effectively an origin. Many systems use the Unix epoch, which is the instant represented in the Gregorian calendar in UTC as midnight at the start of January 1st 1970. That doesn’t mean the epoch is inherently “in” UTC – the Unix epoch could equally well be defined as “the instant at which it was 7pm on December 31st 1969 in New York”.

The Date class uses “milliseconds since the Unix epoch” – that’s the value returned by getTime(), and set by either the Date(long) constructor or the setTime() method. As the moon walk occurred before the Unix epoch, the value is negative: it’s actually -14159020000.

To demonstrate how Date interacts with the system time zone, let’s show the three time zones mentioned before – Houston (America/Chicago), London (Europe/London) and Riyadh (Asia/Riyadh). It doesn’t matter what the system time zone is when we construct the date from its epoch-millis value – that doesn’t depend on the local time zone at all. But if we use Date.toString(), that converts to the current default time zone to display the result. Changing the default time zone does not change the Date value at all. The internal state of the object is exactly the same. It still represents the same instant, but methods like toString(), getMonth() and getDate() will be affected. Here’s sample code to show that:

import java.util.Date;
import java.util.TimeZone;

public class Test {

    public static void main(String[] args) {
        // The default time zone makes no difference when constructing
        // a Date from a milliseconds-since-Unix-epoch value
        Date date = new Date(-14159020000L);

        // Display the instant in three different time zones
        TimeZone.setDefault(TimeZone.getTimeZone("America/Chicago"));
        System.out.println(date);

        TimeZone.setDefault(TimeZone.getTimeZone("Europe/London"));
        System.out.println(date);

        TimeZone.setDefault(TimeZone.getTimeZone("Asia/Riyadh"));
        System.out.println(date);

        // Prove that the instant hasn't changed...
        System.out.println(date.getTime());
    }
}

The output is as follows:

Sun Jul 20 21:56:20 CDT 1969
Mon Jul 21 03:56:20 GMT 1969
Mon Jul 21 05:56:20 AST 1969
-14159020000

The “GMT” and “AST” abbreviations in the output here are highly unfortunate – java.util.TimeZone doesn’t have the right names for pre-1970 values in all cases. The times are right though.

Common questions

How do I convert a Date to a different time zone?

You don’t – because a Date doesn’t have a time zone. It’s an instant in time. Don’t be fooled by the output of toString(). That’s showing you the instant in the default time zone. It’s not part of the value.

If your code takes a Date as an input, any conversion from a “local time and time zone” to an instant has already occurred. (Hopefully it was done correctly…)

If you start writing a method with a signature like this, you’re not helping yourself:

// A method like this is always wrong
Date convertTimeZone(Date input, TimeZone fromZone, TimeZone toZone)

How do I convert a Date to a different format?

You don’t – because a Date doesn’t have a format. Don’t be fooled by the output of toString(). That always uses the same format, as described by the documentation.

To format a Date in a particular way, use a suitable DateFormat (potentially a SimpleDateFormat) – remembering to set the time zone to the appropriate zone for your use.

Common mistakes in date/time formatting and parsing

There are many, many questions on Stack Overflow about both parsing and formatting date/time values. (I use the term “date/time” to mean pretty much “any type of chronlogical information” – dates, times of day, instants in time etc.) Given how often the same kinds of mistakes are made, I thought it would be handy to have a blog post to refer to.

This post assumes you already know the basic operations of formatting and parsing, in terms of the appropriate types to use:

Pattern woes

There are three broad classes of issue here – one of which is “just” a matter of carelessness, usually, and the other which still surprises me in terms of sheer wrongness.

Pattern capitalization issues

This is an insidious problem, because in some cases you may get the right values, but not all of the time. I suspect it usually comes up again due to copy and paste, but often from specifications rather than other code – in a specification, it’s pretty clear what "YYYY-MM-DD HH:MM:SS" means as a date/time format, but that doesn’t mean it’s the right pattern to put in code.

The main thing to do is read the documentation carefully. Of course, some platforms have clearer documentation than others, but most are at least “good enough”. For the Java APIs, the pattern specifiers are generally documented with the formatting classes themselves; for .NET’s built-in classes you want the custom date and time format strings and standard date and time format strings MSDN pages, and for Noda Time follow the various options from the text handling part of the user guide. (For other platforms, use your common sense. :)

The most common mistakes here are:

  • Using mm for months or MM for minutes, rather than vice versa. I’ve seen this mistake both ways round.
  • Using hh for “hour of day” when HH is intended. H is in the range 0-23; h is in the range 1-12. h is usually used singly (rather than requiring exactly two digits), and almost always in conjunction with an AM/PM specifier – as otherwise it’s ambiguous. H is usually used as HH, so that 5am is represented as “05” for example.
  • Using YYYY for year – in Java and Noda Time, Y is used for week-year rather than normal calendar year; it’s usually used in conjunction with “week of year” and “day of week”, but it’s much less common than yyyy.
  • Using DD for “day of month” when in Java it actually means “day of year”.

Broad pattern incompatibilities

I’m surprised by how often I see code like this:

var text = "Tue, 5 May 2015 3:15pm";
var dateTime = DateTime.ParseExact(
    text,
    "yyyy-MM-dd'T'HH:mm:ss");

Here the pattern and the actual data are entirely different, and I get the impression that the author has copied the pattern from another piece of code without any thought about what the magic string "yyyy-MM-dd'T'HH:mm:ss" is there for.

I suspect it goes without saying for most readers, but you should never copy code from elsewhere into your own code without understanding how it works, or which parts you may potentially need to modify.

The result of this sort of error is usually a complete failure to parse, which is at least simpler to find than the “plausible but not quite correct” pattern issue.

Pattern incompatibility issues

Some developers assume that a pattern which works in Java will work in Python, or the equivalent for any other pair of platforms. Don’t make this assumption. Always read the documentation – and if you’re porting code from one platform to another, you’ll need to “decode” the pattern with one set of documentation, then “encode” it with the other.

Time zone issues

Understanding time zones

There are two common issues when understanding what a time zone is to start with.

The first is to assume that a UTC offset (e.g. “+8 hours”) is the same as a time zone. This is an understandable mistake, given that a lot of documentation (from organizations which really should know better) misuse the terminology. The UTC offset is the difference between UTC and local time at a particular instant – so for example, while I’m writing this, I’m in the UK time zone which is currently at UTC+1. However, in the winter (in the same time zone) it will be at UTC+0. So if you have a value of (say) “2015-05-10T16:43:00+0100” that only tells you the UTC offset – it doesn’t tell you the time zone. There may well be multiple time zones with the same offset at that particular time, but which will have different offsets at differ times.

The second mistake is to think that an abbreviation such as “EST” or “GMT” identifies a time zone. It doesn’t, in two ways:

  • A single time zone often uses multiple abbreviations over time. For example, “Pacific Time” varies between PST (Pacific Standard Time) and PDT (Pacific Daylight Time). It’s unfortunate that some people use the abbreviation for standard time even when they mean the general time zone – so even though currently (at the time of writing) Pacific Time is in PDT (UTC-7), some people would write the local time with “PST” at the end. Grr. Avoid abbrevations if you possibly can.
  • The same abbreviation may be used in multiple time zones, or even at different points in time to mean different things within the same time zone. For example, “BST” can mean British Summer Time in Europe/London (standard time of UTC+0, plus 1 hour of daylight saving time), British Standard Time in Europe/London (standard time of UTC+1, with no daylight saving time, around 1970 only) and Bougainville Standard Time in Pacific/Bougainville (UTC+11). Avoid abbreviations if you possibly can.

Using time zones in text formatting/parsing

First, you need to understand exactly what the library you’re using does with time zones, and what the types you’re using represent. One of the most common misconceptions here is with java.util.Date – this is just an instant in time, with no concept of a time zone or calendar system. The fact that the string returned from Date.toString always uses the system default time zone is unfortunately misleading in this respect, and causes developers to ask how to “convert” a Date from one time zone to another.

Next, you need to understand exactly what your data represents. In my experience, most textual data either specifies a date and/or time without a given time zone or it specifies a date and time with a UTC offset. When no time zone information is present, you may know the time zone it’s meant to refer to, or you may not. If you’re using a library which has multiple different types to represent different kinds of information (e.g. Joda Time, java.time or Noda Time) I personally find it clearest to parse to a type that closest represents the information actually stated in the text, and then convert it to something else where appropriate.

You definitely need to be aware when the parsing operation is going to impose any sort of time zone understanding on your data. This is the case with SimpleDateFormat in Java and with DateTime.ParseExact and friends in .NET. For SimpleDateFormat, unless you explicitly set a time zone (or the pattern includes a UTC offset), the system default time zone is used – this is usually not what you want. Parsing in .NET allows you to specify how you want the text to be understood, but you need to be careful. (The fact that DateTime sometimes represents a value in the system default time zone, sometimes a value in UTC, and sometimes a value with no associated time zone makes this all tricky.)

Locale / culture issues

Most libraries allow you to specify which culture to use when parsing (or formatting) data. This is a two-edged sword:

  • If you’re formatting a value to be displayed directly to an end user, that’s great: they can see the month name in their own language, etc. In this situation, you’ll typically use a “standard” format (e.g. “the short date/time format”)
  • If you’re formatting or parsing a value which is designed to be machine-readable (e.g. passed to a web service) then you almost certainly want the invariant culture instead of a user-specific culture. In this situation, you’ll typically use a “custom” format (e.g. “yyyy-MM-dd’T’HH:mm:ss”) or a specific culture-invariant format.

Culture can affect several aspects of handling conversions:

  • The calendar system used (e.g. the Gregorian calendar vs an Islamic calendar)
  • The “standard” formats used (e.g. month/day/year vs day/month/year)
  • The separators used (e.g. - vs / for date separators)
  • The month and day names used
  • The number system used

Converting unnecessarily

As a final common problem, you may be performing more conversions than you should be. For example, if you’ve got a DateTime field in the database but you’re passing a value as a string in your SQL parameter (you are using parameterized SQL, right?) then you probably shouldn’t be. Most platforms allow parameters to be specified as the value in a “native” representation. Likewise when you fetch a value, don’t just call toString on it and then parse the result – if the value is a date/time value, it should already be in a native representation; a simple cast (or call to the type-specific method) should be enough.

Conclusion

Date/time text handling is fraught with problems, as a simple look at Stack Overflow shows. Be careful, make sure you know exactly what you’re converting from and to, and check exactly what you’re specifying vs what you’re leaving implicit.

The BobbyTables culture

I started writing a post like this a long time ago, but somehow never finished it.

Countless posts on Stack Overflow are vulnerable to SQL injection attacks. Along with several other users, I always raise this when it shows up – this is something that really just shouldn’t happen these days. It’s a well-understood issue,and parameterized SQL is a great solution in almost all cases. (No, it doesn’t work if you want to specify an column or table name dynamically. Yes, whitelisting is the solution there.)

The response usually falls into one of three camps:

  • Ah – I didn’t know about that. Great, I’ll fix it now. Thanks!
  • This is just a prototype. I’ll fix it for the real thing. (Ha! Like that ever happens.)
  • Well yes, in theory – but I’m just using numbers. That’s not a problem, is it?

Now personally I feel that you should just get the habit of using parameterized queries all the time, even when you could get away without it. This post is a somewhat tongue-in-cheek counterargument to the last of these responses. If you haven’t seen Bobby Tables, you really should. It’s the best 10-second explanation of SQL injection that I’ve ever seen, and I almost always drop a link to it when I’m adding a comment on a vulnerable query on Stack Overflow.

So in honour of Bobby, here’s a little program. See if you can predict the output.

using System;
using System.Globalization;
using System.Threading;

class Test
{
    static void Main()
    {
        string sql = "SELECT * FROM Foo WHERE BarDate > '" + DateTime.Today + "'";
        // Imagine you're executing the query here...
        Console.WriteLine(sql);

        int bar = -10;
        sql = "SELECT * FROM Foo WHERE BarValue = " + bar;
        // Imagine you're executing the query here...
        Console.WriteLine(sql);
    }

    // Some other code here...
}

Does that look okay? Not great, admittedly – but not too bad, right? Well, the output of the program is:

SELECT * FROM Foo WHERE BarDate > '2014-08-08' OR ' '=' '
SELECT * FROM Foo WHERE BarValue = 1 OR 1=1 OR 1=10

Yikes! Our queries aren’t filtering out anything!

Of course, the black magic is in “Some other code here” part:

static Test()
{
    InstallBobbyTablesCulture();
}

static void InstallBobbyTablesCulture()
{
    CultureInfo bobby = (CultureInfo) CultureInfo.InvariantCulture.Clone();
    bobby.DateTimeFormat.ShortDatePattern = @"yyyy-MM-dd'' OR ' '=''";
    bobby.DateTimeFormat.LongTimePattern = "";
    bobby.NumberFormat.NegativeSign = "1 OR 1=1 OR 1=";
    Thread.CurrentThread.CurrentCulture = bobby;
}

Neither numbers (well, negative numbers in this case) nor dates are safe. And of course if your database permissions aren’t set correctly, the queries could do a lot more than just remove any filtering. For extra fun, you can subvert some custom format strings – by changing the DateSeparator property, for example.

Even in sensible cultures, if the database expects you to use . for the decimal separator and you’re in a European culture that uses , instead, do you know how your database will behave? If you sanitize your input based on the numeric value, but then that isn’t the value that the database sees due to a string conversion, how comfortable are you that your application is still safe? It may not allow direct damage, but it could potentially reveal more data than you originally expected – which is definitely a vulnerability in a form.

Now the chances of me getting onto your system and installing the Bobby Tables culture – let alone making it the system default – are pretty slim, and if that happens you’ve probably got bigger problems anyway… but it’s the principle of the thing. You don’t care about a text representation of your values: you just want to get them to the database intact.

Parameterized SQL: just say yes.

Anti-pattern: parallel collections

(Note that I’m not talking about "processing collections in parallel, which is definitely not an anti-pattern…)

I figured it was worth starting to blog about anti-patterns I see frequently on Stack Overflow. I realize that some or all of these patterns may be collected elsewhere, but it never hurts to express such things yourself… it’s a good way of internalizing information, aside from anything else. I don’t guarantee that my style of presenting these will stay consistent, but I’ll do what I can…

The anti-patterns themselves are likely to be somewhat language-agnostic, or at the very least common between Java and C#. I’m likely to post code samples in C#, but I don’t expect it to be much of a hindrance to anyone coming from a background in a similar language.

Context

You have related pieces of data about each of several items, and want to keep this data in memory. For example, you’re writing a game and have multiple players, each with a name, score and health.

Anti-pattern

Each kind of data is stored (all the names, all the scores, all the health values) in a separate collection. Typically I see this with arrays. Then each time you need to access related values, you need to make sure you’re using the same index for each collection.

So the code might look like this:

class Game 

    string[] names;
    int[] scores;
    int[] health;

    public Game() 
    { 
        // Initialize all the values appropriately
    } 

    public void PrintScores() 
    { 
        for (int i = 0; i < names.Length; i++) 
        { 
            Console.WriteLine("{0}: {1}", names[i], scores[i]); 
        } 
    } 
}

Preferred approach

The code above fails to represent an entity which seems pretty obvious when you look at the description of the data: a player. Whenever you find yourself describing pieces of data which are closely related, you should make sure you have some kind of representation of that in your code. (In some cases an anonymous type is okay, but often you’ll want a separate named class.)

Once you’ve got that type, you can use a single collection, which makes the code much cleaner to work with.

class Player
{
    public string Name { get; private set; }
    public int Score { get; set; }
    public int Health { get; set; }

    public Player(string name, int health)
    {
        Name = name;
        Health = health;
    }

    // Consider other player-specific operations
}

class Game
{
    List<Player> players;

    public Game()
    {
        // Initialize players appropriately
    }

    public void PrintScores()
    {
        foreach (var player in players)
        {
            Console.WriteLine("{0}: {1}", player.Name, player.Score);
        }
    }
}

Note how we can now use a foreach loop to iterate over our players, because we don’t care need to use the same index for two different collections.

Once you perform this sort of refactoring, you may well find that there are other operations within the Game class which would be better off in the Player class. For example, if you also had a Level property, and increasing that would automatically increase a player’s health and score, then it makes much more sense for that "level up" operation to be in Player than in Game. Without the Player concept, you’d have nowhere else to put the code, but once you’ve identified the relationship between the values, it becomes much simpler to work with.

It’s also much easier to modify a single collection than multiple ones. For example, if we wanted to add or remove a player, we now just need to make a change to a single collection, instead of making sure we perform the same operation to each "single value" collection in the original code. This may sound like a small deal, but it’s easy to make a mistake and miss out on one of the collections somewhere. Likewise if you need to add another related value – like the "level" value described above – it’s much easier to add that in one place than adding another collection and then making sure you do the right thing in every piece of code which changes any of the other collections.

Summary

Any time you find yourself with multiple collections sharing the same keys (whether those are simple list indexes or dictionary keys), think about whether you could have a single collection of a type which composes the values stored in each of the original collections. As well as making it easier to handle the collection data, you may find the new type allows you to encapsulate other operations more sensibly.

Update: what about performance?

As some readers have noted, this transformation can have an impact on performance. Originally, all the scores were kept close together in memory, all the health etc. If you perform a bulk operation on the scores (finding the average score, for example) that locality of reference can have a significant impact. In some cases that may be enough justification to use the parallel collections instead… but this should be a conscious decision, having weighed up the pros and cons and measured the performance impact. Even then, I’d be tempted to encapsulate that PlayerCollection in a separate type, allowing it to implement IEnumerable<Player> where useful. (If you wanted the Player to be mutable, you’d need it to be aware of PlayerCollection itself.)

In almost all these anti-patterns, there will be cases where they’re the lesser of two evils – but novice developers need to be aware of them as anti-patterns to start with. As ever with performance trade-offs, I believe in first deciding on concrete performance goals, then implementing the code in the simplest possible way that meets the non-performance goals, measuring against the performance goals, and tweaking if necessary to achieve them, relying heavily on measurement.

Diagnosing issues with reversible data transformations

I see a lot of problems which look somewhat different at first glance, but all have the same cause:

  • Text is losing “special characters” when I transfer it from one computer to another
  • Decryption ends up with garbage
  • Compressed data can’t be decompressed
  • I can transfer text but not binary data

These are all cases of transforming and (usually) transferring data, and then performing the reverse transformation. Often there are multiple transformations involved, and they need to be carefully reversed in the appropriate order. For example:

  1. Convert text to binary using UTF-8
  2. Compress
  3. Encrypt
  4. Base64-encode
  5. Transfer (e.g. as text in XML)
  6. Base64-decode
  7. Decrypt
  8. Decompress
  9. Convert binary back to text using UTF-8

The actual details of each question can be different, but the way I’d diagnose them is the same in each case. That’s what this post is about – partly so that I can just link to it when such questions arise. Although I’ve numbered the broad steps, it’s one of those constant iteration situations – you may well need to tweak the logging before you can usefully reduce the problem, and so on.

1. Reduce the problem as far as possible

This is just my normal advice for almost any problem, but it’s particularly relevant in this kind of question.

  • Start by assembling a complete program demonstrating nothing but the transformations. Using a single program which goes in both directions is simpler than producing two programs, one in each direction.
  • Remove pairs of transformations (e.g. encrypt/decrypt) at a time, until you’ve got the minimal set which demonstrates the problem
  • Avoid file IO if possible: hard-code short sample data which demonstrates the problem, and use in-memory streams (ByteArrayInputStream/ByteArrayOutputStream in Java; MemoryStream in .NET) for temporary results
  • If you’re performing encryption, hard-code a dummy key or generate it as part of the program.
  • Remove irrelevant 3rd party dependencies if possible (it’s simpler to reproduce an issue if I don’t need to download other libraries first)
  • Include enough logging (just console output, usually) to make it obvious where the discrepancy lies

In my experience, this is often enough to help you fix the problem for yourself – but if you don’t, you’ll be in a much better position for others to help you.

2. Make sure you’re really diagnosing the right data

It’s quite easy to get confused when comparing expected and actual (or before and after) data… particularly strings:

  • Character encoding issues can sometimes be hidden by spurious control characters being introduced invisibly
  • Fonts that can’t display all characters can make it hard to see the real data
  • Debuggers sometimes “helpfully” escape data for you
  • Variable-width fonts can make whitespace differences hard to spot

For diagnostic purposes, I find it useful to be able to log the raw UTF-16 code units which make up a string in both .NET and Java. For example, in .NET:

static void LogUtf16(string input)
{
    // Replace Console.WriteLine with your logging approach
    Console.WriteLine(“Length: {0}”, input.Length);
    foreach (char c in input)
    {
        Console.WriteLine(“U+{0:x4}: {1}”, (uint) c, c);
    }
}

Binary data has different issues, mostly in terms of displaying it in some form to start with. Our diagnosis tools are primarily textual, so you’ll need to perform some kind of conversion to a string representation in order to see it at all. If you’re really trying to diagnose this as binary data (so you’re interested in the raw bytes) do not treat it as encoded text using UTF-8 or something similar. Hex is probably the simplest representation that allows differences to be pinpointed pretty simply. Again, logging the length of the data in question is a good first step.

In both cases you might want to include a hash in your diagnostics. It doesn’t need to be a cryptographically secure hash in any way, shape or form. Any form of hash that is likely to change if the data changes is a good start. Just make sure you can trust your hashing code! (Every piece of code you write should be considered suspect – including whatever you decide to use for converting binary to hex, for example. Use trusted third parties and APIs provided with your target platform where possible. Even though adding an extra dependency for diagnostics makes it slightly harder for others to reproduce the problem, it’s better than the diagnostics themselves being suspect.)

3. Analyze a clean, full, end-to-end log

This can be tricky when you’ve got multiple systems and platforms (which is why if you can possibly reproduce it in a single program it makes life simpler) but it’s really important to look at one log for a complete run.

Make sure you’re talking about the same data end-to-end. If you’re analyzing live traffic (which should be rare; unless the problem is very intermittent, this should all be done in test environments or a developer machine) or you have a shared test environment you need to be careful that you don’t use part of the data from one test and part of the data from another test. I know this sounds trivial, but it’s a really easy mistake to make. In particular, don’t assume that the data you’ll get from one part of the process will be the same run-to-run. In many cases it should be, but if the overall system isn’t working, then you already know that one of your expectations is invalid.

Compare “supposed to be equal” parts of the data. As per the steps in the introduction, there should be pairs of equal data, moving from the “top and bottom” of the transformation chain towards the middle. Initially, you shouldn’t care about whether you view the transformation as being correct – you’re only worried about whether the output is equal to the input. If you’ve managed to preserve all the data, the function of the transformation (encryption, compression etc) becomes relevant – but if you’re losing data, anything else is secondary. This is where the hash from the bottom of step 2 is relevant: you want to be able to determine whether the data is probably right as quickly as possible. Between “length” and “hash”, you should have at least some confidence, which will let you get to the most likely problem as quickly as possible.

4. Profit! (Conclusion…)

Once you’ve compared the results at each step, you should get an idea of which transformations are working and which aren’t. This may allow you to reduce the problem further, until you’ve just got a single transformation to diagnose. At that point, the problem becomes about encryption, or about compression, or about text encoding.

Depending on the situation you’re in, at this point you may be able to try multiple implementations or potentially multiple platforms to work out what’s wrong: for example, if you’re producing a zip file and then trying to decompress it, you might want to try using a regular decompression program to open your intermediate results, or decompress the results of compressing with a standard compression tool. Or if you’re trying to encrypt on Java and decrypt in C#, implement the other parts in each platform, so you can at least try to get a working “in-platform” solution – that may well be enough to find out which half has the problem.

To some extent all this blog post is about is reducing the problem as far as possible, with some ideas of how to do that. I haven’t tried to warn you much about the problems you can run into in any particular domain, but you should definitely read Marc Gravell’s excellent post on “How many ways can you mess up IO?” and my earlier post on understanding the meaning of your data is pretty relevant too.

As this is a “hints and tips” sort of post, I’ll happily modify it to include reader contributions from comments. With any luck it’ll be a useful resource for multiple Stack Overflow questions in the months and years to come…

Stack Overflow question checklist

Note: this post is now available with a tinyurl of http://tinyurl.com/stack-checklist

My earlier post on how to write a good question is pretty long, and I suspect that even when I refer people to it, often they don’t bother reading it. So here’s a short list of questions to check after you’ve written a question (and to think about before you write the question):

  • Have you done some research before asking the question? 1
  • Have you explained what you’ve already tried to solve your problem?
  • Have you specified which language and platform you’re using, including version number where relevant?
  • If your question includes code, have you written it as a short but complete program? 2
  • If your question includes code, have you checked that it’s correctly formatted? 3
  • If your code doesn’t compile, have you included the exact compiler error?
  • If your question doesn’t include code, are you sure it shouldn’t?
  • If your program throws an exception, have you included the exception, with both the message and the stack trace?
  • If your program produces different results to what you expected, have you stated what you expected, why you expected it, and the actual results?
  • If your question is related to anything locale-specific (languages, time zones) have you stated the relevant information about your system (e.g. your current time zone)?
  • Have you checked that your question looks reasonable in terms of formatting?
  • Have you checked the spelling and grammar to the best of your ability? 4
  • Have you read the whole question to yourself carefully, to make sure it makes sense and contains enough information for someone coming to it without any of the context that you already know?

    If the answer to any of these questions is “no” you should take the time to fix up your question before posting. I realize this may seem like a lot of effort, but it will help you to get a useful answer as quickly as possible. Don’t forget that you’re basically asking other people to help you out of the goodness of their heart – it’s up to you to do all you can to make that as simple as possible.


    1 If you went from “something’s not working” to “asking a question” in less than 10 minutes, you probably haven’t done enough research.

    2 Ideally anyone answering the question should be able to copy your code, paste it into a text editor, compile it, run it, and observe the problem. Console applications are good for this – unless your question is directly about a user interface aspect, prefer to write a short console app. Remove anything not directly related to your question, but keep it complete enough to run.

    3 Try to avoid code which makes users scroll horizontally. You may well need to change how you split lines from how you have it in your IDE. Take the time to make it as clear as possible for those trying to help you.

    4 I realize that English isn’t the first language for many Stack Overflow users. We’re not looking for perfection – just some effort. If you know your English isn’t good, see if a colleague or friend can help you with your question before you post it.

  • Stack Overflow and personal emails

    This post is partly meant to be a general announcement, and partly meant to be something I can point people at in the future (rather than writing a short version of this on each email).

    These days, I get at least a few emails practically every day along the lines of:

    “I saw you on Stack Overflow, and would like you to answer this development question for me…”

    It’s clear that the author:

    • Is aware of Stack Overflow
    • Is aware that Stack Overflow is a site for development Q&A
    • Is aware that I answer questions on Stack Overflow

    … and yet they believe that the right way of getting me to answer a question is by emailing it to me directly. Sometimes it’s a link to a Stack Overflow question, sometimes it’s the question asked directly in email.

    In the early days of Stack Overflow, this wasn’t too bad. I’d get maybe one email like this a week. Nowadays, it’s simply too much.

    If you have a question worthy of Stack Overflow, ask it on Stack Overflow. If you’ve been banned from asking questions due to asking too many low-quality ones before, then I’m unlikely to enjoy answering your questions by email – learn what makes a good question instead, and edit your existing questions.

    If you’ve already asked the question on Stack Overflow, you should consider why you think it’s more worthy of my attention than everyone else’s questions. You should also consider what would happen if everyone who would like me to answer a question decided to email me.

    Of course in some cases it’s appropriate. If you’ve already asked a question, written it as well as you can, waited a while to see if you get any answers naturally, and if it’s in an area that you know I’m particularly experienced in (read: the C# language, basically) then that’s fine. If your question is about something from C# in Depth – a snippet which doesn’t work or some text you don’t understand, for example – then it’s entirely appropriate to mail me directly.

    Basically, ask yourself whether you think I will actually welcome the email. Is it about something you know I’m specifically interested in? Or are you just trying to get more attention to a question, somewhat like jumping a queue?

    I’m aware that it’s possible this post makes me look either like a grumpy curmudgeon or (worse) like an egocentric pseudo-celebrity. The truth is I’m just like everyone else, with very little time on my hands – time I’d like to spend as usefully and fairly as possible.