This year before NDC, I wrote an article for the conference edition of "The Developer" magazine. Follow that link to find the article in all its illustrated glory (along with many other fine articles, of course) – or read on for just the text.
Back when I used to post on newsgroups I would frequently be in the middle of a debate of the details of some behaviour or terminology, when one poster would say: “You’re just quibbling over semantics” as if this excused any and all previous inaccuracies. I would usually agree – I was indeed quibbling about semantics, but there’s no “just” about it.
Semantics is meaning, and that’s at the heart of communication – so for example, a debate over whether it’s correct to say that Java uses pass- by-reference1 is all about semantics. Without semantics, there’s nothing to talk about.
This has been going on for years, and I’m quite used to being the pedant in any conversation when it comes to terminology – it’s a topic close to my heart. But over the years – and importantly since my attention has migrated to Stack Overflow, which tends to be more about real problems developers are facing than abstract discussions – I’ve noticed that I’m now being picky in the same sort of way, but about the meaning of data instead of terminology.
Data under the microscope
When it comes down to it, all the data we use is just bits – 1s and 0s. We assemble order from the chaos by ascribing meaning to those bits… and not just once, but in a whole hierarchy. For example, take the bits 01001010 00000000:
- Taken as a little-endian 16-bit unsigned integer, they form a value of 74. • That 16-bit unsigned integer can be viewed as a UTF-16 code unit for the character ‘J’.
- That character might be the first character within a string.
- That string might be the target of a reference, which is the value for a field called “firstName”.
- That field might be within an instance of a class called “Person”.
- The instance of “Person” whose “firstName” field has a value
which is a reference to the string whose first character is ‘J’ might itself be the target of a reference, which is the value for a field called “author”, within an instance of a class called “Article”.
- The instance of “Article” whose “author” field (fill in the rest your- self…) might itself be the target of a reference which is part of a collection, stored (indirectly) via a field called “articles” in a class called “Magazine”.
As we’ve zoomed out from sixteen individual bits, at every level we’ve imposed meaning. Imagine all the individual bits of information which would be involved in a single instance of the Magazine with a dozen articles, an editorial, credits – and perhaps even images. Really imagine them, all written down next to each other, possibly without even the helpful gap between bytes that I included in our example earlier.
That’s the raw data. Everything else is “just” semantics.
So what does that have to do with me?
I’m sure I haven’t told you anything you don’t already know. Yes, we can impose meaning on these puny bits, with our awesome developer power. The trouble is that bits have a habit of rebelling if you try to impose the wrong kind of meaning on them… and we seem to do that quite a lot.
The most common example I see on Stack Overflow is treating text (strings) and binary data (image files, zip files, encrypted data) as if they were interchangeable. If you try to load a JPEG using StreamReader in .NET or FileReader in Java, you’re going to have problems. There are ways you can actually get away with it – usually by using the ISO-8859-1 encoding – but it’s a little bit like trying to drive down a road with a broken steering wheel, only making progress by bouncing off other obstacles.
While this is a common example, it’s far from the only one. Some of the problems which fall into this category might not obviously be due to the mishandling of data, but at a deep level they’re all quite similar:
- SQL injection attacks due to mingling code (SQL) with data (values) instead of using parameters to keep the two separate.
- The computer getting arithmetic “wrong” because the developer didn’t understand the meaning of floating binary point numbers, and should actually have used a floating decimal point type (such as System.Decimal or java.math. BigDecimal).
- String formatting issues due to treating the result of a previous string formatting operation as another format string – despite the fact that now it includes user data which could really have any kind of text in it.
- Double-encoding or double-unencoding of text data to make it safe for transport via a URL.
- Almost anything to do with dates and times, including – but certainly not limited to – the way that java.util.Date and System. DateTime values don’t inherently have a format. They’re just values.
The sheer bulk of questions which indicate a lack of understanding of the nature of data is enormous. Of course Stack Overflow only shows a tiny part of this – it doesn’t give much insight into the mountain of code which handles data correctly from the perspective of the types involved, but does entirely inappropriate things with those values from the perspective of the intended business meaning of those values.
It’s not all doom and gloom though. We have some simple but powerful weapons available in the fight against semantic drivel.
This article gives a good indication of why I’m a fan of statically typed languages. The type system can convey huge amounts of information about the nature of data, even if the business meaning of values of those types can be horribly overloaded.
Maybe it would be good if we distinguished between human-readable text which should usually be treated in a culture-sensitive way, and machine-parsable text which should usually be treated without reference to any culture. Those two types might have different operations available on them, for example – but it would almost certainly get messy very quickly.
For business-specific types though, it’s usually easy to make sure that each type is really only used for one concept, and only provides operations which are meaningful for that concept.
Naming is undoubtedly hard, and I suspect most developers have had the same joyless experiences that I have of struggling for ten minutes to come up with a good class or method name, only to end up with the one we first thought of which we really don’t like… but which is simply better than all the other options we’ve considered. Still, it’s worth the struggle.
Our names don’t have to be perfect, but there’s simply no excuse for names such as “Form1” or “MyClass” which seem almost designed to carry no useful information whatsoever. Often simply the act of naming something can communicate meaning. Don’t be afraid to extract local variables in code just to clarify the meaning of some otherwise-obscure expression.
I don’t think I’ve ever met a developer who actually enjoys writing documentation, but I think it’s hard to deny that it’s important. There tend to be very few things that are so precisely communicated just by the names of types, properties and methods that no further information is required. What is guaranteed about the data? How should it be used – and how should it not be used?
The form, style and detail level of documentation will vary from project to project, but don’t underestimate its value. Aside from behavioural details, ask yourself what meaning you’re imposing on or assuming about the data you’re dealing with… what would happen if someone else made different assumptions? What could go wrong, and how can you prevent it by expressing your own understanding clearly? This isn’t just important for large projects with big teams, either – it’s entirely possible that the person who comes to the code with a different viewpoint is going to be you, six months later.
I apologise if this all sounds like motherhood and apple pie. I’m not saying anything new, after all – of course we all agree that we need to understand the meaning of the our data. I’m really not trying to waste your time though: I want you to take a moment to appreciate just how much it matters that we understand the data we work with, and how many problems can be avoided by putting effort into communicating that effectively and reading what others have written about their data.
There are other approaches we can take beyond those I’ve listed above, too – much more technically exciting ones – around static code analysis, contracts, complicated annotations and the like. I have nothing against them, but just understanding the value of semantics is a crucial starting point, and once everyone on your team agrees on that, you’ll be in a much better position to move forward and agree on the meaning of your data itself. From there, the world’s your oyster – if you see what I mean.
1 It doesn’t; references are passed by value, as even my non-programmer wife knows by now. That’s how often this myth comes up.