But what does it all mean?

This year before NDC, I wrote an article for the conference edition of "The Developer" magazine. Follow that link to find the article in all its illustrated glory (along with many other fine articles, of course) – or read on for just the text.

Back when I used to post on newsgroups I would frequently be in the middle of a debate of the details of some behaviour or terminology, when one poster would say: “You’re just quibbling over semantics” as if this excused any and all previous inaccuracies. I would usually agree –  I was indeed quibbling about semantics, but there’s no “just” about it.

Semantics is meaning, and that’s at the heart of communication – so for example, a debate over whether it’s correct to say that Java uses pass- by-reference1 is all about semantics. Without semantics, there’s nothing to talk about.
This has been going on for years, and I’m quite used to being the pedant in any conversation when it comes to terminology – it’s a topic close to my heart. But over the years – and importantly since my attention has migrated to Stack Overflow, which tends to be more about real problems developers are facing than abstract discussions – I’ve noticed that I’m now being picky in the same sort of way, but about the meaning of data instead of terminology.

Data under the microscope

When it comes down to it, all the data we use is just bits – 1s and 0s. We assemble order from the chaos by ascribing meaning to those bits… and not just once, but in a whole hierarchy. For example, take the bits 01001010 00000000:

  • Taken as a little-endian 16-bit unsigned integer, they form a value of 74. • That 16-bit unsigned integer can be viewed as a UTF-16 code unit for the character ‘J’.
  • That character might be the first character within a string.
  • That string might be the target of a reference, which is the value for a field called “firstName”.
  • That field might be within an instance of a class called “Person”.
  • The instance of “Person” whose “firstName” field has a value
    which is a reference to the string whose first character is ‘J’ might itself be the target of a reference, which is the value for a field called “author”, within an instance of a class called “Article”.
  • The instance of “Article” whose “author” field (fill in the rest your- self…) might itself be the target of a reference which is part of a collection, stored (indirectly) via a field called “articles” in a class called “Magazine”.

As we’ve zoomed out from sixteen individual bits, at every level we’ve imposed meaning. Imagine all the individual bits of information which would be involved in a single instance of the Magazine with a dozen articles, an editorial, credits – and perhaps even images. Really imagine them, all written down next to each other, possibly without even the helpful gap between bytes that I included in our example earlier.

That’s the raw data. Everything else is “just” semantics.

So what does that have to do with me?

I’m sure I haven’t told you anything you don’t already know. Yes, we can impose meaning on these puny bits, with our awesome developer power. The trouble is that bits have a habit of rebelling if you try to impose the wrong kind of meaning on them… and we seem to do that quite a lot.

The most common example I see on Stack Overflow is treating text (strings) and binary data (image files, zip files, encrypted data) as if they were interchangeable. If you try to load a JPEG using StreamReader in .NET or FileReader in Java, you’re going to have problems. There are ways you can actually get away with it – usually by using the ISO-8859-1 encoding – but it’s a little bit like trying to drive down a road with a broken steering wheel, only making progress by bouncing off other obstacles.

While this is a common example, it’s far from the only one. Some of the problems which fall into this category might not obviously be due to the mishandling of data, but at a deep level they’re all quite similar:

  • SQL injection attacks due to mingling code (SQL) with data (values) instead of using parameters to keep the two separate.
  • The computer getting arithmetic “wrong” because the developer didn’t understand the meaning of floating binary point numbers, and should actually have used a floating decimal point type (such as System.Decimal or java.math. BigDecimal).
  • String formatting issues due to treating the result of a previous string formatting operation as another format string – despite the fact that now it includes user data which could really have any kind of text in it.
  • Double-encoding or double-unencoding of text data to make it safe for transport via a URL.
  • Almost anything to do with dates and times, including – but certainly not limited to – the way that java.util.Date and System. DateTime values don’t inherently have a format. They’re just values.

The sheer bulk of questions which indicate a lack of understanding of the nature of data is enormous. Of course Stack Overflow only shows a tiny part of this – it doesn’t give much insight into the mountain of code which handles data correctly from the perspective of the types involved, but does entirely inappropriate things with those values from the perspective of the intended business meaning of those values.

It’s not all doom and gloom though. We have some simple but powerful weapons available in the fight against semantic drivel.

Types

This article gives a good indication of why I’m a fan of statically typed languages. The type system can convey huge amounts of information about the nature of data, even if the business meaning of values of those types can be horribly overloaded.

Maybe it would be good if we distinguished between human-readable text which should usually be treated in a culture-sensitive way, and machine-parsable text which should usually be treated without reference to any culture. Those two types might have different operations available on them, for example – but it would almost certainly get messy very quickly.

For business-specific types though, it’s usually easy to make sure that each type is really only used for one concept, and only provides operations which are meaningful for that concept.

Meaningful names

Naming is undoubtedly hard, and I suspect most developers have had the same joyless experiences that I have of struggling for ten minutes to come up with a good class or method name, only to end up with the one we first thought of which we really don’t like… but which is simply better than all the other options we’ve considered. Still, it’s worth the struggle.

Our names don’t have to be perfect, but there’s simply no excuse for names such as “Form1” or “MyClass” which seem almost designed to carry no useful information whatsoever. Often simply the act of naming something can communicate meaning. Don’t be afraid to extract local variables in code just to clarify the meaning of some otherwise-obscure expression.

Documentation

I don’t think I’ve ever met a developer who actually enjoys writing documentation, but I think it’s hard to deny that it’s important. There tend to be very few things that are so precisely communicated just by the names of types, properties and methods that no further information is required. What is guaranteed about the data? How should it be used – and how should it not be used?

The form, style and detail level of documentation will vary from project to project, but don’t underestimate its value. Aside from behavioural  details, ask yourself what meaning you’re imposing on or assuming about the data you’re dealing with… what would happen if someone else made different assumptions? What could go wrong, and how can you prevent it by expressing your own understanding clearly? This isn’t just important for large projects with big teams, either – it’s entirely possible that the person who comes to the code with a different viewpoint is going to be you, six months later.

Conclusion

I apologise if this all sounds like motherhood and apple pie. I’m not saying anything new, after all – of course we all agree that we need to understand the meaning of the our data. I’m really not trying to waste your time though: I want you to take a moment to appreciate just how much it matters that we understand the data we work with, and how many problems can be avoided by putting effort into communicating that effectively and reading what others have written about their data.

There are other approaches we can take beyond those I’ve listed above, too – much more technically exciting ones – around static code analysis, contracts, complicated annotations and the like. I have nothing against them, but just understanding the value of semantics is a crucial starting point, and once everyone on your team agrees on that, you’ll be in a much better position to move forward and agree on the meaning of your data itself. From there, the world’s your oyster – if you see what I mean.


1 It doesn’t; references are passed by value, as even my non-programmer wife knows by now. That’s how often this myth comes up.

22 thoughts on “But what does it all mean?”

  1. Completely agree with you on the use of statically typed data, as it help avoid a lots of problems and makes the code largely bug free.

    And yes I just hate writing documentation, I think this task should be done by professional technical writers.

    Like

  2. Jon, I think this topic of applying meaning to data (overlaying interpretation onto information) is important but your blog post is the first public comment I have seen on the matter. (I have been searching for it for a while now).

    Question: Does this topic have a formal name in mathematics or in logic? I want to know so I can find out more about the subject.

    Interesting note: The captcha I have to respond to in order to post this comment is an example of using meaning as a filter to cast a jpg to a String to validate that I am not just another machine. Just sayin.

    Like

  3. I think the closest thing to a formal name is probably “type theory,” but that tends to be more focused on categorization as opposed to meaning (even though meaning has some overlap with categorization).

    I don’t think I’ve ever heard of a formal name for the study of data meaning. Even if you type “study of data” into a search engine, you don’t get anything particularly useful back as a discipline. Typing data discipline brings up lots of things that are completely unrelated.

    I think “data” is too general.

    Like

  4. Now that I’ve had a bit to think about it, I recall hearing somewhere that data is just data and that information is applied data. Meaningful data.

    Along this line, there are a number of disciplines like informatics, informetrics (wow that’s an obnoxious word to say), information science, and possibly a few others.

    It might not be a huge stretch to say that these are applicable to the discussion.

    Like

  5. On the topic of what is this field of study called, I recently emailed this to a friend. The point being that it applies to more than just computer science:

    Friend,
    Do you know the name of the branch of mathematics (or maybe it is philosophy) that has to do with the the application of context to data?

    By that I mean things like mapping screen memory to pixels to get an image, or inkjet printer instruction to on/off commands to get an image, or taking an array of 8 bytes and mapping it to get the Double value of 123.45 or mapping the same 8 bytes to get the String “abcd”. It doesn’t just apply to computers, though. It maps punched holes to designs in Jacquard Looms or maps integer values to bank statements to represent account balance. It is what takes information theory and makes the theory applicable to _anything_. But I don’t see anybody talking about it.

    I am thinking there has to be some blogs or articles or something on this subject, but I can’t find the name of the field of study in order to start reading or asking questions about it.

    FWIW. – Paul

    Like

  6. @Joel – I like the idea of using “type theory” to envangelize the issue at hand. The type system and the types we model are our foundational constructs. The fidelity with which we specify both is the key to availing ourselves of miscommunication and misunderstanding.

    Perhaps we could amplify the name and call it “semantic type theory”? Thus the name practices what it preaches.

    Like

  7. @all: etymology?

    @skeet,

    How does this help me since I’m already a “semantics nut”?

    I agree with you completely: when writing code and more generally striving to be a clear communicator in life.

    Often my colleagues write semantics discussions off as just “implementation details”, or “that’s semantics matt!” and I’m reduced to silence in favour of ‘more important’ discussions.

    How do I counter this? How can I sell the value of semantics and meaning to the everyday person -and- my code cutting peers?

    -MK

    Like

  8. I immensely enjoy writing documentation. This is the truth. I have to avoid XML comments whil coding to not get caught up writing a treatise on a boolean flag or a complete exegesis on the name and manners of using a class.

    Like

  9. The philosophy behind this “semantic type theory” is also an interesting topic, which I think goes back to Plato. I believe Plato considered the world to be an imperfect reflection of perfect concepts such as circles, triangles and other geometric shapes.

    Even if this were the case, it would still be hard to categorize a real object, since real objects are not actual circles (or any other mathematical model).

    However, in terms of categorizing types of data, we can apply abstract rules (that do not apply to real objects) to determine whether the data fits the category(as described in “Data under the microscope”).

    This is because the data is already an abstract model of the reality — and therefore an approximation. For instance, data about a class of triangles could be described by three lengths, so long as the sum of any two of the lengths is more than the third.

    Just adding my two-penneth-worth.

    Like

  10. @John

    “and makes the code largely bug free”

    Famous last words, and completely untrue. It can help protect against a certain class of bugs, but the real serious bugs are caused by logical errors, not the kind of errors a type-checker will catch.

    Like

  11. @Svick: It’s because you have made the classic mistake of confusing double equals with single. Try this and see if it doesn’t work better:

    ((Oyster)World.Instance).Owner = User.Current;

    Jon: A great article, thanks! (At long last…someone more pedantic than I!) This is exactly the sort of discussion I point to when I insist that comments ARE code. Due to the limitations of formal code syntax, it is utterly impossible to convey all vital semantic information to a compiler that it is necessary to convey to the developers that read the code. Thus comments must be included, and kept up to date with the surrounding code, in any well-maintained code base. Otherwise semantic information is lost.

    Like

  12. I’m afraid I’m lost in your semantic definition of a well defined computer term “reference”. I am not well versed on JAVA, but just working with a well defined term can get into confusing semantics wars with yourself. You have reference types, you can pass a field by reference, whether the field type is reference or value. With a reference type, you can also pass it by reference even when you don’t specify the ref keyword. (In C#) But the behavior is completely different in reference types if you do or don’t use the ref keyword. IE I clearly know what pass by reference means but that meaning can change by context, so I can’t say if we are talking about the same thing or not.

    Like

  13. EDIT: I had a typo in the first version of this comment (pass-by-reference when I meant pass-by-value, of all things). This was corrected in response to a later comment.
    @Ken: No, Java always uses pass-by-value, and C# uses pass-by-value *unless* you use out or ref. You claim you “clearly know what pass by reference means” but that’s not clear given the rest of your statement.
    In particular, using “ref” in C# with a parameter type which is already a reference type still makes a difference – the variable is passed by reference, rather than its value (*a* reference) being passed by value. Please read tinyurl.com/…/parameters.html

    Like

  14. Does anyone know the release date for c# in depth 3rd ed. ? Amazon has it listed as June 28th but they still have not started shipping.

    Thanks

    Like

  15. @skeet: C# uses pass-by-reference *unless* you use out or ref.

    No, not really. That would lead to the absurd conclusion that using ref doesn’t actually pass by reference.

    I know you know this, but in fact C# uses call-by-value unless you specify out or ref. Objects are reference types, but those object reference values are passed by value just like any other regular parameter. It just doesn’t make any sense to say that they are passed by reference, because while you can mutate an object parameter in the callee but you can’t change it’s identity. Call-by-reference allows the callee to change the identity of the parameter by storing a different object in its place.

    In fact the calling mechanisms in Java and C# are identical–the difference you suggest doesn’t exist. The calling mechanism in C is also the same, with arrays being an example of a reference type in C that are still passed by value, but the value of a C array (technically the lvalue, used in parameter passing and on the left sides of assignments) is a pointer.

    Like

  16. @Anonymous Pedant: Yes, that was a typo. It should have been “and C# uses pass-by-value *unless* you use out or ref”

    There *is* a big difference between Java and C# though, as Java doesn’t have any equivalent to ref or out.

    I’m going to see if I can edit my earlier comment to avoid misleading others, but I’ll note that I *have* edited it in order to avoid making your comment seem wrong!

    Like

Leave a comment