Code and data

In a recent Stack Overflow question, I answered a question which started off with a broken XPath expression by suggesting that that poster might be better off using LINQ to XML instead. The discussion which followed in the comments (around whether or not this was an appropriate answer) led me to think about the nature of code and data, and how important context is.

I don’t think there’s any particularly deep insight in this post – so I’ll attempt to keep it relatively short. However, you might like to think about how code and data interact in your own experience, and what the effects of this can be.

Code is data

Okay, so let’s start off with the obvious: all code is data, at some level. If it’s compiled code, it’s just binary data which a machine can execute. Put it on another machine with no VM, and there’s nothing remarkable about it. It’s just a load of 1s and 0s. As source code, most languages are just plain text. Open up some source code written in C#, Ruby, Python, Java, C++ etc in Notepad and it’ll be readable. You may miss the syntax highlighting and so forth, but it’s still just text.

Code in the right context is more than just data

So what makes this data different to (say) a CSV file or a plain text story? It’s all in the context. When you load it into the right editor, or pass it to the right compiler, you get more information: in an editor you may see the aforementioned syntax highlighting, autocompletion, documentation for members you’re using; a compiler will either produce errors or a binary file. For something like Python or Ruby, you may want to feed the source into an interpreter instead of a compiler, but the principle is the same: the data takes on more meaning.

Code in the wrong code-related context is just data again

Now let’s think about typical places where you might put code (or something with similar characteristics) into the "wrong" context:

  • SQL statements
  • XSLT transformations
  • XPath expressions
  • XML or HTML text
  • Regular expressions

All of these languages have editors which understand them, and will help you avoid problems. All of these are also possible to embed in other code – C#, for example. Indeed, almost all the regular expressions I’ve personally written have ended up in Java or C# code. At that point, there are two problems:

  • You may want to include text which doesn’t embed easily within the "host" language’s string literals (particularly double quotes, backslashes and newlines)
  • The code editor doesn’t understand the additional meaning to the text

The first problem is at least somewhat mitigated by C#’s support for verbatim string literals – only double quotes remain as a problem. But the second problem is the really big one. Visual Studio isn’t going to check that your regular expression or XPath expression looks valid. It’s not going to give you syntax highlighting for your SQL statement, much less IntelliSense on the columns present in your database. Admittedly such a thing might be possible, if the IDE looked ahead to find out where the text was going to be used – but I haven’t seen any IDE that advanced yet. (The closest I’ve seen is ReSharper noticing when you’re using a format string with the wrong number of parameters – that’s primitive but still really useful.)

Of course, you could write your SQL (or XPath etc) in a dedicated editor, and then either copy and paste it into your code or embed it into your eventual binary and load it at execution time. Neither of these is particularly appealing. Copy and paste works well once, but then when you’re reading or modifying the code you lose the advantages you had unless you copy and paste it again. Embedding the file can work well in some cases – I use it liberally for test data in unit tests, for example – but I wouldn’t want it all over production code. It means that when reading the code, you have to refer to the external resource to work out what’s going to happen. In some cases that’s not too bad – it’s only like opening another class or method, I guess – but in other cases the shift of gears is too distracting.

When code is data, it’s easy to mix it with other data – badly

Within C# code, it’s easy to see the bits of data which sometimes occur in your code: string or numeric literals, typically. Maybe you subscribe to the "no magic values" philosophy, and only ever have literals (other than 0 or 1, typically) as values for constants. Well, that’s just a level of indirection – which in some ways hides the fact that you’ve still got magic values. If you’re only going to use a piece of data once, including it directly in-place actually adds to readability in my view. Anyway, without wishing to dive into that particular debate too deeply, the point is that the compiler (or whatever) will typically stop you from using that data as code – at least without being explicit about it. It will make sure that if you’re using a value, it really is a value. If you’re trying to use a variable, it had better be a variable. Putting a variable name in quotes means it’s just text, and using a word without the quotes will make the compiler complain unless you happen to have a variable with the right name.

Now compare that with embedding XPath within C#, where you might have:

var node = doc.SelectSingleNode("//foo/bar[@baz=xyz]");

Now it may be obvious to you that "xyz" is meant to be a value here, not the name of an attribute, an element, a function or anything like that… but it’s not obvious to Visual Studio, which won’t give you any warnings. This is only a special case of the previous issue of invalid code, of course, but it does lead onto a related issue… SQL injection attacks.

When you’ve already got your "code" as a simple text value – a string literal containing your SQL statement, as an obvious example – it’s all too easy to start mixing that code/data with genuine data data: a value entered by a user, for example. Hey, let’s just concatenate the two together. Or maybe use a format string, effectively mixing three languages (C#, SQL, the primitive string formatting "language" of string.Format) into a single statement. We all know the results, of course: nothing differentiates between the code/data and the genuine data, so if the user-entered value happens to look like SQL to drop a database table, we end up with Little Bobby Tables.

I’m sure 99% of my blog readers know the way to avoid SQL injection attacks: use parameterized SQL statements. Keep the data and the code separate, basically.

Expressing the same ideas, but back in the "native" language

Going back to the start of all this, the above is why I like LINQ to XML. When I express a query using LINQ to XML, it’s often a lot longer than it would have been in the equivalent XPath – but I can tell where the data goes. I know where I’m using an element name, where I’m using an attribute name, and where I’m comparing or extracting values. If I miss out some quotes, chances are pretty high that the resulting code will be invalid, and it’ll be obvious where the problem is. I’m prepared to sacrifice brevity for the fact that I only work in a single language + library, instead of trying to embed one language within another.

Likewise building XML using LINQ to XML is much better than concatenating strings – I don’t need to worry about any nasty escaping issues, for example. LINQ to XML has been so nicely design, it makes all kinds of things incredibly easy.

Regular expressions can sometimes be replaced by simple string operations. Where they can, I will often do so. I’d rather use a few IndexOf and Substring calls over a regular expression in general – but where the patterns I need get too tricky, I will currently fall back to regular expressions. I’m aware of ReadableRex but I haven’t looked at it in enough detail to say whether it can take the place of "normal" regular expressions in the way that LINQ to XML can so often take the place of XPath.

Of course, LINQ to SQL (and the Entity Framework) do something similar for SQL… although that’s slightly different, and has its own issues around predictability.

In all of these cases, however, the point is that by falling back to more verbose but more native-feeling code, some of the issues of embedding one language within another are removed. Code is still code, data is data again, and the two don’t get mixed up with each other.

Conclusion

If I ever manage to organize these thoughts in a more lucid way, I will probably just rewrite them as another (shorter) post. In the meantime, I’d urge you to think about where your code and data get uncomfortably close.

13 thoughts on “Code and data”

  1. I recently read the paper “Why It’s Nice to be Quoted: Quasiquoting for Haskell”, which describes a method for compilers to check that DSL fragments like regex are correct. Considering F#’s functional bent it seems likely these things should be available, if not already.

    Like

  2. ” Embedding the file can work well in some cases – I use it liberally for test data in unit tests, for example – but I wouldn’t want it all over production code. It means that when reading the code, you have to refer to the external resource to work out what’s going to happen.”

    I get your hesitance, but IMHO “external resource” is something that happens all the time. It’s not really a problem. And externalizing code/data like this has a benefit similar to that when one follows the “no magic values” rule: you can give the defined code/data a _name_ that helps the code where it’s used become more readable.

    That said, inasmuch as a language such as C# provides language features that offer the same functionality as doing it some other way, I do agree that using the language features is preferable. LINQ to XML is a bit odd, because the System.Xml.Linq classes also understand XPath and so even when you’re writing LINQ it can be tempting to mix in some XPath too. But yes, if you’re good about keeping it all LINQ then it can be much clearer where the code ends and the data begins.

    Oddly enough, I thought this blog article was going to be about something else entirely. In particular, one of my favorite strategies is to try to generalize code so that it’s data-driven. There’s a lot of code for which this doesn’t work very well, but when it does it can IMHO make the code a lot easier to maintain.

    This is true even if the data winds up embedded in the code (e.g. a static array used for initializing some data structure). It works best when the generalized code is shorter and simpler; if the act of “generalization” actually winds up meaning that you put a bunch of special-cases into the code to handle a variety of anticipated needs, I think that just makes things harder to understand.

    But if the code can be generalized in a way that _really_ generalizes it, sort of like how generics (for example) generalizes code in a way that externalizes the case-specific aspects, data-driven code can be very valuable.

    Like

  3. “ReSharper noticing when you’re using a format string with the wrong number of parameters”

    I wish that instead of `Resources.MyString` being a string that you pass to `string.Format()`, it was a custom object with a `.Format()` method on it, which took the correct number of parameters.

    I further wish for something clever to make the types of those parameters is correct, at compile time.

    Like

  4. Going even further, it would be nice to see a theory on the equivalence of data and code, just as in physics there is the equivalence of mass and energy.

    Like

  5. Just for the record, IntelliJ IDEA 9 has support for syntax and error highlighting of RegExps, XML, HTML, XPath, SQL, and many others, and allows for others to be plugged in as well. I don’t know whether the latest version of ReSharper does as well.

    I think it makes sense to use the most appropriate language for a given problem. Regular expressions and XPath notation in particular can be far more expressive than the equivalent code, and when the output is a precise string of XML or SQL, than having an extra layer of abstraction in front of it isn’t always necessary.

    Like

  6. xmachina: Everything represented in a computer is basically binary data, that’s basic computer science (pretty much axiomatic).

    In general that means computer programs, binary code, Word documents, JPEG images, etc. *IS* a form of data per definition.

    Theorizing equivalence of data and code is like theorizing strawberry flavoured ice cream and ice cream. One is a subset of the other.

    Like

  7. @Avi: Does it do this for embedded strings in Java code? If so, does it notice how you’re *using* the string to work out what highlighting to perform? Interesting. As for expressiveness… I agree that *sometimes* regexes and XPaths are obvious wins… but often it will depend on the experience of the team involved. There are many cases where a team of XPath gurus would obviously choose that route, whereas LINQ experts would choose LINQ to XML – and each would be making the correct choice for them.

    @Eber: This wasn’t particularly a dig at dynamic languages, no. It’s no secret that I’m generally a fan of static languages, but that’s unrelated to this post.

    Like

  8. The IntelliJ feature is called Language Injection (http://blogs.jetbrains.com/idea/tag/language-injection/). It can either be set manually, or is triggered based on the context of the String. For example, if a String is passed as the first argument of the Pattern.compile() or String.replaceAll() methods, it will treat it as a regular-expression.

    I agree exactly with what you wrote regarding expressiveness – that is what I was trying to say, that sometimes it is appropriate to use different notations, depending on the context and people involved.

    Like

  9. Would it perhaps be desirable to have syntax checking built into Visual Studio by specifying a “type” on a string? For example;

    string mySql = SQL@”SELECT FROM user”;

    This would then give a compiler warning about the missing asterisk. Other default types could be JS@, XPATH@, REGEX@ etc.

    Obviously this isn’t going to work in cases where a statement of the other language is being built dynamically, but I think it would be a nice helper tool for (hopefully) valid statements.

    Like

  10. I am more than say, 5 minutes late, but this is a topic I am interested in. What I used to do (when I was programming) was to create an entire language and system that implemented the functionality I wanted, and then code in that. For example, I created a script language for moving files around a large WAN (20 years ago). A similar project (I didn’t create this but extended it) was to maintain user data input forms with a standard word processor. That way, a non-programmer could create and edit the forms. No coding or compiling required.
    I think that creating constrained, special-purpose systems is more workable than trying to get a complex programming language to handle multiple levels of representation. Of course, you have to accept limitations in what the target system will be able to do, but lots of time is saved later. A customer service rep or end-user can design forms and reports, instead of a programmer. The secret is to set your expectations very low on the scope of what it can do. It will do some simple things extremely well and easily, and other things not at all. Mobile phone app, anyone?

    Like

Leave a comment