When is a string not a string?

November 7, 2014 jonskeet 18 Comments

As part of my “work” on the ECMA-334 TC49-TG2 technical group, standardizing C# 5 (which will probably be completed long after C# 6 is out… but it’s a start!) I’ve had the pleasure of being exposed to some of the interesting ways in which Vladimir Reshetnikov has tortured C#. This post highlights one of the issues he’s raised. As usual, it will probably never impact 99.999% of C# developers… but it’s a lovely little problem to look at.

Relevant specifications referenced in this post:
– The Unicode Standard, version 7.0.0 – in particular, chapter 3
– C# 5 (Word document)
– ECMA-335 (CLI specification)

What is a string?

How would you define the string (or System.String) type? I can imagine a number of responses to that question, from vague to pretty specific, and not all well-defined:

“Some text”
A sequence of characters
A sequence of Unicode characters
A sequence of 16-bit characters
A sequence of UTF-16 code units

The last of these is correct. The C# 5 specification (section 1.3) states:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So far, so good. But that’s C#. What about IL? What does that use, and does it matter? It turns out that it does… Strings need to be represented in IL as constants, and the nature of that representation is important, not only in terms of the encoding used, but how the encoded data is interpreted. In particular, a sequence of UTF-16 code units isn’t always representable as a sequence of UTF-8 code units.

I feel ill (formed)

Consider the C# string literal "X\uD800Y". That is a string consisting of three UTF-16 code units:

0x0058 – ‘X’
0xD800 – High surrogate
0x0059 – ‘Y’

That’s fine as a string – it’s even a Unicode string according to the spec (item D80). However, it’s ill-formed (item D84). That’s because the UTF-16 code unit 0xD800 doesn’t map to a Unicode scalar value (item D76) – the set of Unicode scalar values explicitly excludes the high/low surrogate code points.

Just in case you’re new to surrogate pairs: UTF-16 only deals in 16-bit code units, which means it can’t cope with the whole of Unicode (which ranges from U+0000 to U+10FFFF inclusive). If you want to represent a value greater than U+FFFF in UTF-16, you need to use two UTF-16 code units: a high surrogate (in the range 0xD800 to 0xDBFF) followed by a low surrogate (in the range 0xDC00 to 0xDFFF). So a high surrogate on its own makes no sense. It’s a valid UTF-16 code unit in itself, but it only has meaning when followed by a low surrogate.

Show me some code!

So what does this have to do with C#? Well, string constants have to be represented in IL somehow. As it happens, there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8.

Let’s take an example:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;

[Description(Value)]
class Test
{
    const string Value = "X\ud800Y";

    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }

    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

The output of this code (under .NET) is:

Attribute: 0058 fffd fffd 0059
Constant: 0058 d800 0059

As you can see, the “constant” (Test.Value) has been preserved as a sequence of UTF-16 code units, but the attribute property has U+FFFD (the Unicode replacement character which is used to indicate broken data when decoding binary to text). Let’s dig a little deeper and look at the IL for the attribute and the constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )

.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant (Value) is really simple – it’s just little-endian UTF-16. The format of the attribute is specified in ECMA-335 section II.23.3. Here, the meaning is:

Prolog (01 00)
Fixed arguments (for specified constructor signature)
- 05 58 ED A0 80 59 (a single string argument as a SerString)
  - 05 (the length, i.e. 5, as a PackedLen)
  - 58 ED A0 80 59 (the UTF-8-encoded form of the string)
Number of named arguments (00 00)
Named arguments (there aren’t any)

The interesting part is the “UTF-8-encoded form of the string” here. It’s not valid UTF-8, because the input isn’t a well-formed string. The compiler has taken the high surrogate, determined that there isn’t a low surrogate after it, and just treated it as a value to be encoded in the normal UTF-8 way of encoding anything in the range U+0800 to U+FFFF inclusive.

It’s worth noting that if we had a full surrogate pair, UTF-8 would encode the single Unicode scalar value being represented, using 4 bytes. For example, if we change the declaration of Value to:

const string Value = "X\ud800\udc00Y";

then the UTF-8 bytes in the IL are 58 F0 90 80 80 59 – where F0 90 80 80 is the UTF-8 encoding for U+10000. That’s a well-formed string, and we get the same value for both the description attribute and the constant.

So in our original example, the string constant (encoded as UTF-16 in the IL) is just decoded without checking whether or not it’s ill-formed, whereas the attribute argument (encoded as UTF-8) is decoded with extra validation, which detects the ill-formed code unit sequence and replaces it.

Encoding behaviour

So which approach is right? According to the Unicode specification (item C10) both could be fine:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

and

Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89.

It’s not at all clear to me whether either the attribute argument or the constant value “purports to be in a Unicode character encoding form”. In my experience, very few pieces of documentation or specification are clear about whether they expect a piece of text to be well-formed or not.

Additionally, System.Text.Encoding implementations can often be configured to determine how they behave when encoding or decoding ill-formed data. For example, Encoding.UTF8.GetBytes(Value) returns byte sequence 58 EF BF BD 59 – in other words, it spots the bad data and replaces it with U+FFFD as part of the encoding… so decoding this value will result in X U+FFFD Y with no problems. On the other hand, if you use new UTF8Encoding(true, true).GetBytes(Value), an exception will be thrown. The first constructor argument is whether or not to emit a byte order mark under certain circumstances; the second one is what dictates the encoding behaviour in the face of invalid data, along with the EncoderFallback and DecoderFallback properties.

Language behaviour

So should this compile at all? Well, the language specification doesn’t currently prohibit it – but specifications can be changed :)

In fact, both csc and Roslyn do prohibit the use of ill-formed strings with certain attributes. For example, with DllImportAttribute:

[DllImport(Value)]
static extern void Foo();

This gives an error when Value is ill-formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

There may be other attributes this is applied to as well; I’m not sure.

If we take it as read that the ill-formed value won’t be decoded back to its original form when the attribute is instantiated, I think it would be entirely reasonable to make it a compile-time failure – for attributes. (This is assuming that the runtime behaviour can’t be changed to just propagate the ill-formed string.)

What about the constant value though? Should that be allowed? Can it serve any purpose? Well, the precise value I’ve given is probably not terribly helpful – but it could make sense to have a string constant which ends with a high surrogate or starts with a low surrogate… because it can then be combined with another string to form a well-formed UTF-16 string. Of course, you should be very careful about this sort of thing – read the Unicode Technical Report 36 “Security Considerations” for some thoroughly alarming possibilities.

Corollaries

One interesting aspect to all of this is that “string encoding arithmetic” doesn’t behave as you might expect it to. For example, consider this method:

// Bad code!
string SplitEncodeDecodeAndRecombine
    (string input, int splitPoint, Encoding encoding)
{
    byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint));
    byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint));
    return encoding.GetString(firstPart) + encoding.GetString(secondPart);            
}

You might expect that this would be a no-op so long as everything is non-null and splitPoint is within range… but if you happen to split in the middle of a surrogate pair, it’s not going to be happy. There may well be other potential problems lurking there, depending on things like normalization form – I don’t think so, but at this point I’m unwilling to bet too heavily on string behaviour.

If you think the above code is unrealistic, just imagine partitioning a large body of text, whether that’s across network packets, files, or whatever. You might feel clever for realizing that without a bit of care you’d get binary data split between UTF-16 code units… but even handling that doesn’t save you. Yikes.

I’m tempted to swear off text data entirely at this point. Floating point is a nightmare, dates and times… well, you know my feelings about those. I wonder what projects are available that only need to deal with integers, and where all operations are guaranteed not to overflow. Let me know if you have any.

Conclusion

Text is hard.

18 thoughts on “When is a string not a string?”

3psil0n says:

November 7, 2014 at 6:41 pm

Which is why you use Encoding.GetEncoder() to get an Encoder and call the “Convert” method (which is not very well documented and behaves weird, see the “flush” and “completed” parameters). It is the only true “safe” way to process incoming textual data from a buffered source in one pass (without buffering the whole thing) like a FileStream or NetworkStream.

I’ve had the pleasure of writing an Encoder / Decoder class pairs from scratch and boy are they awkward to use and even more difficult to write / unit test (too many combos).

There’s a design decision in there where the encoder will never write “half” surrogates-pairs (it will back out before and set completed to false). But the corresponding Decoder class will consume a “half” surrogate-pair (if flush is false) and “prepend” it to the buffer on the next call thus processing it properly.

LikeLike

Reply
pete.d says:

November 7, 2014 at 7:07 pm
Thanks…learned something new today. :)

I would prefer to see a more specific error message than “error CS0591: Invalid value for argument to ‘DllImport’ attribute”. The compiler is presumably going through the UTF8 encoding process when it fails, so it ought to be able to say more clearly that it’s a string-encoding issue here. I can imagine being the person trying to figure out what “invalid value” I’ve given when I’m looking at a string that I think is correct, but which in fact isn’t.

I also think it’s a bit strange and unfortunate that they decided to encode these values as UTF8. I understand the space constraint issue, but they obviously didn’t do this for other strings, so it couldn’t have been that important. And it has clear potential to create maintainability issues.

As far as whether to allow invalid character sequences in UTF16 string literals, I say those should be allowed, for at least two different reasons. One, it’s always been that way so changing it would break things (always a compelling reason :) ). Two, there are actually valid scenarios for having invalid character sequences: unit testing code that is supposed to catch invalid character sequences; tool-generated text where strings are split at arbitrary UTF16 boundaries; or even some hack where someone’s storing non-text data in a string for whatever reason (yeah, it’s ugly and awful but should it actually be prohibited?).

By the way, from the editorial desk, a couple of minor corrections:
1. I assume that the phrase “it can’t cope with the hold of Unicode” was intended to be “it can’t cope with the whole of Unicode”.
2. The string literals in your examples appear to be missing the backslash character. E.g. “Xud800Y” where it should be “X\ud800Y”. I assume they are getting lost in some web-enabled escaping somewhere.
LikeLike
Reply
1. jonskeet says:
  
  November 7, 2014 at 7:17 pm
  
  Both of the editorial bits are fixed now. Don’t know why I need to escape the backslash in markdown, but there we go. And yes, I agree about testing.
  
  LikeLike
  
  Reply
Pingback: When is a string not a string? | Откомпилируй Это
Phil says:

November 8, 2014 at 12:19 am

The technical bits of this are a little above my (low) pay grade, but can any of this knowledge help with getting round MySQL’s awful habit of returning “Incorrect stirng value” wheneve passed a non-standard (?) character through a SQL statement (parametized or not)? I’ve looked everywere for a simple (?) solution to “clean up” user input to avoid this, without suiccess…. years ago came across this article https://www.bluebox.net/insight/blog-article/getting-out-of-mysql-character-set-hell (which probably explains the issue better as well), but bloody ‘ell, does it have to be this hard?!?!

LikeLike

Reply
Jeroen Frijters says:

November 8, 2014 at 9:51 am

I ran into this with IKVM.NET. In Java class and member names can even be ill-formed. http://weblog.ikvm.net/2014/06/13/MalformedUTF16.aspx

LikeLike

Reply
Yitzhak Steinmetz says:

November 9, 2014 at 10:01 pm

I think it’s important to note that the StringInfo.GetTextElementEnumerator method (http://msdn.microsoft.com/en-us/library/x2f3k4f6.aspx) can be used to break up a string into valid text elements for the purpose of breaking a Unicode string into chunks.

LikeLike

Reply
aakinshin says:

November 10, 2014 at 12:26 am

I checked the examples on Mono and received very interesting results with UTF-8 operations. They are very different from the Microsoft .NET. http://aakinshin.blogspot.com/2014/11/mono-utf8-conversions.html

LikeLike

Reply
worldmaker says:

November 10, 2014 at 3:35 pm

Integers are hard too. Not just overflow, but there are differences in encodings for them two (twos complement versus sign flag versus who-knows-what). Let’s just all try to work towards an amazing world of Boolean-only programming on nothing but bit flags, then we can be free of encoding problems…

LikeLike

Reply
1. vreshetnikov says:
  
  November 18, 2014 at 12:41 am
  
  Bool is a very evil type in C#. You think it has only 2 possible values (and the compiler assumes it as well to perform some optimizations), but at runtime it can have 256 distinct values (and it’s not very difficult to get non-normalized values in C#). So, bitwise OR-ing of two different true values can result in false.
  
  LikeLike
  
  Reply
  1. worldmaker says:
    
    November 18, 2014 at 1:57 pm
    
    Which is why I did specify “on bit flags”. ;) I would have suggested flags enums but knew how easy it was to accidentally do things like integer math on those.
    
    LikeLike
    
    Reply
Arseny Kapoulkine (@zeuxcg) says:

November 10, 2014 at 8:00 pm

Can you elaborate on “You might feel clever for realizing that without a bit of care you’d get binary data split between UTF-16 code units… but even handling that doesn’t save you.”? Assuming that encoding is a Unicode variant, splitting at codepoint boundaries and encoding seems safe to me.

Of course, if your intention is to display the encoded elements piecewise this is insufficient – but it should be enough for transmission.

LikeLike

Reply
1. jonskeet says:
  
  November 12, 2014 at 7:24 am
  
  That’s the point though – splitting at code point boundaries isn’t the same thing as splitting at UTF-16 code units. If you have a value which includes a surrogate pair, you could split at a UTF-16 code unit boundary which is “in the middle of” a code point, because the code point takes two code units. So you do need to think (and write) in terms of code points, not just code units.
  
  LikeLike
  
  Reply
Pingback: Les liens de la semaine – Édition #106 | French Coding
Pingback: When is an identifier not an identifier? (Attack of the Mongolian Vowel Separator) | Jon Skeet's coding blog
Gyula Kósa says:

December 29, 2015 at 4:02 pm

I think .NET should have different types for sequences of UTF-16 code units and for well-formed Unicode strings, and it should be reflected that these belong to different levels of abstraction.

LikeLike

Reply
1. Yitzhak Steinmetz says:
  
  December 29, 2015 at 6:14 pm
  
  Thanks a very cool idea!
  
  LikeLike
  
  Reply
Pingback: Ce caractere nu sunt valide pentru numele de dosare din interiorul unui fișier ZIP?