How many 32-bit types might we want?

January 30, 2014 jonskeet 22 Comments

I was recently directed to an article on "tiny types" – an approach to static typing which introduces distinct types for the sake of code clarity, rather than to add particular behaviour to each type. As I understand it, they’re like type aliases with no conversions between the various types. (Unlike plain aliases, an object is genuinely an instance of the relevant tiny type – it doesn’t have "alias erasure" as a language-based solution could easily do.)

I like the idea, and wish it were better supported in languages – but it led me to thinking more about the existing numeric types that we’ve got and how they’re used. In particular, I was considering how in C# the "byte" type is relatively rarely used as a number, with a useful magnitude which has arithmetic performed on it. That does happen, but more often it’s used either as part of other types (e.g. converting 4 bytes from a stream into an integer) or as a sequence of 8 bits.

It then struck me that the situations where we perform bitwise operations and the situations where we perform arithmetic operations are reasonably distinct. I wonder whether it would be worth having five separate types – which could be purely at the language level, if we wanted:

Float32 – regular IEEE-754 32-bit binary floating point type with arithmetic operations but no bitwise operations
Int32 – regular signed integer with arithmetic operations but no bitwise operations
UInt32 – unsigned integer with arithmetic operations but no bitwise operations
Bit32 – a bit sequence with bitwise operations but no arithmetic operations
Identity32 – a 32-bit value which only defines equality

The last type would be used for identities which happened to consist of 32 bits but where the actual bit values were only useful in terms of comparison with other identities. (In this case, an Identity64 might be more useful in practice.)

Explicit conversions which preserved the underlying bit pattern would be available, so you could easily generate a sequence of Identity32 values by having a simple Int32 counter, for example.

At the same time, I’d want to introduce bitwise operations for Bit8 and Bit16 values, rather than the "everything is promoted to 32 bits" that we currently have in Java and C#, reducing the number of pointless casts for code that performs bitwise operations.

The expected benefits would be the increased friction when using a type in an "unexpected" way for the value being represented – you’d need to explicitly cast to the right type – but maintaining a low-friction path when using a value in the expected way.

I haven’t yet figured out what I’d want a Stream.Read(…) call to use. Probably Bit32, but I’m not sure yet.

Anyway, I figured I’d just throw it out there as a thought experiment… any reactions?

Diagnostics, General, Stack Overflow

Diagnosing issues with reversible data transformations

January 20, 2014 jonskeet 7 Comments

I see a lot of problems which look somewhat different at first glance, but all have the same cause:

Text is losing “special characters” when I transfer it from one computer to another
Decryption ends up with garbage
Compressed data can’t be decompressed
I can transfer text but not binary data

These are all cases of transforming and (usually) transferring data, and then performing the reverse transformation. Often there are multiple transformations involved, and they need to be carefully reversed in the appropriate order. For example:

Convert text to binary using UTF-8
Compress
Encrypt
Base64-encode
Transfer (e.g. as text in XML)
Base64-decode
Decrypt
Decompress
Convert binary back to text using UTF-8

The actual details of each question can be different, but the way I’d diagnose them is the same in each case. That’s what this post is about – partly so that I can just link to it when such questions arise. Although I’ve numbered the broad steps, it’s one of those constant iteration situations – you may well need to tweak the logging before you can usefully reduce the problem, and so on.

1. Reduce the problem as far as possible

This is just my normal advice for almost any problem, but it’s particularly relevant in this kind of question.

Start by assembling a complete program demonstrating nothing but the transformations. Using a single program which goes in both directions is simpler than producing two programs, one in each direction.
Remove pairs of transformations (e.g. encrypt/decrypt) at a time, until you’ve got the minimal set which demonstrates the problem
Avoid file IO if possible: hard-code short sample data which demonstrates the problem, and use in-memory streams (ByteArrayInputStream/ByteArrayOutputStream in Java; MemoryStream in .NET) for temporary results
If you’re performing encryption, hard-code a dummy key or generate it as part of the program.
Remove irrelevant 3rd party dependencies if possible (it’s simpler to reproduce an issue if I don’t need to download other libraries first)
Include enough logging (just console output, usually) to make it obvious where the discrepancy lies

In my experience, this is often enough to help you fix the problem for yourself – but if you don’t, you’ll be in a much better position for others to help you.

2. Make sure you’re really diagnosing the right data

It’s quite easy to get confused when comparing expected and actual (or before and after) data… particularly strings:

Character encoding issues can sometimes be hidden by spurious control characters being introduced invisibly
Fonts that can’t display all characters can make it hard to see the real data
Debuggers sometimes “helpfully” escape data for you
Variable-width fonts can make whitespace differences hard to spot

For diagnostic purposes, I find it useful to be able to log the raw UTF-16 code units which make up a string in both .NET and Java. For example, in .NET:

static void LogUtf16(string input)
{
    // Replace Console.WriteLine with your logging approach
    Console.WriteLine(“Length: {0}”, input.Length);
    foreach (char c in input)
    {
        Console.WriteLine(“U+{0:x4}: {1}”, (uint) c, c);
    }
}

Binary data has different issues, mostly in terms of displaying it in some form to start with. Our diagnosis tools are primarily textual, so you’ll need to perform some kind of conversion to a string representation in order to see it at all. If you’re really trying to diagnose this as binary data (so you’re interested in the raw bytes) do not treat it as encoded text using UTF-8 or something similar. Hex is probably the simplest representation that allows differences to be pinpointed pretty simply. Again, logging the length of the data in question is a good first step.

In both cases you might want to include a hash in your diagnostics. It doesn’t need to be a cryptographically secure hash in any way, shape or form. Any form of hash that is likely to change if the data changes is a good start. Just make sure you can trust your hashing code! (Every piece of code you write should be considered suspect – including whatever you decide to use for converting binary to hex, for example. Use trusted third parties and APIs provided with your target platform where possible. Even though adding an extra dependency for diagnostics makes it slightly harder for others to reproduce the problem, it’s better than the diagnostics themselves being suspect.)

3. Analyze a clean, full, end-to-end log

This can be tricky when you’ve got multiple systems and platforms (which is why if you can possibly reproduce it in a single program it makes life simpler) but it’s really important to look at one log for a complete run.

Make sure you’re talking about the same data end-to-end. If you’re analyzing live traffic (which should be rare; unless the problem is very intermittent, this should all be done in test environments or a developer machine) or you have a shared test environment you need to be careful that you don’t use part of the data from one test and part of the data from another test. I know this sounds trivial, but it’s a really easy mistake to make. In particular, don’t assume that the data you’ll get from one part of the process will be the same run-to-run. In many cases it should be, but if the overall system isn’t working, then you already know that one of your expectations is invalid.

Compare “supposed to be equal” parts of the data. As per the steps in the introduction, there should be pairs of equal data, moving from the “top and bottom” of the transformation chain towards the middle. Initially, you shouldn’t care about whether you view the transformation as being correct – you’re only worried about whether the output is equal to the input. If you’ve managed to preserve all the data, the function of the transformation (encryption, compression etc) becomes relevant – but if you’re losing data, anything else is secondary. This is where the hash from the bottom of step 2 is relevant: you want to be able to determine whether the data is probably right as quickly as possible. Between “length” and “hash”, you should have at least some confidence, which will let you get to the most likely problem as quickly as possible.

4. Profit! (Conclusion…)

Once you’ve compared the results at each step, you should get an idea of which transformations are working and which aren’t. This may allow you to reduce the problem further, until you’ve just got a single transformation to diagnose. At that point, the problem becomes about encryption, or about compression, or about text encoding.

Depending on the situation you’re in, at this point you may be able to try multiple implementations or potentially multiple platforms to work out what’s wrong: for example, if you’re producing a zip file and then trying to decompress it, you might want to try using a regular decompression program to open your intermediate results, or decompress the results of compressing with a standard compression tool. Or if you’re trying to encrypt on Java and decrypt in C#, implement the other parts in each platform, so you can at least try to get a working “in-platform” solution – that may well be enough to find out which half has the problem.

To some extent all this blog post is about is reducing the problem as far as possible, with some ideas of how to do that. I haven’t tried to warn you much about the problems you can run into in any particular domain, but you should definitely read Marc Gravell’s excellent post on “How many ways can you mess up IO?” and my earlier post on understanding the meaning of your data is pretty relevant too.

As this is a “hints and tips” sort of post, I’ll happily modify it to include reader contributions from comments. With any luck it’ll be a useful resource for multiple Stack Overflow questions in the months and years to come…

A tale of two puzzles

January 14, 2014 jonskeet 31 Comments

As I begin to write this, I’m in a small cubicle in Philadelphia airport, on my way back from CodeMash – a wonderful conference (yet again) which I feel privileged to have attended. Personal top highlights definitely include Dustin Campbell’s talk on C# 6 (I’m practically dribbling with anticipation – bits please!) and playing Settlers of Catan on an enormous board. Kevin Pilch-Bisson’s talk on scriptcs was fabulous too, not least because he demonstrated its NuGet support using Noda Time. I’m very likely to use scriptcs as a tool to ease folks into the language gently if I ever get round to writing my introductory C# book. (To follow my own advice from the talk I gave, I should really just start writing it – whether the initial attempt is good or not, it’s better than having nothing.)

First puzzle

During Dustin’s talk, he set some puzzles around refactoring – where an "inline" operation had to be smarter than it originally appeared, for various reasons. The final puzzle particularly piqued my interest, as Dustin explained that various members of the C# team hadn’t actually worked out why the code behaved the way it did… the refactoring in Roslyn worked in that it didn’t change the behaviour, but more through a good general strategy than any attempt to handle this specific case.

So, the code in question is:

using System;

class X
{
static int M(Func<int?, byte> x, object y) { return 1; }
static int M(Func<X, byte> x, string y) { return 2; }

const int Value = 1000;

    static void Main()
    {
        var a = M(X => (byte) X.Value, null);

        unchecked
        {
            Console.WriteLine(a);
            Console.WriteLine(M(X => (byte) X.Value, null));
        }
    }
}

This produces output of:

1
2

Is this correct? Is it a compiler bug from the old native compiler, faithfully reproduced in Roslyn? Why would moving the expression into an "unchecked" block cause different behaviour?

In an attempt to give you enough space and time to avoid accidentally reading the answer, which is explained below, I’d like to take the time to mention the source of this puzzle. A few years ago, Microsoft hired Vladmir Reshetnikov into testing part of the C#/VB team. Vladmir had previously worked at JetBrains on Resharper, so it’s not like he was new to C#. Every year when I see Kevin and Dustin at CodeMash, they give me more reports of the crazy things Vladmir has come up with – perversions of the language way beyond my usual fare. The stories are always highly entertaining, and I hope to meet Vladmir in person one day. He has a twitter account which I’ve only just started following, but which I suspect I’ll find immensely distracting.

Explanation

I massively overthought this for a while. I then significantly underthought it for a while. I played around with the code by:

Removing the second parameter from M
Changing the name of the parameter
Changing the parameter types in various ways
Making X.Value non-const (just static readonly)
Changing the value of X.Value to 100
Changing the lambda expression to make a call to a method using (byte) X.Value instead of returning it directly
Using a statement lambda

These gave me clues which I then failed to follow up on properly – partly because I was looking for something complicated. I was considering the categorization of a "cast of a constant value to a different numeric type" and similar details – instead of following through the compiler rules in a systematic way.

The expression we need to focus on is this:

M(X => (byte) X.Value, null)

This is just a method call using a lambda expression, using overload resolution to determine which overload to call and type inference to determine the type arguments to the method. In a very crude description of overload resolution, we perform the following steps:

Determine which overloads are applicable (i.e. which ones make sense in terms of the supplied arguments and the corresponding parameters)
Compare the applicable overload to each other in terms of "betterness"
If one applicable overload is "better" than all the others, use that (modulo a bit more final checking)
If there are no applicable overload, or no one applicable method is better than all the others, then the call is invalid and leads to a compile-time error

My mistake was jumping straight to the second bullet point, assuming that both overloads are valid in each case.

Overload 1 – a simple parameter

First let’s look at the first overload: the one where the first parameter is of type Func<int?, byte>. What does the lambda expression of X => (byte) X.Value mean when converted to that delegate type? Is it always valid?

The tricky part is working out what the simple-name X refers to as part of X.Value within the lambda expression. The important part of the spec here is the start of section 7.6.2 (simple names):

A simple-name is either of the form I or of the form I<A₁, …, A_K>, where I is a single identifier and <A₁, …, A_K> is an optional type-argument-list. When no type-argument-list is specified, consider K to be zero. The simple-name is evaluated and classified as follows:

If K is zero and the simple-name appears within a block and if the block’s (or an enclosing block’s) local variable declaration space (§3.3) contains a local variable, parameter or constant with name I, then the simple-name refers to that local variable, parameter or constant and is classified as a variable or value.

(The spec continues with other cases.) So X refers to the lambda expression parameter, which is of type int? – so X.Value refers to the underlying value of the parameter, in the normal way.

Overload 2 – a bit more complexity

What about the second overload? Here the first parameter to M is of type Func<X, byte>, so we’re trying to convert the same lambda expression to that type. Here, the same part of section 7.6.2 is used, but also section 7.6.4.1 comes into play when determining the meaning of the member access expression X.Value:

In a member access of the form E.I, if E is a single identifier, and if the meaning of E as a simple-name (§7.6.2) is a constant, field, property, local variable, or parameter with the same type as the meaning of E as a type-name (§3.8), then both possible meanings of E are permitted. The two possible meanings of E.I are never ambiguous, since I must necessarily be a member of the type E in both cases. In other words, the rule simply permits access to the static members and nested types of E where a compile-time error would otherwise have occurred.

It’s not spelled out explicitly here, but the reason that it can’t be ambiguous is that you can’t have two members of the same name within the same type (other than for method overloads). You can have method overloads that are both static and instance methods, but then normal method overloading rules are enforced.

So in this case, X.Value doesn’t involve the use of the parameter called X at all. Instead, it’s the constant called value within the class X. Note that this is only the case because the type of the parameter is the same as the type that the parameter’s name refers to. (It isn’t quite the same as the names being the same. If you have a using directive introducing an alias, such as using Y = X; then the condition can be satisfied with different names.)

So what difference does unchecked make?

Now that we know what the conversions from the lambda expression to the two different delegate types mean, we can consider whether they’re always valid. Taking the overload resolution out of the picture, when is each of these lines valid at compile-time?

Func<int?, byte> foo = X => (byte) X.Value;
Func<X, byte> bar = X => (byte) X.Value;

Note that we’re not trying to consider whether they might throw an exception – clearly if you call foo(null) then that will fail. But will it at least compile?

The first line is always valid… but the second depends on your context, in terms of checked or unchecked arithmetic. For the most part, if you’re not explicitly in a checked or unchecked context, the language assumes a default context of unchecked, but allows "external factors" (such as a compiler flag). The Microsoft compiler is unchecked by default, and you can make an assembly use checked arithmetic "globally" using the /checked compiler argument (or going to the Advanced settings in the build properties of a Visual Studio project.

However, one exception to this "default" rule is constant expressions. From section 7.6.12 of the spec:

For constant expressions (expressions that can be fully evaluated at compile-time), the default overflow checking context is always checked. Unless a constant expression is explicitly placed in an unchecked context, overflows that occur during the compile-time evaluation of the expression always cause compile-time errors.

The cast of X.Value to byte is still a constant expression – and it’s one that overflows at compile-time, because 1000 is outside the range of byte. So unless we’re explicitly in an unchecked context, the code snippet above will fail for the second line.

Back to overloading

Given all of that, how does it fit in with our problem? Well, at this point it’s reasonably simple – so long as we’re careful. Look at the first assignment, which occurs outside the unchecked statement:

var a = M(X => (byte) X.Value, null);

Which overloads are applicable here? The second argument isn’t a problem – the null literal is convertible to both object and string. So can we convert the first argument (the lambda expression) to the first parameter of each overload?

As noted earlier, it’s fine to convert it to a Func<int?, byte>. So the first method is definitely applicable. However, the second method requires us to convert the lambda expression to Func<X, byte>. We’re not in an explicitly unchecked context, so the conversion doesn’t work. Where the rule quoted above talks about "overflows that occur during the compile-time evaluation of the expression always cause compile-time errors" it doesn’t mean we actually see an error message when the compiler speculatively tries to convert the lambda expression to the Func<X, byte> – it just means that the conversion isn’t valid.

So, there’s only one applicable method, and that gets picked and executed. As a result, the value of a is 1.

Now in the second call (inside the unchecked statement) both methods are applicable: the conversion from the lambda expression to the Func<X, byte> is valid because the constant conversion occurs within an explicitly unchecked context. We end up with two applicable methods, and need to see if one of them is definitively better than the other. This comes down to section 7.5.3.2 of the C #5 spec, which gives details of a "better function member" test. This in turn ends up talking about better conversions from arguments to parameter types. The lambda expression conversions are effectively equal, but the conversion of null to string is "better than" the conversion to object, because there’s an implicit conversion from string to object but not vice versa. (In that sense the string conversion is "more specific" to borrow Java terminology.) So the second overload is picked, and we print 2 to the console.

Phew!

Summary

This all comes down to the cast from a constant to byte being valid in an explicitly unchecked context, but not otherwise.

One odd part is that I spotted that reasonably quickly, and assumed the problem was actually fairly straightforward. It’s only when I started writing it up for this blog post that I dug into the spec to check the exact rules for the meaning of simple names in different situations. I’d solved the crux of the puzzle, but the exact details were somewhat involved. Indeed, while writing this blog post I’ve started two separate emails to the C# team reporting potential (equal and opposite) bugs – before discarding those emails as I got further into the specification.

This is reasonably common, in my experience: the C# language has been well designed so that we can intuitively understand the meaning of code without knowing the details of the exact rules. Those rules can be pretty labyrinthine sometimes, because the language is very precisely specified – but most of the time you don’t need to read through the rules in detail.

Didn’t you mention two puzzles?

Yes, yes I did.

The second puzzle is much simpler to state – and much harder to solve. To quote Vladmir’s tweet verbatim:

C# quiz: write a valid C# program containing a sequence of three tokens `? null :` that remains valid after we remove the `null`.

Constructing a valid program containing "? null :" is trivial.

Constructing a valid program containing "? :" is slightly trickier, but not really hard.

Constructing a program which is valid both with and without the null token is too hard for me, at the moment. I’m stuck – and I know I’m in good company, as Kevin, Dustin and Bill Wagner were all stuck too, last time I heard. I know of two people who’ve solved this so far… but I’m determined to get it…

Jon Skeet's coding blog

Monthly Archives: January 2014