Diagnosing issues with reversible data transformations

I see a lot of problems which look somewhat different at first glance, but all have the same cause:

  • Text is losing “special characters” when I transfer it from one computer to another
  • Decryption ends up with garbage
  • Compressed data can’t be decompressed
  • I can transfer text but not binary data

These are all cases of transforming and (usually) transferring data, and then performing the reverse transformation. Often there are multiple transformations involved, and they need to be carefully reversed in the appropriate order. For example:

  1. Convert text to binary using UTF-8
  2. Compress
  3. Encrypt
  4. Base64-encode
  5. Transfer (e.g. as text in XML)
  6. Base64-decode
  7. Decrypt
  8. Decompress
  9. Convert binary back to text using UTF-8

The actual details of each question can be different, but the way I’d diagnose them is the same in each case. That’s what this post is about – partly so that I can just link to it when such questions arise. Although I’ve numbered the broad steps, it’s one of those constant iteration situations – you may well need to tweak the logging before you can usefully reduce the problem, and so on.

1. Reduce the problem as far as possible

This is just my normal advice for almost any problem, but it’s particularly relevant in this kind of question.

  • Start by assembling a complete program demonstrating nothing but the transformations. Using a single program which goes in both directions is simpler than producing two programs, one in each direction.
  • Remove pairs of transformations (e.g. encrypt/decrypt) at a time, until you’ve got the minimal set which demonstrates the problem
  • Avoid file IO if possible: hard-code short sample data which demonstrates the problem, and use in-memory streams (ByteArrayInputStream/ByteArrayOutputStream in Java; MemoryStream in .NET) for temporary results
  • If you’re performing encryption, hard-code a dummy key or generate it as part of the program.
  • Remove irrelevant 3rd party dependencies if possible (it’s simpler to reproduce an issue if I don’t need to download other libraries first)
  • Include enough logging (just console output, usually) to make it obvious where the discrepancy lies

In my experience, this is often enough to help you fix the problem for yourself – but if you don’t, you’ll be in a much better position for others to help you.

2. Make sure you’re really diagnosing the right data

It’s quite easy to get confused when comparing expected and actual (or before and after) data… particularly strings:

  • Character encoding issues can sometimes be hidden by spurious control characters being introduced invisibly
  • Fonts that can’t display all characters can make it hard to see the real data
  • Debuggers sometimes “helpfully” escape data for you
  • Variable-width fonts can make whitespace differences hard to spot

For diagnostic purposes, I find it useful to be able to log the raw UTF-16 code units which make up a string in both .NET and Java. For example, in .NET:

static void LogUtf16(string input)
{
    // Replace Console.WriteLine with your logging approach
    Console.WriteLine(“Length: {0}”, input.Length);
    foreach (char c in input)
    {
        Console.WriteLine(“U+{0:x4}: {1}”, (uint) c, c);
    }
}

Binary data has different issues, mostly in terms of displaying it in some form to start with. Our diagnosis tools are primarily textual, so you’ll need to perform some kind of conversion to a string representation in order to see it at all. If you’re really trying to diagnose this as binary data (so you’re interested in the raw bytes) do not treat it as encoded text using UTF-8 or something similar. Hex is probably the simplest representation that allows differences to be pinpointed pretty simply. Again, logging the length of the data in question is a good first step.

In both cases you might want to include a hash in your diagnostics. It doesn’t need to be a cryptographically secure hash in any way, shape or form. Any form of hash that is likely to change if the data changes is a good start. Just make sure you can trust your hashing code! (Every piece of code you write should be considered suspect – including whatever you decide to use for converting binary to hex, for example. Use trusted third parties and APIs provided with your target platform where possible. Even though adding an extra dependency for diagnostics makes it slightly harder for others to reproduce the problem, it’s better than the diagnostics themselves being suspect.)

3. Analyze a clean, full, end-to-end log

This can be tricky when you’ve got multiple systems and platforms (which is why if you can possibly reproduce it in a single program it makes life simpler) but it’s really important to look at one log for a complete run.

Make sure you’re talking about the same data end-to-end. If you’re analyzing live traffic (which should be rare; unless the problem is very intermittent, this should all be done in test environments or a developer machine) or you have a shared test environment you need to be careful that you don’t use part of the data from one test and part of the data from another test. I know this sounds trivial, but it’s a really easy mistake to make. In particular, don’t assume that the data you’ll get from one part of the process will be the same run-to-run. In many cases it should be, but if the overall system isn’t working, then you already know that one of your expectations is invalid.

Compare “supposed to be equal” parts of the data. As per the steps in the introduction, there should be pairs of equal data, moving from the “top and bottom” of the transformation chain towards the middle. Initially, you shouldn’t care about whether you view the transformation as being correct – you’re only worried about whether the output is equal to the input. If you’ve managed to preserve all the data, the function of the transformation (encryption, compression etc) becomes relevant – but if you’re losing data, anything else is secondary. This is where the hash from the bottom of step 2 is relevant: you want to be able to determine whether the data is probably right as quickly as possible. Between “length” and “hash”, you should have at least some confidence, which will let you get to the most likely problem as quickly as possible.

4. Profit! (Conclusion…)

Once you’ve compared the results at each step, you should get an idea of which transformations are working and which aren’t. This may allow you to reduce the problem further, until you’ve just got a single transformation to diagnose. At that point, the problem becomes about encryption, or about compression, or about text encoding.

Depending on the situation you’re in, at this point you may be able to try multiple implementations or potentially multiple platforms to work out what’s wrong: for example, if you’re producing a zip file and then trying to decompress it, you might want to try using a regular decompression program to open your intermediate results, or decompress the results of compressing with a standard compression tool. Or if you’re trying to encrypt on Java and decrypt in C#, implement the other parts in each platform, so you can at least try to get a working “in-platform” solution – that may well be enough to find out which half has the problem.

To some extent all this blog post is about is reducing the problem as far as possible, with some ideas of how to do that. I haven’t tried to warn you much about the problems you can run into in any particular domain, but you should definitely read Marc Gravell’s excellent post on “How many ways can you mess up IO?” and my earlier post on understanding the meaning of your data is pretty relevant too.

As this is a “hints and tips” sort of post, I’ll happily modify it to include reader contributions from comments. With any luck it’ll be a useful resource for multiple Stack Overflow questions in the months and years to come…

7 thoughts on “Diagnosing issues with reversible data transformations”

  1. I can’t remember whether it was you or someone else who, a while back, pointed out that a common mistake is to try to use e.g. UTF-8 encoding for steps 4 & 6, not realising that it’s not meant to be applied to arbitrary binary data.

    (or whether you even feel that this is sufficiently on topic to warn about in this post)

    Like

  2. Remove 3rd party dependencies if possible (it’s simpler to reproduce an issue if I don’t need to download other libraries first)

    Use trusted third parties and APIs provided with your target platform where possible.)

    Like

  3. @M: I’ve added a bit of clarifying text. Basically I’m trying to avoid two types of issues in the questions themselves:

    – Ones based on poor diagnostics where a hand-rolled hex converter has removed all leading 0s (for example).

    – Ones where 3rd party libraries are used for things which aren’t actually fundamental to the question, e.g. loading a key from a keystore (where just generating a new random key can demonstrate the problem easily).

    Like

Leave a comment