Diagnosing weird problems – a Stack Overflow case study

Earlier, I came across this Stack Overflow question. I solved it, tweeted it, but then thought it would serve as a useful case study into the mental processes I go through when trying to solve a problem – whether that’s on Stack Overflow, at work, or at home.

It’s definitely worth reading the original question, but the executive summary is:

When I compute the checksum/hash of c:WindowsSystem32Calc.exe using various tools and algorithms, those tools all give the same answer for each algorithm. When I try doing the same thing in Java, I get different results. What’s going on?

Now to start with, I’d like to shower a bit of praise on the author:

  • The post came with a short but utterly complete program to demonstrate the problem
  • The comments in the program showed the expected values and the actual values
  • The code was mostly pretty clean (clean enough for an SO post anyway)

In short, it had pretty much everything I ask for in a question. Yay! Additionally, the result seemed to be strange. The chances of any one of Java’s hashing algorithms being broken seem pretty slim, but four of them? Okay, now you’ve got me interested.

Reproducing the problem

Unless I can spot the error immediately, I usually try to reproduce the problem in a Stack Overflow post with a short but complete program. In this case, the program was already provided, so it was just a matter of copy/paste/compile/run. This one had the additional tweak that it was comparing the results of Java with the results of other tools, so I had to get hold of an MD5 sum tool first. (I chose to start with MD5 for no particular reason.) I happened to pick this one, but it didn’t really seem it would make much difference. (As it happens, that was an incorrect assumption, but hey…)

I ran md5sums on c:WindowsSystem32calc.exe, and got the same result as the poster. Handy.

I then ran the Java code, and again got the same result as the poster: step complete, we have a discrepancy between at least one tool (and MD5 isn’t that hard to get right) and Java.

Looking for obvious problems

The code has four main areas:

  • Reading a file
  • Updating digests
  • Converting bytes to hex
  • Storing and displaying results

Of these, all of the first three have common and fairly simple error modes. For example:

  • Failure to use the return value from InputStream.read()
  • Failure to update the digests using only the relevant portion of the data buffer
  • Failure to cope with Java’s signed bytes

The code for storing and displaying results seemed solid enough to ignore to start with, and brief inspection suggested that the first two failure modes had been avoided. While the hex code didn’t have any obvious problems either, it made sense to check it. I simply printed the result of hard-coding the “known good” CRC-32 value:

System.out.println(toHex(new byte[] {
    (byte) 0x8D, (byte) 0x8F, (byte) 0x5F, (byte) 0x8E
  }));

That produced the right result, so I ruled out that part of the code too. Even if it had errors in some cases, we know it’s capable of producing the right string for one of the values we know we should be returning, so it can’t be getting that value.

Initial checks around the file

I’m always suspicious of stream-handling code – or rather, I know how easily it can hide subtle bugs. Even though it looked okay, I thought I’d check – so I added a simple total to the code so I could see how many bytes had been hashed. I also removed all the hash algorithms other than MD5, just for simplicity:

MessageDigest md5 = MessageDigest.getInstance(“MD5”);

FileInputStream fis = new FileInputStream(file);
byte data[] = new byte[size];
int len = 0;
int total = 0;
while ((len = fis.read(data)) != -1) {
    md5.update(data, 0, len);
    total += len;
}
fis.close();
System.out.println(“Total bytes read: “ + total);

results.put(md5.getAlgorithm(), toHex(md5.digest()));

It’s worth noting that I haven’t tried to fix up bits of the code which I know I would change if I were actually doing a code review:

  • The stream isn’t being closed in a finally block, so we’ll have a dangling resource (until GC) if an IOException is thrown
  • The initial value of len is never read, and can be removed

Neither of these matters in terms of the problem at hand, and closing the file “properly” would make the sample code more complicated. (For the sake of just a short sample program, I’d be tempted to remove it entirely.)

The result showed the number of bytes being read as the command prompt did when I ran “dir c:WindowsSystem32Calc.exe” – so again, everything looks like it’s okay.

Desperate times call for stupid measures

Just on a whim, I decided to copy Calc.exe to a local folder (the current directory) and retry. After all, accessing a file in a system folder might have some odd notions applied to it. It’s hard to work out what, but… there’s nothing to lose just by trying a simple test. If it can rule anything out, and you’ve got no better ideas, you might as well try even the daftest idea.

I modified the source code to use the freshly-copied file, and it gave the same result. Hmm.

I then reran md5sums on the copied file… and it gave the same result as Java. In other words, running “md5sums c:WindowsSystem32Calc.exe” gave one result, but “md5sums CopyOfCalc.exe” gave a different result. At this point we’ve moved from Java looking like it’s behaving weirdly to md5sums looking suspicious.

Proving the root cause

At this point we’re sort of done – we’ve basically proved that the Java code produces the right hash for whatever data it’s given, but we’re left with the question of what’s happening on the file system. I had a hunch that it might be something to do with x86 vs x64 code (all of this was running on a 64-bit version of Windows 7) – so how do we test that assumption?

I don’t know if there’s a simple way of running an x86 version of the JVM, but I do know how to switch .NET code between x86 and x64 – you can do that for an assembly at build time. C# also makes the hashing and hex conversion simple, so I was able to knock up a very small app to show the problem:

using System;
using System.IO;
using System.Security.Cryptography;

class Test
{
    static void Main()
    {
        using (var md5 = MD5.Create())
        {
            string path = “c:/Windows/System32/Calc.exe”;
            var bytes = md5.ComputeHash(File.ReadAllBytes(path));
            Console.WriteLine(BitConverter.ToString(bytes));
        }
    }
}

(For a larger file I’d have used File.OpenRead instead, but then I’d have wanted to close the stream afterwards. Somehow it wasn’t worth correcting the existing possible handle leak in the Java code, but I didn’t want to write leaky code myself. So instead I’ve got code which reads the whole file into memory unnecessarily… )

You can choose the architecture to build against (usually AnyCPU, x86 or x64 – though it’s interesting to see that “arm” is also an option under .NET 4.5, for obvious reasons) either from Visual Studio or using the “/platform” command line option. This doesn’t change the IL emitted (as far as I’m aware) but it’s used for interop with native code – and in the case of executables, it also determines the version of the CLR which is used.

Building and running in x86 mode gave the same answer as the original “assumed to be good” tools; building and running in x64 mode gave the same answer as Java.

Explaining the root cause

At this point we’ve proved that the file system gives different results depending on whether you access it from a 64-bit process or a 32-bit process. The final piece of the puzzle was to find some something to explain why that happens. With all the evidence about what’s happening, it was now easy to search for more information, and I found this article giving satisfactory details. Basically, there are two different copies of the system executables on a 64 bit system: x86 ones which run under the 32-bit emulator, and x64 ones. They’re actually in different directories, but when a process opens a file in WindowsSystem32, the copy which matches the architecture of the process is used. It’s almost as if the WindowsSystem32 directory is a symlink which changes target depending on the current process.

A Stack Overflow comment on my answer gave a final nugget that this is called the “File System Redirector”.

Conclusion

Debugging sessions often feel a bit like this – particularly if you’re like me, and only resort to real debugging after unit testing has failed. It’s a matter of looking under all kinds of rocks, trying anything and everything, but keeping track of everything you try. At the end of process you should be able to explain every result you’ve seen, in an ideal world. (You may not be able to give evidence of the actual chain of events, but you should be able to construct a plausible chain of events which concurs with your theory.)

Be aware of areas which can often lead to problems, and check those first, gradually reducing the scope of “stuff you don’t understand” to a manageable level, until it disappears completely.

15 thoughts on “Diagnosing weird problems – a Stack Overflow case study”

  1. That was a great little puzzle. I’d never have guessed that Microsoft compiled two different versions of calc.exe of all things. Is there any point to having a 64-bit version of a small desktop calculator? Perhaps they simply run a build process that compiles all of the standard System32 programs for 64 bit.

    Also, “I came, I solved, I tweeted” is an excellent idea for a signature. :)

    Like

  2. The compiled two versions so that both 32 and 64 processes can execute a process called calc.exe and rely on the system path to locate the file. Given that the file system redirector exists there have to be two versions, or the executable would have to be moved out of system32.

    Like

  3. > gradually reducing the scope of “stuff you don’t understand” to a manageable level, until it disappears completely.

    This is IMO a problem. The programming world changes quickly. Things you used to understand perfectly, and have no reason to suspect may have changed, may have still changed. My most frustrating debugging experiences have been ones where the “stuff I don’t understand” *did* disappear, at least as far as I knew, yet the problem was still there.

    I would never have thought, for example, that “perhaps my signed integers don’t overflow like twos complement signed integers on my twos complement platform”, yet that turned out to be the problem one time (C and C++ leave overflow behaviour of signed integers undefined, and the GCC optimiser had changed).

    When you have no choice but to doubt everything, that leaves you with no clue where to look.

    In that twos complement example, one thing you can do is identify exactly which operations are giving unexpected results. That can take a fair amount of time, but is at least possible.

    But even if you identify which operation is misbehaving, that automatically mean you can figure out why it’s broken and how to fix it. Once in a while, all you can really do is ask others, and hope someone knows something that you don’t.

    Like

  4. There’s been many a time that I’ve thought of something that seemed like a silly idea and it held the answer to my debugging questions. The good times are when I pay attention to that instinct! It’s so easy to say, “Naw, couldn’t be that!” and then be chagrined when you come back to it hours later.

    Like

  5. @Steve: I meant “stuff you don’t understand about this specific instance”. I certainly didn’t mean as a permanent thing – the idea of reducing the scope of “areas of computing I don’t understand” to nothing is way beyond me :)

    Like

  6. This just shows how valuable it is to read the oldnewthing if you’re writing code for windows, even if you’re never writing a single line of win32 code.

    This topic comes up often enough in Raymond’s excellent posts that the problem was obvious to me after reading the initial post.

    Although the morale of the post is interesting in its own – gradually reducing the problem space while debugging is invaluable and something that lots of SO questions are sadly lacking (although in this case the problem is unintuitive enough that I wouldn’t blame the original user)

    Like

  7. You can install 2 different versions of the JRE/JDK and run the jar through one of those.

    I don’t think there is a compile time thingy to switch between x64 and x86 for Java.

    Like

  8. The last sentence of this post rings particularly poignant. A lot of troubleshooting comes down to eliminating ambiguity. Quickly eliminating large chunks of ambiguity leads to the fastest solution and the best troubleshooters are exceedingly talented at this skill.

    Like

  9. Nice write up of your journey!

    I am curious about your approach to use Search to find stuff.
    >>With all the evidence about what’s happening, it was now easy to search for more information, and I found this article giving satisfactory details

    Do you just start with some search words and refine them?
    Or do you try to imagine the finding page?
    Or do you have any advanced search settings that help finding faster?
    Or are you sitting so close to the search index that you have special access to it?

    Like

  10. @Peter: Looking back over my browser history, it looks like the relevant search was “windows file system system32 x86 processor” – judging by my history, I think I got lucky quite quickly.

    I don’t think I have any search tips that most blog readers wouldn’t already do naturally – I guess the main thing is “what term would be in a good result but not in a bad one?”

    Like

Leave a comment