Diagnosing weird problems – a Stack Overflow case study

March 16, 2012 jonskeet 15 Comments

Earlier, I came across this Stack Overflow question. I solved it, tweeted it, but then thought it would serve as a useful case study into the mental processes I go through when trying to solve a problem – whether that’s on Stack Overflow, at work, or at home.

It’s definitely worth reading the original question, but the executive summary is:

When I compute the checksum/hash of c:WindowsSystem32Calc.exe using various tools and algorithms, those tools all give the same answer for each algorithm. When I try doing the same thing in Java, I get different results. What’s going on?

Now to start with, I’d like to shower a bit of praise on the author:

The post came with a short but utterly complete program to demonstrate the problem
The comments in the program showed the expected values and the actual values
The code was mostly pretty clean (clean enough for an SO post anyway)

In short, it had pretty much everything I ask for in a question. Yay! Additionally, the result seemed to be strange. The chances of any one of Java’s hashing algorithms being broken seem pretty slim, but four of them? Okay, now you’ve got me interested.

Reproducing the problem

Unless I can spot the error immediately, I usually try to reproduce the problem in a Stack Overflow post with a short but complete program. In this case, the program was already provided, so it was just a matter of copy/paste/compile/run. This one had the additional tweak that it was comparing the results of Java with the results of other tools, so I had to get hold of an MD5 sum tool first. (I chose to start with MD5 for no particular reason.) I happened to pick this one, but it didn’t really seem it would make much difference. (As it happens, that was an incorrect assumption, but hey…)

I ran md5sums on c:WindowsSystem32calc.exe, and got the same result as the poster. Handy.

I then ran the Java code, and again got the same result as the poster: step complete, we have a discrepancy between at least one tool (and MD5 isn’t that hard to get right) and Java.

Looking for obvious problems

The code has four main areas:

Reading a file
Updating digests
Converting bytes to hex
Storing and displaying results

Of these, all of the first three have common and fairly simple error modes. For example:

Failure to use the return value from InputStream.read()
Failure to update the digests using only the relevant portion of the data buffer
Failure to cope with Java’s signed bytes

The code for storing and displaying results seemed solid enough to ignore to start with, and brief inspection suggested that the first two failure modes had been avoided. While the hex code didn’t have any obvious problems either, it made sense to check it. I simply printed the result of hard-coding the “known good” CRC-32 value:

System.out.println(toHex(new byte[] {
(byte) 0x8D, (byte) 0x8F, (byte) 0x5F, (byte) 0x8E
}));

That produced the right result, so I ruled out that part of the code too. Even if it had errors in some cases, we know it’s capable of producing the right string for one of the values we know we should be returning, so it can’t be getting that value.

Initial checks around the file

I’m always suspicious of stream-handling code – or rather, I know how easily it can hide subtle bugs. Even though it looked okay, I thought I’d check – so I added a simple total to the code so I could see how many bytes had been hashed. I also removed all the hash algorithms other than MD5, just for simplicity:

MessageDigest md5 = MessageDigest.getInstance(“MD5”);

FileInputStream fis = new FileInputStream(file);
byte data[] = new byte[size];
int len = 0;
int total = 0;
while ((len = fis.read(data)) != -1) {
md5.update(data, 0, len);
total += len;
}
fis.close();
System.out.println(“Total bytes read: “ + total);

results.put(md5.getAlgorithm(), toHex(md5.digest()));

It’s worth noting that I haven’t tried to fix up bits of the code which I know I would change if I were actually doing a code review:

The stream isn’t being closed in a finally block, so we’ll have a dangling resource (until GC) if an IOException is thrown
The initial value of len is never read, and can be removed

Neither of these matters in terms of the problem at hand, and closing the file “properly” would make the sample code more complicated. (For the sake of just a short sample program, I’d be tempted to remove it entirely.)

The result showed the number of bytes being read as the command prompt did when I ran “dir c:WindowsSystem32Calc.exe” – so again, everything looks like it’s okay.

Desperate times call for stupid measures

Just on a whim, I decided to copy Calc.exe to a local folder (the current directory) and retry. After all, accessing a file in a system folder might have some odd notions applied to it. It’s hard to work out what, but… there’s nothing to lose just by trying a simple test. If it can rule anything out, and you’ve got no better ideas, you might as well try even the daftest idea.

I modified the source code to use the freshly-copied file, and it gave the same result. Hmm.

I then reran md5sums on the copied file… and it gave the same result as Java. In other words, running “md5sums c:WindowsSystem32Calc.exe” gave one result, but “md5sums CopyOfCalc.exe” gave a different result. At this point we’ve moved from Java looking like it’s behaving weirdly to md5sums looking suspicious.

Proving the root cause

At this point we’re sort of done – we’ve basically proved that the Java code produces the right hash for whatever data it’s given, but we’re left with the question of what’s happening on the file system. I had a hunch that it might be something to do with x86 vs x64 code (all of this was running on a 64-bit version of Windows 7) – so how do we test that assumption?

I don’t know if there’s a simple way of running an x86 version of the JVM, but I do know how to switch .NET code between x86 and x64 – you can do that for an assembly at build time. C# also makes the hashing and hex conversion simple, so I was able to knock up a very small app to show the problem:

using System;
using System.IO;
using System.Security.Cryptography;

class Test
{
    static void Main()
    {
        using (var md5 = MD5.Create())
        {
            string path = “c:/Windows/System32/Calc.exe”;
            var bytes = md5.ComputeHash(File.ReadAllBytes(path));
            Console.WriteLine(BitConverter.ToString(bytes));
        }
    }
}

(For a larger file I’d have used File.OpenRead instead, but then I’d have wanted to close the stream afterwards. Somehow it wasn’t worth correcting the existing possible handle leak in the Java code, but I didn’t want to write leaky code myself. So instead I’ve got code which reads the whole file into memory unnecessarily… )

You can choose the architecture to build against (usually AnyCPU, x86 or x64 – though it’s interesting to see that “arm” is also an option under .NET 4.5, for obvious reasons) either from Visual Studio or using the “/platform” command line option. This doesn’t change the IL emitted (as far as I’m aware) but it’s used for interop with native code – and in the case of executables, it also determines the version of the CLR which is used.

Building and running in x86 mode gave the same answer as the original “assumed to be good” tools; building and running in x64 mode gave the same answer as Java.

Explaining the root cause

At this point we’ve proved that the file system gives different results depending on whether you access it from a 64-bit process or a 32-bit process. The final piece of the puzzle was to find some something to explain why that happens. With all the evidence about what’s happening, it was now easy to search for more information, and I found this article giving satisfactory details. Basically, there are two different copies of the system executables on a 64 bit system: x86 ones which run under the 32-bit emulator, and x64 ones. They’re actually in different directories, but when a process opens a file in WindowsSystem32, the copy which matches the architecture of the process is used. It’s almost as if the WindowsSystem32 directory is a symlink which changes target depending on the current process.

A Stack Overflow comment on my answer gave a final nugget that this is called the “File System Redirector”.

Conclusion

Debugging sessions often feel a bit like this – particularly if you’re like me, and only resort to real debugging after unit testing has failed. It’s a matter of looking under all kinds of rocks, trying anything and everything, but keeping track of everything you try. At the end of process you should be able to explain every result you’ve seen, in an ideal world. (You may not be able to give evidence of the actual chain of events, but you should be able to construct a plausible chain of events which concurs with your theory.)

Be aware of areas which can often lead to problems, and check those first, gradually reducing the scope of “stuff you don’t understand” to a manageable level, until it disappears completely.

15 thoughts on “Diagnosing weird problems – a Stack Overflow case study”

Chris Nahr says:

March 16, 2012 at 1:00 pm

That was a great little puzzle. I’d never have guessed that Microsoft compiled two different versions of calc.exe of all things. Is there any point to having a 64-bit version of a small desktop calculator? Perhaps they simply run a build process that compiles all of the standard System32 programs for 64 bit.

Also, “I came, I solved, I tweeted” is an excellent idea for a signature. :)

LikeLike

Reply
David Heffernan says:

March 16, 2012 at 4:03 pm

The compiled two versions so that both 32 and 64 processes can execute a process called calc.exe and rely on the system path to locate the file. Given that the file system redirector exists there have to be two versions, or the executable would have to be moved out of system32.

LikeLike

Reply
Steve says:

March 16, 2012 at 9:39 pm

> gradually reducing the scope of “stuff you don’t understand” to a manageable level, until it disappears completely.

This is IMO a problem. The programming world changes quickly. Things you used to understand perfectly, and have no reason to suspect may have changed, may have still changed. My most frustrating debugging experiences have been ones where the “stuff I don’t understand” *did* disappear, at least as far as I knew, yet the problem was still there.

I would never have thought, for example, that “perhaps my signed integers don’t overflow like twos complement signed integers on my twos complement platform”, yet that turned out to be the problem one time (C and C++ leave overflow behaviour of signed integers undefined, and the GCC optimiser had changed).

When you have no choice but to doubt everything, that leaves you with no clue where to look.

In that twos complement example, one thing you can do is identify exactly which operations are giving unexpected results. That can take a fair amount of time, but is at least possible.

But even if you identify which operation is misbehaving, that automatically mean you can figure out why it’s broken and how to fix it. Once in a while, all you can really do is ask others, and hope someone knows something that you don’t.

LikeLike

Reply
Michael Holt says:

March 17, 2012 at 12:19 am

There’s been many a time that I’ve thought of something that seemed like a silly idea and it held the answer to my debugging questions. The good times are when I pay attention to that instinct! It’s so easy to say, “Naw, couldn’t be that!” and then be chagrined when you come back to it hours later.

LikeLike

Reply
skeet says:

March 17, 2012 at 2:16 am

@Steve: I meant “stuff you don’t understand about this specific instance”. I certainly didn’t mean as a permanent thing – the idea of reducing the scope of “areas of computing I don’t understand” to nothing is way beyond me :)

LikeLike

Reply
Voo says:

March 18, 2012 at 2:14 pm

This just shows how valuable it is to read the oldnewthing if you’re writing code for windows, even if you’re never writing a single line of win32 code.

This topic comes up often enough in Raymond’s excellent posts that the problem was obvious to me after reading the initial post.

Although the morale of the post is interesting in its own – gradually reducing the problem space while debugging is invaluable and something that lots of SO questions are sadly lacking (although in this case the problem is unintuitive enough that I wouldn’t blame the original user)

LikeLike

Reply
Kristof says:

March 19, 2012 at 1:56 am

You can install 2 different versions of the JRE/JDK and run the jar through one of those.

I don’t think there is a compile time thingy to switch between x64 and x86 for Java.

LikeLike

Reply
Mark Lysaght (a.k.a. Binary Worrier) says:

March 20, 2012 at 6:47 am

Excellent article, thanks.
For future reference (if you don’t already know) Corflags.exe will tell you the platform target of a compiled assembly, and will also let you change the target.

http://stackoverflow.com/questions/1591557/how-to-tell-if-a-net-assembly-was-compiled-as-x86-x64-or-any-cpu

P.S. The captcha for adding comments appears to be broken on Chrome

LikeLike

Reply
Anonymous says:

March 20, 2012 at 11:52 am

Wow. Teaches you never to use Windows.

LikeLike

Reply
Christoph Engelhardt says:

March 21, 2012 at 9:10 am

Excellent article.
I encountered a similar problem that was caused by Registry Redirector on a machine where the Public Documents folder had been moved from its default location.

I wrote about this here: http://www.hiddenfeatures.net/2012/03/inconsistencies-when-accessing-the-registry-on-windows-x64/

And the bad news is: The problem is getting worse with Windows 8

LikeLike

Reply
Christoph Engelhardt says:

March 21, 2012 at 9:10 am

Excellent article.
I encountered a similar problem that was caused by Registry Redirector on a machine where the Public Documents folder had been moved from its default location.

I wrote about this here: http://www.hiddenfeatures.net/2012/03/inconsistencies-when-accessing-the-registry-on-windows-x64/

And the bad news is: The problem is getting worse with Windows 8

LikeLike

Reply
Johnny Stockton says:

March 21, 2012 at 11:22 pm

The last sentence of this post rings particularly poignant. A lot of troubleshooting comes down to eliminating ambiguity. Quickly eliminating large chunks of ambiguity leads to the fastest solution and the best troubleshooters are exceedingly talented at this skill.

LikeLike

Reply
Peter Gfader says:

April 2, 2012 at 9:05 am

Nice write up of your journey!

I am curious about your approach to use Search to find stuff.
>>With all the evidence about what’s happening, it was now easy to search for more information, and I found this article giving satisfactory details

Do you just start with some search words and refine them?
Or do you try to imagine the finding page?
Or do you have any advanced search settings that help finding faster?
Or are you sitting so close to the search index that you have special access to it?

LikeLike

Reply
skeet says:

April 3, 2012 at 3:24 am

@Peter: Looking back over my browser history, it looks like the relevant search was “windows file system system32 x86 processor” – judging by my history, I think I got lucky quite quickly.

I don’t think I have any search tips that most blog readers wouldn’t already do naturally – I guess the main thing is “what term would be in a good result but not in a bad one?”

LikeLike

Reply
Pingback: Dec 19, 2014 – Today I learned…

Jon Skeet's coding blog

Diagnosing weird problems – a Stack Overflow case study

Reproducing the problem

Initial checks around the file

Desperate times call for stupid measures

Proving the root cause

Explaining the root cause

Conclusion

15 thoughts on “Diagnosing weird problems – a Stack Overflow case study”

Leave a comment Cancel reply

Reproducing the problem

Initial checks around the file

Desperate times call for stupid measures

Proving the root cause

Explaining the root cause

Conclusion

Share this:

Related

15 thoughts on “Diagnosing weird problems – a Stack Overflow case study”

Leave a comment Cancel reply