Category Archives: Diagnostics

Diagnosing a corner-case tooling bug

August 19, 2017 jonskeet 3 Comments

Unlike the previous tests which have been based on Noda Time, this post is based on some issues I’ve had with my main project at work: the Google Cloud Client Libraries for .NET.

Background

This is somewhat detailed for those of you who really care about the chain of events. If you don’t, skip to the next subheading.

google-cloud-dotnet is a complicated repo with lots of projects, generated project files, its own Roslyn analyzers, a fairly involved documentation process using docfx, and other bits and pieces. We use Travis and AppVeyor for CI, and the AppVeyor build has historically gone through the documentation build as well.

On the evening of August 17th (UK time), AppVeyor’s “Visual Studio 2017” image changed to include the .NET Core 2.0 SDK by default. This broke us in a few ways:

Our appveyor.yml invoked a bash script, but without explicitly invoking bash. I don’t know why it worked before, but the change broke that. Easily fixed.
Our docfx build completely broke on AppVeyor due to differences in how msbuild is found or invoked. I don’t understand this yet, but for the moment we’ve had to remove the docfx part of our build from AppVeyor. It works on our development machines – I believe due to the presence of VS2017 itself, rather than just the build tools.
Some aspect of building with .NET Core 2.0 broke our Roslyn analyzers, causing them to throw missing method exceptions. I don’t understand enough about the breakage to explain it or say whether it was a bug, but we had to change the analyzers for them to work with the latest compilers. This wasn’t a problem in our command-line builds as we had pinned to the 1.0 SDK using global.json, but it seems VS2017 update 3 uses the latest Roslyn version anyway (reasonably) so building in Visual Studio caused problems.

Unfortunately the solution to the last bullet point caused the analyzers to break under the 1.0 SDK… so we were basically in a no-win situation until we upgraded everything to use the 2.0 SDK. (Just to reassure customers, we’re still targeting netstandard1.0 – just because we’re using the 2.0 SDK doesn’t mean you have to.)

Having updated to the 2.0 SDK, we were now guaranteed to have the C# 7.1 compiler available, so after a project file update to include latest, we could use new features.

The problem for this post: docfx and default literals

Our code has a lot of methods accepting cancellation tokens, and the default is always specified as

CancellationToken cancellationToken = default(CancellationToken)

That’s quite a mouthful – but C# 7.1 default literals can help! This post isn’t about the language feature, but basically it allows you to just use default in places where the C# compiler can work out which type’s default value you mean. So we can change all occurrences of the above to:

CancellationToken cancellationToken = default

Great! A colleague put together a pull request, and I was about to give an LGTM and suggest that he merged it, when I remembered that we weren’t building the documentation on AppVeyor any more. Given that this was the first build using C# 7.1 features, I thought it would be worth just checking that it worked.

It didn’t. It failed with an error message:

Error:Error extracting metadata for c:/Users/jon/Test/Projects/google-cloud-dotnet/apis/Google.Cloud.BigQuery.V2/Google.Cloud.BigQuery.V2/Google.Cloud.BigQuery.V2.csproj: Value cannot be null.
Parameter name: name

This post is a log of the steps I’ve gone through, not trying to fix the bug, but trying to report the problem in a friendly manner.

On Stack Overflow, when I ask for a short but complete program demonstrating the problem, this is the sort of process I expect people to go through.

This won’t be quite as accurate a log as some other posts, as I only started writing it when I was most of the way through the investigation. Hopefully it’ll still be useful.

Step 0: Confirm the problem is real

This is where git is really, really useful. I was very easily able to clone my colleague’s fork, checkout the branch his PR was created from, and go through the build procedure to see the error shown above.

Once I’d done that, I checked out the master branch, pulled to make sure I was up to date, and rebuilt. No problems – so it was definitely the change in that PR.

Step 1: Try the simplest potential fix

We have a file in our repo specifying the version of docfx to use, so trying a different version is trivial. I knew we weren’t using the latest version, and that the latest version used a new version of Roslyn itself. It would make sense for that to fix the problem.

Tried it – no luck. At this point, it seemed unlikely that we’d be able to resolve the problem ourselves, and I was looking for a minimal example to report it.

Step 2: Try to reproduce in a minimal example

A minimal example in docfx is pretty simple. It needs three files:

A csproj for the project:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netstandard1.0</TargetFramework>
    <LangVersion>latest</LangVersion>
  </PropertyGroup>
</Project>

A docfx.json file, just enough to build the metadata in this case:

{
  "metadata": [{
    "src": [{ "files": ["Project.csproj"] }],
    "dest": "api",
  }]
}

And a source file for it to build:

namespace Project
{
    public class Class1
    {
        public void Foo(int x = default(int)) {}
    }
}

Then I just needed to build the project and then the metadata:

dotnet build && docfx metadata -f

That worked, as expected. I then changed default(int) to just default and built again, hoping to see the failure. Unfortunately, that worked too.

At this point, I have no clue about the differences between the real project which didn’t work and the minimal project that did.

Step 3: find a smaller reproduction from real code

As mentioned in the introduction, the repo contains a lot of projects. The BigQuery project mentioned in the error message contains a lot of code – not an ideal candidate to cut down. Looking through the changes in the PR, I spotted that files in Google.Cloud.Translation.V2 had also been changed. That’s a much smaller project, and would be simpler to work with.

Instead of running through the full build, I just ran the commands to build that specific project and then its documentation. Bingo – same error as before, basically:

Error:Error extracting metadata for C:/Users/jon/Test/Projects/google-cloud-dotnet/apis/Google.Cloud.Translation.V2/Google.Cloud.Translation.V2/Google.Cloud.Translation.V2.csproj: Value cannot be null.
Parameter name: name

Step 4: isolate the smaller reproduction

I copied the broken code away from the git repo, and hacked the project file a bit:

I didn’t need the signing key
I didn’t need the import that prevented building net45 code on Linux
I didn’t need the Roslyn analyzer

I copied and modified the docfx.json file that was used to create the metadata, and tested it again: same error.

At this point, we’ve got a reproduction with “only” 10 C# files and a single project. None of the infrastructure used for the rest of the repo is required. That’s a good start.

Step 5: Gut the code

The aim is to remove as much code as possible, leaving only what’s required to demonstrate the problem.

The project contained two parallel pairs of classes (TranslationClient/TranslationClientImpl and AdvancedTranslationClient/AdvancedTranslationClientImpl). They’re very similar (and really only both exist for backwards compatibility) – it was trivial to remove the “advanced” pair and retest: same failure.

Next, remove the TranslationClientImpl class – we weren’t running any code, so if we could reproduce the problem with the abstract TranslationClient class, that would be fine. Remove all factory methods and the class itself: same failure.

Next, remove the ListLanguages and DetectLanguage methods (and their related friends) – at which point the Language and Detection classes can go, along with LanguageCodes which is really just helper code. Same failure.

We’re down to relatively little code now, but still more than I really want. What’s left is a bunch of calls to do translation. It’s basically a matrix of 8 methods, for every combination of:

Translate a single item or multiple items
Translate text or HTML
Sync or async

The single/multiple part is just implemented in terms of the “single” method calling the “multiple” method and extracting the result, using an expression-bodied member. Only the async methods used the default literal, so I decided to remove the synchronous code first. Same failure.

The HTML and text methods were basically the same in the abstract class (the real differences, small as they are, lie in the implementation) so it was simple to remove the HTML methods. Same failure.

That left two methods: the two async methods, both using the default literal, and one of which (“single”) called the other (“multiple”). I tried removing the “single” method – suddenly, the failure went away. I restored the method, and double-checked: yes, we’re back to seeing the failure.

Still, two methods is reasonable. Let’s reduce those as far as we can, changing them to return strings instead of TranslationResult objects, for example, and removing the TranslationModel? parameter in each case: same failure.

Okay, this looks like it’s enough to reproduce the problem even more minimally.

Step 6: Back to the minimal example code

I went back to the sample project from step 2, with a tiny project file and a tiny docfx.json file. I thought the problem might be one method calling another, so I reproduced that within the test code. (Unfortunately I no longer have the code I tried at this point.)

No failure. Grrr. What was the significant difference between that code and the failing Translation code?

Step 7: Transplant the definitely-failing code

Pretty clueless at this point, I copied the TranslateText methods that were definitely failing in the Translation project into my test project. As I’d removed all the Translation-specific types now, that didn’t cause any compilation problems in itself… but the failure was back. Hooray.

So we now have:

Slightly complex code that at least slightly resembles the Translation code
A project file that’s nice and simple
A docfx file that’s nice and simple

That’s still good: we can now jettison the gutted Translation code, and work just within our minimal test project.

Step 8: Reduce to one method

We still had one method (“single”) calling another (“multiple”). Removing “single” removed the failure; removing “multiple” would cause a compilation failure in “single”. Let’s cut that link.

First, I replaced the the expression-bodied member from “single” with a regular method body, awaited a Task.Delay(...) (to avoid a warning about not awaiting anything) and returned null. The failure went away. Weird.

Next, I tried making it a non-async method, and just returned a null task instead. The failure stayed away.

My first “reasonably minimal, single-method” failure code looked like this:

using System.Threading.Tasks;

namespace Project
{
    public class Class1
    {
        /// <param name="x">A parameter</param>
        public virtual async Task<string> Foo(int x = default) => await Task.FromResult<string>(null);
    }
}

I’ve since played around a lot and reduced it somewhat:

using System.Threading.Tasks;

namespace Project
{
    public class Class1
    {
        /// <param name="x">A parameter</param>
        public Task Foo(int x = default) => (Task) null;
    }
}

Note that it’s now no longer an async method. This generates the same error as before, and it’s reliable, but any one of the following changes makes the failure go away:

Change default to default(int), the failure goes away
Remove the “ documentation, the failure goes away
Make it a regular method instead of an expression-bodied method
Change (Task) null to just null

That’s a pretty sensitive bug! I’ve still no idea where the problem is, but it’s now ready to tell the docfx team.

Step 9: file a bug

With all of this work done, it was simple to file an issue with the repro code. At this point, I can relax…

Points to note

The only chance I really had for resolving this problem myself was updating to the latest version of docfx. The C# code in my project is correct, so it really looks like a bug somewhere (whether in docfx or in one of its dependencies is still up for grabs) but it’s buried further than I’m likely to go.
To be as friendly as I can, I knew I had to provide a way for the team to reproduce the problem.
In this case, I tried a simple “let’s reproduce from scratch” unsuccessfully before going back to the known-failing code
Even though the original scope (“all the code in the repo”) was huge, it was fairly simple to reduce it in stages:
- Find a sub-project which demonstrated the problem
- Isolate that sub-project from the rest of the repo infrastructure
- Cut down that sub-project to a reasonably minimal example
- Transplant that example into a new, completely minimal project
Most of the time I “wasted” was due to the very odd nature of the failure – very small, usually-irrelevant perturbations of the code changed the result in an unexpected way. I could probably have realized this earlier, and started making my changes even smaller between tests.

Diagnostics

Diagnosing a single slow test

August 19, 2017 jonskeet 7 Comments

In the last blog post, I investigated why my Noda Time tests on Travis were running much slower than those on AppVeyor. I resolved a lot of the problem just by making sure I was running release builds on Travis.

That left a single test which takes up about half of the test time though: TzdbDateTimeZoneSourceTest.GuessZoneIdByTransitionsUncached. It seems to run more slowly on .NET Core on Linux than on Windows. The aim of this post is to work out why – at least to the extent of understanding it better than before.

Step 0: Initial assumptions/knowledge and aim

For the moment, I’ll assume that my records of previous investigations are correct. So the situation is:

I have near-identical hardware to run tests on: gabriel is running Linux; bagpuss is running Windows.
Under a release build, the test takes 19s on gabriel and 4.3s on bagpuss
The test does more work on gabriel: it checks 434 zones instead of 135 on bagpuss
That still shows a per-test slow-down of 40%
I’ve looked at this code before to try to optimize it. In production, this method is only going to be called for one time zone (the system default) so it’s not really a problem – but improving the test speed would be nice. (We can categorize it as a “slow” test if necessary, but I’d rather not.)

Aims of this diagnosis:

Find out whether the speed difference is in the test code or the production code that it’s testing. (This test has a lot of assertions.)
Find out whether the speed difference is in Noda Time code (e.g. Instant) or in BCL code (e.g. TimeZoneInfo). If it’s in Noda Time code, that could lead to another investigation later, with potential optimizations.
Make another attempt at speeding up the test across the board

Step 1: validate assumptions

Let’s make sure we’re still in the same state as we were before. (That seems pretty likely.)

On both machines, pull the current code and run

dotnet run -c Release -f netcoreapp1.0 -- --where=test=~Guess

from the src/NodaTime.Test directory.

Odd: it now takes ~6 seconds on bagpuss. At that point, there’s no time discrepancy. But I’m sure there was when I wrote up the previous post. I didn’t make up the 4.3s result… but it’s entirely possible it was on the wrong machine. Run the tests several times – yes, it’s always around 6 seconds. Odd.

At this point, I nearly gave up. After a little while doing other things, I decided to pursue the rest of the diagnosis anyway… at which point the execution time was back to about 4 seconds. That’s very, very strange. I still don’t know what was going on there.

Brutal honesty time: although I ran the tests at this point, I didn’t spot the discrepancy until I was at the end of step 2. I then looked back in the command shell buffer and saw the 6 second run, which was consistent with the runs at the end of step 2.

Step 2: shorten experiment iteration time

The only test we’re interested in is this one. We’re going to be changing the code occasionally, so let’s ditch everything else.

Start a new branch in git
Move TzdbDateTimeZoneSourceTest.cs into NodaTime.Test so it’s easier to find. (It was in a subdirectory.)
Remove all other tests from TzdbDateTimeZoneSourceTest.cs
Delete all the subdirectories in NodaTime.Test
Remove everything in NodaTime.Test apart from Program.cs and TzdbDateTimeZoneSourceTest.cs
Change the target framework to just netcoreapp1.0 so we don’t need to specify it. (This also requires changing the csproj file to use instead of.)
Add a Release property to attempt to make it a release build by default
Test: run dotnet run – oops, even though it’s running from a Release folder, it’s taking 13 seconds instead of 3 on Windows.
Revert previous change, and try dotnet run -c Release – that’s more like it

Great. We can push the commit to github and fetch it on all the relevant machines. (I’m developing on a third machine, just using gabriel and bagpuss to run tests.)

Step 3: Narrow down the time-consuming code

As I mentioned before, the test has a lot of assertions, and they involve a lot of time zone work. Let’s see how much time they’re contributing – by removing them entirely.

Comment out everything after the call to GuessZoneIdByTransitionsUncached
Run locally just to check it does still run
Commit and push
Fetch on gabriel/bagpuss and rerun

Result: the checking part of the test is almost free.

Step 4: Remove NUnit

At this point I’m fairly confident that NUnit is irrelevant. I still want to use the NodaTime.Test project, as NodaTime is already configured to expose internal members to it – and GuessZoneIdByTransitionsUncached is internal. However, it’s very easy to remove NUnit – just remove the dependencies, and replace all the code in Program with a simple test:

using NodaTime.TimeZones;
using System;
using System.Diagnostics;
using System.Linq;

namespace NodaTime.Test
{
    class Program
    {
        public static void Main(string[] args)
        {
            var zones = TimeZoneInfo.GetSystemTimeZones().ToList();

            // Fetch the source outside the measurement time
            var source = TzdbDateTimeZoneSource.Default;
            // JIT compile before we start
            source.GuessZoneIdByTransitionsUncached(TimeZoneInfo.Local);

            var stopwatch = Stopwatch.StartNew();
            foreach (var zone in zones)
            {
                source.GuessZoneIdByTransitionsUncached(zone);
            }
            stopwatch.Stop();
            Console.WriteLine($"Zones tested: {zones.Count}");
            Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");
        }
    }
}

There are a couple of differences here compared with the NUnit test:

I’m deliberately fetching TzdbDateTimeZoneSource.Default (which needs to read some resources) outside the measurement
I’m deliberately calling GuessZoneIdByTransitionsUncached once outside the measurement to remove JIT compilation time

This still isn’t a good benchmark by any means, but it’s probably sufficient for now.

Commit/push/fetch/run as before.

Result: similar to before, but a little faster. Playing around with the code a little further, it seems that the bullet points above make a difference of about 0.1s each (on bagpuss, at least) – there’s quite a lot of code that wouldn’t have been JIT-compiled before the first call, and loading time zone data is somewhat expensive. NUnit also adds a little bit of overhead, which is understandable.

We now have more accurate data for just GuessZoneIdByTransitionsUncached:

bagpuss: 3.6s
gabriel: 19.0s (just under 19 in most cases, instead of just over when running NUnit)

Now we need to get into Noda Time code…

Step 5: Gut the production code

First, the source code helpfully points to https://github.com/nodatime/nodatime/issues/686 with some comments from my last investigation.

We don’t care about correctness at this point, just speed. We can return null at any point. At this point you probably need to know a bit about what the method does. The aim is to guess a matching TZDB (aka IANA, tz, zoneinfo or Olson) time zone from a BCL TimeZoneInfo, when we can’t use the names to match. It combines all the time zone transitions (e.g. going into or out of DST) of every TZDB zone for the next 5 years, and then computes a score for each zone by checking how many of the UTC offsets at those transition points match the offsets returned by the BCL zone.

Let’s divide the code into 5 sections:

Working out the start and end of this UTC year
Getting all the time zones
Getting all the transitions we’re going to test
Getting all the offsets from the BCL zone
Computing the scores

It’s going to be simplest to measure just 1, then 1+2, then 1+2+3 etc. We can just add return null; after the step we’re interested in. We hope the results will tell us:

Which part of the code is slowest (so we can try to optimize it)
Which part of the code is slower on Linux than Windows

For simplicity, let’s make gabriel use the same number of time zones as bagpuss, just by adding .Take(135) in Program.cs above first. That way we’re comparing the same amount of work on each machine – although admittedly we don’t know that each time zone takes the same amount of time. Without any change to the production code, gabriel now takes 6s – which is about what we’d expect if the time per zone is uniform, as 19 * 135/434 = 5.91[…].

Results on gabriel:

Step 1: 0s (it’s free, effectively)
Step 1+2: 1.82s
Step 1+2+3: 3.15s
Step 1+2+3+4: 3.21s
Step 1+2+3+4+5: 6s

Results on bagpuss:

Step 1: 0s
Step 1+2: 1.21s
Step 1+2+3: 2.10s
Step 1+2+3+4: 2.13s
Step 1+2+3+4+5: 3.63s

Or in other words, stepwise:

Step 1: 0s on both
Step 2: 1.82s on gabriel; 1.21s on bagpuss
Step 3: 1.33s on gabriel; 0.89s on bagpuss
Step 4: 0.06s on gabriel; 0.03s on bagpuss
Step 5: 2.79s on gabriel; 1.50s on bagpuss

No particular smoking gun, unfortunately – no obvious part that would be easily optimized. The last part looks like it shows the worst platform difference though… let’s see if we can dig into that.

Step 6: Pivot to DateTimeZone.GetUtcOffset

Let’s change our test pretty radically:

Get a list of all the built in TZDB time zones
Create a list of 50,000 instants, starting at 2000-01-01T00:00:00Z and spaced 1 hour apart. 100,000 is some enough instants to make the test last for roughly 10 seconds. (Fast enough to iterate, slow enough to give significant results.)
For every zone, find the UTC offset for every instant.

To simplify reasoning about what we’re testing, we want to remove time zone caching, too. Time zone computations are inherently fiddly, so Noda Time has a cache for each zone. In GuessZoneIdByTransitionsUncached we were coming up with a new DateTimeZone in each call, so while there’d be some caching involved, it wouldn’t be as much as a simple version of this new test. For the moment, let’s just remove the time zone caching layer entirely:

var zones = DateTimeZoneProviders.Tzdb.GetAllZones()
    .Select(zone => zone is CachedDateTimeZone cached ? cached.TimeZone : zone)
    .ToList();
var start = Instant.FromUtc(2000, 1, 1, 0, 0);
var gap = Duration.FromHours(1);
var instants = Enumerable
    .Range(0, 100000)
    .Select(index => start + gap * index)
    .ToList();

var stopwatch = Stopwatch.StartNew();
foreach (var zone in zones)
{
    foreach (var instant in instants)
    {
        zone.GetUtcOffset(instant);
    }
}
stopwatch.Stop();
Console.WriteLine($"Elapsed time: {stopwatch.Elapsed}");

There’s still one other caching layer that’s relevant, and that’s in terms of conversions between calendrical values and instants… Noda Time is optimized for the years 1900-2100 in the Gregorian calendar. If our test went outside that range, we’d get very different results, hitting different pieces of code very hard. Our original test was looking at 5 years around now, so let’s stick to the optimized range for now.

Results:

gabriel: 19.5s
bagpuss: 12.5

Good – we’re still seeing the Linux run being about 50% slower than the Windows run.

Step 7: Break out a profiler!

It’s been a while since I’ve done any profiling in .NET. I had a quick search to see what was available in terms of free tools that handle .NET Core, and came across CodeTrack. I don’t think it runs under Linux, but that’s okay – we’re mostly using it to track down the hotspots first, so we can fine tune our tests.

I’m not specifically advocating CodeTrack – profiling is an art and science about which I know relatively little – but it was easy to get going. 85% of the time was spent in StandardDaylightAlternatingMap.GetZoneInterval, so that looks like a good place to look next.

Step 8: Write actual benchmarks

I have a benchmark suite for Noda Time already, and at this point we’ve got to a stage where it’s probably worth using proper benchmarking tools. BenchmarkDotNet is my framework of choice – so let’s add a couple of benchmarks.

Looking at the code and the profiler, it looks worth testing both StandardDaylightAlternatingMap and ZoneRecurrence. Having added those benchmarks, I might as well put this down until the next day, as the complete benchmark suite runs automatically every night.

Step 9: Compare the benchmarks

The benchmark results are automatically uploaded to the Noda Time web site although the UI for browsing them is somewhat unstable (in terms of URLs etc) – hence the lack of link here.

Looking at the benchmark results, it turns out I’d already got benchmarks for StandardDaylightAlternatingMap, annoyingly enough – so I needed to tidy those up.

However, they do clearly show the difference between .NET Core on Linux and on Windows, so we’ve narrowed the problem down further. That doesn’t show that only that code has different performance, but it’s hopefully a good enough start that we’ll be able to continue and get a short but complete example to provide to the .NET Core CLR team.

Diving deeper into the implementation, ZoneYearOffset makes sense to test too… the further we go down, the less code will be involved.

Step 10: take a step back for now

We’ve made progress, but we’re not all the way there. That’s okay – the progress is recorded (both here and in benchmarks) and in looking at the benchmark that’s taking so long to run, we’ve confirmed that we’re not doing anything stupid. It happens to end up doing a lot more work than any application code normally would – work that would normally only happen once (so can be inefficient without being a problem) is exercised hundreds of times. Such is life – I could decide not to run it in CI if I want, running it occasionally manually and always just before a release.

I’ll come back to this issue at some point in the future, but at this point, other projects have become more important.

Points to note and lessons learned

Validate assumptions carefully. I still don’t quite know why there was the blip in execution time, but it was very surprising.
Unit tests are no substitute for a good benchmarking framework.
Profilers exist, and are useful. No specific recommendations here, but make sure you’re aware of the tools you could benefit from.
Sometimes, you need to put down a problem for a while. Try not to lose the information you’ve already discovered – capture it in an issue, comments, a blog post like this one, a private document… anything to preserve information. My head leaks data very, very quickly.

Addendum

I’m still investigating where .NET Core runs faster on Linux than on Windows, but I’ve fixed this test – PR 953 is a pretty minor refactoring, but it allows the tests to pass in the same list of time zones into each test case. That has two benefits:

We don’t need to load the zones from raw data on each test
The “get UTC offset at instant X” result is cached in the time zone, so after the first test there’s very little work to do

This very small, internal-only change has halved the total time of the Noda Time tests on Travis. Yay.

Diagnostics

Diagnosing slow tests (again)

August 17, 2017 jonskeet 3 Comments

I’ve previously blogged about a case where tests on .NET Core on Linux were much, much slower than the same tests on Windows. Today I’ve got a very similar problem – but I suspect the cause isn’t going to be the same.

This is my reasonably raw log – skip to the end for a summary of what I found, what went well, and what went badly. Apologies for skipping around in terms of past/present/future tense – being consistent in that respect would be distracting enough to make the rest worse, I think.

Step 0: notice the problem

I spotted this problem because I happened to be looking at the Travis and AppVeyor builds, moving to .NET Core 2.0. Looking at the test runs, I saw that the tests for .NET Core 1.0 were completing in about 15 seconds on AppVeyor (Windows), and 133 seconds on Travis Linux). My experience is that AppVeyor seems to use slightly less powerful hardware than Travis, making this a very odd result.

I looked at several builds, and it seemed consistent across them – very odd.

Step 1: file an issue

My main focus at that point wasn’t to investigate performance – it was to get the build working with .NET Core 2. However, I didn’t want to forget about the problem, so I filed an issue in the Noda Time repo. At this point, I have no idea where the problem is, so the Noda Time repo is the most obvious place to put it. There’s no evidence that it’s a problem with NUnit or .NET Core itself, for example.

Step 2: reproduce locally

I have two machines at home that I use for running the automated benchmarks. One (bagpuss) runs Windows 10, the other (gabriel) runs Ubuntu 16.04. Both are Intel NuCs with i5-5250U processors – I bought one much later than the other, but looked hard to find as close to identical hardware as I could, to enable cross-OS benchmark comparisons without dual booting or virtualization.

I ran the tests on both machines, and copied the TestResult.xml output files to my main dev box for analysis.

First interesting point: the difference isn’t as marked. The tests take 77 seconds on gabriel and 29 seconds on bagpuss. That’s still a lot, but very different to what we’re seeing in CI. It’s possible that the Travis machine is twice as slow as gabriel/bagpuss and that the AppVeyor machine is twice as fast, but that seems unlikely.

Deem that temporarily out of scope as it’s harder to investigate than the fixed-hardware, easy-access problem I have now. File another issue and I can come back to it.

Step 3: Look for individual test case discrepancies

If we’re really lucky, there’ll be a single test case that has a much longer duration on gabriel than on bagpuss, and I can then delve into it.

Investigation:

Create a small console app that loads the two XML files, finds all the test-case elements and puts them in a dictionary from fullname to duration (the latter parsed as a dictionary)
- Oops: some full names aren’t unique (due to using parameters that don’t override ToString, I suspect).
- Load into a lookup instead, then convert that into a Dictionary by taking the first test from each group. Unlikely to affect the results
For each test that’s in both result sets, find the ratio of gabriel/bagpuss
Dump the top 20 ratios, and the slowest 10 tests from each set for good measure.

Okay, lots of text handling tests with ratios of more than 2:1, but they’re all very small durations. I doubt that they’ll be responsible for everything, and could easily just be “measuring small amounts of time is flaky”. Nothing to blame yet.

Point to note: this console app was always less than 50 lines of code. LINQ to XML makes this sort of thing really easy.

Step 4: Look for test fixture discrepancies

If individual tests don’t show a massive change, what about test fixtures? Let’s change the program to load test-suite elements where type="TestFixture" which will get us the durations as a per-test-class level. We’ll start by dumping the slowest 10 tests for each environment… and wow, we have a result!

NodaTime.Test.TimeZones.TzdbDateTimeZoneSourceTest takes 65 seconds on Gabriel and 17 seconds on Bagpuss! That’s out of a total of 77 and 29 seconds respectively – so removing those would take us to 12 seconds on each machine.

Concern: that’s assuming no multithreading is going on. I don’t think NUnit parallelizes by default, but it would definitely make things trickier to reason about.

Step 5: check the logs in more detail

Given that we haven’t seen any very slow individual tests, my guess at this point is that the slowness is going to be in test fixture setup. I haven’t looked at the code yet. Let’s have a look at the XML logs for that one test in more detail.

Yikes: all the difference is in GuessZoneIdByTransitionsUncached. That’s a single method, but it’s parameterized for all system time zones.

At this point, a memory stirs: I’ve looked at this before. I have a distinct memory of being in an airport lounge. At the time I wasn’t trying to work out the OS discrepancy so much as why this one test was taking up over half the total unit test time, even on Windows.

Looking at the code for GuessZoneIdByTransitionsUncached, I see I was even diligent enough to create an issue and keep a bit of a log of what I found. Hooray! So, now we know why this is slow in general. But why is it so much slower on gabriel than on bagpuss?

The answer is that gabriel appears to have rather more system time zones than bagpuss: it’s running 424 tests instead of 135. If they ran each test equally fast, it would take bagpuss about 53 seconds to gabriel’s 65 – that’s not so bad.

Step 6: test on CI again

In one fell swoop, we can confirm that this is the problem on Travis and potentially work out why AppVeyor is so fast. If we only run this one test fixture, we can look at how many tests are running on each machine (and in which environments, as I now run the tests on net45 (AppVeyor only), netcoreapp1.0 and netcoreapp2.0.

That’s really easy to do – create a PR which just changes the test filter, let CI run it, then delete the PR.

Results:

Travis, both environments run 424 tests – 95 seconds for netcoreapp1.0, 92 seconds for netcoreapp2.0
AppVeyor, all environments roughly equal, 135 tests – about 3 seconds

That certainly shows that this is taking the majority of the time on Travis, but we still have two questions:

How is AppVeyor so much faster than my Core i7 development machine, where the same tests take about 15 seconds?
Travis is still only doing about 3 times as much work as AppVeyor, but it’s taking 30 times as long. Why?

Step 7: spot flag differences

I look at the Travis script. I look at the AppVeyor script. I look at what I’ve been running locally. There’s a difference… on AppVeyor, I’m running with -c Release, whereas locally I’ve used the default (debug). Can a debug build make that much difference?

Test locally: running on my dev box with -c Release brings the time down from 15s to 3.9s. Yikes!
Update the Travis script to run with -c Release as well

Now the tests take about 30 seconds on Travis. That’s still three times as long as it would be on Windows for the same number of time zones, but it’s much, much closer.

Run the tests with the same flags on gabriel and bagpuss: gabriel takes 19s, bagpuss takes 4.3s. Taking the number of time zones into account, that suggests that netcoreapp1.0 on Linux is about 40% slower than netcoreapp1.0 on Windows for each test. I can investigate that separately, but it does sound like Travis is just using slower hardware than AppVeyor, contrary to my previous suspicions.

Step 8: update Travis script and merge

Even though this one test takes a large proportion of the test time on Linux (where there are three times as many time zones to run through), it’s not a gating factor. Correctness beats speed in this case – I can live with waiting a few minutes for a complete CI test run.

We still have some investigation to do, but we can already make things definitely better by running the release build on Travis. On a new branch – not the one I’ve been investigating on – create a commit that just changes Travis to use -c Release. Observe the total time go down from 133 seconds to 57 seconds. Yay.

Points to note

When you’re running performance tests, don’t compare release with debug builds. Doh. This was a rookie mistake.
Reproducing a CI problem locally is key: it’s much, much easier to work with local machines than CI. It’s a faster turnaround, and you get easier access to detailed results.
Having multiple machines with the same hardware is really useful for performance comparisons.
Don’t assume that what sounds like the same set of tests will run the same number of tests on different platforms – if a test is parameterized, check how many tests are actually running in each case.
Ad-hoc tools are great – the console app I wrote for steps 3 and 4 was quick to write, and helped zoom in on the problem very simply.

Further work

Almost every diagnostic investigation seems to “end” with something else to look into. In this case, I want to find out why Linux is slower than Windows for the GuessTimeZoneTransitionsUncached test on a per-test basis. Is this a Noda Time difference, or a BCL difference? That can wait.

Diagnostics

Diagnostics everywhere!

August 17, 2017 jonskeet 9 Comments

For a long time, I’ve believed that diagnostic skills are incredibly important for software engineers, and often poorly understood.

The main evidence I see of poor diagnostic skills is on Stack Overflow:

“I have a program that does 10 things, and the output isn’t right. Please fix.”
“I can’t post a short but complete program, because my code is long.”
“I can’t post a short but complete program, because it’s company code.”

I want to tackle this as best I can. If I can move the needle on the average diagnostic skill level, that will probably have more impact on the world than any code I could write. So, this is my new mission, basically.

Expect to see:

Blog posts
Podcast interviews
Screencasts
Conference and user group talks

In particular, I’ve now resolved that when I’m facing a diagnostic situation that at least looks like it’ll be interesting, I’m going to start writing a blog post about it immediately. The blog post will be an honest list of the steps I’ve been through, including those that turned out not to be helpful, to act as case studies. In the process, I hope to improve my own skills (by reflecting on whether the blind alleys were avoidable) and provide examples for other developers to follow as well.

There’s a new “Diagnostics” category on my blog – everything will be there.

Benchmarking, C#, Diagnostics, General, Performance

Tracking down a performance hit

June 9, 2016 jonskeet 21 Comments

I’ve been following the progress of .NET Core with a lot of interest, and trying to make the Noda Time master branch keep up with it. The aim is that when Noda Time 2.0 eventually ships (apologies for the delays…) it will be compatible with .NET Core from the start. (I’d expected to be able to support netstandard1.0, but that appears to have too much missing from it. It looks like netstandard1.3 will be the actual target.)

I’ve been particularly looking forward to being able to run the Noda Time benchmarks (now using BenchmarkDotNet) to compare .NET Core on Linux with the same code on Windows. In order to make that a fair comparison, I now have two Intel NUCs, both sporting an i5-5250U and 8GB of memory.

As it happens, I haven’t got as far as running the benchmarks under .NET Core – but I am now able to run all the unit tests on both Linux and Windows, using both the net451 TFM and netcoreapp1.0.

When I did that recently, I was pretty shocked to see that (depending on which tests I ran) the tests were 6-10 times slower on Linux than on Windows, using netcoreapp1.0 in both cases. This post is a brief log of what I did to track down the problem.

Step 1: Check that there’s really a problem

Thought: Is this actually just a matter of not running the tests in a release configuration, or something similar?

Verification: I ran the tests several times, specifying -c Release on the command line to use the release build of both NodaTime.Test.dll and NodaTime.dll. Running under a debugger definitely wasn’t an issue, as this was all just done from the shell.

Additionally, I ran the tests in two ways – firstly, running the whole test suite, and secondly running with --where=cat!=Slow to avoid the few tests I’ve got which are known to be really pretty slow. They’re typically tests which compare the answers the BCL gives with the answers Noda Time gives, across the whole of history for a particular calendar system or time zone. I’m pleased to report that the bottleneck in these tests is almost always the BCL, but that doesn’t help to speed them up. If only the “slow” tests had been much slower on Linux, that might have pointed to the problems being in BCL calendar or time zone code.

The ratios vary, but there was enough of a problem under both circumstances for it to be worth looking further.

Step 2: Find a problematic test

I didn’t have very strong expectations one way or another about whether this would come down to some general problem in the JIT on Linux, or whether there might be one piece of code causing problems in some tests but not others. Knowing that there are significant differences in handling of some culture and time zone code between the Linux and Windows implementations, I wanted to find a test which used the BCL as little as possible – but which was also slow enough for the differences in timing to be pronounced and not easily explicable by the problems of measuring small amounts of time.

Fortunately, NUnit produces a TestResult.xml file which is easy to parse with LINQ to XML, so I could easily transform the results from Windows and Linux into a list of tests, ordered by duration (descending), and spot the right kind of test.

I found my answer in UmAlQuraYearMonthDayCalculatorTest.GetYearMonthDay_DaysSinceEpoch, which effectively tests the Um Al Qura calendar for self consistency, by iterating over every day in the supported time period and checking that we can convert from “days since Unix epoch” to an expected “year, month day”. In particular, this test doesn’t rely on the Windows implementation of the calendar, nor does it use any time zones, cultures or anything similar. It’s nicely self-contained.

This test took 2051ms on Linux and 295ms on Windows. It’s possible that those figures were from a debug build, but I repeated the tests using a release build and confirmed that the difference was still similar.

Step 3: Find the bottleneck

At this point, my aim was to try to remove bits of the test at a time, until the difference went away. I expected to find something quite obscure causing the difference – something like different CPU cache behaviour. I knew that the next step would be to isolate the problem to a small piece of code, but I expected that it would involve a reasonable chunk of Noda Time – at least a few types.

I was really lucky here – the first and most obvious call to remove made a big difference: the equality assertion. Assertions are usually the first thing to remove in tests, because everything else typically builds something that you use in the assertions… if you’re making a call without either using the result later or asserting something about the result, presumably you’re only interested in side effects.

As soon as I removed the call to Assert.AreEqual(expected, actual), the execution time dropped massively on Linux, but hardly moved on Windows: they were effectively on a par.

I wondered whether the problem was with the fact that I was asserting equality between custom structs, and so tried replacing the real assertions with assertions of equality of strings, then of integers. No significant difference – they all showed the same discrepancy between Windows and Linux.

Step 4: Remove Noda Time

Once I’d identified the assertions as the cause of the problem, it was trivial to start a new test project with no dependency on Noda Time, consisting of a test like this:

[Test]
public void Foo()
{
    for (int i = 0; i < 1000000; i++)
    {
        var x = 10;
        var y = 10;
        Assert.AreEqual(x, y);
    }
}

This still demonstrated the problem consistently, and allowed simpler experimentation with different assertions.

Step 5: Dig into NUnit

For once in my life, I was glad that a lot of implementation details of a framework were exposed publicly. I was able to try lots of different “bits” of asserting equality, in order to pin down the problem. Things I tried:

Assert.AreEqual(x, y): slow
Assert.That(x, Is.EqualTo(y)): slow
Constructing an NUnitEqualityComparer: fast
Calling NUnitEqualityComparer.AreEqual: fast. (Here the construction occurred before the loop, and the comparisons were in the loop.)
Calling Is.EqualTo(y): slow

The last bullets two bullets were surprising. I’d been tipped off that NUnitEqualityComparer uses reflection, which could easily differ in performance between Windows and Linux… but checking for equality seemed to be fast, and just constructing the constraint was slow. In poking around the NUnit source code (thank goodness for Open Source!) it’s obvious why Assert.AreEqual(x, y) and Assert.That(y, Is.EqualTo(x)) behave the same way – the former just calls the latter.

So, why is Is.EqualTo(y) slow (on Linux)? The method itself is simple – it just creates an instance of EqualConstraint. The EqualConstraint constructor body doesn’t do much… so I proved that it’s not EqualConstraint causing the problem by deriving my own constraint with a no-op implementation of ApplyTo… sure enough, just constructing that is slow.

That leaves the constructor of the Constraint abstract base class:

protected Constraint(params object[] args)
{
    Arguments = args;

    DisplayName = this.GetType().Name;
    if (DisplayName.EndsWith("`1") || DisplayName.EndsWith("`2"))
        DisplayName = DisplayName.Substring(0, DisplayName.Length - 2);
    if (DisplayName.EndsWith("Constraint"))
            DisplayName = DisplayName.Substring(0, DisplayName.Length - 10);
}

That looks innocuous enough… but maybe calling GetType().Name is expensive on Linux. So test that… nope, it’s fast.

At this point I’m beginning to wonder whether we’ll ever get to the bottom of it, but let’s just try…

[Test]
public void EndsWith()
{
    string text = "abcdefg";
    for (int i = 0; i < Iterations; i++)
    {
        text.EndsWith("123");
    }
}

… and sure enough, it’s fast on Windows and slow on Linux. Wow. Looks like we have a culprit.

Step 6: Remove NUnit

At this point, it’s relatively plain sailing. We can reproduce the issue in a simple console app. I won’t list the code here, but it’s in the GitHub issue. It just times calling EndsWith once (to get it JIT compiled) and then a million times. Is it the most rigorous benchmark in the world? Absolutely not… but when the difference is between 5.3s on Linux and 0.16s on Windows, on the same hardware, I’m not worried about inaccuracy of a few milliseconds here or there.

Step 7: File a CoreCLR issue

So, as I’ve shown, I filed a bug on GitHub. I’d like to think it was a pretty good bug report:

Details of the environment
Short but complete console app ready to copy/paste/compile/run
Results

Exactly the kind of thing I’d have put into a Stack Overflow question – when I ask for a minimal, complete example on Stack Overflow, this is what I mean.

Anyway, about 20 minutes later (!!!), Stephen Toub has basically worked out the nub of it: it’s a culture issue. Initially, he couldn’t reproduce it – he saw the same results on Windows and Linux. But changing his culture to en-GB, he saw what I was seeing. I then confirmed the opposite – when I ran the code having set LANG=en-US, the problem went away for me. Stephen pulled Matt Ellis in, who gave more details as to what was going wrong behind the scenes.

Step 8: File an NUnit issue

Matt Ellis suggested filing an issue against NUnit, as there’s no reason this code should be culture-sensitive. By specifying the string comparison as Ordinal, we can go through an even faster path than using the US culture. So

if (DisplayName.EndsWith("Constraint"))

becomes

if (DisplayName.EndsWith("Constraint", StringComparison.Ordinal))

… and the equivalent for the other two calls.

I pointed out in the issue that it was also a little bit odd that this was being worked out in every Constraint constructor call, when of course it’s going to give the same result for every instance of the same type. When “every Constraint constructor call” becomes “every assertion in an entire test run”, it’s a pretty performance-critical piece of code. While unit tests aren’t important in terms of performance in the same way that production code is, anything which adds friction is bad news.

Hopefully the NUnit team will apply the simple improvement for the next release, and then the CoreCLR team can attack the tougher underlying problem over time.

Step 9: Blog about it

Open up Stack Edit, start typing: “I’ve been following the progress”… :)

Conclusion

None of the steps I’ve listed here is particularly tricky. Diagnosing problems is often more a matter of determination and being unwilling to admit defeat than cleverness. (I’m not denying that there’s a certain art to being able to find the right seam to split the problem in two, admittedly.)

I hope this has been useful as a “start to finish” example of what a diagnostic session can look and feel like. It wasn’t one physical session, of course – I found bits of time to investigate it over the course of a day or so – but it would have been the same steps either way.

Smug, satisfied smile…

C#, Diagnostics

Diagnosing Portable Class Libraries

May 16, 2014 jonskeet 3 Comments

I’ve always found Portable Class Library (PCL) configuration to be a bit of a mystery. In simple cases, it’s simple: start a new PCL project in Visual Studio, select the environments you want to support, and away you go. But what’s going on under the hood, and what do all the options mean? How do I do this without Visual Studio?

Background: supporting Universal Windows Applications in Noda Time

Recently, I had a feature request for Noda Time to support Windows Phone 8.1 and Universal Windows Applications. This is something I haven’t done before, but it doesn’t sound too hard – it’s just a matter of changing an option in Visual Studio, surely… Unfortunately it’s not that simple for Noda Time. In order to support .NET 3.5 (and some other features which aren’t available in PCLs) the core NodaTime.dll project has multiple configurations which effectively split into “Desktop” and “PCL”. This has been achieved by hacking around with the .csproj file manually – nothing particularly clever, but the sort of thing which can’t be done directly in Visual Studio. This means that when I open the project properties in Visual Studio – even when I’m building a PCL version – it treats the project as a regular class library. (It confuses the heck out of ReSharper too in some cases, unfortunately.)

That’s not too hard to work around – I just need to know the name of the profile I want to support. A profile name is basically a single identifier which refers to the configuration of environments you want to support. Unfortunately, it’s pretty opaque – Noda Time 1.2 ships with “Profile2” which means “.NET 4.0, Windows 8 (store apps), Silverlight 4, Windows Phone 7. The simplest way to work out what profile I wanted was to create a new “regular” PCL in Visual Studio, unload the project, manually set the profile to Profile2 in the project file, reload the project, and then enable “Windows Phone 8.1”.

That then led to the next problem: Visual Studio 2013 (update 2) wanted to upgrade that sample project… because Profile2 is no longer really an option. You can’t manually set that combination in Visual Studio any more – it doesn’t display anything earlier than Silverlight 5 and Windows Phone 8. I now have two problems:

Given that Visual Studio changes the project file for me with a warning when I open it, how can I check what I’ve actually built against? Does msbuild silently do the same thing?
How can I find out what profiles are available, in order to select the most appropriate one?

Time to do some research…

Detecting a profile in a built assembly

First things first: if someone gives you a DLL, how can you detect what profile it was built against, and therefore which environments it can run in?

After hunting around in vain for a while, Petr K helped me out on Twitter: the crucial information is in the TargetFrameworkAttribute and its FrameworkName property, which is something like “.NETPortable,Version=v4.0,Profile=Profile2”. That can then be used with the FrameworkName class to parse out the profile name without writing any string manipulation code.

Then you just need to find the details of the profile…

Where are all the profiles?

Windows stores the profiles along with the reference assembles, in the .NETPortable framework directory – on my machine for example, that’s “C:Program Files (x86)Reference AssembliesMicrosoftFramework.NETPortable”. There are then subdirectories v4.0, v4.5 and v4.6 – although it’s not 100% clear what those refer to. Within each of those, there’s a Profile directory, which contains a bunch of Profile[n] directories. Then under each Profile[n] directory, a SupportedFrameworks directory contains XML documents describing the supported frameworks for that profile. Phew! So as an example:

C:Program Files (x86)Reference AssembliesMicrosoftFramework.NETPortablev4.0ProfileProfile2

contains the profile information for Profile2, including files:

C:Program Files (x86)Reference AssembliesMicrosoftFramework.NETPortablev4.0ProfileProfile2SupportedFrameworks.NET Framework 4.xml
C:Program Files (x86)Reference AssembliesMicrosoftFramework.NETPortablev4.0ProfileProfile2SupportedFrameworksSilverlight.xml
C:Program Files (x86)Reference AssembliesMicrosoftFramework.NETPortablev4.0ProfileProfile2SupportedFrameworksWindows 8.xml
C:Program Files (x86)Reference AssembliesMicrosoftFramework.NETPortablev4.0ProfileProfile2SupportedFrameworksWindows Phone Silverlight 7.xml

Each of those XML files consists of a single Framework element, with a variety of attributes. Not all elements have the same set of attributes, although everything has an Identifier, Family, Profile, MinimumVersion and DisplayName as far as I’ve seen. Here’s an example – the “Windows Phone Silverlight 7.xml” file listed above:

As noted by David Kean in comments, the “Identifier”, “Profile” and “MinimumVersion” are mapped to the corresponding properties in MSBuild; the combination of the three is unique for a target, but you need to be careful to look at all the relevant information… don’t just look at the Identifier attribute. Additionally, if there’s a MinimumVersionDisplayName, that’s probably the version you’ll be more familiar with – so in the example above, the displayed minimum version number is 7, which is what people think about for Windows Phone versions, even though the real minimum version number is 4.0 as far as the build is concerned.

Read David’s comment for more details – I don’t want to paraphrase it too much, in case I accidentally write something misleading.

What about the Xamarin frameworks?

As you can see above, Profile2 doesn’t include anything that looks like it’s a Xamarin framework. (Compare that with Profile136 for example, which includes Xamarin.iOS.xml and Xamarin.Android.xml.) But haven’t I previously said that Noda Time 1.2 works seamlessly in Xamarin.iOS and Xamarin.Android? Indeed I have. This is all very confusing, but as far as I’m aware, the Xamarin.* frameworks are only required if you want to build within Xamarin Studio and you’re not on a Windows machine. I could be wrong about that, but certainly there are some profiles which will work with some versions of Xamarin without a Xamarin-specific framework being on the list.

I would welcome more clarity on this, and if I learn any more about it (either in comments here or via other channels) I’ll update this post.

Where should the files go in NuGet packages?

The next issue to consider is that of NuGet packages. A NuGet package can contain multiple binaries for multiple target profiles – for example, the NodaTime NuGet package includes both the desktop and PCL versions. NuGet adds a reference to the right one when you install a package. How can it tell which binary is which? Through a directory structure. Here’s a piece of Noda Time’s nuspec package:

As you can see, the PCL files end up in libportable-win+net40+sl4+wp7. That works for Profile2, but it’s not the right directory for the profile I’ll end up using. So how do I work out what directory to use in the next version? I can look at the NuGet source code, or this handy table of profiles and corresponding NuGet targets. (Interestingly, that lists the directory for Profile2 as portable-net4+sl4+netcore45+wp7, so I’m not entirely sure what’s going on. I suspect the table is more reliable than whenever I got the name from when preparing a Noda Time release.)

This all sounds a bit fiddly. Shouldn’t there be a tool to make it simple?

Yes, yes there should. As it happens, Stephen Cleary has already written one, although I didn’t realise this until I’d also written one myself. Indeed, the table linked above is the output from Stephen’s tool.

My own tool is called PclPal, and the source is available on GitHub. Don’t expect polished code (or tests) – this was very much a quick and dirty hack to make life slightly easier. It could be the basis of a more polished tool – possibly even a “PCL profile explorer” or something similar. Oh, and it requires the Roslyn End User Preview.

Currently it’s just a command line tool with three options:

“list” – lists the profiles on your system
“dll ” – shows the details of the profile a particular assembly was built against
“show ” – shows the details of a specify profile

Note that currently I’m not doing anything in terms of NuGet package directories, although that’s probably the next step, if I take this any further. (I may just use Stephen’s tool, of course…)

Conclusion

The world of PCLs is a bit of a mess, to say the least. It’s possible that there’s lots of really clear, up-to-date documentation explaining everything somewhere – but I have yet to find it. I suspect the main problem is that this is a moving target in all kinds of ways. Versioning is never easy, and between all the different variations possible here, I’m not entirely surprised it’s confusing.

Hopefully this post will help to shed a bit more light on things – please feel very free to leave corrections, as I definitely don’t want to propagate any falsehoods. I’ve definitely learned a thing or two in the last few days, and with tools such as PclPal and Stephen’s PortableLibraryProfiles, life should be that little bit easier…

Diagnostics, General, Stack Overflow

Diagnosing issues with reversible data transformations

January 20, 2014 jonskeet 7 Comments

I see a lot of problems which look somewhat different at first glance, but all have the same cause:

Text is losing “special characters” when I transfer it from one computer to another
Decryption ends up with garbage
Compressed data can’t be decompressed
I can transfer text but not binary data

These are all cases of transforming and (usually) transferring data, and then performing the reverse transformation. Often there are multiple transformations involved, and they need to be carefully reversed in the appropriate order. For example:

Convert text to binary using UTF-8
Compress
Encrypt
Base64-encode
Transfer (e.g. as text in XML)
Base64-decode
Decrypt
Decompress
Convert binary back to text using UTF-8

The actual details of each question can be different, but the way I’d diagnose them is the same in each case. That’s what this post is about – partly so that I can just link to it when such questions arise. Although I’ve numbered the broad steps, it’s one of those constant iteration situations – you may well need to tweak the logging before you can usefully reduce the problem, and so on.

1. Reduce the problem as far as possible

This is just my normal advice for almost any problem, but it’s particularly relevant in this kind of question.

Start by assembling a complete program demonstrating nothing but the transformations. Using a single program which goes in both directions is simpler than producing two programs, one in each direction.
Remove pairs of transformations (e.g. encrypt/decrypt) at a time, until you’ve got the minimal set which demonstrates the problem
Avoid file IO if possible: hard-code short sample data which demonstrates the problem, and use in-memory streams (ByteArrayInputStream/ByteArrayOutputStream in Java; MemoryStream in .NET) for temporary results
If you’re performing encryption, hard-code a dummy key or generate it as part of the program.
Remove irrelevant 3rd party dependencies if possible (it’s simpler to reproduce an issue if I don’t need to download other libraries first)
Include enough logging (just console output, usually) to make it obvious where the discrepancy lies

In my experience, this is often enough to help you fix the problem for yourself – but if you don’t, you’ll be in a much better position for others to help you.

2. Make sure you’re really diagnosing the right data

It’s quite easy to get confused when comparing expected and actual (or before and after) data… particularly strings:

Character encoding issues can sometimes be hidden by spurious control characters being introduced invisibly
Fonts that can’t display all characters can make it hard to see the real data
Debuggers sometimes “helpfully” escape data for you
Variable-width fonts can make whitespace differences hard to spot

For diagnostic purposes, I find it useful to be able to log the raw UTF-16 code units which make up a string in both .NET and Java. For example, in .NET:

static void LogUtf16(string input)
{
    // Replace Console.WriteLine with your logging approach
    Console.WriteLine(“Length: {0}”, input.Length);
    foreach (char c in input)
    {
        Console.WriteLine(“U+{0:x4}: {1}”, (uint) c, c);
    }
}

Binary data has different issues, mostly in terms of displaying it in some form to start with. Our diagnosis tools are primarily textual, so you’ll need to perform some kind of conversion to a string representation in order to see it at all. If you’re really trying to diagnose this as binary data (so you’re interested in the raw bytes) do not treat it as encoded text using UTF-8 or something similar. Hex is probably the simplest representation that allows differences to be pinpointed pretty simply. Again, logging the length of the data in question is a good first step.

In both cases you might want to include a hash in your diagnostics. It doesn’t need to be a cryptographically secure hash in any way, shape or form. Any form of hash that is likely to change if the data changes is a good start. Just make sure you can trust your hashing code! (Every piece of code you write should be considered suspect – including whatever you decide to use for converting binary to hex, for example. Use trusted third parties and APIs provided with your target platform where possible. Even though adding an extra dependency for diagnostics makes it slightly harder for others to reproduce the problem, it’s better than the diagnostics themselves being suspect.)

3. Analyze a clean, full, end-to-end log

This can be tricky when you’ve got multiple systems and platforms (which is why if you can possibly reproduce it in a single program it makes life simpler) but it’s really important to look at one log for a complete run.

Make sure you’re talking about the same data end-to-end. If you’re analyzing live traffic (which should be rare; unless the problem is very intermittent, this should all be done in test environments or a developer machine) or you have a shared test environment you need to be careful that you don’t use part of the data from one test and part of the data from another test. I know this sounds trivial, but it’s a really easy mistake to make. In particular, don’t assume that the data you’ll get from one part of the process will be the same run-to-run. In many cases it should be, but if the overall system isn’t working, then you already know that one of your expectations is invalid.

Compare “supposed to be equal” parts of the data. As per the steps in the introduction, there should be pairs of equal data, moving from the “top and bottom” of the transformation chain towards the middle. Initially, you shouldn’t care about whether you view the transformation as being correct – you’re only worried about whether the output is equal to the input. If you’ve managed to preserve all the data, the function of the transformation (encryption, compression etc) becomes relevant – but if you’re losing data, anything else is secondary. This is where the hash from the bottom of step 2 is relevant: you want to be able to determine whether the data is probably right as quickly as possible. Between “length” and “hash”, you should have at least some confidence, which will let you get to the most likely problem as quickly as possible.

4. Profit! (Conclusion…)

Once you’ve compared the results at each step, you should get an idea of which transformations are working and which aren’t. This may allow you to reduce the problem further, until you’ve just got a single transformation to diagnose. At that point, the problem becomes about encryption, or about compression, or about text encoding.

Depending on the situation you’re in, at this point you may be able to try multiple implementations or potentially multiple platforms to work out what’s wrong: for example, if you’re producing a zip file and then trying to decompress it, you might want to try using a regular decompression program to open your intermediate results, or decompress the results of compressing with a standard compression tool. Or if you’re trying to encrypt on Java and decrypt in C#, implement the other parts in each platform, so you can at least try to get a working “in-platform” solution – that may well be enough to find out which half has the problem.

To some extent all this blog post is about is reducing the problem as far as possible, with some ideas of how to do that. I haven’t tried to warn you much about the problems you can run into in any particular domain, but you should definitely read Marc Gravell’s excellent post on “How many ways can you mess up IO?” and my earlier post on understanding the meaning of your data is pretty relevant too.

As this is a “hints and tips” sort of post, I’ll happily modify it to include reader contributions from comments. With any luck it’ll be a useful resource for multiple Stack Overflow questions in the months and years to come…

Diagnostics, General, Stack Overflow

Diagnosing weird problems – a Stack Overflow case study

March 16, 2012 jonskeet 15 Comments

Earlier, I came across this Stack Overflow question. I solved it, tweeted it, but then thought it would serve as a useful case study into the mental processes I go through when trying to solve a problem – whether that’s on Stack Overflow, at work, or at home.

It’s definitely worth reading the original question, but the executive summary is:

When I compute the checksum/hash of c:WindowsSystem32Calc.exe using various tools and algorithms, those tools all give the same answer for each algorithm. When I try doing the same thing in Java, I get different results. What’s going on?

Now to start with, I’d like to shower a bit of praise on the author:

The post came with a short but utterly complete program to demonstrate the problem
The comments in the program showed the expected values and the actual values
The code was mostly pretty clean (clean enough for an SO post anyway)

In short, it had pretty much everything I ask for in a question. Yay! Additionally, the result seemed to be strange. The chances of any one of Java’s hashing algorithms being broken seem pretty slim, but four of them? Okay, now you’ve got me interested.

Reproducing the problem

Unless I can spot the error immediately, I usually try to reproduce the problem in a Stack Overflow post with a short but complete program. In this case, the program was already provided, so it was just a matter of copy/paste/compile/run. This one had the additional tweak that it was comparing the results of Java with the results of other tools, so I had to get hold of an MD5 sum tool first. (I chose to start with MD5 for no particular reason.) I happened to pick this one, but it didn’t really seem it would make much difference. (As it happens, that was an incorrect assumption, but hey…)

I ran md5sums on c:WindowsSystem32calc.exe, and got the same result as the poster. Handy.

I then ran the Java code, and again got the same result as the poster: step complete, we have a discrepancy between at least one tool (and MD5 isn’t that hard to get right) and Java.

Looking for obvious problems

The code has four main areas:

Reading a file
Updating digests
Converting bytes to hex
Storing and displaying results

Of these, all of the first three have common and fairly simple error modes. For example:

Failure to use the return value from InputStream.read()
Failure to update the digests using only the relevant portion of the data buffer
Failure to cope with Java’s signed bytes

The code for storing and displaying results seemed solid enough to ignore to start with, and brief inspection suggested that the first two failure modes had been avoided. While the hex code didn’t have any obvious problems either, it made sense to check it. I simply printed the result of hard-coding the “known good” CRC-32 value:

System.out.println(toHex(new byte[] {
(byte) 0x8D, (byte) 0x8F, (byte) 0x5F, (byte) 0x8E
}));

That produced the right result, so I ruled out that part of the code too. Even if it had errors in some cases, we know it’s capable of producing the right string for one of the values we know we should be returning, so it can’t be getting that value.

Initial checks around the file

I’m always suspicious of stream-handling code – or rather, I know how easily it can hide subtle bugs. Even though it looked okay, I thought I’d check – so I added a simple total to the code so I could see how many bytes had been hashed. I also removed all the hash algorithms other than MD5, just for simplicity:

MessageDigest md5 = MessageDigest.getInstance(“MD5”);

FileInputStream fis = new FileInputStream(file);
byte data[] = new byte[size];
int len = 0;
int total = 0;
while ((len = fis.read(data)) != -1) {
md5.update(data, 0, len);
total += len;
}
fis.close();
System.out.println(“Total bytes read: “ + total);

results.put(md5.getAlgorithm(), toHex(md5.digest()));

It’s worth noting that I haven’t tried to fix up bits of the code which I know I would change if I were actually doing a code review:

The stream isn’t being closed in a finally block, so we’ll have a dangling resource (until GC) if an IOException is thrown
The initial value of len is never read, and can be removed

Neither of these matters in terms of the problem at hand, and closing the file “properly” would make the sample code more complicated. (For the sake of just a short sample program, I’d be tempted to remove it entirely.)

The result showed the number of bytes being read as the command prompt did when I ran “dir c:WindowsSystem32Calc.exe” – so again, everything looks like it’s okay.

Desperate times call for stupid measures

Just on a whim, I decided to copy Calc.exe to a local folder (the current directory) and retry. After all, accessing a file in a system folder might have some odd notions applied to it. It’s hard to work out what, but… there’s nothing to lose just by trying a simple test. If it can rule anything out, and you’ve got no better ideas, you might as well try even the daftest idea.

I modified the source code to use the freshly-copied file, and it gave the same result. Hmm.

I then reran md5sums on the copied file… and it gave the same result as Java. In other words, running “md5sums c:WindowsSystem32Calc.exe” gave one result, but “md5sums CopyOfCalc.exe” gave a different result. At this point we’ve moved from Java looking like it’s behaving weirdly to md5sums looking suspicious.

Proving the root cause

At this point we’re sort of done – we’ve basically proved that the Java code produces the right hash for whatever data it’s given, but we’re left with the question of what’s happening on the file system. I had a hunch that it might be something to do with x86 vs x64 code (all of this was running on a 64-bit version of Windows 7) – so how do we test that assumption?

I don’t know if there’s a simple way of running an x86 version of the JVM, but I do know how to switch .NET code between x86 and x64 – you can do that for an assembly at build time. C# also makes the hashing and hex conversion simple, so I was able to knock up a very small app to show the problem:

using System;
using System.IO;
using System.Security.Cryptography;

class Test
{
    static void Main()
    {
        using (var md5 = MD5.Create())
        {
            string path = “c:/Windows/System32/Calc.exe”;
            var bytes = md5.ComputeHash(File.ReadAllBytes(path));
            Console.WriteLine(BitConverter.ToString(bytes));
        }
    }
}

(For a larger file I’d have used File.OpenRead instead, but then I’d have wanted to close the stream afterwards. Somehow it wasn’t worth correcting the existing possible handle leak in the Java code, but I didn’t want to write leaky code myself. So instead I’ve got code which reads the whole file into memory unnecessarily… )

You can choose the architecture to build against (usually AnyCPU, x86 or x64 – though it’s interesting to see that “arm” is also an option under .NET 4.5, for obvious reasons) either from Visual Studio or using the “/platform” command line option. This doesn’t change the IL emitted (as far as I’m aware) but it’s used for interop with native code – and in the case of executables, it also determines the version of the CLR which is used.

Building and running in x86 mode gave the same answer as the original “assumed to be good” tools; building and running in x64 mode gave the same answer as Java.

Explaining the root cause

At this point we’ve proved that the file system gives different results depending on whether you access it from a 64-bit process or a 32-bit process. The final piece of the puzzle was to find some something to explain why that happens. With all the evidence about what’s happening, it was now easy to search for more information, and I found this article giving satisfactory details. Basically, there are two different copies of the system executables on a 64 bit system: x86 ones which run under the 32-bit emulator, and x64 ones. They’re actually in different directories, but when a process opens a file in WindowsSystem32, the copy which matches the architecture of the process is used. It’s almost as if the WindowsSystem32 directory is a symlink which changes target depending on the current process.

A Stack Overflow comment on my answer gave a final nugget that this is called the “File System Redirector”.

Conclusion

Debugging sessions often feel a bit like this – particularly if you’re like me, and only resort to real debugging after unit testing has failed. It’s a matter of looking under all kinds of rocks, trying anything and everything, but keeping track of everything you try. At the end of process you should be able to explain every result you’ve seen, in an ideal world. (You may not be able to give evidence of the actual chain of events, but you should be able to construct a plausible chain of events which concurs with your theory.)

Be aware of areas which can often lead to problems, and check those first, gradually reducing the scope of “stuff you don’t understand” to a manageable level, until it disappears completely.