Benchmarking: designing an API with unusual goals

In a couple of recent posts I’ve written about a benchmarking framework and the results it produced for using for vs foreach in loops. I’m pleased with what I’ve done so far, but I don’t think I’ve gone far enough yet. In particular, while it’s good at testing multiple algorithms against a single input, it’s not good at trying several different inputs to demonstrate the complexity vs input size. I wanted to rethink the design at three levels – what the framework would be capable of, how developers would use it, and then the fine-grained level of what the API would look like in terms of types, methods etc. These may all sound quite similar on the face of it, but this project is somewhat different to a lot of other coding I’ve done, mostly because I want to lower the barrier to entry as far as humanly possible.

Before any of this is meaningful, however, I really needed an idea of the fundamental goal. Why was I writing yet another benchmarking framework anyway? While I normally cringe at mission statements because they’re so badly formulated and used, I figured this time it would be helpful.

Minibench makes it easy for developers to write and share tests to investigate and measure code performance.

The words in bold (or for the semantically inclined, the strong words) are the real meat of it. It’s quite scary that even within a single sentence there are seven key points to address. Some are quite simple, others cause grief. Now let’s look at each of the areas of design in turn.

Each element of the design should either clearly contribute to the mission statement or help in a non-functional way (e.g. make the project feasible in a reasonable timeframe, avoid legal issues etc). I’m aware that with the length of this post, it sounds like I’m engaging in "big upfront design" but I’d like to think that it’s at least informed by my recent attempt, and that the design criteria here are statements of intent rather than implementation commitments. (Aargh, buzzword bingo… please persevere!)

What can it do?

As we’ve already said, it’s got to be able to measure code performance. That’s a pretty vague definition, however, so I’m going to restrict it a bit – the design is as much about saying what isn’t included as what is.

  • Each test will take the form of a single piece of code which is executed many times by the framework. It will have an input and an expected output. (Operations with no natural output can return a constant; I’m not going to make any special allowance for them.)
  • The framework should take the tedium out of testing. In particular I don’t want to have to run it several times to get a reasonable number of iterations. I suspect it won’t be feasible to get the framework to guess appropriate inputs, but that would be lovely if possible.
  • Only wall time is measured. There are loads of different metrics which could be applied: CPU execution time, memory usage, IO usage, lock contention – all kinds of things. Wall time (i.e. actual time elapsed, as measured by a clock on the wall) is by far the simplest to understand and capture, and it’s the one most frequently cited in newsgroup and forum questions in my experience.
  • The benchmark is uninstrumented. I’m not going to start rewriting your code dynamically. Frankly this is for reasons of laziness. A really professional benchmarking system might take your IL and wrap it in a timing loop within a single method, somehow enforcing that the result of each iteration is used. I don’t believe that’s worth my time and energy, as well as quite possibly being beyond my capabilities.
  • As a result of the previous bullet, the piece of code to be run lots of times needs to be non-trivial. The reality is that it’ll end up being called as a delegate. This is pretty quick, but if you’re just testing "is adding two doubles faster or slower than adding two floats" then you’ll need to put a bit more work in (e.g. having a loop in your own code as well).
  • As well as the use case of "which of these algorithms performs the best with this input?" I want to support "how does the performance vary as a function of the input?" This should support multiple algorithms at the same time as multiple inputs.
  • The output should be flexible but easy to describe in code. For single-input tests simple text output is fine (although the exact figures to produce can be interesting); for multiple inputs against multiple tests a graph would often be ideal. If I don’t have the energy to write a graphing output I should at least support writing to CSV or TSV so that a spreadsheet or graphing tool can do the heavy lifting.
  • The output should be useful – it should make it easy to compare the performance of different algorithms and/or inputs. It’s clear from the previous post here that just including the scaled score doesn’t give an obvious meaning. Some careful wording in the output, as well as labeled columns, may be required. This is emphatically not a dig at anyone confused by the last post – any confusion was my own fault.

Okay, that doesn’t sound too unreasonable. The next area is much harder, in my view.

How does a developer use it?

Possibly the most important word in the mission statement is share. The reason I started this project at all is that I was fed up with spending ages writing timing loops for benchmarks which I’d then post on newsgroups or Stack Overflow. That means there are two (overlapping) categories of user:

  • A developer writing a test. This needs to be easy, but that’s an aspect of design that I’m reasonably familiar with. I’m not saying I’m good at it, but at least I have some prior experience.
  • A developer reading a newsgroup/forum post, and wanting to run the benchark for themselves. This distribution aspect is the hard bit – or at least the bit requiring imagination. I want the barrier to running the code to be really, really low. I suspect that there’ll be a "fear of the unknown" to start with which is hard to conquer, but if the framework becomes widely used I want the reader’s reaction to be: "Ah, there’s a MiniBench for this. I’m confident that I can download and run this code with almost no effort."

This second bullet is the one that my friend Douglas and I have been discussing over the weekend, in some ways playing a game of one-upmanship: "I can think of an idea which is even easier than yours." It’s a really fun game to play. Things we’ve thought about so far:

  • A web page which lets you upload a full program (without the framework) and spits out a URL which can be posted onto Stack Overflow etc. The user would then choose from the following formats:
    • Single .cs file containing the whole program – just compile and run. (This would also be shown on the download page.)
    • Test code only – for those whole already have the framework
    • Batch file – just run it to extract/build/run the C# code.
    • NAnt project file containing the C# code embedded in it – just run NAnt
    • MSBuild project file – ditto but with msbuild.
    • Zipped project – open the project to load the test in one file and the framework code in other (possibly separate) .cs files
    • Zipped solution – open to load two projects: the test code in one and the framework in the other
  • A web page which which lets you upload your results and browse the results of others

Nothing’s finalised here, but I like the general idea. I’ve managed (fairly easily) to write a "self-building" batch file, but I haven’t tried with NAnt/MSBuild yet. I can’t imagine it’s that hard – but then I’m not sure how much value there is either. What I do want to try to aim for is users running the tests properly, first time, without much effort. Again, looking back at the last post, I want to make it obvious to users if they’re running under a debugger, which is almost always the wrong thing to be doing. (I’m pretty sure there’s an API for this somewhere, and if there’s not I’m sure I can work out an evil way of detecting it anyway.)

The main thing is the ease of downloading and running the benchmark. I can’t see how it could be much easier than "follow link; choose format and download; run batch file" – unless the link itself was to the batch file, of course. (That would make it harder to use for people who wanted to get the source in a more normal format, of course.)

Going back to the point of view of the developer writing the test, I need to make sure it’s easy enough for me to use from home, work and on the train. That may mean a web page where I can just type in code, the input and expected output, and let it fill in the rest of the code for me. It may mean compiling a source file against a library from the command line. It may mean compiling a source file against the source code of the framework from the command line, with the framework code all in one file. It may mean building in Visual Studio. I’d like to make all of these cases as simple as possible – which is likely to make it simple for other developers as well. I’m not planning on optimising the experience when it comes to writing a benchmark on my mobile though – that might be a step too far!

What should the API look like?

When we get down to the nitty-gritty of types and methods, I think what I’ve got is a good starting point. There are still a few things to think about though:

  • We nearly have the functionality required for running a suite with different inputs already – the only problem is that we’re specifying the input (and expected output) in the constructor rather than as parameters to the RunTests method. I could change that… but then we lose the benefit of type inference when creating the suite. I haven’t resolved this to my satisfaction yet :(
  • The idea of having the suite automatically set up using attributed methods appeals, although we’d still need a Main method to create the suite and format the output. The suite creation can be simplified, but the chances of magically picking the most appropriate output are fairly slim. I suppose it could go for the "scale to best by number of iterations and show all columns" option by default… that still leaves the input and expected output, of course. I’m sure I’ll have something like this as an option, but I don’t know how far it will go.
  • The "configuration" side of it is expressed as a couple of constants at the moment. These control the minimum amount of time to run tests for before we believe we’ll be able to guess how many iterations we’ll need to get close to the target time, and the target time itself. These are currently set at 2 seconds and 30 seconds respectively – but when running tests just to check that you’ve got the right output format etc, that’s far too long. I suspect I should make a test suite have a configuration, and default to those constants but allow them to be specified on the command line as well, or explicitly in code.
  • Why do we need to set the expected output? In many cases you can be pretty confident that at least one of the test cases will be correct – so it’s probably simpler just to run each test once and check that the results are the same for all of them, and take that as the expected output. If you don’t have to specify the expected output, it becomes easier to specify a sequence of inputs to test.
  • Currently BenchmarkResult is nongeneric. This makes things simpler internally – but should a result know the input that it was derived from? Or should the ResultSuite (which is also nongeneric) know the input that has been applied to all its functions? The information will certainly need to be somewhere so that it can be output appropriately in the multiple input case.

My main points of design focus around three areas:

  • Is it easy to pick up? The more flexible it is, with lots of options, the more daunting it may seem.
  • Is it flexible enough to be useful in a variety of situations? I don’t know what users will want to benchmark – and I don’t build the right tool, it will be worthless to them.
  • Is the resulting test code easy and brief enough to include in a forum post, with a link to the full program? Will readers understand it?

As you can see, these are aimed at three slightly different people: the first time test writer, the veteran test writer, and the first time test reader. Getting the balance between the three is tricky.

What’s next?

I haven’t started rewriting the framework yet, but will probably do so soon. This time I hope to do it in a rather more test-driven way, although of course the timing-specific elements will be tricky unless I start using a programmatic clock etc. I’d really like comments around this whole process:

  • Is this worth doing?
  • Am I asking the right questions?
  • Are my answers so far headed in the right direction?
  • What else haven’t I thought of?

8 thoughts on “Benchmarking: designing an API with unusual goals”

  1. “Again, looking back at the last post, I want to make it obvious to users if they’re running under a debugger, which is almost always the wrong thing to be doing. (I’m pretty sure there’s an API for this somewhere, and if there’s not I’m sure I can work out an evil way of detecting it anyway.)”

    System.Diagnostics.Debugger.IsAttached.

    The benchmark API looks quite good. Very few APIs are designed to be “fluent” these days, and this is a good example of one that doesn’t read like sentence soup.

    Like

  2. I think there’s a lot of value in something like this, especially if it’s backed by a social site like SO that could be very helpful for students learning and improving algorithms, i.e. I can see users earning badges for every time they write an algorithm faster than the previous bench for a given problem. I would focus most on the sharing aspect and go so far as to break down snippet code with CodeDOM techniques and wrap your API around that rather than uploading .dlls (too much noise)–then the process becomes a simple matter of sharing snippets that follow a convention, through twitter etc. with “live” site snippet editing, colorizing, etc. Not an easy problem, but definitely worthwhile.

    Like

  3. I recently had a need to determine whether a debugger was attached, and I did the same thing you did, started looking for an IsDebuggerAttached property. A quick Google search reminded me that the native method for determining whether a debugger is attached if you are hosting the CLR is called IsDebuggerAttached. While I have never used that method, I have read about it before; maybe that’s what we were both thinking of.

    I haven’t downloaded the benchmarking framework yet (just don’t have the time to explore all of the things I want to explore), but I think you are heading in the right direction. I can definitely see this becoming a standard in the community if its done well. One thing I would suggest is that, while wall time is fine for the first iteration, I think its worth considering how other performance measurements can be added in the future. It would be a shame to build up the framework and then limit it to one metric when plugging in other metrics probably wouldn’t be too difficult.

    Like

  4. @Daniel: I certainly wasn’t intending people to upload dlls – just the source for the program to drive the framework. There shouldn’t be much need for fluff round it.

    @David: While you can of course download the current version of the framework (and it would be useful to hear comments on it) just remember it’s only the work so far :) And yes, adding more metrics would be nice if they can be included relatively easily and without cluttering up the API.

    Like

  5. Would you consider putting up your ongoing work onto Codeplex or Sourceforge, so that the community can follow your efforts and possibly contribute? I would definitely be interested in following it.

    Like

  6. @David: Probably code.google.com, but yes, I don’t see why not. I’m not sure about the website side of it, as that will be part of the csharpindepth.com site I suspect – but we can talk about that later :)

    If anyone has any better name suggestions than Minibench, I’d be glad to hear them…

    Like

  7. Recently I got interested in layered APIs which provide for a flattened out learning curve. The basic idea is to have a “plumbing” layer which does the nitty-gritty bit-banging details and a “porcelain” layer on top of that to handle actual end-user use-cases.

    To be really flexible and allow a smooth transition from porcelain to advanced scenarios involving the plumbing, care must be taken so that the APIs “speak the same language”.

    I’m only beginning to grasp that concept and am still not sure about the details. e.g. whether two layers are enough, how to really structure that in code, etc.

    PS: yes, “plumbing” and “porcelain” are meant in the git sense.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s