What’s in a name?

T.S. Eliot had the right idea when he wrote “The naming of cats”:

The Naming of Cats is a difficult matter,
It isn’t just one of your holiday games

When you notice a cat in profound meditation,
The reason, I tell you, is always the same:
His mind is engaged in a rapt contemplation
Of the thought, of the thought, of the thought of his name:
His ineffable effable
Effanineffable
Deep and inscrutable singular Name.

Okay, so developers may not contemplate their own names much, but I know I’ve certainly spent a significant amount of time recently trying to work out the right name for various types and methods.  It always feels like it’s just out of reach; tauntingly, tantalisingly close.

Recently I’ve been thinking a bit about what the goals might be in coming up with a good name. In particular, I seem to have been plagued with the naming problem more than usual in the last few weeks.

Operations on immutable types

A while ago I asked a question on Stack Overflow about naming a method which “adds” an item to an immutable collection. Of course, when I say “adds” I mean “returns a new collection whose contents is the old collection and the new item.” There’s a really wide range of answers (currently 38 of them) which mostly seem to fall into four categories:

  • Use Add because it’s idiomatic for .NET collections. Developers should know that the type is immutable and act accordingly.
  • Use Cons because that’s the term functional programming has used for this exact operation for decades.
  • Use a new method name (Plus being my favourite at the moment) which will be obvious to non-functional developers, but without being so familiar that it suggests mutability.
  • Use a constructor taking the old collection and the new item.

Part of the reasoning for Add being okay is that I originally posted the question purely about “an immutable collection” – e.g. a type which would have a name like ImmutableList<T>. I then revealed my true intention (which I should have done from the start) – to use this in MiniBench, where the “collection” would actually be a TestSuite. Everything in MiniBench is immutable (it’s partly an exploration in functional programming, as it seems to fit very nicely) but I don’t want to have to name every single type as Immutable[Whatever]. There’s the argument that a developer should know at least a little bit about any API they’re using, and the immutability aspect is one of the first things they should know. However, MiniBench is arguably an extreme case, because it’s designed for sharing test code with people who’ve never seen it before.

I’m pretty sure I’m going to go with Plus in the end:

  • It’s close enough to Add to be familiar
  • It’s different enough to Add to suggest that it’s not quite the same thing as adding to a normal collection
  • It sounds like it returns something – a statement which just calls Plus without using the result sounds like it’s wrong (and indeed it would be)
  • It’s meaningful to everyone
  • I have a precedent in the Joda Time API

Another option is to overload the + operator, but I’m not really sure I’m ready to do that just yet. It would certainly leave brief code, but is that really the most important thing?

Let’s look at a situation with some of the same issues…

LINQ operators

Work on MoreLINQ has progressed faster than expected, mostly because the project now has four members, and they’ve been expending quite a bit of energy on it. (I must do a proper consistency review at some point – in particular it would be nice to have the docs refer to the same concepts in the same way each time. I digress…)

Most of the discussion in the project hasn’t been about functionality – it’s been about naming. In fact, LINQ is particularly odd in this respect. If I had to guess at how the time has been spent (at least for the operators I’ve implemented) I’d go for:

  • 15% designing the behaviour
  • 20% writing the tests
  • 10% implementation
  • 5% writing the documentation (just XML docs)
  • 50% figuring out the best name

It really is that brutal – and for a lot of the operators we still haven’t got the “right” name yet, in my view. There’s generally too much we want to convey in a word or two. As an example, we’ve got an operator similar to the oft-implemented ForEach one, but which yields the input sequence back out again. Basically it takes an action, and for each element it calls the action and then yields the element. The use case is something like logging. We’ve gone through several names, such as Pipe, Tee, Via… and just this morning I asked a colleague who suggested Apply, just off the top of his head. It’s better than anything we’d previously thought of, but does it convey both the “apply an action” and “still yield the original sequence” aspects?

The old advice of “each method should only do one thing” is all very well, and it clearly helps to make naming simpler, but with situations like this one there are just naturally more concepts which you want to get across in the name.

Let’s stay on the LINQ topic, but stray a bit further from the well-trodden path…

The heart of Push LINQ: IDataProducer

I’ve probably bored most of you with Push LINQ by now, and I’m not actively developing it at the moment, but there’s still one aspect which I’m deeply uncomfortable with: the core interface. IDataProducer represents a stream of data which can be observed. Basically clients subscribe to events, and their event handlers will be called when data is “produced” and when the stream ends.

I know IDataProducer is an awful name – but so far I haven’t found anything better. IObservable? Ick. Overused and isn’t descriptive. IPushEnumerable? Sounds like the client can iterate over the data, which they can’t. The actual event names (DataProduced/EndOfData) are okay but there must be something better than IDataProducer. (Various options have been suggested in the past – none of them have been so obviously “right” as to stick in my head…)

This situation is slightly different to the previous ones, however, simply because it’s such a pivotal type. You would think that the more important the type, the more important the name would be – but in some ways the reverse is true. You see, Push LINQ isn’t a terribly “obvious” framework. I say that without shame – it’s great at what it does, but it takes a few mental leaps before you really grok it. You’re really going to have to read some documentation or examples before you write your own queries.

Given that constraint, it doesn’t matter too much what the interface is called – it’s going to be explained to you before you need it. It doesn’t need to be discoverable – whereas when you’re picking method names to pop up in Intellisense, you really want the developer to be able to guess its purpose even before they hover over it and check the documentation.

I haven’t given up on IDataProducer (and I hope to be moving Push LINQ into MoreLINQ, by the way – working out a better name is one of the blockers) but it doesn’t feel like quite as much of a problem.

Read-only or not read-only?

This final example came up at work, just yesterday – after I’d started writing this post. I wanted to refactor some code to emphasize which methods only use the read-only side of an interface. This was purely for the sake of readability – I wanted to make it easier to reason about which areas of the code modified an object and which didn’t. It’s a custom collection – the details don’t matter, but for the sake of discussion let’s call it House and pretend we’re modelling the various things which might be in a house. (This is Java, hence House rather than IHouse.)

I’m explicitly not doing this for safety – I don’t mind the fact that the reference could be cast to a mutable interface. The point is just to make it self-documenting that if a method only has a parameter in the non-mutating form, it’s not going to change the contents of the house.

So, we have two interfaces, like this:

public interface NameMePlease
{
    Color getDoorColor();
    int getWindowCount();

    // This already returned a read-only collection
    Set<Furniture> getFurniture();
}

public interface House extends NameMePlease
{
    void setDoorColor(Color doorColor);
    void setWindowCount(int windows);
    void addFurniture(Furniture item);
}

Obviously the challenge is to find a name for NameMePlease. One option is to use something like ImmutableHouse or ReadOnlyHouse – but the inheritance hierarchy makes liars of both of those names. How can it be a ReadOnlyHouse if there are methods in an implementation which change it? The interface should say what you can do with the type, rather than specifying what you can’t do – unless part of the contract of the interface is that the implementation will genunely prohibit changes.

Thinking of this “positive” aspect led me to ReadableHouse, which is what I’ve gone with for the moment. It states what you can do with it – read information. Again, this is a concept which Joda Time uses.

Another option is to make it just House, and change the mutable interface to MutableHouse or something similar. In this particular situation the refactoring involved would have been enormous. Simple to automate, but causing a huge check-in for relatively little benefit. Almost all uses are actually mutating ones. The consensus within the Google Java mailing list seems to be that this would have been the preferred option, all things being equal. One interesting data point was that although Joda Time uses ReadableInstant etc, the current proposals for the new date/time API which will be included in Java 7, designed by the author of Joda Time, don’t use this convention. Presumably the author found it didn’t work quite as well as he’d hoped, although I don’t have know of any specific problems.

Conclusion

You’ll probably be unsurprised to hear that I don’t have a recipe for coming up with good names. However, in thinking about naming I’ve at least worked out a few points to think about:

  • Context is important: how discoverable does this need to be? Is accuracy more important than brevity? Do you have any example uses (e.g. through tests) which can help to see whether the code feels right or not?
  • Think of your audience. How familiar will they be with the rest of the code you’re writing? Are they likely to have a background in other areas of computer science where you could steal terminology? Can you make your name consistent with other common frameworks they’re likely to use? The reverse is true too: are you reusing a familiar name for a different concept, which could confuse readers?
  • Work out the information the name is trying to convey. For types, this includes working out how it participates in inheritance. Is it trying to advertise capabilities or restrictions?
  • Is it possible to make correct code look correct, and buggy code look wrong? This is rarely feasible, but it’s one of the main attractions of “Plus” in the benchmark case. (I believe this is one of the main selling points of true Hungarian Notation for variable naming, by the way. I’m not generally a fan, but I like this aspect.)

I may expand this list over time…

I think it’s fitting to close with a quote from Phil Karlton:

There are only two hard things in Computer Science: cache invalidation and naming things.

Almost all of us have to handle naming things. Let’s hope most of us don’t have to mess with cache invalidation as well.

Answering technical questions helpfully

I’m unsure of whether this should be a blog post or an article, so I’ll probably make it both. I’ve probably written most of it before in Stack Overflow answers, but as I couldn’t find anything when I was looking earlier (to answer a question about Stack Overflow answers – now deleted, so only available for those over 10k rep) I figured it was time to write something in a medium I had more control over.

This is not a guide to getting huge amounts of reputation on Stack Overflow. As it happens, following the guidelines here is likely to result in decent rep,  but I’m sure there are various somewhat underhand ways of gaining reputation without actually being helpful. In other words, I’m not going to explain how you might game the system, but just share my views on how to work with the system for the benefit of all involved. A lot of this isn’t specific to Stack Overflow, but some of the advice isn’t applicable to some forums.

Hypocrisy warning: I’m not going to claim I always follow everything in this list. I try, but that’s not the same thing – and quite often time pressures will compromise the quality of an answer. Oh, and don’t read anything into the ordering here. It’s how it happened to come out, that’s all.

Read the question

All too often I’ve written what I thought was a great answer… only to reread the question and find out that it wasn’t going to help the questioner at all.

Now, questions are often written pretty badly, leaving out vital information, being vague about the problem, including lots of irrelevant code but leaving out the crucial bit which probably contains the bug, etc. It’s at least worth commenting on the question to ask for more information, but just occasionally you can apply psychic debugging to answer the question anyway. If you do have to make assumptions when writing an answer, state them explicitly. That will not only reduce the possibility of miscommunication, but will also point out the areas which need further clarification.

Code is king

I can’t believe I nearly posted this article without mentioning sample code.

Answers with sample code are gold… if the code is appropriate. A few rules of thumb:

  • Make sure it compiles, assuming it’s meant to. This isn’t always possible – for example, I often post at work from a machine without .NET on it, or on the train from a laptop without Java. If you can’t get a compiler to make sure your code is valid, be extra careful in terms of human inspection.
  • Snippets are okay, but complete programs rock. If you’re not already comfortable with writing short console apps, practise it. Often you can write a complete app in about 15 lines which doesn’t just give the solution, but shows it working. Imagine the level of confidence that gives to anyone reading your answer. Get rid of anything you don’t need – brevity is really helpful. (In C# for example that usually means getting rid of [STAThread], namespace declarations and unused using directives.)
  • Take a bit of care over formatting. If possible, try to prevent the code wrapping. That’s not always realistically feasible, but it’s a nice ideal. Make sure the spacing at least looks okay – you may want to use a 2-space indent if your code involves a lot of nesting.
  • If you skip best practices, add a comment. For example, you might include // Insert error handling here to indicate that production code really should check the return value etc. This doesn’t include omitting the STAThread attribute and working without a namespace – those are just reasonable assumptions and unlikely to be copied wholesale, whereas the main body of the code may well be. If your code leaks resources, someone’s production code may do so as well…

Code without an explanation is rarely useful, however. At least provide a sentence or two to explain what’s going on.

Answer the question and highlight side-issues

Other developers don’t always do things the way we’d like them to. Questions often reflect this, basically asking how to do something which (in our view) shouldn’t be attempted in the first place. It may completely infeasible, or it may just be a really bad idea.

Occasionally, the idea is so awful – and possibly harmful to users, especially when it comes to security questions – that the best response is just to explain (carefully and politely) why this is a really bad thing to do. Usually, however, it’s better to answer the question and give details of better alternatives at the same time. Personally I prefer to give these alternatives before the answer to the question asked, as I suspect it makes it more likely that the questioner will read the advice and take it on board. Don’t forget that the more persuasive you can be, the more likely it is they’ll abandon their original plans. In other words, “Don’t do this!” isn’t nearly as useful as “Don’t do this because…”

It’s okay to guess, but be honest

This may be controversial. I’ve certainly been downvoted twice on SO for having the temerity to post an answer without being 100% sure that it’s the right one – and (worse?) for admitting as much.

Sometimes there are questions which are slightly outside your own area of expertise, but they feel an awful lot like a situation you’ve been in before. In this kind of case, you can often write an answer which may well help – and would at least suggest something for the questioner to investigate a a possible answer to their problem. Sometimes you may be way off base, which is why it’s worth explaining in your answer that you are applying a bit of educated guesswork. If another answer is posted by an expert in the topic, it may well be worth the questioner trying their solution first… but at least you’re providing an alternative if they run out of other possibilities.

Now, somewhat contradictory…

Raise the overall accuracy level

It should go without saying that a correct answer is more helpful than an incorrect one. There are plenty of entirely inaccurate answers on Stack Overflow – and on newsgroups, and basically every online community-based resource I’ve ever seen. This isn’t surprising, but the best ways to counter it are:

  • Challenge inaccurate information
  • Provide accurate information yourself

One of the key aspects of this is to provide evidence. If you make an objective statement without any sort of disclaimer about your uncertainty, that should mean you’ve got good reason to believe you’re correct. That doesn’t necessarily mean you need to provide evidence, but as soon as there’s disagreement, evidence is king. If you want to assert that your code is faster than some other code, write a benchmark (carefully!). If you want to prove that an object can be collected at a certain point in time, write a test to show it. Short but complete programs are great for this, and can stop an argument dead in its tracks.

Another source of evidence is documentation and specifications. Be aware that they’re not always accurate, but I generally believe documentation unless I have a specific reason to doubt it.

Provide links to related resources

There have been a few questions on Stack Overflow as to whether it’s appropriate to link to other resources on the web. My own opinion is that it’s absolutely appropriate and can add a lot of value to an answer. In particular, I like to link to:

  • MSDN and JavaDoc documentation, or the equivalent for other platforms. With MSDN URLs, if they end in something like http://msdn.microsoft.om/foo(VS80).aspx, take the bit in brackets out of the URL (leaving http://msdn.microsoft.om/foo.aspx in this case). That way the link will always be to the most recent version of the documentation, and it doesn’t give WMD as many problems either.
  • Language specifications, in particular those for C# and Java. I generally link to the Word document for the C# spec, which has the disadvantage that I can’t link to a specific section, and it won’t just open in the browser. On the other hand, I find it easier to navigate than the MSDN version, and I’ve seen inaccuracies in MSDN.
  • My own articles and blog posts (unsurprisingly :)
  • Wikipedia
  • Other resources which are unlikely to become unavailable

The point about the resources not becoming unavailable is important – one of the main arguments against linking is that the page might go away in the future. That’s not a particularly compelling argument for most reference material IMO, but it is relevant in other cases. Either way, it’s worth including some sort of summary of what you’re linking to – a link on its own doesn’t really invite the reader to follow it, whereas a quick description of what they’ll find there provides more incentive.

One quick tip: In Chrome I have an MSDN “search engine” set up and in Firefox a keyword bookmark. In both cases the URL is http://msdn.microsoft.com/en-us/library/%s.aspx – this makes it easier to get to the MSDN page for a particular namespace, type or member. For example, by typing “msdn system.io.fileinfo” I get straight to the FileInfo page. It doesn’t work for generics, however. At some point I’d like to make this simpler somehow…

Care about your reader: spelling, grammar and style matter

I’m lucky: I’m a native English speaker, and I have a reasonably good natural command of English. Having said that, I still take a certain amount of care when writing answers: I’ll often rewrite a sentence several times until I feel it works. I’ve noticed that answers with correct spelling and grammar are generally upvoted more than ones with essentially the same content but less careful presentation.

I’m not recommending style over substance; I’m saying that both are important. You could be putting forward the most insightful comment in the world, but if you can’t communicate it effectively to your readers, it’s not going to help anyone. Having said that…

A time-limited answer may be better than no answer at all

I answer Stack Overflow questions in whatever spare time I have: waiting for a build, on the train, taking a break from editing etc. I frequently see a question which would take a good 15 minutes or more to answer properly – but I only have 30 seconds. If the question already has answers, there’s probably no sensible contribution I can make, but if the question is completely unanswered, I sometimes add a very short answer with the most important points I can think of at the time.

Ideally, I’d then go back later and edit the answer to make it more complete – but at least I may have given the questioner something to think about or an option to try. Usually these “quickies” are relatively fruitless, but occasionally I’ll come back to find that the answer was accepted: the slight nudge in the answer was all that was required. If I’m feeling diligent at that point I’ll still complete the answer, but the point is that a brief answer is usually better than nothing.

Don’t be afraid to delete (or edit heavily) useless answers

It’s almost inevitable that if you post enough answers, one of them will be less than helpful. It may start off being a good one, but if a later answer includes all the information from your answer and more, or explains it in a better way, it’s just clutter.

Likewise if you make a wild stab in the dark about the cause of a problem, and that guess turns out to be wrong, your answer could become actively unhelpful. Usually the community will let you know this in comments or voting, but sometimes you have to recognise it on your own.

Be polite

It’s a shame that I have to include this, and it’s even more of a shame that I need to take better notice of it myself. However boneheaded a question is, there’s no need to be rude. You can express your dismay at a question, and even express dismay at someone failing to read your answer properly, without resorting to inflammatory language. In the end, no-one really wins from a flame war. Remember that there’s a real person at the other end of the network connection, and they’re probably just as frustrated with you as you are with them. If things are getting out of hand, just write a polite note explaining that you don’t think it’s productive to discuss the topic any more, and walk away. (This can be surprisingly difficult advice to heed.)

Next time I get too “involved” in a question, could someone please direct me to this point? And don’t take “But I’m right, darnit!”  as an excuse.

Don’t “answer and run”

Sometimes an answer is very, very nearly spot on – but that final 1% is all the difference between the reader understanding fully and having a dangerous misunderstanding of the topic.

Stack Overflow makes it pretty easy to see comments to your answers: monitor this carefully so you can respond and clarify where appropriate. Many web forums have an “email me if this thread is updated” option – again, this is useful. It’s really frustrating (as a questioner) to be left hanging with an answer which looks like it might solve a real headache, if only you could just get the ear of the author for a moment.

Having said all of this:

Have fun

In my experience the most useful users are the ones who are obviously passionate about helping others. Don’t do it out of some misguided sense of “duty” – no-one’s paying you to do this (I assume) and no-one can reasonably complain if you just haven’t got the time or energy to answer their question. Save yourself for a time when you can be more enthusiastic and enjoy what you’re doing. Go ahead, have a cup of coffee and watch a Youtube video or something instead. You don’t have to check whether there are any new questions!

Benchmarking: designing an API with unusual goals

In a couple of recent posts I’ve written about a benchmarking framework and the results it produced for using for vs foreach in loops. I’m pleased with what I’ve done so far, but I don’t think I’ve gone far enough yet. In particular, while it’s good at testing multiple algorithms against a single input, it’s not good at trying several different inputs to demonstrate the complexity vs input size. I wanted to rethink the design at three levels – what the framework would be capable of, how developers would use it, and then the fine-grained level of what the API would look like in terms of types, methods etc. These may all sound quite similar on the face of it, but this project is somewhat different to a lot of other coding I’ve done, mostly because I want to lower the barrier to entry as far as humanly possible.

Before any of this is meaningful, however, I really needed an idea of the fundamental goal. Why was I writing yet another benchmarking framework anyway? While I normally cringe at mission statements because they’re so badly formulated and used, I figured this time it would be helpful.

Minibench makes it easy for developers to write and share tests to investigate and measure code performance.

The words in bold (or for the semantically inclined, the strong words) are the real meat of it. It’s quite scary that even within a single sentence there are seven key points to address. Some are quite simple, others cause grief. Now let’s look at each of the areas of design in turn.

Each element of the design should either clearly contribute to the mission statement or help in a non-functional way (e.g. make the project feasible in a reasonable timeframe, avoid legal issues etc). I’m aware that with the length of this post, it sounds like I’m engaging in "big upfront design" but I’d like to think that it’s at least informed by my recent attempt, and that the design criteria here are statements of intent rather than implementation commitments. (Aargh, buzzword bingo… please persevere!)

What can it do?

As we’ve already said, it’s got to be able to measure code performance. That’s a pretty vague definition, however, so I’m going to restrict it a bit – the design is as much about saying what isn’t included as what is.

  • Each test will take the form of a single piece of code which is executed many times by the framework. It will have an input and an expected output. (Operations with no natural output can return a constant; I’m not going to make any special allowance for them.)
  • The framework should take the tedium out of testing. In particular I don’t want to have to run it several times to get a reasonable number of iterations. I suspect it won’t be feasible to get the framework to guess appropriate inputs, but that would be lovely if possible.
  • Only wall time is measured. There are loads of different metrics which could be applied: CPU execution time, memory usage, IO usage, lock contention – all kinds of things. Wall time (i.e. actual time elapsed, as measured by a clock on the wall) is by far the simplest to understand and capture, and it’s the one most frequently cited in newsgroup and forum questions in my experience.
  • The benchmark is uninstrumented. I’m not going to start rewriting your code dynamically. Frankly this is for reasons of laziness. A really professional benchmarking system might take your IL and wrap it in a timing loop within a single method, somehow enforcing that the result of each iteration is used. I don’t believe that’s worth my time and energy, as well as quite possibly being beyond my capabilities.
  • As a result of the previous bullet, the piece of code to be run lots of times needs to be non-trivial. The reality is that it’ll end up being called as a delegate. This is pretty quick, but if you’re just testing "is adding two doubles faster or slower than adding two floats" then you’ll need to put a bit more work in (e.g. having a loop in your own code as well).
  • As well as the use case of "which of these algorithms performs the best with this input?" I want to support "how does the performance vary as a function of the input?" This should support multiple algorithms at the same time as multiple inputs.
  • The output should be flexible but easy to describe in code. For single-input tests simple text output is fine (although the exact figures to produce can be interesting); for multiple inputs against multiple tests a graph would often be ideal. If I don’t have the energy to write a graphing output I should at least support writing to CSV or TSV so that a spreadsheet or graphing tool can do the heavy lifting.
  • The output should be useful – it should make it easy to compare the performance of different algorithms and/or inputs. It’s clear from the previous post here that just including the scaled score doesn’t give an obvious meaning. Some careful wording in the output, as well as labeled columns, may be required. This is emphatically not a dig at anyone confused by the last post – any confusion was my own fault.

Okay, that doesn’t sound too unreasonable. The next area is much harder, in my view.

How does a developer use it?

Possibly the most important word in the mission statement is share. The reason I started this project at all is that I was fed up with spending ages writing timing loops for benchmarks which I’d then post on newsgroups or Stack Overflow. That means there are two (overlapping) categories of user:

  • A developer writing a test. This needs to be easy, but that’s an aspect of design that I’m reasonably familiar with. I’m not saying I’m good at it, but at least I have some prior experience.
  • A developer reading a newsgroup/forum post, and wanting to run the benchark for themselves. This distribution aspect is the hard bit – or at least the bit requiring imagination. I want the barrier to running the code to be really, really low. I suspect that there’ll be a "fear of the unknown" to start with which is hard to conquer, but if the framework becomes widely used I want the reader’s reaction to be: "Ah, there’s a MiniBench for this. I’m confident that I can download and run this code with almost no effort."

This second bullet is the one that my friend Douglas and I have been discussing over the weekend, in some ways playing a game of one-upmanship: "I can think of an idea which is even easier than yours." It’s a really fun game to play. Things we’ve thought about so far:

  • A web page which lets you upload a full program (without the framework) and spits out a URL which can be posted onto Stack Overflow etc. The user would then choose from the following formats:
    • Single .cs file containing the whole program – just compile and run. (This would also be shown on the download page.)
    • Test code only – for those whole already have the framework
    • Batch file – just run it to extract/build/run the C# code.
    • NAnt project file containing the C# code embedded in it – just run NAnt
    • MSBuild project file – ditto but with msbuild.
    • Zipped project – open the project to load the test in one file and the framework code in other (possibly separate) .cs files
    • Zipped solution – open to load two projects: the test code in one and the framework in the other
  • A web page which which lets you upload your results and browse the results of others

Nothing’s finalised here, but I like the general idea. I’ve managed (fairly easily) to write a "self-building" batch file, but I haven’t tried with NAnt/MSBuild yet. I can’t imagine it’s that hard – but then I’m not sure how much value there is either. What I do want to try to aim for is users running the tests properly, first time, without much effort. Again, looking back at the last post, I want to make it obvious to users if they’re running under a debugger, which is almost always the wrong thing to be doing. (I’m pretty sure there’s an API for this somewhere, and if there’s not I’m sure I can work out an evil way of detecting it anyway.)

The main thing is the ease of downloading and running the benchmark. I can’t see how it could be much easier than "follow link; choose format and download; run batch file" – unless the link itself was to the batch file, of course. (That would make it harder to use for people who wanted to get the source in a more normal format, of course.)

Going back to the point of view of the developer writing the test, I need to make sure it’s easy enough for me to use from home, work and on the train. That may mean a web page where I can just type in code, the input and expected output, and let it fill in the rest of the code for me. It may mean compiling a source file against a library from the command line. It may mean compiling a source file against the source code of the framework from the command line, with the framework code all in one file. It may mean building in Visual Studio. I’d like to make all of these cases as simple as possible – which is likely to make it simple for other developers as well. I’m not planning on optimising the experience when it comes to writing a benchmark on my mobile though – that might be a step too far!

What should the API look like?

When we get down to the nitty-gritty of types and methods, I think what I’ve got is a good starting point. There are still a few things to think about though:

  • We nearly have the functionality required for running a suite with different inputs already – the only problem is that we’re specifying the input (and expected output) in the constructor rather than as parameters to the RunTests method. I could change that… but then we lose the benefit of type inference when creating the suite. I haven’t resolved this to my satisfaction yet :(
  • The idea of having the suite automatically set up using attributed methods appeals, although we’d still need a Main method to create the suite and format the output. The suite creation can be simplified, but the chances of magically picking the most appropriate output are fairly slim. I suppose it could go for the "scale to best by number of iterations and show all columns" option by default… that still leaves the input and expected output, of course. I’m sure I’ll have something like this as an option, but I don’t know how far it will go.
  • The "configuration" side of it is expressed as a couple of constants at the moment. These control the minimum amount of time to run tests for before we believe we’ll be able to guess how many iterations we’ll need to get close to the target time, and the target time itself. These are currently set at 2 seconds and 30 seconds respectively – but when running tests just to check that you’ve got the right output format etc, that’s far too long. I suspect I should make a test suite have a configuration, and default to those constants but allow them to be specified on the command line as well, or explicitly in code.
  • Why do we need to set the expected output? In many cases you can be pretty confident that at least one of the test cases will be correct – so it’s probably simpler just to run each test once and check that the results are the same for all of them, and take that as the expected output. If you don’t have to specify the expected output, it becomes easier to specify a sequence of inputs to test.
  • Currently BenchmarkResult is nongeneric. This makes things simpler internally – but should a result know the input that it was derived from? Or should the ResultSuite (which is also nongeneric) know the input that has been applied to all its functions? The information will certainly need to be somewhere so that it can be output appropriately in the multiple input case.

My main points of design focus around three areas:

  • Is it easy to pick up? The more flexible it is, with lots of options, the more daunting it may seem.
  • Is it flexible enough to be useful in a variety of situations? I don’t know what users will want to benchmark – and I don’t build the right tool, it will be worthless to them.
  • Is the resulting test code easy and brief enough to include in a forum post, with a link to the full program? Will readers understand it?

As you can see, these are aimed at three slightly different people: the first time test writer, the veteran test writer, and the first time test reader. Getting the balance between the three is tricky.

What’s next?

I haven’t started rewriting the framework yet, but will probably do so soon. This time I hope to do it in a rather more test-driven way, although of course the timing-specific elements will be tricky unless I start using a programmatic clock etc. I’d really like comments around this whole process:

  • Is this worth doing?
  • Am I asking the right questions?
  • Are my answers so far headed in the right direction?
  • What else haven’t I thought of?

For vs Foreach on arrays and lists

As promised earlier in the week, here are the results of benchmarking for and foreach.

For each of int and double, I created an array and a List<T>, filled it with random data (the same for the list as the array) and ran each of the following ways of summing the collection:

  • A simple for loop, testing the index against the array length / list count each time
  • A for loop with an initial variable remembering the length of the array, then comparing the index against the variable each time
  • A foreach loop against the collection with the type known at compile and JIT time
  • A foreach loop against the collection as IEnumerable<T>
  • Enumerable.Sum

I won’t show the complete code in this post, but it’s you can download it and then build it against the benchmarking framework. Here’s a taste of what it looks like – the code for a list instead of an array, and double instead of int is pretty similar:

List<int> intList = Enumerable.Range(0, Size)
                              .Select(x => rng.Next(100))
                              .ToList();
int[] intArray = intList.ToArray();

var intArraySuite = TestSuite.Create(“int[]”, intArray, intArray.Sum())
    .Add(input => { int sum = 0;
        for (int i = 0; i < input.Length; i++) sum += input[i];
        return sum;
    }, “For”)
    .Add(input => { int sum = 0; int length = input.Length;
        for (int i = 0; i < length; i++) sum += input[i];
        return sum;
    }, “ForHoistLength”)
    .Add(input => { int sum = 0;
        foreach (int d in input) sum += d;
        return sum;
    }, “ForEach”)
    .Add(IEnumerableForEach)
    .Add(Enumerable.Sum, “Enumerable.Sum”)
    .RunTests();

static int IEnumerableForEach(IEnumerable<int> input)
{
    int sum = 0;
    foreach (int d in input)
    {
        sum += d;
    }
    return sum;
}

(I don’t normally format code quite like that, and wouldn’t even use a lambda for that sort of code – but it shows everything quite compactly for the sake of blogging.)

Before I present the results, a little explanation:

  • I considered int and double entirely separately – so I’m not comparing the int[] results against the double[] results for example.
  • I considered array and list results together – so I am comparing iterating over an int[] with iterating over a List<int>.
  • The result for each test is a normalized score, where 1.0 means “the best of the int summers” or “the best of the double summers” and other scores for that type of summation show how much slower that test ran (i.e. a score of 2.0 would mean it was twice as slow – it got through half as many iterations).
  • I’m not currently writing out the number of iterations each one completes – that might be interesting to see how much faster it is to sum ints than doubles.

Happy with that? Here are the results…

——————– Doubles ——————–
============ double[] ============
For                 1.00
ForHoistLength      1.00
ForEach             1.00
IEnumerableForEach 11.47
Enumerable.Sum     11.57

============ List<double> ============
For                 1.99
ForHoistLength      1.44
ForEach             3.19
IEnumerableForEach 18.78
Enumerable.Sum     18.61

——————– Ints ——————–
============ int[] ============
For                 1.00
ForHoistLength      2.03
ForEach             1.36
IEnumerableForEach 15.22
Enumerable.Sum     15.73

============ List<int> ============
For                 2.82
ForHoistLength      3.49
ForEach             4.78
IEnumerableForEach 25.71
Enumerable.Sum     26.03

I found the results interesting to say the least. Observations:

  • When summing a double[] any of the obvious ways are good.
  • When summing an int[] there’s a slight benefit to using a for loop, but don’t try to optimise it yourself – I believe the JIT recognizes the for loop pattern and removes array bounds checking, but not when the length is hoisted. Note the lack of difference when summing doubles – I suspect that this is because the iteration part is more significant when summing ints because integer addition is blindingly fast. This is important – adding integers is about as little work as you’re liikely to do in a loop; if you’re doing any real work (even as trivial as adding two doubles together) the difference between for and foreach is negligible
  • Our IEnumerableForEach method has pretty much the same performance as Enumerable.Sum – which isn’t really surprising, as it’s basically the same code. (At some point I might include Marc Gravell’s generic operators to see how they do.)
  • Using a general IEnumerable<T> instead of the specific List<T> makes a pretty huge difference to the performance – I assume this is because the JIT inlines the List<T> code, and it doesn’t need to create an object because List<T>.Enumerator is a value type. (The enumerator will get boxed in the general version, I believe.)
  • When using a for loop over a list, hosting the length in the for loop helped in the double version, but hindered in the int version. I’ve no idea why this happens.

If anyone fancies running this on their own box and letting me know if they get very different results, that would be really interesting. Likewise let me know if you want me to add any more tests into the mix.

Programming is hard

One of the answers to my “controversial opinions” question on Stack Overflow claims that “programming is so easy a five year old can do it.

I’m sure there are some aspects of programming which a five year old can do. Other parts are apparently very hard though. Today I came to the following conclusion:

  • If your code deals with arbitrary human text, it’s probably broken. (Have you taken the Turkey test recently?)
  • If your code deals with floating point numbers, it’s probably broken.
  • If your code deals with concurrency (whether that means database transactions, threading, whatever), it’s probably broken.
  • If your code deals with dates, times and time zones, it’s probably broken. (Time zones in particular are hideous.)
  • If your code has a user interface with anything other than a fixed size and a fixed set of labels (no i18n), it’s probably broken.

You know what I like working with? Integers. They’re nice and predictable. Give me integers, and I can pretty much predict how they’ll behave. So long as they don’t overflow. Of have an architecture-and-processor-dependent size.

Maybe you think I’m too cynical. I think I’m rather bullish, actually. After all, I used the word “probably” on all of those bullet points.

The thing that amazes me is that despite all this hardness, despite us never really achieving perfection, programs seem to work well enough most of the time. It’s a bit like a bicycle – it really shouldn’t work. I mean, if you’d never seen one working, and someone told you that:

  • You can’t really balance when it’s stationary. You have to go at a reasonable speed to stay stable.
  • Turning is a lot easier if you lean towards the ground.
  • The bit of the bike which is actually in contact with the ground is always stationary, even when the bike itself is moving.
  • You stop by squeezing bits of rubber onto the wheels.

Would you not be a touch skeptical? Likewise when I see the complexity of software and our collective failure to cope with it, I’m frankly astonished that I can even write this blog post.

It’s been a hard day. I achieved a small victory over one of the bullet points in the first list today. It took hours, and I pity my colleague who’s going to code review it (I’ve made it as clear as I can, but some things are just designed to mess with your head) – but it’s done. I feel simultaneously satisfied in a useful day’s work, and depressed at the need for it.

Benchmarking made easy

While I was answering a Stack Overflow question on the performance implications of using a for loop instead of a foreach loop (or vice versa) I promised to blog about the results – particularly as I was getting different results to some other posters.

On Saturday I started writing the bigger benchmark (which I will post about in the fullness of time) and used a technique that I’d used when answering a different question: have a single timing method and pass it a delegate, expected input and expected output. You can ask a delegate for the associated method and thus find out its name (for normal methods, anyway – anonymous functions won’t give you anything useful, of course) so that’s all the information you really need to run the test.

I’ve often shied away from using delegates for benchmarking on the grounds of it interfering with the results – including the code inline with the iteration and timing obviously has a bit less overhead. However, the CLR is so fast at delegate invocation these days that it’s really not an issue for benchmarks where each iteration does any real work at all.

It’s still a pain to have to write that testing infrastructure each time, however. A very long time ago I wrote a small attribute-based framework. It worked well enough, but I’ve found myself ignoring it – I’ve barely used it despite writing many, many benchmarks (mostly for newsgroup, blog and Stack Overflow posts) over the course of the years. I’m hoping that the new framework will prove more practical.

There are a few core concepts and (as always) a few assumptions:

  • A benchmark test is a function which takes a single value and returns a single value. This is expressed generically, of course, so you can make the input and output whatever type you like. A test also has a descriptive name, although this can often be inferred as the name of the function itself. The function will be called many times, the exact number being automatically determined.
  • A test suite is a collection of benchmark tests and another descriptive name, as well as the input to supply to each test and the expected output.
  • A benchmark result has a duration and an iteration count, as well as the descriptive name of the test which was run to produce the result. Results can be scaled so that either the duration or the iteration count matches another result. Likewise a result has a score, which is simply the duration (in ticks, but it’s pretty arbitrary) divided by the iteration count. Again, the score can be retrieved in a scaled fashion, using a specified result as a “standard” with a scaled score of 1.0. Lower is always better.
  • A result suite is simply the result of running a test suite. A result suite can be scaled, which is equivalent to building a new result suite with scaled copies of each original result. The result suite contains the smarts to display the results.
  • Running a test consists of two phases. First, we guess roughly how fast the function is. We run 1 iteration, then 2, then 4, then 8 etc – until it takes at least 3 seconds. At that point we scale up the number of iterations so that the real test will last around 30 seconds. This is the one we record. The final iteration of each set is tested for correctness based on the expected output. Currently the 3 and 30 second targets are hard-coded; I could perhaps make them parameters somewhere, but I don’t want to overcomplicate things.

Now for the interesting bit (from my point of view, anyway): I decided that this would be a perfect situation to try playing with a functional style. As a result, everything in the framework is immutable. When you “add” a test to a test suite, it returns a new test suite with the extra test. Running the test suite returns the result suite; scaling the result suite returns a new result suite; scaling a result returns a new result etc.

The one downside of this (beyond a bit of inefficiency in list copying) is that C# collection initializers only work with mutable collections. They also only work with direct constructor calls, whereas generic type inference doesn’t apply to constructors. In the end, the “static generic factory method” combined with simple Add method calls yields quite nice results, even though I can’t use a collection initializer:

double[] array = …; // Code to generate random array of doubles

var results = TestSuite.Create(“Array”, array, array.Sum())
                       .Add(ArrayFor)
                       .Add(ArrayForEach)
                       .Add(Enumerable.Sum, “LINQ Enumerable.Sum”)
                       .RunTests()
                       .ScaleByBest(ScalingMode.VaryDuration);

results.Display(ResultColumns.NameAndDuration);

This is a pretty small amount of extra code to write, beyond the code we actually want to benchmark (the ArrayFor and ArrayForEach methods in particular). No looping by iteration count, no guessing at the number of iterations and rerunning it until it lasts a reasonable amount of time, etc.

My only regret is that I haven’t written this in a test-driven way. There are currently no unit tests at all. Such is the way of projects that start off as “let’s just knock something together” and end up being rather bigger than originally intended.

At some point I’ll make it all downloadable from my main C# page, in normal source form, binary form, and also a “single source file” form so you can compile your benchmark with just csc /o+ /debug- Bench*.cs to avoid checking the assembly filename each time you use it. For the moment, here’s a zip of the source code and a short sample program, should you find them useful. Obviously it’s early days – there’s a lot more that I could add. Feedback would help!

Next time (hopefully fairly soon) I’ll post the for/foreach benchmark and results.

RFID: What I really want it for

This isn’t really coding related, but it’s technology related at least. There’s been a lot of fuss made about how great or awful RFID is and will be in the future, in terms of usefuless and privacy invasion respectively. There’s one use which I haven’t seen discussed, but which seems pretty obvious to me – but with further enhancements available.

Basically, I want RFID on clothes to tell me useful stuff. Suppose each item of clothing were uniquely tagged, and you had a bunch of scanners in your home linked up to one system which stored metadata about the clothes. Suddenly the following tasks become easier:

  • Working out which wash cycle to use for a whole washing basket. Can’t see the dark sock hidden in the white wash? The RFID scanner could alert you to it. Likewise tumbledrying – no need to check each item separately, just attach the scanner over the tumbledryer door and wait for it to beep as you try to put something in which shouldn’t be tumbled.
  • Separating clothes to put them away after they’re clean and dry. Admittedly this is more of a chore for our household than others (with twin boys and an older brother, where some items of clothing could belong to any of them) but it would be really, really useful for us.
  • Remembering who you’ve borrowed which clothes from. Kids grow out of clothes very quickly; we have a number of friends who lend us clothes from their children, and likewise we lend plenty of clothes to them and others. You’ve then got to remember what you borrowed from who, which can become tricky when you’ve got a loft full of bags of clothing. Wouldn’t it be nice to just “label” all the clothes in a bag with the owner’s name when you first receive them, and then just pass a scanner near everything quickly later on?

The privacy issues of all of this would have to be worked out carefully (as the simplest solution would allow your movements to be traced just by which clothes you’re wearing) but if it does end up being possible, I’ll be hugely grateful for this.

(Next up, benchmarking of for vs foreach in various situations. In other words, back to our regular schedule :)

Designing LINQ operators

I’ve started a small project (I’ll post a link when I’ve actually got something worthwhile to show) with some extra LINQ operators in – things which I think are missing from LINQ to Objects, basically. (I hope to include many of the ideas from an earlier blog post.) That, and a few Stack Overflow questions where I’ve effectively written extra LINQ operators and compared them with other solutions, have made me think about the desirable properties of a LINQ operator – or at least the things you should think about when implementing one. My thoughts so far:

Lazy/eager execution

If you’re returning a sequence (i.e. another IEnumerable<T> or similar) the execution should almost certainly be lazy, but the parameter checking should be eager. Unfortunately with the limitations of the (otherwise wonderful) C# iterator blocks, this usually means breaking the method into two, like this:

public static IEnumerable<T> Where(this IEnumerable<T> source,
                                   Func<T, bool> predicate)
{
    // Eagerly executed
    if (source == null)
    {
        throw new ArgumentNullException(“source”);
    }
    if (predicate == null)
    {
        throw new ArgumentNullException(“predicate”);
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<T> WhereImpl(IEnumerable<T> source,
                                        Func<T, bool> predicate)
{
    // Lazily executed
    foreach (T element in source)
    {
        if (predicate(element))
        {
            yield return element;
        }
    }
}

Obviously aggregates and conversions (Max, ToList etc) are generally eager anyway, within normal LINQ to Objects. (Just about everything in Push LINQ is lazy. They say pets look like their owners…)

Streaming/buffering

One of my favourite features of LINQ to Objects (and one which doesn’t get nearly the publicity of deferred execution) is that many of the operators stream the data. In other words, they only consume data when they absolutely have to, and they yield data as soon as they can. This means you can process vast amounts of data with very little memory usage, so long as you use the right operators. Of course, not every operator can stream (reversing requires buffering, for example) but where it’s possible, it’s really handy.

Unfortunately, the streaming/buffering nature of operators isn’t well documented in MSDN – and sometimes it’s completely wrong. As I’ve noted before, the docs for Enumerable.Intersect claim that it reads the whole of both sequences (first then second) before yielding any data. In fact it reads and buffers the whole of second, then streams first, yielding intersecting elements as it goes. I strongly encourage new LINQ operators to document their streaming/buffering behaviour (accurately!). This will limit future changes in the implementation admittedly (Intersect can be implemented in a manner where both inputs are streamed, for example) but in this case I think the extra guarantees provided by the documentation make up for that restriction.

Once-only evaluation

When I said that reversing requires buffering earlier on, I was sort of lying. Here’s an implementation of Reverse which doesn’t buffer any data anywhere:

public static IEnumerable<T> StreamingReverse<T>(this IEnumerable<T> source)
{
    // Error checking omitted for brevity
    int count = source.Count();
    for (int i = count-1; i >= 0; i–)
    {
        yield return source.ElementAt(i);
    }
}

If we assume we can read the sequence as often as we like, then we never need to buffer anything – just treat it as a random-access list. I hope I don’t have to tell you that’s a really, really bad idea. Leaving aside the blatant inefficiency even for sequences like lists which are cheap to iterate over, some sequences are inherently once-only (think about reading from a network stream) and some are inherently costly to iterate over (think about lines in a big log file – or the result of an ordering).

I suspect that developers using LINQ operators assume that they’ll only read the input data once. That’s a good assumption – wherever possible, we ought to make sure that it’s correct, and if we absolutely can’t help evaluating a sequence twice (and I can’t remember any times when I’ve really wanted to do that) we should document it in large, friendly letters.

Mind your complexity

In some ways, this falls out of “try to stream, and try to only read once” – if you’re not storing any data and you’re only reading each item once, it’s quite hard to come up with an operator which isn’t just O(n) for a single sequence. It is worth thinking about though – particularly as most of the LINQ operators can work with large amounts of data. For example, to find the smallest element in a sequence you can either sort the whole sequence and take the first element of the result or you can keep track of a “current minimum” and iterate through the whole sequence. Clearly the latter saves a lot of complexity (and doesn’t require buffering) – so don’t just take the first idea that comes into your head. (Or at least, start with that and then think how you could improve it.)

Again, documenting the complexity of the operator is a good idea, and call particular attention to anything which is unintuitively expensive.

Conclusion

Okay, so there’s nothing earth-shattering here. But the more I use LINQ to answer Stack Overflow questions, and the more I invent new operators in the spirit of the existing ones, the more powerful I think it is. It’s amazing how powerful it can be, and how ridiculously simple the code (sometimes) looks afterwards. It’s not like the operator implementation is usually hard, either – it’s just a matter of thinking of the right concepts. I’m going to try to follow these principles when I implement my extra operator library, and I hope you’ll bear them in mind too, should you ever feel that LINQ to Objects doesn’t have quite the extension method you need…

Quick rant: why isn’t there an Exception(string, params object[]) constructor?

This Stack Overflow question has reminded me of something I often wish existed in common exception constructors – an overload taking a format string and values. For instance, it would be really nice to be able to write:

throw new IOException(“Expected to read {0} bytes but only {1} were available”,
                      requiredSize, bytesRead);

Of course, with no way of explicitly inheriting constructors (which I almost always want for exceptions, and almost never want for anything else) it would mean yet another overload to copy and paste from another exception, but the times when I’ve actually written it in my own exceptions it’s been hugely handy, particularly for tricky cases where you’ve got a lot of data to include in the message. (You’d also want an overload taking a nested exception first as well, adding to the baggage…)

Stack Overflow reputation and being a micro-celebrity

I’ve considered writing a bit about this before, but not done so for fear of looking like a jerk. I still think I may well end up looking like a jerk, but this is all stuff I’m interested in and I’ll enjoy writing about it, so on we go. Much of this is based on experiences at and around Stack Overflow, and it’s more likely to be interesting to you if you’re a regular there or at least know the basic premises and mechanics. Even then you may well not be particularly interested – as much as anything, this post is to try to get some thoughts out of my system so I can stop thinking about how I would blog about it. If you don’t want the introspection, but want to know how to judge my egotism, skipping to the summary is probably a good plan. If you really don’t care at all, that’s probably a healthy sign. Quit now while you’re ahead.

What is a micro-celebrity?

A couple of minutes ago, I thought I might have been original with the term “micro-celebrity” but I’m clearly not. I may well not use the term the same way other people do, however, so here’s my rough definition solely for the purposes of this post:

A micro-celebrity is someone who gains a significant level of notoriety within a relatively limited community on the internet, usually with a positive feedback loop.

Yes, it’s woolly. Not to worry.

I would consider myself to have been a micro-celebrity in five distinct communities over the course of the last 14 years:

  • The alt.books.stephen-king newsgroup
  • The mostly web-based community around Team17’s series of “Worms” games (well, the first few, on the PC only)
  • The comp.lang.java.* newsgroups
  • The microsoft.public.dotnet.languages.csharp newsgroup
  • Stack Overflow

The last has been far and away the most blatant case. This is roughly how it goes – or at least how it’s gone in each of the above cases:

  • Spend some time in the community, post quite a lot. Shouting loudly works remarkably well on the internet – if you’re among the most prolific writers in a group, you will get noticed. Admittedly it helps to try hard to post well-written and interesting thoughts.
  • After a while, a few people will refer to you in their other conversations. For instance, if someone in the Java newsgroup was talking about “objects being passed by reference”, another poster might say something like “Don’t let Jon Skeet hear you talking like that.”
  • Play along with it, just a bit. Don’t blow your own trumpet, but equally don’t discourage it. A few wry comments to show that you don’t mind often go down well.
  • Sooner or later, you will find yourself not just mentioned in another topic, but being the topic of conversation yourself. At this point, it’s no longer an inside joke that just the core members of the group “get” – you’re now communal property, and almost any regular will be part of the joke.

One interesting thing you might have noticed about the above is that it doesn’t really take very much skill. It takes a fair amount of time, and ideally you should have some reasonable thoughts and the ability to express yourself clearly, but you certainly don’t have to be a genius. Good job, really.

How much do you care?

This is obviously very personal, and I’m only speaking for myself (as ever).

It’s undeniably an ego boost. Just about every day there’s something on Stack Overflow to laugh about in reference to me. How could I not enjoy that? How could it not inflate my sense of self-worth just a little bit? I could dismiss it as being entirely silly and meaningless – which it is, ultimately – but it’s still fun and I get a kick out of it. And yes, I’m sorry to say I bore/annoy my colleagues and wife with the latest Stack Overflow news, because I’ve always been the selfish kind of person who wants to talk about what they’re up to instead of asking the other person about their interests. This is an unfortunate trait which has relatively little to do with the micro-celebrity business.

One very good thing for keeping my ego in check is that at Google, I converse with people who are smarter than me every day, whether at coffee, breakfast, lunch or just while coding. There’s no sense of anyone trying to make anyone else feel small, but it’s pretty obvious that I’m nothing special when it comes to Google. Now, I don’t want to put on too much false modesty – I know I have a reasonable amount of experience, and I happen to know two very popular platforms reasonably well (which really helps on Stack Overflow- being the world’s greatest guru on MUMPS isn’t going to get you much love), and perhaps most importantly I can communicate pretty well. All of these are good things, and I’m proud of my career and particularly C# in Depth…

… but let’s get real here. The Jon Skeet Facts page isn’t really about me. It’s about making geek jokes where the exact subject is largely irrelevant. It could very easily have been about someone else with little change in the humour. Admittedly the funniest answers (to my mind) are the ones which do have some bearing on me (particularly the one about having written a book on C# 5.0 already) – but that doesn’t mean there’s anything really serious in it. I hope it’s pretty obvious to everyone that I’m not a genius programmer. I’d like to think I’m pretty good, but I’m not off-the-charts awesome by any means. (In terms of computer science, I’m nothing special at all and I have a really limited range of languages/paradigms. I’m trying to do something about those, but it’s hard when there’s always another question to answer.)

It’s worth bearing in mind the “micro” part of micro-celebrity. I suspect that if we somehow got all the C# developers in the world together and asked them whether they’d heard of Jon Skeet, fewer than 0.1% would say yes. (That’s a complete guess, by the way. I have really no idea. The point is I’m pretty sure it’s a small number.) Compared with the sea of developers, the set of Stack Overflow regulars is a very small pond.

What I care far more about than praise and fandom is the idea of actually helping people and making a difference. A couple of days ago I had an email from someone saying that C# in Depth had helped them in an interview: they were able to write more elegant code because now they grok lambda expressions. How cool is that? Yes, I know it’s all slightly sickening in a “you do a lot of good work for charity” kind of way – but I suspect it’s what drives most Stack Overflow regulars. Which leads me on to reputation…

What does Stack Overflow reputation mean to you?

In virtually every discussion about the Stack Overflow reputation system and its caps, I try to drop in the question of “what’s the point of reputation? What does it mean to you?” It’s one of those questions which everyone needs to answer for themselves. Jeff Atwood’s answer is that reputation is how much the system trusts you. My own answers:

  • It’s a daily goal. Making sure I always get to 200 is a fun little task, and then trying to get accepted answers is a challenge.
  • It’s measurable data, and you can play with graphs and stats. Hey we’re geeks – it’s fun to play with numbers, however irrelevant they are.
  • It’s an indication of helpfulness to some extent. It plays to my ego in terms of both pride of knowledge and the fulfillment of helping people.
  • It’s useful as an indicator of community trust for the system to use, which is probably more important to Jeff than it is to me.
  • It’s a game. This is the most important aspect. I love games. I’m fiercely competitive, and will always try to work out all the corners of a game’s system – things like it being actually somewhat useless getting accepted answers before you’ve reached the 200 limit. I don’t necessarily play to the corners of the game (I would rather post a useful but unpopular answer than a popular but harmful one, for serious questions) but I enjoy working them out. I would be interested to measure my levels of testosterone when typing furiously away at an answer, hoping to craft something useful before anyone else does. I’m never going to be “macho” physically, but I can certainly be an alpha geek. So long as it doesn’t go too far, I think it’s a positive thing.

I sometimes sense (perhaps inaccurately) that Jeff and Joel are frustrated with people getting too hung up about reputation. It’s really unimportant in the grand scheme of things – rep in itself isn’t as much of a net contribution to the world’s happiness as the way that Stack Overflow connects people with questions to people with relevant answers really, really quickly. But rep is one of the things that makes Stack Overflow so “sticky” as a website. It’s not that I wouldn’t answer questions if the reputation system went down – after all, I’ve been answering questions on newsgroups for years, for the other reasons mentioned – but the reputation system certainly helps. Yes, it’s probably taking advantage of a competitive streak which is in some ways ugly… but the result is a good one.

One downside of the whole micro-celebrity thing – and in particular of being the top rep earner – is that various suggestions (such as changing the rep limit algorithm and introducing a monthly league) make me look really selfish. It’s undeniable that both of the suggestions work in my favour. I happen to believe that both work in the community’s favour too, but I can certainly see why people might get the wrong idea about my motivation. I don’t remember thinking of any suggestions which would work against my personal interests but in the interests of the community. If I do, I’m pretty sure I’ll post them with no hesitation.

Summary

Yes, I like the attention of being a micro-celebrity. It would be ridiculous to deny it, and I don’t think it says much more about me than the fact that I’m human.

Yes, I like competing for reputation, even though it’s blatantly obvious that the figure doesn’t reflect programming prowess. It’s part of the fuel for my addiction to Stack Overflow.

With this out of the way, I hope to return to more technical blog posts. If anything interesting comes up in the comments, I’ll probably edit this post rather than writing a new one.