Category Archives: CSharpDev

Migrating from Visual Studio 2010 beta 1 to beta 2 – solution file change required

Having installed Visual Studio 2010 beta 2 on my freshly-reinstalled netbook (now with Windows 7 and and SSD – yummy) I found that my solution file from Visual Studio 2010 beta 1 wasn’t recognised properly: double-clicking on the file didn’t do anything. Opening the solution file manually was absolutely fine, but slightly less convenient than being able to double-click.

After a bit of investigation, I’ve found the solution. Manually edit the solution file, and change the first few lines from this:

Microsoft Visual Studio Solution File, Format Version 11.00
# Visual Studio 10

to this:

Microsoft Visual Studio Solution File, Format Version 11.00
# Visual Studio 2010

It’s just a case of changing "10" to "2010".

Hopefully between this and the linked SuperUser post, this should avoid others feeling the same level of bafflement :)

For vs Foreach on arrays and lists

As promised earlier in the week, here are the results of benchmarking for and foreach.

For each of int and double, I created an array and a List<T>, filled it with random data (the same for the list as the array) and ran each of the following ways of summing the collection:

  • A simple for loop, testing the index against the array length / list count each time
  • A for loop with an initial variable remembering the length of the array, then comparing the index against the variable each time
  • A foreach loop against the collection with the type known at compile and JIT time
  • A foreach loop against the collection as IEnumerable<T>
  • Enumerable.Sum

I won’t show the complete code in this post, but it’s you can download it and then build it against the benchmarking framework. Here’s a taste of what it looks like – the code for a list instead of an array, and double instead of int is pretty similar:

List<int> intList = Enumerable.Range(0, Size)
                              .Select(x => rng.Next(100))
                              .ToList();
int[] intArray = intList.ToArray();

var intArraySuite = TestSuite.Create(“int[]”, intArray, intArray.Sum())
    .Add(input => { int sum = 0;
        for (int i = 0; i < input.Length; i++) sum += input[i];
        return sum;
    }, “For”)
    .Add(input => { int sum = 0; int length = input.Length;
        for (int i = 0; i < length; i++) sum += input[i];
        return sum;
    }, “ForHoistLength”)
    .Add(input => { int sum = 0;
        foreach (int d in input) sum += d;
        return sum;
    }, “ForEach”)
    .Add(IEnumerableForEach)
    .Add(Enumerable.Sum, “Enumerable.Sum”)
    .RunTests();

static int IEnumerableForEach(IEnumerable<int> input)
{
    int sum = 0;
    foreach (int d in input)
    {
        sum += d;
    }
    return sum;
}

(I don’t normally format code quite like that, and wouldn’t even use a lambda for that sort of code – but it shows everything quite compactly for the sake of blogging.)

Before I present the results, a little explanation:

  • I considered int and double entirely separately – so I’m not comparing the int[] results against the double[] results for example.
  • I considered array and list results together – so I am comparing iterating over an int[] with iterating over a List<int>.
  • The result for each test is a normalized score, where 1.0 means “the best of the int summers” or “the best of the double summers” and other scores for that type of summation show how much slower that test ran (i.e. a score of 2.0 would mean it was twice as slow – it got through half as many iterations).
  • I’m not currently writing out the number of iterations each one completes – that might be interesting to see how much faster it is to sum ints than doubles.

Happy with that? Here are the results…

——————– Doubles ——————–
============ double[] ============
For                 1.00
ForHoistLength      1.00
ForEach             1.00
IEnumerableForEach 11.47
Enumerable.Sum     11.57

============ List<double> ============
For                 1.99
ForHoistLength      1.44
ForEach             3.19
IEnumerableForEach 18.78
Enumerable.Sum     18.61

——————– Ints ——————–
============ int[] ============
For                 1.00
ForHoistLength      2.03
ForEach             1.36
IEnumerableForEach 15.22
Enumerable.Sum     15.73

============ List<int> ============
For                 2.82
ForHoistLength      3.49
ForEach             4.78
IEnumerableForEach 25.71
Enumerable.Sum     26.03

I found the results interesting to say the least. Observations:

  • When summing a double[] any of the obvious ways are good.
  • When summing an int[] there’s a slight benefit to using a for loop, but don’t try to optimise it yourself – I believe the JIT recognizes the for loop pattern and removes array bounds checking, but not when the length is hoisted. Note the lack of difference when summing doubles – I suspect that this is because the iteration part is more significant when summing ints because integer addition is blindingly fast. This is important – adding integers is about as little work as you’re liikely to do in a loop; if you’re doing any real work (even as trivial as adding two doubles together) the difference between for and foreach is negligible
  • Our IEnumerableForEach method has pretty much the same performance as Enumerable.Sum – which isn’t really surprising, as it’s basically the same code. (At some point I might include Marc Gravell’s generic operators to see how they do.)
  • Using a general IEnumerable<T> instead of the specific List<T> makes a pretty huge difference to the performance – I assume this is because the JIT inlines the List<T> code, and it doesn’t need to create an object because List<T>.Enumerator is a value type. (The enumerator will get boxed in the general version, I believe.)
  • When using a for loop over a list, hosting the length in the for loop helped in the double version, but hindered in the int version. I’ve no idea why this happens.

If anyone fancies running this on their own box and letting me know if they get very different results, that would be really interesting. Likewise let me know if you want me to add any more tests into the mix.

Benchmarking made easy

While I was answering a Stack Overflow question on the performance implications of using a for loop instead of a foreach loop (or vice versa) I promised to blog about the results – particularly as I was getting different results to some other posters.

On Saturday I started writing the bigger benchmark (which I will post about in the fullness of time) and used a technique that I’d used when answering a different question: have a single timing method and pass it a delegate, expected input and expected output. You can ask a delegate for the associated method and thus find out its name (for normal methods, anyway – anonymous functions won’t give you anything useful, of course) so that’s all the information you really need to run the test.

I’ve often shied away from using delegates for benchmarking on the grounds of it interfering with the results – including the code inline with the iteration and timing obviously has a bit less overhead. However, the CLR is so fast at delegate invocation these days that it’s really not an issue for benchmarks where each iteration does any real work at all.

It’s still a pain to have to write that testing infrastructure each time, however. A very long time ago I wrote a small attribute-based framework. It worked well enough, but I’ve found myself ignoring it – I’ve barely used it despite writing many, many benchmarks (mostly for newsgroup, blog and Stack Overflow posts) over the course of the years. I’m hoping that the new framework will prove more practical.

There are a few core concepts and (as always) a few assumptions:

  • A benchmark test is a function which takes a single value and returns a single value. This is expressed generically, of course, so you can make the input and output whatever type you like. A test also has a descriptive name, although this can often be inferred as the name of the function itself. The function will be called many times, the exact number being automatically determined.
  • A test suite is a collection of benchmark tests and another descriptive name, as well as the input to supply to each test and the expected output.
  • A benchmark result has a duration and an iteration count, as well as the descriptive name of the test which was run to produce the result. Results can be scaled so that either the duration or the iteration count matches another result. Likewise a result has a score, which is simply the duration (in ticks, but it’s pretty arbitrary) divided by the iteration count. Again, the score can be retrieved in a scaled fashion, using a specified result as a “standard” with a scaled score of 1.0. Lower is always better.
  • A result suite is simply the result of running a test suite. A result suite can be scaled, which is equivalent to building a new result suite with scaled copies of each original result. The result suite contains the smarts to display the results.
  • Running a test consists of two phases. First, we guess roughly how fast the function is. We run 1 iteration, then 2, then 4, then 8 etc – until it takes at least 3 seconds. At that point we scale up the number of iterations so that the real test will last around 30 seconds. This is the one we record. The final iteration of each set is tested for correctness based on the expected output. Currently the 3 and 30 second targets are hard-coded; I could perhaps make them parameters somewhere, but I don’t want to overcomplicate things.

Now for the interesting bit (from my point of view, anyway): I decided that this would be a perfect situation to try playing with a functional style. As a result, everything in the framework is immutable. When you “add” a test to a test suite, it returns a new test suite with the extra test. Running the test suite returns the result suite; scaling the result suite returns a new result suite; scaling a result returns a new result etc.

The one downside of this (beyond a bit of inefficiency in list copying) is that C# collection initializers only work with mutable collections. They also only work with direct constructor calls, whereas generic type inference doesn’t apply to constructors. In the end, the “static generic factory method” combined with simple Add method calls yields quite nice results, even though I can’t use a collection initializer:

double[] array = …; // Code to generate random array of doubles

var results = TestSuite.Create(“Array”, array, array.Sum())
                       .Add(ArrayFor)
                       .Add(ArrayForEach)
                       .Add(Enumerable.Sum, “LINQ Enumerable.Sum”)
                       .RunTests()
                       .ScaleByBest(ScalingMode.VaryDuration);

results.Display(ResultColumns.NameAndDuration);

This is a pretty small amount of extra code to write, beyond the code we actually want to benchmark (the ArrayFor and ArrayForEach methods in particular). No looping by iteration count, no guessing at the number of iterations and rerunning it until it lasts a reasonable amount of time, etc.

My only regret is that I haven’t written this in a test-driven way. There are currently no unit tests at all. Such is the way of projects that start off as “let’s just knock something together” and end up being rather bigger than originally intended.

At some point I’ll make it all downloadable from my main C# page, in normal source form, binary form, and also a “single source file” form so you can compile your benchmark with just csc /o+ /debug- Bench*.cs to avoid checking the assembly filename each time you use it. For the moment, here’s a zip of the source code and a short sample program, should you find them useful. Obviously it’s early days – there’s a lot more that I could add. Feedback would help!

Next time (hopefully fairly soon) I’ll post the for/foreach benchmark and results.

Designing LINQ operators

I’ve started a small project (I’ll post a link when I’ve actually got something worthwhile to show) with some extra LINQ operators in – things which I think are missing from LINQ to Objects, basically. (I hope to include many of the ideas from an earlier blog post.) That, and a few Stack Overflow questions where I’ve effectively written extra LINQ operators and compared them with other solutions, have made me think about the desirable properties of a LINQ operator – or at least the things you should think about when implementing one. My thoughts so far:

Lazy/eager execution

If you’re returning a sequence (i.e. another IEnumerable<T> or similar) the execution should almost certainly be lazy, but the parameter checking should be eager. Unfortunately with the limitations of the (otherwise wonderful) C# iterator blocks, this usually means breaking the method into two, like this:

public static IEnumerable<T> Where(this IEnumerable<T> source,
                                   Func<T, bool> predicate)
{
    // Eagerly executed
    if (source == null)
    {
        throw new ArgumentNullException(“source”);
    }
    if (predicate == null)
    {
        throw new ArgumentNullException(“predicate”);
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<T> WhereImpl(IEnumerable<T> source,
                                        Func<T, bool> predicate)
{
    // Lazily executed
    foreach (T element in source)
    {
        if (predicate(element))
        {
            yield return element;
        }
    }
}

Obviously aggregates and conversions (Max, ToList etc) are generally eager anyway, within normal LINQ to Objects. (Just about everything in Push LINQ is lazy. They say pets look like their owners…)

Streaming/buffering

One of my favourite features of LINQ to Objects (and one which doesn’t get nearly the publicity of deferred execution) is that many of the operators stream the data. In other words, they only consume data when they absolutely have to, and they yield data as soon as they can. This means you can process vast amounts of data with very little memory usage, so long as you use the right operators. Of course, not every operator can stream (reversing requires buffering, for example) but where it’s possible, it’s really handy.

Unfortunately, the streaming/buffering nature of operators isn’t well documented in MSDN – and sometimes it’s completely wrong. As I’ve noted before, the docs for Enumerable.Intersect claim that it reads the whole of both sequences (first then second) before yielding any data. In fact it reads and buffers the whole of second, then streams first, yielding intersecting elements as it goes. I strongly encourage new LINQ operators to document their streaming/buffering behaviour (accurately!). This will limit future changes in the implementation admittedly (Intersect can be implemented in a manner where both inputs are streamed, for example) but in this case I think the extra guarantees provided by the documentation make up for that restriction.

Once-only evaluation

When I said that reversing requires buffering earlier on, I was sort of lying. Here’s an implementation of Reverse which doesn’t buffer any data anywhere:

public static IEnumerable<T> StreamingReverse<T>(this IEnumerable<T> source)
{
    // Error checking omitted for brevity
    int count = source.Count();
    for (int i = count-1; i >= 0; i–)
    {
        yield return source.ElementAt(i);
    }
}

If we assume we can read the sequence as often as we like, then we never need to buffer anything – just treat it as a random-access list. I hope I don’t have to tell you that’s a really, really bad idea. Leaving aside the blatant inefficiency even for sequences like lists which are cheap to iterate over, some sequences are inherently once-only (think about reading from a network stream) and some are inherently costly to iterate over (think about lines in a big log file – or the result of an ordering).

I suspect that developers using LINQ operators assume that they’ll only read the input data once. That’s a good assumption – wherever possible, we ought to make sure that it’s correct, and if we absolutely can’t help evaluating a sequence twice (and I can’t remember any times when I’ve really wanted to do that) we should document it in large, friendly letters.

Mind your complexity

In some ways, this falls out of “try to stream, and try to only read once” – if you’re not storing any data and you’re only reading each item once, it’s quite hard to come up with an operator which isn’t just O(n) for a single sequence. It is worth thinking about though – particularly as most of the LINQ operators can work with large amounts of data. For example, to find the smallest element in a sequence you can either sort the whole sequence and take the first element of the result or you can keep track of a “current minimum” and iterate through the whole sequence. Clearly the latter saves a lot of complexity (and doesn’t require buffering) – so don’t just take the first idea that comes into your head. (Or at least, start with that and then think how you could improve it.)

Again, documenting the complexity of the operator is a good idea, and call particular attention to anything which is unintuitively expensive.

Conclusion

Okay, so there’s nothing earth-shattering here. But the more I use LINQ to answer Stack Overflow questions, and the more I invent new operators in the spirit of the existing ones, the more powerful I think it is. It’s amazing how powerful it can be, and how ridiculously simple the code (sometimes) looks afterwards. It’s not like the operator implementation is usually hard, either – it’s just a matter of thinking of the right concepts. I’m going to try to follow these principles when I implement my extra operator library, and I hope you’ll bear them in mind too, should you ever feel that LINQ to Objects doesn’t have quite the extension method you need…

You don’t have to use query expressions to use LINQ

LINQ is clearly gaining a fair amount of traction, given the number of posts I see about it on Stack Overflow. However, I’ve noticed an interesting piece of coding style: a lot of developers are using query expressions for every bit of LINQ they write, however trivial.

Now, don’t get the wrong idea – I love query expressions as a helpful piece of syntactic sugar. For instance, I’d always pick the query expression form over the “dot notation” form for something like this:

var query = from file in Directory.GetFiles(logDirectory, “*.log”)
            from line in new LineReader(file)
            let entry = new LogEntry(line)
            where entry.Severity == Severity.Error
            select file + “: “ + entry.Message;

(Yes, it’s yet another log entry example – it’s one of my favourite demos of LINQ, and particularly Push LINQ.) The equivalent code using just the extension methods would be pretty ugly, especially given the various range variables and transparent identifiers involved.

However, look at these two queries instead:

var query = from person in people
            where person.Salary > 10000m
            select person;

var dotNotation = people.Where(person => person.Salary > 10000m);

In this case, we’re just making a single method call. Why bother with three lines of query expression? If the query becomes more complicated later, it can easily be converted into a query expression at that point. The two queries are exactly the same, even though the syntax is different.

My guess is that there’s a “black magic” fear of LINQ – many developers know how to write query expressions, but aren’t confident about what they’re converted into (or even the basics of what the translation process is like in the first place). Most of the C# 3.0 and LINQ books that I’ve read do cover query expression translation to a greater or lesser extent, but it’s rarely given much prominence.

I suspect the black magic element is reinforced by the inherent “will it work?” factor of LINQ to SQL – you get to write the query in your favourite language, but you may well not be confident in it working until you’ve tried it; there will always be plenty of little gotchas which can’t be picked up at compile time. With LINQ to Objects, there’s a lot more certainty (at least in my experience). However, the query expression translation shouldn’t be part of what developers are wary of. It’s clearly defined in the spec (not that I’m suggesting that all developers should learn it via the spec) and benefits from being relatively dumb and therefore easy to predict.

So next time you’re writing a query expression, take a look at it afterwards – if it’s simple, try writing it without the extra syntactic sugar. It may just be sweet enough on its own.

Value types and parameterless constructors

There have been a couple of questions on StackOverflow about value types and parameterless constructors:

I learned quite a bit when answering both of these. When a further question about the default value of a type (particularly with respect to generics) came up, I thought it would be worth delving into a bit more depth. Very little of this is actually relevant most of the time, but it’s interesting nonetheless.

I won’t go over most of the details I discovered in my answer to the first question,  but if you’re interested in the IL generated by the statement “x = new Guid();” then have a look there for more details.

Let’s start off with the first and most important thing I’ve learned about value types recently:

Yes, you can write a parameterless constructor for a value type in .NET

I very carefully wrote “in .NET” there – “in C#” would have been incorrect. I had always believed that the CLI spec prohibited value types from having parameterless constructors. (The C# spec used the terminology in a slightly different way – it treats all value types as having a parameterless constructor. This makes the language more consistent for the most part, but it does give rise to some interesting behaviour which we’ll see later on.)

It turns out that if you write your value type in IL, you can provide your own parameterless constructor with custom code without ilasm complaining at all. It’s possible that other languages targeting the CLI allow you to do this as well, but as I don’t know any, I’ll stick to IL. Unfortunately I don’t know IL terribly well, so I thought I’d just start off with some C# and go from there:

public struct Oddity
{
    public Oddity(int removeMe)
    {
        System.Console.WriteLine(“Oddity constructor called”);
    }
}

I compiled that into its own class library, and then disassembled it with ildasm /out:Oddity.il Oddity.dll. After changing the constructor to be parameterless, removing a few comments, and removing some compiler-generated assembly attributes) I ended up with this IL:

.assembly extern mscorlib
{
  .publickeytoken = (B7 7A 5C 56 19 34 E0 89 )  
  .ver 2:0:0:0
}
.assembly Oddity
{
  .hash algorithm 0x00008004
  .ver 0:0:0:0
}
.module Oddity.dll
.imagebase 0x00400000
.file alignment 0x00000200
.stackreserve 0x00100000
.subsystem 0x0003
.corflags 0x00000001

.class public sequential ansi sealed beforefieldinit Oddity
       extends [mscorlib]System.ValueType
{
  .pack 0
  .size 1
  .method public hidebysig specialname rtspecialname 
          instance void  .ctor() cil managed
  {
    .maxstack  8
    IL_0000:  nop
    IL_0001:  ldstr      "Oddity constructor called"
    IL_0006:  call       void [mscorlib]System.Console::WriteLine(string)
    IL_000b:  nop
    IL_000c:  ret
  }
}

I reassembled this with ilasm /dll /out:Oddity.dll Oddity.il. So far, so good. We have a value type with a custom constructor in a class library. It doesn’t do anything particularly clever – it just logs that it’s been called. That’s enough for our test program.

When does the parameterless constructor get called?

There are various things one could investigate about parameterless constructors, but I’m mostly interested in when they get called. The test application is reasonably simple, but contains lots of cases – each writes to the console what it’s about to do, then does something which might call the constructor. Without further ado:

using System;
using System.Runtime.CompilerServices;

class Test
{
    static Oddity staticField;
    Oddity instanceField;
   
    static void Main()
    {
        Report(“Declaring local variable”);
        Oddity localDeclarationOnly;
        // No variables within the value, so we can use it
        // without inializing anything
        Report(“Boxing”);
        object o = localDeclarationOnly;
        // Just make sure it’s really done it
        Report(o.ToString());
        Report(“new Oddity() – set local variable”);
        Oddity local = new Oddity();
        Report(“Create instance of Test – contains instance variable”);
        Test t = new Test();
        Report(“new Oddity() – set instance field”);
        t.instanceField = new Oddity();
        Report(“new Oddity() – set static field”);
        staticField = new Oddity();
        Report(“new Oddity[10]”);
        o = new Oddity[10];
        Report(“Passing argument to method”);
        MethodWithParameter(local);
        GenericMethod<Oddity>();
        GenericMethod2<Oddity>();
        Report(“Activator.CreateInstance(typeof(Oddity))”);
        Activator.CreateInstance(typeof(Oddity));
        Report(“Activator.CreateInstance<Oddity>()”);
        Activator.CreateInstance<Oddity>();
    }
   
    [MethodImpl(MethodImplOptions.NoInlining)]
    static void MethodWithParameter(Oddity oddity)
    {
        // No need to do anything
    }
   
    static void GenericMethod<T>() where T : new()
    {
        Report(“default(T) in generic method with new() constraint”);
        T t = default(T);
        Report(“new T() in generic method with new() constraint”);
        t = new T();
    }
   
    static void GenericMethod2<T>() where T : struct
    {
        Report(“default(T) in generic method with struct constraint”);
        T t = default(T);
        Report(“new T() in generic method with struct constraint”);
        t = new T();
    }

    static void Report(string text)
    {
        Console.WriteLine(text);
    }
}

And here are the results:

Declaring local variable
Boxing
Oddity
new Oddity() – set local variable
Oddity constructor called
Create instance of Test – contains instance variable
new Oddity() – set instance field
Oddity constructor called
new Oddity() – set static field
Oddity constructor called
new Oddity[10]
Passing argument to method
default(T) in generic method with new() constraint
new T() in generic method with new() constraint
default(T) in generic method with struct constraint
new T() in generic method with struct constraint
Activator.CreateInstance(typeof(Oddity))
Oddity constructor called
Activator.CreateInstance<Oddity>()
Oddity constructor called

So, to split these out:

Operations which do call the constructor

  • new Oddity() – whatever we’re storing the result in. This isn’t much of a surprise. What may surprise you is that it gets called even if you compile Test.cs against the original Oddity.dll (without the custom parameterless constructor) and then just rebuild Oddity.dll.
  • Activator.CreateInstance<T>() and Activator.CreateInstance(Type). I wouldn’t be particular surprised by this either way.

Operations which don’t call the constructor

  • Just declaring a variable, whether local, static or instance
  • Boxing
  • Creating an array – good job, as this could be a real performance killer
  • Using default(T) in a generic method - this one didn't surprise me
  • Using new T() in a generic method – this one really did surprise me. Not only is it counterintuitive, but in IL it just calls Activator.CreateInstance<T>(). What’s the difference between this and calling Activator.CreateInstance<Oddity>()? I really don’t understand.

Conclusions

Well, I’m still glad that C# doesn’t let us define our own parameterless constructors for value types, given the behaviour. The main reason for using it – as far as I’ve seen – it to make sure that the “default value” for a type is sensible. Given that it’s possible to get a usable value of the type without the constructor being called, this wouldn’t work anyway. Writing such a constructor would be like making a value type mutable – almost always a bad idea.

However, it’s nice to know it’s possible, just on the grounds that learning new things is always a good thing. And at least next time someone asks a similar question, I’ll have somewhere to point them…

Redesigning System.Object/java.lang.Object

I’ve had quite a few discussions with a colleague about some failures of Java and .NET. The issue we keep coming back to is the root of the inheritance tree. There’s no doubt in my mind that having a tree with a single top-level class is a good thing, but it’s grown a bit too big for its boots.

Pretty much everything in this post applies to both .NET and Java, sometimes with a few small changes. Where it might be unclear, I’ll point out the changes explicitly – otherwise I’ll just use the .NET terminology.

What’s in System.Object?

Before we work out what we might be able to change, let’s look at what we’ve got. I’m only talking about instance methods. At the moment:

Life-cycle and type identity

There are three members which I believe really need to be left alone.

We need a parameterless constructor because (at least with the current system of chaining constructors to each other) we have to have some constructor, and I can’t imagine what parameter we might want to give it. I certainly find it hard to believe there’s a particular piece of state which really deserves to be a part of every object but which we’re currently missing.

I really don’t care that much about finalizers. Should the finalizer be part of Object itself, or should it just get handled automatically by the CLR if and only if it’s defined somewhere in the inheritance chain? Frankly, who cares. No doubt it makes a big difference to the implementation somewhere, but that’s not my problem. All I care about when it comes to finalizers is that when I have to write them it’s as easy as possible to do it properly, and that I don’t have to write them very often in the first place. (With SafeHandle, it should be a pretty rare occurrence in .NET, even when you’re dealing directly with unmanaged resources.)

GetType() or (getClass() in Java) is pretty important. I can’t see any particular alternative to having this within Object, unless you make it a static method somewhere else with an Object parameter. In fact, that would have the advantage of freeing up the name for use within your own classes. The functionality is sufficiently important (and really does apply to every object) that I think it’s worth keeping.

Comparison methods

Okay, time to get controversial. I don’t think every object should have to be able to compare itself with another object. Of course, most types don’t really support this anyway – we just end up with reference equality by default.

The trouble with comparisons is that everyone’s got a different idea of what makes something equal. There are some types where it really is obvious – there’s only one natural comparison. Integers spring to mind. There are other types which have multiple natural equality comparisons – floating point numbers (exact, within an absolute epsilon, and within a relative epsilon) and strings (ordinal, culture sensitive and/or case sensitive) are examples of this. Then there are composite types where you may or may not care about certain aspects – when comparing URLs, do I care about case? Do I care about fragments? For http, if the port number is explicitly specified as 80, is that different to a URL which is still http but leaves the port number implicit?

.NET represents these reasonably well already, with the IEquatable<T> interface saying “I know how to compare myself with an instance of type T, and how to produce a hashcode for myself” and IEqualityComparer<T> interface saying “I know how to compare two instances of T, and how to produce a hashcode for one instance of T.” Now suppose we didn’t have the (nongeneric!) Equals() method and GetHashCode() in System.Object. Any type which had a natural equality comparison would still let you compare it for equality by implementing IEquatable<T>.Equals – but anything else would either force you to use reference equality or an implementation of IEqualityComparer<T>.

Some of the principle consumers of equality comparisons are collections – particularly dictionaries (which is why it’s so important that the interfaces should include hashcode generation). With the current way that .NET generics work, it would be tricky to have a constraint on a constructor such that if you only specified the types, it would only work if the key type implemented IEquatable<T>, but it’s easy enough to do with static methods (on a non-generic type). Alternatively you could specify any type and an appropriate IEqualityComparer<T> to use for the keys. We’d need an IdentityComparer<T> to work just with references (and provide the equivalent functionaliy to Object.GetHashCode) but that’s not hard – and it would be absolutely obvious what the comparison was when you built the dictionary.

Monitors and threading

This is possibly my biggest gripe. The fact that every object has a monitor associated with it was a mistake in Java, and was unfortunately copied in .NET. This promotes the bad practice of locking on “this” and on types – both of which are typically publicly accessible references. I believe that unless a reference is exposed explicitly for the purpose of locking (like ICollection.SyncRoot) then you should avoid locking on any reference which other code knows about. I typically have a private read-only variable for locking purposes. If you’re following these guidelines, it makes no sense to be able to lock on absolutely any reference – it would be better to make the Monitor class instantiable, and make Wait/Pulse/PulseAll instance members. (In Java this would mean creating a new class and moving Object.wait/notify/notifyAll members to that class.)

This would lead to cleaner, more readable code in my view. I’d also do away with the “lock” statement in C#, making Monitor.Enter return a token implementing IDisposable – so “using” statements would replace locks, freeing up a keyword and giving the flexibility of having multiple overloads of Monitor.Enter. Arguably if one were redesigning all of this anyway, it would be worth looking at whether or not monitors should really be reentrant. Any time you use lock reentrancy, you’re probably not thinking hard enough about the design. Now there’s a nice overgeneralisation with which to end this section…

String representations

This is an interesting one. I’m genuinely on the fence here. I find ToString() (and the fact that it’s called implicitly in many circumstances) hugely useful, but it feels like it’s attempting to satisfy three different goals:

  • Giving a developer-readable representation when logging and debugging
  • Giving a user-readable representation as part of a formatted message in a UI
  • Giving a machine-readable format (although this is relatively rare for anything other than numeric types)

It’s interesting to note that Java and .NET differ as to which of these to use for numbers – Java plumps for “machine-readable” and .NET goes for “human-readable in the current thread’s culture”. Of course it’s clearer to explicitly specify the culture on both platforms.

The trouble is that very often, it’s not immediately clear which of these has been implemented. This leads to guidelines such as “don’t use ToString() other than for logging” on the grounds that at least if it’s implemented inappropriately, it’ll only be a log file which ends up with difficult-to-understand data.

Should this usage be explicitly stated – perhaps even codified in the name: “ToDebugString” or something similar? I will leave this for smarter minds to think about, but I think there’s enough value in the method to make it worth keeping.

MemberwiseClone

Again, I’m not sure on this one. It would perhaps be better as a static (generic!) method somewhere in a class whose name indicated “this is for sneaky runtime stuff”. After all, it constructs a new object without calling a constructor, and other funkiness. I’m less bothered by this than the other items though.

Conclusion

To summarise, in an ideal world:

  • Equals and GetHashCode would disappear from Object. Types would have to explicitly say that they could be compared
  • Wait/Pulse/PulseAll would become instance methods in Monitor, which would be instantiated every time you want a lock.
  • ToString might be renamed to give clearer usage guidance.
  • MemberwiseClone might be moved to a different class.

Obviously it’s far too late for either Java or .NET to make these changes, but it’s always interesting to dream. Any more changes you’d like to see? Or violent disagreements with any of the above?

The Snippy Reflector add-in

Those of you who’ve read C# in Depth will know about Snippy – a little tool which makes it easy to build complete programs from small snippets of code.

I’m delighted to say that reader Jason Haley has taken the source code for Snippy and built an add-in for Reflector. This will make it much simpler to answer questions like this one about struct initialization, where you really want to the IL generated for a snippet. Here’s a screenshot to show what it does:

This is really cool – if you want to dabble to see what the C# compiler does in particular situations, check it out. It comes as just the DLL, or a zipped version. Thanks for putting in all this work, Jason :)

Update: Jason now has his own (more detailed) blog entry too.

Copenhagen C# talk videos now up

The videos from my one day talk about C# in Copenhagen are now on the MSDN community site. There are eight sessions, varying between about 25 minutes and 50 minutes in length. I haven’t had time to watch them yet, but when I do I’ll submit brief summaries so you can quickly get to the bits you’re most interested in. (As far as I’m aware, they’re only available via Silverlight, which I realise isn’t going to be convenient for everyone.)

Feedback is very welcome.

.NET 4.0’s game-changing feature? Maybe contracts…

Update: As Chris Nahr pointed out, there’s a blog post by Melitta Andersen of the BCL team explaining this in more detail.

Obviously I’ve been looking at the proposed C# 4.0 features pretty carefully, and I promise I’ll blog more about them at some later date – but yesterday I watched a PDC video which blew me away.

As ever, a new version of .NET means more than just language changes – Justin van Patten has written an excellent blog post about what to expect in the core of thee framework. There are nice things in there – tuples and BigInteger, for example – but it was code contracts that really caught my eye.

Remember Spec#? Well, as far as I can tell the team behind it realised that people don’t really want to have to learn a new language – but if the goodness of Design By Contract can be put into a library, then everyone can use it. Enter CodeContracts

Actual examples are relatively few and far between at the moment, but the basic idea is that you write your contracts at the start of methods – not in attributes, presumably because that’s too limiting in terms of what you can express – and then a post-build tool will “understand” those contracts, find potential issues, and do a bit of code-rewriting where appropriate (e.g. to move the post-condition testing to the end points of the method). Object invariants can also be expressed as separate methods.

Rather than guess at the syntax in this blog post I highly recommend you watch the PDC 2008 video on both this and Pex (an intelligent code explorer and test generator). The teams have clearly thought through a lot of significant issues:

  • Contracts can be enforced at runtime, or stripped out for release builds. (I’ll be interested to see whether I can keep the pre-condition checks in the release build, just removing invariants and post-conditions etc.)
  • If you’re stripping out the contracts, you can still have them in a separate assembly – so if you supply a library to someone, they can still have all the Design by Contract goodness available, and see where they’re potentially violating your preconditions
  • Contracts will be automatically documented in the generated XML documentation file (although this has yet to be implemented, I believe)
  • Interfaces can be associated with contract classes where the contracts are expressed. (They couldn’t be in the interface, as they require method bodies.)
  • Pex will be able to generate tests in MS Test, NUnit and MbUnit. (Hooray! This got a massive cheer at PDC.)

Now I should point out that I haven’t tried any of this – I’ve just watched a video which was very slick and obviously used a well-tested scenario. If this genuinely works, however, I think it could change the way mainstream developers approach coding just as LINQ is changing the way we see data. (Obviously there’s nothing fundamentally new about DbC – but there’s a difference between it existing and it being mainstream.)

I’m really, really excited about this :) Definitely time to boot up the VPC image when I get a moment…