Book idea

Having just glanced at the clock, now is the ideal time to post about an idea I had a little while ago – a book (or blog, or something) about C# (or maybe C# and Java) which I’d only write between midnight and one in the morning.

It would contain only those things which seemed like really good ideas at the time – but which might seem insane at other times. Most of these ideas are probably useless, but may contain a germ of interest. While I don’t always have those ideas between midnight and one, that’s the time of night when they seem most potent, and when I’d be most likely to be ready to write enthusiastically about them. The coding equivalent of “beer goggles” if you will.

A couple ideas I’ve had which would probably qualify:

Extension interfaces

If C# 3.0 is going to allow us to pretend to add methods to classes, which shouldn’t it allow us to pretend that classes implement interfaces they don’t? My original reason for wanting this is to get rid of some of the ugliness in the suggesting new XML APIs: there’s a method which takes an array of objects, even though only a handful of types are catered for. Unfortunately, those types don’t have an interface in common, so all the checking has to be done at runtime. If you could pretend that they all implement the same interface, just for the purposes of the API, you could declare the method as taking an array of the interface type. Of course, this is much less straightforward than converting what looks like an instance method call into a static method call…

Conditional returns

This came up when implementing Equals for several types in quick succession. All of them followed a very similar pattern, and there were similar things needed at the start of each implementation – simple checks for nullity, reference identity etc. It would be interesting to have a sort of “nullable return” for methods which had a non-nullable value type return type – I could write return? expression; where the expression was a nullable form of the return type, and it would only return if the expression was non-null. There are bits of this which appeal, and bits which seem horrible – but the main problem I have with it is that I suspect would rarely use it outside Equals implementations. (If this isn’t a clear enough description, I’m happy to write an example – just not right now.)

Yay! I’m not the only one who doesn’t like designers…

For a long time I’ve disliked “designer-generated” code. My preference when writing a Windows Forms (or Swing) app is to work out what it should look like on paper, possibly prototype just the UI in a designer (for the look of it, not the code) and then start with an empty file for real code.

That way, I can end up with a UI which is built up in logical stages (significant UI construction often takes several hundred lines of code – it’s handy to be able to put that in multiple methods with descriptive names, etc), can have code re-use (if all the buttons I create have similar properties, I can write a method to take care of the common stuff), doesn’t put extra fields in for no good reason (how often do labels actually change their text?) and various other things.

I’ve always regarded myself as slightly odd in that respect. However, it seems that Charles Petzold – yes, that Charles Petzold feels the same way and for pretty much the same reasons. Like me, he feels that XAML could help things in terms of autogeneration, as the autogenerated “code” may well look very much like what I’d have written myself, and what’s more, the designer should be able to still understand the XAML after I’ve changed it. Altogether good things.

Anyway, this is all by way of introducing a wonderful article about all of this (and other things): Does Visual Studio Rot The Mind?

In case anyone’s wondering what my take on Intellisense is: the VS.NET 2003 Intellisense seems to get in the way as much as it helps. I expect VS 2005 is better, but I haven’t used it enough to know. Eclipse’s equivalent is much, much nicer than VS.NET 2003, partly because I have a finer degree of control over when it pops up. It’s also brilliant at guessing what parameters I want to pass to methods (really rather surprisingly so at times), and possibly most important of all, it knows how to display more than one overload at a time. VS 2005 beta 2 doesn’t; I’m hoping for a pleasant surprise when the real thing arrives, but we’ll see. To understand what I mean, type Convert.ToString( into Visual Studio. Oh great, I can see 36 overloads – one at a time. Eclipse would give a larger tooltip-style window, with scrollbars. Visual Studio knows how to do that when it’s offering you different method names altogether, but as soon as it comes to overloads, it decides that one-at-a-time is the way to go. Aargh. Anyway, enough ranting…

Corner cases in Java and C#

Every language has a few interesting corner cases – bits of surprising behaviour
which can catch you out if you’re unlucky. I’m not talking about the kind of thing
that all developers should really be aware of – the inefficiencies of repeatedly
concatenating strings, etc. I’m talking about things which you would never suspect
until you bump into them. Both C#/.NET and Java have some oddities in this respect,
and as most are understandable even to a developer who is used to the other, I thought
I’d lump them together.

Interned boxing – Java 1.5

Java 1.5 introduced autoboxing of primitive types – something .NET has had
from the start. In Java, however, there’s a slight difference – the boxed
types have been available for a long time, and are proper named reference
types just as you’d write elsewhere. In this example, we’ll look at
int boxing to java.lang.Integer. What would you
expect the results of the following operation to be?

Object x = 5;
Object y = 5;
boolean equality = (x==y);

Personally, I’d expect the answer to be false. We’re testing for reference equality
here, after all – and when you box two values, they’ll end up in different boxes,
even if the values are the same, right? Wrong. Java 1.5 (or rather, Sun’s current
implementation of Java 1.5) has a sort of cache of interned values between -128 and
127 inclusive. The language specification explicitly states that programmers shouldn’t
rely on two boxed values of the same original value being different (or being the
same, of course). Goodness only knows whether or not this actually yields performance
improvements in real life, but it can certainly cause confusion. I only ran into it
when I had a unit test which incorrectly asserted reference equality rather than
value equality between two boxed values. The tests worked for ages, until I added
something which took the value I needed to test against above 127.

Lazy initialisation and the static constructor – C#

One of the things which is sometimes important about the pattern I usually use when
implementing a singleton
is that it’s only initialised when it’s first used – or is it? After a newsgroup
question asked why the supposedly lazy pattern wasn’t working, I investigated a little,
finding out that there’s a big difference between using an initialiser directly on
the static field declaration, and creating a static constructor which assigns the value.
Full details on my beforefieldinit
page.

The old new object – .NET

I always believed that using new with a reference type would give me
a reference to a brand new object. Not quite so – the overload for the String
constructor which takes a char[] as its single parameter will return
String.Empty if you pass it an empty array. Strange but true.

When is == not reflexive? – .NET

Floating point numbers have been
the cause of many headaches over the years. It’s relatively well known that “not a number” is not equal
to itself (i.e. if x=double.NaN, then x==x is false).

It’s slightly more surprising when two values which look like they really, really should be equal just
aren’t. Here are a couple of sample programs:

using System;
public class Oddity1
{
    public static void Main()
    {
        double two = double.Parse("2");
        double a = double.Epsilon/two;
        double b = 0;
        Console.WriteLine(a==b);
        Console.WriteLine(Math.Abs(b-a) < double.Epsilon);
    }
}

On my computer, the above (compiled and run from the command line) prints out True twice.
If you comment out the last line, however, it prints False – but only under .NET 1.1.
Here’s another:

using System;

class Oddity2
{
    static float member;

    static void Main()
    {
        member = Calc();
        float local = Calc();
        Console.WriteLine(local==member);
        member = local;
    }

    static float Calc()
    {
        float f1 = 2.82323f;
        float f2 = 2.3f;
        return f1*f2;
    }
}

This time it prints out True until you comment out the last
line, which changes the result to False. This occurs on both .NET 1.1 and 2.0.

The reason for these problems is really the same – it’s a case of when the JIT decides to
truncate the result down to the right number of bits. Most CPUs work on 80-bit floating point
values natively, and provide ways of converting to and from 32 and 64 bit values. Now, if you
compare a value which has been calculated in 80 bits without truncation with a value which has
been calculated in 80 bits, truncated to 32 or 64, and then expanded to 80 again, you can run
into problems. The act of commenting or uncommenting the extra lines in the above changes what
the JIT is allowed to do at what point, hence the change in behaviour. Hopefully this will
persuade you that comparing floating point values directly isn’t a good idea, even in cases
which look safe.

That’s all I can think of for the moment, but I’ll blog some more examples as and when I see/remember
them. If you enjoy this kind of thing, you’d probably like
Java Puzzlers
– whether or not you use Java itself. (A lot of the puzzles there map directly to C#, and even those which
don’t are worth looking at just for getting into the mindset which spots that kind of thing.)

A short case study in LINQ efficiency

I’ve been thinking a bit about how I’d use LINQ in real life (leaving DLinq and XLinq alone for the moment). One of the examples I came up with is a fairly common one – trying to find the element in a collection which has the maximum value for a certain property. Note that quite often I don’t just need to know the maximum value of the property itself – I need to know which element had that value. Now, it’s not at all hard to implement that in “normal” code, but using LINQ could potentially make the intention clearer. So, I tried to work out what the appropriate LINQ expression should look like.

I’ve come up with three ways of expressing what I’m interested in in LINQ. For these examples, I’ve created a type NameAndSize which has (unsurprisingly) properties Name (a string) and Size (an int). For testing purposes, I’ve created a list of these items, with a variable list storing a reference to the list. All samples assume that the list is non-empty.

Sort, and take first element

(from item in list
orderby item.Size descending
select item).First();

This orders by size descending, and takes the first element. The obvious disadvantage of this is that before finding the first element (which is the only one we care about) we have to sort all the elements – nasty! Assuming a reasonable sort, this operation is likely to be O(n log (n))

Subselect in where clause

(from item in list
where item.Size==list.Max(x=>x.Size)
select item).First();

This goes through the list, finding every element whose size is equal to the maximum one, and then takes the first of those elements. Unfortunately, the comparison calculates the maximum size on every iteration This makes it an O(n^2) operation.

Two selects

int maxSize = list.Max(x=>x.Size);
NameAndSize max = (from item in list
where item.Size==maxSize
select item).First();

This is similar to the previous version, but solves the problem of the repeated calculation of the maximum size by doing it before anything else. This makes the whole operation O(n), but it’s still somewhat dissatisfying, as we’re having to iterate through the list twice.

The non-LINQ way

NameAndSize max = list[0];
foreach (NameAndSize item in list)
{
if (item.Size > max.Size)
{
max = item;
}
}

This keeps a reference to the “maximum element so far”. It only iterates through the list once, and is still O(n).

Benchmarks

Now, I’ve written a little benchmark which runs all of these except the “embedded select” version which was just too slow to run sensibly by the time I’d made the list large enough to get sensible results for the other versions. Here are the results using a list of a million elements, averaged over five runs:
Sorting: 437ms
Two queries: 109ms
Non-LINQ: 38ms

After tripling the size of the list, the results were:
Sorting: 1437ms
Two queries: 324ms
Non-LINQ: 117ms

These results show the complexities being roughly as predicted above, and in particular show that it’s definitely cheaper to only iterate through the collection once than to iterate through it twice.

Now, this query is a fairly simple one, conceptually – it would be a shame if LINQ couldn’t cope with it efficiently. I suspect it could be solved by giving the Max operator another parameter, which specified what should be selected, as well as what should be used for comparisons. Then I could just use list.Max(item => item.Size, item=>item). At that stage, the only loss in efficiency would be through invoking the delegates, which is a second-order problem (and one which is inherent in LINQ). Fortunately, the way LINQ works makes this really easy to try out – just write an extension class:

static class Extensions
{
    public static V Max<T,V>(this IEnumerable<T> source, 
                             Func<T,int> comparisonMap, 
                             Func<T,V> selectMap)
    {
        int maxValue=0;
        V maxElement=default(V);
        bool gotAny = false;
        using (IEnumerator<T> enumerator = source.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                T sourceValue = enumerator.Current;
                int value = comparisonMap(sourceValue);
                if (!gotAny || value > maxValue)
                {
                    maxValue = value;
                    maxElement = selectMap(sourceValue);
                    gotAny = true;
                }
            }
        }
        if (!gotAny)
        {
            throw new EmptySequenceException();
        }
        return maxElement;
    }
}

This gave results of 57ms and 169ms for the two list sizes used earlier – not quite as fast as the non-LINQ way, but much better than any of the others – and by far the simplest to express, too.

Lessons learned

  • You really need to think about the complexity of LINQ queries, and know where they will be executed (I suspect that DLinq would have coped with the “subselect” version admirably).
  • There’s still some more work to be done on the standard query operators to find efficient solutions to common use cases.
  • Even if the standard query operators don’t quite do what you want, it can be worthwhile to implement your own – and it’s not that hard to do so!