The irritation of bad names

A couple of days ago I accidentally derailed the comments on Eric Lippert’s blog post about unused "using" directives. The reason that redundant code doesn’t generate a warning in Visual Studio is that it’s what you get to start with in Visual Studio. This led me to rant somewhat about other aspects of Visual Studio’s behaviour which sacrifice long term goodness in favour of short term efficiency. Almost all the subsequent comments (at the time of writing this post) are concerned with my rant rather than Eric’s post. Some agree with me, some don’t – but it’s only now that I’ve spotted the bigger picture behind my annoyances.

All of them are to do with names and the defaults provided. I’ve blogged before about how hard it is to find a good name – it’s a problem I run into time and time again, and the ability to rename something is one of the most important refactorings around.

If you don’t know, ask

Now if it’s hard to find a good name, it stands to reason that anything the IDE can generate automatically is likely to be a poor name… such as "Form1", "textBox1" or "button1_Click". And yet, in various situations, Visual Studio will happily generate such names, and it can sometimes be a small but significant pain to correct it.

The situation which causes me personally a lot of pain is copying a file. For C# in Depth, I have a lot of very small classes, each with a Main method. When I’m evolving an example, I often want to take the existing code and just change it slightly, but in a new file. So I might have a file called OrderByName.cs containing a class called OrderByName. (I agree this would normally be a bad class name, but in the context of a short but complete example it’s okay.) I want to just select the file, hit copy and paste, and be asked for a new filename. The class within the file would then be renamed for me as well. As an aside, this is the behaviour Eclipse has in its Java tooling.

In reality, I’d end up with a new file called "Copy of OrderByName.cs", still containing a class called OrderByName. Renaming the file wouldn’t offer to rename the class, as the filename wouldn’t match the class name. Renaming the class by just changing it and then hitting Ctrl-. would also rename the original class, which is intensely annoying. You’re basically stuck doing it manually with find and replace, as far as I can see. There may be some automated aid available, but at the very least it’s non-obvious.

Now the question is: why would I ever want a file called "Copy of OrderByName.cs"? That’s always going to be the wrong name, so why doesn’t Visual Studio ask me for the right name? It could provide a default so I can skip through if I really want to (and probably an "Apply to all items" if I’m copying multiple files) but at least give me the chance to specify the right filename at the crucial point. Once it knows the right new filename before it’s got a broken build, I would hope it would be easy to then apply the new name to the class too.

The underlying point is that if you absolutely have to have a name for something, and there’s no way a sensible suggestion can be provided, the user should be consulted. I know there’s a lot of discussion these days about not asking the user pointless questions, but this isn’t a pointless question… at least when it comes to filenames.

If you don’t need a name, don’t use one

I’m not a UI person, so some of this section may be either outdated or at least only partially applicable. In particular, I believe WPF does a better job than the Windows Forms designer.

Names have two main purposes, in my view. They can provide semantic meaning to aid the reader, even if a name isn’t strictly required (think of the "introduce local variable" refactoring) and they can be used for identification.

Now suppose I’m creating a label on a form. If I’m using the designer, I can probably see the text on the label – its meaning is obvious. I quite possibly don’t have to refer to the label anywhere in code, unless I’m changing the value programmatically… so why does it need a name? If you really think it needs a name, is "label1" ever going to be the right name – the one you’d have come up with as the most meaningful one you could think of?

In the comments in Eric’s blog, someone pointed out that being prompted for a name every time you dragged a control onto the designer would interrupt workflow… and I quite agree. Many of those controls won’t need names. However, as soon as they do need a name, prompting for the name at that point (or just typing it into the property view) isn’t nearly such a distraction… indeed, I’d suggest it’s actually guiding the developer in question to crystallize their thoughts about the purpose of that control.

Conclusion

Okay, this has mostly been more ranting – but at least it’s now on my blog, and I’ve been able to give a little bit more detail about the general problem I see in Visual Studio – a problem which leads to code containing utterly useless names.

The fundamental principle is that I want every name in my code to be a meaningful one. The IDE should use two approaches to help me with that goal:

  • Don’t give a name to anything that doesn’t deserve or need one
  • If a name is really necessary, and you can’t guess it from the rest of the context, ask the user

I don’t expect anything to change, but it’s good to have it off my chest.

Type initialization changes in .NET 4.0

This morning, while checking out an email I’d received about my brain-teasers page, I discovered an interesting change to the CLR in .NET 4.0. At least, I think it’s interesting. It’s possible that different builds of the CLR have exhibited different behaviour for a while – I only have 32-bit versions of Windows installed, so that’s what I’m looking at for this whole post. (Oh, and all testing was done under .NET 4.0b2 – it could still change before release.)

Note: to try any of this code, build in release mode. Running in the debugger or even running a debug build without the debugger may well affect the behaviour.

Precise initialization: static constructors

I’ve written before about static constructors in C# causing types to be initialized immediately before the type is first used, either by constructing an instance or referring to a static member. In other words, consider the following program:

using System;

class StaticConstructorType
{
    private static int x = Log();
    
    // Force “precise” initialization
    static StaticConstructorType() {}
    
    private static int Log()
    {
        Console.WriteLine(“Type initialized”);
        return 0;
    }
    
    public static void StaticMethod() {}
}

class StaticConstructorTest
{
    static void Main(string[] args)
    {
        if (args.Length == 0)
        {
            Console.WriteLine(“No args”);
        }
        else
        {
            StaticConstructorType.StaticMethod();
        }
    }
}

Note how the static variable x is initialized using a method that writes to the console. This program is guaranteed to write exactly one line to the console: StaticConstructorType will not be initialized unless you give a command line argument to force it into the “else” branch. The way the C# compiler controls this is using the beforefieldinit flag.

So far, so boring. We know exactly when the type will be initialized – I’m going to call this “precise” initialization. This behaviour hasn’t changed, and couldn’t change without it being backwardly incompatible. Now let’s consider what happens without the static constructor.

Eager initialization: .NET 3.5

Let’s take the previous program and just remove the (code-less) static constructor – and change the name of the type, for clarity:

using System;

class Eager
{
    private static int x = Log();
    
    private static int Log()
    {
        Console.WriteLine(“Type initialized”);
        return 0;
    }
    
    public static void StaticMethod() {}
}

class EagerTest
{
    static void Main(string[] args)
    {
        if (args.Length == 0)
        {
            Console.WriteLine(“No args”);
        }
        else
        {
            Eager.StaticMethod();
        }
    }
}

Under .NET 3.5, this either writes both “Type initialized” and “No args” (if you don’t pass any command line arguments) or just “Type initialized” (if you do). In other words, the type initialization is eager. In my experience, a type is initialized at the start of execution of the first method which refers to that type.

So what about .NET 4.0? Under .NET 4.0, the above code will never print “Type initialized”.

If you don’t pass in a command line argument, you see “No args” as you might expect… if you do, there’s no output at all. The type is being initialized extremely lazily. Let’s see how far we can push it…

Lazy initialization: .NET 4.0

The CLR guarantees that the type initializer will be run at some point before the first reference to any static field. If you don’t use a static field, the type doesn’t have to be initialized… and it looks like .NET 4.0 obeys that in a fairly lazy way. Another test app:

using System;

class Lazy
{
    private static int x = Log();
    private static int y = 0;
    
    private static int Log()
    {
        Console.WriteLine(“Type initialized”);
        return 0;
    }
    
    public static void StaticMethod()
    {
        Console.WriteLine(“In static method”);
    }

    public static void StaticMethodUsingField()
    {
        Console.WriteLine(“In static method using field”);
        Console.WriteLine(“y = {0}”, y);
    }
    
    public void InstanceMethod()
    {
        Console.WriteLine(“In instance method”);
    }
}

class LazyTest
{
    static void Main(string[] args)
    {
        Console.WriteLine(“Before static method”);
        Lazy.StaticMethod();
        Console.WriteLine(“Before construction”);
        Lazy lazy = new Lazy();
        Console.WriteLine(“Before instance method”);
        lazy.InstanceMethod();
        Console.WriteLine(“Before static method using field”);
        Lazy.StaticMethodUsingField();
        Console.WriteLine(“End”);
    }
}

This time the output is:

Before static method
In static method
Before construction
Before instance method
In instance method
Before static method using field
Type initialized
In static method using field
y = 0
End

As you can see, the type initialized when StaticMethodUsingField is called. It’s not as lazy as it could be – the first line of the method could execute before the type is initialized. Still, being able to construct an instance and call a method on it without triggering the type initializer is slightly surprising.

I’ve got one final twist… what would you expect this program to do?

using System;

class CachingSideEffect
{
    private static int x = Log();

    private static int Log()
    {
        Console.WriteLine(“Type initialized”);
        return 0;
    }
    
    public CachingSideEffect()
    {
        Action action = () => Console.WriteLine(“Action”);
    }
}

class CachingSideEffectTest
{
    static void Main(string[] args)
    {
        new CachingSideEffect();
    }
}

In .NET 4.0, using the Microsoft C# 4 compiler, this does print “Type initialized”… because the C# compiler has created a static field in which to cache the action. The lambda expression doesn’t capture any variables, so the same delegate instance can be reused every time. That involves caching it in a static field, triggering type initialization. If you change the action to use Console.WriteLine(this) then it can’t cache the delegate, and the constructor no longer triggers initialization.

This bit is completely implementation-specific in terms of the C# compiler, but I thought it might tickle your fancy anyway.

Conclusion

I’d like to stress that none of this should cause your code any problems. The somewhat eager initialization of types without static constructors was entirely legitimate according to the C# and CLR specs, and so the new lazy behaviour of .NET 4.0. If your code assumed that just calling a static method, or creating an instance, would trigger initialization, then that’s your own fault to some extent. That doesn’t stop it being an interesting change to spot though :)

LINQ to Rx: second impressions

My previous post had the desired effect: it generated discussion on the LINQ to Rx forum, and Erik and Wes very kindly sent me a very detailed response too. There’s no better way to cure ignorance than to display it to the world.

Rather than regurgitating the email verbatim, I’ve decided to try to write it in my own words, with extra thoughts where appropriate. That way if I’ve misunderstood anything, I may be corrected – and the very act of trying to explain all this is likely to make me explore it more deeply than I would otherwise.

I’m leaving out the bits I don’t yet understand. One of the difficulties with LINQ to Rx at the moment is that the documentation is somewhat sparse – there are loads of videos, and at least there is a CHM file for each of the assemblies bundled in the framework, but many methods just have a single sentence of description. This is entirely understandable – the framework is still in flux, after all. I’d rather have the bits but sparse docs than immaculate docs for a framework I can’t play with – but it makes it tricky to go deeper unless you’ve got time to experiment extensively. There’s an rxwiki site which looks like it may be the community’s attempt to solve this problem – but it needs a bit more input, I think. When I get a bit of time to breathe, I’d like to try to contribute there.

The good news is that I don’t think there were any mechanical aspects that I got definitively wrong in what I wrote… but the bad news is that I wasn’t thinking in Rx terms. We’ll look at the different aspects separately.

Subscriptions and collections

My first "complaint" was about the way that IEnumerable<T>.ToObservable() worked. Just to recap, I was expecting a three stage startup process:

  • Create the observable
  • Subscribe any observers
  • Tell the observable to "start"

Instead, as soon as an observer subscribes, the observable publishes everything in the sequence to it (on a different thread, by default). Separate calls to Subscribe make the observable iterate over the sequence multiple times.

Now, my original viewpoint makes sense if you think of Subscribe as being like an event subscription. It feels like something which should be passive: another viewer turning on their television to watch a live broadcast.

However, as soon as you think of IObservable.Subscribe as being the dual of IEnumerable.GetEnumerator, the Rx way makes more sense. Each call to GetEnumerator starts the sequence from scratch, and so does each call to Subscribe. This is more like inserting a disc into the DVD player – you’re still watching something, but there’s a more active element to it. You put the DVD in, it starts playing. I guess following the analogy further would make my suggested model more like a PVR :)

Additionally, this "subscription as an action" model makes more sense of methods like Return and Repeat, and also works better as a reusable object: my own idea of "now push the collection" feels dreadfully stateful: why can’t I push the collection twice? What happens if an observer subscribes after I’ve pushed?

I suspect this will trip up many an Rx neophyte; the video Wes recorded on hot and cold observables should help. Admittedly I’d already watched it before writing the blog post, so I’ve no excuse…

The subscription model can effectively be modified via composition though; using Subject (as per the blog post), AsyncSubject (which remembers the first value it sees, and only yields that), BehaviorSubject (which remembers the last value it’s seen), and ReplaySubject (which remembers everything it sees, optionally limited by a buffer) you can do quite a bit.

Wes included in his email a StartableObservable which one could start and stop. I’d come up with a slightly similar idea at home, an ObservableSequence (not nearly such a good name) but which was limited to sequences: effectively it made the steps listed above explicit for a pull sequence. The code Wes provided was completely isolated from IEnumerable<T> – you would create a StartableObservable from any existing observable, then subscribe to it, then start it. It uses a Subject to effectively collect the subscriptions – starting the observable merely subscribed the subject to the original observable passed into the constructor.

The difference between Wes’s solution and mine is more fundamental than whether his is more general-purpose than mine or not (although it clearly is). Wes didn’t have to go outside the world of Rx at all. All the building blocks were there, he just put them together – and ended up with another building block, ready to be used with the rest. That’s a common theme in this blog post :)

Asynchronous aggregation

I did get one thing right in my previous post: my suggestion that there should be asynchronous versions of the aggregation operators is apparently not a bad one. We may see the framework come with such things in the future… but they won’t revolve around Task<T>.

What do we have to represent an asynchronous computation? Why, IObservable<T> of course. It will present the result at some future point in time. Ideally, I suppose you would deal with the count (or maximum line length, or whatever) by reacting to it asynchronously too. If necessary though, you can always just take the value produced and stash it somewhere… which is exactly what an AsyncSubject does, as mentioned above. You can get the value from that by just calling First() on it, which will block if the value hasn’t been seen yet – and you don’t need to worry about "missing" it, because of the caching within the subject.

When I started this blog post, I didn’t understand Prune, but I’ve found that writing about the whole process has made it somewhat clearer to me. Calling Prune on an observable returns an AsyncSubject – but which also unsubscribes itself from the original observable when the subject is disposed, basically allowing a more orderly cleanup. So, all we need to do is call Prune on the result of our asynchronous aggregation, and we’re away.

That’s one part of the "non-Rx" framework removed… what else can we take out of the code from the previous blog post? Well, if you look at the FutureAggregate method I posted, it does two things: maintains a running aggregate, and publishes the last result (via a Task<T>). Now the "maintain a running aggregate" looks remarkably like Scan, doesn’t it? All the future aggregates (FutureCount etc) can be built from one new building block: an observable which subscribes to an existing one, and yields the last value it sees before completion.

I’ll check with Wes whether he’s happy for me for me to share his code – if so, I’ll put that and the original code into a zip file so it’s easy to compare the dull version with the shiny one.

Conclusion

It’s not enough to be able to think about Rx. To really appreciate it, you’ve got to be able to think in Rx. As I’d written a sort of "mini-Rx" before, I was arrogant enough to assume I already knew how to think in observable sequences… but apparently not. (To be fair to myself, it’s been a while and Push LINQ didn’t try to do anything genuinely asynchronous.)

I’m certainly not "in the zone" yet when it comes to Rx… but I think I can see it in the distance now. I’m heartily glad I raised my concerns over asynchronous aggregation – partly as encouragement to the team to consider including them in the framework, but mostly because it’s helped me appreciate the framework a lot better. With any luck, these two somewhat "stream of consciousness" posts may have helped you too.

Now to go over what I wrote last night for the book, and see how much of it was rubbish :)

First encounters with Reactive Extensions

I’ve been researching Reactive Extensions for the last few days, with an eye to writing a short section in chapter 12 of the second edition of C# in Depth. (This is the most radically changed chapter from the first edition; it will be covering LINQ to SQL, IQueryable, LINQ to XML, Parallel LINQ, Reactive Extensions, and writing your own LINQ to Objects operators.) I’ve watched various videos from Channel 9, but today was the first time I actually played with it. I’m half excited, and half disappointed.

My excited half sees that there’s an awful lot to experiment with, and loads to learn about join patterns etc. I’m also looking forward to trying genuine events (mouse movements etc) – so far my tests have been to do with collections.

My disappointed half thinks it’s missing something. You see, Reactive Extensions shares some concepts with my own Push LINQ library… except it’s had smarter people (no offense meant to Marc Gravell) working harder on it for longer. I’d expect it to be easier to use, and make it a breeze to do anything you could do in Push LINQ. Unfortunately, that’s not quite the case.

Subscription model

First, the way that subscription is handled for collections seems slightly odd. I’ve been imagining two kinds of observable sources:

  • Genuine "event streams" which occur somewhat naturally – for instance, mouse movement events. Subscribing to such an observable wouldn’t do anything to it other than adding subscribers.
  • Collections (and the like) where the usual use case is "set up the data pipeline, then tell it to go". In that case calling Subscribe should just add the relevant observers, but not actually "start" the sequence – after all, you may want to add more observers (we’ll see an example of this in a minute).

In the latter case, I could imagine an extension method to IEnumerable<T> called ToObservable which would return a StartableObservable<T> or something like that – you’d subscribe what you want, and then call Start on the StartableObservable<T>. That’s not what appears to happen though – if you call ToObservable(), you get an implementation which iterates over the source sequence as soon as anything subscribes to it – which just doesn’t feel right to me. Admittedly it makes life easy in the case where that’s really all you want to do, but it’s a pain otherwise.

There’s a way of working round this in Reactive Extensions: there’s Subject<T> which is both an observer and an observable. You can create a Subject<T>, Subscribe all the observers you want (so as to set up the data pipeline) and then subscribe the subject to the real data source. It’s not exactly hard, but it took me a while to work out, and it feels a little unwieldy. The next issue was somewhat more problematic.

Blocking aggregation

When I first started thinking about Push LINQ, it was motivated by a scenario from the C# newsgroup: someone wanted to group a collection in a particular way, and then count how many items were in each group. This is effectively the "favourite colour voting" scenario outlined in the link at the top of this post. The problem to understand is that the normal Count() call is blocking: it fetches items from a collection until there aren’t any more; it’s in control of the execution flow, effectively. That means if you call it in a grouping construct, the whole group has to be available before you call Count(). So, you can’t stream an enormous data set, which is unfortunate.

In Push LINQ, I addressed this by making Count() return Future<int> instead of int. The whole query is evaluated, and then you can ask each future for its actual result. Unfortunately, that isn’t the approach that the Reactive Framework has taken – it still returns int from Count(). I don’t know the reason for this, but fortunately it’s somewhat fixable. We can’t change Observable of course, but we can add our own future-based extensions:

public static class ObservableEx
{
    public static Task<TResult> FutureAggregate<TSource, TResult>
        (this IObservable<TSource> source,
        TResult seed, Func<TResult, TSource, TResult> aggregation)
    {
        TaskCompletionSource<TResult> result = new TaskCompletionSource<TResult>();
        TResult current = seed;
        source.Subscribe(value => current = aggregation(current, value),
            error => result.SetException(error),
            () => result.SetResult(current));
        return result.Task;
    }

    public static Task<int> FutureMax(this IObservable<int> source)
    {
        // TODO: Make this generic and throw exception on
        // empty sequence. Left as an exercise for the reader.
        return source.FutureAggregate(int.MinValue, Math.Max);
    }

    public static Task<int> FutureMin(this IObservable<int> source)
    {
        // TODO: Make this generic and throw exception on
        // empty sequence. Left as an exercise for the reader.
        return source.FutureAggregate(int.MaxValue, Math.Min);
    }

    public static Task<int> FutureCount<T>(this IObservable<T> source)
    {
        return source.FutureAggregate(0, (count, _) => count + 1);
    }
}

This uses Task<T> from Parallel Extensions, which gives us an interesting ability, as we’ll see in a moment. It’s all fairly straightforward – TaskCompletionSource<T> makes it very easy to specify a value when we’ve finished, or indicate that an error occurred. As mentioned in the comments, the maximum/minimum implementations leave something to be desired, but it’s good enough for a blog post :)

Using the non-blocking aggregation operators

Now that we’ve got our extension methods, how can we use them? First I decided to do a demo which would count the number of lines in a file, and find the maximum and minimum line lengths:

public static List<T> ToList<T>(this IObservable<T> source)
{
    List<T> ret = new List<T>();
    source.Subscribe(x => ret.Add(x));
    return ret;
}
private static IEnumerable<string> ReadLines(string filename)
{
    using (TextReader reader = File.OpenText(filename))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}

var subject = new Subject<string>();
var lengths = subject.Select(line => line.Length);
var min = lengths.FutureMin();
var max = lengths.FutureMax();
var count = lengths.FutureCount();
            
var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);
Console.WriteLine("Count: {0}, Min: {1}, Max: {2}",
                  count.Result, min.Result, max.Result);

As you can see, we use the Result property of a task to find its eventual result – this call will block until the result is ready, however, so you do need to be careful about how you use it. Each line is only read from the file once, and pushed to all three observers, who carry their state around until the sequence is complete, whereupon they publish the result to the task.

I got this working fairly quickly – then went back to the "grouping lines by line length" problem I’d originally set myself. I want to group the lines of a file by their length (all lines of length 0, all lines of length 1 etc) and count each group. The result is effectively a histogram of line lengths. Constructing the query itself wasn’t a problem – but iterating through the results was. Fundamentally, I don’t understand the details of ToEnumerable yet, particularly the timing. I need to look into it more deeply, but I’ve got two alternative solutions for the moment.

The first is to implement my own ToList extension method. This simply creates a list and subscribes an observer which adds items to the list as it goes. There’s no attempt at "safety" here – if you access the list before the source sequence has completed, you’ll see whatever has been added so far. I am still just experimenting :) Here’s the implementation:

public static List<T> ToList<T>(this IObservable<T> source)
{
    List<T> ret = new List<T>();
    source.Subscribe(x => ret.Add(x));
    return ret;
}

Now we can construct a query expression, project each group using our future count, make sure we’ve finished pushing the source before we read the results, and everything is fine:

var subject = new Subject<string>();
var groups = from line in subject
             group line.Length by line.Length into grouped
             select new { Length = grouped.Key, Count = grouped.FutureCount() };
var results = groups.ToList();

var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);
foreach (var group in results)
{
    Console.WriteLine("Length: {0}; Count: {1}", group.Length, group.Count.Result);
}

Note how the call to ToList is required before calling source.ToObservable(...).Subscribe – otherwise everything would have been pushed before we started collecting it.

All well and good… but there’s another way of doing it too. We’ve only got a single task being produced for each group – instead of waiting until everything’s finished before we dump the results to the console, we can use Task.ContinueWith to write it (the individual group result) out as soon as that group has been told that it’s finished. We force this extra action to occur on the same thread as the observer just to make things easier in a console app… but it all works very neatly:

var subject = new Subject<string>();
var groups = from line in subject
             group line.Length by line.Length into grouped
             select new { Length = grouped.Key, Count = grouped.FutureCount() };
                                    
groups.Subscribe(group =>
{
    group.Count.ContinueWith(
         x => Console.WriteLine("Length: {0}; Count: {1}"
                                group.Length, x.Result),
         TaskContinuationOptions.ExecuteSynchronously);
});
var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);

Conclusion

That’s the lot, so far. It feels like I’m sort of in the spirit of Reactive Extensions, but that maybe I’m pushing it (no pun intended) in a direction which Erik and Wes either didn’t anticipate, or at least don’t view as particularly valuable/elegant. I very much doubt that they didn’t consider deferred aggregates – it’s much more likely that either I’ve missed some easy way of doing this, or there are good reasons why it’s a bad idea. I hope to find out which at some point… but in the meantime, I really ought to work out a more idiomatic example for C# in Depth.