All posts by jonskeet

Mad props to @arcaderage for the "Princess Rescue" image - see https://toggl.com/programming-princess for the full original

A simple extension method, but a beautiful one

This came up a little while in a newsgroup question, and Marc Gravell and I worked out a solution between us. I’ve finally included it in MiscUtil (although not released it yet – there’s a lot of stuff ready to go when we’ve finalised namespaces and updated the website etc) but I thought I’d share it here.

How often have you written code to do something like counting word frequencies, or grouping items into lists? I know a lot of this can be solved with LINQ if you’re using .NET 3.5, but in .NET 2.0 we’ve always been nearly there. Dictionaries have provided a lot of the necessary facilities, but there’s always the bit of code which needs to check whether or not we’ve already seen the key, and populate the dictionary with a suitable initial value if not – a count of 0, or an empty list for example.

There’s something that 0 and “empty list” have in common. They’re both the results of calling new TValue() for their respect TValue types of int and List<Whatever>. Can you see what’s coming? A generic extension method for dictionaries whose values are of a type which can use a parameterless constructor, which returns the value associated with a key if there is one, or a new value (which is also inserted into the dictionary) otherwise. It’s really simple, but it’ll avoid duplication all over the place:

Note: This code has been updated due to comments below. Comments saying “Use TryGetValue” referred to the old version!

 

public static TValue GetOrCreate<TKey, TValue>(this IDictionary<TKey, TValue> dictionary,
                                               TKey key)
    where TValue : new()
{
    TValue ret;
    if (!dictionary.TryGetValue(key, out ret))
    {
        ret = new TValue();
        dictionary[key] = ret;
    }
    return ret;
}

The usage of it might look something like this:

 

var dict = new Dictionary<string,int>();

foreach (string word in someText)
{
    dict[word] = dict.GetOrCreate(word)+1;
}

I’m not going to claim this will set the world on fire, but I know I’m fed up with writing the kind of code which is in GetOrCreate, and maybe you are too.

Additional overloads are available to specify either a value to use when the key is missing, or a delegate to invoke to create a value.

C# 4, part 1: Looking back at the past

Everyone else is speculating about what’s going to be in C# 4 (and various possibilities are coming out of MS), so I thought it would be wise to start my own series of wishlist posts before I miss the boat completely.

In this first post, I’m not going to look at the future at all – I’m going to look at mistakes of the past. When I say “mistake” I of course mean “things I would have done differently had I been a language designer with 20/20 hindsight”. Of course, there’s a lot of room for argument :)

Mistakes in C# 1

  • Lack of separate getter/setter access for properties. This came in C# 2, but it should have been obvious that it was highly desirable long before C# 1 came out.
  • Lack of generics – ish. Don’t worry, I’m not going to claim that all the features of C# 2 and 3 should have been in C# 1, but if generics had been in at the start we could have avoided having the non-generic collections (and interfaces) completely. Mind you, I’m glad that the .NET team took their time instead of including the bodged (IMO) generics of Java 5.
  • Classes not being sealed by default. I’ve believed for a long time that allowing inheritance incurs a design cost (and it’s not like I’m unique in that respect). C# fixed a mistake of Java by making methods non-virtual by default; the same should be true for classes in my view.
  • Enums just being named numbers. Again, I’ve blogged about this before, but it’s worth mentioning again. It’s possible to work around the lack of this feature (as the blog post readers pointed out!) but framework and language support would have been very welcome.
  • The “x” character escape sequence. Fortunately it’s rarely used, but it’s so error-prone. Quick, how different are “x8Good compiler”, “x8Bad compiler”? What’s the first character in each string? (This will appear soon on my brainteasers page).
  • The switch statement. There are lots of ways in which this could have been better designed. VB addresses some of them (such as making it easier to express multiple matching values) but there are other ways in which this construct needed overhauling. Fallthrough is (rightly) prohibited, so why not just force braces round the code in the case block instead of requiring a break statement? Aside from anything else, that would fix the somewhat bizarre scoping rules.
  • Wacky overload resolution. I entirely understand the point that introducing new methods in a base type shouldn’t change the behaviour in derived types – but if you’ve explicitly chosen to override that method, that should be more easily callable than it is. (See the first example of the brainteasers page to see what I’m talking about.)
  • The “lock” keyword, and associated issues. Basically, the IDisposable pattern should have been used for locking, and not every object should have a monitor associated with it. Developers should keep a close eye on what’s being locked, and being able to lock on everything takes away from this. Likewise “lock” creates a keyword for little purpose (and one which would otherwise be useful as a variable name etc).

Mistakes in C# 2

  • Lack of partial methods. I’m really only saying this because it broke the format of C# in Depth slightly. I’ve introduced partial methods along with partial types because they logically fit in with them, and they don’t fit in with any of the other features of C# 3 particularly. This is just a matter of not working out all of how partial types would be used – or at least not doing so early enough. (For all I know partial methods were on the table before C# 2 shipped – I wouldn’t be surprised.)
  • Possibly the lack of generic variance. This is certainly a big issue of understanding which is often raised as a question. On the other hand, I suspect that if/when it becomes available, it will raise just as many questions in terms of the detail anyway…
  • The System.Nullable class. It’s really only there as an historical accident, and I know it’s not a C# issue as such – but even so… (Note for extra clarity – I’m fine with nullable types and the System.Nullable<T> struct. It’s just the supporting class that I don’t like.)
  • InternalsVisibleTo requiring the whole public key for strongly signed assemblies instead of the public key token (contrary to the documentation). Ick.

Mistakes in C# 3

  • It’s a real shame that readonly automatic properties aren’t in C# 3. I suspect they’ll come in C# 4 (and they’ll be on my wishlist in future posts) but I think it’s reasonable to wonder why they weren’t included in C# 3. Immutability is a known pattern of goodness, and although C# 4 may well contain any number of more significant improvements towards making it easier, readonly automatic properties would have been a good start.
  • The way that extension methods are found. This issue was raised time and time again before release, and I’ve never heard a good defence of finding them by whole namespace, instead of allowing developers to say “use the extension methods found in this class, please”. As it is, anyone writing their own extension methods is likely to end up with whole namespaces devoted to a single type. It’s very odd.

Of course that’s not to say there aren’t other things I’d like to see – but these are more “features which were slightly misdesigned” rather than “features which I really want”.

I’m not trying to take anything away from the language designers – C# is still easily my favourite language in terms of its design, particularly in C# 3, but nobody’s perfect :)

Next time I’ll start giving my opinions of features that other people are calling for.

Immutability and inheritance

In my book, I present an example of a Range<T> class do demonstrate iterator blocks. The range allows you to iterate over each element within it in the obvious fashion. There’s an abstract base class, and then a couple of concrete classes derived from that – enough to show the pattern. The base class is abstract because there’s a single abstract method, GetNextValue, which is required to take a current value and return the next one in the sequence. How this occurs depends on the types involved – in the case of a range of DateTime elements, it will add a TimeSpan each time, for instance, whereas an Int32Range will just add an int. Here’s a class diagram of how it looks:

The requirements for the code in the book were very simplistic, in order to be able to present all the code on the printed page. However, I wanted to expand this in MiscUtil and “do the job properly”. In particular, I wanted to be able to:

  • Reverse a range
  • Make a range exclusive (at the “far end” – a half-open interval)
  • Make an exclusive range inclusive
  • Do all of this while keeping the immutable nature of the code from the book

When trying to implement this, I discovered it was actually quite tricky. In particular, when using inheritance I ran into some obstacles:

  • Unless we use the original range as a proxy, creating a new range based on the original is tricky. We basically need to clone, and that’s fraught in various ways. MemberwiseClone will work in many situations, but it’s inelegant – and we can’t keep the fields marked readonly and still modify the cloned copy.
  • Reversing a range using just the original type constraint of T : IComparable<T> is a bit of a pain. You need to keep remembering which way to compare things. This is a bit of an aside, but using an IComparer<T> instead is a lot simpler – it’s really easy to build a new IComparer<T> which proxies to the original one and reverses the order of the parameters.
  • There’s no guarantee that just because the base class has no mutable data, the derived class will do likewise.

In addition, I realised I was using inheritance in a way that went against what I’d written near the end of the book: when using inheritance in a very limited way, consider using delegates instead. A Range<T> only needs extra behaviour to be specified in terms of comparisons (IComparer<T>) and how to take a step from one value to the next. The latter can easily be represented as a Func<T,T> in .NET 3.5.

My new design has a single sealed, immutable class:

There are still a few ways in which this isn’t ideal:

  • You can specify a null step function, in which case you can’t iterate over the range. I’d prefer the type to not implement IEnumerable<T> if it can’t do the job properly.
  • You have to specify a reverse step function if you want to iterate over the reverse of the range.
  • There are a heck of a lot of constructor overloads.

Now, none of these are horrendous, and I think it’s a lot nicer than it was before. I’ve currently got an additional non-generic Range class with a bunch of overloaded methods for creating ranges of various types. I can’t think of a decent name for these methods at the moment, so currently you’d write:

  • Range.Of(1, 5) // 1, 2, 3, 4, 5
  • Range.Of(1, 5).Exclusive() // 1, 2, 3, 4
  • Range.Of(1, 5, 2) // 1, 3, 5
  • Range.Of(DateTime.Today, DateTime.Now, TimeSpan.FromMinutes(1)) // Midnight, 1 minute past etc

I think it might be nicer to use extension methods for these, to allow:

  • 1.To(5)
  • 1.To(5).Exclusive()
  • 1.To(5).Step(2)
  • DateTime.Today.To(DateTime.Now).Step(TimeSpan.FromMinutes(1))

In order to do this nicely I may need to expose the comparer in Range<T> as well, but I don’t think that’s really a problem. Thoughts on this are welcome.

Anyway, the broad point of this post (other than to hopefully relieve my own insomnia – and possibly yours too) is that immutability and inheritance don’t mix terribly nicely, especially when you want to effectively clone an instance and modify some aspects. That’s not terribly surprising, but it is interesting – and it fits in with the experience that inheritance doesn’t mix terribly nicely with equality comparisons, either.

Bridging gaps, and finding my role

Warning: this post won’t teach you anything technical. It’s about how I see myself. That may be of interest to you, or it may not. If not, feel free to skip it knowing you’re not missing anything else.

One of the great problems of the world today is undoubtedly this problem of not being able to talk to scientists, because we don’t understand science. They can’t talk to us because they don’t understand anything else, poor dears. (Michael Flanders)

For a while, I’ve made myself slightly miserable (only slightly – I’m generally a very happy guy) by seeing just how impoverished my own understanding of computing is compared with my “heroes” in the industry: Eric Lippert, Joe Duffy, and Wes Dyer to name but three. I always learn a lot from their blogs, but often I don’t manage to take it all in. I understand enough about Wes’s post about monads to realise that I’ve probably implemented a monad with my (incredibly primitive) Future classes in “push LINQ” – but my grasp of them is tenuous at best. I understand enough about threading to be able to reason about concurrency in my day-to-day life, but I’m never going to have Joe’s depth of knowledge of either the Windows-specific stuff or the underlying principles and theories. I can hold an interesting (to me) conversation with Eric over email, but I suspect that if we were talking in real life I’d have to constantly ask him to slow down.

This used to bother me. I used to almost feel that it was unfair that others were so much smarter than me. Yes, I know how arrogant that sounds even in terms of ambition, and I’m not going to flinch away from the term. However, I’m a bit more comfortable with my place in life now. You see, just because they’re so much smarter than me doesn’t mean I’m useless. I want to be a bridge.

People have occasionally used the word “expert” about me, entirely inappropriately. I’m not an expert in threading, or floating point, or text encoding. I know a bit more about those topics than most developers, and sometimes that’s all that’s required to be labeled as an expert these days. After the last ten months, I could probably agree with the label when applied to C# as a language, although certainly not .NET as a framework, or even the act of developing in C#. I happen to have read the spec more closely than most people, and retained that information reasonably well, that’s all.

The trouble is, real experts can be hard to understand sometimes for the “average developer” (again, I know how that sounds; I’m not putting myself above this “average developer” in all or even many respects, just on the topics I happen to write about). Don’t get me wrong: it’s not because they lack communication skills (although that’s sometimes the case). It’s that a lot of what they talk about is at a level of depth which requires a lot of background knowledge for one to properly understand it. This is where I believe I can play a useful role – I like explaining the most useful bits of what I’ve understood from the experts, but in a way which hopefully any interested developer can understand.

C# in Depth is a good manifestation of that. If you want the unadulterated “truth” about C# (at least in theory) you look at the spec. But that’s frankly a pain to read, and there’s very little to distinguish the bits which are really important from the corner cases which you’re unlikely to ever encounter. I hope my book covers provides more depth than the level of knowledge most C# developers already have (and more depth than most other books provide), but without going so deep as to be impenetrably difficult to understand.

Having identified this role, I’m really quite happy to try to fulfil it. I just hope I can keep doing so. It’s a lot of fun.

Types of Parallelism

When I was at TechEd, Joe Duffy mentioned task parallelism and data parallelism a number of times. It was easy enough to follow what he meant, but I had to keep consciously thinking about the terms instead of instinctively knowing what they meant. This post is intended to ram the point home to me as much as anyone else – there’s nothing like writing things up to make them stick in your head.

I thought I had a modicum of success with my “real life example” in the post about “push” LINQ, so I’ll use another one here. The difference is that I won’t be presenting any actual code in this post, just concepts.

Example situation: the soup factory

Suppose you run a soup factory, producing tins of tomato soup. Now, to simplify things massively, we’re going to assume there are three tasks:

  1. Create empty tins. Each empty tin takes 5 minutes to make, and has 400ml volume.
  2. Cook the soup in a big saucepan. It takes one person 1 hour (of constant stirring etc) to cook the soup, and the saucepan holds 20 litres.
  3. Fill the tins with soup. It takes 2 minutes to fill a tin and seal it.

Obviously the numbers here are entirely arbitrary. They just give us something to play with. Now let’s look at different ways we can run the soup factory.

No parallelism: a single worker

We’ll start off with the simplest situation – a single worker, who will just perform one task at a time. Let’s look at how he might spend his time, starting at 8am:

Time Activity
8:00-8:05 Make tin 1
8:05-8:10 Make tin 2
 
12:05-12:10 Make tin 50
12:10-13:10 Cook soup
13:10-13:12 Fill tin 1
13:12-13:14 Fill tin 2
 
14:48-14:50 Fill tin 50

Now, that’s just one way of doing it. He could have cooked the soup first, then made tin 1, filled tin 1, made tin 2, filled tin 2 etc. The only dependency is that we have both an empty tin and some soup before we fill the tin. Let’s give our worker a colleague and see how things can improve.

Just data parallelism: two workers doing the same thing

When we add a second worker, we can do organize things any number of ways, as we’ll see in a minute. However, we’ll stick to a simple way at the moment. Each of the workers will deal with 25 tins, and one will relax while the other is cooking.

Time Activity
  Worker 1 Worker 2
8:00-8:05 Make tin 1 Make tin 2
8:05-8:10 Make tin 3 Make tin 4
   
10:00-10:05 Make tin 49 Make tin 50
10:05-11:05 Cook soup Relax
11:05-11:07 Fill tin 1 Fill tin 2
11:07-11:09 Fill tin 3 Fill tin 4
   
11:53-11:55 Fill tin 49 Fill tin 50

This shows data parallelism. Both the “making tins” and “filling tins” tasks involve 50 tins which are each independent. We don’t need to wait for one tin to be full before starting another – with two people working, we can go twice as quickly. Note that this doesn’t hold for the cooking – two people cooking wouldn’t be able to get the work done any quicker.

With 50 people working like this, we’d have a total time of 1 hour and 7 minutes – five minutes of everyone making tins, one hour of soup cooking, and then two minutes of everyone filling tins. Now let’s try something different with our two workers. Adding extra workers at that point wouldn’t help.

Task parallelism: one person cooking while the other makes tins

We’ve seen that the cooking is the bottleneck – in our previous example, we were wasting time while the soup was cooking. We don’t want workers relaxing on the job! Let’s try a different kind of parallelism now – task parallelism, where instead of breaking a single task into multiple independent bits, we perform two wholly different tasks at the same time.

We’ll have one worker make tins while the other cooks soup. They can both fill tins afterwards.

Time Activity
  Worker 1 Worker 2
8:00-8:05 Make tin 1 Start cooking soup
8:05-8:10 Make tin 2 Still cooking…
   
9:00-9:05 Make tin 13 Finished cooking – relax
12:05-12:10 Make tin 50 Still relaxing
12:10-12:12 Fill tin 1 Fill tin 2
12:12-12:14 Fill tin 3 Fill tin 4
   
12:58-13:00 Fill tin 49 Fill tin 50

This is actually worse than the previous time – but could be improved by the second worker helping to make tins after the soup has finished cooking. This time, if we went up to 51 workers we’d have the fastest possible time for a single batch – 1 hour and 2 minutes.

Dependencies and pipelines

I think that’s enough detailed examples – it should be enough to demonstrate the difference between the two types of parallelism. However, I’d like to consider one more aspect: dependencies. When using task parallelism, you need to be aware of what dependencies are involved. As noted earlier, there’s no dependency between making tins and cooking the soup, but filling the tins depends on there being an empty tin available, and being soup available. That’s what makes the minimum possible time for a single batch 1 hour and 2 minutes rather than just the “longest single task” of 1 hour for cooking the soup.

So, does that mean that n batches of soup will take n * (1 hour and 2 minutes) to create? No – because we can form a pipeline for the work. Suppose we can start cooking another load of soup while the first one is being used to fill tins. We can get our average time for a batch down to an hour, using 7 people: 1 person always cooking, 4 people always making tins, 1 person always filling tins and 1 person making a couple of tins and then filling for the rest of the hour. The time taken for the first batch is relatively high, but thereafter we can keep producing a new batch of 50 full tins every hour.

Few processes in computing are as straightforward as this “assembly line” – but it’s a useful model to remind us of the two types of parallelism possible.

Extension methods on lamdba expressions don’t work, unfortunately

Over the Christmas holidays, I thought I’d experiment with something I’d been thinking about a little – sorting a generic IList<T>. Now, before anyone gets huffy, I’m well aware of OrderBy in LINQ to Objects. However, sometimes you want to sort collections in-place, and as IList<T> provides random access, there’s no reason we shouldn’t be able to. Now, I do like the way that OrderBy allows multiple criteria to be specified, whether they should be applied in an ascending or descending fashion, and by way of just “compare by this projection” rather than having to actually implement the comparison yourself. I thought I could probably make use of those ideas again.

Unlike in LINQ to Objects, the sort would occur immediately, which means I couldn’t use quite the chained syntax of OrderBy(...).ThenBy(...).ThenByDescending(...) but my plan was to allow code like this:

List<Person> myList = …;

myList.SortBy((person => person.Name).Ascending(),
              (person => person.DateOfBirth).Descending(),
              (person => person.SocialSecurityNumber).Ascending());

Because each of the different output types involved might be different, that would only work for as many overloads as I’d implement. An alternative would be:

Comparer<Person> sorter = Comparer.By(person => person.Name)
                          .ThenByDescending(person => person.DateOfBirth)
                          .ThenBy(person => person.SocialSecurityNumber);

sorter.Sort(myList);

I did like the idea of the Ascending and Descending extension methods though, operating on Func<T1,T2>. Unfortunately, the dot operator doesn’t work on lambda expressions, even though the expression itself is implicitly convertible to Func<T1,T2>.

My plan isn’t entirely scuppered – the latter syntax will still work, and there are probably some alternatives I can work out. I think there are nice possibilities around extension methods and delegates though, and it’s a shame they’re not useful for lambda expressions. Ah well.

LINQ to Objects – not just for in-memory collections

I’ve just seen LINQ to Objects described as the LINQ provider for “in-memory collections” again. It’s a fairly frequent occurrence, and I may have done it myself on occasion. It doesn’t do LINQ to Objects justice. An example I’ve used in a few places is a query which runs over log files. Something along the lines of:

var query = from file in Directory.GetFiles(@“c:logs”, “*.log”)
            from line in new LineReader(file)
            let entry = new LogEntry(line)
            where entry.Severity = Severity.Critical
            select entry;

Where’s the in-memory collection here? I suppose there’s the array of log file names, but that’s about it. LINQ to Objects isn’t restricted to datasets which fit comfortably in memory. The above query could process many gigs of data very easily, limited basically by disk speed (and date/time parsing speed in my experience, but that’s a topic for another post).

What LINQ to Objects does is in-process querying of enumerable sequences of data. More of a mouthful than “querying in-memory collections” but more accurate, IMO.

Rant over. Well, for a couple of minutes.

“Push” LINQ revisited – next attempt at an explanation

Marc Gravell and I have now implemented a lot of LINQ standard query operators on the “push” model of IDataProducer as opposed to the “pull” model of IEnumerable. My good friend Douglas Leeder (who doesn’t use C#) has been with me this weekend, and through explaining the “big picture” to him in various ways, and taking his feedback, I think I’ve now got a good way of communicating it. Voting.

It’s a “real life analogy” which is always dangerous – don’t think of it too literally – I’m not claiming that it’s meant to be an absolute 1:1 correspondence. However, I think it neatly demonstrates the problem and some of the benefits of the solution we’ve come up with.

In order to make all this concrete, all of the code is real, and can be downloaded as a zip of a VS2008 solution. It contains a binary of an unreleased version of MiscUtil which is where the DataProducer stuff currently lives.

Real life situation

Let’s suppose we’re trying to find out what the favourite colour is of everyone in the world. Now, for the purposes of the demo code, there are only four colours and six people in the world: that makes the diagrams nice and easy, and we can see results simply too. Extending the data to the rest of the real world is left as an exercise to the reader. We may also want additional information, such as the average ages of people voting for particular colours.

Here’s our complete sample data – the five members of my family, and Douglas:

Name Age Favourite colour
Jon 31 Blue
Douglas 28 red
Holly 31 Purple
Tom 4 Pink
Robin 1 RED
William 1 blue

Note how the colours are specified with variations of case. We’ll use that later as an example of why you might need to specify a “key comparer”.

There are various ways of implementing this in LINQ, and for each model we’ll provide code and think about how it would work in the real world.

Model 1: “Pull” model

This is the model which “normal” LINQ uses – you only ever pull data, using an IEnumerable<T>. Here’s a simple query expression which gives the answers we want (admittedly unordered – we’ll ignore that for now):

 

var query = from voter in Voter.AllVoters()
            group voter by voter.FavouriteColour.ToUpper() into grouped
            select new { Colour = grouped.Key, Votes = grouped.Count() };

foreach (var entry in query)
{
    Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

There are two problems here.

Firstly, we’re using ToUpper() to get round the “RED != red” problem. This is not only bad in terms of internationalisation, but it also loses data. We really want to get the original string as the key, and then use a case-insensitive comparer. We can do this by a manual call to GroupBy instead of using query expressions – there’s an overload which takes an IEqualityComparer.

Secondly, the result of the “group … by” keeps all the voter data temporarily. It has to all be available at the same time before the “select” kicks in. This runs counter to the normal “streaming” idea of LINQ. This is inherent in the nature of the “pull” model, as I’ll explain in a minute.

Now, let’s see what this looks like in the real world. People come into a room through a door, and a “grouper” asks them for their favourite colour. The grouper then tells each voter (immediately) which corner of the room to stand in. The result at the end of the grouping is this:

After the initial grouping, another person goes to each group in turn, finding out their key and doing a head count. That group is then free to go. The important thing is that this person can’t do their job until all the data is in, because they’ve got to be able to see everyone in order to count them.

Improvement to pull model: just keep a token presence

The fact that we used “group voter by …” meant that the result of the grouping still involved whole people. As we’re only going to do a head count, we only need something saying “There was a person here.” We can change our original query to do that quite easily:

 

var query = from voter in Voter.AllVoters()
             group 1 by voter.FavouriteColour.ToUpper() into grouped
             select new { Colour = grouped.Key, Votes = grouped.Count() };

 foreach (var entry in query)
{
     Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

 

This time, after the grouping takes place, the room looks like this:

The use of 1 here is purely incidental: we could have used ‘group “Spartacus” by …’ and the results would have been the same. It’s just something which can be counted.

Now, there’s good and bad here:

  • We’re not taking as much memory here. If voters have large amounts of data attached to them, we’ve reduced our requirements significantly.
  • We still have one object per voter, all in memory at the same time. Think “population of the world”.
  • We’ve lost our age data, which would make any extra aggregation impossible.

Model 2: “Push” model

The problem with the pull model is that each aggregator always wants to be the only thing pulling. The call to MoveNext will block until more data is available. That’s a real problem when you want to have multiple aggregators (one vote counter per colour). We could do a complicated threading manoeuvre, with each colour getting its own thread and the “grouper” pushing items out to relevant threads. Again though, that doesn’t scale – the four extra threads in our example aren’t too bad, but imagine other groupings with potentially thousands of keys.

The alternative is to change the model. Instead of having a greedy aggregator pulling data, we change to aggregators who observe data being pushed past them, and also observe a special “all the data has now been pushed” signal. Before we look at the code to do this, let’s think about what it could be like in real life. We don’t know how many different colours will be voted on, but we know what we need to do with each one: count the number of votes for them. In detail, the situation would be something like this:

  • The grouper stands just inside the door of the room, and “pulls” voters in the normal way
  • For any voter:
    • Ask the voter which colour they wish to vote for
    • Check to see if that colour is a “new” one. If it is, create a “counter” person for that colour, and position them by an exit in the room. (We create new exits as we go. We’ll assume there’s a sledgehammer at the ready.)
    • Send the voter past the relevant “counter” person, through the exit near them
    • Each counter just counts how many voters they see going past them
  • When all voters have been pulled, tell each of the counters and ask them how many people they saw

We never have more than one voter in the room at once:

Let’s have a look at the code involved now.

Using the “push” model

There are two sides to the code here: the code that the LINQ user has to write, and the code Marc Gravell and I have implemented. We’ll look at the client code in a few different scenarios.

1) GroupWithPipeline in the middle of normal LINQ

Keeping to the normal “start with a data source, do something, then select” model involves stepping away from query expressions. We’ve got a new extension method on IEnumerable<T> called GroupWithPipeline, which takes a key selector (just like the normal GroupBy) and what to do with the results of each grouping. Here’s the new code (which requires a using directive for MiscUtil.Linq.Extensions):

 

var query = Voter.AllVoters()
                 .GroupWithPipeline(voter => voter.FavouriteColour.ToUpper(),
                                    voters => voters.Count())
                 .Select(grouped => new { Colour = grouped.Key, Votes = grouped.Value });

foreach (var entry in query)
{
    Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

How about making this a bit smarter now? Let’s try to also work out the minimum and maximum ages of the voters for each colour. Conceptually this is just a case of adding extra observers along with each vote counter in the “real life” model above. The code is remarkably simple:

 

var query = Voter.AllVoters()
                 .GroupWithPipeline(voter => voter.FavouriteColour.ToUpper(),
                                    voters => voters.Count(),
                                    voters => voters.Min(voter => voter.Age),
                                    voters => voters.Max(voter => voter.Age))
                 .Select(grouped => new { Colour = grouped.Key,
                                          Votes = grouped.Value1,
                                          MinAge = grouped.Value2,
                                          MaxAge = grouped.Value3});

foreach (var entry in query)
{
    Console.WriteLine(“Colour {0} has {1} votes. Age range: {2}-{3}”, entry.Colour, entry.Votes, entry.MinAge, entry.MaxAge);
}

The fact that it uses “Value1”, “Value2” and “Value3” isn’t ideal, but unfortunately there’s no way round that as far as we’ve worked out – for this part.

2) Using DataProducer directly for multiple aggregates

GroupWithPipeline uses a few types internally which you can use directly instead: DataProducer (implementing IDataProducer) and Future (implemeting IFuture). If I go into the details here, I’ll never get this posted – but that may come into another post if there’s enough interest. However, let’s have a look at how it can be used. First, let’s find the results of a few aggregates of our voters, this time without any groupings:

 

// Create the data source to watch
DataProducer<Voter> voters = new DataProducer<Voter>();

// Add the aggregators
IFuture<int> total = voters.Count();
IFuture<int> adults = voters.Count(voter => voter.Age >= 18);
IFuture<int> children = voters.Where(voter => voter.Age < 18).Count();
IFuture<int> youngest = voters.Min(voter => voter.Age);
IFuture<int> oldest = voters.Select(voter => voter.Age).Max();

// Push all the data through
voters.ProduceAndEnd(Voter.AllVoters());

// Write out the results
Console.WriteLine(“Total voters: {0}”, total.Value);
Console.WriteLine(“Adult voters: {0}”, adults.Value);
Console.WriteLine(“Child voters: {0}”, children.Value);
Console.WriteLine(“Youngest vote age: {0}”, youngest.Value);
Console.WriteLine(“Oldest voter age: {0}”, oldest.Value);

The output of the code is what you’d expect, but there are a few things to note:

  1. Each aggregate returns an IFuture<int> instead of an int. This is because we set up the aggregators before we produce any data. We need to use the Value property to get the actual value back after we’ve produced the data.
  2. Just to hammer the point home, we must set up the aggregators (calling Count etc) before we produce the data (in ProduceAndEnd). Otherwise the aggregators won’t have any data to work with.
  3. We can chain operators together (Select and then Max, or Where and then Count) just as with normal LINQ.
  4. We’re applying multiple aggregates, but the data is only being produced once. This just can’t be done with normal LINQ. ProduceAndEnd takes an IEnumerable<T> which could be another LINQ query – something fetching large amounts of data from files, etc. Everything will be streamed appropriately.

3) Using DataProducer in query expressions

This part wouldn’t have been available when I started writing this post. I hadn’t quite realised the power of the pattern yet.
By implementing GroupBy on IDataProducer, we can perform the original grouping in a query expression, in
a pretty normal kind of way… except that this time we can apply multiple aggregates, never buffering the data beyond the results of the
aggregation:

 

DataProducer<Voter> voters = new DataProducer<Voter>();

var query = from voter in voters
            group voter by voter.FavouriteColour.ToUpper() into grouped
            select new { Colour = grouped.Key, 
                         Votes = grouped.Count(),
                         MinAge = grouped.Min(voter => voter.Age),
                         MaxAge = grouped.Max(voter => voter.Age)};

var results = query.AsEnumerable();

voters.ProduceAndEnd(Voter.AllVoters());

foreach (var entry in results)
{
    Console.WriteLine(“Colour {0} has {1} votes. Age range: {2}-{3}”,
                      entry.Colour, entry.Votes.Value, 
                      entry.MinAge.Value, entry.MaxAge.Value);
}

There’s just one tricky bit in here – you must call AsEnumerable before the data is produced, otherwise the aggregators will stream all their data with nothing watching for the results. In fact, AsEnumerable builds a list internally – the final results are buffered, but only those results. There’s really not a lot that can be done about that.

So, there we go. That may or may not be a bit clearer now. I’m still learning both the power of the pattern, its potential uses, and the best ways of explaining it. Feedback is very welcome, both on the technical front and about the explanation. I’m absolutely convinced that it’s a useful pattern in some situations (though not all). All the code will be released as part of MiscUtil eventually, of course – we’re still tidying it up and producing a bit more documentation at the moment.

Last post about the book (nearly) and a new source of information…

It’s about time I stopped using this blog to mention the book every couple of posts. Fortunately, I now have a new blog of sorts – well, a news page with an RSS feed. It’s part of the book’s web site – completely independent of Manning’s page for the book (which includes a link to their forum for it).

The web site is present in skeletal form – there are placeholders for everything I currently intend to include there, but none of the real content yet. That will be the work of the next couple of months.

The book itself is now almost out of my hands – it’s gone for copy editing and technical review, so I’m in reactive mode instead of proactive. It’s been nice to have time to play with some code for a change :) In particular, I’m quite pleased with the RSS generator. I could have found a third party library, I’m sure – but the mixture of LINQ to SQL and LINQ to XML sorts it out in a mere 34 lines, which is pretty neat.

Anyway, I expect I’ll post again here when I finally get my hands on a proper printed copy of the book, but until then I promise to keep quiet :)

Boxing day

Here in the UK (and possibly elsewhere – I’m too lazy to check Wikipedia) December 26th is called Boxing Day. I want to know why only this one aspect of .NET is given its own public holiday. Expanding the opportunities to language features as well as those of the runtime, I’d like to see:

  • Null coalescing day (“Take 3 nulls into the shower? Not me. I just coalesce and go!”)
  • Type inference day (not popular with HR)
  • Garbage collection day (as a proper holiday, not just when the bins are collected)
  • Just-In-Time day (arguably every day for procrastinators such as myself)
  • Lambda expression day (get those lambdas out of the closet)

I could go on, but it’s a pretty thin joke and dinner is ready.