Category Archives: LINQ

LambdaExpression – source code would be nice

Just taking a quick break from proof-reading to post a thought I had yesterday. Visual LINQ uses ToString() to convert an expression tree’s body into readable text. In some cases it works brilliantly, reproducing the original source code exactly – but in other cases it’s far from useful. For instance, from this expression tree representation:

(Convert(word.get_Chars(0)) = 113)

I suspect you wouldn’t have guessed that the original code was:

word[0] == ‘q’

Personally I don’t find it terrifically obvious, even though I can see how it works. Here’s a thought though – suppose the LambdaExpression class had a Source property which the compiler could optionally provide as a constructor argument, so that you could get at the original source code if you needed it. The use in Visual LINQ is obvious, but would it be useful elsewhere?

Well, suppose that LINQ to SQL couldn’t translate a particular query expression into SQL. Wouldn’t it be nice if it could report the part of the query expression it was having trouble with in the language it was originally written in? So a VB expression would give source in VB, a C# expression would give source in C# etc.

All of this would have to be optional, and I suspect some people would have intellectual property concerns about their source leaking out (most of which would be silly, due to the logic effectively being available with a call to ToString() anyway). I think it would be quite handy in a few situations though.

Visual LINQ: Watch query expressions as they happen!

Critical link (in case you can’t find it): Source Code Download

Update: Dmitry Lyalin has put together a screencast of Visual LINQ in action – it gives a much better idea of what it’s like than static pictures do. There’s music, but no speech – so you won’t be missing any important information if you mute it. 

I was going to save this until it was rather more polished, but I’ve just started reading the first proofs of C# in Depth, so it’s unlikely I’ll have much time for coding in the near future. I’m too excited about the results of Monday night to keep them until the book’s done :)

After my Human LINQ experiment, I was considering how it my be possible to watch queries occurring on a computer. I had the idea of turning LINQ to Objects into WPF animations… and it just about works. The initial version took about 3 hours on Monday night, and it’s had a few refinements since. It’s very definitely not finished, but I’ll go into that in a minute.

The basic idea is that you write a nearly-normal query expression like this:

 

VisualSource<string> words = new VisualSource<string>
    (new[] { “the”, “quick”, “brown”, “fox”, “jumped”,
             “over”, “the”, “lazy”, “dog” },
     canvas, “words”);

var query = from word in words
            where word.Length > 3
            select new { Word = word, Length = word.Length};

pipeline = query.End();

… and then watch it run. At the moment, the code is embedded in the constructor for MainWindow, but obviously it needs to be part of the UI at some point in the future. To explain the above code a little bit further, the VisualSource class displays a data source on screen, and calling End() on a query creates a data sink which will start fetching data as soon as you hit the relevant “Go!” button in the UI. Speaking of which, here’s what you see when you start it:

When you tell it to go, words descend from the source, and hit the “where” clause:

As you can see, “the” doesn’t meet the criterion of “word.Length > 3”, so the answer “False” is fading up. Fast forward a few seconds, and we’ll see the first passing result has reached the bottom, with the next word on its way down:

Results accumulate at the bottom so you can see what’s going on:

To be honest, it’s better seen “live” as an animation… but the important thing is that none of the text above is hand-specified (other than the data and the source name). If you change the query and rebuild/run, the diagram will change – so long as I’ve actually implemented the bits you use. So far, you can use:

  • select
  • where
  • group (with trivial or non-trivial elementSelector)
  • multiple “from” clauses (invoking SelectMany)

Select and Where have the most polish applied to them – they’ve got relatively fancy animations. Grouping works, but it appears to be just swallowing data until it spews it all out later – I want to build up what it’s going to return visually as it’s doing it. SelectMany could be a lot prettier than it is.

So, what’s wrong with it? Well:

  • Ordering isn’t implemented
  • Join isn’t implemented
  • GroupJoin isn’t implemented
  • The animation could do with tweaking (a lot!)
  • The code itself is fairly awful
  • The UI is unpolished (but functional) – my WPF skills are sadly lacking
  • It would be nice to lay the query out more sensibly (it gets messy with multiple sources for multiple “from” clauses)
  • Allow the user to enter a query (and their own data sources)

Despite all of that though, I’m really pleased with it. It uses expression trees to create a textual representation of the logic, then compiles them to delegates for the actual projection etc. A bit of jiggery pokery was needed to make anonymous types look nice, and I dare say there’ll be more of that to come.

Interested yet? The source code is available so you can play with it yourself. Let me know if you plan to make any significant improvements, and we could consider starting a SourceForge project for it.

Human LINQ

Last night I gave a talk about C# 3 and LINQ, organised by Iterative Training and NxtGenUG. I attempted to cover all the features of C# 3 and the basics of LINQ in about an hour and a half or so. It’s quite a brutal challenge, and obviously I wasn’t able to go into much detail about anything. It went down reasonably well, but I can’t help feeling there’s a lot of room for improvement. That said, there was one part of the talk which really did go well, and made the appropriate points effectively.

I had demonstrated the following LINQ query in code:

using System;
using System.Linq;

class Test
{
    static void Main(string[] args)
    {
        var words = new[] {“the”, “quick”, “brown”, “fox”,
                “jumped”, “over”, “the”, “lazy”, “dog”};
       
        var query = from x in words
                    where x.Length > 3
                    where x[0] != ‘q’
                    select x.ToUpper();
       
       
        foreach (string word in query)
        {
            Console.WriteLine(word);
        }
    }
}

I know it would have been easy to combine the two “where” clauses, but separating them helped with the second part of the demonstration.

In the pizza break, I had prepared sheets of paper – some with the words on (“the”, “quick” etc), and some with clauses (“from x in words”, “where x.Length > 3” etc). I asked for 5 volunteers, who I arranged in a line, facing the rest of the audience. I stood at “audience left” with the sheets of words, gave the next person the “from” clause, then the first “where” clause etc. The person at the far end didn’t have a sheet of paper, but they were acting as the foreach loop.

I suspect you can see what’s coming – we ran possibly the slowest LINQ query in the world. However, we did it reasonably accurately: when the next piece of information was needed, a person would turn to their neighbour and request it; the request would pass up the line until it got to me, whereupon I’d hand over a sheet of paper with a word on. If a “where” clause filtered out a word, they just dropped the piece of paper. When a word reached the far end, the guy shouted it out.

With this, I was able to illustrate:

  • Deferred execution (nothing starts happening until the foreach loop executes)
  • Streaming (only a single word was ever on the move at once)

Next we added an “orderby” clause in just before the end. Sure enough, we then see buffering in action – the guy representing the ordering can’t return any data to the “select” clause until he’s actually got all the data.

Finally, we removed the “orderby” clause again, but added a call to Count(). We didn’t have time to go into a lot of detail, but I think people understood why that led to immediate execution rather than the deferred execution we had earlier.

I suspect I’m not the first person to do something like this, but I’m still really pleased with it. If you’re ever talking about LINQ and people’s eyes are glazing over, it’s a fun little demo. It wasn’t perfect though; there are things I’d change:

  • Put the upper case version of the word on the back of the paper. We had to imagine the result of the projection.
  • Having two “where” clauses is useful for the first demo, but slows things down after that.
  • Possibly use fewer words – it takes quite a while, and having been through it three times, the audience may grow impatient.
  • Explain deferred execution more in terms of the result type – it’ll make it easier to contrast with immediate execution

Overall, it was a really fun night. I did a little interview with Dave McMahon afterwards, which should go up on the NxtGenUG site at some point. I suspect I was talking rather too quickly for the whole time, but we’ll see how it pans out.

Extension methods on lamdba expressions don’t work, unfortunately

Over the Christmas holidays, I thought I’d experiment with something I’d been thinking about a little – sorting a generic IList<T>. Now, before anyone gets huffy, I’m well aware of OrderBy in LINQ to Objects. However, sometimes you want to sort collections in-place, and as IList<T> provides random access, there’s no reason we shouldn’t be able to. Now, I do like the way that OrderBy allows multiple criteria to be specified, whether they should be applied in an ascending or descending fashion, and by way of just “compare by this projection” rather than having to actually implement the comparison yourself. I thought I could probably make use of those ideas again.

Unlike in LINQ to Objects, the sort would occur immediately, which means I couldn’t use quite the chained syntax of OrderBy(...).ThenBy(...).ThenByDescending(...) but my plan was to allow code like this:

List<Person> myList = …;

myList.SortBy((person => person.Name).Ascending(),
              (person => person.DateOfBirth).Descending(),
              (person => person.SocialSecurityNumber).Ascending());

Because each of the different output types involved might be different, that would only work for as many overloads as I’d implement. An alternative would be:

Comparer<Person> sorter = Comparer.By(person => person.Name)
                          .ThenByDescending(person => person.DateOfBirth)
                          .ThenBy(person => person.SocialSecurityNumber);

sorter.Sort(myList);

I did like the idea of the Ascending and Descending extension methods though, operating on Func<T1,T2>. Unfortunately, the dot operator doesn’t work on lambda expressions, even though the expression itself is implicitly convertible to Func<T1,T2>.

My plan isn’t entirely scuppered – the latter syntax will still work, and there are probably some alternatives I can work out. I think there are nice possibilities around extension methods and delegates though, and it’s a shame they’re not useful for lambda expressions. Ah well.

LINQ to Objects – not just for in-memory collections

I’ve just seen LINQ to Objects described as the LINQ provider for “in-memory collections” again. It’s a fairly frequent occurrence, and I may have done it myself on occasion. It doesn’t do LINQ to Objects justice. An example I’ve used in a few places is a query which runs over log files. Something along the lines of:

var query = from file in Directory.GetFiles(@“c:logs”, “*.log”)
            from line in new LineReader(file)
            let entry = new LogEntry(line)
            where entry.Severity = Severity.Critical
            select entry;

Where’s the in-memory collection here? I suppose there’s the array of log file names, but that’s about it. LINQ to Objects isn’t restricted to datasets which fit comfortably in memory. The above query could process many gigs of data very easily, limited basically by disk speed (and date/time parsing speed in my experience, but that’s a topic for another post).

What LINQ to Objects does is in-process querying of enumerable sequences of data. More of a mouthful than “querying in-memory collections” but more accurate, IMO.

Rant over. Well, for a couple of minutes.

“Push” LINQ revisited – next attempt at an explanation

Marc Gravell and I have now implemented a lot of LINQ standard query operators on the “push” model of IDataProducer as opposed to the “pull” model of IEnumerable. My good friend Douglas Leeder (who doesn’t use C#) has been with me this weekend, and through explaining the “big picture” to him in various ways, and taking his feedback, I think I’ve now got a good way of communicating it. Voting.

It’s a “real life analogy” which is always dangerous – don’t think of it too literally – I’m not claiming that it’s meant to be an absolute 1:1 correspondence. However, I think it neatly demonstrates the problem and some of the benefits of the solution we’ve come up with.

In order to make all this concrete, all of the code is real, and can be downloaded as a zip of a VS2008 solution. It contains a binary of an unreleased version of MiscUtil which is where the DataProducer stuff currently lives.

Real life situation

Let’s suppose we’re trying to find out what the favourite colour is of everyone in the world. Now, for the purposes of the demo code, there are only four colours and six people in the world: that makes the diagrams nice and easy, and we can see results simply too. Extending the data to the rest of the real world is left as an exercise to the reader. We may also want additional information, such as the average ages of people voting for particular colours.

Here’s our complete sample data – the five members of my family, and Douglas:

Name Age Favourite colour
Jon 31 Blue
Douglas 28 red
Holly 31 Purple
Tom 4 Pink
Robin 1 RED
William 1 blue

Note how the colours are specified with variations of case. We’ll use that later as an example of why you might need to specify a “key comparer”.

There are various ways of implementing this in LINQ, and for each model we’ll provide code and think about how it would work in the real world.

Model 1: “Pull” model

This is the model which “normal” LINQ uses – you only ever pull data, using an IEnumerable<T>. Here’s a simple query expression which gives the answers we want (admittedly unordered – we’ll ignore that for now):

 

var query = from voter in Voter.AllVoters()
            group voter by voter.FavouriteColour.ToUpper() into grouped
            select new { Colour = grouped.Key, Votes = grouped.Count() };

foreach (var entry in query)
{
    Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

There are two problems here.

Firstly, we’re using ToUpper() to get round the “RED != red” problem. This is not only bad in terms of internationalisation, but it also loses data. We really want to get the original string as the key, and then use a case-insensitive comparer. We can do this by a manual call to GroupBy instead of using query expressions – there’s an overload which takes an IEqualityComparer.

Secondly, the result of the “group … by” keeps all the voter data temporarily. It has to all be available at the same time before the “select” kicks in. This runs counter to the normal “streaming” idea of LINQ. This is inherent in the nature of the “pull” model, as I’ll explain in a minute.

Now, let’s see what this looks like in the real world. People come into a room through a door, and a “grouper” asks them for their favourite colour. The grouper then tells each voter (immediately) which corner of the room to stand in. The result at the end of the grouping is this:

After the initial grouping, another person goes to each group in turn, finding out their key and doing a head count. That group is then free to go. The important thing is that this person can’t do their job until all the data is in, because they’ve got to be able to see everyone in order to count them.

Improvement to pull model: just keep a token presence

The fact that we used “group voter by …” meant that the result of the grouping still involved whole people. As we’re only going to do a head count, we only need something saying “There was a person here.” We can change our original query to do that quite easily:

 

var query = from voter in Voter.AllVoters()
             group 1 by voter.FavouriteColour.ToUpper() into grouped
             select new { Colour = grouped.Key, Votes = grouped.Count() };

 foreach (var entry in query)
{
     Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

 

This time, after the grouping takes place, the room looks like this:

The use of 1 here is purely incidental: we could have used ‘group “Spartacus” by …’ and the results would have been the same. It’s just something which can be counted.

Now, there’s good and bad here:

  • We’re not taking as much memory here. If voters have large amounts of data attached to them, we’ve reduced our requirements significantly.
  • We still have one object per voter, all in memory at the same time. Think “population of the world”.
  • We’ve lost our age data, which would make any extra aggregation impossible.

Model 2: “Push” model

The problem with the pull model is that each aggregator always wants to be the only thing pulling. The call to MoveNext will block until more data is available. That’s a real problem when you want to have multiple aggregators (one vote counter per colour). We could do a complicated threading manoeuvre, with each colour getting its own thread and the “grouper” pushing items out to relevant threads. Again though, that doesn’t scale – the four extra threads in our example aren’t too bad, but imagine other groupings with potentially thousands of keys.

The alternative is to change the model. Instead of having a greedy aggregator pulling data, we change to aggregators who observe data being pushed past them, and also observe a special “all the data has now been pushed” signal. Before we look at the code to do this, let’s think about what it could be like in real life. We don’t know how many different colours will be voted on, but we know what we need to do with each one: count the number of votes for them. In detail, the situation would be something like this:

  • The grouper stands just inside the door of the room, and “pulls” voters in the normal way
  • For any voter:
    • Ask the voter which colour they wish to vote for
    • Check to see if that colour is a “new” one. If it is, create a “counter” person for that colour, and position them by an exit in the room. (We create new exits as we go. We’ll assume there’s a sledgehammer at the ready.)
    • Send the voter past the relevant “counter” person, through the exit near them
    • Each counter just counts how many voters they see going past them
  • When all voters have been pulled, tell each of the counters and ask them how many people they saw

We never have more than one voter in the room at once:

Let’s have a look at the code involved now.

Using the “push” model

There are two sides to the code here: the code that the LINQ user has to write, and the code Marc Gravell and I have implemented. We’ll look at the client code in a few different scenarios.

1) GroupWithPipeline in the middle of normal LINQ

Keeping to the normal “start with a data source, do something, then select” model involves stepping away from query expressions. We’ve got a new extension method on IEnumerable<T> called GroupWithPipeline, which takes a key selector (just like the normal GroupBy) and what to do with the results of each grouping. Here’s the new code (which requires a using directive for MiscUtil.Linq.Extensions):

 

var query = Voter.AllVoters()
                 .GroupWithPipeline(voter => voter.FavouriteColour.ToUpper(),
                                    voters => voters.Count())
                 .Select(grouped => new { Colour = grouped.Key, Votes = grouped.Value });

foreach (var entry in query)
{
    Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

How about making this a bit smarter now? Let’s try to also work out the minimum and maximum ages of the voters for each colour. Conceptually this is just a case of adding extra observers along with each vote counter in the “real life” model above. The code is remarkably simple:

 

var query = Voter.AllVoters()
                 .GroupWithPipeline(voter => voter.FavouriteColour.ToUpper(),
                                    voters => voters.Count(),
                                    voters => voters.Min(voter => voter.Age),
                                    voters => voters.Max(voter => voter.Age))
                 .Select(grouped => new { Colour = grouped.Key,
                                          Votes = grouped.Value1,
                                          MinAge = grouped.Value2,
                                          MaxAge = grouped.Value3});

foreach (var entry in query)
{
    Console.WriteLine(“Colour {0} has {1} votes. Age range: {2}-{3}”, entry.Colour, entry.Votes, entry.MinAge, entry.MaxAge);
}

The fact that it uses “Value1”, “Value2” and “Value3” isn’t ideal, but unfortunately there’s no way round that as far as we’ve worked out – for this part.

2) Using DataProducer directly for multiple aggregates

GroupWithPipeline uses a few types internally which you can use directly instead: DataProducer (implementing IDataProducer) and Future (implemeting IFuture). If I go into the details here, I’ll never get this posted – but that may come into another post if there’s enough interest. However, let’s have a look at how it can be used. First, let’s find the results of a few aggregates of our voters, this time without any groupings:

 

// Create the data source to watch
DataProducer<Voter> voters = new DataProducer<Voter>();

// Add the aggregators
IFuture<int> total = voters.Count();
IFuture<int> adults = voters.Count(voter => voter.Age >= 18);
IFuture<int> children = voters.Where(voter => voter.Age < 18).Count();
IFuture<int> youngest = voters.Min(voter => voter.Age);
IFuture<int> oldest = voters.Select(voter => voter.Age).Max();

// Push all the data through
voters.ProduceAndEnd(Voter.AllVoters());

// Write out the results
Console.WriteLine(“Total voters: {0}”, total.Value);
Console.WriteLine(“Adult voters: {0}”, adults.Value);
Console.WriteLine(“Child voters: {0}”, children.Value);
Console.WriteLine(“Youngest vote age: {0}”, youngest.Value);
Console.WriteLine(“Oldest voter age: {0}”, oldest.Value);

The output of the code is what you’d expect, but there are a few things to note:

  1. Each aggregate returns an IFuture<int> instead of an int. This is because we set up the aggregators before we produce any data. We need to use the Value property to get the actual value back after we’ve produced the data.
  2. Just to hammer the point home, we must set up the aggregators (calling Count etc) before we produce the data (in ProduceAndEnd). Otherwise the aggregators won’t have any data to work with.
  3. We can chain operators together (Select and then Max, or Where and then Count) just as with normal LINQ.
  4. We’re applying multiple aggregates, but the data is only being produced once. This just can’t be done with normal LINQ. ProduceAndEnd takes an IEnumerable<T> which could be another LINQ query – something fetching large amounts of data from files, etc. Everything will be streamed appropriately.

3) Using DataProducer in query expressions

This part wouldn’t have been available when I started writing this post. I hadn’t quite realised the power of the pattern yet.
By implementing GroupBy on IDataProducer, we can perform the original grouping in a query expression, in
a pretty normal kind of way… except that this time we can apply multiple aggregates, never buffering the data beyond the results of the
aggregation:

 

DataProducer<Voter> voters = new DataProducer<Voter>();

var query = from voter in voters
            group voter by voter.FavouriteColour.ToUpper() into grouped
            select new { Colour = grouped.Key, 
                         Votes = grouped.Count(),
                         MinAge = grouped.Min(voter => voter.Age),
                         MaxAge = grouped.Max(voter => voter.Age)};

var results = query.AsEnumerable();

voters.ProduceAndEnd(Voter.AllVoters());

foreach (var entry in results)
{
    Console.WriteLine(“Colour {0} has {1} votes. Age range: {2}-{3}”,
                      entry.Colour, entry.Votes.Value, 
                      entry.MinAge.Value, entry.MaxAge.Value);
}

There’s just one tricky bit in here – you must call AsEnumerable before the data is produced, otherwise the aggregators will stream all their data with nothing watching for the results. In fact, AsEnumerable builds a list internally – the final results are buffered, but only those results. There’s really not a lot that can be done about that.

So, there we go. That may or may not be a bit clearer now. I’m still learning both the power of the pattern, its potential uses, and the best ways of explaining it. Feedback is very welcome, both on the technical front and about the explanation. I’m absolutely convinced that it’s a useful pattern in some situations (though not all). All the code will be released as part of MiscUtil eventually, of course – we’re still tidying it up and producing a bit more documentation at the moment.

C# in Depth: All chapters available in MEAP!

Rather excitingly, all the chapters of C# in Depth are now available for early access. The following chapters have recently been added:

10: Extension methods

Without extension methods, LINQ just couldn’t work in an elegant form. Extension methods are basically a way of faking instance methods by providing static methods with an implicit “this” parameter. Importantly, they can work on interfaces, which means you can make an interface appear to have many more methods than implementations actually have to provide. Although they’re primarily provided for the sake of LINQ, extension methods can improve code readability in other spheres too – when used cautiously. In this chapter, we look at extension methods in the non-LINQ world, and get our first glance at some of the methods in System.Linq.Enumerable.

11: Query expressions and LINQ to Objects

If you ask someone who has just read a page or two about C# 3 what the new features are, they’ll almost certainly write a query expression. This is the “from x in y where z select foo” type of expression which looks almost like SQL but isn’t. The amazing thing about query expressions is how little they impact the rest of the language: they are pure syntactic sugar, being effectively translated into other source code before being compiled. That allows some really neat tricks, and is the basis for how LINQ handles multiple data sources.

In this chapter we look at query expressions and the standard query operators which support them, via LINQ to Objects. This is “in-process” LINQ, often used with in-memory collections but more generally available for anything implementing IEnumerable or IEnumerable<T>.

12: LINQ beyond collections

We’ve all seen LINQ to SQL demos, and gone “ooh” and “ahh” (or possibly “I could do that in Rails too, y’know”) at the appropriate time. In this chapter I emphatically don’t try to teach you LINQ to SQL, but instead take you on a whistle-stop tour of lots of different LINQ providers:

  • LINQ to SQL
  • LINQ to DataSet
  • LINQ to XML
  • LINQ to NHibernate
  • LINQ to ActiveDirectory
  • LINQ to Entities
  • Parallel LINQ

I also give a bit of insight into how “true” LINQ providers like LINQ to SQL (I don’t really count LINQ to XML or LINQ to DataSet – they’re just providing IEnumerable<T>s for LINQ to Objects to work with) work, using IQueryable.

As you can tell from the scope of the chapter, I don’t try to go into many details – just enough to give the flavour, and hopefully show the “big picture”. I believe Microsoft is really trying something very ambitious with a mostly-unified query framework, and with any luck this chapter leaves that as the lasting impression.

13: Elegant code in the new era

I’ve taken a look at a lot of my technical books to see how they end – and really they don’t, properly. The last line of the last chapter could often have been set anywhere else. My final chapter is very short, but tries to give an impression of where I think software development is going, particularly in terms of C#.

Appendix A: LINQ standard query operators

Although some of the standard query operators are covered in chapter 11, there are plenty which aren’t. This appendix is really just a grouped list of the operators with some brief examples of what they do. Handy as a reference guide – one reviewer apparently said to another, “Holy crap! I want this on my wall!”

 

What next?

So, now that everything is available in MEAP, it’s all done, right? Well, not quite. I’m currently indexing and putting together final revisions – where the word “final” is pretty loose. It will then be passed to my technical reviewer (whose name I shouldn’t reveal just yet, but who I’m proud to have on board – even if I’m dreading the errors they’ll find) and the copy editor, who I believe will work effectively in parallel. After that (and final approval) it will go into production, then to press. The due date is still late March or April at the moment, I believe.

My current indexing/revising task is a real slog (which is why I’m taking a break by writing this blog entry) but I think it’s the last big push – it should get easier when I’m done on this bit. Right, back to chapter 5…

A cautionary parallel tale: ordering isn’t simple

A little while ago, I wrote about my silly project to test Parallel LINQ – a Mandelbrot generator. In the last week, two things have happened to make this more of a reality. Firstly, the December CTP of Parallel FX has been released. Secondly, my old laptop died, “forcing” me to get a new laptop, which just happens to have a dual core processor.

So, it should just be a case of running it, right? Well, not quite. First let’s have a look at the query expression again, in its serial form:

 

var query = from row in Enumerable.Range(0, ImageHeight)
            from col in Enumerable.Range(0, ImageWidth)                                             
            select ComputeMandelbrotIndex(row, col);

 

And here’s what should be generated.

That’s the result of running without any parallelism, in other words. Now, I realised from the start that we would need ordering, but my first experiment was just to call AsParallel() to see what would happen:

 

var query = from row in Enumerable.Range(0, ImageHeight).AsParallel()
            from col in Enumerable.Range(0, ImageWidth)                                             
            select ComputeMandelbrotIndex(row, col);

 

As expected, that didn’t produce quite the result we wanted:

Well, that’s okay. I wanted to prove that ordering was necessary, and indeed that’s fairly obvious from the result. There are horizontal blocks, returned out of order. Easily fixable, right? After all, I posted what I thought would be the solution with the original post. We just need to give the appropriate option as a method parameter:

 

var query = from row in Enumerable.Range(0, ImageHeight).AsParallel(ParallelQueryOptions.PreserveOrdering)
            from col in Enumerable.Range(0, ImageWidth)                                             
            select ComputeMandelbrotIndex(row, col);

 

Not so fast. It certainly changed things, but not quite as hoped:

I haven’t yet analysed quite why we now have the rows in order but the columns out of order. However, I haven’t quite managed to fix it with the code in its original form. I have managed to fix it by reducing it from two (implicit) loops to one:

 

var query = from row in Enumerable.Range(0, ImageHeight).AsParallel(ParallelQueryOptions.PreserveOrdering)                                             
            select ComputeMandelbrotRow(row);

byte[] data = new byte[ImageHeight * ImageWidth];

int rowStart = 0;
foreach (byte[] row in query)
{
    Array.Copy(row, 0, data, rowStart, ImageWidth);
    rowStart += ImageWidth;
}

 

Instead of getting all the results in one go (with a call to ToArray()) we now have to reassemble all the data into a block. Still, it achieves the desired result. I should point out that PFX has better ways of doing this, Parallel.For being an obvious starting point from the little I know of it. At some point I’ll get round to reading about them, but at the moment life’s too short. I should also say that I don’t expect that any of the pictures indicate a bug in Parallel LINQ, merely my understanding of it.

Update: Explanation and a workaround

Having asked about this on the Connect forum for PFX, Igor Ostrovsky has explained that this is by design – currently only outer ordering is preserved when requested. The issue is still open, however – it’s possible that it will change before release.

In the meantime, Nicholas Palladinos has come up with an alternative solution which I rather like. I’ve refactored it a bit, but the basic idea is to turn the two sequences into one before parallelisation:

 

var points = from row in Enumerable.Range(0, ImageHeight)
             from col in Enumerable.Range(0, ImageWidth)
             select new { row, col };

var query = from point in points.AsParallel(ParallelQueryOptions.PreserveOrdering)                                              
            select ComputeMandelbrotIndex(point.row, point.col);

 

That works really well – in fact, more than twice as fast as the serial version, on my 2-core box!

Don’t call us, we’ll call you – push enumerators

Update: I’ve got a new and simpler design now. I’m leaving this in for historical interest, but please see the entry about the new design for more recent information.

This post is going to be hard to write, simply because I can’t remember ever writing quite such bizarre code before. When I find something difficult to keep in my own head, explaining it to others is somewhat daunting, especially when blogging is so much more restrictive than face-to-face discussion. Oh, and you’ll find it hard if you’re not familiar with lambda expression syntax (x => x.Foo etc). Just thought I’d warn you.

It’s possibly easiest to explain with an example. It’s one I’m hoping to use in a magazine article – but I certainly won’t be trying to explain this in the article. Imagine you’ve got a lot of log entries on disk – by which I mean hundreds of millions. You certainly don’t want all of that in memory. However, each of the log entries contains a customer ID, and you want to find out for each customer how many entries there are. Here’s a LINQ query which would work but be horribly inefficient, loading everything into memory:

var query = from entry in logEntryReader
            group entry by entry.Customer into entriesByCustomer
            let count = entriesByCustomer.Count()
            orderby count descending
            select new { Customer = entriesByCustomer.Key, Count = count };

Now, it’s easy to improve this somewhat just by changing the “group entry by” to “group 1 by” – that way the entries themselves are thrown away. However, you’ve still got some memory per entry – a huge great enumeration of 1s to count after grouping.

The problem is that you can’t tell “group … by” how to aggregate the sequence associated with each key. This isn’t just because there’s no syntax to express it – it’s to do with the nature of IEnumerable itself. You see, the “pull” nature of IEnumerable is a problem. While a thread is waiting for more data, it will just block. Normally, an aggregator (like Count) just picks data off a sequence until it reaches the end, then returns the result. How can that work when there are multiple sequences involved (one for each customer)?

There are three answers to this:

1) Write your own group-and-count method. This is pretty straightforward, and potentially useful in many situations. It’s also fairly easy to understand. You just iterate through a sequence, and keep a dictionary of key to int, increasing each key’s count as you see elements. This is the pragmatic solution when faced with a specific problem – but it feels like there should be something better, something that lets us group and then specify the processing in terms of standard query operators.

2) Create a new thread and producer/consumer style IEnumerable for each key. Clearly this doesn’t scale.

3) Invert control of enumerations: put the producer in the driving seat instead of the consumer. This is the approach we’re talking about for the rest of the post.

A word on the term “asynchronous”

I don’t know whether my approach could truly be called asynchronous. What I haven’t done is make any of the code thread-aware at all, or even thread-safe. All the processing of multiple sequences happens in a single thread. I also don’t have the full BeginXXX, EndXXX using IAsyncResult pattern. I started down that line, but it ended up being a lot more complicated than what I’ve got now.

I’m pretty sure that what I’ve been writing is along the lines of CSPs (Communicating Sequential Processes) but I wouldn’t in any way claim that it’s a CSP framework, either.

However, you may find that it helps to think about asynchronous APIs like Stream.BeginRead when looking at the rest of the code. In particular, reading a stream asynchronously has the same “say I’m interested in data, react to data, request some more” pattern.

Keeping the Count aggregator in mind, what we want to do is maintain a private count, and request some data. When we get called back to say there is more data, we increment our count (ignoring the data) and request some more. When we are told that there’s no more data, we can return the count.

With that said, here’s the interface for what I’ve called IPushEnumerator. The name is open for change – I’ve been through a few options, and I’m still not comfortable with it. Please feel free to suggest another one! Note that there isn’t an IPushEnumerable – again, I started off with one, but found it didn’t make sense. Maybe someone smarter than me will come up with a way of it fitting.

IPushEnumerator

/// <summary>
/// An enumerator which works in an inverse manner – the consumer requests
/// to be notified when a value is available, and then the enumerator
/// will call back to the consumer when the producer makes some data
/// available or indicates that all the data has been produced.
/// </summary>
public interface IPushEnumerator<T>
{
    /// <summary>
    /// Fetch the current value. Not valid until
    /// a callback with a True argument has occurred,
    /// or after a callback 
    /// </summary>
    T Current { get; }

    /// <summary>
    /// Requests notification (via the specified callback) when more
    /// data is available or when the end of the data has been reached.
    /// The argument passed to the callback indicates which of these
    /// conditions has been met – logically the result of MoveNext
    /// on a normal IEnumerator.
    /// </summary>
    void BeginMoveNext(Action<bool> callback);
}

That bit is relatively easy to understand. I can ask to be called back when there’s data, and typically I’ll fetch the data within the callback and ask for more.

So far, so good. But what’s going to create these in the first place? How do we interface with LINQ? Time for an extension method.

Enumerable.GroupWithPush

I wanted to create an extension to IEnumerable<T> which had a “LINQ feel” to it. It should be quite like GroupBy, but then allow the processing of the subsequences to be expressed in a LINQ-like way. (Actual C# query expressions aren’t terribly useful in practice because there isn’t specific syntax for the kind of operators which turn out to be useful with this approach.) We’ll want to have type parameters for the original sequence (TElement), the key used for grouping (TKey) and the results of whatever processing is performed on each sequence (TResult).

So, the first parameter of our extension method is going to be an IEnumerable<TElement>. We’ll use a Func<TElement,TKey> to map source elements to keys. We could optionally allow an IEqualityComparer<TKey> too – but I’m certainly not planning on supporting as many overloads as Enumerable.GroupBy does. The final parameter, however, needs to be something to process the subsequence. The first thought would be Func<IPushEnumerator<TElement>,TResult> – until you start trying to implement the extension method or indeed the delegate doing the processing.

You see, given an IPushEnumerator<TElement> you really don’t want to return a result. Not just yet. After all, you don’t have the data yet, just a way of being given the data. What you want to return is the means of the caller obtaining the result after all the data has been provided. This is where we need to introduce a Future<T>.

Future<T>

If you don’t know about the idea of a future, it’s basically an IOU for a result. In proper threading libraries, futures allow the user to find out whether a computation has completed or not, wait for the result etc. My implementation of Future<T> is not that smart. It’s not smart at all. Here it is:

/// <summary>
/// Poor-man’s version of a Future. This wraps a result which *will* be
/// available in the future. It’s up to the caller/provider to make sure
/// that the value has been specified by the time it’s requested.
/// </summary>
public class Future<T>
{
    T value;
    bool valueSet = false;

    public T Value 
    {
        get
        {
            if (!valueSet)
            {
                throw new InvalidOperationException(“No value has been set yet”);
            }
            return value;
        }
        set
        {
            valueSet = true;
            this.value = value;
        }
    }
}

With this in place, we can reveal the actual signature of GroupWithPush:

public static IEnumerable<KeyValuePair<TKey, TResult>> GroupWithPush<TElement, TKey, TResult>
    (this IEnumerable<TElement> source,
     Func<TElement, TKey> mapping,
     Func<IPushEnumerator<TElement>, Future<TResult>> pipeline)

I shall leave you to mull over that – I don’t know about you, but signatures of generic methods always take me a little while to decode.

The plan is to then implement extension methods on IPushEnumerator<T> so that we can write code like this:

var query = logEntryReader.GroupWithCount(entry => entry.Customer,
                                          sequence => sequence.Count());

foreach (var result in query)
{
    Console.WriteLine (“Customer {0}: {1} entries”,
                       result.Key,
                       result.Value);
}

Okay, so how do we implement these operators? Let’s give an example – Count being pretty a simple case.

Implementing Count()

Let’s start off by looking at a possible Count implementation for a normal sequence, to act as a sort of model for the implementation in the weird and wacky land of futures and push enumerators:

public static int Count<T>(IEnumerable<T> source)
{
    int count = 0;
    foreach (T item in source)
    {
        count++;
    }
    return count;
}

Now, we’ve got two problems. Firstly, we’re not going to return the count – we’re going to return a Future. Secondly, we certainly can’t use foreach on an IPushEnumerator<T> – the whole point is to avoid blocking while we wait for data. However, the concept of “for each element in a sequence” is useful – so let’s see whether we can do something similar with another extension method, then come back and use it in Count.

Implementing ForEach()

Warning: this code hurts my head, and I wrote it. Even the idea of it hurts my head a bit. The plan is to implement a ForEach method which takes two delegates – one which is called for each item in the enumerator, and one which is called after all the data has been processed. It will return without blocking, but it will call BeginMoveNext first, using a delegate of its own. That delegate will be called when data is provided, and it will in turn call the delegates passed in as parameters, before calling BeginMoveNext again, etc.

Ready?

public static void ForEach<T>(this IPushEnumerator<T> source, 
                              Action<T> iteration,
                              Action completion)
{
    Action<bool> moveNextCallback = null;
            
    moveNextCallback = dataAvailable =>
         {
             if (dataAvailable)
             {
                 iteration(source.Current);
                 source.BeginMoveNext(moveNextCallback);
             }
             else
             {
                 completion();
             }
         };

    source.BeginMoveNext(moveNextCallback);
}

What I find particularly disturbing is that moveNextCallback is self-referential – it calls BeginMoveNext passing itself a the parameter. (Interestingly, you still need to assign it to null first, otherwise the compiler complains that it might be used without being assigned. I seem to remember reading a blog post about this before now, and thinking that I’d never ever run into such a situation. Hmm.)

Nasty as ForEach is in terms of implementation, it’s not too bad to use.

Implementing Count() – the actual code

The translation of the original Count is now relatively straightforward. We prepare the Future wrapper for the result, and indicate that we want to iterate through all the entries, counting them and then setting the result value when we’ve finished (which will be long after the method first returns, don’t forget).

public static Future<int> Count<T>(this IPushEnumerator<T> source)
{
    Future<int> ret = new Future<int>();
    int count = 0;

    source.ForEach(t => count++, 
                   () => ret.Value = count);
    return ret;
}

We’re nearly there now. All we need to do is complete the original GroupWithPush method:

Implementing GroupWithPush

There are three phases to GroupWithPush, as mentioned before: pushing the data to the consumers (creating those consumers as required based on the keys we see); telling all the consumers that we’ve finished; retrieving the results. It’s probably easiest just to show the code – it’s actually not too hard to understand.

public static IEnumerable<KeyValuePair<TKey, TResult>> GroupWithPush<TElement, TKey, TResult>
    (this IEnumerable<TElement> source,
     Func<TElement, TKey> mapping,
     Func<IPushEnumerator<TElement>, Future<TResult>> pipeline)
{
    var enumerators = new Dictionary<TKey, SingleSlotPushEnumerator<TElement>>();
    var results = new Dictionary<TKey, Future<TResult>>();
    // Group the data, pushing it to the enumerators at the same time.
    foreach (TElement element in source)
    {
        TKey key = mapping(element);
        SingleSlotPushEnumerator<TElement> push;
        if (!enumerators.TryGetValue(key, out push))
        {
            push = new SingleSlotPushEnumerator<TElement>();
            results[key] = pipeline(push);
            enumerators[key] = push;
        }
        push.Push(element);
    }
    // Indicate to all the enumerators that we’ve finished
    foreach (SingleSlotPushEnumerator<TElement> push in enumerators.Values)
    {
        push.End();
    }
    // Collect the results, converting Future<T> into T for each one.
    foreach (var result in results)
    {
        yield return new KeyValuePair<TKey, TResult>(result.Key, result.Value.Value);
    }
}

I haven’t introduced SingleSlotPushEnumerator before, but as you can imagine, it’s an implementation of IPushEnumerator, with Push() and End() methods to provide data or indicate the end of the data stream. It’s not terribly interesting to see, in my view.

Conclusion

So, that’s what I’ve been looking at and thinking about for the last few evenings. I’ve implemented quite a few of the standard query operators, although not all of them are worth doing. I’m not currently viewing this as anything more than an interesting exercise, partly in terms of seeing how far I can push the language, but if anyone thinks it’s worth pursuing further (e.g. as a complete implementation as far as sensibly possible, either in MiscUtil or on SourceForge) I’d be very happy to hear your ideas. Frankly, I’d be glad and slightly surprised just to find out that anyone made it this far.

Oh, exercise for the reader – draw out a sequence diagram of how all this behaves :)

I love LINQ: Simplifying a tedious task

As mentioned in my previous post, I’ve been putting together the code samples for C# in Depth. Now, these are spread across several projects in a few solutions. They’re referred to in the book as things like “Listing 6.2” but I’ve given the files “real” names in the projects. When you run any project with multiple executable listings in it, the project offers you options of which listing to run, showing both the type name and the listing, which is embedded using System.ComponentModel.DescriptionAttribute, e.g. [Description("Listing 12.4")]. A few listings have a description which is more than just “Listing x.y” – for instance, “Listing 6.1/6.2/6.3”. These are displayed with no problems.

Now, the problem for the reader would be finding a particular listing – it’s not always obvious from the code what the type name should be, particularly when there are many variations on a theme. Clearly some sort of map is required. Ideally it should be a file looking something like this:

Chapter 1:
1.1: Foo.cs
1.2: Bar.cs
1.3/1.4: Baz.cs

Chapter 2:
2.1: Gronkle.cs

It’s easy enough to work out the directory for any particular file – the projects are helpfully named “Chapter3” and the like. So, the next thing is to create this file. I really didn’t want to do that by hand. After all, there are about 150 listings in the book – and I’ve already done the work of attributing them all. Ah… we could do it programmatically. Sounds like a bit of a slog…

… but it’s a problem which is ideally suited to LINQ. It’s also fairly ideally suited to regular expressions, much as I hate to admit it. The regular expression in question is reasonably complex, but thanks to Jesse Houwing’s advice on adding comments to regular expressions, the results aren’t too bad. Here’s the finished code – which of course is part of the downloadable source code itself.

using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;

namespace Chapter11.Queries
{
    /// <summary>
    /// The listings are scattered around .cs files within various directories.
    /// This class uses LINQ to find all classes with a suitable Description
    /// attribute, groups them by chapters and orders them by chapter and listing number.
    /// </summary>
    class DisplayListingsMap
    {
        static readonly Regex ListingPattern = new Regex(
            @"# First match the start of the attribute, up to the bit we're interested in
            [Description(""Listing 
            # The 'text' group is the whole of the description after Listing
            (?<text>
            # The 'chapter' group is the first set of digits in the description, before a dot
            (?<chapter>d+).
            # The chapter group is the second set of digits in the description
            (?<listing>d+)
            # After that we don't care - stop the 'text' group at the double quote
            [^""]*)
            # Now match the end of the attribute
            "")]",
            RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

        static void Main()
        {
            DirectoryInfo directory = new DirectoryInfo(@"........");

            var query = from file in directory.GetFiles("*.cs", SearchOption.AllDirectories)
                        let match = ListingPattern.Match(File.ReadAllText(file.FullName))
                        where match.Success
                        let Details = new
                        {
                            Text = match.Groups["text"].Value,
                            Chapter = int.Parse(match.Groups["chapter"].Value),
                            Listing = int.Parse(match.Groups["listing"].Value)
                        }
                        orderby Details.Chapter, Details.Listing
                        group new { File = file, Description=Details.Text } by Details.Chapter;

            foreach (var chapter in query)
            {
                Console.WriteLine("Chapter {0}", chapter.Key);
                foreach (var listing in chapter)
                {
                    Console.WriteLine("{0}: {1}", listing.Description, listing.File.Name);
                }
                Console.WriteLine();                
            }
        }
    }
}

Isn’t it cool? The regex works out the listing number (first x.y part only) and sorts on that, grouping by chapter – then we just display the results. There are other ways of skinning the same cat – such as grouping and then ordering “inside” and “outside” a chapter separately – but they’ll all boil down to the same sort of thing.