Category Archives: LINQ

A cautionary parallel tale: ordering isn’t simple

A little while ago, I wrote about my silly project to test Parallel LINQ – a Mandelbrot generator. In the last week, two things have happened to make this more of a reality. Firstly, the December CTP of Parallel FX has been released. Secondly, my old laptop died, “forcing” me to get a new laptop, which just happens to have a dual core processor.

So, it should just be a case of running it, right? Well, not quite. First let’s have a look at the query expression again, in its serial form:

 

var query = from row in Enumerable.Range(0, ImageHeight)
            from col in Enumerable.Range(0, ImageWidth)                                             
            select ComputeMandelbrotIndex(row, col);

 

And here’s what should be generated.

That’s the result of running without any parallelism, in other words. Now, I realised from the start that we would need ordering, but my first experiment was just to call AsParallel() to see what would happen:

 

var query = from row in Enumerable.Range(0, ImageHeight).AsParallel()
            from col in Enumerable.Range(0, ImageWidth)                                             
            select ComputeMandelbrotIndex(row, col);

 

As expected, that didn’t produce quite the result we wanted:

Well, that’s okay. I wanted to prove that ordering was necessary, and indeed that’s fairly obvious from the result. There are horizontal blocks, returned out of order. Easily fixable, right? After all, I posted what I thought would be the solution with the original post. We just need to give the appropriate option as a method parameter:

 

var query = from row in Enumerable.Range(0, ImageHeight).AsParallel(ParallelQueryOptions.PreserveOrdering)
            from col in Enumerable.Range(0, ImageWidth)                                             
            select ComputeMandelbrotIndex(row, col);

 

Not so fast. It certainly changed things, but not quite as hoped:

I haven’t yet analysed quite why we now have the rows in order but the columns out of order. However, I haven’t quite managed to fix it with the code in its original form. I have managed to fix it by reducing it from two (implicit) loops to one:

 

var query = from row in Enumerable.Range(0, ImageHeight).AsParallel(ParallelQueryOptions.PreserveOrdering)                                             
            select ComputeMandelbrotRow(row);

byte[] data = new byte[ImageHeight * ImageWidth];

int rowStart = 0;
foreach (byte[] row in query)
{
    Array.Copy(row, 0, data, rowStart, ImageWidth);
    rowStart += ImageWidth;
}

 

Instead of getting all the results in one go (with a call to ToArray()) we now have to reassemble all the data into a block. Still, it achieves the desired result. I should point out that PFX has better ways of doing this, Parallel.For being an obvious starting point from the little I know of it. At some point I’ll get round to reading about them, but at the moment life’s too short. I should also say that I don’t expect that any of the pictures indicate a bug in Parallel LINQ, merely my understanding of it.

Update: Explanation and a workaround

Having asked about this on the Connect forum for PFX, Igor Ostrovsky has explained that this is by design – currently only outer ordering is preserved when requested. The issue is still open, however – it’s possible that it will change before release.

In the meantime, Nicholas Palladinos has come up with an alternative solution which I rather like. I’ve refactored it a bit, but the basic idea is to turn the two sequences into one before parallelisation:

 

var points = from row in Enumerable.Range(0, ImageHeight)
             from col in Enumerable.Range(0, ImageWidth)
             select new { row, col };

var query = from point in points.AsParallel(ParallelQueryOptions.PreserveOrdering)                                              
            select ComputeMandelbrotIndex(point.row, point.col);

 

That works really well – in fact, more than twice as fast as the serial version, on my 2-core box!

Don’t call us, we’ll call you – push enumerators

Update: I’ve got a new and simpler design now. I’m leaving this in for historical interest, but please see the entry about the new design for more recent information.

This post is going to be hard to write, simply because I can’t remember ever writing quite such bizarre code before. When I find something difficult to keep in my own head, explaining it to others is somewhat daunting, especially when blogging is so much more restrictive than face-to-face discussion. Oh, and you’ll find it hard if you’re not familiar with lambda expression syntax (x => x.Foo etc). Just thought I’d warn you.

It’s possibly easiest to explain with an example. It’s one I’m hoping to use in a magazine article – but I certainly won’t be trying to explain this in the article. Imagine you’ve got a lot of log entries on disk – by which I mean hundreds of millions. You certainly don’t want all of that in memory. However, each of the log entries contains a customer ID, and you want to find out for each customer how many entries there are. Here’s a LINQ query which would work but be horribly inefficient, loading everything into memory:

var query = from entry in logEntryReader
            group entry by entry.Customer into entriesByCustomer
            let count = entriesByCustomer.Count()
            orderby count descending
            select new { Customer = entriesByCustomer.Key, Count = count };

Now, it’s easy to improve this somewhat just by changing the “group entry by” to “group 1 by” – that way the entries themselves are thrown away. However, you’ve still got some memory per entry – a huge great enumeration of 1s to count after grouping.

The problem is that you can’t tell “group … by” how to aggregate the sequence associated with each key. This isn’t just because there’s no syntax to express it – it’s to do with the nature of IEnumerable itself. You see, the “pull” nature of IEnumerable is a problem. While a thread is waiting for more data, it will just block. Normally, an aggregator (like Count) just picks data off a sequence until it reaches the end, then returns the result. How can that work when there are multiple sequences involved (one for each customer)?

There are three answers to this:

1) Write your own group-and-count method. This is pretty straightforward, and potentially useful in many situations. It’s also fairly easy to understand. You just iterate through a sequence, and keep a dictionary of key to int, increasing each key’s count as you see elements. This is the pragmatic solution when faced with a specific problem – but it feels like there should be something better, something that lets us group and then specify the processing in terms of standard query operators.

2) Create a new thread and producer/consumer style IEnumerable for each key. Clearly this doesn’t scale.

3) Invert control of enumerations: put the producer in the driving seat instead of the consumer. This is the approach we’re talking about for the rest of the post.

A word on the term “asynchronous”

I don’t know whether my approach could truly be called asynchronous. What I haven’t done is make any of the code thread-aware at all, or even thread-safe. All the processing of multiple sequences happens in a single thread. I also don’t have the full BeginXXX, EndXXX using IAsyncResult pattern. I started down that line, but it ended up being a lot more complicated than what I’ve got now.

I’m pretty sure that what I’ve been writing is along the lines of CSPs (Communicating Sequential Processes) but I wouldn’t in any way claim that it’s a CSP framework, either.

However, you may find that it helps to think about asynchronous APIs like Stream.BeginRead when looking at the rest of the code. In particular, reading a stream asynchronously has the same “say I’m interested in data, react to data, request some more” pattern.

Keeping the Count aggregator in mind, what we want to do is maintain a private count, and request some data. When we get called back to say there is more data, we increment our count (ignoring the data) and request some more. When we are told that there’s no more data, we can return the count.

With that said, here’s the interface for what I’ve called IPushEnumerator. The name is open for change – I’ve been through a few options, and I’m still not comfortable with it. Please feel free to suggest another one! Note that there isn’t an IPushEnumerable – again, I started off with one, but found it didn’t make sense. Maybe someone smarter than me will come up with a way of it fitting.

IPushEnumerator

/// <summary>
/// An enumerator which works in an inverse manner – the consumer requests
/// to be notified when a value is available, and then the enumerator
/// will call back to the consumer when the producer makes some data
/// available or indicates that all the data has been produced.
/// </summary>
public interface IPushEnumerator<T>
{
    /// <summary>
    /// Fetch the current value. Not valid until
    /// a callback with a True argument has occurred,
    /// or after a callback 
    /// </summary>
    T Current { get; }

    /// <summary>
    /// Requests notification (via the specified callback) when more
    /// data is available or when the end of the data has been reached.
    /// The argument passed to the callback indicates which of these
    /// conditions has been met – logically the result of MoveNext
    /// on a normal IEnumerator.
    /// </summary>
    void BeginMoveNext(Action<bool> callback);
}

That bit is relatively easy to understand. I can ask to be called back when there’s data, and typically I’ll fetch the data within the callback and ask for more.

So far, so good. But what’s going to create these in the first place? How do we interface with LINQ? Time for an extension method.

Enumerable.GroupWithPush

I wanted to create an extension to IEnumerable<T> which had a “LINQ feel” to it. It should be quite like GroupBy, but then allow the processing of the subsequences to be expressed in a LINQ-like way. (Actual C# query expressions aren’t terribly useful in practice because there isn’t specific syntax for the kind of operators which turn out to be useful with this approach.) We’ll want to have type parameters for the original sequence (TElement), the key used for grouping (TKey) and the results of whatever processing is performed on each sequence (TResult).

So, the first parameter of our extension method is going to be an IEnumerable<TElement>. We’ll use a Func<TElement,TKey> to map source elements to keys. We could optionally allow an IEqualityComparer<TKey> too – but I’m certainly not planning on supporting as many overloads as Enumerable.GroupBy does. The final parameter, however, needs to be something to process the subsequence. The first thought would be Func<IPushEnumerator<TElement>,TResult> – until you start trying to implement the extension method or indeed the delegate doing the processing.

You see, given an IPushEnumerator<TElement> you really don’t want to return a result. Not just yet. After all, you don’t have the data yet, just a way of being given the data. What you want to return is the means of the caller obtaining the result after all the data has been provided. This is where we need to introduce a Future<T>.

Future<T>

If you don’t know about the idea of a future, it’s basically an IOU for a result. In proper threading libraries, futures allow the user to find out whether a computation has completed or not, wait for the result etc. My implementation of Future<T> is not that smart. It’s not smart at all. Here it is:

/// <summary>
/// Poor-man’s version of a Future. This wraps a result which *will* be
/// available in the future. It’s up to the caller/provider to make sure
/// that the value has been specified by the time it’s requested.
/// </summary>
public class Future<T>
{
    T value;
    bool valueSet = false;

    public T Value 
    {
        get
        {
            if (!valueSet)
            {
                throw new InvalidOperationException(“No value has been set yet”);
            }
            return value;
        }
        set
        {
            valueSet = true;
            this.value = value;
        }
    }
}

With this in place, we can reveal the actual signature of GroupWithPush:

public static IEnumerable<KeyValuePair<TKey, TResult>> GroupWithPush<TElement, TKey, TResult>
    (this IEnumerable<TElement> source,
     Func<TElement, TKey> mapping,
     Func<IPushEnumerator<TElement>, Future<TResult>> pipeline)

I shall leave you to mull over that – I don’t know about you, but signatures of generic methods always take me a little while to decode.

The plan is to then implement extension methods on IPushEnumerator<T> so that we can write code like this:

var query = logEntryReader.GroupWithCount(entry => entry.Customer,
                                          sequence => sequence.Count());

foreach (var result in query)
{
    Console.WriteLine (“Customer {0}: {1} entries”,
                       result.Key,
                       result.Value);
}

Okay, so how do we implement these operators? Let’s give an example – Count being pretty a simple case.

Implementing Count()

Let’s start off by looking at a possible Count implementation for a normal sequence, to act as a sort of model for the implementation in the weird and wacky land of futures and push enumerators:

public static int Count<T>(IEnumerable<T> source)
{
    int count = 0;
    foreach (T item in source)
    {
        count++;
    }
    return count;
}

Now, we’ve got two problems. Firstly, we’re not going to return the count – we’re going to return a Future. Secondly, we certainly can’t use foreach on an IPushEnumerator<T> – the whole point is to avoid blocking while we wait for data. However, the concept of “for each element in a sequence” is useful – so let’s see whether we can do something similar with another extension method, then come back and use it in Count.

Implementing ForEach()

Warning: this code hurts my head, and I wrote it. Even the idea of it hurts my head a bit. The plan is to implement a ForEach method which takes two delegates – one which is called for each item in the enumerator, and one which is called after all the data has been processed. It will return without blocking, but it will call BeginMoveNext first, using a delegate of its own. That delegate will be called when data is provided, and it will in turn call the delegates passed in as parameters, before calling BeginMoveNext again, etc.

Ready?

public static void ForEach<T>(this IPushEnumerator<T> source, 
                              Action<T> iteration,
                              Action completion)
{
    Action<bool> moveNextCallback = null;
            
    moveNextCallback = dataAvailable =>
         {
             if (dataAvailable)
             {
                 iteration(source.Current);
                 source.BeginMoveNext(moveNextCallback);
             }
             else
             {
                 completion();
             }
         };

    source.BeginMoveNext(moveNextCallback);
}

What I find particularly disturbing is that moveNextCallback is self-referential – it calls BeginMoveNext passing itself a the parameter. (Interestingly, you still need to assign it to null first, otherwise the compiler complains that it might be used without being assigned. I seem to remember reading a blog post about this before now, and thinking that I’d never ever run into such a situation. Hmm.)

Nasty as ForEach is in terms of implementation, it’s not too bad to use.

Implementing Count() – the actual code

The translation of the original Count is now relatively straightforward. We prepare the Future wrapper for the result, and indicate that we want to iterate through all the entries, counting them and then setting the result value when we’ve finished (which will be long after the method first returns, don’t forget).

public static Future<int> Count<T>(this IPushEnumerator<T> source)
{
    Future<int> ret = new Future<int>();
    int count = 0;

    source.ForEach(t => count++, 
                   () => ret.Value = count);
    return ret;
}

We’re nearly there now. All we need to do is complete the original GroupWithPush method:

Implementing GroupWithPush

There are three phases to GroupWithPush, as mentioned before: pushing the data to the consumers (creating those consumers as required based on the keys we see); telling all the consumers that we’ve finished; retrieving the results. It’s probably easiest just to show the code – it’s actually not too hard to understand.

public static IEnumerable<KeyValuePair<TKey, TResult>> GroupWithPush<TElement, TKey, TResult>
    (this IEnumerable<TElement> source,
     Func<TElement, TKey> mapping,
     Func<IPushEnumerator<TElement>, Future<TResult>> pipeline)
{
    var enumerators = new Dictionary<TKey, SingleSlotPushEnumerator<TElement>>();
    var results = new Dictionary<TKey, Future<TResult>>();
    // Group the data, pushing it to the enumerators at the same time.
    foreach (TElement element in source)
    {
        TKey key = mapping(element);
        SingleSlotPushEnumerator<TElement> push;
        if (!enumerators.TryGetValue(key, out push))
        {
            push = new SingleSlotPushEnumerator<TElement>();
            results[key] = pipeline(push);
            enumerators[key] = push;
        }
        push.Push(element);
    }
    // Indicate to all the enumerators that we’ve finished
    foreach (SingleSlotPushEnumerator<TElement> push in enumerators.Values)
    {
        push.End();
    }
    // Collect the results, converting Future<T> into T for each one.
    foreach (var result in results)
    {
        yield return new KeyValuePair<TKey, TResult>(result.Key, result.Value.Value);
    }
}

I haven’t introduced SingleSlotPushEnumerator before, but as you can imagine, it’s an implementation of IPushEnumerator, with Push() and End() methods to provide data or indicate the end of the data stream. It’s not terribly interesting to see, in my view.

Conclusion

So, that’s what I’ve been looking at and thinking about for the last few evenings. I’ve implemented quite a few of the standard query operators, although not all of them are worth doing. I’m not currently viewing this as anything more than an interesting exercise, partly in terms of seeing how far I can push the language, but if anyone thinks it’s worth pursuing further (e.g. as a complete implementation as far as sensibly possible, either in MiscUtil or on SourceForge) I’d be very happy to hear your ideas. Frankly, I’d be glad and slightly surprised just to find out that anyone made it this far.

Oh, exercise for the reader – draw out a sequence diagram of how all this behaves :)

I love LINQ: Simplifying a tedious task

As mentioned in my previous post, I’ve been putting together the code samples for C# in Depth. Now, these are spread across several projects in a few solutions. They’re referred to in the book as things like “Listing 6.2” but I’ve given the files “real” names in the projects. When you run any project with multiple executable listings in it, the project offers you options of which listing to run, showing both the type name and the listing, which is embedded using System.ComponentModel.DescriptionAttribute, e.g. [Description("Listing 12.4")]. A few listings have a description which is more than just “Listing x.y” – for instance, “Listing 6.1/6.2/6.3”. These are displayed with no problems.

Now, the problem for the reader would be finding a particular listing – it’s not always obvious from the code what the type name should be, particularly when there are many variations on a theme. Clearly some sort of map is required. Ideally it should be a file looking something like this:

Chapter 1:
1.1: Foo.cs
1.2: Bar.cs
1.3/1.4: Baz.cs

Chapter 2:
2.1: Gronkle.cs

It’s easy enough to work out the directory for any particular file – the projects are helpfully named “Chapter3” and the like. So, the next thing is to create this file. I really didn’t want to do that by hand. After all, there are about 150 listings in the book – and I’ve already done the work of attributing them all. Ah… we could do it programmatically. Sounds like a bit of a slog…

… but it’s a problem which is ideally suited to LINQ. It’s also fairly ideally suited to regular expressions, much as I hate to admit it. The regular expression in question is reasonably complex, but thanks to Jesse Houwing’s advice on adding comments to regular expressions, the results aren’t too bad. Here’s the finished code – which of course is part of the downloadable source code itself.

using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;

namespace Chapter11.Queries
{
    /// <summary>
    /// The listings are scattered around .cs files within various directories.
    /// This class uses LINQ to find all classes with a suitable Description
    /// attribute, groups them by chapters and orders them by chapter and listing number.
    /// </summary>
    class DisplayListingsMap
    {
        static readonly Regex ListingPattern = new Regex(
            @"# First match the start of the attribute, up to the bit we're interested in
            [Description(""Listing 
            # The 'text' group is the whole of the description after Listing
            (?<text>
            # The 'chapter' group is the first set of digits in the description, before a dot
            (?<chapter>d+).
            # The chapter group is the second set of digits in the description
            (?<listing>d+)
            # After that we don't care - stop the 'text' group at the double quote
            [^""]*)
            # Now match the end of the attribute
            "")]",
            RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

        static void Main()
        {
            DirectoryInfo directory = new DirectoryInfo(@"........");

            var query = from file in directory.GetFiles("*.cs", SearchOption.AllDirectories)
                        let match = ListingPattern.Match(File.ReadAllText(file.FullName))
                        where match.Success
                        let Details = new
                        {
                            Text = match.Groups["text"].Value,
                            Chapter = int.Parse(match.Groups["chapter"].Value),
                            Listing = int.Parse(match.Groups["listing"].Value)
                        }
                        orderby Details.Chapter, Details.Listing
                        group new { File = file, Description=Details.Text } by Details.Chapter;

            foreach (var chapter in query)
            {
                Console.WriteLine("Chapter {0}", chapter.Key);
                foreach (var listing in chapter)
                {
                    Console.WriteLine("{0}: {1}", listing.Description, listing.File.Name);
                }
                Console.WriteLine();                
            }
        }
    }
}

Isn’t it cool? The regex works out the listing number (first x.y part only) and sorts on that, grouping by chapter – then we just display the results. There are other ways of skinning the same cat – such as grouping and then ordering “inside” and “outside” a chapter separately – but they’ll all boil down to the same sort of thing.

LINQ to Silliness: Generating a Mandelbrot with parallel potential

I’ve been writing about LINQ recently, and in particular I’ve written a small amount about Parallel LINQ. (Don’t get excited – it’s only about a page, just to mention it as a sort of “meta-provider” for LINQ.) I was wondering what to use to demonstrate it – what general task can we perform which could take a lot of CPU?

Well, I used to be quite into fractals, and I’ve written Mandelbrot set generators in various languages. I hadn’t done it in C# before now, however. Calculating the colour of each pixel is completely independent of all the other pixels – it’s an “embarrassingly parallelizable” task. So, a great task for PLINQ. Here’s the “normal LINQ” code:

 

var query = from row in Enumerable.Range(0, ImageHeight)
from col in Enumerable.Range(0, ImageWidth)
select ComputeMandelbrotIndex(row, col);

byte[] data = query.ToArray();

Changing this into a parallel query is really simple – although we do need to preserve the ordering of the results:

var query = from row in Enumerable.Range(0, ImageHeight).AsParallel(QueryOptions.PreserveOrdering)
from col in Enumerable.Range(0, ImageWidth)
select ComputeMandelbrotIndex(row, col);

byte[] data = query.ToArray();

Without being able to actually use PLINQ yet, I can’t tell how awful the order preservation is – Joe warns that it’s costly, but we’ll see. This is on a pretty giant sequence of data, of course… An alternative would be to parallelize a row at a time, but that loses some of the purity of the solution. This is a very, very silly way of parallelizing the task, but it’s got a certain quirky appeal.

Of course, there’s then the code for ComputeMandelBrotIndex and displaying a bitmap from it – the full code is available for download (it’s a single C# file – just compile and run). Enjoy.

Update!

This blog post has been picked up by Nick Palladinos, who has written his own Parallel LINQ provider (much kudos for that – unfortunately for me the blog is in Greek, which I don’t understand). Apparently on a dual core processor the parallelised version of the Mandelbrot generator is indeed about twice as fast – it works! Unfortunately I can’t tell as my laptop only has a single core… it’s very exciting though :)

A short case study in LINQ efficiency

I’ve been thinking a bit about how I’d use LINQ in real life (leaving DLinq and XLinq alone for the moment). One of the examples I came up with is a fairly common one – trying to find the element in a collection which has the maximum value for a certain property. Note that quite often I don’t just need to know the maximum value of the property itself – I need to know which element had that value. Now, it’s not at all hard to implement that in “normal” code, but using LINQ could potentially make the intention clearer. So, I tried to work out what the appropriate LINQ expression should look like.

I’ve come up with three ways of expressing what I’m interested in in LINQ. For these examples, I’ve created a type NameAndSize which has (unsurprisingly) properties Name (a string) and Size (an int). For testing purposes, I’ve created a list of these items, with a variable list storing a reference to the list. All samples assume that the list is non-empty.

Sort, and take first element

(from item in list
orderby item.Size descending
select item).First();

This orders by size descending, and takes the first element. The obvious disadvantage of this is that before finding the first element (which is the only one we care about) we have to sort all the elements – nasty! Assuming a reasonable sort, this operation is likely to be O(n log (n))

Subselect in where clause

(from item in list
where item.Size==list.Max(x=>x.Size)
select item).First();

This goes through the list, finding every element whose size is equal to the maximum one, and then takes the first of those elements. Unfortunately, the comparison calculates the maximum size on every iteration This makes it an O(n^2) operation.

Two selects

int maxSize = list.Max(x=>x.Size);
NameAndSize max = (from item in list
where item.Size==maxSize
select item).First();

This is similar to the previous version, but solves the problem of the repeated calculation of the maximum size by doing it before anything else. This makes the whole operation O(n), but it’s still somewhat dissatisfying, as we’re having to iterate through the list twice.

The non-LINQ way

NameAndSize max = list[0];
foreach (NameAndSize item in list)
{
if (item.Size > max.Size)
{
max = item;
}
}

This keeps a reference to the “maximum element so far”. It only iterates through the list once, and is still O(n).

Benchmarks

Now, I’ve written a little benchmark which runs all of these except the “embedded select” version which was just too slow to run sensibly by the time I’d made the list large enough to get sensible results for the other versions. Here are the results using a list of a million elements, averaged over five runs:
Sorting: 437ms
Two queries: 109ms
Non-LINQ: 38ms

After tripling the size of the list, the results were:
Sorting: 1437ms
Two queries: 324ms
Non-LINQ: 117ms

These results show the complexities being roughly as predicted above, and in particular show that it’s definitely cheaper to only iterate through the collection once than to iterate through it twice.

Now, this query is a fairly simple one, conceptually – it would be a shame if LINQ couldn’t cope with it efficiently. I suspect it could be solved by giving the Max operator another parameter, which specified what should be selected, as well as what should be used for comparisons. Then I could just use list.Max(item => item.Size, item=>item). At that stage, the only loss in efficiency would be through invoking the delegates, which is a second-order problem (and one which is inherent in LINQ). Fortunately, the way LINQ works makes this really easy to try out – just write an extension class:

static class Extensions
{
    public static V Max<T,V>(this IEnumerable<T> source, 
                             Func<T,int> comparisonMap, 
                             Func<T,V> selectMap)
    {
        int maxValue=0;
        V maxElement=default(V);
        bool gotAny = false;
        using (IEnumerator<T> enumerator = source.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                T sourceValue = enumerator.Current;
                int value = comparisonMap(sourceValue);
                if (!gotAny || value > maxValue)
                {
                    maxValue = value;
                    maxElement = selectMap(sourceValue);
                    gotAny = true;
                }
            }
        }
        if (!gotAny)
        {
            throw new EmptySequenceException();
        }
        return maxElement;
    }
}

This gave results of 57ms and 169ms for the two list sizes used earlier – not quite as fast as the non-LINQ way, but much better than any of the others – and by far the simplest to express, too.

Lessons learned

  • You really need to think about the complexity of LINQ queries, and know where they will be executed (I suspect that DLinq would have coped with the “subselect” version admirably).
  • There’s still some more work to be done on the standard query operators to find efficient solutions to common use cases.
  • Even if the standard query operators don’t quite do what you want, it can be worthwhile to implement your own – and it’s not that hard to do so!

LINQ, DLinq, XLinq and C# 3.0

Some readers may already be aware of Project LINQ – .NET Language Integrated Query. There have been various posts about it (and in particular the effect it has on C# itself) on the C# newsgroup, and many of those have involved a certain amount of speculation. This is understandable, as it’s still very much under development, and although a sample initial implementation has been available since the PDC, I’d be surprised if things didn’t change.

I haven’t downloaded the PDC implementation, and don’t intend to for a while. I have, however, finally had a chance to read the specs and overviews of LINQ, DLinq, XLinq and C# 3.0. I haven’t pored over them, and again don’t intend to just yet – I believe that features like these are best investigated in anger. Once I have a use for them, I’ll probably look more closely at them. However, here is my (literally – I’m typing in a plane here) 34,000 foot view of what’s going on. This won’t include any code samples, not even ones copied from the documentation, so you may well find it useful to read the Microsoft documents before proceeding much further.

The basic premise

Imperative programming languages such as C, C++, Java, C#, VB and VB.NET have traditionally made it hard to work with relational databases, and made it possibly even harder to work with XML. The latter should come as a surprise in a way – XML was designed much later than SQL, and should have benefitted from a lot of hindsight in terms of technology design. Don’t get me wrong – I’m not anti-XML per se, but the APIs for working with it have generally sucked, particularly the W3C ones. There are half-decent APIs available in .NET, and various open source libraries in Java-land such as Dom4J and JDom which improve the situation, but don’t feel like they’ve quite cracked it.

Both XML and relational databases have what is commonly called an impedance mismatch with object-oriented languages – they just don’t think about things in the same way. Database tables are related to each other rather than entities being composed of each other, and object identity and equality pose really significant problems when trying to map data from one paradigm to the other. Just as with XML manipulation, there have been various attempts to solve (or at least greatly help) the mismatch problem, often with libraries or tools called Object Relational Mappers (ORM). There are many different ORM tools available – probably more for Java than .NET, possibly due to the longer timescale of availability of Java. Beyond the sphere of Java and .NET, I only know of one other ORM tool, which is ActiveRecord for Ruby, usually used with the Rails framework for web applications. I’m sure there are plenty of others available though – no need to comment on my ignorance here!

The Powers-That-Be in the Java world are trying to semi-standardise ORM with the EJB 3.0 specification, which I believe is currently at the public review stage. In theory, this should mean that marking up some Java classes with annotations (attributes to .NET folks) will allow the same classes to be used with multiple ORMs. I suspect that this facility won’t be used for multiple-ORM support very often, as you tend to need to know which ORM you’re targetting for other reasons, but it does mean that the bigwigs of the ORM world have got together to try to work out at least a common way of talking about ORM. I should say at this point that the only ORM I’ve had any real experience with is Hibernate, which is generally excellent, although rough around the edges in a few places. It has a .NET equivalent, NHibernate, which I haven’t used.

So, what does this have to do with LINQ? Well, all of the above projects and tools have a problem – you don’t tend to get much language support for them, as they’ve had to work with the language features available to them, rather than directing the future of the languages they target. LINQ, on the other hand, has added many new features to C# (and VB 9.0) which should greatly add to the potential safety and efficiency of the solution. LINQ is neither XML-specific nor SQL-specific, but instead introduces various language features which target general querying, with DLinq and XLinq hooking those up to SQL and XML respectively. It is worth noting at this point that LINQ itself doesn’t try to be an ORM system at all – only half of DLinq is particularly LINQ-related, and that’s (naturally) the querying side. The update side of things requires no new language features, and looks somewhat like using Hibernate with annotations, at least at first glance.

The language features

So, what new language features does LINQ effectively require? I’m only going to cover the C# side of things here – VB 9.0 has gained some features supporting XLinq (of somewhat dubious value at a first glance, but I’ll let VB fans work out whether they’re actually good or not), but I won’t address how the new VB looks.

  • Lambda expressions: these look pretty powerful and (more importantly) useful – and not just as part of LINQ. Whether expression trees (where a lambda expression ends up as data rather than compiler code) will be good or not (at least outside LINQ, which I suspect pretty much requires them) remains to be seen. Again though, there’s a lot of potential.
  • Extension methods: ouch. I can see why it’s being done, and I can see why it will be potentially very useful, but I suspect it will be abused hideously. The worst thing is, I can see times when I might well abuse it and like it at the time – System.String doesn’t have a method which would be handy? Heck, make it an extension. Fine at write-time, but not so fun at maintenance time. This is just a gut feeling at this stage, but I’m frankly a little worried. If my team were using C# 3.0, I’d want any language extensions to be code-reviewed by two people (other than the original developer) rather than just one (or if pair programming was going on, at least one extra pair of eyes).
  • Object initializers: yes, yes, yes. Great stuff.
  • Anonymous types: these could be interesting. They certainly make sense in LINQ, but they potentially have use outside it too. How many times have you had local variables which were essentially linked to each other, but that relationship could only be expressed through naming? I’m thinking of situations like searching for the minimum element in a collection (which admittedly LINQ would make a piece of cake anyway) – you typically want to remember the minimum value and the index of the element which had that value. Coupling the two into a real type is too much effort, but with an anonymous type? There are possibilities there. I’ll have to see how it reads and how maintainable it is before I can really pass judgement.
  • Implicitly typed arrays: yes and no. Part of me screams that it’ll make things easier – and part screams that it’ll make things harder to read at a glance. I think “use with care” is the watch-phrase here.
  • Implicitly typed local variables: Hmm. Very much “use with care”. I don’t want to see variables being implicitly typed all over the place, as they are in the specifications. They’re obviously necessary for anonymous types, but I’m not sure about their use outside that situation. There’s been a fair amount of discussion about this on the newsgroups, with no clear “winning” side. I think we’re all guessing really – we need to use this in anger, and wait for a year or two to see what the maintenance implications are.
  • Query expressions themselves: not sure. I can see why it’s nice to have them in the language – particularly having gone through an HQL (Hibernate Query Language) run/see error/debug/repeat cycle a few times, but at the same time it feels like it’s going a bit overboard. I think I’ll need to see real examples from real code in both query syntax and “dot notation” (calling normal methods) before making up my mind on this. I should note that this attitude is significantly more “friendly” towards query expressions than my first reactions, which were along the lines of “no, no, no, get that SQL out of my C# code.” That suggests I may well come to like it over time – but I’m not quite there yet.

Concerns

I’m worried about C# expanding this quickly. I don’t know what the C# 3.0 timeframe is, but C# 2.0 isn’t out of the door yet, and we don’t know how developers are going to cope with generics in the real world yet. Introducing several new and pretty major language features at this stage seems premature – although of course they’re not really being introduced just yet. Java has tended to go in the opposite direction, only allowing change very slowly. This hasn’t always had positive results (there are some aspects of generics in Java which are truly awful compared with .NET generics – although it has its advantages) but it has generally given a lot of time for people to think about things and give feedback. Hopefully the reason for the C# 3.0 draft specs being made available at this stage is to get as much feedback as possible as early as possible.

Having said I’m worried about C# growing too quickly, there are ways in which I wish it would grow which don’t sem to be addressed at all – including ones which are present in prototype form in MS research labs (Spec# in particular). Things I’d like to see:

  • Simpler property definition, for properties which really are just setting variables and returning them – making thread-safety available by specifying a lock to apply would be nice too, if it didn’t interfere with the simplicity too much.
  • Design-by-contract – not so much for the “provability” side of things (which I’m sure would be great too, but which I have no direct experience of) but more for getting rid of all the irritating code I need to write just to verify arguments etc. This is an ideal target for machine-generated code. Proving the result side of the contract would be great too – not just “this parameter must not be null” but “the result of this operation is never null”.
  • Aspect-oriented programming – in a very similar vein to design-by-contract, I’m sure that AOP could have great benefits for cross-cutting concerns, and would work much better as a language construct than in libraries which need to do nasty code manipulation.

I’m sure there are more that I’ve thought of over the years, but these are my biggest gripes at the moment. Compared with the changes which are being made, they’re possibly relatively small, too. You can bet that I’ll be asking the C# team about the possibility of their inclusion while I’m at the MVP summit! (Don’t expect any results to be posted here though – I’m afraid it’ll almost certainly all be under NDA.)

Conclusion

However far away C# 3.0 may be, it has great promise – as well as a few big holes which the over-zealous developer wishing to use new features wherever possible may end up falling into. We’ll see how things shape up over time. My battery is running low, so until I’m near power again, goodbye…