Group pipelining returns: new and improved design

Last night’s blog post provoked a flurry of emails between myself and Marc Gravell. Looking back, trying to base the pipeline on a pseudo-asynchronous version of IEnumerable<T> was a mistake. We’ve now got a much more attractive interface to write extensions against:

 

public interface IDataProducer<T>
{
    event Action<T> DataProduced;
    event Action EndOfData;
}

Why is this so much better? A few reasons:

  1. Acting on all the data is much easier: just subscribe to the events. No need for a weird ForEach.
  2. It fits the normal processing model much better – we tend to want to process the data, not whether or not there are more elements on their way.
  3. It allows unsubscription when we’re not interested in any more data.
  4. It allows multiple subscribers.

The last point is very, very interesting. It means we can implement GroupWithPipeline (current new name – I suspect it won’t last forever though) to take multiple pipelines, producing a KeyValueTuple with, say, three values in, each of different types, still strongly typed. If we don’t care about the type safety, we can use “params…” to deal with as many pipelines as we want – the client then needs to cast, which isn’t ideal, but isn’t too bad.

As an idea of what I mean by this, consider this call to find the Max, Min and Average values, all without buffering:

 

var query = someSequenceOfOrders
                .GroupWithPipeline (entry => entry.Customer,
                                    seq => seq.Min(entry => entry.OrderSize),
                                    seq => seq.Average(entry => entry.OrderSize),
                                    seq => seq.Max(entry => entry.OrderSize));
                .Select(x => new { Customer = x.Key,
                                   Min = x.Value1,
                                   Avereage = x.Value2,
                                   Max = x.Value3 });
                                   
foreach (var result in query)
{
    Console.WriteLine (“Customer {0}: {1}/{2}/{3}”
                       result.Customer,
                       result.Min,
                       result.Average,
                       result.Max);
}

We specify three different pipelines, all of which will be applied to the same sequence of data. The fact that we’ve specified OrderSize three times is unfortunate – a new overload to transform the entries passed to the pipeline is probably in order – but it’s all doable.

This sort of “in pipeline” multiple aggregation is very, very cool IMO. It’s turned the whole idea from “interesting” to “useful enough to get into MiscUtil”.

I haven’t actually written Min, Max or Average yet – although Marc has, I believe. (We’re collaborating and sharing source, but not working off a common source control system yet. It’s all a bit ad hoc.) What I know he’s done which is possibly even more useful than all of this to start with is used expression trees to implement generic maths. This is only execution time checked, which is unfortunate, but I don’t believe that will be a problem in real life.

The upshot is that the above code will work with any type with appropriate operators defined. No need for loads of overloads for decimal, long, int, float, double etc – it will just work. If you’re worried about performance, you can relax – it performs very, very well, which was a bit of a surprise to both of us.

More of that in another post though… I wanted to share the new design because it’s so much nicer than the deeply complicated stuff I was working with yesterday.

Don’t call us, we’ll call you – push enumerators

Update: I’ve got a new and simpler design now. I’m leaving this in for historical interest, but please see the entry about the new design for more recent information.

This post is going to be hard to write, simply because I can’t remember ever writing quite such bizarre code before. When I find something difficult to keep in my own head, explaining it to others is somewhat daunting, especially when blogging is so much more restrictive than face-to-face discussion. Oh, and you’ll find it hard if you’re not familiar with lambda expression syntax (x => x.Foo etc). Just thought I’d warn you.

It’s possibly easiest to explain with an example. It’s one I’m hoping to use in a magazine article – but I certainly won’t be trying to explain this in the article. Imagine you’ve got a lot of log entries on disk – by which I mean hundreds of millions. You certainly don’t want all of that in memory. However, each of the log entries contains a customer ID, and you want to find out for each customer how many entries there are. Here’s a LINQ query which would work but be horribly inefficient, loading everything into memory:

var query = from entry in logEntryReader
            group entry by entry.Customer into entriesByCustomer
            let count = entriesByCustomer.Count()
            orderby count descending
            select new { Customer = entriesByCustomer.Key, Count = count };

Now, it’s easy to improve this somewhat just by changing the “group entry by” to “group 1 by” – that way the entries themselves are thrown away. However, you’ve still got some memory per entry – a huge great enumeration of 1s to count after grouping.

The problem is that you can’t tell “group … by” how to aggregate the sequence associated with each key. This isn’t just because there’s no syntax to express it – it’s to do with the nature of IEnumerable itself. You see, the “pull” nature of IEnumerable is a problem. While a thread is waiting for more data, it will just block. Normally, an aggregator (like Count) just picks data off a sequence until it reaches the end, then returns the result. How can that work when there are multiple sequences involved (one for each customer)?

There are three answers to this:

1) Write your own group-and-count method. This is pretty straightforward, and potentially useful in many situations. It’s also fairly easy to understand. You just iterate through a sequence, and keep a dictionary of key to int, increasing each key’s count as you see elements. This is the pragmatic solution when faced with a specific problem – but it feels like there should be something better, something that lets us group and then specify the processing in terms of standard query operators.

2) Create a new thread and producer/consumer style IEnumerable for each key. Clearly this doesn’t scale.

3) Invert control of enumerations: put the producer in the driving seat instead of the consumer. This is the approach we’re talking about for the rest of the post.

A word on the term “asynchronous”

I don’t know whether my approach could truly be called asynchronous. What I haven’t done is make any of the code thread-aware at all, or even thread-safe. All the processing of multiple sequences happens in a single thread. I also don’t have the full BeginXXX, EndXXX using IAsyncResult pattern. I started down that line, but it ended up being a lot more complicated than what I’ve got now.

I’m pretty sure that what I’ve been writing is along the lines of CSPs (Communicating Sequential Processes) but I wouldn’t in any way claim that it’s a CSP framework, either.

However, you may find that it helps to think about asynchronous APIs like Stream.BeginRead when looking at the rest of the code. In particular, reading a stream asynchronously has the same “say I’m interested in data, react to data, request some more” pattern.

Keeping the Count aggregator in mind, what we want to do is maintain a private count, and request some data. When we get called back to say there is more data, we increment our count (ignoring the data) and request some more. When we are told that there’s no more data, we can return the count.

With that said, here’s the interface for what I’ve called IPushEnumerator. The name is open for change – I’ve been through a few options, and I’m still not comfortable with it. Please feel free to suggest another one! Note that there isn’t an IPushEnumerable – again, I started off with one, but found it didn’t make sense. Maybe someone smarter than me will come up with a way of it fitting.

IPushEnumerator

/// <summary>
/// An enumerator which works in an inverse manner – the consumer requests
/// to be notified when a value is available, and then the enumerator
/// will call back to the consumer when the producer makes some data
/// available or indicates that all the data has been produced.
/// </summary>
public interface IPushEnumerator<T>
{
    /// <summary>
    /// Fetch the current value. Not valid until
    /// a callback with a True argument has occurred,
    /// or after a callback 
    /// </summary>
    T Current { get; }

    /// <summary>
    /// Requests notification (via the specified callback) when more
    /// data is available or when the end of the data has been reached.
    /// The argument passed to the callback indicates which of these
    /// conditions has been met – logically the result of MoveNext
    /// on a normal IEnumerator.
    /// </summary>
    void BeginMoveNext(Action<bool> callback);
}

That bit is relatively easy to understand. I can ask to be called back when there’s data, and typically I’ll fetch the data within the callback and ask for more.

So far, so good. But what’s going to create these in the first place? How do we interface with LINQ? Time for an extension method.

Enumerable.GroupWithPush

I wanted to create an extension to IEnumerable<T> which had a “LINQ feel” to it. It should be quite like GroupBy, but then allow the processing of the subsequences to be expressed in a LINQ-like way. (Actual C# query expressions aren’t terribly useful in practice because there isn’t specific syntax for the kind of operators which turn out to be useful with this approach.) We’ll want to have type parameters for the original sequence (TElement), the key used for grouping (TKey) and the results of whatever processing is performed on each sequence (TResult).

So, the first parameter of our extension method is going to be an IEnumerable<TElement>. We’ll use a Func<TElement,TKey> to map source elements to keys. We could optionally allow an IEqualityComparer<TKey> too – but I’m certainly not planning on supporting as many overloads as Enumerable.GroupBy does. The final parameter, however, needs to be something to process the subsequence. The first thought would be Func<IPushEnumerator<TElement>,TResult> – until you start trying to implement the extension method or indeed the delegate doing the processing.

You see, given an IPushEnumerator<TElement> you really don’t want to return a result. Not just yet. After all, you don’t have the data yet, just a way of being given the data. What you want to return is the means of the caller obtaining the result after all the data has been provided. This is where we need to introduce a Future<T>.

Future<T>

If you don’t know about the idea of a future, it’s basically an IOU for a result. In proper threading libraries, futures allow the user to find out whether a computation has completed or not, wait for the result etc. My implementation of Future<T> is not that smart. It’s not smart at all. Here it is:

/// <summary>
/// Poor-man’s version of a Future. This wraps a result which *will* be
/// available in the future. It’s up to the caller/provider to make sure
/// that the value has been specified by the time it’s requested.
/// </summary>
public class Future<T>
{
    T value;
    bool valueSet = false;

    public T Value 
    {
        get
        {
            if (!valueSet)
            {
                throw new InvalidOperationException(“No value has been set yet”);
            }
            return value;
        }
        set
        {
            valueSet = true;
            this.value = value;
        }
    }
}

With this in place, we can reveal the actual signature of GroupWithPush:

public static IEnumerable<KeyValuePair<TKey, TResult>> GroupWithPush<TElement, TKey, TResult>
    (this IEnumerable<TElement> source,
     Func<TElement, TKey> mapping,
     Func<IPushEnumerator<TElement>, Future<TResult>> pipeline)

I shall leave you to mull over that – I don’t know about you, but signatures of generic methods always take me a little while to decode.

The plan is to then implement extension methods on IPushEnumerator<T> so that we can write code like this:

var query = logEntryReader.GroupWithCount(entry => entry.Customer,
                                          sequence => sequence.Count());

foreach (var result in query)
{
    Console.WriteLine (“Customer {0}: {1} entries”,
                       result.Key,
                       result.Value);
}

Okay, so how do we implement these operators? Let’s give an example – Count being pretty a simple case.

Implementing Count()

Let’s start off by looking at a possible Count implementation for a normal sequence, to act as a sort of model for the implementation in the weird and wacky land of futures and push enumerators:

public static int Count<T>(IEnumerable<T> source)
{
    int count = 0;
    foreach (T item in source)
    {
        count++;
    }
    return count;
}

Now, we’ve got two problems. Firstly, we’re not going to return the count – we’re going to return a Future. Secondly, we certainly can’t use foreach on an IPushEnumerator<T> – the whole point is to avoid blocking while we wait for data. However, the concept of “for each element in a sequence” is useful – so let’s see whether we can do something similar with another extension method, then come back and use it in Count.

Implementing ForEach()

Warning: this code hurts my head, and I wrote it. Even the idea of it hurts my head a bit. The plan is to implement a ForEach method which takes two delegates – one which is called for each item in the enumerator, and one which is called after all the data has been processed. It will return without blocking, but it will call BeginMoveNext first, using a delegate of its own. That delegate will be called when data is provided, and it will in turn call the delegates passed in as parameters, before calling BeginMoveNext again, etc.

Ready?

public static void ForEach<T>(this IPushEnumerator<T> source, 
                              Action<T> iteration,
                              Action completion)
{
    Action<bool> moveNextCallback = null;
            
    moveNextCallback = dataAvailable =>
         {
             if (dataAvailable)
             {
                 iteration(source.Current);
                 source.BeginMoveNext(moveNextCallback);
             }
             else
             {
                 completion();
             }
         };

    source.BeginMoveNext(moveNextCallback);
}

What I find particularly disturbing is that moveNextCallback is self-referential – it calls BeginMoveNext passing itself a the parameter. (Interestingly, you still need to assign it to null first, otherwise the compiler complains that it might be used without being assigned. I seem to remember reading a blog post about this before now, and thinking that I’d never ever run into such a situation. Hmm.)

Nasty as ForEach is in terms of implementation, it’s not too bad to use.

Implementing Count() – the actual code

The translation of the original Count is now relatively straightforward. We prepare the Future wrapper for the result, and indicate that we want to iterate through all the entries, counting them and then setting the result value when we’ve finished (which will be long after the method first returns, don’t forget).

public static Future<int> Count<T>(this IPushEnumerator<T> source)
{
    Future<int> ret = new Future<int>();
    int count = 0;

    source.ForEach(t => count++, 
                   () => ret.Value = count);
    return ret;
}

We’re nearly there now. All we need to do is complete the original GroupWithPush method:

Implementing GroupWithPush

There are three phases to GroupWithPush, as mentioned before: pushing the data to the consumers (creating those consumers as required based on the keys we see); telling all the consumers that we’ve finished; retrieving the results. It’s probably easiest just to show the code – it’s actually not too hard to understand.

public static IEnumerable<KeyValuePair<TKey, TResult>> GroupWithPush<TElement, TKey, TResult>
    (this IEnumerable<TElement> source,
     Func<TElement, TKey> mapping,
     Func<IPushEnumerator<TElement>, Future<TResult>> pipeline)
{
    var enumerators = new Dictionary<TKey, SingleSlotPushEnumerator<TElement>>();
    var results = new Dictionary<TKey, Future<TResult>>();
    // Group the data, pushing it to the enumerators at the same time.
    foreach (TElement element in source)
    {
        TKey key = mapping(element);
        SingleSlotPushEnumerator<TElement> push;
        if (!enumerators.TryGetValue(key, out push))
        {
            push = new SingleSlotPushEnumerator<TElement>();
            results[key] = pipeline(push);
            enumerators[key] = push;
        }
        push.Push(element);
    }
    // Indicate to all the enumerators that we’ve finished
    foreach (SingleSlotPushEnumerator<TElement> push in enumerators.Values)
    {
        push.End();
    }
    // Collect the results, converting Future<T> into T for each one.
    foreach (var result in results)
    {
        yield return new KeyValuePair<TKey, TResult>(result.Key, result.Value.Value);
    }
}

I haven’t introduced SingleSlotPushEnumerator before, but as you can imagine, it’s an implementation of IPushEnumerator, with Push() and End() methods to provide data or indicate the end of the data stream. It’s not terribly interesting to see, in my view.

Conclusion

So, that’s what I’ve been looking at and thinking about for the last few evenings. I’ve implemented quite a few of the standard query operators, although not all of them are worth doing. I’m not currently viewing this as anything more than an interesting exercise, partly in terms of seeing how far I can push the language, but if anyone thinks it’s worth pursuing further (e.g. as a complete implementation as far as sensibly possible, either in MiscUtil or on SourceForge) I’d be very happy to hear your ideas. Frankly, I’d be glad and slightly surprised just to find out that anyone made it this far.

Oh, exercise for the reader – draw out a sequence diagram of how all this behaves :)

VS2005 and VS2008 co-existence

This is a “notepad” type post – I’m writing it as I go along, and I may write up the conclusions later on as a full article for one of the other web sites. However, the plan is to investigate how best to work in an environment where both VS2005 and VS2008 exist.

Environment under consideration

VS2005 has taken a long time to penetrate the market. I believe this is for three main reasons:

  • Using VS2005 means using .NET 2.0, which means a deployment headache
  • Using VS2005 means upgrading projects and solutions in a largely “magical” way
  • It is impossible to use a VS2005 project/solution with VS2003

VS2008 reduces all of these problems:

  • VS2008 can target .NET 2.0, 3.0 or 3.5; if you don’t need LINQ, WPF etc, you can just target the same platform as you did with VS2005.
  • The changes between VS2005 projects/solutions and VS2008 ones is minimal, and easily verifiable. (For a start, you can easily do a diff – something which wasn’t feasible from VS2003 to VS2005.)
  • It is possible to share projects between VS2005 and VS2008, even if different solution files are required.

In many cases it’s desirable to have a few developers trying out a new tool before a whole company decides to move over to it. In this case, I’m going to imagine I’m working at a company which is considering using VS2008, but wants to try it for a while first. It has many VS2005 projects/solutions, and the “bleeding edge” developers have to be able to develop on that codebase. Any similarity between this and my actual job may or may not be purely coincidental – I don’t really want to go into what Audatex‘s development plans are on a public blog!

Test environment

For the test, I’m using two laptops – one with just VS2005SP1 and one running both VS2005SP1 and VS2008. All installations are of Visual Studio Professional. We’ll call the “just VS2005SP1” laptop “Dull” and the “VS2005SP1 and VS2008” laptop “Shiny”. All tests will be done with plain console projects, just to keep sanity intact. I realise there may be other issues with other project types. Both laptops have the fabulous ReSharper installed for VS2005, but I’m hoping that won’t make any difference. I never took a backup when converting a solution or project. I haven’t included .suo files, as they shouldn’t be relevant.

Test 1: Simple upgrade

Dull: Create new solution (Test); add console project (Project1).
Shiny: Copy solution; open with VS2008; compare solution and project files.

Results:

Test.sln: Differences on lines 2 and 3. The original was:

Microsoft Visual Studio Solution File, Format Version 9.00
# Visual Studio 2005

After conversion, the format version became 10.00, and the 2005 became 2008.

Project1.csproj: After conversion, the first line has an additional ToolsVersion="3.5" attribute in the Project element. There is also an extra piece of XML in the PropertyGroup element:

<FileUpgradeFlags>
</FileUpgradeFlags>
<OldToolsVersion>2.0</OldToolsVersion>
< UpgradeBackupLocation>
</UpgradeBackupLocation>

Test 2: Move project file back to Dull

Dull: Copy just the project file from the upgraded Test1 back onto Dull, and reopen solution

Results: Solution and project both opened with no issues.

Test 3: Move solution file back to Dull

Dull: Copy the solution file from the upgraded Test1 back onto Dull, and reopen solution

Results: Unsurprisingly, this failed with an error message: “The selected file is a solution file, but was created by a newer version of this application and cannot be opened.”

Test 4: Add a project in VS2005

Dull: Add a new project (Project2)
Shiny: Copy over project directory; in VS2008 select “Add existing project”

Results: Same differences in Project2.csproj as before in Project1.csproj

Test 5: Add a project in VS2008

Shiny: Add a new project (Project3)
Dull: Copy over project directory; in VS2005 select “Add existing project”

Note that the new project was targeting .NET 2.0; I haven’t tried targeting other frameworks, but I’d imagine there could well be issues there.

Results: Initially, an error message complains about failing to find Microsoft.CSharp.targets. This is due to a difference in the project files, near the bottom. The VS2008 project has this:

<Import Project=”$(MSBuildToolsPath)Microsoft.CSharp.targets” />

A project created in VS2005 has this instead:

<Import Project=”$(MSBuildBinPath)Microsoft.CSharp.targets” />

Note the difference between “Bin” and “Tools”. After editing the file manually, both VS2005 and VS2008 are able to open it, and neither one tries to change the file back again.

Test 6: Using C# 3 features

Shiny: Use var (local variable type inference) in a code file
Dull: Attempt to compile under VS2005.

Results: It comes as no surprise that we have code which builds under VS2008 but not under VS2005. Not good.

Test 7: Preventing VS2008 from using C# 3 features

Shiny: Edit Project3.csproj to set the ToolsVersion attribute to 2.0 instead of 3.5, then rebuild

Results: Disappointingly, it still builds under VS2008. However, other users have reported this as doing the right thing. Currently I don’t know what’s going on here.

Test 8: Trying harder…

Shiny: Edit Project3.csproj to add a LangVersion element. Inside the PropertyGroup element, I added this and rebuilt, based on guesswork from the csc options:

<LangVersion>ISO-2</LangVersion>

Dull: After removing the C# 3 code, copy the project file to test if VS2005 still understands it.

Results: Aargh! VS2005 now complains that LangVersion can only be ISO-1 or Default. Stalemate.

Conclusion

  • It’s possible to share project files but not solution files between VS2005 and VS2008.
  • If you upgrade a solution file by mistake, it’s very easy to fix it by hand.
  • If you decide to maintain different solution files, if there are big changes in one it may be easiest to just make them in one solution, then upgrade again.
  • Creating a project in VS2005 and then importing it into VS2008 is seamless; the other way round has slight issues which are fixable by hand.
  • I don’t know of a way of forcing VS2008 to only use C# 2 while at the same time maintaining VS2005 compatibility.

C# in Depth: Chapters 8 and 9 now in MEAP

I should have written this a little while ago, but it slipped my mind. Despite that, I think it’s pretty exciting for anyone who’s actually following along in the early access edition (and yes, such people do exist). The reason? These are the first C# 3 chapters. Here’s what they’ve got in them:

Chapter 8: Cutting fluff with a smart compiler

This is a bit like chapter 7, in that it’s a collection of small features. The difference between the two chapters is that the features in C# 3 mostly work together towards LINQ, so there’s more of an obvious connection between them. In this chapter we look at:

  • Automatically implemented properties (a.k.a. automatic properties)
  • Implicit typing of local variables (the “var” keyword – note that this is not dynamic typing)
  • Simplified initialization (object initializers, collection initializers)
  • Implicitly typed arrays
  • Anonymous types

Chapter 9: Lambda expressions and expression trees

If you have been following C# 3, you’ll know what these topics are – and if you don’t, I certainly can’t do them justice in a short blog post. In some senses, these are actually simple concepts: an easier way of creating a delegate instance, and a way of representing code as data, optionally using the same syntax. The implications of this are massive, though. We could have LINQ without anonymous types, even though it would be more painful. With that out of the way we wouldn’t really need implicitly typed local variables or arrays. But without expression trees and lambda expressions, LINQ would be dead in the water.

It takes more than just the concepts being there in the first place though. When was the last time you looked at the rules of type inference for generic methods? If your answer is “never” then you’re probably in the company of 99% of C# developers – and that’s perfectly reasonable, to be honest. However, the rules have changed for C# 3 to make them more powerful – to let the compiler do more work for us. The last part of chapter 9 deals with the changes. I should warn you, it’s not pretty – but it’s worth knowing that the rules are explained (with a little bit of hand waving) in a format which is slightly more user friendly than the specification.

 

So, that’s the most recent two chapters. The next to chapters to be released actually conclude the “strictly C#” part of the book, leaving just LINQ providers and the conclusion – we’re nearly there!

Getting started with F#

Updated 13th November: Robert pickering has answered the questions. I’ve included his answers inline. Thanks Robert!

The talks I went to at TechEd today (Friday) were mostly related to concurrency and F#. I’ve decided that although it’s fairly unlikely I’ll use it in a serious manner, it’s worth learning F# mostly to become more comfortable with functional programming in general. Apart from anything else, this is likely to help in terms of LINQ and general concurrency.

I was fortunate enough to win a copy of Robert Pickering’s “Foundations of F#” today at the final talk, answering a “pop quiz” question set by Joe Duffy. (Whoever claims that there’s no point in a C# developer knowing details of the CLR memory model is thus proven wrong in a most bizarre way.) I’ll probably buy Don Syme’s “Expert F#” when it comes out. I’ve been the Foundations book on the plane, where I’m currently hunched over my laptop, desperately hoping it doesn’t run out of power before we land. My situation has one important impact on this post: I haven’t actually written any F# yet. As it happens, with the VS2008 release being pretty imminent, I’ll probably wait until that’s out before installing F#.

In a way, my complete inexperience with the language is a good thing, for the sake of what I’m writing here: these are my raw impressions of F# based solely on what I’ve read. Now, some of the points I’m going to make here are potentially about the book instead of F# – I want to make it perfectly clear that I’m not trying to criticize the book. In general, it’s been an easy read so far (I’ve read about 70 pages) and my main problem with it is that there are typos. Normally that’s fine, but there are some cases where I don’t know whether I’m missing something about F# or whether there’s just a typo. Errors like this are to be expected, however hard we try to avoid them. I know that I’m still finding typos in chapters of C# in Depth that I’ve read several times, and I’m sure there will be some in the finished product. Anyway, with that disclaimer, along with the reiteration that these are just first impressions, here are my thoughts so far. They’re numbered for the sake of ease of reference, and as I find out answers to any questions, I’ll include them at the end of the point in italics.

  1. Byte literal strings – what encoding is used? This feels like a bad idea, although I’m sure it’s useful sometimes. (At least the normal strings are plain .NET strings; as I understand it IronRuby uses non-Unicode strings internally by default and only converts to/from Unicode when calling .NET code. The things we do in the name of compatibility…)

    Robert: It doesn’t say what encode is used in the language specification, however I strongly supect its UTF-8 as the spec says that unicode characters are allowed but “A”B comes out as [|65uy|]. I think they were added to help out Ocaml users who are used to having mutable strings thus providing an easier migration for ocaml programs which take advantage of mutable strings (although having non-mutable strings generally feels like a good design choice for an FP).

  2. #light is used at the start of every listing , and Robert explains that he’s not going to explain the non-light syntax. I know this is only a single book, but isn’t that at least a good indiciation that perhaps it should be the default syntax style?

    Robert: The light syntax really isn’t that different from the non-light syntax, it just means you can miss out certainty tokens such as “in” and semi-colon (;) when the context is clear from the whitespacing. This small change does however make listings look a lot neater and forces you to make whitespacing reflect the structure of the program (which is a very good thing). The choice not to explain it was made for two reasons 1) I though explaining would be more confusing that just telling beginners not to worry about it 2) it provides beginners very clear guidance to what syntax they should be using. Should the F# compiler be light by default? I can see the advantages, but adds a little extra complexity for people trying to do F#/Ocaml cross compilation.

  3. List comprehensions: why do I need [] round the range when declaring a value, but not for iteration? I can do “for i in 1..5” but not “let x = 1..5”. Isn’t this inconsistent?

    Robert: This is because the surrounding bracket types denote the type of collection [] is list [||] is array and {} is Seq/IEnumerable, this is directly copied from creating a literal list of these types. Also you can’t use say “for i in 1..5” you need to say “for i in do …” which you can also do in list comprehension [for i in 1..5 -> (i, i*i) ] for a list of tuples of squares.

  4. The “when” clause in a for loop – I’m sure it’s to look like OCaml, but coming at a time when the non-functional world is used to “where”, it’s unfortunate. I’m not saying that F# has made the wrong decision here – it’s just a pain when two different histories collide.

    Robert: No comment :)

  5. P28 has ‘let objList = [box 1; box 2.0; box “three”]’. The box operator here isn’t fully explained as far as I can see – but the main interesting idea is that a string could be boxed. In “classic” .NET boxing refers to creating a reference type instance out of a value type value – but string is already a reference type. What’s going on here?

    Robert: box is not just for boxing structs, it also performs an upcast to type object. A result of F#’s type inference is that there is no implicit upcasts as there are in C#. F# provides the box keyword and :> operator to compensate for this

    Jon: That seems pretty unfortunate to me, when box already has a clearly defined meaning in .NET. I suspect this will confuse quite a few people.

  6. P37 looks like it’s using a C-style pre-decrement: ‘| x -> luc (x – 1) + luc (–x – 2)’ – is that –x just a typo for x?

    Robert: –x is just a strange and embarrassing typo (already listed in the errata)

  7. Another possible typo: P41, first listing the final line is ‘| [] -> ()’ where in the previous example the result had been [] instead of (). That makes more sense to me, as otherwise the function returns unit where otherwise it’s returning a list. Typo, or am I missing something?

    Robert: This is correct, in the other rules of this pattern matching we’re printing to the console, which has type unit as a result, so we need to use unit for the final rule as well.

    Jon: Oops. Doh!

  8. Union/record types: as a C# guy, my natural question is how these things look in IL. Is a union just a name and a value, and if so is it in a struct or a class? Will have to dig out reflector when I’ve got the compiler…

    Robert: Record types are classes with properties for the each of the fields. The union type is represented as class with inner classes to represent each of the cases. The outer class provides methods and properties for working with each of the cases from C#. There is a good description of what a union type looks like in the final chapter of the book.

  9. There are quotes in the output at the bottom of P48, from calling print_any: ‘”one”, “two”, “three”, “four”‘ – are these genuinely in the output? I haven’t found a specific description of print_any yet, but I don’t think we’ve seen quotes round all strings. I could be wrong though :)

    Robert: print_any tries to reconstruct its input into its literal equivalent, so strings come out quoted listed look like [“one”; “two”; “three”]

    Jon: Ah, handy. A bit like Groovy’s inspect() method.

  10. Another terminology clash: “raise” for exceptions rather than “throw” is likely to confusion me for a while; “raise” sounds like an event to my C#-tuned ears.

    Robert: Again, no comment

  11. Supposedly OCaml is really efficient when it comes to exceptions. Now, I usually find that when people talk about exceptions being expensive in .NET they’re basing that on experience under the debugger. Are OCaml exceptions really significantly faster than under .NET? It’s far from impossible, but I’d be interested to know.

    Robert: I believe OCaml exceptions are a lot faster that .NET exceptions, but then they don’t carry any of the debugging information that .NET exceptions do. It is reasonably common to use exceptions as follow control in OCaml which is a bit of a no no in .NET

  12. Why is “lazy” a keyword but “force” isn’t? It feels odd for part of a feature to be in the language but its other side (which is necessary, as far as I can see) being in the library instead.

    Robert: I believe lazy is a keyword because it helps tidy up the precedence when you are creating lazy values. It maps directly to a function in the F# libraries that can be used instead. I believe this is a design decision inherited from OCaml.

  13. “unit” is an interesting name for what I’d normally think of as “void” – I wonder what the history is here? “void” isn’t as descriptive as it might be, but “unit” is even less obvious…

    Robert: Wikipedia has some more info here: http://en.wikipedia.org/wiki/Unit_type but is doesn’t cover the history of the name.

  14. It took me a little while to get the hang of () really being like void instead of like null – so () as a function return is the equivalent of “return;” which requires a void return type – it’s not “return null;”. Nothing in the book suggests that it is null – that was my own misunderstanding, and I don’t know what caused it.

    Robert: No comment :)

  15. P59 talks about mutable record types, and states that “this operation [updating a record] changes the contents of the record’s field rather than changing the record itself”. Now, I suspect the difference being alluded to is the same as “changing a value within a reference type instance’s data is not the same as making a variable refer to a different instance” – but if it’s not, I don’t know what it is meant to be saying.

    Robert: Yes, you are change the value with the reference

  16. P67/68: Do while and for loops require a “done” terminator or not? The text claims they do, but the examples don’t include it. My guess is that the language specification changed while the book was being written, and that the examples are correct, but I’d appreciate clarification :)

    Robert: They only require done when there’s no #light declaration and yes #light declarations were added half way though writing the book

  17. P69: I assume “for x in y” works where y is any sequence (i.e. anything implementing IEnumerable)?

    Robert: Yes, its works for anything that implements IEnumerable<T> or IEnumerable.

  18. P70: Calling methods and specifying parameter names – in the example, the parameter names are actually in the same order as they’re declared in the method itself. Is this always the case? Can you reorder the use of the parameters? Note that (at least in C#) reordering could have an effect on what the eventual arguments were, if the evaluation of the arguments has side effects. I’d really like to see this feature in C# though – it would save on all those comments explaining which parameter method means what!

    Robert: Just tested:
    System.Console.WriteLine(arg = [| box 1 |], format = “{0}”)
    And it compiles okay, so named arguments can be reordered.

    Jon: Ooh, I’ve got feature envy :)

  19. Why can’t I use a .NET method as a value, just as I can use an F# function value? If the method were overloaded I might have to provide some type information to the compiler to tell it which overload I mean, but otherwise I should be able to use it just like anything else without wrapping it. I assume there’s a deep implementation reason why this is impossible, but it seems a bit of a shame.

    Robert: Actually you can now, it’s another case of the language spec changing while [the book was being written]

    Jon: Cool. Always good to hear news like that.

  20. P73: Indexers aren’t always called Item – that’s just the default name for the default indexer. You can have other named ones, although C# will only use whichever has the appropriate attribute applied. (Even in C# you can change the name emitted when you declare an indexer though.)

    Robert: I’ve just read sections 10.9 of C# 3.0 specification and I see no reference to attributes or the ability to change the indexer name. Also section 10.3.9.3 “Member names reserved for indexers” seems to suggest that Item is the only reserved name. What am I missing?

    Jon: The attribute in question is System.Reflection.DefaultMemberAttribute – but it appears that you can’t apply it in C# when there’s already an indexer, contrary to my previous belief. However, if you use IL which specifies a DefaultMemberAttribute other than Item, it works fine.

  21. Also on P73, there’s this line: ‘let temp = new ResizeArray<string>() in’ – is the ‘in’ part here a typo, or is it another bit of syntax that I’ve missed somewhere?

    Robert: “in” is optional here because of the #light syntax

  22. What’s the :> operator shown  on P77?

    Update: P82 explains that it’s the upcasting operator

  23. What is ‘try … match’ referred to on P77?

    Robert: This is try … match is the equivalent of try … catch

  24. Is box actually a keyword or not? It’s not listed in the list of keywords, but it looks like it should be…

    Robert: box is a function not an keyword, its defined in Microsoft.Fsharp.Core.Operators if you want to see its definition (prim-types.fs).

I don’t intend to make notes as I read the rest of the book – it takes too long. However, I hope any F# evangelists find these initial reactions useful to some extent.

Joe Duffy speaks as well as he writes

(This is being written on Thursday evening, but won’t be posted until Friday morning, just to explain any apparent oddities in timing. Quite appropriate, given the topic.)

I’m currently at TechEd in Barcelona, representing Iterative Training and generally trying to raise our profile a bit. (Mentioning the book to everyone I meet doesn’t hurt either.) All the sessions I’ve been to so far have been pretty good (or better) – but the real treat was hearing Joe Duffy speak. In a single session, he covered threading from why it’s necessary (and becoming more so) to some of the details of the memory model. Obviously this meant that for any single person, either some of the talk would be already known or some of the talk would be way beyond them – but even so, it had a really good feeling to it. We saw a little bit of what’s coming up in ParallelFX – and although Joe can’t say at the moment whether it has any of the Coordination and Concurrency Runtime stuff in it, what he presented certainly sounded pretty similar. Oh, and the CTP will be released really soon, with any luck. (I really hope so – I’ve got a sample in my book which I need to check before it goes to print!)

Anyway, the bottom line is that I’m still completely in awe of Joe. I don’t know his age, but I’m sure he’s not that much older than me – but his confidence and authority on what is fundamentally a massive and difficult topic is incredible.

Tomorrow should be a lot of fun – it’s a shorter day than today, but it looks like I might finally get to learn some F#…

I love LINQ: Simplifying a tedious task

As mentioned in my previous post, I’ve been putting together the code samples for C# in Depth. Now, these are spread across several projects in a few solutions. They’re referred to in the book as things like “Listing 6.2” but I’ve given the files “real” names in the projects. When you run any project with multiple executable listings in it, the project offers you options of which listing to run, showing both the type name and the listing, which is embedded using System.ComponentModel.DescriptionAttribute, e.g. [Description("Listing 12.4")]. A few listings have a description which is more than just “Listing x.y” – for instance, “Listing 6.1/6.2/6.3”. These are displayed with no problems.

Now, the problem for the reader would be finding a particular listing – it’s not always obvious from the code what the type name should be, particularly when there are many variations on a theme. Clearly some sort of map is required. Ideally it should be a file looking something like this:

Chapter 1:
1.1: Foo.cs
1.2: Bar.cs
1.3/1.4: Baz.cs

Chapter 2:
2.1: Gronkle.cs

It’s easy enough to work out the directory for any particular file – the projects are helpfully named “Chapter3” and the like. So, the next thing is to create this file. I really didn’t want to do that by hand. After all, there are about 150 listings in the book – and I’ve already done the work of attributing them all. Ah… we could do it programmatically. Sounds like a bit of a slog…

… but it’s a problem which is ideally suited to LINQ. It’s also fairly ideally suited to regular expressions, much as I hate to admit it. The regular expression in question is reasonably complex, but thanks to Jesse Houwing’s advice on adding comments to regular expressions, the results aren’t too bad. Here’s the finished code – which of course is part of the downloadable source code itself.

using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;

namespace Chapter11.Queries
{
    /// <summary>
    /// The listings are scattered around .cs files within various directories.
    /// This class uses LINQ to find all classes with a suitable Description
    /// attribute, groups them by chapters and orders them by chapter and listing number.
    /// </summary>
    class DisplayListingsMap
    {
        static readonly Regex ListingPattern = new Regex(
            @"# First match the start of the attribute, up to the bit we're interested in
            [Description(""Listing 
            # The 'text' group is the whole of the description after Listing
            (?<text>
            # The 'chapter' group is the first set of digits in the description, before a dot
            (?<chapter>d+).
            # The chapter group is the second set of digits in the description
            (?<listing>d+)
            # After that we don't care - stop the 'text' group at the double quote
            [^""]*)
            # Now match the end of the attribute
            "")]",
            RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

        static void Main()
        {
            DirectoryInfo directory = new DirectoryInfo(@"........");

            var query = from file in directory.GetFiles("*.cs", SearchOption.AllDirectories)
                        let match = ListingPattern.Match(File.ReadAllText(file.FullName))
                        where match.Success
                        let Details = new
                        {
                            Text = match.Groups["text"].Value,
                            Chapter = int.Parse(match.Groups["chapter"].Value),
                            Listing = int.Parse(match.Groups["listing"].Value)
                        }
                        orderby Details.Chapter, Details.Listing
                        group new { File = file, Description=Details.Text } by Details.Chapter;

            foreach (var chapter in query)
            {
                Console.WriteLine("Chapter {0}", chapter.Key);
                foreach (var listing in chapter)
                {
                    Console.WriteLine("{0}: {1}", listing.Description, listing.File.Name);
                }
                Console.WriteLine();                
            }
        }
    }
}

Isn’t it cool? The regex works out the listing number (first x.y part only) and sorts on that, grouping by chapter – then we just display the results. There are other ways of skinning the same cat – such as grouping and then ordering “inside” and “outside” a chapter separately – but they’ll all boil down to the same sort of thing.

Book news

The book is coming along well, and here are a few snippets which may be of interest:

  • It’s now on Amazon
  • All the chapters and the appendix have been written and given a first set of edits
  • We’re going to “final review” stage soon – that doesn’t mean the text is being finalized just yet, but it means this is probably the last round of peer review
  • I’ve been putting together the downloadable source code (see next post for some fun)
  • I’m hoping that the next couple of chapters will turn up in MEAP soon
  • Daniel Moth very kindly let me plug it at the recent UK MVP Open Day
  • There’s another plug on the flyers I’ll be giving out promoting Iterative Training at TechEd Developer in Barcelona next week
  • Manning are doing a 25% discount when you buy LINQ in Action and C# in Depth together

The last point is particularly cool – it’s something I’ve been suggesting for a while, as the books complement each other very nicely.