Category Archives: C#

Non-nullable reference types

I suspect this has been done several times before, but on my way home this evening I considered what would be needed for the reverse of nullable value types – non-nullable reference types. I’ve had a play with an initial implementation in MiscUtil (not in the currently released version) and it basically boils down (after removing comments, equality etc) to this:

public struct NonNullable<T> where T : class
{
    private readonly T value;

    public NonNullable(T value)
    {
        if (value == null)
        {
            throw new ArgumentNullException(“value”);
        }
        this.value = value;
    }

    public T Value
    {
        get
        {
            if (value == null)
            {
                throw new NullReferenceException();
            }
            return value;
        }
    }

    public static implicit operator NonNullable<T>(T value)
    {
        return new NonNullable<T>(value);
    }

    public static implicit operator T(NonNullable<T> wrapper)
    {
        return wrapper.Value;
    }
}

Currently I’ve got both conversion operators as implicit, which isn’t quite as safe/obvious as it is with nullable value types, but it does make life simpler. Obviously this is experimental code more than anything else, so I may change my mind later.

The simple use case is something like this (along with testing error cases):

[Test]
public void Demo()
{
    SampleMethod(“hello”); // No problems
    try
    {
        SampleMethod(null);
        Assert.Fail(“Expected exception”);
    }
    catch (ArgumentNullException)
    {
        // Expected
    }
    try
    {
        SampleMethod(new NonNullable<string>());
        Assert.Fail(“Expected exception”);
    }
    catch (NullReferenceException)
    {
        // Expected
    }
}

private static void SampleMethod(NonNullable<string> text)
{
    // Shouldn’t get here with usual conversions, but could do
    // through default construction. The conversion to string
    // will throw an exception anyway, so we’re guaranteed
    // that foo is non-null afterwards.
    string foo = text;
    Assert.IsNotNull(foo);
}

I don’t intend to use this within MiscUtil at the moment. Note that it’s a struct, which has a few implications:

  • It has the same memory footprint as the normal reference type. No GC pressure and no extra dereferencing. This is basically the main reason for making it a struct.
  • It ends up being boxed to a new object. Boo hiss – ideally the boxing conversion would box it to the wrapped reference, and you could “unbox” (not really unboxing, of course) from a non-null reference to an instance of the struct.
  • It’s possible to create instances wrapping null references, by calling the parameterless constructor or using the default value of fields etc. This is very annoying.

The good news is that it’s self-documenting and easy to use – the conversions will happen where required, doing appropriate checking (reasonably cheaply). The bad news is the lack of support from either the CLR (boxing behaviour above) or language (I’d love to specify string! text as the parameter in the sample method above.) I suspect that if Microsoft were to do this properly, they’d want to avoid all of the problems above – but I’m not sure how…

Anyway, just a little experiment, but an interesting one – any thoughts?

Formatting strings

A while ago I wrote an article about StringBuilder and a reader mailed me to ask about the efficiency of using String.Format instead. This reminded me of a bone I have to pick with the BCL.

Whenever we make a call to String.Format, it has to parse the format string. That doesn’t sound too bad, but string formatting can be used a heck of a lot – and the format is almost always hard-coded in some way. It may be loaded from a resource file instead of being embedded directly in the source code, but it’s not going to change after the application has started.

I put together a very crude benchmark which joins two strings together, separating them with just a space. The test uses String.Format first, and then concatenation. (I’ve tried it both ways round, however, and the results are the same.)

using System;
using System.Diagnostics;

public static class Test
{
    const int Iterations=10000000;
    const int PieceSize=10;
   
    static void Main()
    {
        string first = GenerateRandomString();
        string second = GenerateRandomString();
        int total=0;
   
        Stopwatch sw = Stopwatch.StartNew();
        for (int i=0; i < Iterations; i++)
        {
            string x = String.Format(“{0} {1}”, first, second);
            total += x.Length;
        }
        sw.Stop();
        Console.WriteLine(“Format: {0}”, sw.ElapsedMilliseconds);
        GC.Collect();
       
        sw = Stopwatch.StartNew();
        for (int i=0; i < Iterations; i++)
        {
            // Equivalent to first + ” ” + second
            string x = String.Concat(first, ” “, second);
            total += x.Length;
        }
        sw.Stop();
        Console.WriteLine(“Concat: {0}”, sw.ElapsedMilliseconds);
        if (total != Iterations * 2 * (PieceSize*2 + 1))
        {
            Console.WriteLine(“Incorrect total: {0}”, total);
        }
    }
   
    private static readonly Random rng = new Random();
    private static string GenerateRandomString()
    {
        char[] ret = new char[PieceSize];
        for (int j=0; j < ret.Length; j++)
        {
            ret[j] = (char) (‘A’ + rng.Next(26));
        }
        return new string(ret);
    }
}

And the results (on my very slow Eee)…

Format: 14807
Concat: 3567

As you can see, Format takes significantly longer than Concat. I strongly suspect that this is largely due to having to parse the format string on each iteration. That won’t be the whole of the cost – String needs to examine the format specifier for each string as well, in case there’s padding, etc – but again that could potentially be optimised.

I propose a FormatString class with a pair of Format methods, one of which takes a culture and one of which doesn’t. We could then hoist our format strings out of the code itself and make them static readonly variables referring to format strings. I’m not saying it would do a huge amount to aid performance, but it could shave off a little time here and there, as well as making it even more obvious what the format string is used for.

Copenhagen talk on C# – what do you want to hear about?

I’ve created a Google moderator page for the C# talk I’ll be giving in Copenhagen. I don’t know whether there will be internet access at the event itself (for people to create and vote up/down questions during the talk) but at least as there’s a month before the event, people can ask questions now and I’ll do my best to make sure I answer them.

If you haven’t looked at Google moderator yet, it’s a very handy way of keeping track of questions during lectures etc. It’s almost a shame that people don’t tend to have laptops and internet access in church – it would be very handy to be able to add questions for the preacher during the sermon :)

Book Review: Programming C# 3.0 by Jesse Liberty and Donald Xie

Resources

Disclaimer

One reader commented that a previous book review was too full of “this is only my personal opinion” and other such disclaimers. I think it’s still important to declare the situation, but I can see how it can get annoying if done throughout the review. So instead, I’ve lumped everything together here. Please bear these points in mind while reading the whole review:

  • Obviously this book competes with C# in Depth, although probably not very much.
  • I was somewhat prejudiced against the book by seeing that the sole 5-star review for it on Amazon was by Jesse Liberty himself. Yes, he wanted to explain why he wrote the book and why he’s proud of it, but giving yourself a review isn’t the right way to go about it.
  • I’ve seen a previous edition of the book (for C# 2.0) and been unimpressed at the coverage of some of the new features.
  • I’m a nut for technical accuracy, particularly when it comes to terminology. More on this later, but if you don’t mind reading (and then presumably using) incorrect terminology, you’re likely to have a lot better time with this book than I did.
  • I suspect I have higher expectations for established, prolific authors such as Jesse Liberty than for newcomers to the world of writing.
  • I’m really not the target market for this book.

Okay, with all that out of the way, let’s get cracking.

Contents and target audience

According to the preface, Programming C# 3.0 (PC# from now on) is for people learning C# for the first time, or brushing up on it. There’s an expectation that you probably already know another language – it wouldn’t be impossible to learn C# from the book without any prior development experience, but the preface explicitly acknowledges that it would be reasonably tough. That’s a fair comment – probably fair for any book, in fact. I have yet to read anything which made me think it would be a wonderful way to teach someone to program from absolute scratch. Likewise the preface recommends C# 3.0 in a Nutshell for a more detailed look at the language, for more expert readers. Again, that’s reasonable – it’s clearly not aiming to go into the same level of depth as Accelerated C# 2008 or C# in Depth.

The book is split into 4 parts:

  • The C# language: pretty much what you’d expect, except that not all of the language coverage is in this part (most of the new features of C# 3.0 are in the second part) and some non-language coverage is included (regular expressions and collections) – about 270 pages
  • C# and Data: LINQ, XML (the DOM API and a bit of LINQ to XML), database access (ADO.NET and LINQ to SQL) – about 100 pages
  • Programming with C#: introductions to ASP.NET, WPF and Windows Forms – about 85 pages
  • The CLR and the .NET Framework: attributes, reflection, threading, I/O and interop – about 110 pages

As you can tell, the bulk of it is in the language part, which is fine by me and reflects the title accurately. I’ll focus on that part of the book in this review, and the first chapter of part 2, which deals with the LINQ parts of C# 3.0. To be honest, I don’t think the rest of the book actually adds much value, simply because they skim over the surface of their topics so lightly. Part 3 would make a reasonable series of videos – and indeed that’s how it’s written, basically in the style of “Open Visual Studio, start a new WinForms project, now drag a control over here” etc. I’ve never been fond of that style for a book, although it works well in screencasts.

The non-LINQ database and XML chapters in part 2 seemed relatively pointless too – I got the feeling that they’d been present in older editions and so had just stayed in by default. With the extra space available from cutting these, a much better job could have been done on LINQ to SQL and LINQ to XML. The latter gets particularly short-changed in PC#, with a mere 4 pages devoted to it! (C# in Depth is much less of a “libraries” book but I still found over 6 pages to devote to it. Not a lot, I’ll grant you.)

Part 4 has potential, and is more useful than the previous parts – reflection, threading, IO and interop are all important topics (although I’d probably drop interop in favour of internationalization or something similar) – but they’re just not handled terribly well. The threading chapter talks about using lock or Monitor, but never states that lock is just shorthand for try/finally blocks which use Monitor; no mention is made of the memory model or volatility; aborting threads is demonstrated but not warned about; the examples always lock on this without explaining that it’s generally thought to be a bad idea. The IO chapter uses TextReader (usually via StreamReader) but never mentions the crucial topic of character encodings (it uses Encoding.ASCII but without really explaining it) – and most damning of all, as far as I can tell there’s not a single using statement in the entire chapter. There are calls to Close() at the end of each example, and there’s a very brief mention saying that you should always explicitly close streams – but without saying that you should use a using statement or try/finally for this purpose.

Okay, enough on those non-language topics – let’s look at the bulk of the book, which is about the language.

Language coverage

PC# starts from scratch, so it’s got the whole language to cover in about 300 pages. It would be unreasonable to expect it to provide as much attention to detail as C# in Depth, which (for the most part) only looks at the new features of C# 2.0 and 3.0. (On the other hand, if the remaining 260 pages had been given to the language as well, a lot more ground could have been covered.) It’s also worth bearing in mind that the book is not aimed at confident/competent C# developers – it’s written for newcomers, and delving into tricky issues like generic variance would be plain mean. However, I’m still not impressed with what’s been left out:

  • There’s no mention of nullable types as far as I can tell – indeed, the list of operators omit the null-coalescing operator (??).
  • Generics are really only talked about in the context of collections – despite the fact that to understand any LINQ documentation, you really will need to understand generic delegates. Generic constraints are only likewise only mentioned in the context of collections, and only what I call a “derivation type constraint” (e.g. T : IComparable<T>) (as far as I can tell the spec doesn’t give this a name). There’s no coverage of default(T) – although the “default value of a type” is mentioned elsewhere, with an incorrect explanation.
  • Collection initializers aren’t explained as far as I can tell, although I seem to recall seeing one in an example. They’re not mentioned in the index.
  • Iterator blocks (and the yield contextual keyword) are likewise absent from the index, although there’s definitely one example of yield return when IEnumerable<T> is covered. The coverage given is minimal, with no mention of the completely different way that this executes compared with normal methods.
  • Query expression coverage is limited: although from, where, select, orderby, join and group are covered, there’s no mention of let, the difference between join and join ... into, explicitly typed range variables, or query continuations. The translation process isn’t really explained clearly, and the text pretty much states that it will always use extension methods.
  • Expression trees aren’t referenced to my knowledge; there’s one piece of text which attempts to mention them but just calls them “expressions” – which are of course entirely different. We’ll come onto terminology in a minute.
  • Only the simplest (and admittedly most common by a huge margin) form of using directives is shown – no extern aliases, no namespace aliases, not even using Foo = System.Console;
  • Partial methods aren’t mentioned.
  • Implicitly typed arrays aren’t covered.
  • Static classes may be mentioned in passing (not sure) but not really explained.
  • Object initializers are shown in one form only, ditto anonymous object initializer expressions
  • Only field-like events are shown. The authors spend several pages on an example of bad code which just has a public delegate variable, and then try to blame delegates for the problem (which is really having a public variable). The solution is (of course) to use an event, but there’s little to no explanation of the nature of events as pairs of methods, a bit like properties but with subscribe/unsubscribe behaviour instead of data fetch/mutate.
  • Anonymous methods and lambda expressions are covered, but with very little text about the closure aspect of them. This is about it: “[…] and the anonymous method has access to the variables in the scope in which they are defined:” (followed by an example which doesn’t demonstrate the use of such variables at all).

I suspect there’s more, but you get the general gist. I’m not saying that all of these should have been covered and in great detail, but really – no mention of nullable types at all? Is it really more appropriate in a supposed language book to spend several pages building an asynchronous file server than to actually list all the operators accurately?

Okay, I’m clearly beginning to rant by now. The limited coverage is annoying, but it’s not that bad. Yes, I think the poor/missing coverage of generics and nullable types is a real problem, but it’s not enough to get me really cross. It’s the massive abuse of terminology which winds me up.

Accuracy

I’ll say this for PC# – if you ignore the terminology abuse, it’s mostly accurate. There are definitely “saying something incorrect” issues (e.g. an implication that ref/out can only be used with value type parameters; the statement that reference types in an array aren’t initialized to their default value (they are – the default value is null); the claim that extension methods can only access public members of target types (they have the same access as normal – so if the extension method is in the same assembly as the target type, for instance, it could access internal members)) but the biggest problem is that of terminology – along with sloppy code, including its formatting.

The authors confuse objects, values, variables, expressions, parameters, arguments and all kinds of other things. These have well-defined meanings, and they’re there for a reason. They do have footnotes explaining that they’re deliberately using the wrong terminology – but that doesn’t make it any better. Here are the three footnotes, and my responses to them:

The terms argument and parameter are often used interchangably, though some programmers insist on differentiating between the parameter declaration and the arguments passed in when the method is invoked.

Just because others abuse terms doesn’t mean it’s right for a book to do so. It’s not that programmers insist on differentiating between the two – the specification does. Now, to lighten things up a bit I’ll acknowledge that this one isn’t always easy to deal with. There are plenty of times where I’ve tried really hard to use the right term and just not ended up with a satisfactory bit of wording. However, at least I’ve tried – and where it’s easy, I’ve done the right thing. I wish the authors had the same attitude. (They do the same with the conditional operator, calling it “the ternary operator”. It’s a ternary operator. Having three operands is part of its nature – it’s not a description of its behaviour. Again, lots of other people get this wrong. Perhaps if all books got it right, more developers would too.) Next up:

Throughout this book, I use the term object to refer to reference and value types. There is some debate in the fact that Microsoft has implemented the value types as though they inherited from the root class Object (and thus, you may call all of Object’s methods on any value type, including the built-in types such as int.)

To me, this pretty much reads as “I’m being sloppy, but I’ve got half an excuse.” It’s true that the C# specification isn’t clear on this point – although the CLI spec is crystal clear. Personally, it just feels wrong to talk about the value 5 as an object. It’s an object when it’s boxed, of course (and if you call any Object methods on a value type which haven’t been overridden by that type, it gets boxed at that point) but otherwise I really don’t think of it as an object. An instance of the type, yes – but not an object. So yes, I’ll acknowledge that there’s a little wiggle room here – but I believe it’s going to confuse readers more than it helps them.

It’s the “confusing readers more than it helps them” part which is important. I’m not above a little bit of shortcutting myself – in C# in Depth, I refer to automatically implemented properties as “automatic properties” (after explicitly saying what I’m doing) and I refer to the versions of C# as 1, 2 and 3 instead of 1.0, 1.2, 2.0 and 3.0. In both these cases, I believe it adds to the readability of the book without giving any room for confusion. That’s very different from what’s going on in PC#, in my view. I’ve saved the most galling example of this for last:

As noted earlier, btnUpdate and btnDelete are actually variables that refer to the unnamed instances on the heap. For simplicity, we’ll refer to these as the names of the objects, keeping in mind that this is just short-hand for “the name of the variables that refer to the unnamed instances on the heap.”

This one’s the killer. It sounds relatively innocuous until you see the results. Things like this (from P63):

ListBox myListBox;  // instantiate a ListBox object

No, that code doesn’t instantiate anything. It declares a variable – and that’s all. The comment isn’t non-sensical – the idea of some code which does instantiate a ListBox object clearly makes sense – but it’s not what’s happening in this code (in C# – it would in C++, which makes it even more confusing). That’s just one example – the same awful sloppiness (which implies something completely incorrect) permeates the whole book. Time and time again we’re told about instances being created when they’re not. From P261:

The Clock class must then create an instance of this delegate, which it does on the following line:

public SecondChangeHandler SecondChanged;

Why do I care about this so much? Because I see the results of it on the newsgroups, constantly. How can I blame developers for failing to communicate properly about the problems they’re having if their source of learning is so sloppy and inaccurate? How can they get an accurate mental model of the language if they’re being told that objects are being instantiated when they’re not? Communication and a clear mental model are very important to me. They’re why I get riled up when people perpetuate myths about where structs “live” or how parameters are passed. PC# had me clenching my fists on a regular basis.

These are examples where the authors apparently knew they were abusing the terminology. There are other examples where I believe it’s a genuine mistake – calling anonymous methods “anonymous delegates” or “statements that evaluate to a value are called expressions” (statements are made up of expressions, and expressions don’t have to return a value). I can certainly sympathise with this. Quite where they got the idea that HTML was derived from “Structured Query Markup Language” I don’t know – the word “Query”should have been a red flag – but these things happen.

In other places the authors are just being sloppy without either declaring that they’re going to be, or just appearing to make typos. In particular, they’re bad at distinguishing between language, framework and runtime. For instance:

  • “C# combines the power and complexity of regular expression syntax […]” – no, C# itself neither knows nor cares about regular expressions. They’re in the framework.
  • (When talking about iterator blocks) “All the bookkeeping for keeping track of which element is next, resetting the iterator, and so forth is provided for you by the Framework.” – No, this time it is the C# compiler which is doing all the work. (It doesn’t support reset though.)
  • “Strings can also be created using verbatim string literals, which start the at (@) symbol. This tells the String constructor that the string should be used verbatim […]” – No, the String constructor doesn’t know about verbatim string literals. They’re handled by the C# compiler.
  • “The .NET CLR provides isolated storage to allow the application developer to store data on a per-user basis.” I very much doubt that the CLR code has any idea about this. I expect it to be in the framework libraries.

Again, if books don’t get this right, how do we expect developers to distinguish between the three? Admittedly sometimes it can be tricky to decide where responsibility lies – but there are plenty of clearcut cases where PC# is just wrong. I doubt that the authors really don’t know the difference – they just don’t seem to think it’s important to get it right.

Code

I’m mostly going to point out the shortcomings of the code, but on the plus side I believe almost all of it will basically work. There’s one point at which the authors have both a method and a variable with the same name (which is already in the unconfirmed errata) and a few other niggles, but they’re relatively rare. However:

  • The code frequently ignores naming conventions. Method and class names sometimes start with lower case, and there’s frequent use of horrible names beginning with “my” or “the”.
  • The authors often present several pages of code together, and then take them apart section by section. This isn’t the only book to do this by a long chalk, but I wonder – does anyone really benefit from having the whole thing in a big chunk? Isn’t it better to present small, self-contained examples?
  • As mentioned before, the uses of using statements are few and far between.
  • The whitespace is all over the place. The indentation level changes all the time, and sometimes there are outdents in the middle of blocks. Occasionally newlines have actually been missed out, and in other cases (particularly at the start of class bodies) there are two blank lines for no reason at all. (The latter is very odd in a book, where vertical whitespace is seen as extremely valuable.) Sometimes there’s excessive (to my mind) spacing Just as an example (which is explicitly labelled as non-compiling code, so I’m not faulting it at all for that):
    using System.Console;
    class Hello
    {
        static void Main()
        {
        WriteLine(“Hello World”);
       }
    }

    I promise you that’s exactly how it appears in the book. Now this may have started out as a fault of the type-setter, but the authors should have picked it up before publication, IMO. I could understand there being a few issues like this (proof-reading code really is hard) but not nearly as many as there are.

  • There are examples of mutable structs (or rather, there’s at least one example), and no warning at all that mutable value types are a really, really bad idea.

Again, I don’t want to give the impression I’m an absolute perfectionist when it comes to code in book. For the sake of keeping things simple, sometimes authors don’t seal types where they should, or make them immutable etc. I’m not really looking for production-ready code, and indeed I made this very point in one of the notes for C# in Depth. However, I draw the line at using statements, which are important and easy to get right without distracting the reader. Likewise giving variables good names – counter rather than ctr, and avoiding those the and my prefixes – makes a competent reader more comfortable and can transfer good habits to the novice via osmosis.

Writing style and content ordering

Time for some good news – when you look beyond the terminology, this is a really easy book to read. I don’t mean that everything in it is simplistic, but the style rarely gets in the way. It’s not dry, and some of the real-world analogies are very good. This may well be Jesse Liberty’s experience as a long-standing author making itself apparent.

In common with many O’Reilly books, there are two icons which usually signify something worth paying special attention to: a set of paw prints indicating a hint or tip, and a mantrap indicating a commonly encountered issue to be aware of. Given the rest of the review, I suspect you’d be surprised if I agreed with all of the points made in these extra notes – and indeed there are some issues – but most of them are good.

Likewise there are also notes for the sake of existing Java and C++ developers, which make sense and are useful.

I don’t agree with some of the choices made in terms of how and when to present some concepts. I found the way of explaining query expressions confusing, as it interleaved “here’s a new part of query expressions” with “here’s a new feature (e.g. anonymous types, extension methods).” It will come as no surprise to anyone who’s read C# in Depth that I prefer the approach of presenting all the building blocks first, and then showing how query expressions use all those features. There’s a note explaining why the authors have done what they’ve done, but I don’t buy it. One important thing with the “building blocks first” approach is to present a preliminary example or two, to give an idea of where we’re headed. I’ve forgotten to do that in the past (in a talk) and regretted it – but I don’t regret the overall way of tackling the topic.

On a slightly different note, I would have presented some of the earlier topics in a different order too. For instance, I regard structs and interfaces as more commonly used and fundamental topics than operator overloading. (While C# developers tend not to create their own structs often, they use them all the time. When was the last time you wrote a program without an int in it?) This is a minor nit – and one which readers may remember I also mentioned for Accelerated C# 2008.

There’s one final point I’d like to make, but which doesn’t really fit anywhere else – it’s about Jesse Liberty’s dedication. Most people dedicate books to friends, colleages etc. Here’s Jesse’s:

This book is dedicated to those who come out, loud, and in your face and in the most inappropriate places. We will look back at this time and shake our heads in wonder. In 49 states, same-sex couples are denied the right to marry, though incarcerated felons are not. In 36 states, you can legally be denied housing just for being q-u-e-e-r. In more than half the states, there is no law protecting LGBT children from harassment in school, and the suicide rate among q-u-e-e-r teens is 400 percent higher than among straight kids. And, we are still kicking gay heroes out of the military despite the fact that the Israelis and our own NSA, CIA, and FBI are all successfully integrated. So yes, this dedication is to those of us who are out, full-time.

(I’ve had to spell out q-u-e-e-r as otherwise the blog software replaces it with asterisks. Grr.) I’m straight, but I support Jesse’s sentiment 100%. I can’t remember when I first started taking proper notice of the homophobia in the world, but it was probably at university. This dedication does nothing to help or hinder the reader with C#, but to my mind it still makes it a better book.

Conclusion

In short, I’m afraid I wouldn’t recommend Programming C# 3.0 to potential readers. There are much better books out there: ones which won’t make it harder for the reader to talk about their code with others, in particular. It’s not all bad by any means, but the mixture of sloppy use of terminology and poor printed code is enough of a problem to make me give a general thumbs down.

Next up will be CLR via C#, by Jeffrey Richter.

Response from Jesse Liberty

As normal, I mailed the author (in this case just Jesse Liberty – I confess I didn’t look for Donald Xie’s email address) and very promptly received a nice response. He asked me to add the following as his reaction:

I believe the book is very good for most real-world programmers and the publisher and I are dedicated to making the next revision a far better book, by correcting some of the problems you point out, and by beefing up the coverage of the newer features of the language.

Also as normal, I’ll be emailing Jesse with a list of the errors I found, so hopefully they can be corrected for the next edition.

Book review: Pro LINQ – Language Integrated Query in C# 2008, by Joe Rattz

I’m trying something slightly different this time. Joe (the author) has reacted to specific points of my review, and I think it makes sense to show those reactions. I’d originally hoped to present them so that you could toggle them on or off, but this blog server apparently wants to strip out scripts etc, so the comments are now permanently visible.

Resources

Introduction and disclaimer

As usual, I first need to give the disclaimer that as the author of a somewhat-competing book, I may be biased and certainly will have different criteria to most people. In this case the competition aspect is less direct than normal – this book is “LINQ with the C# 3 bits as necessary” whereas my book is “C# 2 and 3 with LINQ API where necessary”. However, it’s still perfectly possible that a potential reader may try to choose between the two books and buy just one. If you’re in that camp, I suggest you buy my book try to find an impartial opinion instead of trusting my review.

A second disclaimer is needed this time: I didn’t buy my copy of this book; it was sent to me by Apress at the request of Joe Rattz, specifically for review (and because Joe’s a nice guy). I hope readers of my other reviews will be confident that this won’t change the honest nature of the review; where there are mistakes or possible improvements, I’m happy to point them out.

Content, audience and overall approach

This book is simply aimed at existing C# developers who want to learn LINQ. There’s an assumption that you’re already reasonably confident in C# 2 – knowledge of generics is taken as read, for example – but there is brief coverage of using iterator blocks to return sequences. No prior experience of LINQ is required, but the LINQ to XML and LINQ to SQL sections assume (not unreasonably) that you already know XML and SQL.

The book is divided into five parts:

  • Introduction and C# 3.0 features (50 pages)
  • LINQ to Objects (130 pages)
  • LINQ to XML (152 pages)
  • LINQ to DataSet (42 pages)
  • LINQ to SQL (204 pages)

The approach to the subject matter changes somewhat through the book. Sometimes it’s a concept-by-concept “tutorial style” approach, but for most of the book (particularly the LINQ to Objects and LINQ to XML parts) it reads more like an API reference. Joe recommends that readers tackle the book from cover to cover, but that falls down a bit in the more reference-oriented sections.

[Joe] Early in the development of my book, a friend asked me if it was going to be a tutorial-style book or a reference book. I initially found the question odd because I never really viewed books as being exclusively one or the other. Perhaps I am different than most readers, but when I buy a programming book, I usually read a bit, start coding, and then refer to the book as a reference when needed. This is how I envision my book being used by readers and the type of book I would like for it to be. I see it as both a tutorial and a reference. I want it to be a book that gets used repeatedly, not read once and shelved. Some books work better for this than others. I rarely read a programming book cover to cover because I just don’t have time for that. I think ultimately, most authors write the book they would want to read, and that is what I did. I hope that if someone buys my book, in two years it will be tattered and worn from use as a reference, as well as read cover to cover.

I would disagree that the majority of the book reads like an API reference. Certainly, chapters 4 and 5 (deferred and nondeferred operators) work better as a reference because there isn’t a lot of connective context between the approximately 50 different standard query operators. At best it would be an eclectic tutorial with little continuity. So I decided to make those two chapters (the ones covering the standard query operators) function more like a reference. I knew that I (and hopefully my readers) would refer to it time and time again for information about the operators, and based on most of the reviews I have seen, this appears to have been a good choice. I know I refer to it myself quite frequently. I would not consider the chapters on LINQ to XML to be reference oriented although I could see why someone might feel they are. My discussion of LINQ to XML is tutorial based as I approach the different tasks a developer would need to accomplish when working with XML, such as how to construct XML, how to output XML, how to input XML, how to traverse XML, etc. However, within a task, like traversing XML, I do list the API calls and discuss them, so this is probably why it feels reference-like to some readers, and will function pretty well as a reference.

For example, take the ordering operators in LINQ – OrderBy, ThenBy, OrderByDescending and ThenByDescending. (Interestingly, one of the Amazon reviews picks up on the same example. I already had it in mind before reading that review.) These four LINQ to Objects operators take 15 pages to cover because every method overload is used, but a lot of it is effectively repeated between different examples. I think more depth could have been achieved in a shorter space by talking about the group as a whole – we only really need to see what happens when a custom comparison is used once, not four times – whereas every example of ThenBy/ThenByDescending used an identity projection, instead of showing how you can make the secondary ordering use some completely different projection (without necessarily using a custom comparer). Likewise I don’t remember seeing anything about tertiary orderings, or what the descending orderings tend to do with nulls, or emphasis on the fact that descending orderings aren’t just reversed ascending orderings (due to the stability of the sort – the stability was mentioned, but not this important corollary). Having an example for each overload is useful for a reference work, but not for a “read through from start to finish” book.

The set operators (Distinct, Except, Intersect, Union and SequenceEqual) as applied to DataSets suffer a similar problem – the five descriptions of why custom comparers are needed are all basically the same, and could be dealt with once. In particular, one paragraph is repeated verbatim for each operator. Again, that’s fine for a reference – but cutting and pasting like this makes for an irritating read when you see the exact same text several times in one reading session.

[Joe] A few readers have complained about some of the redundancies that you have pointed out, but I think most of the readers have appreciated my attempt to provide material for each operator/method. I think one of the words you will see most often in the Amazon reviews is “thorough”.

Now, it’s important that I don’t give the wrong impression here. This is certainly not just a reference book, and there’s enough introduction to topics to help readers along. If I’d been coming to C# 3 and LINQ without any other information, I think I’d have followed things, for the most part. (I’m not a fan of the way Joe presented the query expression translations, but I’m enormously pleased that he did it at all. I think I might have got lost at that point, which was unfortunately early in the book. It might have been better as just an appendix.) Anyone reading the book thoroughly should come away with a competent knowledge of LINQ and the ability to use it profitably. They may well be less comfortable with the new features of C# 3, as they’re only covered briefly – but that’s entirely appropriate given the title and target of the book. (To be blunt and selfish, I’m entirely in favour of books which leave room for more depth at a language level – that should be a good thing for sales of my own book!)

[Joe] Jon, if you only knew how difficult it was getting those query expression translations into the book. ;-) You can read in my acknowledgments where I specifically thank Katie Stence and her team for them. They were a very painful effort and in hindsight, I probably would not include them if I were to start the book from scratch. I agree with you that the translations are complex, as the book states. Perhaps the most important part of that section is when I state “Allow me to provide a word of warning. The soon to be described translation steps are quite complicated. Do not allow this to discourage you. You no more need to fully understand the translation steps to write LINQ queries than you need to know how the compiler translates the foreach statement to use it. They are here to provide additional translation information should you need it, which should be rarely, or never.”

However, I would personally have preferred to see a more conceptual approach which spent more time focused on getting the ideas through at a deep level and less time making sure that every overload was covered. After all, MSDN does a reasonable job as a reference – and the book’s web site could have contained an example for every overload if necessary without everything making it into print. The kind of thing I’d have liked to see explored more fully is the buffering vs streaming nature of data flow in LINQ. Some operators – Select and Where, for example – stream all their data through. They never keep look at more than one item of data at a time. Others (Reverse and OrderBy, for example) have to buffer up all the data in the sequence before yielding any of it. Still others use two sequences, and may buffer one sequence and stream the other – Join and Intersect work that way at the moment, although as we saw in my last blog post Intersect can be implemented in a way which streams both sequences (but still needs to keep a buffer of data it’s already seen). When you’re working with an infinite (or perhaps just very large – much bigger than memory) sequence you really need to be aware of this distinction, but it isn’t covered in Pro LINQ as far as I remember. In the interests of balance, I should point out that the difference between immediate and deferred execution is explained, repeatedly and clearly – including the semi-immediate execution which can occur sometimes in LINQ to SQL.

[Joe] I wanted my book to cover each overload because I can’t read MSDN in the bathroom, or when at the beach without an internet connection, or when curled up in a chair by the fireplace. I also wanted to provide examples for every method and overload because I find it frustrating when a book shows the simplest one and I have to figure out the one I need. Granted, depth could be added too, but you have to draw the line somewhere. Apress (at the time, not sure if this is still the plan) has the concept of three levels of book; Foundations, Pro, and Expert. I considered some information beyond the scope of the Pro level that my book is aimed at. The buffering versus streaming issue is an interesting one and would make an excellent additional column in Table 3-1, if I can get it to fit.

I’m unable to really judge the depth to which LINQ to SQL was explored, given that a lot of it was beyond my own initial knowledge (which is a good thing!). I’m slightly perturbed by the idea that it can be comprehensively tackled in a couple of hundred pages, whereas books on other ORMs are often much bigger and tackle topics such as session lifetimes and caching in much more depth. I suspect this is more due to the technologies than the writing here – LINQ to SQL is a relatively feature-poor ORM compared with, say, Hibernate – but a bit more attention to “here are options to consider when writing an application” would have been welcome.

Accuracy and code style

Most of Pro LINQ is pretty accurate. Joe is occasionally a bit off in terms of terminology, but that probably bothers most readers less than it bothers me. There are a few things which changed between the beta version of VS2008 against which the book was clearly developed and the release version, which affect the new features of C# 3. For instance, automatically implemented properties aren’t mentioned at all (and would have been much nicer to see in examples than public fields) and collection initializers are described with the old restrictions (the collection type has to implement ICollection<T>) rather than the new ones (the collection type has to implement IEnumerable and have appropriate Add methods). Other errors include trusting the documentation too much (witness the behaviour of Intersect) and an inconsistency (stating correctly that OrderBy is stable on one page, then incorrectly warning that it’s unstable on another). In my normal fashion, I’ll give Joe an exhaustive list of everything I’ve found and leave it up to him to see which he’d like to fix for the next printing, but overall Pro LINQ does pretty well. I suspect this may be partly due to covering a great deal of area but with relatively little depth and some repetition – Accelerated C# had a higher error rate, but was delving into more treacherous waters, for example.

[Joe] Since my book is not meant to be a C# 3.0 book, but rather a LINQ book, I only cover the new C# 3.0 features which were added to support LINQ. Since automatic properties were not one of those features, I do not cover them. You may notice that my chapter dedicated to the new C# 3.0 features is titled C# 3.0 Language Enhancements For LINQ. Just for your reader’s knowledge, the ordering is now specified to be stable. Initially it was unstable, and was later changed to be stable but I was told it would be specified to be unstable, but apparently at some point, the specification was changed to be stable. My book was updated but apparently I missed a spot.

Most of the advice given throughout the book is reasonable, although I take issue with one significant area. Joe recommends using the OfType operator instead of the Cast operator, because when a nongeneric collection contains the “wrong type of object,” OfType will silently skip it whereas Cast will throw an exception. I recommend using Cast for exactly the same reason! If I’ve got an object of an unexpected type in my collection, I want to know about it as soon as possible. Throwing an exception tells me what’s going on immediately, instead of hiding the problem. It’s usually the better behaviour, unless you explicitly have reason to believe that you will legitimately have objects of different types in the collection and you really want to only find objects of the specified type.

[Joe] Yes, I should have known better than to provide that advice (prefer OfType to Cast) without more explanation, more disclaimers, and more caveats. My preference would be to use Cast in development and debug built code for the exact reasons you mention, but to use OfType in production code. I would prefer my applications to handle unexpected data more gracefully in production than I would in development.

As well as “headline” pieces of advice which are advertised right up to the table of contents, there are many hints and tips along the way, most of which really do add value. I believe they’d actually add more value if they weren’t sometimes buried within reference-like material – but as we’ve already seen, my personal preference is for a more narrative style of book anyway.

The code examples are in “snippet” form (i.e. without using directives, Main method declarations etc) but are complete aside from that. At the start of each chapter there’s a detailed list of which namespaces and references are involved, so there’s no guesswork required. In fact, I’d expect most of them to work in Snippy given an appropriate environment. Some examples are a bit longwinded – we only really need to see the 7 lines showing the list of presidents once or twice, not over and over again – but that’s a minor issue. Another niggle is Joe’s choices when it comes to a few bits of coding convention. There are various areas where we differ, but a few repeatedly bothered me: overuse (to my mind) of parentheses, “old-style” delegate creation (i.e. something.Click += new EventHandler(Foo) instead of just something.Click += Foo) and the explicit specification of type parameters on LINQ operators which don’t need them. Here’s one example which demonstrates the first and the last of these issues – as well as introducing an unnecessary cast:

// This is the code in the book (in listing 7-30)
XElement outOfPrintParticipant = xDocument
  .Element(“BookParticipants”)
  .Elements(“BookParticipant”)
  .Where(e => ((string)((XElement)e).Element(“FirstName”)) == “Joe”
           && ((string)((XElement)e).Element(“LastName”)) == “Rattz”)
  .Single<XElement>();

// This is what I’d have preferred
XElement outOfPrintParticipant = xDocument
  .Element(“BookParticipants”)
  .Elements(“BookParticipant”)
  .Where(e => (string)e.Element(“FirstName”) == “Joe”
           && (string)e.Element(“LastName”) == “Rattz”)
  .Single();

Check out the penultimate line of the original – a whopping 5 opening brackets and 6 closing ones. This issue looks even worse to me when it’s used to make return and throw look like method calls:

// From GetStringFromDb (P388)
throw (new Exception(
            String.Format(“Unexpected exception executing query [{0}].”, sqlQuery)));

// (Insert more code here) – same listing

return (result);

These just look odd and wrong. Of course they’re perfectly valid, but not pleasant to read in my view. On a more minor matter, Joe tends to close SQL connections, commands etc with an explicit try/finally block instead of the more idiomatic (to my mind) using statement, but again that probably bothers me more than others.

The source code is all available on the web site, and it’s easy to find each listing. (The zip file is about 10 times larger than it needs to be because it contains all the bin/obj directories with all the compiled code in rather than just the source, but that’s a tiny niggle.)

Writing style

Joe’s writing style is very informal – or at least, while most of the text is in “normal” formal prose, there are plenty of informal pieces of writing there too. As readers of my book will know, I’m much the same – I try to keep things from getting too dry, despite that being the natural state for technical teaching. I have no idea how well I succeed for most readers, but Joe certainly manages. He occasionally takes it a little too far for my personal taste, usually around listing outputs. They’re often introduced as if Joe didn’t really know what the output would be, with a kind of “wow, it worked, who’d have thought?” comment afterwards. I suspect I’ve got some of this in my book too, but Joe lays it on a little too thickly for my liking. I don’t know whether it would be fairer to present a “medium-level” example of this rather than one which really grated, but this is the one (from page 257) made such an impression that I remembered it over 300 pages later:

This should output the language attribute. Let’s see:


language="English"

Groovy! I have never actually written the word groovy before. I had to let the spelling checker spell it for me.

Now, I really want to stress that that’s a “worst case” rather than the average case, and indeed many listings don’t have anything “cutesy” about them. I just wanted to give an example of the kind of thing that didn’t work for me.

[Joe] Let me see if I get this straight. So you are saying you got to learn something about LINQ and how to spell groovy, and it stuck for over 300 pages and you are upset? Man, you know how to spell groovy now, what’s the problem? 8-D Would it annoy you less if I told you that is a reference to Austin Powers? My book is riddled with references to movies and TV shows, and that one is for Austin Powers. Maybe you didn’t catch that, or maybe you don’t like Austin Powers, or maybe you just still don’t like it. One reader was irritated when I said “Dude, Sweet” because he didn’t recognize that as a reference to Dude, Where’s My Car. I have references to Office Space, Arrested Development, Bottle Rocket, Seinfeld, The Matrix, Wargames, Tron, etc. In fact, on page 455, I actually use the word “moo” instead of “moot” in reference to Friends. My copy editor actually corrected that for me, but once I explained it, she let me have it back. So if you see something goofy, like “groovy” just know it is a reference to something and begin your investigation in your spare time. And if you see an error, it is intentional to make sure you are paying attention. ;-) As you have already pointed out, technical writing can be dry. I made an effort to inject humor into the book in the form of references to pop culture, most specifically movies and television. Sometimes the reference is in a comment like “groovy”, and sometimes it’s in the sample data like a character’s name. Like any comedian, every joke or reference can’t be a hit with everyone. I will say though that I have heard more from those that recognized the references and appreciated them (which helps carry a reader through the lesser interesting parts) than I have from those that found them annoying.

What really did work was including hints and tips which explicitly said where Joe had received unexpected results with slightly different code. If anything is unexpected to the author, it may well be unexpected to readers too, so I really appreciated reading that sort of thing. (It would be wearing if Joe were stupid and expected all kinds of silly results, but that’s not the case at all.)

Conclusion

Pro LINQ is a good book. It has enough niggles to keep me from using superlatives about it, but it’s good nonetheless. It’s Joe’s first book (just like C# in Depth is the first one I can truly call “mine”) and I hope he writes more. Having read it from cover to cover, I think it’ll be more useful as a reference for individual methods (when MSDN doesn’t quite cut it) than to reread whole chapters, but that’s not a problem. My slight complaints above certainly don’t stop it from being a book I’m pleased to own.

[Joe] I’ll take it as a compliment that you think my book would be useful for those times that MSDN isn’t good enough!

This is the first LINQ book I’ve reviewed – I already have LINQ in Action, which is also on the list to review at some point. (I’ve read large chunks of it in soft copy, but I haven’t been through the finished hard copy yet.) It will be interesting to see how the two compare. Next up will probably be “Programming C# 3.0” by Jesse Liberty, however.

Logging enumeration flow

I’m currently reading Pro LINQ: Language Integrated Query in C# 2008 by Joe Rattz and yesterday I came across a claim about Enumerable.Intersect which didn’t quite ring true. I consulted MSDN and the documentation is exactly the same as the book. Here’s what it says:

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

(first is the first parameter, the one which the method appears to be called on when using it as an extension method. second is the second parameter – the other sequence involved.)

This seems to be needlessly restrictive. In particular, it doesn’t allow you to work with an infinite sequence on either side. It also means loading the whole of both sequences into memory at the same time. Given the way that Join works, I was surprised to see this. So I thought I’d test it. This raised the question of how you trace the flow of a sequence – how do you know when data is being pulled from it? The obvious answer is to create a new sequence which fetches from the old one, logging as it goes. Fortunately this is really easy to implement:

using System;
using System.Collections.Generic;

public static class Extensions
{
    public static IEnumerable<T> WithLogging<T>(this IEnumerable<T> source,
                                                string name)
    {
        foreach (T element in source)
        {
            Console.WriteLine(“{0}: {1}”, name, element);
            yield return element;
        }
    }
}

We keep a name for the sequence so we can easily trace which sequence is being pulled from at what point. Now let’s apply this logging to a call to Intersect:

using System;
using System.Linq;
using System.Collections.Generic;

// Compile alongside the Extensions class

class Test
{
    static void Main()
    {
        var first = Enumerable.Range(1, 5).WithLogging(“first”);
        var second = Enumerable.Range(3, 5).WithLogging(“second”);
        foreach (int i in first.Intersect(second))
        {
            Console.WriteLine(“Intersect: {0}”, i);
        }
    }
}

As you can see, we’re intersecting the numbers 1-5 with the numbers 3-7 – the intersection should clearly be 3-5. We’ll see a line of output each time data is pulled from either first or second, and also when the result of Intersect yields a value. Given the documentation and the book, one would expect to see this output:

// Theoretical output. It doesn’t really do this
first: 1
first: 2
first: 3
first: 4
first: 5
second: 3
second: 4
second: 5
second: 6
second: 7
Intersect: 3
Intersect: 4
Intersect: 5

Fortunately, it actually works exactly how I’d expect: the second sequence is evaluated fully, then the first is evaluated in a streaming fashion, with results being yielded as they’re found. (This means that, if you’re sufficiently careful with the result, e.g. by calling Take with a suitably small value, the first sequence can be infinite.) Here’s the actual output demonstrating that:

// Actual output.
second: 3
second: 4
second: 5
second: 6
second: 7
first: 1
first: 2
first: 3
Intersect: 3
first: 4
Intersect: 4
first: 5
Intersect: 5

Initial Conclusion

There are two interesting points here, to my mind. The first is demonstrating that the documentation for Intersect is wrong – the real code is more sensible than the docs. That’s not as important as seeing how easy it is to log the flow of sequence data – as simple as adding a single extension method and calling it. (You could do it with a Select projection which writes the data and then yields the value of course, but I think this is neater.)

I’m hoping to finish reading Joe’s book this week, and write the review over the weekend, by the way.

Update (Sept. 11th 2008)

Frederik Siekmann replied to this post with a thrilling alternative implementation to stream the intersection, which takes alternating elements from the two sequences involved. It’s a bit more memory hungry (with three sets of elements to remember instead of just one) but it means that we can deal with two infinite streams, if we’re careful. Here’s a complete example:

using System;
using System.Collections.Generic;
using System.Linq;

public static class Extensions
{
    public static IEnumerable<T> AlternateIntersect<T>(this IEnumerable<T> first, IEnumerable<T> second)
    {
        var intersection = new HashSet<T>();
        var firstSet = new HashSet<T>();
        var secondSet = new HashSet<T>();
        using (IEnumerator<T> firstEnumerator = first.GetEnumerator())
        {
            using (IEnumerator<T> secondEnumerator = second.GetEnumerator())
            {
                bool firstHasValues = firstEnumerator.MoveNext();
                bool secondHasValues = secondEnumerator.MoveNext();
                while (firstHasValues && secondHasValues)
                {
                    T currentFirst = firstEnumerator.Current;
                    T currentSecond = secondEnumerator.Current;
                   
                    if (!intersection.Contains(currentFirst) &&
                        secondSet.Contains(currentFirst))
                    {
                        intersection.Add(currentFirst);
                        yield return currentFirst;
                    }
                    firstSet.Add(currentFirst);
                    if (!intersection.Contains(currentSecond) &&
                        firstSet.Contains(currentSecond))
                    {
                        intersection.Add(currentSecond);
                        yield return currentSecond;
                    }
                    secondSet.Add(currentSecond);
                    firstHasValues = firstEnumerator.MoveNext();
                    secondHasValues = secondEnumerator.MoveNext();
                }
                if (firstHasValues)
                {
                    do
                    {
                        T currentFirst = firstEnumerator.Current;
                        if (!intersection.Contains(currentFirst) &&
                            secondSet.Contains(currentFirst))
                        {
                            intersection.Add(currentFirst);
                            yield return currentFirst;
                        }
                    } while (firstEnumerator.MoveNext());
                }
                if (secondHasValues)
                {
                    do
                    {
                        T currentSecond = secondEnumerator.Current;
                        if (!intersection.Contains(currentSecond) &&
                            firstSet.Contains(currentSecond))
                        {
                            intersection.Add(currentSecond);
                            yield return currentSecond;
                        }
                    } while (secondEnumerator.MoveNext());
                }
            }
        }
    }
   
    public static IEnumerable<T> WithLogging<T>(this IEnumerable<T> source, string name)
    {
        foreach (T element in source)
        {
            Console.WriteLine(string.Format(“{0}: {1}”, name, element));
            yield return element;
        }
    }
}

class Test
{
    static void Main()
    {
        var positiveIntegers = Enumerable.Range(0, int.MaxValue);
       
        var multiplesOfTwo = positiveIntegers.Where(x => (x%2) == 0)
                                             .WithLogging(“Twos”);
        var multiplesOfThree = positiveIntegers.Where(x => (x%3) == 0)
                                               .WithLogging(“Threes”);
   
        foreach (int x in multiplesOfTwo.AlternateIntersect(multiplesOfThree).Take(10))
        {
            Console.WriteLine (“AlternateIntersect: {0}”, x);
        }
    }
}

Most of the code is the alternating intersection – the test at the end just shows intersection of the sequence 2, 4, 6, 8… with 3, 6, 9, 12… The output shows elements being taken from both sequences, and yielded when a match is found. It’s important that we limit the output in some way – in the above code we call Take(10) but anything which prevents the loop from just executing until we run out of memory is fine.

Twos: 0
Threes: 0
AlternateIntersect: 0
Twos: 2
Threes: 3
Twos: 4
Threes: 6
Twos: 6
Threes: 9
AlternateIntersect: 6
Twos: 8
Threes: 12
Twos: 10
Threes: 15
Twos: 12
Threes: 18
AlternateIntersect: 12
Twos: 14
Threes: 21
Twos: 16
Threes: 24
Twos: 18
Threes: 27
AlternateIntersect: 18
(etc)

That’s really neat. Quite how often it’ll be useful is a different matter, but I find this kind of thing fascinating to consider. Thanks Frederik!

Developer Developer Developer day 7: voting now open

Every so often, Microsoft in Reading (in the UK) hosts a “Developer Developer Developer” day where members of the community present talks on all kinds of topics. There are no Microsoft speakers – it’s a big community event.

The next one is on November 22nd. Applications to speak have now been closed, and voting (for attendees to choose which sessions they’d like to hear) is now open. I’ve applied to present on three different topics: Protocol Buffers (with Marc Gravell), implementing LINQ to Objects in 60 minutes, and Push LINQ. There are far more potential sessions than time available, so I may well not end up presenting at all, but there’s a great range of subjects on offer.

If you’re in the Reading area, please come along – it should be a fab event!

Data Structures and Algorithms: new free eBook available (first draft)

I’ve been looking at this for a while: Data Structures and Algorithms: Annotated reference with examples. It’s only in “first draft” stage at the moment, but the authors would love your feedback (as would I). Somehow I’ve managed to end up as the editor and proof-reader, although due to my holiday the version currently available doesn’t have many of my edits in. It’s a non-academic data structures and algorithms book, intended (as I see it, anyway) as a good starting point for those who know that they ought to be more aware of the data structures they use every day (lists, heaps etc) but don’t have an academic background in computer science.

An implementation will be available in both Java and C#, I believe.

Core .NET refcard now available

Same drill as before – and same registration requirements (although if you registered before, you don’t need to register again). The Core .NET refcard touches on some of the topics which I personally need to refer to MSDN for most often. It covers:

  • Common .NET types, aliases and sizes
  • String literals and escape sequences
  • Format strings (general, numeric, date/time)
  • Working with dates and times
  • Text encodings
  • Threading
  • Using the new features of C# 3.0 / VB 9.0 in .NET 2.0 projects

It’s only 6 pages long, which should give you an idea of the depth of coverage on each of these topics, but it’s meant to be handy reference material.

Enjoy!

Lessons learned from Protocol Buffers, part 4: static interfaces

Warning: During this entire post, I will use the word static to mean “relating to a type instead of an instance”. This isn’t a strictly accurate use but I believe it’s what most developers actually think of when they hear the word.

A few members of the interfaces in Protocol Buffers have no logical reason to act on instances of their types. The message interface has members to return the message’s type descriptor (the PB equivalent of System.Type), the default instance for the message type, and a builder for the message type. The builder interface copies the first two of these, and also has a method to create a builder for a particular field. None of these touch any instance data.

In most cases this doesn’t actually cause any difficulties – we usually have an instance available when we’re in the PB library code, and the generated types have static properties for the default instance and the type descriptor anyway. Even so, it feels messy to have interface members which rely only on the type of the implementation and not on any of the actual data of the instance.

I’ve wondered before now about the possibility of having static members in interfaces – usually when thinking about plug-in architectures – but there’s always been the problem of working out how to specify the type on which to call the members. Variables and other expressions usually refer to values rather than types, and System.Type doesn’t help as it provides no compile-time knowledge of the type being referred to.

There’s one big exception to this, however: generic type parameters. I don’t know why it had never occurred to me before, but this is a great fit for the ability to safely call static methods on types which are unknown at compile-time. Furthermore, it could provide a great way of enforcing the presence of constructors with appropriate signatures, and even operators. Before I get too far ahead of myself, let’s tie the simple case to a concrete example.

Creating builders from nothing

In my previous post I gave an example of a method which ideally wanted to return a new message given a CodedInputStream and an ExtensionRegistry (both types within Protocol Buffers, the details of which are unimportant to this example). The current code looks like this:

private static TMessage BuildImpl<TMessage2, TBuilder> (Func<TBuilder> builderBuilder,
                                                        CodedInputStream input,
                                                        ExtensionRegistry registry)
    where TBuilder : IBuilder<TMessage2, TBuilder>
    where TMessage2 : TMessage, IMessage<TMessage2, TBuilder>
{
    TBuilder builder = builderBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();
}

For the purposes of this discussion I’ll simplify it a little, making it a generic method in a non-generic type.

private static TMessage BuildImpl<TMessage, TBuilder> (Func<TBuilder> builderBuilder,
                                                       CodedInputStream input,
                                                       ExtensionRegistry registry)
    where TBuilder : IBuilder<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
{
    TBuilder builder = builderBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();
}

The first parameter is a function which will return us a builder. We can’t simply add a new() constraint to TBuilder as not all geenrated builders will have a public constructor. However, we do know that the TMessage type has a CreateBuilder() method because it implements IMessage<TMessage, TBuilder>. Unfortunately we don’t have an instance of TMessage to call CreateBuilder() on! Really, we’d like to be able to change the code to this:

private static TMessage BuildImpl<TMessage, TBuilder> (CodedInputStream input, ExtensionRegistry registry)
    where TBuilder : IBuilder<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
{
    TBuilder builder = TMessage.CreateBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();
}

That’s currently impossible, but only because we can’t specify static methods in interfaces. Suppose we could write:

public interface IMessage<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
    where TBuilder : IBuilder<TMessage, TBuilder>
{
    static TBuilder CreateBuilder();

    // Other methods as before
}

Wouldn’t that be useful? The value would almost entirely be for generic types or methods where the type parameter is constrained to specify the relevant interface, but that could arguably still be very handy.

Operators and constructors

At this point hopefully the idea I mentioned earlier of being able to specify operators and constructors is quite obvious. For instance, we could make all the existing numeric types implement IArithmetic<T> (where T was the same type, e.g. int : IArithmetic<int>):

public interface IArithmetic<T>
{
    static T operator +(T left, T right);
    static T operator -(T left, T right);
    static T operator /(T left, T right);
    static T operator *(T left, T right);
}

// Used in LINQ to Objects, for example:
public static T Sum<T>(this IEnumerable<T> source) where T : IArithmetic<T>
{
    T total = default(T);
    foreach (T element in source)
    {
        total += element; // Interface says we can do total + element
    }
    return total;
}

Plug-ins for a particular program could implement IPlugin:

public interface IPlugin
{
    static new (PluginHost host);

    // Normal plug-in members here
    string Title { get; }
}

// Used within a PluginHost like this…
public T CreatePlugin<T>() where T : IPlugin
{
    T plugin = new T(this);
    log.Info(“Loaded plugin {0}”, plugin.Title);
    return plugin;
}

In fact, I’d imagine it would make sense to define a whole family of IConstructable interfaces, along the same lines as the Func and Action delegate families:

public interface IConstructable
{
    static new();
}

public interface IConstructable<T>
{
    static new(T arg)
}

public interface IConstructable<T1, T2>
{
    static new(T1 arg1, T2 arg2);
}

// etc

Inheritance raises its ugly head

There’s a fly in the ointment here. Normally, if a base type implements an interface, that means a type derived from it will effectively implement the interface too. That ceases to hold in all cases. You can get away with it for straightforward static methods/properties, in the same way that people often write code such as UTF8Encoding.UTF8 when they really just mean Encoding.UTF8. However, it doesn’t work for constructors – they aren’t inherited, so you can’t guarantee that Banana has a parameterless constructor just because Fruit does.

This is not only a problem for one concrete class deriving from another, but also for abstract implementations of interfaces. This happens a lot in Protocol Buffers; often the interface is partially implemented by the abstract class, with the final “leaf” classes in the inheritance tree implementing outstanding members and occasionally overriding earlier implementations for the sake of efficiency. Should we be able to specify an abstract static method in the abstract class, making sure that there’s an appropriate implementation by the time we hit a concrete class? As it happens that would be useful elsewhere in the Protocol Buffer library, but I’ll admit it’s slightly messy. I suspect there are ways round all of these issues, even if they might sometimes involve restricting the feature to particular, common use cases. However, every niggle would add its own piece of complexity.

There may well be other issues which would prove challenging – and other interesting aspects such as what explicit interface implementation would mean, if anything, in the context of static members. Language experts may well be able to reel off great lists of problems – I’d be very interested to here the ensuing discussions.

Conclusion

I believe that static interface members could prove very useful in generic algorithms, particularly if operators and constructors were allowed as well as the existing member types available in interfaces. There are significant sticking points to be carefully considered, and I wouldn’t like to prejudge the outcome of such deliberations in terms of whether the feature would be useful enough to merit the additional language complexity involved. It feels odd to implement an interface member which is effectively only of use when the implementing type is being used as the type argument for a generic type or method, but as developers learn to think more generically that may be less of a restriction than it currently seems.

This post is the last in my current batch talking about Protocol Buffers. There may well be more, but it’s unlikely now that the Protocol Buffer port is almost entirely complete. Most of these posts have involved generics, and the current limitations of what C# allows us to express in them. I do not intend to give the impression that I’m dissatisfied with C# – I’ve just found it interesting to take a look at what lies beyond the current boundaries of the language. The one aspect of these posts which I would definitely like to see addresses is that of covariant return types – the Java implementation of Protocol Buffers is significantly simpler in many ways purely due to this one point.