Book Review: Effective C# (2nd edition) by Bill Wagner

Resources:

Disclaimer

Just in case you’re unaware, I’m the author of another C# book, C# in Depth. Although Effective C# is somewhat different to my book, they certainly share a target audience. To that extent, Bill and I are competitors. I try hard to stay unbiased in reviews, but it’s probably impossible. Bear this in mind while reading. I should also note that I didn’t buy my copy of Effective C#; it was kindly sent to me by Pearson, for the purpose of reviewing.

Content and target audience

Effective C# is a style guide for C# developers – but not at the low level of "put your braces here, use PascalCase for method names;" instead, it’s at the design level. As far as I can tell, the aim isn’t to be complete, just the most important aspects of style. (Hey, otherwise there wouldn’t be any need for More Effective C#, right?) There are 50 mostly-self-contained items, totalling about 300 pages to digest – which is a nice size of book, in my opinion. It’s not daunting, and the items can definitely be bitten off one at a time.

Looking down the table of contents, the items are divided into six categories: "C# language idioms", ".NET resource management", "Expressing Designs in C#", "Working with the Framework", "Dynamic Programming in C#", and "Miscellaneous". Broadly speaking these contain the sorts of thing you’d expect – although it’s worth pointing out that a significant chunk of "Working with the Framework" is given over to Parallel Extensions, which may not be obvious from the title. (It’s a really good source of information on PFX, by the way.)

This is not a tutorial on C#. If you don’t know C# reasonably well already (including generics, lambda expressions and so on) you should read another book first, and then come back to Effective C# in order to get the most out of it.

Comment from Bill: generics and lambda expressions (and LINQ) are covered in some detail in More Effective C#. It’s a bit strange that as of the 2nd edition, Effective C# covers a newer version of the language than More Effective C#. I tried hard to make sure neither book expects a reader to have read the other, but the organization of both books as a whole does show the hazards of hitting a moving target.

That’s not to say that there’s no explanation of C# – for example, Bill goes into a few details about the "dynamic" type from C# 4, as well as overloading and how optional parameters work. But these are meant to just cover some poorly-understood (or very new) aspects of the language, rather than teaching you from the beginning. The balance here feels just right to me – I believe most professional C# developers will learn details of C# they weren’t aware of before, but won’t be confused by the basics that Bill left out.

Accuracy, opinion and explanation

My copy of Effective C# has plenty of ink annotations now. They broadly fall into five categories:

  • "Ooh, I’d never thought of that" – aspects of C# which were genuinely new to me
  • "Hell, yes!" – things I agree with 100%, and which will help developers a lot
  • "Um, I disagree" – points where Bill and I would go probably different routes, presumably due to different experiences and having worked in different contexts. (It’s possible that when put in the same context, we’d do the same thing, of course.)
  • "No, that’s technically incorrect" – a few areas which are outright wrong, or were correct for previous versions of the framework/CLR, but aren’t correct now
  • "That’s not what that term means" (or "that’s not the right term for the concept you’re trying to get across") – it should come as no surprise to regular readers that I’m a bit of a pedant when it comes to terminology

The majority of my annotations are of the third category – disagreements. That’s not because I disagree with most of the book; it’s just that the second category is reserved for vehement agreement. I haven’t bothered to note every sentence that I’m just fine with.

The good news is that in areas where we disagree, Bill does an admirable job of stating his case. I disagree with some of his arguments – or I can give counter-examples, or merely place different value on some of the pros and cons – but the important thing is that the reasoning is there. If it doesn’t apply to your context, evaluate the advice accordingly.

It’s entirely reasonable for there to be quite a bit of disagreement, as much of the book is opinion. It’s obviously founded in a great deal of experience (and I should note that Bill has spent a lot more time as a professional C# developer than I have), but it’s still opinion. I rather wish that the book was a wiki, so that these items could be debated, amended etc, as per my dream book – I think that would make it even more valuable.

There are relatively few absolutely incorrect statements, and even on the terminology front it’s usually two things which have bugged me repeatedly. Bill uses "implicit properties" for "automatically implemented properties"; I’ve usually heard developers use the abbreviated form "automatic properties" but "implicit" is new to me. Likewise the book talks about "overriding ==" instead of "overloading ==" reasonably frequently. It’s a shame I was too busy with my own book to participate in the technical review for Effective C#, as I suspect that on at least some of these points, Bill would have been happy to amend the text accordingly. I shall, of course, transcribe my comments and send them to him.

Comment from Bill: I’ll make those corrections for the subsequent printings.

What’s missing?

There are some areas which I wish Bill had touched on or emphasized more. Topics such as numbers, text and chronological values could have been given some space as they frequently confuse folks (and are full of pitfalls; see Humanity: Epic Fail for more of my thoughts on this). I would personally have placed more importance on the mantra of "value types should be immutable" – it’s certainly talked about, but in the context of "preferring" atomic, immutable value types – and preferring value types over reference types in rather more situations than I’d personally use. In terms of importance in avoiding shooting yourself in the foot, making sure all structs are immutable comes near the top of the list in my view.

"More Effective C#" doesn’t cover those areas as far as I can tell from the table of contents, but it does go into details about generics and various aspects of C# 3 and LINQ, which are clearly part of any modern C# developer’s toolkit. I certainly intend to get hold of the book to see what else I have to learn from Bill.

I think it might have been nice to have a few sections at an even higher level than the specific items in Effective C#. Topics such as:

  • Don’t trust me: I don’t know your context. Even the smartest folks can only give advice in fairly general terms in books. Don’t apply this advice blindly; weigh up the arguments presented and work out how they apply to your actual code.
  • The importance of testing. It’s possible that this was mentioned, but I don’t recall it. Perhaps it’s a little on the opinionated side (see previous point…) but for significant code bases, testing should be deeply ingrained in whatever process you’re using. Note that although it’s worth trying to keep high standards in test code, it often has a very different look and feel to production code, and different "best practices" may apply.
  • Encouraging "working with the language" – if you find yourself fighting the language, you may discover you can keep winning battles but losing the war. Typically changing your design to represent more idiomatic C# will make life more pleasant for everyone.
  • Performance: how you might decide when and how much to care.

Very few of these would be C#-specific, of course – which may be why Bill left them out. You could easily fill a whole book like that, and it would probably be horrible to read – full of platitudes rather than concrete advice. I personally think there’s room for some discussion of this kind though.

Comment from Bill: The ultimate goal was to have the book be that ‘nice size’ you mention above. I agree that all of those concepts are important. I felt that many of these features were not C# specific (or even .NET specific) that I felt better covered elsewhere.  However, that ‘working with the language’ was one area where I feel that I do cover.  There are only a small number of negative titles (e.g "avoid" something or "do not" do something). In those cases, I tried to recommend alternatives where you would find yourself "working with the language".

Conclusion

I like Effective C# a lot. I like the fact that I’ve disagreed with a number of the points raised, and in disagreeing I’ve found myself thinking about why I disagree and in what situations each point of view may be appropriate. I worry a little about inexperienced readers who may be tempted to treat this (or any other book) as an ultimate truth to be quoted out of context and used to beat other developers into inappropriate solutions… but hopefully most readers won’t be like that.

Comment from Bill: I also hope that most readers avoid that. Thank you for pointing out that I’ve tried very hard to explain when my advice applies, and when it doesn’t.  That is critical.

It’s definitely encouraged me to try to write a somewhat similar book at some point… possibly not with the same organization, and probably dealing with some very different topics – but it’s good to see that it can work. Whether I can pull it off as well as Bill remains to be seen, of course.

I’ll look forward to reading More Effective C# – although my pile of books to review is groaning somewhat, and I should probably go through another of those first :)

Non-iterable collection initializers

Yesterday on Stack Overflow, I mentioned that sometimes I make a type implement IEnumerable just so that I can use collection initializers with it. In such a situation, I use explicit interface implementation (despite not really needing to – I’m not implementing IEnumerable<T>) and leave it throwing a NotImplementedException. (EDIT: As noted in the comments, throwing NotSupportedException would probably be more appropriate. In many cases it would actually be pretty easy to implement this in some sort of meaningful fashion… although I quite like throwing an exception to indicate that it’s not really intended to be treated as a sequence.)

Why would I do such a crazy thing? Because sometimes it’s helpful to be able to construct a "collection" of items easily, even if you only want the class itself to really treat it as a collection. As an example, in a benchmarking system you might want to be able to add a load of tests individually, but you never want to ask the "test collection" what tests are in it… you just want to run the tests. The only iteration is done internally.

Now, there’s an alternative to collection initializers here: parameter arrays. You can add a "params string[]" or whatever as the final constructor parameter, and simply use the constructor. That works fine in many cases, but it falls down in others:

  • If you want to be able to add different types of values, without just using "params object[]". For example, suppose we wanted to restrict our values to int, string, DateTime and Guid… you can’t do that in a compile-time-safe way using a parameter array.
  • If you want to be able to constructor composite values from two or more parts, without having to explicitly construct that composite value each time. Think about the difference in readability between using a collection initializer for a dictionary and explicitly constructing a KeyValuePair<TKey, TValue> for each entry.
  • If you want to be able to use generics to force aspects of type safety. The Add method can be generic, so you could, for example, force two parameters for a single entry to both be of T, but different entries could have different types. This is pretty unusual, but I needed it just the other day :)

Now, it’s a bit of a hack to have to "not quite implement" IEnumerable. I’ve come up with two alternative options. These have the additional benefit of not requiring the method to always be called Add any more. I suspect it still would be in most cases, but flexibility is a bonus.

Option 1: Class level attribute

Instead of just relying on IEnumerable, the compiler could detect an attribute applied to the class, specifying the single method name for all collection initializer methods:

[CollectionInitializerMethod("AddValue")]
public class RestrictedValues
{
    public void AddValue(int x) { … }

    public void AddValue(string y) { … }
}

var values = new RestrictedValues
{
    3, "value", 10
};

Option 2: Method level attributes

In this case, each method eligible for use in a collection initializer would be decorated with the attribute:

public class RestrictedValues
{
    [CollectionInitializerMethod]
    public void AddInt32(int x) { … }

    [CollectionInitializerMethod]
    public void AddString(string y) { … }
}

var values = new RestrictedValues
{
    3, "value", 10
};

This has the disadvantage that the compiler would need to look at every method in the target class when it found a collection initializer.

Obviously both of these can be made backwardly compatible very easily: the presence of an implementation of IEnumerable with no attributes present would just fall back to using Add.

Option 3: Compiler and language simplicity

(I’ve added this in response to Peter’s comment.)

Okay, let’s stick with the Add method. All we need is another way of indicating that you should be able to use collection initializers with a type:

[AllowCollectionInitializers] 
public class RestrictedValues 

    public void Add(int x) { … } 

    public void Add(string y) { … } 
}

At this point, the changes required to the compiler (and language spec) are really minimal. In the bit of code which detects whether or not you can use a collection initializer, you just need to change from "does this type implement IEnumerable" to "does this type implement IEnumerable or have the relevant attribute defined". I can’t think of many possible language changes which would be more localized than that.

And another thing…

One final point. I’d still like the ability for collection initializers to return a value, and for that value to be used for subsequent elements of the initialization – with the final return value being the last-returned value. Any methods with a void return value would be treated as if they returned "this". This would allow you to build immutable collections easily.

Likewise you could decorate methods with a [PropertyInitializerMethod] to allow initialization of immutable types with "WithXyz" methods. Admittedly there’s less need for this now that we have optional parameters and named arguments in C# 4 – a lot of the benefits of object initializers are available just with constructors.

Anyway, just a few slightly odd thoughts around initialization for you to ponder over the weekend…

You are all individuals! (I’m not…)

I’ve been meaning to post this for a while, but recently a couple of events have coincided, reminding me about the issue.

First, Joe Duffy blogged in defence of premature optimization. Second, I started reading Bill Wagner’s Effective C#, 2nd edition, which contains advice such as "make almost all your types serializable". Now, let’s be clear: I have a great deal of respect for both of these gentlemen… but in both cases I think there’s a problem: to some extent they’re assuming a certain type of development.

In some cases, you really, really want to understand the nuts and bolts of every bit of performance. If, for example, you’re writing a parallelization library to be the part of the .NET framework. For Noda Time I’m pretty obsessed with performance, too – I really want it to be very fast indeed. And to be clear, Joe does give a certain amount of balance in the article… but I think it’s probably still biased due to his background on working on libraries where it really, really matters. For many developers, it’s vastly preferable to have the in-house HR web app used by 100 people take a little bit more time to process each request than to take an extra few days of developer work (cumulative) making sure that every little bit of it is as fast as possible. And many of the questions I’ve seen on Stack Overflow are asking for micro-optimizations which are really, really unlikely to matter. (EDIT: Just to be clear, there’s a lot of stuff I agree with in Joe’s post, but I think enough of us find correctness hard enough to start with, without having to consider every possible performance hit of every statement. At least some of the time.)

Likewise for a certain class of development, it probably does make sense to make most types serializable. If most of your objects are modelling data, serialization really will be a major factor. For other people, it won’t be. Most of my working life has been spent writing code which really doesn’t need to serialize anything… or which uses Protocol Buffers for serialization, in order to preserve portability, compactness and flexible versioning. Very few of my types should really be serializable using the platform-default binary serialization (whether in Java or .NET). Relatively few of them need to be serializable at all.

Finally, I’ll mention another example I’ve probably been guilty of: the assumption that a "public API" really can’t be changed without major hassle. An example of this is making a "public const" in C#, and later wanting to change the value of it. "No," I hear you cry… "Make it a public static readonly field instead, to avoid callers baking the value into their compiled code." Absolutely. If you’re in a situation where you may well not know all of your callers, or can’t recompile them all on every deployment, that’s great advice. But I suspect a lot of developers work in environments where they can recompile everything – where the only code which calls their code is written within the same company, and deployed all in one go.

In short, we’re not all writing system libraries. We’re not all writing data-driven business apps. We’re not all writing the same kind of code at all. Good "one size fits all" advice is pretty rare, and "we" (the community preaching best practices etc) should take that into account more often. I absolutely include myself in that chastisement, too.

Reply to a reply… tweaking refactoring

This is a reply to Ben Alabaster’s blog post, which is itself a reply. You can follow the trail yourself. I’ll assume you’ve read the post – I’m not going to go over anything already written there, other than my comments.

I took issue with three aspects of Ben’s refactoring:

  • The use of float for currency
  • The fact that "BaseRate" effectively doesn’t have a well-defined unit; in some cases it’s "dollars per hour" and in others it’s "dollars per pay cheque, regardless of hours" smells
  • The use of properties for the pay functions

I’ll tackle the last one first, because I was stupid. I suggested using public static readonly fields instead of properties. This was dumb. All we need is a simple static class with public static methods – we can use them with exactly the same source code as before for the Employee construction, but without all the fancy lambda expressions:

public static class PayCalculations
{
    public static float BasicWithoutOvertime(float hours, float baseRate)
    {
        return hours * baseRate;
    }

    public static float BasicWithOvertime(float hours, float baseRate)
    {
        if (hours < 40) return hours * baseRate;
        return ((hours – 40f) * 1.5f + 40f) * baseRate;
    }

    public static float Salary(float hours, float baseRate)
    {
        return baseRate;
    }

    public static float Contractor(float hours, float baseRate)
    {
        /* Base rate */
        float subtotal = Math.Min(hours, 40) * baseRate;
        hours -= Math.Min(hours, 40);
        /* Time plus a half */
        if (hours > 0) subtotal += 1.5f * Math.Min(hours, 20) * baseRate;
        hours -= Math.Min(hours, 20);
        /* Double time */
        if (hours > 0) subtotal += 2.0f * Math.Min(hours, 20) * baseRate;
        hours -= Math.Min(hours, 20);
        /* Double time plus a half */
        if (hours > 0) subtotal += 2.5f * hours * baseRate;

        return subtotal;
    }
}

Less code, less nesting, less use of fancy C# 3 features… generally nicer. The construction code remains the same, because it uses method group conversions to build the delegates.

Fixing the "float should be decimal" problem is easy, of course. Let’s move on to the units and "wrong" values. The problem is that the BaseRate property means different things for different employees, and in some cases it’s not even needed at all. That’s a reasonably strong indicator that it’s in the wrong place. Let’s accept that all employees’ pay may depend on the number of hours they’ve worked that week, but that’s all. Everything else depends on the particular contract that the employee is using, and that can vary. So let’s put the variance into what creates the function – so we can build a "salaried employee on $2935 week" function, a "per hour worker on $40.25 basic without overtime" etc. This is effectively like creating an IPayContract interface and multiple implementations, then creating instances of those implementations which have specific values. Except we’re using delegates… so having ripped out the lambda expressions, I’m going to put them back in :) But this time we’re just going to use a Func<decimal, decimal> as we only to know how much to pay given a certain number of hours worked. (The first decimal here could potentially be float or double instead, but if anyone ever did claim to work 3.1 hours, they’d probably want pay that reflected it.)

Here are the pay calculations:

public static class PayCalculations
{
    public static Func<decimal, decimal> BasicWithoutOvertime(decimal dollarsPerHour)
    {
        return hours => dollarsPerHour * hours;
    }

    public static Func<decimal, decimal> BasicWithOvertime(decimal dollarsPerHour)
    {
        // Use an alternative approach just for LOLs
        return hours => {
            decimal basicHours = Math.Min(hours, 40);
            decimal overtimeHours = Math.Max(hours – 40, 0);
            return (basicHours * dollarsPerHour) + (overtimeHours * dollarsPerHour * 1.5m);
        };
    }

    public static Func<decimal, decimal> Salary(decimal dollarsPerWeek)
    {
        // This *looks* like the units are wrong… but it’s okay, see text.
        return hours => dollarsPerWeek;
    }

    public static Func<decimal, decimal> Contractor(decimal baseRate)
    {
        return hours => {
            // 0-40 hours
            decimal basicHours = Math.Min(hours, 40);
            // 40-60 hours
            decimal overtime = Math.Min(Math.Max(hours – 40, 0), 20);
            // 60-80 hours
            decimal doubleTime = Math.Min(Math.Max(hours – 60, 0), 20);
            // 80+ hours
            decimal chargingThroughTheNoseTime = Math.Max(hours – 80, 0);

            return (basicHours * baseRate)
                 + (overtime * baseRate * 1.5m)
                 + (doubleTime * baseRate * 2m)
                 + (chargingThroughTheNoseTime * baseRate * 2.5m);
        };
    }
}

And now, when we construct the employees, we don’t have to specify a base rate which was only meaningful in some cases – instead, we give that value to the pay calculator instead:

List<Employee> employees = new List<Employee>
{
    new Employee("John", "MacIntyre", PayCalculations.BasicWithoutOvertime(40.25m)),
    new Employee("Ben", "Alabaster", PayCalculations.BasicWithOvertime(40.25m)),
    new Employee("Cory", "Fowler", PayCalculations.Salary(2935m)),
    new Employee("John", "Doe", PayCalculations.Contractor(150m)),
    new Employee("Jane", "Doe", hours => 3500m),
    new Employee("Joe", "Bloggs", hours => 34.25m * Math.Max(hours, 15))
};

Now, look at the Salary method and the comment in it… I’m still concerned about the units. We’re meant to be returning a simple dollar value (and in another system I’d probably bake that information into the types used) but we’ve still got dollarsPerWeek. What’s wrong here? Well, it all boils down to an assumption: we’re running this once a week. We’ve got a constant of 1 week… so we could rewrite the method like this:

public static Func<decimal, decimal> Salary(decimal dollarsPerWeek)
{
    decimal weeksWorked = 1m; // We run payroll once per week
    return hours => dollarsPerWeek * weeksWorked;
}

Now the units work… although it looks a bit silly. Of course, it makes our assumption very explicit – and easy to challenge. Maybe we actually run payroll once per month… in which case the fact that we’re expressing the salary in dollars per week is wrong – but very obviously wrong, which is a good thing.

Conclusion

It doesn’t feel right to have a blog post with no conclusion. Lessons applied here:

  • Remember all the tools available, not just the shiny ones. Using method group conversions made the initial "constant function" approach simpler to understand.
  • Units are important! If the same field effectively represents different units for different instances, there’s something wrong
  • If a field is only relevant for some instances of a type, it’s probably in the wrong place
  • Don’t use floats for currency. See any number of Stack Overflow questions for reasons why :)

EDIT: As noted by Barry Dorrans, there’s a lot of scope for introducing constants in here, for further goodness. I’m not going to bother updating the code just for Barry though. That way madness lies.

Query expression syntax: continuations

In this Stack Overflow question, I used a query continuation from a select clause, and one commenter expressed surprise, being unaware of what "select … into" meant. He asked for any references beyond the MSDN "into" page, and I didn’t know of any specific ones. So, here’s a very quick guide.

When "into" is used after either a "group x by y" or "select x" clause, it’s called a query continuation. (Note that "join … into" clauses are not query continuations; they’re very different.) A query continuation effectively says, "I’ve finished one query, and I want to do another one with the results… but all in one expression." This query:

var query = from x in y
            // other query clauses here
            select x.SomeProperty into z
            // other query clauses here (involving z)
            select z.Result;

Has *exactly* the same behaviour as this (leaving aside the visible local variable):

var tmp = from x in y
          // other query clauses here
          select x.SomeProperty;

var query = from z in tmp
            // other query clauses here (involving z)
            select z.Result;

Indeed the specification is written in terms of a transformation a bit like that. Note that the query continuation starts a clean slate in terms of range variables – after the "into z" part, x is not in scope.

Personally I usually split a query up into two statements with a local variable (i.e. the second form) rather than using a "select … into" query continuation, but it’s useful to know about them. I find I use "group … into" much more often, and I suspect others do too – it’s relatively rare to see a "group by" clause ending a LINQ query on its own.

Mini-post: abstractions vs repetition; a driving analogy

Driving Tom back from a children’s party this afternoon, I was thinking about Noda Time.

I’ve been planning to rework the parsing/formatting API, so that each chronological type (ZonedDateTime, LocalDateTime, LocalDate, LocalTime) has its own formatter and parser pair. I suspect this will involve quite a bit of similar code between the various classes… but code which is easy to understand and easy to test in its simple form.

The question which is hard to answer before the first implementation is whether it will be worth trying to abstract out that similar code to avoid repetition. In my experience, quite often something that sounds like a good idea in terms of abstraction ends up becoming significantly more complex than the "just implement each class separately" approach… but we’ll see.

Around this time, I ended up stuck behind a car which was going at 20mph between speed bumps, decreasing to 10-15mph near the speed bumps, of which there were quite a few. I could have tried overtaking, but visibility wasn’t great, and I wasn’t far from home anyway. I regard overtaking as a somewhat extreme measure when I’m not on a dual carriageway. It was frustrating, but I knew I wouldn’t be losing very much. The risks involved in overtaking (combined with the possibility that I’d end up stuck behind someone else) weren’t worth the small benefit of getting past the car in front.

It struck me that the two situations were somewhat similar. I know from experience that trying to extract a general purpose abstraction to avoid repetition can be risky: corner cases where the abstraction just doesn’t work can hide themselves until quite late in the proceedings, or you may find that you end up with nearly as much repetition anyway due to something else. The repetitive code is a burden, but one which is more easily reckoned with.

I suspect I won’t try too hard to abstract out lots of the code for formatting and parsing in Noda Time. I’m sure there’ll be quite a lot which can be easily factored out, but anything which is non-obvious to start with can wait for another time. For the moment, I’ll drive slowly but steadily, without any flashy moves.

Very quick interlude: I know that CAPTCHAs aren’t working on this blog if you’re using Chrome

This is just a quick post to hopefully stem the tide of people commenting (in blog comments and Twitter messages) saying that this blog is broken when it comes to CAPTCHAs and Chrome. I know, but there’s nothing I can do about it – I don’t run the software on the blog. I haven’t looked into why it’s not working… if you open the image in a new tab, it displays that fine. Weird, but such is life… I’m hoping that an update to either Community Server or Chrome will fix it eventually.

Reimplementing LINQ to Objects: Part 2 – “Where”

Warning: this post is quite long. Although I’ve chosen a simple operator to implement, we’ll encounter a few of the corner cases and principles involved in LINQ along the way. This will also be a somewhat experimental post in terms of format, as I try to work out the best way of presenting the material.

We’re going to implement the "Where" clause/method/operator. It’s reasonably simple to understand in general, but goes into all of the deferred execution and streaming bits which can cause problems. It’s generic, but only uses one type parameter (which is a big deal, IMO – the more type parameters a method has, the harder I find it to understand in general). Oh, and it’s a starting point for query expressions, which is a bonus.

What is it?

"Where" has two overloads:

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

Before I go into what it actually does, I’ll point out a few things which are going to be common across nearly all of the LINQ operators we’ll be implementing:

  • These are extension methods – they’re declared in a top-level, non-nested, static class and the first parameter has a "this" modifier. They can be invoked roughly as if they were instance methods on the type of the first parameter.
  • They’re generic methods – in this case there’s just a single type parameter (TSource) which indicates the type of sequence we’re dealing with. So for (say) a list of strings, TSource would be string.
  • They take generic delegates in the Func<…> family. These are usually specified with lambda expressions, but any other way of providing a delegate will work fine too.
  • They deal with sequences. These are represented by IEnumerable<T>, with an iterator over a sequence being represented by IEnumerator<T>.

I fully expect that most readers will be comfortable with all of these concepts, so I won’t go into them in more detail. If any of the above makes you nervous, please familiarise yourself with them before continuing, otherwise you’re likely to have a hard time.

The purpose of "Where" is to filter a sequence. It takes an input sequence and a predicate, and returns another sequence. The output sequence is of the same element type (so if you put in a sequence of strings, you’ll get a sequence of strings out) and it will only contain elements from the input sequence which pass the predicate. (Each item will be passed to the predicate in turn. If the predicate returns true, the item will be part of the output sequence; otherwise it won’t.)

Now, a few important details about the behaviour:

  • The input sequence is not modified in any way: this isn’t like List<T>.RemoveAll, for example.
  • The method uses deferred execution – until you start trying to fetch items from the output sequence, it won’t start fetching items from the input sequence
  • Despite deferred execution, it will validate that the parameters aren’t null immediately
  • It streams its results: it only ever needs to look at one result at a time, and will yield it without keeping a reference to it. This means you can apply it to an infinitely long sequence (for example a sequence of random numbers)
  • It will iterate over the input sequence exactly once each time you iterate over the output sequence
  • Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence. (In case you didn’t realise, the foreach statement in C# uses a try/finally block to make sure the iterator is always disposed however the loop finishes.)

Many of these points will be true for a lot of our other operators too.

The overload which takes a Func<TSource, int, bool> lets the predicate use the index within the sequence as well as the value. The index always starts at 0, and increments by 1 each time regardless of previous results from the predicate.

What are we going to test?

Ideally, we’d like to test all of the points above. The details of streaming and how many times the sequence is iterated over are frankly a pain to deal with, unfortunately. Given how much there is to implement already, we’ll come back to those.

Let’s have a look at some tests. First, here’s a simple "positive" test – we’re starting with an array of integers, and using a lambda expression to only include elements less than 4 in the output. (The word "filter" is omnipresent but unfortunate. It’s easier to talk about "filtering out" elements than "including" them, but the predicate is expressed in a positive way.)

[Test]
public void SimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = source.Where(x => x < 4);
    result.AssertSequenceEqual(1, 3, 2, 1);
}

I’ve kept the TestExtensions from MoreLINQ, despite NUnit coming with CollectionAssert. I find the extension methods easier to work with for three reasons:

  • They’re extension methods, which helps to reduce the clutter
  • They can use a parameter array for the expected output, which makes the test simpler to express
  • The message is clearer when the assertion fails

Basically, AssertSequenceEqual does what you’d expect it to – it checks that the actual result (usually expressed as the variable you call the method on) matches the expected result (usually expressed as a parameter array).

So far, so good. Now let’s check argument validation:

[Test]
public void NullSourceThrowsNullArgumentException()
{
    IEnumerable<int> source = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(x => x > 5));
}

[Test]
public void NullPredicateThrowsNullArgumentException()
{
    int[] source = { 1, 3, 7, 9, 10 };
    Func<int, bool> predicate = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(predicate));
}

I’m not bothering to check the name within the ArgumentNullException, but importantly I’m testing that the arguments are being validated immediately. I’m not trying to iterate over the result – so if the validation is deferred, the test will fail.

The final interesting test for the moment is also around deferred execution, using a helper class called ThrowingEnumerable. This is a sequence which blows up with an InvalidOperationException if you ever try to iterate over it. Essentially, we want to check two things:

  • Just calling Where doesn’t start iterating over the source sequence
  • When we call GetEnumerator() to get an iterator and then MoveNext() on that iterator, we should start iterating, causing an exception to be thrown.

We’ll need to do something similar for other operators, so I’ve written a small helper method in ThrowingEnumerable:

internal static void AssertDeferred<T>(
    Func<IEnumerable<int>, IEnumerable<T>> deferredFunction)
{
    ThrowingEnumerable source = new ThrowingEnumerable();
    var result = deferredFunction(source);
    using (var iterator = result.GetEnumerator())
    {
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

Now we can use that to check that Where really defers execution:

[Test]
public void ExecutionIsDeferred()
{
    ThrowingEnumerable.AssertDeferred(src => src.Where(x => x > 0));
}

These tests have all dealt with the simpler predicate overload – where the predicate is only passed the item, not the index. The tests involving the index are very similar.

Let’s implement it!

With all the tests passing when running against the real LINQ to Objects, it’s time to implement it in our production code. We’re going to use iterator blocks, which were introduced in C# 2 to make it easier to implement IEnumerable<T>. I have a couple of articles you can read if you want more background details… or read chapter 6 of C# in Depth (either edition). These give us deferred execution for free… but that can be a curse as well as a blessing, as we’ll see in a minute.

At its heart, the implementation is going to look something like this:

// Naive implementation
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Simple, isn’t it? Iterator blocks allow us to write the code pretty much how we’d describe it: we iterate over each item in the source, and if the predicate returns true for that particular item, we yield (include) it in the output sequence.

Lo and behold, some of our tests pass already. Now we just need argument validation. That’s easy, right? Let’s give it a go:

// Naive validation – broken!
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Hmm. Our validation tests still seem to be red, and putting a breakpoint on the "throw" statements doesn’t help… they’re not getting hit. What’s going on?

I’ve given a few pretty broad hints already. The problem is deferred execution. Until we start trying to iterate over the result, none of our code will run. Our tests deliberately don’t iterate over the result, so validation is never performed.

We’ve just hit a design flaw in C#. Iterator blocks in C# simply don’t work nicely when you want to split execution between "immediate" (usually for validation) and "deferred". Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

It’s ugly, but it works: all our index-less tests go green. From here, it’s a short step to implement the version using an index as well:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<T, int, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (predicate(item, index))
        {
            yield return item;
        }
        index++;
    }
}

Now the bar is green, and we’re done. Hang on though… we haven’t used it every way we can yet.

Query expressions

So far, we’ve been calling the method directly (although as an extension method) – but LINQ also provides us with query expressions. Here’s our "SimpleFiltering" test rewritten to use a query expression:

[Test]
public void QueryExpressionSimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = from x in source
                 where x < 4
                 select x;
    result.AssertSequenceEqual(1, 3, 2, 1);
}

(Note that the name is different here to in the downloadable code, to stop the blog server software blocking the name of the method. Grr.)

That will produce exactly the same code as our earlier test. The compiler basically translates this form into the previous one, leaving the condition (x < 4) as a lambda expression and then converting it appropriately (into a delegate in this case). You may be surprised that this works as we have no Select method yet…  but in this case we have a "no-op" select projection; we’re not actually performing a real transformation. In that case – and so long as there’s something else in the query, in this case our "where" clause – the compiler effectively omits the "select" clause, so it doesn’t matter that we haven’t implemented it. If you changed "select x" to "select x * 2", it would fail to compile against our Where-only LINQ implementation.

The fact that query expressions are just based on patterns like this is a very powerful feature for flexibility – it’s how LINQ to Rx is able to only implement the operators that make sense in that environment, for example. Similarly, there’s nothing in the C# compiler that "knows" about IEnumerable<T> when it comes to query expressions – which is how entirely separate interfaces such as IObservable<T> work just as well.

What have we learned?

There’s been a lot to take in here, in terms of both implementation and core LINQ principles:

  • LINQ to Objects is based on extension methods, delegates and IEnumerable<T>
  • Operators use deferred execution where appropriate and stream their data where possible
  • Operators don’t mutate the original source, but instead return a new sequence which will return the appropriate data
  • Query expressions are based on compiler translations of patterns; you don’t need to implement any more of the pattern than the relevant query expression requires
  • Iterator blocks are great for implementing deferred execution…
  • … but make eager argument validation a pain

Code download

Linq-To-Objects-2.zip

Many people have asked for a source repository for the project, and that makes sense. I’m putting it together a source repository for it now; it’s likely to be done before I post the next part.

Reimplementing LINQ to Objects: Part 1 – Introduction

About a year and a half ago, I gave a talk at a DDD day in Reading, attempting to reimplement as much of LINQ to Objects as possible in an hour. Based on the feedback from the session, I went far too fast… and I was still a long way from finishing. However, I still think it’s an interesting exercise, so I thought I’d do it again in a more leisurely way, blogging as I go. Everything will be under the "Edulinq" tag, so that’s the best way to get all the parts in order, without any of my other posts.

General approach

The plan is to reimplement the whole of LINQ to Objects, explaining each method (or group of methods) in a blog post. I’m going to try to make the code itself production quality, but I’m not going to include any XML documentation – if I’m already writing up how things work, I don’t want to do everything twice. I’ll include optimizations where appropriate, hopefully doing better than LINQ to Objects itself.

The approach is going to be fairly simple: for each LINQ method, I’ll write some unit tests (most of which I won’t show in the blog posts) and make sure they run against the normal .NET implementation. I’ll then comment out the using directive for System.Linq, and introduce one for JonSkeet.Linq instead. The tests will fail, I’ll implement the methods, and gradually they’ll go green again. It’s not quite the normal TDD pattern, but it works pretty well.

I will write up a blog entry for each LINQ operator, probably including all of the production code but only "interesting" tests. I’ll highlight important patterns as they come up – that’s half the point of the exercise, of course.

At the end of each post, I’ll include a link to download the "code so far". For the sake of anyone looking at these posts in the future, I’m going to keep the downloads numbered separately, rather than updating a single download over time. I’m hoping that the code will simply grow, but I dare say there’ll be some modifications along the way too.

The aim is not to end up with LINQBridge: I’m going to be targeting .NET 3.5 (mostly so I can use extension methods without creating my own attribute) and I’m certainly not going to start worrying about installers and the like. The purpose of all of this is purely educational: if you follow along with these blog posts, with any luck you’ll have a deeper understanding of LINQ in general and LINQ to Objects in particular. For example, topics like deferred execution are often misunderstood: looking at the implementation can clarify things pretty well.

Testing

The unit tests will be written using NUnit (just for the sake of my own familiarity). Fairly obviously, one of the things we’ll need to test quite a lot is whether two sequences are equal. We’ll do this using the TestExtensions class from MoreLINQ (which I’ve just copied into the project). The netbook on which I’ll probably write most of this code only has C# Express 2010 on it, so I’m going to use the external NUnit GUI. I’ve set this in the project file as the program to start… which you can’t do from C# Express directly, but editing the project file is easy, to include this:

<StartAction>Program</StartAction>
<StartProgram>C:Program FilesNUnit-2.5.7.10213binnet-2.0nunit-x86.exe</StartProgram>

It’s a bit of a grotty hack, but it works. The "additional command line parameters" are then set to just JonSkeet.Linq.Tests.dll – the current directory is the bin/debug directory by default, so everything’s fine. Obviously if you want to run the tests yourself and you have ReSharper or something similar, you can see the results integrated into Visual Studio.

Although I’m hoping to write reasonably production-level code, I doubt that there’ll be as many unit tests as I’d really write against production code. I fully expect the number of lines of test code to dwarf the number of lines of production code even so. There are simply vast numbers of potential corner cases… and quite a few overloads in several cases. Remember that the goal here is to examine interesting facets of LINQ.

Code layout

Just like the real LINQ to Objects, I’ll be creating an enormous static Enumerable class… but I’ll do so using partial classes, with one method name (but multiple overloads) per file. So Where will be implemented in Where.cs and tested in WhereTest.cs, for example.

First code drop

The first zip file is available: Linq-To-Objects-1.zip. It doesn’t contain any production code yet – just 4 tests for Where, so I could check that NUnit was working properly for me. Next stop… implementing Where.

Presentation preparation

As most of you know, I occasionally talk about C# at conferences, user groups or basically anywhere that people won’t attack me. A while ago I rejected PowerPoint in favour of a rather less formal approach: hand-drawn slides. Quite a few people have now asked me about how they’re prepared – even occasionally making the assumption that my awful handwriting is actually a real font – so I figured it’s worth a blog post.

My process is both primitive and complex…

1. Draw the slides

I use plain A4 printer paper and a black flipchart marker to draw the slides. Where I need one bit overlaid on top of another (e.g. to cross something out), I’ll draw the two sections separately, to merge later. Given that I’m fiddling around anyway, it doesn’t matter if there are extra bits on the paper – so if I make a small mistake, I’ll often keep going on the same sheet.

I rarely make a textual plan of my presentations beforehand – I think of roughly where I want to go, and then just draw. Often some slides will be reordered or deleted for the final presentation. I find a good fat pen in my hand helps the creative process.

2. Scan the paper

I have a relatively ancient but still trusty flatbed scanner (a Xerox 2400, if you’re interested) which I use to scan everything in. A document scanner would be more convenient – scanning 25 slides individually is very tedious – but I don’t really have the space at the moment.

I scan in greyscale at 400dpi, into Paintshop Pro v8.10. It may be 7 years old, but it does the job perfectly adequately. I then re-orient them to landscape mode and save them as BMP files. Yes, it’s as daft as it sounds, but necessary for the next step. These files are pretty big – about 15MB – and large in terms of canvas size too… but here’s an idea of what they look like (but as a PNG, scaled down significantly):

Slide before alterations

At this point I’m done with Paintshop Pro, so I close it down…

3. Convert into SVG

I used to edit the slides as PNG files in Paintshop Pro, but this had various issues:

  • Colouring was a pain
  • Resizing didn’t work terribly well
  • Cutting and pasting shapes wasn’t as easy as it might have been

For line drawing like mine, I suspect SVG is the ideal format. I found InkScape as a free tool which works pretty well… but I need to convert the images first. InkScape does actually have a conversion tool built-in, but I’ve never quite got it to work properly – whereas using potrace directly (I believe InkScape uses potrace internally) means I can batch the conversions and control them as far as I need to.

When I discovered potrace, I was amazed by it. It really does exactly what it says on the tin – it converts bitmaps into SVGs. Unfortunately it’s limited in terms of its input formats – hence the BMPs – but that’s fine.

For the record, here’s the potrace command I use:

potrace -b svg -r 400 -t 8 -O 0.4 -k 0.7 %*

I can’t remember what all the arguments mean offhand, but you can find out if you want to :)

4. Edit in InkScape

So, now I have black and white SVGs which I can edit in InkScape. I generally perform the following tweaks:

  • Remove any crud left from my dirty scanner – often there are extra shapes which need deleting
  • Resize everything and rotate shapes – usually to straighten them
  • Apply colour; I usually stick to red, black, blue and green with occasional other colours where necessary. A bit of colour’s great, but we’re not going for works of art here… and many colours look rubbish on some projectors.

Here’s the result of editing the above picture – assuming your browser can handle SVG:

Slide as an SVG

5. Convert to PNG

So now we’re done, right? Well, no… because I haven’t found any way of presenting nicely from SVGs. Aside from anything else, they take a long time to render, and I really need to be able to flick from slide to slide whenever I want to. However, I’ve found that PNGs work well as a display format. I use ImageMagick to convert from SVG to PNG usually… although very occasionally it crashes, and I have to use InkScape to export to PNG, which it’s perfectly capable of doing. (I use ImageMagick mostly so that I can do it easily in a batch.)

I convert to a 1024×768 (-ish) PNG, so that it won’t be too time-consuming to render, and I don’t need to worry about scaling it. (Most projectors I’ve used have been that size.) Here’s the PNG for the same image again – shrunk to 600×436 just to avoid taking up too much room:

Final converted slide

Even if you couldn’t see the SVG before, you should now be able to spot that not only is the image coloured in (and with a lot better contrast than the original) but I’ve tweaked the man to be falling instead of flying. (The aim of the slide is to indicate that C# should push you into the pit of success.)

6. Present!

At this point, I have three directories: one of the original BMPs, one of the SVGs, and one of the PNGs. I could then create a PowerPoint presentation consisting of all of the images… but I’ve found that PowerPoint takes an awfully long time to render images, and it would also be a bit of a pain to add them in the first place.

Instead, I use FastStone Image Viewer to present a slideshow of the images. I’ve tweaked a few options to make the images appear nice and quickly against a white background. It’s easiest if the image files are in alphabetical order, so I tend to go for filenames such as "010 Introduction.png" – always iniitially leaving 10 between each number, so that I can easily slot in an extra slide or two if necessary.

So that’s my process… four pieces of free software and one very old commercial one, a fair amount of time, and a slightly odd result.

Looking to the future…

Obviously this isn’t an ideal process. I now have a Wacom Bamboo Fun tablet, and the hope is that eventually I’ll be able to draw with it directly into InkScape, cutting out huge amounts of the process. However, so far I’m completely inept with it. If you think my drawings are bad and my handwriting is ridiculous when I’m using a pen and paper, you should see it with the tablet… but maybe that will change. I’d like to think so.

I suspect my style may change over time too, although I’m not sure in what way. Different presentations already get different treatment – for some talks I work entirely in Visual Studio, occasionally breaking to talk without coding, but without any slides… for others I’ll have slides but no code at all. If you ever happen to see me present though, please give me feedback afterwards – good or bad. It’s an area I’d love to improve on, and feedback is the best way of that happening.