Category Archives: C#

C#, C# 4

C# 4, part 2: Ideas from other community members

February 7, 2008 jonskeet 18 Comments

There has been a fair amount of speculation online about what should be in C# 4. I’ve taken the list below from a few posts, primarily those by Ayende and Jeremy Miller. I’ve deliberately left out the ideas that Microsoft have mentioned that they’re at least considering – they’ll come in the next post.

Mixins

I suspect everyone has a different idea of what these mean, but I’ll say what I’d like. I want to be able to implement an interface by proxying all calls (other than those I’ve actually implemented) to a particular member variable, as an aid to favouring composition over inheritance. As an example, here’s a class which implements IList<string> but makes sure that only strings with length 5 or more can be added:

public class LongStringList : IList<string>
{
[MixinImplementation(typeof(IList<string>))]
readonly IList<string> m_list = new List<string>();

    public void Add(string item)
    {
        if (item.Length < 5)
        {
            throw new ArgumentException(“Only strings of length >= 5 allowed”);
        }

m_list.Add(item);
}
}

(Yes, you’d need to implement the setter as well to prevent other values being replaced. This is just sample code to give the rough flavour – and the syntax is pretty much made up as an example too. I’m not hung up on the syntax, I just want the functionality.

You could of course derive from System.ObjectModel.Collection<string> – but this prevents you from deriving from any other class, and fixes the inheritance forever. If you only really want to provide your clients with an IList<string> implementation, it’s nicer not to pin anything down. At a later date you could manually implement more of the interface members instead of proxying them, without changing any of the calling code.

Symbols

I don’t see the benefit over normal string interning here. That could just be because of a poor description of symbols in Ruby, admittedly… but I suspect any other benefit wouldn’t meet the “it’s got to be really useful in many situations” bar.

Hashes

I’ve only extensively used one language with hashes built in: Groovy. While I agree it’s nice occasionally, I don’t think it’s worth bending the language out of shape for as we’ve now got collection initializers anyway:

var hash = new Dictionary<string,int>
{
{ “First”, 1 },
{ “Second”, 2}
};

Automatic delegation

To be honest I don’t really know what Jeremy means here – although it’s possible that he means what I understand as mixins. Ah the joys of loose terminology.

Metaprogramming

I only have vague ideas of what metaprogramming is all about, and those are mostly through Ayende’s blog. I can see that it’s almost certainly very powerful, but I’m not sure I want it in C#. I don’t want C# to turn into a massive box with every nifty feature ever considered. It’s possible I could be turned on this one, if someone showed me it working really nicely.

Macros

Ick, no. I’ve seen what macros tend to be used for. I’m sure there are nice shiny reasons for them, but certainly in the C/C++ form I’d be heavily against them.

Update: Ian Griffiths mailed me drawing my attention to LISP macros and how different they are to C/C++ macros. The way Ian described it sounds similar to what I understand of the metaprogramming that Ayende wants to do. I can see why it’s a powerful tool… but personally I think I’d rather keep it away from a mainstream language like C#. I’ll be writing another blog post to explain my view on this, because it’s worthy of a much fuller discussion.

Everything virtual by default

Absolutely not! Yes, it would make mocking easier – but then making everything public by default would probably make things easier too. Inheritance is hard to control properly, and should only be done with very careful design. As I wrote in the previous post, I’d prefer classes to be sealed by default, i.e. a step in the opposite direction. Oh, there’s the performance implication too, which is one of the reason’s Java needs a much more complicated multi-pass JIT – to allow even virtual methods to be inlined until they’re actually overridden. The performance part is much, much less important than the “inheritance is powerful but easy to misuse” argument.

Not only should the default not change at this point, but it was the right default to start with.

Instant Data Type

This would basically be a way of using anonymous types at a higher level – returning them with a return type of var, for instance. I don’t support that proposal per se, but I can see a benefit in having “named anonymous types” – classes which have the same behaviour as anonymous types (in terms of immutability, equality, hash codes etc) but in a named manner. Something like this:

public class Person = new {string Name, DateTime DateOfBirth}
…
Person p = new Person { Name = “Jon”, DateOfBirth = 19.June(1976) };
Person p2 = new Person { Name = “Jon”, DateOfBirth = 19.June(1976) };

Assert.IsTrue(p.Equals(p2));
Assert.AreEqual(p.GetHashCode(), p2.GetHashCode());
// etc

Again, the syntax isn’t terribly important to me – but the ability to define very simple immutable data objects is nice. It could also improve the readability of some LINQ code as you could make the meaning of the (currently anonymous) tuple clear in the name.

A few anticipated comebacks:

Clash with object initializers: yes, it looks like it’s setting properties rather than passing them in as constructor arguments. That’s unfortunate, and maybe parentheses would be better than braces here. That would require named parameters though. (I’ll come onto those in another post!)
Why not just refactor the anonymous type to a named type? ReSharper lets you do this! Indeed it does – but then you’ve got a complete class to maintain. Given a single line of code, I know the features of the Person class. I can add a new property (breaking existing uses, of course) without having to make sure I get the equality and hash code implementations right manually, etc. I prefer simplicity of language expression over just saving typing by using snippets etc – that’s why I like automatic properties.
It can’t use quite the same implementation as anonymous types. Indeed, anonymous types are quite interesting in terms of the number of types actually generated in the IL, due to sharing of generics. I don’t think it would be a great loss in this case though.
The use still isn’t as brief as with anonymous types, due to needing to specify the name. True, but unavoidable, I think.

MemberInfo (infoof)

I don’t think the C# team have actually stated that this is even potentially on the table, but one of the lovely things about having Eric Lippert as a tech reviewer for the book is I get to hear all kinds of anecdotes about what’s been considered before. Some of them will be on the book’s website in the notes section. In this case, I don’t think it’s a problem to reveal that the C# team have considered this before as an infoof operator (commonly pronounced “in-foof” of course).

I could go for this idea – it would certainly make reflection simpler in a number of cases.

Method Interception and IDynamicObject

I’ve lumped these two together as they’re similiar (in my view) – they’re leading down the road to a dynamic language. I can appreciate the benefits of dynamic languages, but that doesn’t mean I think every language ought to be dynamic. I’d pass on these two.

Static interfaces

I’m not entirely sure what Ayende means on this front, but I know I’ve seen a number of requests for the ability to declare that a type definitely has a given static method. Indeed, I’ve wanted it myself a few times. However, I’m not sure how I’d go about using it. Interfaces by their current nature are used when we’ve got an instance. We already know how to pass references etc around – but not types, other than as either type parameters or Type objects.

Now, having just written it I wonder whether that’s what Ayende means – if a type parameter is constrained to implement a particular interface, any static methods within that interface could be called using the type parameter. I can see the use in a few situations, but I’d need to be convinced that it was common enough to warrant a language change. The bar wouldn’t be too high for me on this one though, as I think we could use very natural syntax without having to make up anything significantly new.

Aspect-Oriented Programming

Ooh, tricky one. I’m definitely undecided on this. I can see benefits, but also drawbacks in terms of how obvious the flow of the code is, etc – all the normal objections.

I think I’d welcome additions to the framework and/or runtime to make AOP support simpler, but then leave it to IoC containers etc to actually implement, rather than embedding AOP directly in the language.

Design by Contract

There are parts of DbC that I’d really like to see in the language, or possibly as a language/framework mixture where the framework describes certain common attributes (invariants, non-null arguments etc) and then each compiler takes note of the same attributes. I would really, really like to get rid of having manually-written trivial argument checking in my code. I don’t think I’d immediately want to go as far as Spec# though, in terms of trying to deduce correctness. I wouldn’t like to say why, beyond unfamiliarity (which I know isn’t a good reason). Again, I could possibly be persuaded.

IDisposable implementation support

Good idea. It’s a pain to implement IDisposable properly – some help would be welcome. It would probably need to be flexible enough to allow the developer to say whether a finalizer was required or not, and possibly some other things – but in principle, I’m in favour.

Constructor inheritance

Aargh, no. Constructors effectively belong to the type rather than instances of the type, so they’re not inherited in the same way. They’re a bit like static members – and I know we can call static members as if they were inherited as normal (e.g. UnicodeEncoding.ASCII), but it’s generally a bad idea to do so in my view.

Also consider the lack of control. System.Object has a parameterless constructor – so should all types do so as well, given that they all inherit (directly or indirectly) from System.Object? What would new FileStream() really mean? I suppose one possibility would be to mark your type as intentionally inheriting constructors – which is all very well until the base class adds a new constructor you don’t want, and you don’t realise it until it’s too late. On this one the complexities and disadvantages outweigh the advantages for me.

“Const correctness”

I haven’t actually seen anyone asking for this specifically for C# 4, but it’s been a general feature request pretty much forever. Again, I can see the benefits but:

I suspect it’s the kind of thing you really need to get right in V1.0 for it to be genuinely useful.
I still haven’t seen an easy way to express “this is an immutable reference to a mutable list of immutable objects of a particular type”. Basically you need to express “constness” for every level down the composition hierarchy, which isn’t simple.

Conclusion

Just to wrap the above up, here are the above features in “yes, maybe, no” categorization (just for my own view, of course):

Yes: Mixins, instant data types, IDisposable implementation, design by contract (partial), infoof
Maybe: Automatic delegation, metaprogramming, static interfaces
No: Symbols, hashes, everything virtual by default, macros, constructor inheritance, AOP, method interception and IDynamicObject

Next time (which may be tonight if I’m feeling energetic) I’ll look at what Microsoft has hinted at.

A simple extension method, but a beautiful one

February 5, 2008 jonskeet 9 Comments

This came up a little while in a newsgroup question, and Marc Gravell and I worked out a solution between us. I’ve finally included it in MiscUtil (although not released it yet – there’s a lot of stuff ready to go when we’ve finalised namespaces and updated the website etc) but I thought I’d share it here.

How often have you written code to do something like counting word frequencies, or grouping items into lists? I know a lot of this can be solved with LINQ if you’re using .NET 3.5, but in .NET 2.0 we’ve always been nearly there. Dictionaries have provided a lot of the necessary facilities, but there’s always the bit of code which needs to check whether or not we’ve already seen the key, and populate the dictionary with a suitable initial value if not – a count of 0, or an empty list for example.

There’s something that 0 and “empty list” have in common. They’re both the results of calling new TValue() for their respect TValue types of int and List<Whatever>. Can you see what’s coming? A generic extension method for dictionaries whose values are of a type which can use a parameterless constructor, which returns the value associated with a key if there is one, or a new value (which is also inserted into the dictionary) otherwise. It’s really simple, but it’ll avoid duplication all over the place:

Note: This code has been updated due to comments below. Comments saying “Use TryGetValue” referred to the old version!

public static TValue GetOrCreate<TKey, TValue>(this IDictionary<TKey, TValue> dictionary,
                                               TKey key)
    where TValue : new()
{
    TValue ret;
    if (!dictionary.TryGetValue(key, out ret))
    {
        ret = new TValue();
        dictionary[key] = ret;
    }
    return ret;
}

The usage of it might look something like this:

var dict = new Dictionary<string,int>();

foreach (string word in someText)
{
dict[word] = dict.GetOrCreate(word)+1;
}

I’m not going to claim this will set the world on fire, but I know I’m fed up with writing the kind of code which is in GetOrCreate, and maybe you are too.

Additional overloads are available to specify either a value to use when the key is missing, or a delegate to invoke to create a value.

C#, C# 4

C# 4, part 1: Looking back at the past

February 5, 2008 jonskeet 16 Comments

Everyone else is speculating about what’s going to be in C# 4 (and various possibilities are coming out of MS), so I thought it would be wise to start my own series of wishlist posts before I miss the boat completely.

In this first post, I’m not going to look at the future at all – I’m going to look at mistakes of the past. When I say “mistake” I of course mean “things I would have done differently had I been a language designer with 20/20 hindsight”. Of course, there’s a lot of room for argument :)

Mistakes in C# 1

Lack of separate getter/setter access for properties. This came in C# 2, but it should have been obvious that it was highly desirable long before C# 1 came out.
Lack of generics – ish. Don’t worry, I’m not going to claim that all the features of C# 2 and 3 should have been in C# 1, but if generics had been in at the start we could have avoided having the non-generic collections (and interfaces) completely. Mind you, I’m glad that the .NET team took their time instead of including the bodged (IMO) generics of Java 5.
Classes not being sealed by default. I’ve believed for a long time that allowing inheritance incurs a design cost (and it’s not like I’m unique in that respect). C# fixed a mistake of Java by making methods non-virtual by default; the same should be true for classes in my view.
Enums just being named numbers. Again, I’ve blogged about this before, but it’s worth mentioning again. It’s possible to work around the lack of this feature (as the blog post readers pointed out!) but framework and language support would have been very welcome.
The “x” character escape sequence. Fortunately it’s rarely used, but it’s so error-prone. Quick, how different are “x8Good compiler”, “x8Bad compiler”? What’s the first character in each string? (This will appear soon on my brainteasers page).
The switch statement. There are lots of ways in which this could have been better designed. VB addresses some of them (such as making it easier to express multiple matching values) but there are other ways in which this construct needed overhauling. Fallthrough is (rightly) prohibited, so why not just force braces round the code in the case block instead of requiring a break statement? Aside from anything else, that would fix the somewhat bizarre scoping rules.
Wacky overload resolution. I entirely understand the point that introducing new methods in a base type shouldn’t change the behaviour in derived types – but if you’ve explicitly chosen to override that method, that should be more easily callable than it is. (See the first example of the brainteasers page to see what I’m talking about.)
The “lock” keyword, and associated issues. Basically, the IDisposable pattern should have been used for locking, and not every object should have a monitor associated with it. Developers should keep a close eye on what’s being locked, and being able to lock on everything takes away from this. Likewise “lock” creates a keyword for little purpose (and one which would otherwise be useful as a variable name etc).

Mistakes in C# 2

Lack of partial methods. I’m really only saying this because it broke the format of C# in Depth slightly. I’ve introduced partial methods along with partial types because they logically fit in with them, and they don’t fit in with any of the other features of C# 3 particularly. This is just a matter of not working out all of how partial types would be used – or at least not doing so early enough. (For all I know partial methods were on the table before C# 2 shipped – I wouldn’t be surprised.)
Possibly the lack of generic variance. This is certainly a big issue of understanding which is often raised as a question. On the other hand, I suspect that if/when it becomes available, it will raise just as many questions in terms of the detail anyway…
The System.Nullable class. It’s really only there as an historical accident, and I know it’s not a C# issue as such – but even so… (Note for extra clarity – I’m fine with nullable types and the System.Nullable<T> struct. It’s just the supporting class that I don’t like.)
InternalsVisibleTo requiring the whole public key for strongly signed assemblies instead of the public key token (contrary to the documentation). Ick.

Mistakes in C# 3

It’s a real shame that readonly automatic properties aren’t in C# 3. I suspect they’ll come in C# 4 (and they’ll be on my wishlist in future posts) but I think it’s reasonable to wonder why they weren’t included in C# 3. Immutability is a known pattern of goodness, and although C# 4 may well contain any number of more significant improvements towards making it easier, readonly automatic properties would have been a good start.
The way that extension methods are found. This issue was raised time and time again before release, and I’ve never heard a good defence of finding them by whole namespace, instead of allowing developers to say “use the extension methods found in this class, please”. As it is, anyone writing their own extension methods is likely to end up with whole namespaces devoted to a single type. It’s very odd.

Of course that’s not to say there aren’t other things I’d like to see – but these are more “features which were slightly misdesigned” rather than “features which I really want”.

I’m not trying to take anything away from the language designers – C# is still easily my favourite language in terms of its design, particularly in C# 3, but nobody’s perfect :)

Next time I’ll start giving my opinions of features that other people are calling for.

Immutability and inheritance

January 25, 2008 jonskeet 8 Comments

In my book, I present an example of a Range<T> class do demonstrate iterator blocks. The range allows you to iterate over each element within it in the obvious fashion. There’s an abstract base class, and then a couple of concrete classes derived from that – enough to show the pattern. The base class is abstract because there’s a single abstract method, GetNextValue, which is required to take a current value and return the next one in the sequence. How this occurs depends on the types involved – in the case of a range of DateTime elements, it will add a TimeSpan each time, for instance, whereas an Int32Range will just add an int. Here’s a class diagram of how it looks:

The requirements for the code in the book were very simplistic, in order to be able to present all the code on the printed page. However, I wanted to expand this in MiscUtil and “do the job properly”. In particular, I wanted to be able to:

Reverse a range
Make a range exclusive (at the “far end” – a half-open interval)
Make an exclusive range inclusive
Do all of this while keeping the immutable nature of the code from the book

When trying to implement this, I discovered it was actually quite tricky. In particular, when using inheritance I ran into some obstacles:

Unless we use the original range as a proxy, creating a new range based on the original is tricky. We basically need to clone, and that’s fraught in various ways. MemberwiseClone will work in many situations, but it’s inelegant – and we can’t keep the fields marked readonly and still modify the cloned copy.
Reversing a range using just the original type constraint of T : IComparable<T> is a bit of a pain. You need to keep remembering which way to compare things. This is a bit of an aside, but using an IComparer<T> instead is a lot simpler – it’s really easy to build a new IComparer<T> which proxies to the original one and reverses the order of the parameters.
There’s no guarantee that just because the base class has no mutable data, the derived class will do likewise.

In addition, I realised I was using inheritance in a way that went against what I’d written near the end of the book: when using inheritance in a very limited way, consider using delegates instead. A Range<T> only needs extra behaviour to be specified in terms of comparisons (IComparer<T>) and how to take a step from one value to the next. The latter can easily be represented as a Func<T,T> in .NET 3.5.

My new design has a single sealed, immutable class:

There are still a few ways in which this isn’t ideal:

You can specify a null step function, in which case you can’t iterate over the range. I’d prefer the type to not implement IEnumerable<T> if it can’t do the job properly.
You have to specify a reverse step function if you want to iterate over the reverse of the range.
There are a heck of a lot of constructor overloads.

Now, none of these are horrendous, and I think it’s a lot nicer than it was before. I’ve currently got an additional non-generic Range class with a bunch of overloaded methods for creating ranges of various types. I can’t think of a decent name for these methods at the moment, so currently you’d write:

Range.Of(1, 5) // 1, 2, 3, 4, 5
Range.Of(1, 5).Exclusive() // 1, 2, 3, 4
Range.Of(1, 5, 2) // 1, 3, 5
Range.Of(DateTime.Today, DateTime.Now, TimeSpan.FromMinutes(1)) // Midnight, 1 minute past etc

I think it might be nicer to use extension methods for these, to allow:

1.To(5)
1.To(5).Exclusive()
1.To(5).Step(2)
DateTime.Today.To(DateTime.Now).Step(TimeSpan.FromMinutes(1))

In order to do this nicely I may need to expose the comparer in Range<T> as well, but I don’t think that’s really a problem. Thoughts on this are welcome.

Anyway, the broad point of this post (other than to hopefully relieve my own insomnia – and possibly yours too) is that immutability and inheritance don’t mix terribly nicely, especially when you want to effectively clone an instance and modify some aspects. That’s not terribly surprising, but it is interesting – and it fits in with the experience that inheritance doesn’t mix terribly nicely with equality comparisons, either.

Books, C#, General

Bridging gaps, and finding my role

January 25, 2008 jonskeet 5 Comments

Warning: this post won’t teach you anything technical. It’s about how I see myself. That may be of interest to you, or it may not. If not, feel free to skip it knowing you’re not missing anything else.

One of the great problems of the world today is undoubtedly this problem of not being able to talk to scientists, because we don’t understand science. They can’t talk to us because they don’t understand anything else, poor dears. (Michael Flanders)

For a while, I’ve made myself slightly miserable (only slightly – I’m generally a very happy guy) by seeing just how impoverished my own understanding of computing is compared with my “heroes” in the industry: Eric Lippert, Joe Duffy, and Wes Dyer to name but three. I always learn a lot from their blogs, but often I don’t manage to take it all in. I understand enough about Wes’s post about monads to realise that I’ve probably implemented a monad with my (incredibly primitive) Future classes in “push LINQ” – but my grasp of them is tenuous at best. I understand enough about threading to be able to reason about concurrency in my day-to-day life, but I’m never going to have Joe’s depth of knowledge of either the Windows-specific stuff or the underlying principles and theories. I can hold an interesting (to me) conversation with Eric over email, but I suspect that if we were talking in real life I’d have to constantly ask him to slow down.

This used to bother me. I used to almost feel that it was unfair that others were so much smarter than me. Yes, I know how arrogant that sounds even in terms of ambition, and I’m not going to flinch away from the term. However, I’m a bit more comfortable with my place in life now. You see, just because they’re so much smarter than me doesn’t mean I’m useless. I want to be a bridge.

People have occasionally used the word “expert” about me, entirely inappropriately. I’m not an expert in threading, or floating point, or text encoding. I know a bit more about those topics than most developers, and sometimes that’s all that’s required to be labeled as an expert these days. After the last ten months, I could probably agree with the label when applied to C# as a language, although certainly not .NET as a framework, or even the act of developing in C#. I happen to have read the spec more closely than most people, and retained that information reasonably well, that’s all.

The trouble is, real experts can be hard to understand sometimes for the “average developer” (again, I know how that sounds; I’m not putting myself above this “average developer” in all or even many respects, just on the topics I happen to write about). Don’t get me wrong: it’s not because they lack communication skills (although that’s sometimes the case). It’s that a lot of what they talk about is at a level of depth which requires a lot of background knowledge for one to properly understand it. This is where I believe I can play a useful role – I like explaining the most useful bits of what I’ve understood from the experts, but in a way which hopefully any interested developer can understand.

C# in Depth is a good manifestation of that. If you want the unadulterated “truth” about C# (at least in theory) you look at the spec. But that’s frankly a pain to read, and there’s very little to distinguish the bits which are really important from the corner cases which you’re unlikely to ever encounter. I hope my book covers provides more depth than the level of knowledge most C# developers already have (and more depth than most other books provide), but without going so deep as to be impenetrably difficult to understand.

Having identified this role, I’m really quite happy to try to fulfil it. I just hope I can keep doing so. It’s a lot of fun.

C#, LINQ

Extension methods on lamdba expressions don’t work, unfortunately

January 8, 2008 jonskeet 3 Comments

Over the Christmas holidays, I thought I’d experiment with something I’d been thinking about a little – sorting a generic IList<T>. Now, before anyone gets huffy, I’m well aware of OrderBy in LINQ to Objects. However, sometimes you want to sort collections in-place, and as IList<T> provides random access, there’s no reason we shouldn’t be able to. Now, I do like the way that OrderBy allows multiple criteria to be specified, whether they should be applied in an ascending or descending fashion, and by way of just “compare by this projection” rather than having to actually implement the comparison yourself. I thought I could probably make use of those ideas again.

Unlike in LINQ to Objects, the sort would occur immediately, which means I couldn’t use quite the chained syntax of OrderBy(...).ThenBy(...).ThenByDescending(...) but my plan was to allow code like this:

List<Person> myList = …;

myList.SortBy((person => person.Name).Ascending(),
(person => person.DateOfBirth).Descending(),
(person => person.SocialSecurityNumber).Ascending());

Because each of the different output types involved might be different, that would only work for as many overloads as I’d implement. An alternative would be:

Comparer<Person> sorter = Comparer.By(person => person.Name)
.ThenByDescending(person => person.DateOfBirth)
.ThenBy(person => person.SocialSecurityNumber);

sorter.Sort(myList);

I did like the idea of the Ascending and Descending extension methods though, operating on Func<T1,T2>. Unfortunately, the dot operator doesn’t work on lambda expressions, even though the expression itself is implicitly convertible to Func<T1,T2>.

My plan isn’t entirely scuppered – the latter syntax will still work, and there are probably some alternatives I can work out. I think there are nice possibilities around extension methods and delegates though, and it’s a shame they’re not useful for lambda expressions. Ah well.

C#, LINQ

LINQ to Objects – not just for in-memory collections

January 8, 2008 jonskeet Leave a comment

I’ve just seen LINQ to Objects described as the LINQ provider for “in-memory collections” again. It’s a fairly frequent occurrence, and I may have done it myself on occasion. It doesn’t do LINQ to Objects justice. An example I’ve used in a few places is a query which runs over log files. Something along the lines of:

var query = from file in Directory.GetFiles(@“c:logs”, “*.log”)
            from line in new LineReader(file)
            let entry = new LogEntry(line)
            where entry.Severity = Severity.Critical
            select entry;

Where’s the in-memory collection here? I suppose there’s the array of log file names, but that’s about it. LINQ to Objects isn’t restricted to datasets which fit comfortably in memory. The above query could process many gigs of data very easily, limited basically by disk speed (and date/time parsing speed in my experience, but that’s a topic for another post).

What LINQ to Objects does is in-process querying of enumerable sequences of data. More of a mouthful than “querying in-memory collections” but more accurate, IMO.

Rant over. Well, for a couple of minutes.

C#, LINQ

“Push” LINQ revisited – next attempt at an explanation

January 4, 2008 jonskeet 12 Comments

Marc Gravell and I have now implemented a lot of LINQ standard query operators on the “push” model of IDataProducer as opposed to the “pull” model of IEnumerable. My good friend Douglas Leeder (who doesn’t use C#) has been with me this weekend, and through explaining the “big picture” to him in various ways, and taking his feedback, I think I’ve now got a good way of communicating it. Voting.

It’s a “real life analogy” which is always dangerous – don’t think of it too literally – I’m not claiming that it’s meant to be an absolute 1:1 correspondence. However, I think it neatly demonstrates the problem and some of the benefits of the solution we’ve come up with.

In order to make all this concrete, all of the code is real, and can be downloaded as a zip of a VS2008 solution. It contains a binary of an unreleased version of MiscUtil which is where the DataProducer stuff currently lives.

Real life situation

Let’s suppose we’re trying to find out what the favourite colour is of everyone in the world. Now, for the purposes of the demo code, there are only four colours and six people in the world: that makes the diagrams nice and easy, and we can see results simply too. Extending the data to the rest of the real world is left as an exercise to the reader. We may also want additional information, such as the average ages of people voting for particular colours.

Here’s our complete sample data – the five members of my family, and Douglas:

Name	Age	Favourite colour
Jon	31	Blue
Douglas	28	red
Holly	31	Purple
Tom	4	Pink
Robin	1	RED
William	1	blue

Note how the colours are specified with variations of case. We’ll use that later as an example of why you might need to specify a “key comparer”.

There are various ways of implementing this in LINQ, and for each model we’ll provide code and think about how it would work in the real world.

Model 1: “Pull” model

This is the model which “normal” LINQ uses – you only ever pull data, using an IEnumerable<T>. Here’s a simple query expression which gives the answers we want (admittedly unordered – we’ll ignore that for now):

var query = from voter in Voter.AllVoters()
group voter by voter.FavouriteColour.ToUpper() into grouped
select new { Colour = grouped.Key, Votes = grouped.Count() };

foreach (var entry in query)
{
Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

There are two problems here.

Firstly, we’re using ToUpper() to get round the “RED != red” problem. This is not only bad in terms of internationalisation, but it also loses data. We really want to get the original string as the key, and then use a case-insensitive comparer. We can do this by a manual call to GroupBy instead of using query expressions – there’s an overload which takes an IEqualityComparer.

Secondly, the result of the “group … by” keeps all the voter data temporarily. It has to all be available at the same time before the “select” kicks in. This runs counter to the normal “streaming” idea of LINQ. This is inherent in the nature of the “pull” model, as I’ll explain in a minute.

Now, let’s see what this looks like in the real world. People come into a room through a door, and a “grouper” asks them for their favourite colour. The grouper then tells each voter (immediately) which corner of the room to stand in. The result at the end of the grouping is this:

After the initial grouping, another person goes to each group in turn, finding out their key and doing a head count. That group is then free to go. The important thing is that this person can’t do their job until all the data is in, because they’ve got to be able to see everyone in order to count them.

Improvement to pull model: just keep a token presence

The fact that we used “group voter by …” meant that the result of the grouping still involved whole people. As we’re only going to do a head count, we only need something saying “There was a person here.” We can change our original query to do that quite easily:

var query = from voter in Voter.AllVoters()
group 1 by voter.FavouriteColour.ToUpper() into grouped
select new { Colour = grouped.Key, Votes = grouped.Count() };

foreach (var entry in query)
{
Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

This time, after the grouping takes place, the room looks like this:

The use of 1 here is purely incidental: we could have used ‘group “Spartacus” by …’ and the results would have been the same. It’s just something which can be counted.

Now, there’s good and bad here:

We’re not taking as much memory here. If voters have large amounts of data attached to them, we’ve reduced our requirements significantly.
We still have one object per voter, all in memory at the same time. Think “population of the world”.
We’ve lost our age data, which would make any extra aggregation impossible.

Model 2: “Push” model

The problem with the pull model is that each aggregator always wants to be the only thing pulling. The call to MoveNext will block until more data is available. That’s a real problem when you want to have multiple aggregators (one vote counter per colour). We could do a complicated threading manoeuvre, with each colour getting its own thread and the “grouper” pushing items out to relevant threads. Again though, that doesn’t scale – the four extra threads in our example aren’t too bad, but imagine other groupings with potentially thousands of keys.

The alternative is to change the model. Instead of having a greedy aggregator pulling data, we change to aggregators who observe data being pushed past them, and also observe a special “all the data has now been pushed” signal. Before we look at the code to do this, let’s think about what it could be like in real life. We don’t know how many different colours will be voted on, but we know what we need to do with each one: count the number of votes for them. In detail, the situation would be something like this:

The grouper stands just inside the door of the room, and “pulls” voters in the normal way
For any voter:
- Ask the voter which colour they wish to vote for
- Check to see if that colour is a “new” one. If it is, create a “counter” person for that colour, and position them by an exit in the room. (We create new exits as we go. We’ll assume there’s a sledgehammer at the ready.)
- Send the voter past the relevant “counter” person, through the exit near them
- Each counter just counts how many voters they see going past them
When all voters have been pulled, tell each of the counters and ask them how many people they saw

We never have more than one voter in the room at once:

Let’s have a look at the code involved now.

Using the “push” model

There are two sides to the code here: the code that the LINQ user has to write, and the code Marc Gravell and I have implemented. We’ll look at the client code in a few different scenarios.

1) GroupWithPipeline in the middle of normal LINQ

Keeping to the normal “start with a data source, do something, then select” model involves stepping away from query expressions. We’ve got a new extension method on IEnumerable<T> called GroupWithPipeline, which takes a key selector (just like the normal GroupBy) and what to do with the results of each grouping. Here’s the new code (which requires a using directive for MiscUtil.Linq.Extensions):

var query = Voter.AllVoters()
                 .GroupWithPipeline(voter => voter.FavouriteColour.ToUpper(),
                                    voters => voters.Count())
                 .Select(grouped => new { Colour = grouped.Key, Votes = grouped.Value });

foreach (var entry in query)
{
Console.WriteLine(“Colour {0} has {1} votes”, entry.Colour, entry.Votes);
}

How about making this a bit smarter now? Let’s try to also work out the minimum and maximum ages of the voters for each colour. Conceptually this is just a case of adding extra observers along with each vote counter in the “real life” model above. The code is remarkably simple:

var query = Voter.AllVoters()
                 .GroupWithPipeline(voter => voter.FavouriteColour.ToUpper(),
                                    voters => voters.Count(),
                                    voters => voters.Min(voter => voter.Age),
                                    voters => voters.Max(voter => voter.Age))
                 .Select(grouped => new { Colour = grouped.Key,
                                          Votes = grouped.Value1,
                                          MinAge = grouped.Value2,
                                          MaxAge = grouped.Value3});

foreach (var entry in query)
{
Console.WriteLine(“Colour {0} has {1} votes. Age range: {2}-{3}”, entry.Colour, entry.Votes, entry.MinAge, entry.MaxAge);
}

The fact that it uses “Value1”, “Value2” and “Value3” isn’t ideal, but unfortunately there’s no way round that as far as we’ve worked out – for this part.

2) Using DataProducer directly for multiple aggregates

GroupWithPipeline uses a few types internally which you can use directly instead: DataProducer (implementing IDataProducer) and Future (implemeting IFuture). If I go into the details here, I’ll never get this posted – but that may come into another post if there’s enough interest. However, let’s have a look at how it can be used. First, let’s find the results of a few aggregates of our voters, this time without any groupings:

// Create the data source to watch
DataProducer<Voter> voters = new DataProducer<Voter>();

// Add the aggregators
IFuture<int> total = voters.Count();
IFuture<int> adults = voters.Count(voter => voter.Age >= 18);
IFuture<int> children = voters.Where(voter => voter.Age < 18).Count();
IFuture<int> youngest = voters.Min(voter => voter.Age);
IFuture<int> oldest = voters.Select(voter => voter.Age).Max();

// Push all the data through
voters.ProduceAndEnd(Voter.AllVoters());

// Write out the results
Console.WriteLine(“Total voters: {0}”, total.Value);
Console.WriteLine(“Adult voters: {0}”, adults.Value);
Console.WriteLine(“Child voters: {0}”, children.Value);
Console.WriteLine(“Youngest vote age: {0}”, youngest.Value);
Console.WriteLine(“Oldest voter age: {0}”, oldest.Value);

The output of the code is what you’d expect, but there are a few things to note:

Each aggregate returns an IFuture<int> instead of an int. This is because we set up the aggregators before we produce any data. We need to use the Value property to get the actual value back after we’ve produced the data.
Just to hammer the point home, we must set up the aggregators (calling Count etc) before we produce the data (in ProduceAndEnd). Otherwise the aggregators won’t have any data to work with.
We can chain operators together (Select and then Max, or Where and then Count) just as with normal LINQ.
We’re applying multiple aggregates, but the data is only being produced once. This just can’t be done with normal LINQ. ProduceAndEnd takes an IEnumerable<T> which could be another LINQ query – something fetching large amounts of data from files, etc. Everything will be streamed appropriately.

3) Using DataProducer in query expressions

This part wouldn’t have been available when I started writing this post. I hadn’t quite realised the power of the pattern yet.
By implementing GroupBy on IDataProducer, we can perform the original grouping in a query expression, in
a pretty normal kind of way… except that this time we can apply multiple aggregates, never buffering the data beyond the results of the
aggregation:

DataProducer<Voter> voters = new DataProducer<Voter>();

var query = from voter in voters
            group voter by voter.FavouriteColour.ToUpper() into grouped
            select new { Colour = grouped.Key,
                         Votes = grouped.Count(),
                         MinAge = grouped.Min(voter => voter.Age),
                         MaxAge = grouped.Max(voter => voter.Age)};

var results = query.AsEnumerable();

voters.ProduceAndEnd(Voter.AllVoters());

foreach (var entry in results)
{
    Console.WriteLine(“Colour {0} has {1} votes. Age range: {2}-{3}”,
                      entry.Colour, entry.Votes.Value,
                      entry.MinAge.Value, entry.MaxAge.Value);
}

There’s just one tricky bit in here – you must call AsEnumerable before the data is produced, otherwise the aggregators will stream all their data with nothing watching for the results. In fact, AsEnumerable builds a list internally – the final results are buffered, but only those results. There’s really not a lot that can be done about that.

So, there we go. That may or may not be a bit clearer now. I’m still learning both the power of the pattern, its potential uses, and the best ways of explaining it. Feedback is very welcome, both on the technical front and about the explanation. I’m absolutely convinced that it’s a useful pattern in some situations (though not all). All the code will be released as part of MiscUtil eventually, of course – we’re still tidying it up and producing a bit more documentation at the moment.

Books, C#

Last post about the book (nearly) and a new source of information…

January 3, 2008 jonskeet 4 Comments

It’s about time I stopped using this blog to mention the book every couple of posts. Fortunately, I now have a new blog of sorts – well, a news page with an RSS feed. It’s part of the book’s web site – completely independent of Manning’s page for the book (which includes a link to their forum for it).

The web site is present in skeletal form – there are placeholders for everything I currently intend to include there, but none of the real content yet. That will be the work of the next couple of months.

The book itself is now almost out of my hands – it’s gone for copy editing and technical review, so I’m in reactive mode instead of proactive. It’s been nice to have time to play with some code for a change :) In particular, I’m quite pleased with the RSS generator. I could have found a third party library, I’m sure – but the mixture of LINQ to SQL and LINQ to XML sorts it out in a mere 34 lines, which is pretty neat.

Anyway, I expect I’ll post again here when I finally get my hands on a proper printed copy of the book, but until then I promise to keep quiet :)

Books, C#, LINQ

C# in Depth: All chapters available in MEAP!

December 17, 2007 jonskeet 1 Comment

Rather excitingly, all the chapters of C# in Depth are now available for early access. The following chapters have recently been added:

10: Extension methods

Without extension methods, LINQ just couldn’t work in an elegant form. Extension methods are basically a way of faking instance methods by providing static methods with an implicit “this” parameter. Importantly, they can work on interfaces, which means you can make an interface appear to have many more methods than implementations actually have to provide. Although they’re primarily provided for the sake of LINQ, extension methods can improve code readability in other spheres too – when used cautiously. In this chapter, we look at extension methods in the non-LINQ world, and get our first glance at some of the methods in System.Linq.Enumerable.

11: Query expressions and LINQ to Objects

If you ask someone who has just read a page or two about C# 3 what the new features are, they’ll almost certainly write a query expression. This is the “from x in y where z select foo” type of expression which looks almost like SQL but isn’t. The amazing thing about query expressions is how little they impact the rest of the language: they are pure syntactic sugar, being effectively translated into other source code before being compiled. That allows some really neat tricks, and is the basis for how LINQ handles multiple data sources.

In this chapter we look at query expressions and the standard query operators which support them, via LINQ to Objects. This is “in-process” LINQ, often used with in-memory collections but more generally available for anything implementing IEnumerable or IEnumerable<T>.

12: LINQ beyond collections

We’ve all seen LINQ to SQL demos, and gone “ooh” and “ahh” (or possibly “I could do that in Rails too, y’know”) at the appropriate time. In this chapter I emphatically don’t try to teach you LINQ to SQL, but instead take you on a whistle-stop tour of lots of different LINQ providers:

LINQ to SQL
LINQ to DataSet
LINQ to XML
LINQ to NHibernate
LINQ to ActiveDirectory
LINQ to Entities
Parallel LINQ

I also give a bit of insight into how “true” LINQ providers like LINQ to SQL (I don’t really count LINQ to XML or LINQ to DataSet – they’re just providing IEnumerable<T>s for LINQ to Objects to work with) work, using IQueryable.

As you can tell from the scope of the chapter, I don’t try to go into many details – just enough to give the flavour, and hopefully show the “big picture”. I believe Microsoft is really trying something very ambitious with a mostly-unified query framework, and with any luck this chapter leaves that as the lasting impression.

13: Elegant code in the new era

I’ve taken a look at a lot of my technical books to see how they end – and really they don’t, properly. The last line of the last chapter could often have been set anywhere else. My final chapter is very short, but tries to give an impression of where I think software development is going, particularly in terms of C#.

Appendix A: LINQ standard query operators

Although some of the standard query operators are covered in chapter 11, there are plenty which aren’t. This appendix is really just a grouped list of the operators with some brief examples of what they do. Handy as a reference guide – one reviewer apparently said to another, “Holy crap! I want this on my wall!”

What next?

So, now that everything is available in MEAP, it’s all done, right? Well, not quite. I’m currently indexing and putting together final revisions – where the word “final” is pretty loose. It will then be passed to my technical reviewer (whose name I shouldn’t reveal just yet, but who I’m proud to have on board – even if I’m dreading the errors they’ll find) and the copy editor, who I believe will work effectively in parallel. After that (and final approval) it will go into production, then to press. The due date is still late March or April at the moment, I believe.

My current indexing/revising task is a real slog (which is why I’m taking a break by writing this blog entry) but I think it’s the last big push – it should get easier when I’m done on this bit. Right, back to chapter 5…