Category Archives: Wacky Ideas

Non-iterable collection initializers

September 18, 2010 jonskeet 15 Comments

Yesterday on Stack Overflow, I mentioned that sometimes I make a type implement IEnumerable just so that I can use collection initializers with it. In such a situation, I use explicit interface implementation (despite not really needing to – I’m not implementing IEnumerable<T>) and leave it throwing a NotImplementedException. (EDIT: As noted in the comments, throwing NotSupportedException would probably be more appropriate. In many cases it would actually be pretty easy to implement this in some sort of meaningful fashion… although I quite like throwing an exception to indicate that it’s not really intended to be treated as a sequence.)

Why would I do such a crazy thing? Because sometimes it’s helpful to be able to construct a "collection" of items easily, even if you only want the class itself to really treat it as a collection. As an example, in a benchmarking system you might want to be able to add a load of tests individually, but you never want to ask the "test collection" what tests are in it… you just want to run the tests. The only iteration is done internally.

Now, there’s an alternative to collection initializers here: parameter arrays. You can add a "params string[]" or whatever as the final constructor parameter, and simply use the constructor. That works fine in many cases, but it falls down in others:

If you want to be able to add different types of values, without just using "params object[]". For example, suppose we wanted to restrict our values to int, string, DateTime and Guid… you can’t do that in a compile-time-safe way using a parameter array.
If you want to be able to constructor composite values from two or more parts, without having to explicitly construct that composite value each time. Think about the difference in readability between using a collection initializer for a dictionary and explicitly constructing a KeyValuePair<TKey, TValue> for each entry.
If you want to be able to use generics to force aspects of type safety. The Add method can be generic, so you could, for example, force two parameters for a single entry to both be of T, but different entries could have different types. This is pretty unusual, but I needed it just the other day :)

Now, it’s a bit of a hack to have to "not quite implement" IEnumerable. I’ve come up with two alternative options. These have the additional benefit of not requiring the method to always be called Add any more. I suspect it still would be in most cases, but flexibility is a bonus.

Option 1: Class level attribute

Instead of just relying on IEnumerable, the compiler could detect an attribute applied to the class, specifying the single method name for all collection initializer methods:

[CollectionInitializerMethod("AddValue")]
public class RestrictedValues
{
public void AddValue(int x) { … }

public void AddValue(string y) { … }
}

var values = new RestrictedValues
{
3, "value", 10
};

Option 2: Method level attributes

In this case, each method eligible for use in a collection initializer would be decorated with the attribute:

public class RestrictedValues
{
[CollectionInitializerMethod]
public void AddInt32(int x) { … }

[CollectionInitializerMethod]
public void AddString(string y) { … }
}

var values = new RestrictedValues
{
3, "value", 10
};

This has the disadvantage that the compiler would need to look at every method in the target class when it found a collection initializer.

Obviously both of these can be made backwardly compatible very easily: the presence of an implementation of IEnumerable with no attributes present would just fall back to using Add.

Option 3: Compiler and language simplicity

(I’ve added this in response to Peter’s comment.)

Okay, let’s stick with the Add method. All we need is another way of indicating that you should be able to use collection initializers with a type:

[AllowCollectionInitializers]
public class RestrictedValues
{
public void Add(int x) { … }

public void Add(string y) { … }
}

At this point, the changes required to the compiler (and language spec) are really minimal. In the bit of code which detects whether or not you can use a collection initializer, you just need to change from "does this type implement IEnumerable" to "does this type implement IEnumerable or have the relevant attribute defined". I can’t think of many possible language changes which would be more localized than that.

And another thing…

One final point. I’d still like the ability for collection initializers to return a value, and for that value to be used for subsequent elements of the initialization – with the final return value being the last-returned value. Any methods with a void return value would be treated as if they returned "this". This would allow you to build immutable collections easily.

Likewise you could decorate methods with a [PropertyInitializerMethod] to allow initialization of immutable types with "WithXyz" methods. Admittedly there’s less need for this now that we have optional parameters and named arguments in C# 4 – a lot of the benefits of object initializers are available just with constructors.

Anyway, just a few slightly odd thoughts around initialization for you to ponder over the weekend…

General, Wacky Ideas

Code and data

August 29, 2010 jonskeet 13 Comments

In a recent Stack Overflow question, I answered a question which started off with a broken XPath expression by suggesting that that poster might be better off using LINQ to XML instead. The discussion which followed in the comments (around whether or not this was an appropriate answer) led me to think about the nature of code and data, and how important context is.

I don’t think there’s any particularly deep insight in this post – so I’ll attempt to keep it relatively short. However, you might like to think about how code and data interact in your own experience, and what the effects of this can be.

Code is data

Okay, so let’s start off with the obvious: all code is data, at some level. If it’s compiled code, it’s just binary data which a machine can execute. Put it on another machine with no VM, and there’s nothing remarkable about it. It’s just a load of 1s and 0s. As source code, most languages are just plain text. Open up some source code written in C#, Ruby, Python, Java, C++ etc in Notepad and it’ll be readable. You may miss the syntax highlighting and so forth, but it’s still just text.

Code in the right context is more than just data

So what makes this data different to (say) a CSV file or a plain text story? It’s all in the context. When you load it into the right editor, or pass it to the right compiler, you get more information: in an editor you may see the aforementioned syntax highlighting, autocompletion, documentation for members you’re using; a compiler will either produce errors or a binary file. For something like Python or Ruby, you may want to feed the source into an interpreter instead of a compiler, but the principle is the same: the data takes on more meaning.

Code in the wrong code-related context is just data again

Now let’s think about typical places where you might put code (or something with similar characteristics) into the "wrong" context:

SQL statements
XSLT transformations
XPath expressions
XML or HTML text
Regular expressions

All of these languages have editors which understand them, and will help you avoid problems. All of these are also possible to embed in other code – C#, for example. Indeed, almost all the regular expressions I’ve personally written have ended up in Java or C# code. At that point, there are two problems:

You may want to include text which doesn’t embed easily within the "host" language’s string literals (particularly double quotes, backslashes and newlines)
The code editor doesn’t understand the additional meaning to the text

The first problem is at least somewhat mitigated by C#’s support for verbatim string literals – only double quotes remain as a problem. But the second problem is the really big one. Visual Studio isn’t going to check that your regular expression or XPath expression looks valid. It’s not going to give you syntax highlighting for your SQL statement, much less IntelliSense on the columns present in your database. Admittedly such a thing might be possible, if the IDE looked ahead to find out where the text was going to be used – but I haven’t seen any IDE that advanced yet. (The closest I’ve seen is ReSharper noticing when you’re using a format string with the wrong number of parameters – that’s primitive but still really useful.)

Of course, you could write your SQL (or XPath etc) in a dedicated editor, and then either copy and paste it into your code or embed it into your eventual binary and load it at execution time. Neither of these is particularly appealing. Copy and paste works well once, but then when you’re reading or modifying the code you lose the advantages you had unless you copy and paste it again. Embedding the file can work well in some cases – I use it liberally for test data in unit tests, for example – but I wouldn’t want it all over production code. It means that when reading the code, you have to refer to the external resource to work out what’s going to happen. In some cases that’s not too bad – it’s only like opening another class or method, I guess – but in other cases the shift of gears is too distracting.

When code is data, it’s easy to mix it with other data – badly

Within C# code, it’s easy to see the bits of data which sometimes occur in your code: string or numeric literals, typically. Maybe you subscribe to the "no magic values" philosophy, and only ever have literals (other than 0 or 1, typically) as values for constants. Well, that’s just a level of indirection – which in some ways hides the fact that you’ve still got magic values. If you’re only going to use a piece of data once, including it directly in-place actually adds to readability in my view. Anyway, without wishing to dive into that particular debate too deeply, the point is that the compiler (or whatever) will typically stop you from using that data as code – at least without being explicit about it. It will make sure that if you’re using a value, it really is a value. If you’re trying to use a variable, it had better be a variable. Putting a variable name in quotes means it’s just text, and using a word without the quotes will make the compiler complain unless you happen to have a variable with the right name.

Now compare that with embedding XPath within C#, where you might have:

var node = doc.SelectSingleNode("//foo/bar[@baz=xyz]");

Now it may be obvious to you that "xyz" is meant to be a value here, not the name of an attribute, an element, a function or anything like that… but it’s not obvious to Visual Studio, which won’t give you any warnings. This is only a special case of the previous issue of invalid code, of course, but it does lead onto a related issue… SQL injection attacks.

When you’ve already got your "code" as a simple text value – a string literal containing your SQL statement, as an obvious example – it’s all too easy to start mixing that code/data with genuine data data: a value entered by a user, for example. Hey, let’s just concatenate the two together. Or maybe use a format string, effectively mixing three languages (C#, SQL, the primitive string formatting "language" of string.Format) into a single statement. We all know the results, of course: nothing differentiates between the code/data and the genuine data, so if the user-entered value happens to look like SQL to drop a database table, we end up with Little Bobby Tables.

I’m sure 99% of my blog readers know the way to avoid SQL injection attacks: use parameterized SQL statements. Keep the data and the code separate, basically.

Expressing the same ideas, but back in the "native" language

Going back to the start of all this, the above is why I like LINQ to XML. When I express a query using LINQ to XML, it’s often a lot longer than it would have been in the equivalent XPath – but I can tell where the data goes. I know where I’m using an element name, where I’m using an attribute name, and where I’m comparing or extracting values. If I miss out some quotes, chances are pretty high that the resulting code will be invalid, and it’ll be obvious where the problem is. I’m prepared to sacrifice brevity for the fact that I only work in a single language + library, instead of trying to embed one language within another.

Likewise building XML using LINQ to XML is much better than concatenating strings – I don’t need to worry about any nasty escaping issues, for example. LINQ to XML has been so nicely design, it makes all kinds of things incredibly easy.

Regular expressions can sometimes be replaced by simple string operations. Where they can, I will often do so. I’d rather use a few IndexOf and Substring calls over a regular expression in general – but where the patterns I need get too tricky, I will currently fall back to regular expressions. I’m aware of ReadableRex but I haven’t looked at it in enough detail to say whether it can take the place of "normal" regular expressions in the way that LINQ to XML can so often take the place of XPath.

Of course, LINQ to SQL (and the Entity Framework) do something similar for SQL… although that’s slightly different, and has its own issues around predictability.

In all of these cases, however, the point is that by falling back to more verbose but more native-feeling code, some of the issues of embedding one language within another are removed. Code is still code, data is data again, and the two don’t get mixed up with each other.

Conclusion

If I ever manage to organize these thoughts in a more lucid way, I will probably just rewrite them as another (shorter) post. In the meantime, I’d urge you to think about where your code and data get uncomfortably close.

C#, Wacky Ideas

Iterate, damn you!

July 27, 2010 jonskeet 19 Comments

Do you know the hardest thing about presenting code with surprising results? It’s hard to do so without effectively inviting readers to look for the trick. Not that that’s always enough – I failed the last of Neal and Eric’s C# puzzlers at NDC, for example. (If you haven’t already watched the video, please do so now. It’s far more entertaining than this blog post.) Anyway, this one may be obvious to some of you, but there are some interesting aspects even when you’ve got the twist, as it were.

What does the following code print?

using System;
using System.Collections.Generic;

public class WeirdIterators
{
    static void ShowNext(IEnumerator<int> iterator)
    {
        if (iterator.MoveNext())
        {
            Console.WriteLine(iterator.Current);
        }
        else
        {
            Console.WriteLine("Done");
        }
    }

    static void Main()
    {
        List<int> values = new List<int> { 1, 2 };
        using (var iterator = values.GetEnumerator())
        {
            ShowNext(iterator);
            ShowNext(iterator);
            ShowNext(iterator);
        }
    }
}

If you guessed "1, 2, Done" despite the title of the post and the hints that it was surprising, then you’re at least brave and firm in your beliefs. I suspect most readers will correctly guess that it prints "1, 1, 1" – but I also suspect some of you won’t have worked out why.

Let’s look at the signature of List<T>.GetEnumerator(). We’d expect it to be

public IEnumerator<T> GetEnumerator()

right? That’s what IEnumerable<T> says it’ll look like. Well, no. List<T> uses explicit interface implementation for IEnumerable<T>. The signature actually looks like this:

public List<T>.Enumerator GetEnumerator()

Hmm… that’s an unfamiliar type. Let’s have another look in MSDN…

[SerializableAttribute]
public struct Enumerator : IEnumerator<T>,
IDisposable, IEnumerator

(It’s nested in List<T> of course.) Now that’s odd… it’s a struct. You don’t see many custom structs around, beyond the familiar ones in the System namespace. And hang on, don’t iterators fundamentally have to be mutable.

Ah. "Mutable value type" – a phrase to strike terror into the hearts of right-headed .NET developers everywhere.

So what’s going on? If we’re losing all the changes to the value, why is it printing "1, 1, 1" instead of throwing an exception due to printing out Current without first moving?

Well, we’re fetching the iterator into a variable of type List<int>.Enumerator, and then calling ShowNext() three times. On each call, the value is boxed (creating a copy), and the reference to the box is passed to ShowNext().

Within ShowNext(), the value within the box changes when we call MoveNext() – which is how it’s able to get the real first element with Current. So that mutation isn’t lost… until we return from the method. The box is now eligible for garbage collection, and no change has been made to the iterator variable’s value. On the next call to ShowNext(), a new box is created and we see the first item again…

How can we fix it?

There are various things we can do to fix the code – or at least, to make it display "1, 2, Done". We can then find other ways of breaking it again :)

Change the type of the `values` variable

How does the compiler work out the type of the iterator variable? Why, it looks at the return type of values.GetEnumerator(). And how does it find that? It looks at the type of the values variable, and then finds the GetEnumerator() method. In this case it finds List<int>.GetEnumerator(), so it makes the iterator variable type List<int>.Enumerator.

If suppose just change values to be of type IList<int> (or IEnumerable<int>, or ICollection<int>):

IList<int> values = new List<int> { 1, 2 };

The compiler uses the interface implementation of GetEnumerator() on List<T>. Now that could return a different type entirely – but it actually returns a boxed List<T>.Enumerator. We can see that by just printing out iterator.GetType().

So if it’s just returning the same value as before, why does it work?

Well, this time we’re boxing once – the iterator gets boxed on its way out of the GetEnumerator() method, and the same box is used for all three calls to ShowNext(). No extra copies are created, and the changes within the box don’t get lost.

Change the type of the `iterator` variable

This is exactly the same as the previous fix – except we don’t need to change the type of values. We can just explicitly state the type of iterator:

using (IEnumerator<int> iterator = values.GetEnumerator())

The reason this works is the same as before – we box once, and the changes within the box don’t get lost.

Pass the `iterator` variable by reference

The initial problem was due to the mutations involved in ShowNext() getting lost due to repeated boxing. We’ve seen how to solve it by reducing the number of boxing operations down to one, but can we remove them entirely?

Well yes, we can. If we want changes to the value of the parameter in ShowNext() to be propagated back to the caller, we just need to pass the variable by reference. When passing by reference the parameter and argument types have to match exactly of course, so we can’t leave the iterator variable being type List<T>.Enumerator without changing the parameter type. Now we could explicitly change the type of the parameter to List<T>.Enumerator – but that would tie our implementation down rather a lot, wouldn’t it? Let’s use generics instead:

static void ShowNext<T>(ref T iterator)
where T : IEnumerator<int>

Now we can pass iterator by reference and the compiler will infer the type. The interface members (MoveNext() and Current) will be called using constrained calls, so there’s no boxing involved…

… except that when you try to just change the method calls to use ref, it doesn’t work – because apparently you can’t pass a "using variable" by reference. I’d never come across that rule before. Interesting. Fortunately, we can (roughly) expand out the using statement ourselves, like this:

var iterator = values.GetEnumerator();
try
{
    ShowNext(ref iterator);
    ShowNext(ref iterator);
    ShowNext(ref iterator);
}
finally
{
    iterator.Dispose();
}

Again, this fixes the problem – and this time there’s no boxing involved.

Let’s quickly look at one more example of it not working, before I finish…

Dynamic typing to the anti-rescue

What happens if we change the type of iterator to dynamic (and set everything else back the way it was)? I’ll readily admit, I really didn’t know what was going to happen here. There are two competing forces:

The dynamic type is often really just object behind the scenes… so it will be boxed once, right? That means the changes within the box won’t get lost. (This would give "1, 2, Done")
The dynamic type is in many ways meant to act as if you’d declared a variable of the type which it actually turns out to be at execution time – so in this case it should work as if the variable was of type List<int>.Enumerator, just like our original code. (This would give "1, 1, 1")

What actually happens? I believe it actually boxes the value returned from GetEnumerator() – and then the C# binder ~~DLR~~ makes sure that the value type behaviour is preserved by copying the box before passing it to ShowNext(). In other words, both bits of intuition are right, but the second effectively overrules the first. Wow. (See the comment below from Chris Burrows for more information about this. I’m sure he’s right that it’s the only design that makes sense. This is a pretty pathological example in various ways.)

Conclusion

Just say "no" to mutable value types. They do weird things.

(Fortunately the vast majority of the time this particular one won’t be a problem – it’s rare to use iterators explicitly in the first place, and when you do you very rarely pass them to another method.)

Books, C#, LINQ, Wacky Ideas

First encounters with Reactive Extensions

January 16, 2010 jonskeet 14 Comments

I’ve been researching Reactive Extensions for the last few days, with an eye to writing a short section in chapter 12 of the second edition of C# in Depth. (This is the most radically changed chapter from the first edition; it will be covering LINQ to SQL, IQueryable, LINQ to XML, Parallel LINQ, Reactive Extensions, and writing your own LINQ to Objects operators.) I’ve watched various videos from Channel 9, but today was the first time I actually played with it. I’m half excited, and half disappointed.

My excited half sees that there’s an awful lot to experiment with, and loads to learn about join patterns etc. I’m also looking forward to trying genuine events (mouse movements etc) – so far my tests have been to do with collections.

My disappointed half thinks it’s missing something. You see, Reactive Extensions shares some concepts with my own Push LINQ library… except it’s had smarter people (no offense meant to Marc Gravell) working harder on it for longer. I’d expect it to be easier to use, and make it a breeze to do anything you could do in Push LINQ. Unfortunately, that’s not quite the case.

Subscription model

First, the way that subscription is handled for collections seems slightly odd. I’ve been imagining two kinds of observable sources:

Genuine "event streams" which occur somewhat naturally – for instance, mouse movement events. Subscribing to such an observable wouldn’t do anything to it other than adding subscribers.
Collections (and the like) where the usual use case is "set up the data pipeline, then tell it to go". In that case calling Subscribe should just add the relevant observers, but not actually "start" the sequence – after all, you may want to add more observers (we’ll see an example of this in a minute).

In the latter case, I could imagine an extension method to IEnumerable<T> called ToObservable which would return a StartableObservable<T> or something like that – you’d subscribe what you want, and then call Start on the StartableObservable<T>. That’s not what appears to happen though – if you call ToObservable(), you get an implementation which iterates over the source sequence as soon as anything subscribes to it – which just doesn’t feel right to me. Admittedly it makes life easy in the case where that’s really all you want to do, but it’s a pain otherwise.

There’s a way of working round this in Reactive Extensions: there’s Subject<T> which is both an observer and an observable. You can create a Subject<T>, Subscribe all the observers you want (so as to set up the data pipeline) and then subscribe the subject to the real data source. It’s not exactly hard, but it took me a while to work out, and it feels a little unwieldy. The next issue was somewhat more problematic.

Blocking aggregation

When I first started thinking about Push LINQ, it was motivated by a scenario from the C# newsgroup: someone wanted to group a collection in a particular way, and then count how many items were in each group. This is effectively the "favourite colour voting" scenario outlined in the link at the top of this post. The problem to understand is that the normal Count() call is blocking: it fetches items from a collection until there aren’t any more; it’s in control of the execution flow, effectively. That means if you call it in a grouping construct, the whole group has to be available before you call Count(). So, you can’t stream an enormous data set, which is unfortunate.

In Push LINQ, I addressed this by making Count() return Future<int> instead of int. The whole query is evaluated, and then you can ask each future for its actual result. Unfortunately, that isn’t the approach that the Reactive Framework has taken – it still returns int from Count(). I don’t know the reason for this, but fortunately it’s somewhat fixable. We can’t change Observable of course, but we can add our own future-based extensions:

public static class ObservableEx
{
    public static Task<TResult> FutureAggregate<TSource, TResult>
        (this IObservable<TSource> source,
        TResult seed, Func<TResult, TSource, TResult> aggregation)
    {
        TaskCompletionSource<TResult> result = new TaskCompletionSource<TResult>();
        TResult current = seed;
        source.Subscribe(value => current = aggregation(current, value),
            error => result.SetException(error),
            () => result.SetResult(current));
        return result.Task;
    }

    public static Task<int> FutureMax(this IObservable<int> source)
    {
        // TODO: Make this generic and throw exception on
        // empty sequence. Left as an exercise for the reader.
        return source.FutureAggregate(int.MinValue, Math.Max);
    }

    public static Task<int> FutureMin(this IObservable<int> source)
    {
        // TODO: Make this generic and throw exception on
        // empty sequence. Left as an exercise for the reader.
        return source.FutureAggregate(int.MaxValue, Math.Min);
    }

    public static Task<int> FutureCount<T>(this IObservable<T> source)
    {
        return source.FutureAggregate(0, (count, _) => count + 1);
    }
}

This uses Task<T> from Parallel Extensions, which gives us an interesting ability, as we’ll see in a moment. It’s all fairly straightforward – TaskCompletionSource<T> makes it very easy to specify a value when we’ve finished, or indicate that an error occurred. As mentioned in the comments, the maximum/minimum implementations leave something to be desired, but it’s good enough for a blog post :)

Using the non-blocking aggregation operators

Now that we’ve got our extension methods, how can we use them? First I decided to do a demo which would count the number of lines in a file, and find the maximum and minimum line lengths:

public static List<T> ToList<T>(this IObservable<T> source)
{
    List<T> ret = new List<T>();
    source.Subscribe(x => ret.Add(x));
    return ret;
}
private static IEnumerable<string> ReadLines(string filename)
{
    using (TextReader reader = File.OpenText(filename))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}
…
var subject = new Subject<string>();
var lengths = subject.Select(line => line.Length);
var min = lengths.FutureMin();
var max = lengths.FutureMax();
var count = lengths.FutureCount();

var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);
Console.WriteLine("Count: {0}, Min: {1}, Max: {2}",
                  count.Result, min.Result, max.Result);

As you can see, we use the Result property of a task to find its eventual result – this call will block until the result is ready, however, so you do need to be careful about how you use it. Each line is only read from the file once, and pushed to all three observers, who carry their state around until the sequence is complete, whereupon they publish the result to the task.

I got this working fairly quickly – then went back to the "grouping lines by line length" problem I’d originally set myself. I want to group the lines of a file by their length (all lines of length 0, all lines of length 1 etc) and count each group. The result is effectively a histogram of line lengths. Constructing the query itself wasn’t a problem – but iterating through the results was. Fundamentally, I don’t understand the details of ToEnumerable yet, particularly the timing. I need to look into it more deeply, but I’ve got two alternative solutions for the moment.

The first is to implement my own ToList extension method. This simply creates a list and subscribes an observer which adds items to the list as it goes. There’s no attempt at "safety" here – if you access the list before the source sequence has completed, you’ll see whatever has been added so far. I am still just experimenting :) Here’s the implementation:

public static List<T> ToList<T>(this IObservable<T> source)
{
    List<T> ret = new List<T>();
    source.Subscribe(x => ret.Add(x));
    return ret;
}

Now we can construct a query expression, project each group using our future count, make sure we’ve finished pushing the source before we read the results, and everything is fine:

Note how the call to ToList is required before calling source.ToObservable(...).Subscribe – otherwise everything would have been pushed before we started collecting it.

All well and good… but there’s another way of doing it too. We’ve only got a single task being produced for each group – instead of waiting until everything’s finished before we dump the results to the console, we can use Task.ContinueWith to write it (the individual group result) out as soon as that group has been told that it’s finished. We force this extra action to occur on the same thread as the observer just to make things easier in a console app… but it all works very neatly:

Conclusion

That’s the lot, so far. It feels like I’m sort of in the spirit of Reactive Extensions, but that maybe I’m pushing it (no pun intended) in a direction which Erik and Wes either didn’t anticipate, or at least don’t view as particularly valuable/elegant. I very much doubt that they didn’t consider deferred aggregates – it’s much more likely that either I’ve missed some easy way of doing this, or there are good reasons why it’s a bad idea. I hope to find out which at some point… but in the meantime, I really ought to work out a more idiomatic example for C# in Depth.

C#, Evil code, Stack Overflow, Wacky Ideas

“Magic” null argument testing

December 9, 2009 jonskeet 51 Comments

Warning: here be dragons. I don’t think this is the right way to check for null arguments, but it was an intriguing idea.

Today on Stack Overflow, I answered a question about checking null arguments. The questioner was already using an extension similar to my own one in MiscUtil, allowing code like this:

public void DoSomething(string name)
{
name.ThrowIfNull("name");

// Normal code here
}

That’s all very well, but it’s annoying to have to repeat the name part. Now in an ideal world, I’d say it would be nice to add an attribute to the parameter and have the check performed automatically (and when PostSharp works with .NET 4.0, I’m going to give that a go, mixing Code Contracts and AOP…) – but for the moment, how far can we go with extension methods?

I stand by my answer from that question – the code above is the simplest way to achieve the goal for the moment… but another answer raised the interesting prospect of combining anonymous types, extension methods, generics, reflection and manually-created expression trees. Now that’s a recipe for hideous code… but it actually works.

The idea is to allow code like this:

public void DoSomething(string name, string canBeNull, int foo, Stream input)
{
new { name, input }.CheckNotNull();

// Normal code here
}

That should check name and input, in that order, and throw an appropriate ArgumentNullException – including parameter name – if one of them is null. It uses the fact that projection initializers in anonymous types use the primary expression’s name as the property name in the generated type, and the value of that expression ends up in the instance. Therefore, given an instance of the anonymous type initializer like the above, we have both the name and value despite having only typed it in once.

Now obviously this could be done with normal reflection – but that we be slow as heck. No, we want to effectively find the properties once, and generate strongly typed delegates to perform the property access. That sounds like a job for Delegate.CreateDelegate, but it’s not quite that simple… to create the delegate, we’d need to know (at compile time) what the property type is. We could do that with another generic type, but we can do better than that. All we really need to know about the value is whether or not it’s null. So given a "container" type T, we’d like a bunch of delegates, one for each property, returning whether that property is null for a specified instance – i.e. a Func<T, bool>. And how do we build delegates at execution time with custom logic? We use expression trees…

I’ve now implemented this, along with a brief set of unit tests. The irony is that the tests took longer than the implementation (which isn’t very unusual) – and so did writing it up in this blog post. I’m not saying that it couldn’t be improved (and indeed in .NET 4.0 I could probably make the delegate throw the relevant exception itself) but it works! I haven’t benchmarked it, but I’d expect it to be nearly as fast as manual tests – insignificant in methods that do real work. (The same wouldn’t be true using reflection every time, of course.)

The full project including test cases is now available, but here’s the (almost completely uncommented) "production" code.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;
using System.Linq.Expressions;

public static class Extensions
{
    public static void CheckNotNull<T>(this T container) where T : class
    {
        if (container == null)
        {
            throw new ArgumentNullException("container");
        }
        NullChecker<T>.Check(container);
    }

    private static class NullChecker<T> where T : class
    {
        private static readonly List<Func<T, bool>> checkers;
        private static readonly List<string> names;

        static NullChecker()
        {
            checkers = new List<Func<T, bool>>();
            names = new List<string>();
            // We can’t rely on the order of the properties, but we
            // can rely on the order of the constructor parameters
            // in an anonymous type – and that there’ll only be
            // one constructor.
            foreach (string name in typeof(T).GetConstructors()[0]
                                             .GetParameters()
                                             .Select(p => p.Name))
            {
                names.Add(name);
                PropertyInfo property = typeof(T).GetProperty(name);
                // I’ve omitted a lot of error checking, but here’s
                // at least one bit…
                if (property.PropertyType.IsValueType)
                {
                    throw new ArgumentException
                        ("Property " + property + " is a value type");
                }
                ParameterExpression param = Expression.Parameter(typeof(T), "container");
                Expression propertyAccess = Expression.Property(param, property);
                Expression nullValue = Expression.Constant(null, property.PropertyType);
                Expression equality = Expression.Equal(propertyAccess, nullValue);
                var lambda = Expression.Lambda<Func<T, bool>>(equality, param);
                checkers.Add(lambda.Compile());
            }
        }

        internal static void Check(T item)
        {
            for (int i = 0; i < checkers.Count; i++)
            {
                if (checkers[i](item))
                {
                    throw new ArgumentNullException(names[i]);
                }
            }
        }
    }
}

Oh, and just as a miracle – the expression tree worked first time. I’m no Marc Gravell, but I’m clearly improving :)

Update: Marc Gravell pointed out that the order of the results of Type.GetProperties isn’t guaranteed – something I should have remembered myself. However, the order of the constructor parameters will be the same as in the anonymous type initialization expression, so I’ve updated the code above to reflect that. Marc also showed how it could almost all be put into a single expression tree which returns either null (for no error) or the name of the "failing" parameter. Very clever :)

Books, C#, C# 4, Wacky Ideas

Contract classes and nested types within interfaces

October 31, 2009 jonskeet 8 Comments

I’ve just been going through some feedback for the draft copy of the second edition of C# in Depth. In the contracts section, I have an example like this:

[ContractClass(typeof(ICaseConverterContracts))]
public interface ICaseConverter
{
string Convert(string text);
}

[ContractClassFor(typeof(ICaseConverter))]
internal class ICaseConverterContracts : ICaseConverter
{
    string ICaseConverter.Convert(string text)
    {
        Contract.Requires(text != null);
        Contract.Ensures(Contract.Result<string>() != null);
        return default(string);
    }

private ICaseConverterContracts() {}
}

public class InvariantUpperCaseFormatter : ICaseConverter
{
    public string Convert(string text)
    {
        return text.ToUpperInvariant();
    }
}

The point is to demonstrate how contracts can be specified for interfaces, and then applied automatically to implementations. In this case, ICaseConverter is the interface, ICaseConverterContracts is the contract class which specifies the contract for the interface, and InvariantUpperCaseFormatter is the real implementation. The binary rewriter effectively copies the contract into each implementation, so you don’t need to duplicate the contract in the source code.

The reader feedback asked where the contract class code should live – should it go in the same file as the interface itself, or in a separate file as normal? Now normally, I’m firmly of the "one top-level type per file" persuasion, but in this case I think it makes sense to keep the contract class with the interface. It has no meaning without reference to the interface, after all – it’s not a real implementation to be used in the normal way. It’s essentially metadata. This does, however, leave me feeling a little bit dirty. What I’d really like to be able to do is nest the contract class inside the interface, just like I do with other classes which are tightly coupled to an "owner" type. Then the code would look like this:

[ContractClass(typeof(ICaseConverterContracts))]
public interface ICaseConverter
{
string Convert(string text);

    [ContractClassFor(typeof(ICaseConverter))]
    internal class ICaseConverterContracts : ICaseConverter
    {
        string ICaseConverter.Convert(string text)
        {
            Contract.Requires(text != null);
            Contract.Ensures(Contract.Result<string>() != null);
            return default(string);
        }

private ICaseConverterContracts() {}
}
}

public class InvariantUpperCaseFormatter : ICaseConverter
{
    public string Convert(string text)
    {
        return text.ToUpperInvariant();
    }
}

That would make me feel happier – all the information to do with the interface would be specified within the interface type’s code. It’s possible that with that as a convention, the Code Contracts tooling could cope without the attributes – if interface IFoo contains a nested class IFooContracts which implements IFoo, assume it’s a contract class and handle it appropriately. That would be sweet.

You know the really galling thing? I’m pretty sure VB does allow nested types in interfaces…

C#, Parallelization, Stack Overflow, Wacky Ideas

Iterating atomically

October 23, 2009 jonskeet 33 Comments

The IEnumerable<T> and IEnumerator<T> interfaces in .NET are interesting. They crop up an awful lot, but hardly anyone ever calls them directly – you almost always use a foreach loop to iterate over the collection. That hides all the calls to GetEnumerator(), MoveNext() and Current. Likewise iterator blocks hide the details when you want to implement the interfaces. However, sometimes details matter – such as for this recent Stack Overflow question. The question asks how to create a thread-safe iterator – one that can be called from multiple threads. This is not about iterating over a collection n times independently on n different threads – this is about iterating over a collection once without skipping or duplicating. Imagine it’s some set of jobs that we have to complete. We assume that the iterator itself is thread-safe to the extent that calls from different threads at different times, with intervening locks will be handled reasonably. This is reasonable – basically, so long as it isn’t going out of its way to be thread-hostile, we should be okay. We also assume that no-one is trying to write to the collection at the same time.

Sounds easy, right? Well, no… because the IEnumerator<T> interface has two members which we effectively want to call atomically. In particular, we don’t want the collection { “a”, “b” } to be iterated like this:

Thread 1	Thread 2
MoveNext()
	MoveNext()
Current
	Current

That way we’ll end up not processing the first item at all, and the second item twice.

There are two ways of approaching this problem. In both cases I’ve started with IEnumerable<T> for consistency, but in fact it’s IEnumerator<T> which is the interesting bit. In particular, we’re not going to be able to iterate over our result anyway, as each thread needs to have the same IEnumerator<T> – which it won’t do if each of them uses foreach (which calls GetEnumerator() to start with).

Fix the interface

First we’ll try to fix the interface to look how it should have looked to start with, at least from the point of view of atomicity. Here are the new interfaces:

public interface IAtomicEnumerable<T>
{
IAtomicEnumerator<T> GetEnumerator();
}

public interface IAtomicEnumerator<T>
{
bool TryMoveNext(out T nextValue);
}

One thing you may notice is that we’re not implementing IDisposable. That’s basically because it’s a pain to do so when you think about a multi-threaded environment. Indeed, it’s possibly one of the biggest arguments against something of this nature. At what point do you dispose? Just because one thread finished doesn’t mean that the rest of them have… don’t forget that “finish” might mean “an exception was thrown while processing the job, I’m bailing out”. You’d need some sort of co-ordinator to make sure that everyone is finished before you actually do any clean-up. Anyway, the nice thing about this being a blog post is we can ignore that little thorny issue :)

The important point is that we now have a single method in IAtomicEnumerator<T> – TryMoveNext, which works the way you’d expect it to. It atomically attempts to move to the next item, returns whether or not it succeeded, and sets an out parameter with the next value if it did succeed. Now there’s no chance of two threads using the method and stomping on each other’s values (unless they’re silly and use the same variable for the out parameter).

It’s reasonably easy to wrap the standard interfaces in order to implement this interface:

/// <summary>
/// Wraps a normal IEnumerable[T] up to implement IAtomicEnumerable[T].
/// </summary>
public sealed class AtomicEnumerable<T> : IAtomicEnumerable<T>
{
private readonly IEnumerable<T> original;

    public AtomicEnumerable(IEnumerable<T> original)
    {
        this.original = original;
    }

    public IAtomicEnumerator<T> GetEnumerator()
    {
        return new AtomicEnumerator(original.GetEnumerator());
    }

    /// <summary>
    /// Implementation of IAtomicEnumerator[T] to wrap IEnumerator[T].
    /// </summary>
    private sealed class AtomicEnumerator : IAtomicEnumerator<T>
    {
        private readonly IEnumerator<T> original;
        private readonly object padlock = new object();

        internal AtomicEnumerator(IEnumerator<T> original)
        {
            this.original = original;
        }

        public bool TryMoveNext(out T value)
        {
            lock (padlock)
            {
                bool hadNext = original.MoveNext();
                value = hadNext ? original.Current : default(T);
                return hadNext;
            }
        }
    }
}

Just ignore the fact that I never dispose of the original IEnumerator<T> :)

We use a simple lock to make sure that MoveNext() and Current always happen together – that nothing else is going to call MoveNext() between our TryMoveNext() calling it, and it fetching the current value.

Obviously you’d need to write your own code to actually use this sort of iterator, but it would be quite simple:

T value;
while (iterator.TryMoveNext(out value))
{
// Use value
}

However, you may already have code which wants to use an IEnumerator<T>. Let’s see what else we can do.

Using thread local variables to fake it

.NET 4.0 has a very useful type called ThreadLocal<T>. It does basically what you’d expect it to, with nice features such as being able to supply a delegate to be executed on each thread to provide the initial value. We can use a thread local to make sure that so long as we call both MoveNext() and Current atomically when we’re asked to move to the next element, we can get back the right value for Current later on. It has to be thread local because we’re sharing a single IEnumerator<T> across multiple threads – each needs its own separate storage.

This is also the approach we’d use if we wanted to wrap an IAtomicEnumerator<T> in an IEnumerator<T>, by the way. Here’s the code to do it:

public class ThreadSafeEnumerable<T> : IEnumerable<T>
{
private readonly IEnumerable<T> original;

    public ThreadSafeEnumerable(IEnumerable<T> original)
    {
        this.original = original;
    }

    public IEnumerator<T> GetEnumerator()
    {
        return new ThreadSafeEnumerator(original.GetEnumerator());
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    private sealed class ThreadSafeEnumerator : IEnumerator<T>
    {
        private readonly IEnumerator<T> original;
        private readonly object padlock = new object();
        private readonly ThreadLocal<T> current = new ThreadLocal<T>();

        internal ThreadSafeEnumerator(IEnumerator<T> original)
        {
            this.original = original;
        }

        public bool MoveNext()
        {
            lock (padlock)
            {
                bool ret = original.MoveNext();
                if (ret)
                {
                    current.Value = original.Current;
                }
                return ret;
            }
        }

        public T Current
        {
            get { return current.Value; }
        }

        public void Dispose()
        {
            original.Dispose();
            current.Dispose();
        }

        object IEnumerator.Current
        {
            get { return Current; }
        }

        public void Reset()
        {
            throw new NotSupportedException();
        }
    }
}

I’m going to say it one last time – we’re broken when it comes to disposal. There’s no way of safely disposing of the original iterator at “just the right time” when everyone’s finished with it. Oh well.

Other than that, it’s quite simple. This code has the serendipitous property of actually implementing IEnumerator<T> slightly better than C#-compiler-generated implementations from iterator blocks – if you call the Current property without having called MoveNext(), this will throw an InvalidOperationException, just as the documentation says it should. (It doesn’t do the same at the end, admittedly, but that’s fixable if we really wanted to be pedantic.

Conclusion

I found this an intriguing little problem. I think there are better ways of solving the bigger picture – a co-ordinator which takes care of disposing exactly once, and which possibly mediates the original iterator etc is probably the way forward… but I enjoyed thinking about the nitty gritty.

Generally speaking, I prefer the first of these approaches. Thread local variables always feel like a bit of a grotty hack to me – they can be useful, but it’s better to avoid them if you can. It’s interesting to see how an interface can be inherently thread-friendly or not.

One last word of warning – this code is completely untested. It builds, and I can’t immediately see why it wouldn’t work, but I’m making no guarantees…

C#, LINQ, Wacky Ideas

An object lesson in blogging and accuracy; was: Efficient “vote counting” with LINQ to Objects – and the value of nothing

September 20, 2009 jonskeet 11 Comments

Well, this is embarrassing.

Yesterday evening, I excitedly wrote a blog post about an interesting little idea for making a particular type of LINQ query (basically vote counting) efficient. It was an idea that had occurred to me a few months back, but I hadn’t got round to blogging about it.

The basic idea was to take a completely empty struct, and use that as the element type in the results of a grouping query – as the struct was empty, it would take no space, therefore "huge" arrays could be created for no cost beyond the fixed array overhead, etc. I carefully checked that the type used for grouping did in fact implement ICollection<T> so that the Count method would be efficient; I wrote sample code which made sure my queries were valid… but I failed to check that the empty struct really took up no memory.

Fortunately, I have smart readers, a number of whom pointed out my mistake in very kind terms.

Ben Voigt gave the reason for the size being 1 in a comment:

The object identity rules require a unique address for each instance… identity can be shared with super- or sub- class objects (Empty Base Optimization) but the total size of the instance has to be at least 1.

This makes perfect sense – it’s just a shame I didn’t realise it before.

Live and learn, I guess – but apologies for the poorly researched post. I’ll attempt to be more careful next time.

C#, Stack Overflow, Wacky Ideas

Generic constraints for enums and delegates

September 10, 2009 jonskeet 56 Comments

As most readers probably know, C# prohibits generic type constraints from referring to System.Object, System.Enum, System.Array, System.Delegate and System.ValueType. In other words, this method declaration is illegal:

public static T[] GetValues<T>() where T : struct, System.Enum
{
return (T[]) Enum.GetValues(typeof(T));
}

This is a pity, as such a method could be useful. (In fact there are better things we can do… such as returning a read-only collection. That way we don’t have to create a new array each time the method is called.) As far as I can tell, there is no reason why this should be prohibited. Eric Lippert has stated that he believes the CLR doesn’t support this – but I think he’s wrong. I can’t remember the last time I had cause to believe Eric to be wrong about something, and I’m somewhat nervous of even mentioning it, but section 10.1.7 of the CLI spec (ECMA-335) partition II (p40) specifically gives examples of type parameter constraints involving System.Delegate and System.Enum. It introduces the table with "The following table shows the valid combinations of type and special constraints for a representative set of types." It was only due to reading this table that I realized that the value type constraint on the above is required (or a constructor constraint would do equally well) – otherwise System.Enum itself satisfies the constraint, which would be a Bad Thing.

It’s possible (but unlikely) that the CLI doesn’t fully implement this part of the CLR spec. I’m hoping that Eric’s just wrong on this occasion, and that actually there’s nothing to stop the C# language from allowing such constraints in the future. (It would be nice to get keyword support, such that a constraint of "T : enum" would be equivalent to the above, but hey…)

The good news is that ilasm/ildasm have no problem with this. The better news is that if you add a reference to a library which uses those constraints, the C# compiler applies them sensibly, as far as I can tell…

Introducing UnconstrainedMelody

(Okay, the name will almost surely have to change. But I like the idea of it removing the constraints of C# around which constraints are valid… and yet still being in the key of C#. Better suggestions welcome.)

I have a plan – I want to write a utility library which does useful things for enums and delegates (and arrays if I can think of anything sensible to do with them). It will be written in C#, with methods like this:

public static T[] GetValues<T>() where T : struct, IEnumConstraint
{
return (T[]) Enum.GetValues(typeof(T));
}

(IEnumConstraint has to be an interface of course, as otherwise the constraint would be invalid.)

As a post-build step, I will:

Run ildasm on the resulting binary
Replace every constraint using EnumConstraint with System.Enum
Run ilasm to build the binary again

If anyone has a simple binary rewriter (I’ve looked at PostSharp and CCI; both look way more complicated than the above) which would do this, that would be great. Otherwise ildasm/ilasm will be fine. It’s not like consumers will need to perform this step.

As soon as the name is finalized I’ll add a project on Google Code. Once the infrastructure is in place, adding utility methods should be very straightforward. Suggestions for utility methods would be useful, or just join the project when it’s up and running.

Am I being silly? Have I overlooked something?

A couple of hours later…

Okay, I decided not to wait for a better name. The first cut – which does basically nothing but validate the idea, and the fact that I can still unit test it – is in. The UnconstrainedMelody Google Code project is live!

Books, C#, Wacky Ideas

The “dream book” for C# and .NET

August 20, 2009 jonskeet 34 Comments

This morning I showed my hand a little on Twitter. I’ve had a dream for a long time about the ultimate C# book. It’s a dream based on Effective Java, which is my favourite Java book, along with my experiences of writing C# in Depth.

Effective Java is written by Josh Bloch, who is an absolute giant in the Java world… and that’s both the problem and the opportunity. There’s no-one of quite the equivalent stature in the .NET world. Instead, there are many very smart people, a lot of whom blog and some of whom have their own books.

There are "best practices" books, of course: Microsoft’s own Framework Design Guidelines, and Bill Wagner’s Effective C# and More Effective C# being the most obvious examples. I’m in no way trying to knock these books, but I feel we could do even better. The Framework Design Guidelines (also available free to browse on MSDN) are really about how to create a good API – which is important, but not the be-all-and-end-all for many application developers who aren’t trying to ship a reusable class library and may well have different concerns. They want to know how to use the language most effectively, as well as the core types within the framework.

Bill’s books – and many others which cover the core framework, such as CLR via C#, Accelerated C# 2008 and C# 3.0 in a Nutshell – give plenty of advice, but often I’ve felt it’s a little one-sided. Each of these books is the work of a single person (or brothers in the case of Nutshell). Reading them, I’ve often wanted to give present a different point of view – or alternatively, to give a hearty "hear, hear." I believe that a book giving guidance would benefit greatly from being more of a conversation: where the authors all agree on something, that’s great; where they differ, it would be good to hear about the pros and cons of various approaches. The reader can then weigh up those factors as they apply to each particular real-world scenario.

Scope

So what would such a book contain? Opinions will vary of course, but I would like to see:

Effective ways of using language features such as lambda expressions, generic type inference (and indeed generics in general), optional parameters, named arguments and extension methods. Assume that the reader knows roughly what C# does, but give some extra details around things like iterator blocks and anonymous functions.
Guidance around class design (in a similar fashion to the FDG, but with more input from others in the community)
Core framework topics (again, assume the basics are understood):

Resource management (disposal etc)
Exceptions
Collections (including LINQ fundamentals)
Streams
Text (including internationalization)
Numeric types
Time-related APIs
Concurrency
Contracts
AppDomains
Security
Performance

I would prefer to avoid anything around the periphery of .NET (WPF, WinForms, ASP.NET, WCF) – I believe those are better handled in different topics.

Obstacles and format

There’s one big problem with this idea, but I think it may be a saving grace too. Many of the leading authors work for different publishers. Clearly no single publisher is going to attract all the best minds in the C# and .NET world. So how could this work in practice? Well…

Imagine a web site for the book, paid for jointly by all interested publishers. The web site would be the foremost delivery mechanism for the content, both to browse and probably to download in formats appropriate for offline reading (PDF etc). The content would be edited in a collaborative style obviously, but exactly how that would work is a detail to be thrashed out. If you’ve read the annotated C# or CLI specifications, they have about the right feel – opinions can be attributed in places, but not everything has a label.

Any contributing publisher could also take the material and publish it as hard copy if they so wished. Quite how this would work – with potentially multiple hard copy editions of the same content – would be interesting to see. There’s another reason against hard copy ever appearing though, which is that it would be immovable. I’d like to see this work evolve as new features appear and as more best practices are discovered. Publishers could monetize the web site via adverts, possibly according to how much they’re kicking into the site.

I don’t know how the authors would get paid, admittedly, and that’s another problem. Would this cannibalize the sales of the books listed earlier? It wouldn’t make them redundant – certainly not for the Nutshell type of book, which teaches the basics as well as giving guidance. It would hit Effective C# harder, I suspect – and I apologise to Bill Wagner in advance; if this ever takes off and it hurts his bottom line, I’m very sorry – I think it’s in a good cause though.

Dream Team

So who would contribute to this? Part of me would like to say "anyone and everyone" in a Wikipedia kind of approach – but I think that practically, it makes sense for industry experts to take their places. (A good feedback/comments mechanism for anyone to use would be crucial, however.) Here’s a list which isn’t meant to be exhaustive, but would make me happy – please don’t take offence if your name isn’t on here but should be, and I wouldn’t expect all of these people to be interested anyway.

Anders Hejlsberg
Eric Lippert
Mads Torgersen
Don Box
Brad Abrams
Krzysztof Cwalina
Joe Duffy
Vance Morrison
Rico Mariani
Erik Meijer
Don Symes
Wes Dyer
Jeff Richter
Joe and Ben Albahari
Andrew Troelsen
Bill Wagner
Trey Nash
Mark Michaelis
Jon Skeet (yeah, I want to contribute if I can)

I imagine "principal" authors for specific topics (e.g. Joe Duffy for concurrency) but with all the authors dropping in comments in other places too.

Dream or reality?

I have no idea whether this will ever happen or not. I’d dearly love it to, and I’ve spoken to a few people before today who’ve been encouraging about the idea. I haven’t been putting any work into getting it off the ground – don’t worry, it’s not been delaying the second edition of C# in Depth. One day though, one day…

Am I being hopelessly naïve to even consider such a venture? Is the scope too broad? Is the content valuable but not money-making? We’ll see.

Option 1: Class level attribute

Option 2: Method level attributes

Option 3: Compiler and language simplicity

And another thing…

Code is data

Code in the right context is more than just data

When code is data, it’s easy to mix it with other data – badly

Expressing the same ideas, but back in the "native" language

Conclusion

How can we fix it?

Change the type of the values variable

Change the type of the iterator variable

Pass the iterator variable by reference

Dynamic typing to the anti-rescue

Conclusion

Subscription model

Blocking aggregation

Using the non-blocking aggregation operators

Conclusion

Fix the interface

Conclusion

Introducing UnconstrainedMelody

A couple of hours later…

Scope

Obstacles and format

Dream Team

Dream or reality?

Change the type of the `values` variable

Change the type of the `iterator` variable

Pass the `iterator` variable by reference