Data Structures and Algorithms: new free eBook available (first draft)

August 29, 2008 jonskeet 1 Comment

I’ve been looking at this for a while: Data Structures and Algorithms: Annotated reference with examples. It’s only in “first draft” stage at the moment, but the authors would love your feedback (as would I). Somehow I’ve managed to end up as the editor and proof-reader, although due to my holiday the version currently available doesn’t have many of my edits in. It’s a non-academic data structures and algorithms book, intended (as I see it, anyway) as a good starting point for those who know that they ought to be more aware of the data structures they use every day (lists, heaps etc) but don’t have an academic background in computer science.

An implementation will be available in both Java and C#, I believe.

Core .NET refcard now available

August 29, 2008 jonskeet Leave a comment

Same drill as before – and same registration requirements (although if you registered before, you don’t need to register again). The Core .NET refcard touches on some of the topics which I personally need to refer to MSDN for most often. It covers:

Common .NET types, aliases and sizes
String literals and escape sequences
Format strings (general, numeric, date/time)
Working with dates and times
Text encodings
Threading
Using the new features of C# 3.0 / VB 9.0 in .NET 2.0 projects

It’s only 6 pages long, which should give you an idea of the depth of coverage on each of these topics, but it’s meant to be handy reference material.

Enjoy!

C#, CSharpDev, CSharpDevCenter, Google, Protocol Buffers

Lessons learned from Protocol Buffers, part 4: static interfaces

August 29, 2008 jonskeet 13 Comments

Warning: During this entire post, I will use the word static to mean “relating to a type instead of an instance”. This isn’t a strictly accurate use but I believe it’s what most developers actually think of when they hear the word.

A few members of the interfaces in Protocol Buffers have no logical reason to act on instances of their types. The message interface has members to return the message’s type descriptor (the PB equivalent of System.Type), the default instance for the message type, and a builder for the message type. The builder interface copies the first two of these, and also has a method to create a builder for a particular field. None of these touch any instance data.

In most cases this doesn’t actually cause any difficulties – we usually have an instance available when we’re in the PB library code, and the generated types have static properties for the default instance and the type descriptor anyway. Even so, it feels messy to have interface members which rely only on the type of the implementation and not on any of the actual data of the instance.

I’ve wondered before now about the possibility of having static members in interfaces – usually when thinking about plug-in architectures – but there’s always been the problem of working out how to specify the type on which to call the members. Variables and other expressions usually refer to values rather than types, and System.Type doesn’t help as it provides no compile-time knowledge of the type being referred to.

There’s one big exception to this, however: generic type parameters. I don’t know why it had never occurred to me before, but this is a great fit for the ability to safely call static methods on types which are unknown at compile-time. Furthermore, it could provide a great way of enforcing the presence of constructors with appropriate signatures, and even operators. Before I get too far ahead of myself, let’s tie the simple case to a concrete example.

Creating builders from nothing

In my previous post I gave an example of a method which ideally wanted to return a new message given a CodedInputStream and an ExtensionRegistry (both types within Protocol Buffers, the details of which are unimportant to this example). The current code looks like this:

private static TMessage BuildImpl<TMessage2, TBuilder> (Func<TBuilder> builderBuilder,
                                                        CodedInputStream input,
                                                        ExtensionRegistry registry)
    where TBuilder : IBuilder<TMessage2, TBuilder>
    where TMessage2 : TMessage, IMessage<TMessage2, TBuilder>
{
    TBuilder builder = builderBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();
}

For the purposes of this discussion I’ll simplify it a little, making it a generic method in a non-generic type.

private static TMessage BuildImpl<TMessage, TBuilder> (Func<TBuilder> builderBuilder,
                                                       CodedInputStream input,
                                                       ExtensionRegistry registry)
    where TBuilder : IBuilder<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
{
    TBuilder builder = builderBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();
}

The first parameter is a function which will return us a builder. We can’t simply add a new() constraint to TBuilder as not all geenrated builders will have a public constructor. However, we do know that the TMessage type has a CreateBuilder() method because it implements IMessage<TMessage, TBuilder>. Unfortunately we don’t have an instance of TMessage to call CreateBuilder() on! Really, we’d like to be able to change the code to this:

private static TMessage BuildImpl<TMessage, TBuilder> (CodedInputStream input, ExtensionRegistry registry)
    where TBuilder : IBuilder<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
{
    TBuilder builder = TMessage.CreateBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();
}

That’s currently impossible, but only because we can’t specify static methods in interfaces. Suppose we could write:

public interface IMessage<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
    where TBuilder : IBuilder<TMessage, TBuilder>
{
    static TBuilder CreateBuilder();

// Other methods as before
}

Wouldn’t that be useful? The value would almost entirely be for generic types or methods where the type parameter is constrained to specify the relevant interface, but that could arguably still be very handy.

Operators and constructors

At this point hopefully the idea I mentioned earlier of being able to specify operators and constructors is quite obvious. For instance, we could make all the existing numeric types implement IArithmetic<T> (where T was the same type, e.g. int : IArithmetic<int>):

public interface IArithmetic<T>
{
    static T operator +(T left, T right);
    static T operator -(T left, T right);
    static T operator /(T left, T right);
    static T operator *(T left, T right);
}

// Used in LINQ to Objects, for example:
public static T Sum<T>(this IEnumerable<T> source) where T : IArithmetic<T>
{
    T total = default(T);
    foreach (T element in source)
    {
        total += element; // Interface says we can do total + element
    }
    return total;
}

Plug-ins for a particular program could implement IPlugin:

public interface IPlugin
{
static new (PluginHost host);

// Normal plug-in members here
string Title { get; }
}

// Used within a PluginHost like this…
public T CreatePlugin<T>() where T : IPlugin
{
    T plugin = new T(this);
    log.Info(“Loaded plugin {0}”, plugin.Title);
    return plugin;
}

In fact, I’d imagine it would make sense to define a whole family of IConstructable interfaces, along the same lines as the Func and Action delegate families:

public interface IConstructable
{
static new();
}

public interface IConstructable<T>
{
static new(T arg)
}

public interface IConstructable<T1, T2>
{
static new(T1 arg1, T2 arg2);
}

// etc

Inheritance raises its ugly head

There’s a fly in the ointment here. Normally, if a base type implements an interface, that means a type derived from it will effectively implement the interface too. That ceases to hold in all cases. You can get away with it for straightforward static methods/properties, in the same way that people often write code such as UTF8Encoding.UTF8 when they really just mean Encoding.UTF8. However, it doesn’t work for constructors – they aren’t inherited, so you can’t guarantee that Banana has a parameterless constructor just because Fruit does.

This is not only a problem for one concrete class deriving from another, but also for abstract implementations of interfaces. This happens a lot in Protocol Buffers; often the interface is partially implemented by the abstract class, with the final “leaf” classes in the inheritance tree implementing outstanding members and occasionally overriding earlier implementations for the sake of efficiency. Should we be able to specify an abstract static method in the abstract class, making sure that there’s an appropriate implementation by the time we hit a concrete class? As it happens that would be useful elsewhere in the Protocol Buffer library, but I’ll admit it’s slightly messy. I suspect there are ways round all of these issues, even if they might sometimes involve restricting the feature to particular, common use cases. However, every niggle would add its own piece of complexity.

There may well be other issues which would prove challenging – and other interesting aspects such as what explicit interface implementation would mean, if anything, in the context of static members. Language experts may well be able to reel off great lists of problems – I’d be very interested to here the ensuing discussions.

Conclusion

I believe that static interface members could prove very useful in generic algorithms, particularly if operators and constructors were allowed as well as the existing member types available in interfaces. There are significant sticking points to be carefully considered, and I wouldn’t like to prejudge the outcome of such deliberations in terms of whether the feature would be useful enough to merit the additional language complexity involved. It feels odd to implement an interface member which is effectively only of use when the implementing type is being used as the type argument for a generic type or method, but as developers learn to think more generically that may be less of a restriction than it currently seems.

This post is the last in my current batch talking about Protocol Buffers. There may well be more, but it’s unlikely now that the Protocol Buffer port is almost entirely complete. Most of these posts have involved generics, and the current limitations of what C# allows us to express in them. I do not intend to give the impression that I’m dissatisfied with C# – I’ve just found it interesting to take a look at what lies beyond the current boundaries of the language. The one aspect of these posts which I would definitely like to see addresses is that of covariant return types – the Java implementation of Protocol Buffers is significantly simpler in many ways purely due to this one point.

C#, CSharpDev, CSharpDevCenter, Google, Protocol Buffers

Lessons learned from Protocol Buffers, part 3: generic type relationships

August 29, 2008 jonskeet 1 Comment

In part 2 of this series we saw how the message and builder interfaces were self-referential in order to allow the implementation types to be part of the API. That’s one sort of relationship, but in this post we’ll see how the two interfaces relate to each other. If you remember from part 1 every generated message type has a corresponding builder type. As it happens, this is implemented with a nested type, so if you had a Person message, the generated types would be Person and Person.Builder (in a specified namespace, of course).

Without any interfaces involved, this would be very simple. The types would just look like this (with more members, of course):

public class Person
{
public static Builder CreateBuilder() { … }

public Builder CreateBuilderForType() { … }

    public class Builder
    {
        public Builder() { … }

public Person Build() { … }
}
}

You may well be wondering why there are two methods for creating a builder. The static method is convenient for code which knows it’s dealing with the Person message. The instance method ends up being part of the message interface, which makes it useful for code which can work with any message. In addition, the constructor for Person.Builder is accessible in the C# version. In the original Java code the only way of creating a builder is via the methods in the message class; I decided to remove this restriction for the sake of making the oh-so-readable object initializer syntax available in C# 3.

Redesigning the interfaces to refer to each other

In part 2 we created self-referential interfaces for the message and builder interfaces which looked like this:

public interface IMessage<TMessage> where TMessage : IMessage<TMessage>
{
…
}

public interface IBuilder<TBuilder> where TBuilder : IBuilder<TBuilder>
{
…
}

The constraints on the type parameters allow us to make the API very specific, and we can use the same trick again when we relate the builder and message types together. The step where we introduce a new type parameter to each of them is straightforward:

public interface IMessage<TMessage, TBuilder> where TMessage : IMessage<TMessage, TBuilder>
{
…
}

public interface IBuilder<TMessage, TBuilder> where TBuilder : IBuilder<TMessage, TBuilder>
{
…
}

Unfortunately without any restrictions on the “foreign” type parameter in each interface, we don’t get enough information to make everything work. We need to tie the two types together more tightly, like this:

public interface IMessage<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
    where TBuilder : IBuilder<TMessage, TBuilder>
{
    …
}

public interface IBuilder<TMessage, TBuilder>
    where TMessage : IMessage<TMessage, TBuilder>
    where TBuilder : IBuilder<TMessage, TBuilder>
{
    …
}

To make this concrete for Person and Person.Builder we end up with implementations like this:

public class Person : IMessage<Person, Builder>
{
public static Builder CreateBuilder() { … }

public Builder CreateBuilderForType() { … }

    public class Builder : IBuilder<Person, Builder>
    {
        public Builder() { … }

public Person Build() { … }
}
}

This works, but it’s really ugly. Any generic methods wanting to take a TMessage type parameter implementing IMessage<TMessage, TBuilder> have to also have a TBuilder type parameter, and the two constraints need to be expressed each time. It’s a real pain. In fact, I’ve got an IMessage<TMessage> interface which contains almost nothing in it (and which the more generic interface extends). This allows me to get hold of the message type (and use it in the API), inferring the builder type by reflection. That’s a pain too, frankly. It’s a particular nuisance because when I do infer the builder type, I haven’t actually got any compile-time constraint the lets any other code know that it’s the right builder type for the message type. In one specific case it’s led to this horrific method (in a type generic in TMessage:

Fortunately this is hidden from public view – and the only reason to do it at all is to enable a pleasant API of MessageStreamIterator<TMessage> : IEnumerable<TMessage> where TMessage : IMessage<TMessage>. The result of the evil method above is exactly what the caller is likely to want, otherwise I wouldn’t put up with it. However, that sort of excuse has been coming up far too much in the PB implementation, so I’ve had a quick think about what could be done about it.

Contemplating a more expressive language

I should really prefix this section by saying that I’m not actually suggesting this as a way forward for C# or .NET. (I suspect it would take more work in the CLR as well as just in the language; I don’t know enough about CLR generics to say for sure, but I’d be surprised if this were feasible.) I haven’t encountered many situations where I’ve wanted anything like this, and the extra complexity in the language would be quite high, I suspect. Suppose an interface could contain extra type parameters, including constraints, in the body of the interface:

// Purely imaginary syntax!
public interface IMessage<TMessage> where TMessage : IMessage<TMessage>
{
<TBuilder> where TBuilder : IBuilder<TBuilder>, TBuilder.TMessage : TMessage

// Normal methods, which could use TBuilder
}

public interface IBuilder<TBuilder> where TBuilder : IBuilder<TBuilder>
{
<TMessage> where TMessage : IMessage<TMessage>, TMessage.TBuilder : TBuilder

// Normal methods, which could use TMessage
}

There are various ways in which the interface implementation could indicate the type of TBuilder. The syntax itself isn’t particularly interesting – it’s the extra information which is conveyed which is the important bit. I’ve dithered between this being a step forward and it not. At first glance it looks no better than having both type parameters in the interface declaration, but I believe it would genuinely make a difference. For instance, the above evil method could be written as:

private static TMessage BuildImpl(Func<TMessage.TBuilder> builderBuilder,
                                  CodedInputStream input,
                                  ExtensionRegistry registry)
{
    TMessage.TBuilder builder = builderBuilder();
    input.ReadMessage(builder, registry);
    return builder.Build();
}

This time there’s no need for the method to be generic, because the type is already generic in the message type. Furthermore, we can call this method with no reflection. All other APIs which have previously had to be specify two type parameters can now just specify the one. Apart from anything else, this leaves more scope for type inference in generic methods – passing either a message or a builder to a generic method happens occasionally, but it’s very rare to pass in both.

We’ve essentially expressed the relationship between the message type and the builder type a little more explicitly, so that we can guarantee it exists (and use it) at compile time. That’s at the heart of the problem to start with – without a second type parameter in the initial interface declaration, in the current language there’s no way of expressing a close relationship with another type.

Conclusion

I don’t think it would be fair to say that C# really lets us down here – it happens not to support a pretty rare scenario, and that’s fair enough. I’d be interested to know whether any other languages allow the same sort of concepts to be expressed more pleasantly. The ugly solution I’ve presented here does at least work, and it’s nearly invisible to most users, who are likely to just reference the concrete generated types. I’m not happy with the verbosity which has become necessary in many places, but it’s in a good cause. It’s interesting to note that the Java API doesn’t use this sort of doubly-generic relationship: again, covariant return types allow the concrete message and builder types to express their APIs directly and still implement a more general interface at the same time.

In the next part I’ll look at another possibility which would make interfaces and generics a more powerful combination: static interface methods.

General

Holiday blogging

August 29, 2008 jonskeet Leave a comment

Assuming I manage to get the publication order (and also assuming that you’re reading this blog in publication order to start with) you’ll see a number of posts seemingly written very shortly after this one. As I type this, I am on holiday in Southwold in Suffolk – a delightful seaside town which unfortunately has not been blessed with a transmitted for my mobile data provider. I’d rather hoped to be able to pick up a signal, but have been disappointed so far. I’m currently half way through the eight day holiday, and being out of my normal communication circles is somewhat odd.

Apologies to all who have tried to email me during this time – as well as any newsgroup discussions where I have suddenly gone silent. Normal service should now be resumed for the foreseeable future.

C#, CSharpDev, CSharpDevCenter, Protocol Buffers

Lessons learned from Protocol Buffers, part 2: self-referential generic types

August 20, 2008 jonskeet 3 Comments

In the first part of this series we saw that a message type and its builder are closely related. The tricky bit comes when we want to define an interface describing messages and builders. Although some members clearly depend on the data being built (the first and last name in the person example above, for instance) others apply to all messages or all builders. For instance, a message can always provide you with a suitable builder, and a builder always allows you to build it to create the actual message. Likewise the message and builder types also have methods which return other instances of themselves – you can ask any message for the default message of the same type, or clone a builder. Many common builder methods effectively return this (i.e. the same builder) – but the declared return type needs to be the concrete type involved, not just the interface, otherwise you couldn’t then use the returned builder to set properties without casting.

(Aside: some of the members of the common interface would be more pleasant if they could be declared statically. We’ll look at that later in the series.)

We have two slightly different issues here: defining the interface to allow members to return the concrete types, and tying builders and messages together. This post will just talk about the first of these issues. Enjoy the luxury of only having to think about one type parameter at a time – it won’t last long.

First encounters of the self-referential kind

I first came across a generic constraint which referred to itself back in the early days of Java 5. Here’s the declaration for java.lang.Enum:

public abstract class Enum<E extends Enum<E>>

Assuming you’re more comfortable in C#, I’ll translate that into C# syntax:

public abstract class Enum<T> where T : Enum<T>

The constraint is easier read than understood. Any concrete, constructed class deriving from this will be an “enum of something” where something itself an “enum of something“.

Now, Java puts additional restrictions on the Enum class (it’s like System.Delegate in C# – you can’t explicitly derive from it yourself; you have to let the compiler do it for you). However, the syntax is perfectly valid in “normal” code. Typically when you encounter this kind of type constraint, you satisfy it in new classes by using the same class as the type argument for T. So, in the Enum example we might have:

public sealed class Currency : Enum<Currency>
{
// Code
}

public sealed class Status : Enum<Status>
{
// Code
}

There’s nothing to actually stop you from declaring class Status : Enum<Currency> – it’s not just not normally useful. Likewise you can leave the derived type as a generic one, but again that’s atypical. I don’t know any way to enforce the usual implementation – short of building into the language, as Java did – but it’s generally not a problem.

Back to Protocol Buffers

So why is this useful? Well, moving on from enums let’s look at the builder interface in Protocol Buffers. Here’s part of it – somewhat simplified, admittedly:

public interface IBuilder<TBuilder> where TBuilder : IBuilder<TBuilder>
{
    TBuilder Clear();
    TBuilder Clone();
    TBuilder ClearField(FieldDescriptor field);
    TBuilder AddRepeatedField(FieldDescriptor field, object value);
    TBuilder SetUnknownFields(UnknownFieldSet unknownFields);
    TBuilder MergeUnknownFields(UnknownFieldSet unknownFields);
    TBuilder MergeFrom(ByteString data);
    TBuilder MergeFrom(CodedInputStream input);
    TBuilder MergeFrom(CodedInputStream input, ExtensionRegistry registry);
}

None of those methods mention the actual message directly – for that we need another type parameter, as we’ll see in the next post – but all of them return a TBuilder. As it happens, the interface documentation requires that all the methods return the same reference back, just as StringBuilder methods do, but you could equally create an interface around immutable types, expecting each operation to return a new value. For instance, you could create an IArithmetic<T> interface such that int could implement IArithmetic<int>, double could implement IArithmetic<double> etc. You can then chain multiple operations together, e.g. 5.Add(10).Multiply(2) and know that you’re always within the world of integers.

It’s important to note that the return type of each of the methods in our builder interface is TBuilder, not IBuilder<TBuilder>. I point this out mostly because the latter is what I originally had. After all, it’s often best to expose fairly general return types. That works fine while you’re only using operations within the interface, but often clients know more detail about the concrete type and want to use that information. For instance, you might want to be able to write:

Person.Builder builder = …; // Get a builder from somewhere
builder = builder.Clear()
                 .SetFirstName(“Fred”)
                 .SetLastName(“Jones”)
                 .Clone();

Here SetFirstName() and SetLastName() aren’t members of the interface, but Clear() and Clone() are. We can mix and match like this (and finally reassign the builder variable) because the interface is as strongly typed as it is. Code which only knows about the interface can still do whatever it likes, because it knows that TBuilder implements IBuilder<TBuilder>. In particular, that means it’s fine for some of the interface to be implemented by an abstract class – in Protocol Buffers there can be quite a deep inheritance tree for messages and builders, and a lot of the methods (particularly the merging ones) can be written in terms of the others. (Yes, that suggests that an extension method might be appropriate – but leaving it in the interface allows for particular implementations to override the general one, which can be important for optimisation. There’s also the matter of making the whole thing play nicely for people who are still stuck with .NET 2.0 and Visual Studio 2005.)

A small diversion via Java

It’s interesting to note that while my C# port is larely a port of the Java code, there are significant differences around how generics are used. This is understandable given how different generics are in .NET and Java. (My preference being heavily towards the .NET side – but there are moments when Java has its advantages.) However, one aspect of Java which is used to great effect is covariant return types. In the Java protocol buffers, the Message and Builder interfaces aren’t generic at all. For instance, the equivalent of the earlier part of the builder interface is just this:

public interface Builder
{
    Builder clear();
    Builder clone();
    Builder clearField(FieldDescriptor field);
    Builder addRepeatedField(FieldDescriptor field, object value);
    Builder setUnknownFields(UnknownFieldSet unknownFields);
    Builder mergeUnknownFields(UnknownFieldSet unknownFields);
    Builder mergeFrom(ByteString data);
    Builder mergeFrom(CodedInputStream input);
    Builder mergeFrom(CodedInputStream input, ExtensionRegistry registry);
}

Does that mean we can’t chain operations together any more, mixing and matching “concrete-type-specific” methods (such as the setters for first name and last name) with the interface methods? Not at all – because where Person.Builder implements (say) clear() it can do it like this:

Person.Builder clear()
{
// Implementation
return this;
}

At that point, everything which knows at compile-time that it’s calling Person.Builder.clear() knows that it returns a Person.Builder – whereas code which only knows that it’s calling the interface method only knows that it will return some implementation of the interface. (Apologies for the naming here – it’s unfortunate from a clarity standpoint that both the interface and the implementation is called Builder, but I thought it would be worth being faithful to the real code on this point.)

It’s just about possible to do this in C# as well, with explicit interface implementation. Again, I went that way to start with – and it was a disaster. In the intermediate abstract classes I was having to cast to the interface sometimes, not cast at other times, declare new abstract protected methods of ClearImpl etc. It was simply awful. I’ve gone back to my school of thought which is that explicit interface implementation is handy when it’s absolutely required (or where you deliberately want to make it hard to call certain members), but should be largely avoided.

In fact, I do have a non-generic interface for both messages and builders, but where types would be involved I’ve renamed the methods to things like WeakClear and WeakBuild. These Weak* methods are only defined in terms of the non-generic interfaces, and are mostly used in cases where we really don’t know at compile time what kind of message we’re dealing with, even in a generic sense. Life would, however, be much simpler if only we had covariant return types in C#.

Conclusion

Self-referential generic types shouldn’t be used more widely than they really need to be – they can be tough to get your head round. However, they can be useful when you want to maintain a strongly typed API which needs to talk in terms of itself. One redeeming feature of the complexity in Protocol Buffers is that most of it is in the implementation: users of Person and Person.Builder really don’t need to know or care about the interfaces for most of the time. So long as they use a strongly typed expression to start with, they’ll keep that strong typing and be presented with appropriate members to call as if the interfaces and intermediate abstract classes didn’t even exist. It’s an API which gets out of your way when you’re not interested in it, which is always a nice sign.

While trying a number of schemes I’ve learned that there can often be a lot of subtly different options available, and their benefits and drawbacks aren’t always obvious until you try them. Oh, and covariant return types would be very welcome, and explicit interface implementation should generally be avoided where possible :)

Next time I’ll reveal a bit more about the real interfaces in my PB port. Bear in mind that messages need to know about their builders, and vice versa…

C#, CSharpDev, CSharpDevCenter, Protocol Buffers

Lessons learned from Protocol Buffers, part 1: messages, builders and immutability

August 20, 2008 jonskeet 5 Comments

My port of the Protocol Buffers project has proved pretty interesting. I thought I’d share some of the lessons I’ve learned along the way, as well as some of the frustrations at concepts I still can’t express in C#.

This was originally all going to be in one post, but I’m becoming acutely aware of how long some posts can grow. I don’t know about you, but I find very long blog posts quite intimidating, so I’ve decided to split them up into individual topics. You’ll still probably need to read the posts in order to understand them though – and this introductory post is the most important one in that respect.

Messages and Builders

The Protocol Buffers project (or PB for short) is basically another serialization technology, putting emphasis on efficiency, platform neutrality, and backward/forward compatibility. The normal set of steps in using PB is something like this:

Write a .proto file describing your data in terms of messages.
Run protoc to generate C# (and Java/C++ if you so wish).
In your application, use the builder associated with the message type to create an instance of a message.
Serialize the data to a stream.
At some other point in the application (or a different app) deserialize the data.

The idea is that builders are mutable, while the messages they build are immutable. You can use builders either with Set* methods which return the same builder again, or properties which can be used within object initializers. For example:

// Syntax available in C# 2
Person john = new Person.Builder()
    .SetFirstName(“John”)
    .SetLastName(“Doe”)
    .Build();

// Using an object initializer
Person jane = new Person.Builder
{ FirstName=“Jane”, LastName=“Doe” }
.Build();

Of course, you don’t have to do all the building in one expression, it’s just a handy option in many cases.

As you can see, the builder is generated as a nested type of the message. That’s handy, as it means the builder has access to the private members of the message. To avoid lots of data copying we employ popsicle immutability – the builder directly manipulates the message until it’s built, at which point it makes sure that nothing will change it afterwards. If that makes you uncomfortable in terms of it not being “true” immutability, I sympathise – but I also give String as a counterexample; StringBuilder works in exactly this way, modifying a string directly until it exposes it to the outside world.

Other than the copying – and the fact that all the code exists explicitly, and the caller has to know about the builder – this is quite similar to the suggestion I made about C# immutability a while ago. One point which makes it all simpler is that every data type in Protocol Buffers is itself immutable – so we don’t need to worry about deep copies and the like.

Unfortunately the current implementation doesn’t support collection initializers – if you have a repeated field in your message, you have to call Add* to populate it. The Add* methods return the builder just like the Set* methods, so you can still do it all in one expression, but it’s not terribly neat. Using a collection initializer compiles, but fails at execution time because the properties for repeated fields always return immutable lists. This is by design, to stop callers from creating a builder, fetching the list property, calling Build and then adding to the list. A better solution (and one which I plan to implement soon) is to have a PopsicleList<T> which is initially mutable but which will become immutable at the appropriate time (i.e. when Build() is called). At that point we’ll be able to write:

Person jane = new Person.Builder
    { FirstName=“Jane”, LastName=“Doe”,
      Friends = { “Tom”, “Dick”, “Harry” } }
    .Build();

There’s quite a lot more to messages and builders than this – things like the reflection-like API to query properties of the message based on fields in the the message descriptor – but what I’ve described so far ought to be enough for most of what I want to talk about, most of which relates to generics. In the next part, I’ll talk about self-referential generic types.

Pre-Copenhagen interview

August 19, 2008 jonskeet Leave a comment

Brian Rasmussen has just posted an interview we did by email, as a sort of precursor to my talk in Copenhagen. It’s nice to occasionally write down “where I am” in terms of my feelings about Java, C# and my own career. There’s a bit of technical content, but it’s mostly stuff about me as a person, just to dampen expectations suitably.

I’m really, really looking forward to giving the talk now. Nearly two and a half months is a long time to wait…

Update: If you tried to get to the link earlier on and failed, try again – it’s back up.

Speaking in Copenhagen, October 30th

August 13, 2008 jonskeet 2 Comments

I should have announced this earlier, but I’m delighted to report that on October 30th I’ll be speaking at a C# event in Copenhagen. Brian Rasmussen has organised a one day seminar which basically consists of me talking about C# all day and fielding questions.

That sounds like more fun for me than anyone else, but apparently enough people disagree that the event is already fully booked. Still, if you want to sign up in case anyone drops out, the registration page has the details.

My plan is to make it a verbal edition of C# in Depth, with as much interaction as possible. I’ll try to tackle C# 2 in the morning and C# 3 in the afternoon, possibly with some fun using Push LINQ and Protocol Buffers at the end, just to show how flexible LINQ is (even in-process LINQ). However, I’m hoping that my agenda will be derailed by the audience asking lots of questions and leading me down interesting alleys. That’s usually been the way of things in the past, and it makes the whole experience much more fun.

If you’re coming, please mail me with the kind of topics you’d like covered. The more input I get, the better the event is likely to be. I’m looking forward to it a lot…

Visual Studio 2008 SP1 and .NET 3.5 SP1 both out now

August 11, 2008 jonskeet 2 Comments

I suspect this will be pretty widely advertised fairly quickly, but both Visual Studio 2008 SP1 and .NET 3.5 SP1 are available for download. Personally I’ve had problems signing into the MSDN subscriptions site and going to the downloads page, but the direct links work fine. Both are fairly small files which then download more stuff when you execute them. The .NET 3.5 SP1 download doesn’t require you to have .NET 3.5 installed beforehand.

Update (13th August): Patrick Smacchia has a great post showing the differences (in terms of numbers rather than features) between 3.5 and 3.5SP1.

Jon Skeet's coding blog

Monthly Archives: August 2008

Data Structures and Algorithms: new free eBook available (first draft)

Core .NET refcard now available

Lessons learned from Protocol Buffers, part 4: static interfaces

Creating builders from nothing

Operators and constructors

Inheritance raises its ugly head

Conclusion

Lessons learned from Protocol Buffers, part 3: generic type relationships

Redesigning the interfaces to refer to each other

Contemplating a more expressive language

Conclusion

Holiday blogging

Lessons learned from Protocol Buffers, part 2: self-referential generic types

First encounters of the self-referential kind

Back to Protocol Buffers

A small diversion via Java

Conclusion

Lessons learned from Protocol Buffers, part 1: messages, builders and immutability

Messages and Builders

Pre-Copenhagen interview

Speaking in Copenhagen, October 30th

Visual Studio 2008 SP1 and .NET 3.5 SP1 both out now