Vista and External Memory Devices

Update – read the first two comments. I'm leaving the rest of the article as it is in order to avoid revisionism. The solution is in the first two comments though.

According to the Windows Vista feature page, Vista is going to be able to use external memory devices (USB flash drives and the like to you and me) to act as extra memory to save having to go to the hard disk. I've heard this mentioned at a few places, and it's always asserted that EMDs are slower than memory but "much, much faster" than disks. This has just been stated as a fact that everyone would just go along with. I've been a bit skeptical myself, so I thought I'd write a couple of very simple benchmarks. I emphasise the fact that they're very simple because it could well be that I'm missing something very important.

Here are three classes. Writer just writes out however many blocks of 1MB data you ask it to, to whichever file you ask it to. Reader simply reads a whole file in 1MB chunks. RandomReader reads however many 1MB chunks you ask it to, seeking randomly within the file between each read.

Writer

using System;
using System.IO;

public class Writer
{
    static void Main(string[] args)
    {
        Random rng = new Random();
        
        byte[] buffer = new byte[1024*1024];
        
        DateTime start = DateTime.Now;
        using (FileStream stream = new FileStream (args[0], FileMode.Create))
        {
            for (int i=0; i < int.Parse(args[1]); i++)
            {
                rng.NextBytes(buffer);
                Console.Write(".");
                stream.Write(buffer, 0, buffer.Length);
            }
        }
        DateTime end = DateTime.Now;
        Console.WriteLine();
        Console.WriteLine (end-start);
    }
}

Reader

using System;
using System.IO;

public class Reader
{
    static void Main(string[] args)
    {
        byte[] buffer = new byte[1024*1024];
        
        DateTime start = DateTime.Now;
        int total=0;
        using (FileStream stream = new FileStream (args[0], FileMode.Open))
        {
            int read;
            while ( (read=stream.Read (buffer, 0, buffer.Length)) > 0)
            {
                total += read;
                Console.Write(".");
            }
        }
        DateTime end = DateTime.Now;
        Console.WriteLine();
        Console.WriteLine (end-start);
        Console.WriteLine (total);
    }
}

RandomReader

using System;
using System.IO;

public class RandomReader
{
    static void Main(string[] args)
    {
        byte[] buffer = new byte[1024*1024];
        
        Random rng = new Random();
        DateTime start = DateTime.Now;
        int total=0;
        using (FileStream stream = new FileStream (args[0], FileMode.Open))
        {
            int length = (int) stream.Length;
            for (int i=0; i < int.Parse(args[1]); i++)
            {
                stream.Position = rng.Next(length-buffer.Length);                
                total += stream.Read (buffer, 0, buffer.Length);
                Console.Write(".");
            }
        }
        DateTime end = DateTime.Now;
        Console.WriteLine();
        Console.WriteLine (end-start);
        Console.WriteLine (total);
    }
}

I have five devices I can test: a 128MB Creative Muvo (USB), a 1GB PNY USB flash drive, a Viking 512MB SD card, my laptop hard disk (fairly standard 60GB Hitachi drive) and a LaCie 150GB USB hard disk. (All USB devices are USB 2.0.) The results are below. This is pretty rough and ready – I was more interested in the orders of magnitude than exact figures, hence the low precision given. All figures are in MB/s.

Drive Write Stream read Random read
Internal HDD 17.8 24 22
External HDD 14 20 22
SD card 2.3 7 8.3
1GB USB stick 3.3 10 10
128MB USB stick 1.9 2.9 3.5

Where possible, I tried to reduce the effects of caching by mixing the tests up, so I never ran two tests on the same location in succession. Some of the random reads will almost certainly have overlapped each other within a test, which I assume is the reason for some of the tests showing faster seek+read than streaming reads.

So, what's wrong with this picture? Why does MS claim that flash memory is much faster than hard disks, when my flash drives appear to be much slower than my laptop and external drives? (Note that laptop disks aren't noted for their speed, and I don't have a particularly fancy one.) It doesn't appear to be the USB bus – the external hard disk is fine. The 1GB stick and the SD card are both pretty new, although admittedly cheap. I doubt that either of them are worse quality than the majority of flash drives in the hands of the general public now, and I don't expect the average speed to radically increase between now and the Vista launch, in terms of what people actually own.

I know my tests don't accurately mimic how data will be accessed by Vista – but how is it so far out? I don't believe MS would have invested what must have been a substantial amount of resource into this feature without conducting rather more accurate benchmarks than my crude ones. I'm sure I'm missing something big, but what is it? And if flash can genuinely work so much faster than hard disks, why do flash cards perform so badly in simple file copying etc?

Inheritance Tax


Introduction

There aren’t many technical issues that my technical lead (Stuart) and I disagree on.
However, one of them is inheritance and making things virtual. Stuart tends to favour
making things virtual on the grounds that you never know when you might need to inherit from
a class and override something. My argument is that unless a class is explicitly designed
for inheritance in the first place, you can get into a big mess very quickly. Desiging a
class for inheritance is not a simple matter, and in particular it ties your
implementation down significantly. Composition/aggregation usually works better in
my view. This is not to say that inheritance isn’t useful – like regular expressions,
inheritance of implementation is incredibly powerful and I certainly wouldn’t dream of
being without it. However, I find it’s best used sparingly. (Inheritance of interface is a
different matter – I happily use interfaces all the time, and they don’t suffer from the
same problems.) I suspect that much of my wariness is due to a bad experience I had with
java.util.Properties – so I’ll take that as a worked example.

Note: I’ll use the terms “derived type” and “subclass” (along with their related
equivalents) interchangably. This post is aimed at both C# and Java developers, and I can’t
get the terminology right for both at the same time. I’ve tended to go with whatever sounds
most natural at the time.

For those of you who aren’t Java programmers, a bit of background about the class.
Properties represents a “string to string” map, with strongly typed methods
(getProperty and setProperty) along with methods to save and
load the map. So far, so good.

Something we can all agree on…

The very first problem with Properties itself is that it extends
Hashtable, which is an object to object map. Is a string to string map
actually an object to object map? This is actually a question which has come up a lot
recently with respect to generics. In both C# and Java, List<String>
is not viewed as a subtype of List<Object>, for instance. This can
be a pain, but is logical when it comes to writable lists – you can add any object
to a list of objects, but you can only add a string to a list of strings. Co-variance
of type parameters would work for a read-only list, but isn’t currently available in C#.
Contravariance would work for a write-only list (you could view a list of objects as a list
of strings if you’re only writing to it), although that situation is less common, not to
mention less intuitive. I believe the CLR itself supports non-variance, covariance and
contravariance, but it’s not available in C# yet. Arguably generics is a complicated
enough topic already, without bringing in further difficulties just yet – we’ll have to
live with the restrictions for the moment. (Java supports both types of variance to
some extent with the ? extends T and ? super T syntax. Java’s
generics are very different to those in .NET, however.)

Anyway, java.util.Properties existed long before generics were a twinkle
in anyone’s eye. The typical “is-a” question which is usually taught for
determining whether or not to derive from another class wasn’t asked carefully enough in
this case. I believe it’s important to ask the question with Liskov’s Substitution Principle
in mind – is the specialization you’re going to make entirely compatible with
the more general contract? Can/should an instance of the derived type be used as if it were
just an instance of the base type?

The answer to the “can/should” question is “no” in the case of Properties, but
in two potentially different ways. If Properties overrides put (the
method in Hashtable used to add/change entries in the map) to prevent non-string
keys and values from being added, then it can’t be used as a general purpose Hashtable
– it’s breaking the general contract. If it doesn’t override put then a
Properties instance merely shouldn’t be used as a general purpose
Hashtable – in particular, you could get surprises if one piece of code added
a string key with a non-string value, treating it just as a Hashtable, and then
another piece of code used getProperty to try to retrieve the value of that key.

Furthermore, what happens if Hashtable changes? Suppose another method is added which
modifies the internal structure. It wouldn’t be unreasonable to create an add method
which adds a new key/value pair to the map only if the key isn’t already present. Now, if
Properties overrides put, it should really override add as
well – but the cost of checking for new methods which should potentially be overridden every time a
new version comes out is very high.

The fact that Properties derived from Hashtable
also means that its threading mechanisms are forever tied to those of Hashtable.
There’s no way of making it use a HashMap internally and managing the thread
safety within the class itself, as might be desirable. The public interface of
Properties shouldn’t be tied to the fact that it’s implemented using
Hashtable, but the fact that that implementation was achieved using
inheritance means it’s out in the open, and can’t be changed later (without abandoning
making use of the published inheritance).

So, hopefully we can all agree that in the case of java.util.Hashtable and
java.util.Properties at least, the choice to use inheritance instead of aggregation
was a mistake. So far, I believe Stuart would agree.

Attempting to specialize

Now for the tricky bit. I believe that if you’re going to allow a method to be overridden
(and methods are virtual by default in Java – fortunately not so in C#) then you need to document
not only what the current implementation does, but what it’s called from within the rest of
the class. A good example to demonstrate this comes from Properties again.

A long time ago, I wrote a subclass of Properties which had a sort of hierarchy.
If you had keys "X", "foo.bar" and
"foo.baz" you could ask an instance of this hierarchical properties type for a
submap (which would be another instance of the same type) for "foo". The returned
map would have keys "bar" and "baz". We used this kind of hierarchy
for configuration. If you’re thinking that XML would have been a better fit, you’re right.
(XML didn’t actually exist at the time, and I don’t know if there were any SGML libraries around
for Java. Either way, this was a reasonably simple way of organising configuration.

Now the question of whether or not I should have been deriving from Properties
in the first place is an interesting one. I don’t think there’s any reason anyone couldn’t or
shouldn’t use an instance of the PeramonProperties (as it was unfortunately called)
class as a normal Properties object, and it certainly helped when it came to other
APIs which wanted to use a parameter of type Properties. As it happens, I believe
we did run into a versioning problem, in terms of wanting to override a method of
Properties which only appeared in Java version 1.2, but only when compiling against
1.2. It’s certainly not crystal clear to me now whether we did the right thing or not – there
were definite advantages, and it wasn’t as obviously wrong as the inheritance from Hashtable
to Properties, but it wasn’t plain sailing either.

I needed to override getProperty – but I wanted to do it in the simplest possible way.
There are two overloads for getProperty, one of which takes a default value and one
of which just assumes a default value of null. (The default is returned if the key isn’t
present in the map.) Now, consider three possible implementations of getProperties in
Properties (get is a method in Hashtable which returns
the associated value or null. I’m leaving aside the issue of what to do if a non-string
value has been put in the map.)

First version: non-defaulting method delegates to defaulting

public String getProperty (String key)
{
    return getProperty (key, null);
}
    
public String getProperty (String key, String defaultValue)
{
    String value = (String) get(key);
    return (value == null ? defaultValue : value);
}

Second version: defaulting method delegates to non-defaulting

public String getProperty (String key)
{
    return (String) get(key);
}
    
public String getProperty (String key, String defaultValue)
{
    String value = getProperty (key);
    return (value == null ? defaultValue : value);
}

Third version: just calling base methods

public String getProperty (String key)
{
    return (String) get(key);
}

public String getProperty (String key, String defaultValue)
{
    String value = (String) get(key);
    return (value == null ? defaultValue : value);
}

Now, when overriding getProperty myself, it matters a great deal what the implementation
is – because I’m likely to want to call one of the base overloads, and if that in turn calls
my overridden getProperty, we’ve just blown up the stack. An alternative is to override
get instead, but can I absolutely rely on Properties calling get?
What if in a future version of Java, Hashtable adds an overload for get which
takes a default value, and Properties gets updated to use that instead of the signature
of get that I’ve overridden?

There’s a pattern in all of the worrying above – it involves needing to know the implementation
of a the class in order to override anything sensibly. That should make two parties nervous – the
ones relying on the implementation, and the ones providing the implementation. The ones
relying on it first have to find out what the implementation currently is. This is hard enough
sometimes even when you’ve got the source – Properties is a pretty straightforward
class, but if you’ve got a deep inheritance hierarchy with a lot of interaction going on it can
be a pain to work out what eventually calls what. Try doing it without the source and you’re in
real trouble). The ones providing the implementation should be nervous because they’ve now effectively
exposed something which they may want to change later. In the example of Hashtable providing
get with an overload taking a default value, it wouldn’t be unreasonable for the
authors of Properties to want to make use of that – but because they can’t change the
implementation of the class without potentially breaking other classes which have overridden
get, they’re stuck with their current implementation.

Of course, that’s assuming that both parties involved are aware of the risks. If the author
of the base class doesn’t understand the perils of inheritance, they could easily change the
implementation to still fulfill the interface contract, but break existing subclasses. They
could have all the unit tests required to prove that the implementation was, in itself, correct –
but that wouldn’t help the poor subclass which was relying on a particular implementation.
If the author of the subclass doesn’t understand the potential problems – particularly if
the way they first overrode methods just happened to work, so they weren’t as aware as they
might be that they were relying on a specific implementation – then they may not
do quite as much checking as they should when a new version of the base class comes out.

Does this kill inheritance?

Having proclaimed doom and gloom so far, I’d like to emphasise that I’m not trying
to say that inheritance should never be used. There are many times when it’s fabulously
useful – although in most of those cases an interface would be just as useful from a client’s
point of view, possibly with a base class providing a “default implementation” for use where
appropriate without making life difficult for radically different implementations (such as
mocks :)

So, how can inheritance be used safely? Here are a few suggestions – they’re not absolute
rules, and if you’re careful I’m sure it’s possible to have a working system even if you
break all of them. I’d just be a bit nervous when trying to change things in that state…

  • Don’t make methods virtual unless you really need to. Unless you can think of a reason
    why someone would want to override the behaviour, don’t let them. The downside of this
    is that it makes it harder to provide mock objects deriving from your type – but interfaces
    are generally a better answer here.
  • If you have several methods doing a similar thing and you want to make them virtual,
    consider making one method virtual (possibly a protected method) and making all
    the others call the virtual method. That gives a single point of access for derived classes.
  • When you’ve decided to make a method virtual, document all other paths that will call
    that method. (For instance, in the case above, you would document that all the similar
    methods call the virtual one.) In some cases it may be reasonable to not document the
    details of when the method won’t be called (for instance, if a particular
    parameter value will always result in the same return value for one overload of a method,
    you may not need to call anything else). Likewise it may be reasonable to only document
    the callers on the virtual method itself, rather than on each method that calls it.
    However, both of these can affect an implementation. This documentation becomes
    part of the interface of your class – once you’ve stated that one method will call
    another (and implicitly that other methods won’t call the virtual method) any
    change to that is a breaking change in the same way that changing the acceptable parameters
    or the return value is. You should also consider documenting what the base implementation
    of the method does (and in particular what other methods it calls within the same class) –
    quite often, an override will want to call the base implementation, but it can be difficult
    to know how safe this is to do or at what point to call it unless you know what the
    implementation really does.
  • When overriding a method, be very careful which other methods in the base class you
    call – check the documentation to make sure you won’t be causing an infinitely
    recursive loop. If you’re deriving from one of your own types and the documentation
    isn’t explicit enough, now would be a very good time to improve it. You might also
    want to make a note in the base class that you’re overriding the method in the specific
    class so that you can refer to the overriding method if you want to change the base class
    implementation.
  • If you make any assumptions when overriding a method, consider writing unit tests to document
    those assumptions. For instance, if you assume that calling method X will result in a call to
    your overridden method Y, consider testing that path as well as the path where method Y is
    called directly. This will help to give you more confidence if the base type is upgraded to
    a newer version. (This shouldn’t be considered a replacement for careful checking when
    the base type is upgraded to a new version though – indeed, you may want to add extra tests
    due to an expanding API etc.)
  • Take great care when adding a new virtual method in Java, as any existing derived class which
    happens to have a method of the same name will automatically override it, usually
    with unintended consequences. If you’re using Java 1.5/5.0, you can use the @Override
    annotation to specify that you intend to override a method. Some IDEs (such as Eclipse) have
    options to make any override which doesn’t have the @Override annotation result
    in a compile-time error or warning. This gives a similar degree of safety to C#’s requirement
    to use the override modifier – although there’s still no way of providing a “new”
    method which has the same signature as a base type method but without overriding it.
  • If you upgrade the version of a type you’re using as a base type, check for any changes in
    the documentation, particularly any methods you’ve overridden. Look at any new methods which
    you’d expect to call your overridden method – and any you’d expect not to!

Many of these considerations have different effects depending on the consumer of the type.
If you’re writing a class library for use outside your development team or organisation,
life is harder than in a situation where you can easily find out all the uses of a particular
type or method. You’ll need to think harder about what might genuinely be useful to override
up-front rather than waiting until you have a need before making a method virtual (and then
checking all existing uses to ensure you won’t break anything). You may also want to give more
guidance – perhaps even a sample subclass – on how you envisage a method being overridden.

Conclusion

You should be very aware of the consequences of making a method virtual. C# (fortunately in my view)
makes methods non-virtual by default. In an interview
Anders Hejlsberg explained the reasons for that decision, some of which are along the same lines as
those described here. Java treats methods as virtual by default, using Hotspot to get round the performance
implications and largely ignoring the problems described here (with the @Override annotation
coming late in the day as a partial safety net). Like many powerful tools, inheritance of implementation
should be used with care.