Of memory and strings

This post was provoked by a recent Stack Overflow question which asked whether there was an efficient representation of ASCII strings in .NET.

In particular, the questioner wanted to store hundreds of thousands – possibly millions – of strings in memory, and knowing (or assuming) that they all consisted of ASCII characters, he wanted to avoid the waste of space that comes from storing each character in a .NET string as a UTF-16 code unit.

My answer to the question mostly consisted of saying that I didn’t think it would be worth the effort, and giving some reasons. But the more reasons I gave, the more I thought about the subtleties involved, and that it’s actually quite an interesting case study into memory use in .NET.

How are we going to measure object size?

If we’re going to work out any sort of benefit from a more compact string representation, we’ll need to be able to calculate how much memory our objects are taking to start with. Rather than work this out in a purely theoretical way, I’ve been running tests using code like this:

using System;

class Program
{
static void Main(string[] args)
{
int size = 10000000;
object[] array = new object[size];

long before = GC.GetTotalMemory(true);
for (int i = 0; i < size; i++)
{
array[i] = new object();
}
long after = GC.GetTotalMemory(true);

double diff = after – before;

Console.WriteLine(“Per object: “ + diff / size);

// Stop the GC from messing up our measurements
GC.KeepAlive(array);
}
}

Obviously that doesn’t take into account factors such as memory used by JITting, or anything that could be going on in other threads, but by using a suitably large number of objects, and by performing the division in floating point arithmetic (to avoid a slight variation making an 11.99999 come out as an 11, when it’s really just a “12 with something else going on”, we can work out the size of objects pretty clearly. The sample above measures the size of a vanilla object, but the code can be adapted very easily.

The first important thing to point out is that C# doesn’t guarantee the results of this – it isn’t responsible for determining how all of an object is laid out in memory; that’s the runtime’s job. While there are attributes to affect the layout and padding of the data members of a type in memory, there are other aspects that are out of your control. In this post I won’t use any of the layout attributes – we’ll just use the defaults.

Not all runtimes are created equal, either. On my laptop I’ve got Mono 2.8, .NET 3.5 and .NET 4, with the two versions of .NET each having the 32-bit (x86) and 64-bit (x64) CLRs. For the sake of simplicity, I’m going to stick with .NET 4 for this post, but I’ll give results for both the x64 and x86 CLRs. To test each of them, I’m compiling with “/platform:x64” or “/platform:x86”.

Some simple starter measurements

Before I start creating my own types, let’s try a few built-in types, including strings:

 Type x86 size x64 size object 12 24 object[] 16 + length * 4 32 + length * 8 int[] 12 + length * 4 28 + length * 4 byte[] 12 + length 24 + length string 14 + length * 2 26 + length * 2

Note that all the x86 sizes are rounded up to the nearest 4 bytes, and all x64 sizes are rounded up to the nearest 8 bytes.

The string numbers are interesting, because strings are the only non-array types in .NET which vary in size. A long string consists of a single large object in memory. Compare this with Java, where a String is a “normal” type in terms of memory consumption, containing an offset and length into a char array – so a long string consists of a small object referring to a large char array. This distinction will be very important when we come to build an AsciiString type. Another point about measuring string sizes is that it’s relatively tricky to measure the size of an empty string – because even if you use the “new string(‘x’, 0)” constructor, the result is still cached – the same reference is returned each time.

You might be forgiven for looking at the numbers above and thinking that the “overhead” of an object is 12 bytes in x86 and 24 in x64… but that’s not quite right. Let’s build our own straightforward classes and measure them…

Custom classes containing primitives

Here are six classes, all of which are measured with the same simple test code:

class Empty {}
class OneInt32 { int x; }
class TwoInt32 { int x, y; }
class ThreeInt32 { int x, y, z; }

class Mixed1
{
int x;
byte b1, b2, b3, b4;
int y, z;
}

class Mixed2
{
int x;
byte b1;
int y, z;
byte b2, b3, b4;
}

The last case is mostly to check how the CLR handles an “awkward” class declaration, where the int variables won’t naturally be aligned on 4-byte boundaries. The results look odd at first, but we’ll make sense of them in a minute:

 Type x86 size x64 size Empty 12 24 OneInt32 12 24 TwoInt32s 16 24 ThreeInt32s 20 32 Mixed1 24 32 Mixed2 24 32

A few interesting things to note here:

• There’s a “base” overhead of 8 bytes per object in x86 and 16 per object in x64… given that we can store an Int32 of “real” data in x86 and still have an object size of 12, and likewise we can store two Int32s of real data in x64 and still have an object of x64.
• There’s a “minimum” size of 12 bytes and 24 bytes respectively. In other words, you can’t have a type which is just the overhead. Note how the “Empty” class takes up the same size as creating instances of Object… there’s effectively some spare room, because the CLR doesn’t like operating on an object with no data. (Note that a struct with no fields takes up space too, even for local variables.)
• The x86 objects are padded to 4 byte boundaries; on x64 it’s 8 bytes (just as before)
• By default, the CLR is happy to pack fields pretty densely – Mixed2 only took as much space as ThreeInt32. My guess is that it reorganized the in-memory representation so that the bytes all came after the ints… and that’s what a quick bit of playing around with unsafe pointers suggests too… but I’m not sufficiently comfortable with this sort of thing to say for sure. Frankly, I don’t care… so long as it all works, what we’re interested in is the overall size, not the precise layout.

So what does an ASCII string look like?

In this blog post I’m not actually going to implement an ASCII string at all (well, not much). I’m merely pointing out what the data structures would look like. However, it’s worth working out what desirable qualities it should have. As far as possible, it should feel like System.String. In particular:

• It should be immutable.
• It should have fast access to individual characters, and the length.
• It should mostly “feel” like an immutable reference type, in that passing a value of type AsciiString around should be cheap, like copying a reference.
• It should use as little memory as possible… less than the equivalent string, or it’s pointless.
• One caveat to this: in theory that could mean storing 8 characters in every 7 bytes, as ASCII really only uses 7 bits per character. I’m not going to those extremes, but you can think about them if you want.

We’re going to store the characters as a byte array. We have three options as to exactly how we handle that byte array:

• We could go the Java way, where several strings share references to the same array. Each string then has an offset and a length to say which bit of the array they’re interested in.
• Pros: Substring becomes really cheap
• Cons: You can end up having just a tiny substring responsible for keeping a huge character array alive
• We could go the .NET way, where each string has its own character data, but the buffer may be longer than necessary… so it stores the length too. (A bit like a List<T>.)
• Pros: Can potentially make building strings cheap, if you just keep whatever potentially oversized buffer you’ve already got.
• Cons: Wasted space for the unused part of the array, and a field for the length.
• We could just have a byte array of exactly the right size – and it already knows its size.

I’m going to assume the third option here. So all the data our type needs is a byte array. That’s going to be pretty cheap… we hope. Let’s look at what we can build.

Option 1: A simple class

To give a flavour of the implementation, I’ve decided to implement four members for each option:

• A way of creating an AsciiString from a regular string
• The Substring overload with both a start and length
• The Length property
• The indexer returning a char

Hopefully that will give enough of an idea of what’s going on to be useful. Note that these aren’t production-quality implementations at all… none of the code has ever been run at all. I have made sure it compiles, so just be grateful for that :)

using System;
using System.Text;

public sealed class AsciiString
{

public AsciiString(string text)
{
// TODO: Rather more data validation etc!
data = Encoding.ASCII.GetBytes(text);
}

private AsciiString(byte[] data)
{
// This constructor is private: we can trust that it’s been called
// by code which isn’t going to modify the contents of the array
// afterwards.
this.data = data;
}

public AsciiString Substring(int startIndex, int length)
{
if (startIndex < 0 || startIndex > data.Length)
{
throw new ArgumentOutOfRangeException(“startIndex”);
}
if (startIndex + length > data.Length)
{
throw new ArgumentOutOfRangeException(“length”);
}
byte[] newData = new byte[length];
Buffer.BlockCopy(data, startIndex, newData, 0, length);
return new AsciiString(newData);
}

public int Length { get { return data.Length; } }

public char this[int position] { get { return (char) data[position]; } }
// etc…
}

Hopefully this is pretty straightforward – it’s meant to be the most “obvious” solution. Note that we’ve not got the nice locality of reference which the real String class has – it’s possible that the an AsciiString could end up with its backing array a long way away in memory, so a indexer operation for a single character could end up with three cache misses – one for the AsciiString object, one for part of the data array storing the length (for argument validation) and one for the part of the data array containing the character we’re looking for. That may be unlikely, and it’s not the kind of thing I normally think about – but it’s probably the kind of thing the BCL team pay a lot of attention to.

We get the same “immutable reference type” behaviour present in the normal string type, however – you can have a null AsciiString reference just as normal, any assignments will just be reference assignments, etc.

What about the size though? There are two objects to consider:

• The array, of size 12 + length or 24 + length (x86 and x64 respectively; rounded up to 4 or 8 bytes as well)
• The object itself, of size 12 or 24

So we’ve got a total size of 24 + length or 48 + length, depending on architecture. To show how the break-even point works, here’s a little table showing the sizes of string and AsciiString for various sizes on both architectures:

 Length string-x86 string-x64 AsciiString-x86 AsciiString-x64 0 16 32 24 48 1 16 32 28 56 2 20 32 28 56 3 20 32 28 56 4 24 40 28 56 5 24 40 32 56 6 28 40 32 56 7 28 40 32 56 8 32 48 32 56 9 32 48 36 64 10 36 48 36 64 … .. … … … 16 48 64 40 64 24 64 80 48 72 32 80 96 56 80

As you can see, the break-even point in x86 is at length 10; in x64 it’s at length 16. After that, we start winning – as we’d expect. The penalty for very small strings is quite hefty though – you’d really better hope you didn’t have lots of single-character strings, taking 56 bytes each in x64.

Let’s see if we can do better…

Option 2: A simple struct

A lot of the overhead here has come from the fact that we’ve got an object which only has a single field. The field is all we’re interested in… why are we bothering with all the overhead of the object? Let’s make it a struct instead, effectively inlining that field wherever we use the type. Assignment, passing arguments to methods etc will still only be copying a reference – it’s just the reference will be the byte array rather than a wrapper object.

It all sounds good, but there are two snags:

• The value can never be null; that at least diverges from the familiar string behaviour
• We won’t be able to prevent code from creating an instance of our struct with new AsciiString() – and that won’t be good.

We can actually pit these two downsides against each other by making the “default” value a pseudo-null value… we can even throw NullReferenceException just as if it were a reference type. We don’t even need to do any work in order to get that NullReferenceException – every member is going to use the data array anyway, and dereferencing that will automatically throw an exception. We might want to change things around a bit to make that the very first thing that can throw an exception, but that’s all.

It’s nasty, but it appeals very slightly. In an evil kind of way. It makes things slightly more familiar, but at the cost of being generally weird in other ways.

We still need to be able to check whether an AsciiString value is the logical null value. I’ll add an IsNull property for that purpose. (An alternative would be HasValue, but that would be confusing with Nullable<T>.)

Most of the code remains untouched – it looks like this:

public struct AsciiString
{
public bool IsNull { get { return data == null; } }

// Remainder of code as before
}

Now let’s look at the sizes, which should be a lot more favourable than before. Note that I had to change the size-checking code to create an array of type AsciiStruct[] instead of object[] to avoid boxing. Should we take the size of the array itself into consideration when computing the size of the AsciiString? We haven’t when working out the size of string… in each case the size of any individual element will be the size of a reference. For the table below, I haven’t included it… but bear in mind that this form of measurement would count the size of most value types (int etc) as 0. It just goes to show that when you talk about the size of a data type, you really need to be very precise in what you mean.

 Length string-x86 string-x64 AsciiString-x86 AsciiString-x64 0 16 32 12 24 1 16 32 16 32 2 20 32 16 32 3 20 32 16 32 4 24 40 16 32 5 24 40 20 32 6 28 40 20 32 7 28 40 20 32 8 32 48 20 32 9 32 48 24 40 10 36 48 24 40 … .. … … … 16 48 64 28 40 24 64 80 36 48 32 80 96 44 56

This time, unsurprisingly, AsciiString is always more space-efficient than the normal string. It just takes a certain amount of holding our noses. Speaking of which…

Option 3: Extension methods on byte[]

Suppose we really, really want to have “proper” null references. We don’t really need the struct. We could treat any byte array as an array of ASCII characters, with extension methods like this:

public static class ByteArrayExtensions
{
public static byte[] Substring(this byte[] data, int startIndex, int length)
{
if (startIndex < 0 || startIndex > data.Length)
{
throw new ArgumentOutOfRangeException(“startIndex”);
}
if (startIndex + length > data.Length)
{
throw new ArgumentOutOfRangeException(“length”);
}
byte[] newData = new byte[length];
Buffer.BlockCopy(data, startIndex, newData, 0, length);
return newData;
}
}

The size is the same as with option 2 – in both cases there’s just the byte array, basically. This option is truly horrible in many ways though:

• You can no longer tell in your code (just through typing) what’s meant to be an AsciiString and what isn’t
• Kiss immutability goodbye
• We can’t guarantee that all the characters will be valid ASCII any more
• We can’t add extension properties or extension indexers
• We can’t make it implement the interfaces we want it to

Obviously, we’d never take this route. I just thought I’d include it for a regular dose of evil-ness.

Conclusion

Implementing an ASCII-only string representation sounds like it should be an obvious win in terms of memory, at the cost of doing a lot of work that’s already been done for us in String. However, the most obvious implementation takes a while to break even in memory usage, compared with the normal string type, due to the “special” nature of string within the CLR. We can’t mimic the “stretchiness” of string ourselves. The BCL/CLR teams could, of course, if they really wanted to. I’m not holding my breath though.

If we’re dead keen on saving space at the cost of some design wonkiness, we can use a value type instead of a reference type. Other than nullity, it works pretty well… but you have all the disadvantages which go with value types, such as the unavoidable parameterless constructor and the need to watch out for boxing.

Aside from anything else, I hope this was useful as a delve into how much space objects actually take up in .NET – and as a way of highlighting the extra memory used when running in the x64 CLR.