Of memory and strings

This post was provoked by a recent Stack Overflow question which asked whether there was an efficient representation of ASCII strings in .NET.

In particular, the questioner wanted to store hundreds of thousands – possibly millions – of strings in memory, and knowing (or assuming) that they all consisted of ASCII characters, he wanted to avoid the waste of space that comes from storing each character in a .NET string as a UTF-16 code unit.

My answer to the question mostly consisted of saying that I didn’t think it would be worth the effort, and giving some reasons. But the more reasons I gave, the more I thought about the subtleties involved, and that it’s actually quite an interesting case study into memory use in .NET.

How are we going to measure object size?

If we’re going to work out any sort of benefit from a more compact string representation, we’ll need to be able to calculate how much memory our objects are taking to start with. Rather than work this out in a purely theoretical way, I’ve been running tests using code like this:

using System;

class Program
{
    static void Main(string[] args)
    {
        int size = 10000000;
        object[] array = new object[size];

        long before = GC.GetTotalMemory(true);
        for (int i = 0; i < size; i++)
        {
            array[i] = new object();
        }
        long after = GC.GetTotalMemory(true);

        double diff = after – before;

        Console.WriteLine(“Per object: “ + diff / size);

        // Stop the GC from messing up our measurements
        GC.KeepAlive(array);
    }
}

Obviously that doesn’t take into account factors such as memory used by JITting, or anything that could be going on in other threads, but by using a suitably large number of objects, and by performing the division in floating point arithmetic (to avoid a slight variation making an 11.99999 come out as an 11, when it’s really just a “12 with something else going on”, we can work out the size of objects pretty clearly. The sample above measures the size of a vanilla object, but the code can be adapted very easily.

The first important thing to point out is that C# doesn’t guarantee the results of this – it isn’t responsible for determining how all of an object is laid out in memory; that’s the runtime’s job. While there are attributes to affect the layout and padding of the data members of a type in memory, there are other aspects that are out of your control. In this post I won’t use any of the layout attributes – we’ll just use the defaults.

Not all runtimes are created equal, either. On my laptop I’ve got Mono 2.8, .NET 3.5 and .NET 4, with the two versions of .NET each having the 32-bit (x86) and 64-bit (x64) CLRs. For the sake of simplicity, I’m going to stick with .NET 4 for this post, but I’ll give results for both the x64 and x86 CLRs. To test each of them, I’m compiling with “/platform:x64” or “/platform:x86”.

Some simple starter measurements

Before I start creating my own types, let’s try a few built-in types, including strings:

Type x86 size x64 size
object 12 24
object[] 16 + length * 4 32 + length * 8
int[] 12 + length * 4 28 + length * 4
byte[] 12 + length 24 + length
string 14 + length * 2 26 + length * 2

Note that all the x86 sizes are rounded up to the nearest 4 bytes, and all x64 sizes are rounded up to the nearest 8 bytes.

The string numbers are interesting, because strings are the only non-array types in .NET which vary in size. A long string consists of a single large object in memory. Compare this with Java, where a String is a “normal” type in terms of memory consumption, containing an offset and length into a char array – so a long string consists of a small object referring to a large char array. This distinction will be very important when we come to build an AsciiString type. Another point about measuring string sizes is that it’s relatively tricky to measure the size of an empty string – because even if you use the “new string(‘x’, 0)” constructor, the result is still cached – the same reference is returned each time.

You might be forgiven for looking at the numbers above and thinking that the “overhead” of an object is 12 bytes in x86 and 24 in x64… but that’s not quite right. Let’s build our own straightforward classes and measure them…

Custom classes containing primitives

Here are six classes, all of which are measured with the same simple test code:

class Empty {}
class OneInt32 { int x; }
class TwoInt32 { int x, y; }
class ThreeInt32 { int x, y, z; }

class Mixed1
{
    int x;
    byte b1, b2, b3, b4;
    int y, z;
}

class Mixed2
{
    int x;
    byte b1;
    int y, z;
    byte b2, b3, b4;
}

The last case is mostly to check how the CLR handles an “awkward” class declaration, where the int variables won’t naturally be aligned on 4-byte boundaries. The results look odd at first, but we’ll make sense of them in a minute:

Type x86 size x64 size
Empty 12 24
OneInt32 12 24
TwoInt32s 16 24
ThreeInt32s 20 32
Mixed1 24 32
Mixed2 24 32

A few interesting things to note here:

  • There’s a “base” overhead of 8 bytes per object in x86 and 16 per object in x64… given that we can store an Int32 of “real” data in x86 and still have an object size of 12, and likewise we can store two Int32s of real data in x64 and still have an object of x64.
  • There’s a “minimum” size of 12 bytes and 24 bytes respectively. In other words, you can’t have a type which is just the overhead. Note how the “Empty” class takes up the same size as creating instances of Object… there’s effectively some spare room, because the CLR doesn’t like operating on an object with no data. (Note that a struct with no fields takes up space too, even for local variables.)
  • The x86 objects are padded to 4 byte boundaries; on x64 it’s 8 bytes (just as before)
  • By default, the CLR is happy to pack fields pretty densely – Mixed2 only took as much space as ThreeInt32. My guess is that it reorganized the in-memory representation so that the bytes all came after the ints… and that’s what a quick bit of playing around with unsafe pointers suggests too… but I’m not sufficiently comfortable with this sort of thing to say for sure. Frankly, I don’t care… so long as it all works, what we’re interested in is the overall size, not the precise layout.

So what does an ASCII string look like?

In this blog post I’m not actually going to implement an ASCII string at all (well, not much). I’m merely pointing out what the data structures would look like. However, it’s worth working out what desirable qualities it should have. As far as possible, it should feel like System.String. In particular:

  • It should be immutable.
  • It should have fast access to individual characters, and the length.
  • It should mostly “feel” like an immutable reference type, in that passing a value of type AsciiString around should be cheap, like copying a reference.
  • It should use as little memory as possible… less than the equivalent string, or it’s pointless.
    • One caveat to this: in theory that could mean storing 8 characters in every 7 bytes, as ASCII really only uses 7 bits per character. I’m not going to those extremes, but you can think about them if you want.

We’re going to store the characters as a byte array. We have three options as to exactly how we handle that byte array:

  • We could go the Java way, where several strings share references to the same array. Each string then has an offset and a length to say which bit of the array they’re interested in.
    • Pros: Substring becomes really cheap
    • Cons: You can end up having just a tiny substring responsible for keeping a huge character array alive
  • We could go the .NET way, where each string has its own character data, but the buffer may be longer than necessary… so it stores the length too. (A bit like a List<T>.)
    • Pros: Can potentially make building strings cheap, if you just keep whatever potentially oversized buffer you’ve already got.
    • Cons: Wasted space for the unused part of the array, and a field for the length.
  • We could just have a byte array of exactly the right size – and it already knows its size.

I’m going to assume the third option here. So all the data our type needs is a byte array. That’s going to be pretty cheap… we hope. Let’s look at what we can build.

Option 1: A simple class

To give a flavour of the implementation, I’ve decided to implement four members for each option:

  • A way of creating an AsciiString from a regular string
  • The Substring overload with both a start and length
  • The Length property
  • The indexer returning a char

Hopefully that will give enough of an idea of what’s going on to be useful. Note that these aren’t production-quality implementations at all… none of the code has ever been run at all. I have made sure it compiles, so just be grateful for that :)

using System;
using System.Text;

public sealed class AsciiString
{
    private readonly byte[] data;

    public AsciiString(string text)
    {
        // TODO: Rather more data validation etc!
        data = Encoding.ASCII.GetBytes(text);
    }

    private AsciiString(byte[] data)
    {
        // This constructor is private: we can trust that it’s been called
        // by code which isn’t going to modify the contents of the array
        // afterwards.
        this.data = data;
    }

    public AsciiString Substring(int startIndex, int length)
    {
        if (startIndex < 0 || startIndex > data.Length)
        {
            throw new ArgumentOutOfRangeException(“startIndex”);
        }
        if (startIndex + length > data.Length)
        {
            throw new ArgumentOutOfRangeException(“length”);
        }
        byte[] newData = new byte[length];
        Buffer.BlockCopy(data, startIndex, newData, 0, length);
        return new AsciiString(newData);
    }

    public int Length { get { return data.Length; } }

    public char this[int position] { get { return (char) data[position]; } }
    // etc…
}

Hopefully this is pretty straightforward – it’s meant to be the most “obvious” solution. Note that we’ve not got the nice locality of reference which the real String class has – it’s possible that the an AsciiString could end up with its backing array a long way away in memory, so a indexer operation for a single character could end up with three cache misses – one for the AsciiString object, one for part of the data array storing the length (for argument validation) and one for the part of the data array containing the character we’re looking for. That may be unlikely, and it’s not the kind of thing I normally think about – but it’s probably the kind of thing the BCL team pay a lot of attention to.

We get the same “immutable reference type” behaviour present in the normal string type, however – you can have a null AsciiString reference just as normal, any assignments will just be reference assignments, etc.

What about the size though? There are two objects to consider:

  • The array, of size 12 + length or 24 + length (x86 and x64 respectively; rounded up to 4 or 8 bytes as well)
  • The object itself, of size 12 or 24

So we’ve got a total size of 24 + length or 48 + length, depending on architecture. To show how the break-even point works, here’s a little table showing the sizes of string and AsciiString for various sizes on both architectures:

Length string-x86 string-x64 AsciiString-x86 AsciiString-x64
0 16 32 24 48
1 16 32 28 56
2 20 32 28 56
3 20 32 28 56
4 24 40 28 56
5 24 40 32 56
6 28 40 32 56
7 28 40 32 56
8 32 48 32 56
9 32 48 36 64
10 36 48 36 64
..
16 48 64 40 64
24 64 80 48 72
32 80 96 56 80

As you can see, the break-even point in x86 is at length 10; in x64 it’s at length 16. After that, we start winning – as we’d expect. The penalty for very small strings is quite hefty though – you’d really better hope you didn’t have lots of single-character strings, taking 56 bytes each in x64.

Let’s see if we can do better…

Option 2: A simple struct

A lot of the overhead here has come from the fact that we’ve got an object which only has a single field. The field is all we’re interested in… why are we bothering with all the overhead of the object? Let’s make it a struct instead, effectively inlining that field wherever we use the type. Assignment, passing arguments to methods etc will still only be copying a reference – it’s just the reference will be the byte array rather than a wrapper object.

It all sounds good, but there are two snags:

  • The value can never be null; that at least diverges from the familiar string behaviour
  • We won’t be able to prevent code from creating an instance of our struct with new AsciiString() – and that won’t be good.

We can actually pit these two downsides against each other by making the “default” value a pseudo-null value… we can even throw NullReferenceException just as if it were a reference type. We don’t even need to do any work in order to get that NullReferenceException – every member is going to use the data array anyway, and dereferencing that will automatically throw an exception. We might want to change things around a bit to make that the very first thing that can throw an exception, but that’s all.

It’s nasty, but it appeals very slightly. In an evil kind of way. It makes things slightly more familiar, but at the cost of being generally weird in other ways.

We still need to be able to check whether an AsciiString value is the logical null value. I’ll add an IsNull property for that purpose. (An alternative would be HasValue, but that would be confusing with Nullable<T>.)

Most of the code remains untouched – it looks like this:

public struct AsciiString
{
    private readonly byte[] data;
    public bool IsNull { get { return data == null; } }

    // Remainder of code as before
}

Now let’s look at the sizes, which should be a lot more favourable than before. Note that I had to change the size-checking code to create an array of type AsciiStruct[] instead of object[] to avoid boxing. Should we take the size of the array itself into consideration when computing the size of the AsciiString? We haven’t when working out the size of string… in each case the size of any individual element will be the size of a reference. For the table below, I haven’t included it… but bear in mind that this form of measurement would count the size of most value types (int etc) as 0. It just goes to show that when you talk about the size of a data type, you really need to be very precise in what you mean.

Length string-x86 string-x64 AsciiString-x86 AsciiString-x64
0 16 32 12 24
1 16 32 16 32
2 20 32 16 32
3 20 32 16 32
4 24 40 16 32
5 24 40 20 32
6 28 40 20 32
7 28 40 20 32
8 32 48 20 32
9 32 48 24 40
10 36 48 24 40
..
16 48 64 28 40
24 64 80 36 48
32 80 96 44 56

This time, unsurprisingly, AsciiString is always more space-efficient than the normal string. It just takes a certain amount of holding our noses. Speaking of which…

Option 3: Extension methods on byte[]

Suppose we really, really want to have “proper” null references. We don’t really need the struct. We could treat any byte array as an array of ASCII characters, with extension methods like this:

public static class ByteArrayExtensions
{
    public static byte[] Substring(this byte[] data, int startIndex, int length)
    {
        if (startIndex < 0 || startIndex > data.Length)
        {
            throw new ArgumentOutOfRangeException(“startIndex”);
        }
        if (startIndex + length > data.Length)
        {
            throw new ArgumentOutOfRangeException(“length”);
        }
        byte[] newData = new byte[length];
        Buffer.BlockCopy(data, startIndex, newData, 0, length);
        return newData;
    }
}

The size is the same as with option 2 – in both cases there’s just the byte array, basically. This option is truly horrible in many ways though:

  • You can no longer tell in your code (just through typing) what’s meant to be an AsciiString and what isn’t
  • Kiss immutability goodbye
  • We can’t guarantee that all the characters will be valid ASCII any more
  • We can’t add extension properties or extension indexers
  • We can’t make it implement the interfaces we want it to

Obviously, we’d never take this route. I just thought I’d include it for a regular dose of evil-ness.

Conclusion

Implementing an ASCII-only string representation sounds like it should be an obvious win in terms of memory, at the cost of doing a lot of work that’s already been done for us in String. However, the most obvious implementation takes a while to break even in memory usage, compared with the normal string type, due to the “special” nature of string within the CLR. We can’t mimic the “stretchiness” of string ourselves. The BCL/CLR teams could, of course, if they really wanted to. I’m not holding my breath though.

If we’re dead keen on saving space at the cost of some design wonkiness, we can use a value type instead of a reference type. Other than nullity, it works pretty well… but you have all the disadvantages which go with value types, such as the unavoidable parameterless constructor and the need to watch out for boxing.

Aside from anything else, I hope this was useful as a delve into how much space objects actually take up in .NET – and as a way of highlighting the extra memory used when running in the x64 CLR.

44 thoughts on “Of memory and strings”

  1. For even more evilness, there is another trick you have missed.

    In your string class/struct you can include 4 byte values which are the first 4 bytes of the string (or 8 byte values if you want to optimize for x64).

    This way if your string is short you do not need the array at all.

    This saves enough memory that you can turn your string back into a class and still be more efficient than System.String in all cases except the empty string. But in this case you only need one instance, right?

    Like

  2. @Oliver: Ooh, that’s nice. I like it. Sounds good to me. Too tired to update the post with that idea though :)

    (And the code *would* end up being pretty messy…)

    Like

  3. I recall that in a Channel 9 interview someone from the CLR team said they’re considering a more compact representation of System.String in a future version of CLR — using byte[]-like representation in System.String for ASCII-only strings. Can’t remember exactly which show/who it was. If that were in the system, we wouldn’t have to invent anything to take advantage of it…in the future.

    BTW, object size in CLR can be measured with SOS extension via !dumpobj command, which is exact and needs no guesswork. Using SOS in either Visual Studio or WinDbg wouldn’t be all that hard. Don’t know if there’s any equivalent for Mono, though.
    http://msdn.microsoft.com/en-us/library/bb190764.aspx

    Like

  4. @Andrew: I specifically talk about that just before the table in option 2. If we were going to include the array there, we should also include the same for the references we need to for the *class* variables. In both cases, the space required is the same (the size of a reference) so I omitted it here, but noted the omission.

    Like

  5. Why not using a trie or even better a Huffman/prefix code? That would make for a very compact memory representation.

    I admit I haven’t read the article to the end, so I may be guilty of having missed a requirement.

    Like

  6. @Francesco: I can’t see how a trie would be useful here… I normally think of a trie as a way of performing prefix mappings… not storing a single string.

    While it would be possible to compress each string internally, it would affect performance (goodbye O(1) indexing etc) and I suspect that for small strings it would do more harm than good.

    Like

  7. @skeet: The trie could be used to compress the whole string set vs. making every individual string small by itself.

    But now that I’ve read the SO question I understand it doesn’t make sense in this case (to start with the OP doesn’t have a plain string list but a set of records).

    Like

  8. It would be interesting to see a bit more description of how we could go “.NET way” in the implementation. As I understand, strings store reference to their “first character”, but the rest of chars are just stored aligned to it. It’s like C array (without the overhead). We would then probably need to use GCHandle to keep this memory pinned, which could result in memory fragmentation, etc., etc.

    Like

  9. @configurator: Sorry I didn’t reply to your comment before: yes, every ASCII character has the same value in Unicode, so the conversion is fine.

    @Appu: No, it’s meant to represent a string of ASCII characters. The byte array used internally is an implementation detail. Note how the indexer returns a char rather than a byte.

    @Francesco: Yes, if we wanted to store a set of strings together, that sort of thing would make a lot more sense. An interesting problem, but not the one I was trying to solve this time :)

    Like

  10. For some reason I remembered that when you use the String(char, length) constructor it doesn’t really allocate length chars. But either I misremembered or the behaviour has since changed.

    new string(‘x’, int.MaxValue);// this throws OutOfMemoryException although I was sure it wouldn’t

    Like

  11. @Oliver, @skeet
    (referring to the first comment)
    How would you implement such a class without introducing new instance variables?
    New instance variables would consume extra space.

    Like

  12. Reading the Stackoverflow question, I think the key to his problem is that each record has many string fields… Could you take advantage of this?

    Like

  13. @OAlbrecht: I’m slightly confused about Oliver’s suggestion as well. Introducing yet another byte array for only the first 4 bytes? But we still would need the other big byte array for a case when it’s not actually less than 4 bytes – so why would that be better?

    I really seem to be missing something.

    Like

  14. @ShdNx: Oliver wasn’t suggesting another *array*. He was suggesting 4 or 8 byte variables “inline” in the object… so that we can avoid the overhead of creating an array at all, if the string is very short.

    (If we *are* creating an array, it could then be 4 bytes shorter than it would otherwise be… although it would complicate things significantly.)

    Like

  15. @skeet: ah, I see. Thanks!

    That would lead to some… hmm… with your choice of words: evil code, though I certainly see its advantages. Very cool! :)

    Like

  16. I’ve had to solve a problem very close to this in the past. What I’ve done is store all the character data as UTF8 encoded in a separate byte array, and store the indexes into that array in a separate bit-packed array. So let’s say we don’t have any more than 1,000,000 strings; we don’t need more than 20 bits for that. So I’d store the index into the byte array in a separate byte array which used calculations, bit shifting and masks to essentially act as an array of 20-bit integers. Similarly for length; let’s say none are longer than 100, so 7 bits are sufficient.

    This way, you pay no overhead at all per-string, just a fixed overhead; to refer to any given string, you just need its index, which may itself not even be 32 bits, if needed.

    It’s awkward to go much further without actually compressing the text data, but that can be a win and still have amortized O(1) indexing etc. – e.g. take the text in chunks of 100 items, and compress those chunks, and when indexing find out which chunk to decompress and index into it, etc.

    Like

  17. “The field is all we’re interested in… why are we bothering with all the overhead of the object? Let’s make it a struct instead, effectively inlining that field wherever we use the type. Assignment, passing arguments to methods etc will still only be copying a reference – it’s just the reference will be the byte array rather than a wrapper object.”

    I did not know that a struct behaves like that (although I thought I knew the difference between value types and reference types). I did some Google searches but I cannot find a mention of it.
    Could you please elaborate on this a little more? How is it “effectively inlining that field”?

    Like

  18. As @Francesco suggested some type of compression might make sense in usage of large strings. So just as @skeet said with option 1 there will be a break even point where it makes sense. This is pretty common in algorithms so once that particular break even point is found for strings in memory, where the advantages out weigh the penalties, it could be included. O(1) probably isn’t something likely to be done on your string when the size gets very large.

    Like

  19. So:

    string 14 + length * 2 26 + length * 2

    implies the string type is in fact an array with special C# syntax? That would be an interesting article to explore the inner working of the string class at the IL level.

    Thanks for the article.

    Like

  20. Presumably there would be a lot more on the AsciiString interface and in its associated infrstructure – for example we need sugar to simplify the interop with native code (which presumably would be needed) and all sorts of things to aid with the manipulation of such strings (for example *trim and AsciiStringBuilder etc..) And some operators to make things like += works. The there are the comparers and converters and all manner of more tengentially associated code. Suddenly you have a massive project on your hands. On the plus side I don’t see .NET putting too many barriers in your way – it does look possible to create a such an immutable type and have most or all of the support that a build in String provides.

    Like

  21. @Jonathan: Unless required, I wouldn’t worry about native code interop to be honest.

    Operators and such are fine, although note that String itself doesn’t supply a + operator – it relies on the C# compiler (and presumably other languages) using String.Concat appropriately.

    The interesting thing here (IMO) is that we can’t really mimic String’s memory allocation.

    Like

  22. Regarding your “mixed” custom classes, I have suspected in other contexts that the CLR stores fields in alphabetical order, by name. If true, your mixed examples are not really so mixed because you named your byte fields b1, b2, … and your int fields x, y, z.

    Like

  23. @Chris: That’s a good point. If I get time, I’ll try rerunning the tests with different variable names. I really hope the CLR is smarter than that though :)

    Like

        1. Well these days I write all the blog posts with Markdown in stackedit.io, which keeps the original Markdown separately, and syncs it to Google Drive. Will consider the gist approach though.

          WordPress still has a way to go in terms of its Markdown support, but it’s better than nothing…

          Like

Leave a comment