Object pooling and thread safety

A few days ago, I was watching Dustin Campell’s excellent talk from TechEd 2013, “Essential truths everyone should know about performance in a large managed code base”. I highly recommend it, and it made me think a bit more about Noda Time’s performance. (It doesn’t take much to make me think about Noda Time’s performance, admittedly.)

Base situation: no pooling

Most of our formatting goes through something called SteppedPattern – that’s basically one multicast delegate (with lots of actions) for formatting, and a list of delegates for parsing. The Format method was really simple before this week:

public string Format(TResult value)
{
    StringBuilder builder = new StringBuilder();
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

Very simple – but it allocates a new StringBuilder on each call, which is somewhat unnecessary. Allocation is cheap, but not free – and string formatting is something which could reasonably happen an awful lot.

In the talk, Dustin showed an example where introducing a single object pool – also for StringBuilder – had made marked difference. Unfortunately, in my case it’s not as simple as we might like. Or at least, I’m not confident that it is. Let’s look at the options I’ve considered so far – in no particular order.

Option 1: Lock-free using Interlocked

After construction, a formatter can be used on multiple threads, so some case is needed to make sure we don’t end up using one StringBuilder from multiple threads at a time. Interlocked.Exchange makes this reasonably straightforward:

private StringBuilder cachedBuilder = new StringBuilder();
...
public string Format(TResult value)
{
    // Try to borrow the cached builder, in a thread-safe way.
    // (If another thread is using the builder, this will return null.)
    StringBuilder builder = Interlocked.Exchange(ref cachedBuilder, null);
    if (builder == null)
    {
        builder = new StringBuilder();
    }
    else
    {
        builder.Length = 0;
    }
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    string ret = builder.ToString();
    // Whether we managed to borrow the cached builder or not,
    // we can replace it with the one we've used, now that we
    // know what to return.
    Volatile.Write(ref cachedBuilder, builder);
    return ret;
}

(See comments for discussion on the Volatile.Write call, which was previously just an assignment statement, along with the use of Interlocked.Exchange compared with an earlier version using Interlocked.CompareExchange.)
I’m pretty confident that this does achieve the goal of only using the builder in one thread at a time. If an exception is thrown while formatting, that’s fine – we don’t end up putting anything back in the cache, and the next caller will allocate a new one.
I don’t believe we need to use Exchange a second time when writing to cachedBuilder – we’re certainly fine if it takes a while for that write to be seen by other threads, as the only downside would be not using the cached builder when we might want to.

But is that enough to make this code thread-safe? Is it possible that a write from one thread would only be seen in another thread after the second thread has borrowed the builder? Is it possible that the write to cachedBuilder actually occurs before the call to builder.ToString() completes, allowing another thread to start messing with the builder before we’ve got our result?

My guess is that it’s actually okay – but lock-free code fundamentally scares me, and I’m very cautious about writing it. (Elsewhere in Noda Time we do have some lock-free mutable data – but usually in the form of array-based caches where the element type is guaranteed to have atomic writes, and is immutable.)

So while this is “clever” and may be fine, it makes me nervous, which is not a good place to be.

Option 2: Locking

I’m much more confident about using threads when locking is involved. I find it easier to reason about what’s allowed and what’s not.

private readonly StringBuilder pooledStringBuilder = new StringBuilder();
private bool pooledBuilderInUse = false;
...

public string Format(TResult value)
{
    StringBuilder builder = null;
    lock (pooledStringBuilder)
    {
        if (!pooledBuilderInUse)
        {
            pooledBuilderInUse = true;
            builder = pooledStringBuilder;
            builder.Length = 0;
        }
    }
    try
    {
        builder = builder ?? new StringBuilder();
        // This will call all the actions in the multicast delegate.
        formatActions(value, builder);
        return builder.ToString();
    }
    finally
    {
        if (builder == pooledStringBuilder)
        {
            lock (pooledStringBuilder)
            {
                pooledBuilderInUse = false;
            }
        }
    }
}

Unlike the previous solution, this time we always know what the builder is (and never change it) – that gives us something to lock on. What we change within the lock is whether or not the builder is in use. This should be safe across multiple threads, but it’s ever so verbose – and in my benchmarks, the results weren’t terribly clear.

Option 3: ThreadStatic

The obvious alternative to sharing one builder between multiple threads is to have a
single builder per thread. There are two approaches to this in .NET:
[ThreadStatic] and ThreadLocal.

ThreadStatic is the older mechanism, and is simply a matter of decorating a static field with the [ThreadStatic] attribute. Each thread has a separate variable, as if by magic. So in principle, the implementation just looks like this:

[ThreadStatic]
private static StringBuilder cachedBuilder;

public string Format(TResult value)
{
    StringBuilder builder = cachedBuilder;
    if (cachedBuilder == null)
    {
        builder = new StringBuilder();
        cachedBuilder = builder;
    }
    else
    {
        builder.Length = 0;
    }
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

Unfortunately we still need to check for nullity on every call, as any initialization
expression for the variable will only be executed by the first thread which uses the class. Still, the code is reasonably simple.

There’s a big “gotcha” with this though: our code is now not re-entrant. If one SteppedPattern has a format action which calls format on another SteppedPattern, we’ll end up trying to use the same builder for both operations, resetting the builder in the middle.

Fortunately, SteppedPattern already has a method (FormatPartial) designed for calling one formatter within another. In the very few places where this is required, I’m confident that it’s being called correctly – but I’m not keen on making my code fragile like this. It’s fairly safe from third party intervention – there are very few places (if any) where formatting code calls out to any other code which could include a format call. Just writing this paragraph, I’ve come up with a couple of worrying parts (fetching values from a CultureInfo, and fetching the ID of a time zone) but I think they’re safe… the question is whether I’ve missed something similar in another case. Either way, it would have to be pretty pathological code to cause a problem, and I consider the chances of this being a problem in the real world to be essentially zero.

However, I still want to guard against accidental re-entrancy within Noda Time itself – so I’ve got a debug-only version of Format which performs a crude re-entrancy check using another [ThreadStatic] variable. (And I’ve got a debug-only test to check that the check works, of course.) So long as my unit tests all still pass in a debug configuration, I’m reasonably confident. (You can see the check and the test in the revision that added this code. It may be gone in the latest version, of course – I’m still on the fence about all of this.)

There’s a second issue with [ThreadStatic] which is its level of support in the PCL. We just recently updated our PCL profiles, and I think using [ThreadStatic] would have been a problem if we’d continued to support Windows Phone 7, as that didn’t include [ThreadStatic] as far as I can tell. We’re fine in the current profile, but this is the first time I’ve knowing added code which prevents us from going back to a WP7-enabled profile.

Option 4: ThreadLocal

ThreadLocal is a more recent solution to the thread-local-storage problem. It’s generally a cleaner solution:

  • The variable doesn’t have to be static
  • You can specify an initialization delegate to be executed on first use in any particular thread
  • It’s clear from the type itself what you’re doing, rather than using an attribute which makes a variable act completely differently from normal

That means we can have one StringBuilder per formatter per thread (if we want; we could still make it a static variable if we wanted exactly one StringBuilder for each thread), and simplify the code within the Format method:

private ThreadLocal<StringBuilder> cachedBuilder =
    new ThreadLocal<StringBuilder>(() => new StringBuilder());

public string Format(TResult value)
{
    StringBuilder builder = cachedBuilder.Value;
    builder.Length = 0;
    // This will call all the actions in the multicast delegate.
    formatActions(value, builder);
    return builder.ToString();
}

However, I haven’t tested this code, as ThreadLocal required .NET 4.0 – whereas Noda Time targets .NET 3.5.

Conclusion

You may have noticed that I haven’t provided any performance figures in this post. That’s deliberate, because the results I’ve seen have had too much variance so far. The basic upshot of this is that I don’t yet have enough evidence of the benefit to justify the complexity – it’s just one StringBuilder per format operation, after all, and there’s plenty of other work going on at the same time. I believe there is a benefit, but it’s pretty marginal – and the level of benefit depends on which of the above options is taken.

I’m intrigued by the first option though, and whether experts would consider it to be safe. I still want to see a really good specification for the .NET memory model (which I know to be stronger than the one in the ECMA CLI specification) – I know it’s something that Microsoft has wanted to get round to writing up, but I don’t know if it’s actually happened… I fully expect to see contradictory opinions in comments :)

52 thoughts on “Object pooling and thread safety”

  1. The approach I took in C# a few years ago (with the aim of avoiding garbage collection pauses) was to use ThreadStatic and a Stack for the objects in the pool. This adds some small amount of overhead, but allows a lot of flexibility. It also has the bonus of growing up to your needs, because if you use more than one object of that type on that thread, you’re likely to do it again as code rarely runs just once.

    Like

  2. Lovely article and blog. It looks good. I’m seeing an issue with your rendering in the latest Chrome. The code isn’t indented, the code for “Option 3” isn’t formatted, and I see “##Conclusion” instead of the rendered version. I’d love to hear how the mark-down blog is working out on WordPress.

    Like

  3. Lovely article and blog. It looks good. I’m seeing an issue with your rendering in the latest Chrome. None of the code is indented, the code for “Option 3” isn’t formatted, and I see “##Conclusion” instead of the rendered version. I’d love to hear how the mark-down blog is working out on WordPress. Any plugins?

    I used the Customizr theme for my code blog. It is nice and wide with a clean look.

    Like

    1. The indentation issue was due to the wordpress.com editor just collapsing it all – I’ve fixed that now. Likewise the “##Conclusion” was due to a heading notation that apparently WP doesn’t support.

      Will have a look at Customizr.

      Like

        1. No, I’m already using Markdown – but there are a few quirks to the WP implementation. (And I can’t change which plugins I use, given that it’s hosted wordpress.com.)

          Like

  4. Does the allocation cost as much as all the copying? I once had some code with several layers of format methods. I got a huge speed-up by providing ‘void AppendFormat(StringBuilder)’ methods. That way an entire hierarchy of calls could all share the same builder, with a single final ToString occurring at the end at the top level. I still quite like that API. I must admit I’m slightly terrified of your proposed solutions – I’d want to see miraculous benchmark results to feel the complexity and risk from all this shared mutable state was justified.

    Like

  5. Interlocked.Exchange() is overkill — all you need is a store barrier before the write to cachedBuilder. The cheapest way to get that is to use Volatile.Write(ref cachedBuilder, builder); (or just mark the field as volatile)

    Now, in the .NET 2.0 days, the .NET framework actually guaranteed that every write is a volatile write, so your code would be safe. (this was a guarantee of the MS .NET implementation, not part of the official language specs) But that’s back when .NET pretty much only ran on x86, where the hardware guarantees it doesn’t reorder writes. I don’t know if the guarantee is still valid on ARM. Lots of .NET code would be subtly broken if it isn’t, but on the other hand, ARM CPUs aren’t really powerful enough to pay an additional performance penalty on every write to keep up the .NET 2.0 guarantee.

    Liked by 1 person

  6. I also have a bad feeling about (1). Seems like there is a missing barrier. For the .NET memory models, I find Joe Duffy’s book and blog to be the best source, though they can be difficult to understand.

    I’d consider a fifth option (ConcurrentBag):

    private static ConcurrentBag<StringBuilder> cachedBuilders =
    new ConcurrentBag()<StringBuilder>;
    
    public string Format(TResult value)
    {
      StringBuilder builder;
      if (!cachedBuilders.TryTake(out builder))
        builder = new StringBuilder();
      builder.Length = 0;
      // This will call all the actions in the multicast delegate.
      formatActions(value, builder);
      var result = builder.ToString();
      cachedBuilders.Add(builder);
      return result;
    }
    

    This will create new SBs only if all the existing ones are already being used.

    Like

    1. Yes, I agree that the first one is probably missing a barrier – but now I don’t want to change the code in the post!

      I can’t use ConcurrentBag for the same reason as option 4 – but I would expect this may be more expensive than just creating a builder, to be honest.

      Like

    2. Eeep, if this optimisation is to play nice with memory, then don’t use a ConcurrentBag – they are not good with memory – their focus being on avoiding thread clashes on data and thoughput performance so work by having a bag per thread for good concurrency and cache awareness – not so good with general memory use.

      Also more generally, I’d stay away from anything that does threadlocal/static stuff (ConcurrentBag included) in a library that may be used in the async/await world of tasks. This is because as a async/task chain may wander from thread to thread as and when they are available for execution. Thread-local stuff doesn’t naturally follow the path of execution when it moves between threads. To help with this, and make it seem like it does, the framework starts lifting an moving the thread local stuff around with your tasks between threads, so that it still behaves as you might expect.

      However, synchronising this ExecutionContext comes with a performance cost. On the other hand if you don’t do any thread-local stuff (or impersonation, security etc) then the ambient execution context doesn’t need to flow across await points as it doesn’t contain anything that would differ between threadpool threads so you don’t have this penalty. In general code this is a fairly lightweight cost; but for server-side stuff or awaiting in tight loops its bad – so I’d certainly stay away from it in a library!

      Basically, if I called .ToString() on a library method I wouldn’t expect all my unrelated awaits to suddenly start paying a synchronisation penalty – which is what would happen in this case.

      Like

      1. While there occasions where a lock free algo is important I wouldn’t consider creating independent things to be one. Instead keep them independent. ConcurrentBag has the right type of idea; however if you go core local rather than thread local it will work a lot better and work well with tasks and await. Also since they are independent, and its the world of strings, an uncontested lock probably has enough performance while being clearer. So I’d go for something like:

        [csharp]

        class Formatter
        {
            [DllImport("kernel32.dll"), SuppressUnmanagedCodeSecurity]
            private static extern int GetCurrentProcessorNumber();
        
            private static readonly StringBuilder[] pooledBuilders = new StringBuilder[Environment.ProcessorCount];
            private static readonly object[] pooledLocks = new object[Environment.ProcessorCount];
        
            static Formatter() 
            {
                for (var i = 0; i &lt; pooledLocks.Length; i++)
                {
                    pooledLocks[i] = new object();
                }
            }
            private static StringBuilder GetBuilder()
            {
                var i = GetCurrentProcessorNumber();
                if (pooledBuilders[i] == null)
                {
                    return new StringBuilder();
                }
                else
                {
                    lock (pooledLocks[i])
                    {
                        StringBuilder builder = pooledBuilders[i] ?? new StringBuilder();
                        pooledBuilders[i] = null;
                        return builder;
                    }
                }
            }
            private static void ReleaseBuilder(StringBuilder builder)
            {
                var i = GetCurrentProcessorNumber();
                lock (pooledLocks[i])
                {
                    builder.Length = 0;
                    pooledBuilders[i] = builder;
                }
            }
            public string Format(TResult value)
            {
                StringBuilder builder = GetBuilder();
                try
                {
                    // This will call all the actions in the multicast delegate.
                    formatActions(value, builder);
                    return builder.ToString();
                }
                finally
                {
                    ReleaseBuilder(builder);
                }
            }
        }
        

        [/csharp]

        Like

  7. How I would write Option 2 … minimal time in the first lock, and the second lock isn’t necessary since no one else can be using pooledStringBuilder inside the if:

    private readonly StringBuilder pooledStringBuilder = new StringBuilder();
    private bool pooledBuilderInUse = false;
    ...
    
    public string Format(TResult value)
    {
        bool useOwnBuilder;
        lock (pooledStringBuilder)
        {
            useOwnBuilder = pooledBuilderInUse;
            pooledBuilderInUse = true;
        }
        try
        {
            StringBuilder builder;
            if (useOwnBuilder)
            {
                builder = new StringBuilder();
            }
            else
            {
                builder = pooledStringBuilder;
                builder.Length = 0;
            }
            // This will call all the actions in the multicast delegate.
            formatActions(value, builder);
            return builder.ToString();
        }
        finally
        {
            if (!useOwnBuilder)
            {
                pooledBuilderInUse = false;
            }
        }
    }
    

    Like

        1. Their lock means they get a read barrier – but you need a write barrier too, on the writing side. (Think of the write barrier as a bit like flushing a write cache, and the read barrier as a bit like flushing a read cache. That’s not entirely accurate, but it’s a starting point.)

          Like

    1. Also, the builder could be cleared after use rather than before, reducing memory use slightly, making the code a little more compact, and following the principle that one who makes a mess is responsible for cleaning it up:

      StringBuilder builder = useOwnBuilder ? new StringBuilder() : pooledStringBuilder;

      finally
      {
      if (useOwnBuilder)
      builder.Length = 0;
      else
      pooledBuilderInUse = false;
      }

      Like

  8. I can’t seem to figure out that Interlocked.CompareExchange in option 1.

    StringBuilder builder = Interlocked.CompareExchange(ref cachedBuilder, cachedBuilder, null);

    If I’m understanding it right on MSDN (http://msdn.microsoft.com/en-us/library/h7etff8w(v=vs.110).aspx), if the first and third argument are equal, then the second argument is assigned to the first. The initial value of the first argument is always returned.

    So if the first and third argument are ARE NOT equal, nothing happens and cachedBuilder doesn’t change. If the first and third argument ARE equal, the second argument (cachedBuilder) is assigned to cachedBuilder and again, cachedBuilder doesn’t change. I can’t seem to figure out how it’s ever going to return null to indicate that it is in use.

    I even wrote a program simulating the first 2 options. Option 2 works exactly as I would think, but Option 1 doesn’t ever seem to want to create a new instance. Multiple calls on different threads are constantly walking all over each other. What am I missing?

    Thanks.

    Like

    1. Yes, you’re right – oops! I got arguments 2 and 3 the wrong way round. Basically the idea is to replace the value with null if it was the builder that’s returned. Will fix – thanks very much for spotting it.

      Like

      1. I always write CompareExchange like this, I’ve had that same problem happen to me a few times!

        StringBuilder builder =
        Interlocked.CompareExchange(ref cachedBuilder, value: null, comparand: cachedBuilder);

        Like

  9. I have a slightly out-of-band question — since you’re trying to optimize, have you tested initializing the StringBuilder with a longer capacity?

    Simple date/time formats aren’t an issue, but longer formats like ISO 9660 will bust through the 16-character StringBuilder default and force an ExpandByABlock call. Depending on the format string you’re using for testing, this might even have an impact on the test results.

    Like

    1. Not very recently, but I think I did at one point. Worth trying again, certainly… I even have a sample template value, so I could just format that and use it as the default size.

      Like

    1. Maybe you should clear pooled stringbuilders after use so that they don’t continue to consume arbitrary amounts of memory. Or, is the buffer size always small?
    2. I’d only write back the current string builder if it was taken from the pool. If not we can save one volatile write that will also pull in the cache line and mark it dirty. It will also write to the GC card table (does that thing even exist? I think they are now using OS write watch).

    Like

      1. They’re always small, in reality. But yes, they could be cleared after use instead of before next use.
      2. Yes, that’s definitely an option – if I ever go back to this :)

      Like

  10. You’re better off associating the lifetime with whoever is calling Format so that you can manage the problem at a higher level if necessary.

    The kind of speculative changes you posit in the post need very strong justification. Consider design changes ahead of this kind of micro-optimization where possible – this thing sounds a bit like an internal interface, or at least something that could be made to work in two different ways to accommodate backward compatibility.

    The cost of allocating a string builder per call should be quite low – it should just be the extra copying that’s costing you. GC shouldn’t be a problem since the object is neither large nor dies in middle age. Going with a pooling approach requires more than a demonstrated benefit in a micro-benchmark, where GC costs aren’t amortized across real life work, it needs to be a reported real-world performance problem, reproduced by more than one entity.

    And even then, an alternate approach – where you let your callers hand you a StringBuilder or TextWriter or similar – is still better.

    Like

    1. Unfortunately, trying to get performance measures for “real life work” is pretty tricky for an open source project – we’re working in the dark, to a large extent. I’m aware of BCL parsing and formatting speed causing issues in a previous project, but I haven’t actually worked on any projects using Noda Time, so I’m somewhat removed.

      I generally agree with all your points – although I’d say we’ve already done quite a lot of the design work for optimizations. (Noda Time 2.0 is massively faster than 1.x in general due to some significant rearchitecture.)

      I do like the idea of allowing a StringBuilder to be passed in though – we’ve already got that as a “partial pattern” which isn’t exposed everywhere, but it should be trivial to make that more widely available…

      Like

  11. I should add that it is in principle very possible for a future CLR to perform escape analysis, see that the StringBuilder doesn’t escape this call, and compose the finally returned string object in place. That kind of optimization could be a big potential win to a lot of end-user programs that do a lot of string generation. If you did all this work without concrete reasons, you’ll have harmed the codebase in the medium term.

    Like

    1. General point taken – but in this case the StringBuilder is passed to a somewhat-arbitrary delegate (from the CLR’s perspective) so I doubt that the escape analysis could actually do anything.

      Like

  12. I wanted to comment to warn against optimizatons like this, but see that barrkel already made all the points I wanted to make (in a much better way).

    Like

  13. In solution 1 wouldn’t Interlocked.CompareExchange throw an exception if the first argument is null? at least MSDN says so…

    Like

    1. Yes, but it never would be null… it’s always going to refer to the cachedBuilder variable. It doesn’t matter if the value of the variable is null – it’s just saying it can’t be a null pointer, i.e. you can’t try to modify a variable that doesn’t exist.

      Liked by 1 person

  14. I haven’t spent a lot of time analyzing it, but I think you just want Interlocked.Exchange when checking out the builder object from the cache, not CompareExchange.

    Liked by 1 person

    1. You’re right, and it introduces a bug. I explained below (accidentally; wish I could delete my own comment using my phone).

      I caused the bug to appear by testing with 26 threads. Each thread wrote a single character, waited for 10 ms, and checked to make sure the length was exactly 1 before checking the builder back in. It failed within seconds; two threads had indeed obtained the same instance and written into it.
      On the second run, it failed after fifteen seconds. Another thread had cleared the buffer right before the length was measured.
      I also showed that, often, cachedBuilder != Volatile.Read(ref cachedBuilder) at the top of the method.

      Changing to Ben Voigt’s suggestion prevents the entire issue.

      Locks would have solved the staleness easily enough. The things we do (and will continue to do) to be efficient!

      Like

  15. Oh, good catch. According to theory it’s actually possible the code could be buggy. The read of cachedBuilder, right before the CompareExchange call, might be stale and hold a value that is neither null nor the current value at the ref address.
    Net effect: exchange is not done because the stale instance does not match the current instance, but the current instance is returned anyway. This could allow two threads to obtain the same StringBuilder instance.

    To fix the staleness problem, should be Interlocked.CompareExchange(ref cachedBuilder, null, Volatile.Read(ref cachedBuilder)) – or more efficiently and without the race condition, Interlocked.Exchange(ref cachedBuilder, null).

    Terse stuff.

    Like

Leave a comment