Implementing Synchronization Primitives for Boost on Windows Platforms

by Anthony Williams

Introduction

The project to rewrite the Boost thread library started in 2005, as part of the drive to have the whole of Boost covered by the Boost Software License; Boost.Thread was under a different (though similar) license, and the original author could no longer be contacted. Though this issue has now been resolved, as the original author was contacted and agreed to the license change, efforts were underway to reimplement the library, and they continue unabated. The reimplementation effort is driven in part due to the proposals for threading to be added to the next version of the C++ Standard, scheduled to be released in 2009.

This article describes the Windows implementations of mutexes for boost in CVS on the thread_rewrite branch at the time of writing. As discussions continue, and alternative implementations are proposed, the final version used in Boost release 1.35 may differ from that described here.

What's a mutex

Mutex is short for "mutual exclusion", and they are used to protect data — only one thread is permitted to "own" the mutex at one time, so only one thread is permitted to access that data at once, the rest are excluded. The other important job performed by a mutex is to ensure that when a thread owns the mutex, the values it sees for the protected data are exactly the same as the values seen by the last thread to own the mutex, including any writes made by that thread. This is particularly important on systems with multiple CPUs, where the threads might be running on different CPUs, and the data will reside in the associated CPU caches for some indeterminate amount of time without specific instructions to ensure cache coherency, and visibility of the data between CPUs.

Memory visibility

Without explicit instructions to say otherwise, data loaded from main memory is generally kept in the CPU cache for as long as possible, since retrieval from main memory is a slow operation, and the CPU will stall if the data it needs is not in the cache. In addition, data may be stored in the CPU cache before being written to main memory. On a single-CPU, single-core system this is not a problem, but on systems with multiple CPUs, or CPUs with multiple cores, this can be a problem if data is to be shared, as stored data may not make it to main memory for other CPUs/cores to see, and they may not reload it from main memory anyway, if that memory location is stored in their cache.

With CPUs that can reorder instruction execution, reads and writes might not necessarily happen in the order anticipated, which therefore compounds the problem.

Consequently, in general, without special instructions, data written by one CPU may or may not be visible to other CPUs, and writes that appear sequential on one CPU may become visible on other CPUs in any order.

There are special CPU instructions which can force the CPU to synchronize the cache with main memory, and which restrict the reordering of instructions. These are called memory barrier instructions, and come in several flavours, which determine the level of synchronization done.

Native Windows Synchronization Objects

Windows provides an assortment of synchronization objects: CRITICAL_SECTIONs, Semaphores, Events and Mutexes. Though all of them can be used to ensure that only one thread at a time has access to a given block of data, CRITICAL_SECTIONs and Mutexes are the only ones designed for the purpose.

A CRITICAL_SECTION is a relatively light-weight object, and can only be used within a single process. Use of a CRITICAL_SECTION requires use of the CRITICAL_SECTION-specific API. In contrast, the other synchronization objects are kernel objects, and can be used to provide synchronization between processes, as well as within a single process. All these objects can be used with the generic HANDLE-based APIs, such as WaitForSingleObject and WaitForMultipleObjectsEx.

Atomic Operations

As well as the synchronization objects described above, the Windows API provides a set of atomic operations, commonly implemented as compiler intrinsics that compile down to a single processor instruction. The most basic of these is the InterlockedExchange API, which atomically swaps two 32-bit values, such that all CPUs and cores in the system will see either the original value, or the new value.

This property is common to all the atomic APIs (all of which share the Interlocked prefix), which include increment, decrement, add, and compare-and-exchange. The latter comes in both 32-bit, and (on Windows Vista, Windows Server 2003, and 64-bit platforms) 64-bit variants, and is the basic workhorse of thread synchronization — all other operations can be built on it.

The semantics of these basic atomic operations also include the full memory barrier semantics necessary for synchronization between threads running on different CPUs — all writes to memory that occur in the source code before the atomic operation will be completed (and made visible to other CPUs) before the atomic operation, and all reads that in the source code after the atomic operation will not proceed until the atomic operation has complete.

Newer versions of Windows such as Windows Server 2003 and Windows Vista also provide additional variants of some of the atomic operations that only provide a write barrier (the xxxRelease variants), or a read barrier (the xxxAcquire variants), but these are not used for the implementation under discussion.

Choosing a mutex implementation

The simplest way to implement a mutex would be to write a wrapper class for one of the native Windows synchronization objects; after all, that's what they're there for. Unfortunately, they all have their problems. The Mutex, Event and Semaphore are kernel objects, so every synchronization call requires a context switch to the kernel. This can be rather expensive, especially so when there is no contention.

Boost mutexes are only designed for use within a single process, so the CRITICAL_SECTION looks appealing. Unfortunately, there are problems with this, too. The first of these is that it requires explicit initialization, which means it cannot reliably be used as part of an object with static storage duration — the standard static-initialization-order problem is compounded by the potential of race conditions, especially if the mutex is used as a local static. On most compilers, dynamic initialization of objects with static storage duration is not thread-safe, so two threads may race to run the initialization, potentially leading to the initialization being run twice, or one thread proceding without waiting for the initialization being run by the other thread to complete.

Race Conditions

In multithreaded code, a race condition is anything where the outcome depends on the relative ordering of execution of instructions on two threads. They are particularly problematic where there are two separate operations that need to be done together: if the thread performing the operations is interrupted after doing the first part, and before doing the second, then another thread may see the overall operation as half-done. Not only that, but without synchronization, due to memory visibility issues, the thread might see the result of the second operation, and not the result of the first. If the second thread is also operating on the same data, this may then end in a garbled mess.

As a concrete example, consider a 64-bit counter on a 32-bit system. This is therefore implemented as a pair of 32-bit unsigned integers, one for the high part, and one for the low part. If we have two threads incrementing this counter without synchronization, we have a potential for problems:

Thread A reads low part=0xffffffff
Thread A read high part=0
Thread A increments counter
Thread B reads low part=0xffffffff
Thread A writes low part=0
Thread A writes high part=1
Thread B reads high part=1 // oops thread B now sees low=0xffffffff, high=1
Thread B increments counter
Thread B writes low part=0
Thread B writes high part=2

Thread B has now just increased the counter by 2³², rather than by 1. If this counter is a reference count, then the same issue in reverse when decrementing the count could cause the reference-counted object to be freed whilst other threads still had references.

Race conditions can be the causes of hard-to-find bugs, since they don't replicate well, and often disappear entirely in the debugger.

The second problem is that you can't do a timed wait on a CRITICAL_SECTION, which means we need another solution for a mutex that supports timed waits, anyway.

There is also another problem with using CRITICAL_SECTIONs as a high-performance mutex, which is that a thread unlocking the CRITICAL_SECTION will hand-off ownership to a waiting thread. More on this below.

The need for speed

The primary reason for using anything other than a Windows Mutex kernel object is the need for speed — otherwise, the ease of implementation when using a Mutex object would make it an obvious choice. The fastest way to lock a mutex is using atomic operations — do an atomic exchange of the "locked" flag to set it, and check the old value; if it was already locked, this thread has to wait, otherwise it has the lock — but the problem comes with the "this thread has to wait" part. A waiting thread has to consume as near to zero CPU time as possible until the mutex becomes unlocked, in order to not slow down the running threads.

long flag;

void lock()
{
    while(!InterlockedExchange(&flag,1))
    {
        wait();
    }
}

void unlock()
{
    InterlockedExchange(&flag,0);
    wake_a_waiting_thread();
}

The simple answer, which is the one used by the Windows CRITICAL_SECTION, is to use a Windows auto-reset Event to handle contention. When a thread tries to lock the mutex, and finds it already locked, then it blocks on the event. When the thread that owns the lock unlocks it, if there is a thread blocked on the Event, it signals the event to wake the waiting thread. This also handles contention and priority neatly, in that Windows will wake exactly one thread waiting on an auto-reset Event, and the kernel can choose the highest priority thread to wake.

The hard part here is the "if there is a thread blocked on the Event" part. Windows won't tell us, so we have to maintain a count of waiting threads in the mutex. We could just set the Event every time, rather than keeping a count, but that would lead to a potentially-unnecessary kernel API call on unlock, and would also cause the first thread to enter the wait loop afterwards to wake immediately, as the Event would stay set. This spurious wake up would just consume CPU for no gain.

Keeping Count

The simplest way to keep count is just to increment the flag rather than always set it to 1. If we just incremented it to 1, then we're the only thread, so we've got the lock. Otherwise, someone else has the lock, so we must wait.

On unlock, we can decrement the count. If it hits zero, it was just us, so we do nothing, otherwise we must wake another thread, and that thread now has the lock:

void lock()
{
    if(InterlockedIncrement(&flag)!=1)
    {
        wait_on_event();
    }
}

void unlock()
{
    if(InterlockedDecrement(&flag)!=0)
    {
        wake_a_waiting_thread();
    }
}

Hand-off

This is essentially how CRITICAL_SECTIONs work, and is a nice simple scheme. Unfortunately, it suffers from a problem called "hand-off". When the current thread unlocks the mutex, if there are any other threads waiting, it "hands off" the mutex to one of these even if that thread is lower priority than the unlocking thread. If the mutex is being locked and unlocked in a loop, the high priority thread may come round to the lock again, before the lower priority thread (which now owns the mutex) wakes up. In tight loops, this is quite likely, since the high priority thread won't yield to the lower priority thread until it either blocks or hits the end of its timeslice. This means that the high priority thread now has to wait for the lower priority thread, even when it could have continued.

To solve this, we need a different scheme, where the mutex is truly unlocked by unlock(), however many waiting threads there are.

Avoiding hand-off with compare-and-swap

The solution I've implemented is to reserve one bit of the flag for the "locked" state, whilst keeping the remaining bits for the count of "interested" threads. Consequently, this can no longer be implemented as a single atomic instruction, since it is not a simple addition, or even a simple mask. This is where compare-and-exchange/compare-and-swap (CAS) comes into its own: by using CAS in a loop, you can do any operation you like to a single 32-bit field (or 64-bit field, on those platforms that support it). CAS only sets the field to the new value if it was equal to the specified "old value", but it returns the old one, whatever it was. That way you know whether your set succeeded, and you can decide whether to try again, and what the new value should be based on the actual current value.

Consequently, try_lock() looks like this:

bool try_lock()
{
    long old_flag=0;
    do
    {
        long const new_flag=(old_flag+1)|lock_flag_value;
        long const current_flag=
            InterlockedCompareExchange(&flag,new_flag,old_flag);

        if(current_flag==old_flag)
        {
            return true;
        }
        old_flag=current_flag;
    }
    while(!(old_flag&lock_flag_value));
    return false;
}

Which basically says: take the current value, increment the "interested threads" count and set the lock flag. If we succeed in this, we've got the lock. If not, the state must have changed, so try again, unless another thread has the lock, in which case we're done.

Plain lock() is essentially a try_lock() in a loop, waiting on an event. The first time through, we increment the "interested threads" count, whether or not we get the lock, to mark that we're waiting. If the lock flag was set already, then we wait on the event. When we wake, we just try and set the lock flag. If it was still set, we wait on the event again.

bool lock()
{
    long old_flag=0;
    while(true)
    {
        long const new_flag=(old_flag+1)|lock_flag_value;
        long const current_flag=
            InterlockedCompareExchange(&flag,new_flag,old_flag);

        if(current_flag==old_flag)
        {
            break;
        }
        old_flag=current_flag;
    }
    while(old_flag&lock_flag_value)
    {
        wait_on_event();
        do
        {
            long const new_flag=old_flag|lock_flag_value;
            long const current_flag=
                InterlockedCompareExchange(&flag,new_flag,old_flag);

            if(current_flag==old_flag)
            {
                break;
            }
            old_flag=current_flag;
        }
        while(!(old_flag&lock_flag_value));
    }
}

In contrast, the unlock() code is quite straight-forward, since the modification to the flag value is simple: clear the lock flag, and take one off the count. Since we know the lock flag is always set, this is a simple subtraction, which we can do with the atomic exchange-and-add operation. If the old value returned by the exchange-and-add shows there was another thread waiting, signal the event to wake it.

void unlock()
{
    long const offset=lock_flag_value+1;
    long const old_flag_value=InterlockedExchangeAdd(&flag,-offset);
    if(old_flag_value!=offset)
    {
        wake_a_waiting_thread();
    }
}

Timing out

Adding a timeout to the lock function is relatively straight-forward with this scheme. We need to add a timeout to the wait_on_event() call, and if it times out, then we decrease the count of "interested threads" and return false to indicate that we haven't got the lock. If the wait doesn't time out, then we proceed just like lock(), returning true if we do get the lock.

bool timed_lock(timeout_type timeout)
{
    long old_flag=0;
    while(true)
    {
        long const new_flag=(old_flag+1)|lock_flag_value;
        long const current_flag=
            InterlockedCompareExchange(&flag,new_flag,old_flag);

        if(current_flag==old_flag)
        {
            break;
        }
        old_flag=current_flag;
    }
    while(old_flag&lock_flag_value)
    {
        if(!wait_on_event(milliseconds_remaining(timeout)))
        {
            InterlockedDecrement(&flag);
            return false;
        }
        do
        {
            long const new_flag=old_flag|lock_flag_value;
            long const current_flag=
                InterlockedCompareExchange(&flag,new_flag,old_flag);

            if(current_flag==old_flag)
            {
                break;
            }
            old_flag=current_flag;
        }
        while(!(old_flag&lock_flag_value));
    }
    return true;
}

The only issue here is the fact that the timeout on the wait and the decrement of the count are not a single atomic operation. Consequently, if this thread is the only waiting thread, and the current owner unlocks the mutex between the wait timing out and the flag being decremented, then the event will be signalled, even though there are no waiting threads. This will cause a spurious wake-up of the next thread to wait on the mutex, which is unfortunate, but not a disaster. The alternative, which is that rather than just decrementing the count, we try and acquire the lock, and only decrement the count if we don't get it, also has the potential for spurious wake-ups. If the timing-out thread is not the only waiting thread, but the mutex is unlocked between the timeout and the decrement, then another of the waiting threads will wake. If we acquire the lock instead of just decrementing the count, then the other waiting thread will have woken for no reason.

Events and Initialization

Up to now I've glossed over the wait_on_event() and wake_a_waiting_thread() calls, but they're quite important to the whole scheme. They're also rather simple:

bool wait_on_event(unsigned milliseconds=INFINITE)
{
    return WaitForSingleObject(get_event(),milliseconds)==0;
}

void wake_a_waiting_thread()
{
    SetEvent(get_event());
}

Since we're using an auto-reset event, the event object is automatically reset when the first waiting thread wakes from the WaitForSingleObject call. The complication is get_event(). In order to permit static initialization, our mutex cannot have a constructor, so the members have to be initialized with fixed values — this was one of the problems with CRITICAL_SECTIONs. Therefore, we need to create the event object in the get_event() call, in such a way that if multiple threads call get_event() concurrently, they all return the same event object. We do this with yet more atomic operations — first we read the current Event handle; if it's NULL, then we create a new Event, and try and swap it into place. If we succeeded, then we're using the Event we created, otherwise another thread beat us to it, so we use that, and destroy the one we created.

HANDLE event;

HANDLE get_event()
{
    HANDLE* current_event=
        InterlockedCompareExchangePointer(&event,0,0);
    if(current_event)
    {
        return current_event;
    }
    HANDLE* new_event=CreateEvent(NULL,false,false,NULL);
    if(!new_event)
    {
        throw thread_resource_error();
    }
    current_event=
        InterlockedCompareExchangePointer(&event,new_event,0);
    if(!current_event)
    {
        return new_event;
    }
    else
    {
        CloseHandle(new_event);
        return current_event;
    }
}

Initialization is thus straight-forward: for objects with static storage duration, zero initialization will suffice, and for objects with non-static storage duration, explicit initialization to zero using an aggregate initializer will suffice.

void f()
{
    static mutex sm; // zero init works OK for static
    mutex m={0}; // aggregate initialization required for auto
    static mutex sm2={0}; // aggregate init works OK for static, too
}

This is important, as any use of a constructor would require dynamic initialization, which occurs at runtime, and therefore may be subject to race conditions. This is particularly a problem for objects with static storage duration at block scope, since the constructor is run the first time control passes through the definition of the object; if multiple threads are running concurrently, it is possible for two threads to believe they are "the first", and thus both run the constructor. It is also possible that one thread will start running the constructor, and another thread will then proceed before the constructor has completed.

Cleaning up

Since such a mutex uses a Windows Event object, it needs to tidy up after itself, otherwise you have a resource leak, which is less than ideal.

For automatic and dynamic mutexes, clean-up is easy: the destructor should destroy the Event. For statics, clean-up is a bit more complicated.

Destructors for objects with static storage duration are called in the reverse order of their construction. Unfortunately, this is not necessarily the best order, especially for block-scope statics. Block-scope statics have the additional problem of a potential race condition on "the first time through", which can lead to the destructor being run multiple times on popular compilers, which is undefined behaviour — compilers often use the equivalent of a call to atexit in order to ensure that destruction is done in the reverse order of construction, and the initialization race that may cause the constructor to be run twice may also cause the destructor to be registered twice. If any threads continue after main, this problem is compounded, as the unpredictable destruction order means that the mutex may be accessed after it has been destroyed, again leading to undefined behaviour.

We do know, however, that Windows Objects allocated by a program are freed when the program exits, so for objects of static storage duration, we can get by without a destructor, which neatly sidesteps the "access after destruction" and "multiple destruction" issues. It does make it hard to use the same class for static objects and non-static objects, though.

Copping Out

The current boost spec requires that instances of the mutex type are usable without an explicit initializer. This means that it is not possible to satisfy the requirements for both static and non-static storage duration without resorting to compiler-specific techniques, or some form of "named mutex" technique that doesn't require storing state within the mutex.

The current implementation of boost::mutex cops out, and has a default constructor and a destructor. This makes it dangerous to use as a block-scope static, but yields correct behaviour for objects with non-static storage duration, and objects of namespace scope, provided care is taken with initialization order.

Though this is the same as for previous boost releases, this situation is less than ideal, however, and the search continues for a way of ensuring correct initialization in all cases.

It has been suggested that boost adds a static_mutex, which would then be portable to other platforms, but this is not available at this time, and building a safe-for-all-uses mutex would be preferable.

The basic_timed_mutex used in the Windows implementation has no constructor or destructor, and could therefore safely be used with static storage duration as described above. Use at non-static storage duration requires calling the init() and destroy() member functions in place of the constructor and destructor. This class is a detail of the current thread_rewrite Windows implementation, however, and its use is not therefore recommended.

Generalizing to Other Platforms

Whilst the implementation described here is Windows-specific, the majority of the code is just using the InterlockedXXX functions. These functions are generally just compiler intrinsics for single CPU instructions, so could easily be implemented on non-Windows platforms with a bit of inline assembly, or by replacing them with the equivalent calls to OS functions.

The larger bit of non-portability comes from the use of Event objects. These are essentially just binary semaphores, so can easily be replaced by semaphores on those systems that support them (e.g. POSIX systems). POSIX semaphores are not quite ideal, though — they would have to be dynamically allocated using new or malloc in get_event, and they aren't limited to just set/unset, so there is the potential of spurious wake-ups. This wouldn't prevent the scheme described from working, since it allows for spurious wake-ups, but it would be less than ideal. A condition variable could be used instead, but that also has the potential for spurious wake-ups, and might have more overhead, since it requires use of an OS mutex in addition.

Future Plans

Work is still under way to identify a solution to the initialization and clean up problems. Under the new C++ Standard, initialization may be easier, as there is a proposal under consideration for generalized constant expressions, which would enable simple default constructors to be used for static initialization, and thus solve the race conditions due to initialization. Unfortunately, this does not solve the corresponding clean up problems. There are also proposals under consideration to address thread-safe initialization of static objects.

In any case, the new C++ Standard won't be published before 2009, and it will then be a few years before compilers supporting it become common, so this is still an important issue.

Conclusion

Implementing synchronization primitives is hard. Ensuring they are correct is hard enough, but fairness issues and initialization problems just make the whole thing harder.

Thankfully, most of the time we can just rely on libraries such as Boost, and not have to think about the issues. It does mean that as the implementor of such a library it is even more important to get things right.

As a library user, understanding the issues involved can help us to see the reason behind particular restrictions the library places on us, and can help us write better code.

Acknowledgements

Thanks to Peter Dimov and Alexander Terekhov who pointed out the hand-off problem to me, and suggested using a swap-based method to avoid it. Thanks also to Roland Schwarz for reviewing the code, and proposing alternative initialization and implementation techniques.

Just Software Solutions

About Us

Technical Writings

Subscribe to Blog