Multithreading and Concurrency

My book, C++ Concurrency in Action contains a detailed description of the C++11 threading facilities, and techniques for designing concurrent code.

The just::thread implementation of the new C++0x thread library is available for Microsoft Visual Studio 2005, Microsoft Visual Studio 2008, Microsoft Visual Studio 2010, g++ 4.5.2 and g++ 4.6.1 on Windows, g++ 4.3, 4.4, 4.5 and 4.6 on Linux, and g++ 4.3, 4.4 and 4.5 on MacOSX. Order your copy today.

Multithreading in C++0x part 4: Protecting Shared Data

Saturday, 04 April 2009

This is the fourth of a series of blog posts introducing the new C++0x thread library. The first three parts covered starting threads in C++0x with simple functions, starting threads with function objects and additional arguments, and starting threads with member functions and reference arguments.

If you've read the previous parts of the series then you should be comfortable with starting threads to perform tasks "in the background", and waiting for them to finish. You can accomplish a lot of useful work like this, passing in the data to be accessed as parameters to the thread function, and then retrieving the result when the thread has completed. However, this won't do if you need to communicate between the threads whilst they are running — accessing shared memory concurrently from multiple threads causes undefined behaviour if either thread modifies the data. What you need here is some way of ensuring that the accesses are mutually exlusive, so only one thread can access the shared data at a time.

Mutual Exclusion with std::mutex

Mutexes are conceptually simple. A mutex is either "locked" or "unlocked", and threads try and lock the mutex when they wish to access some protected data. If the mutex is already locked then any other threads that try and lock the mutex will have to wait. Once the thread is done with the protected data it unlocks the mutex, and another thread can lock the mutex. If you make sure that threads always lock a particular mutex before accessing a particular piece of shared data then other threads are excluded from accessing the data until as long as another thread has locked the mutex. This prevents concurrent access from multiple threads, and avoids the undefined behaviour of data races. The simplest mutex provided by C++0x is std::mutex.

Now, whilst std::mutex has member functions for explicitly locking and unlocking, by far the most common use case in C++ is where the mutex needs to be locked for a specific region of code. This is where the std::lock_guard<> template comes in handy by providing for exactly this scenario. The constructor locks the mutex, and the destructor unlocks the mutex, so to lock a mutex for the duration of a block of code, just construct a std::lock_guard<> object as a local variable at the start of the block. For example, to protect a shared counter you can use std::lock_guard<> to ensure that the mutex is locked for either an increment or a query operation, as in the following example:

std::mutex m;
unsigned counter=0;

unsigned increment()
{
    std::lock_guard<std::mutex> lk(m);
    return ++counter;
}
unsigned query()
{
    std::lock_guard<std::mutex> lk(m);
    return counter;
}

This ensures that access to counter is serialized — if more than one thread calls query() concurrently then all but one will block until the first has exited the function, and the remaining threads will then have to take turns. Likewise, if more than one thread calls increment() concurrently then all but one will block. Since both functions lock the same mutex, if one thread calls query() and another calls increment() at the same time then one or other will have to block. This mutual exclusion is the whole point of a mutex.

Exception Safety and Mutexes

Using std::lock_guard<> to lock the mutex has additional benefits over manually locking and unlocking when it comes to exception safety. With manual locking, you have to ensure that the mutex is unlocked correctly on every exit path from the region where you need the mutex locked, including when the region exits due to an exception. Suppose for a moment that instead of protecting access to a simple integer counter we were protecting access to a std::string, and appending parts on the end. Appending to a string might have to allocate memory, and thus might throw an exception if the memory cannot be allocated. With std::lock_guard<> this still isn't a problem — if an exception is thrown, the mutex is still unlocked. To get the same behaviour with manual locking we have to use a catch block, as shown below:

std::mutex m;
std::string s;

void append_with_lock_guard(std::string const& extra)
{
    std::lock_guard<std::mutex> lk(m);
    s+=extra;
}

void append_with_manual_lock(std::string const& extra)
{
    m.lock();
    try
    {
        s+=extra;
        m.unlock();
    }
    catch(...)
    {
        m.unlock();
        throw;
    }
}

If you had to do this for every function which might throw an exception it would quickly get unwieldy. Of course, you still need to ensure that the code is exception-safe in general — it's no use automatically unlocking the mutex if the protected data is left in a state of disarray.

Next time

Next time we'll take a look at the std::unique_lock<> template, which provides more options than std::lock_guard<>.

Subscribe to the RSS feed RSS feed or email newsletter for this blog to be sure you don't miss the rest of the series.

Try it out

If you're using Microsoft Visual Studio 2008 or g++ 4.3 or 4.4 on Ubuntu Linux you can try out the examples from this series using our just::thread implementation of the new C++0x thread library. Get your copy today.

Multithreading in C++0x Series

Here are the posts in this series so far:

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Multithreading in C++0x part 3: Starting Threads with Member Functions and Reference Arguments

Thursday, 26 February 2009

This is the third of a series of blog posts introducing the new C++0x thread library. The first two parts covered Starting Threads in C++0x with simple functions, and starting threads with function objects and additional arguments.

If you've read the previous parts of the series then you've seen how to start threads with functions and function objects, with and without additional arguments. However, the function objects and arguments are always copied into the thread's internal storage. What if you wish to run a member function other than the function call operator, or pass a reference to an existing object?

The C++0x library can handle both these cases: the use of member functions with std::thread requires an additional argument for the object on which to invoke the member function, and references are handled with std::ref. Let's take a look at some examples.

Invoking a member function on a new thread

Starting a new thread which runs a member function of an existing object: you just pass a pointer to the member function and a value to use as the this pointer for the object in to the std::thread constructor.

#include <thread>
#include <iostream>

class SayHello
{
public:
    void greeting() const
    {
        std::cout<<"hello"<<std::endl;
    }
};

int main()
{
    SayHello x;
    std::thread t(&SayHello::greeting,&x);
    t.join();
}

You can of course pass additional arguments to the member function too:

#include <thread>
#include <iostream>

class SayHello
{
public:
    void greeting(std::string const& message) const
    {
        std::cout<<message<<std::endl;
    }
};

int main()
{
    SayHello x;
    std::thread t(&SayHello::greeting,&x,"goodbye");
    t.join();
}

Now, the preceding examples both a plain pointer to a local object for the this argument; if you're going to do that, you need to ensure that the object outlives the thread, otherwise there will be trouble. An alternative is to use a heap-allocated object and a reference-counted pointer such as std::shared_ptr<SayHello> to ensure that the object stays around as long as the thread does:

#include <>

int main()
{
    std::shared_ptr<SayHello> p(new SayHello);
    std::thread t(&SayHello::greeting,p,"goodbye");
    t.join();
}

So far, everything we've looked at has involved copying the arguments and thread functions into the internal storage of a thread even if those arguments are pointers, as in the this pointers for the member functions. What if you want to pass in a reference to an existing object, and a pointer just won't do? That is the task of std::ref.

Passing function objects and arguments to a thread by reference

Suppose you have an object that implements the function call operator, and you wish to invoke it on a new thread. The thing is you want to invoke the function call operator on this particular object rather than copying it. You could use the member function support to call operator() explicitly, but that seems a bit of a mess given that it is callable already. This is the first instance in which std::ref can help — if x is a callable object, then std::ref(x) is too, so we can pass std::ref(x) as our function when we start the thread, as below:

#include <thread>
#include <iostream>
#include <functional> // for std::ref

class PrintThis
{
public:
    void operator()() const
    {
        std::cout<<"this="<<this<<std::endl;
    }
};

int main()
{
    PrintThis x;
    x();
    std::thread t(std::ref(x));
    t.join();
    std::thread t2(x);
    t2.join();
}

In this case, the function call operator just prints the address of the object. The exact form and values of the output will vary, but the principle is the same: this little program should output three lines. The first two should be the same, whilst the third is different, as it invokes the function call operator on a copy of x. For one run on my system it printed the following:

this=0x7fffb08bf7ef
this=0x7fffb08bf7ef
this=0x42674098

Of course, std::ref can be used for other arguments too — the following code will print "x=43":

#include <thread>
#include <iostream>
#include <functional>

void increment(int& i)
{
    ++i;
}

int main()
{
    int x=42;
    std::thread t(increment,std::ref(x));
    t.join();
    std::cout<<"x="<<x<<std::endl;
}

When passing in references like this (or pointers for that matter), you need to be careful not only that the referenced object outlives the thread, but also that appropriate synchronization is used. In this case it is fine, because we only access x before we start the thread and after it is done, but concurrent access would need protection with a mutex.

Next time

That wraps up all the variations on starting threads; next time we'll look at using mutexes to protect data from concurrent modification.

Subscribe to the RSS feed RSS feed or email newsletter for this blog to be sure you don't miss the rest of the series.

Try it out

If you're using Microsoft Visual Studio 2008 or g++ 4.3 or 4.4 on Ubuntu Linux you can try out the examples from this series using our just::thread implementation of the new C++0x thread library. Get your copy today.

Multithreading in C++0x Series

Here are the posts in this series so far:

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Multithreading in C++0x part 2: Starting Threads with Function Objects and Arguments

Tuesday, 17 February 2009

This is the second of a series of blog posts introducing the new C++0x thread library. If you missed the first part, it covered Starting Threads in C++0x with simple functions.

If you read part 1 of this series, then you've seen how easy it is to start a thread in C++0x: just construct an instance of std::thread, passing in the function you wish to run on the new thread. Though this is good, it would be quite limiting if new threads were constrained to run plain functions without any arguments — all the information needed would have to be passed via global variables, which would be incredibly messy. Thankfully, this is not the case. Not only can you run function objects on your new thread, as well as plain functions, but you can pass arguments in too.

Running a function object on another thread

In keeping with the rest of the C++ standard library, you're not limited to plain functions when starting threads — the std::thread constructor can also be called with instances of classes that implement the function-call operator. Let's say "hello" from our new thread using a function object:

#include <thread>
#include <iostream>

class SayHello
{
public:
    void operator()() const
    {
        std::cout<<"hello"<<std::endl;
    }
};

int main()
{
    std::thread t((SayHello()));
    t.join();
}

If you're wondering about the extra parentheses around the SayHello constructor call, this is to avoid what's known as C++'s most vexing parse: without the parentheses, the declaration is taken to be a declaration of a function called t which takes a pointer-to-a-function-with-no-parameters-returning-an-instance-of-SayHello, and which returns a std::thread object, rather than an object called t of type std::thread. There are a few other ways to avoid the problem. Firstly, you could create a named variable of type SayHello and pass that to the std::thread constructor:

int main()
{
    SayHello hello;
    std::thread t(hello);
    t.join();
}

Alternatively, you could use copy initialization:

int main()
{
    std::thread t=std::thread(SayHello());
    t.join();
}

And finally, if you're using a full C++0x compiler then you can use the new initialization syntax with braces instead of parentheses:

int main()
{
    std::thread t{SayHello()};
    t.join();
}

In this case, this is exactly equivalent to our first example with the double parentheses.

Anyway, enough about initialization. Whichever option you use, the idea is the same: your function object is copied into internal storage accessible to the new thread, and the new thread invokes your operator(). Your class can of course have data members and other member functions too, and this is one way of passing data to the thread function: pass it in as a constructor argument and store it as a data member:

#include <thread>
#include <iostream>
#include <string>

class Greeting
{
    std::string message;
public:
    explicit Greeting(std::string const& message_):
        message(message_)
    {}
    void operator()() const
    {
        std::cout<<message<<std::endl;
    }
};

int main()
{
    std::thread t(Greeting("goodbye"));
    t.join();
}

In this example, our message is stored as a data member in the class, so when the Greeting instance is copied into the thread the message is copied too, and this example will print "goodbye" rather than "hello".

This example also demonstrates one way of passing information in to the new thread aside from the function to call — include it as data members of the function object. If this makes sense in terms of the function object then it's ideal, otherwise we need an alternate technique.

Passing Arguments to a Thread Function

As we've just seen, one way to pass arguments in to the thread function is to package them in a class with a function call operator. Well, there's no need to write a special class every time; the standard library provides an easy way to do this in the form of std::bind. The std::bind function template takes a variable number of parameters. The first is always the function or callable object which needs the parameters, and the remainder are the parameters to pass when calling the function. The result is a function object that stores copies of the supplied arguments, with a function call operator that invokes the bound function. We could therefore use this to pass the message to write to our new thread:

#include <thread>
#include <iostream>
#include <string>
#include <functional>

void greeting(std::string const& message)
{
    std::cout<<message<<std::endl;
}

int main()
{
    std::thread t(std::bind(greeting,"hi!"));
    t.join();
}

This works well, but we can actually do better than that — we can pass the arguments directly to the std::thread constructor and they will be copied into the internal storage for the new thread and supplied to the thread function. We can thus write the preceding example more simply as:

#include <thread>
#include <iostream>
#include <string>

void greeting(std::string const& message)
{
    std::cout<<message<<std::endl;
}

int main()
{
    std::thread t(greeting,"hi!");
    t.join();
}

Not only is this code simpler, it's also likely to be more efficient as the supplied arguments can be copied directly into the internal storage for the thread rather than first into the object generated by std::bind, which is then in turn copied into the internal storage for the thread.

Multiple arguments can be supplied just by passing further arguments to the std::thread constructor:

#include <thread>
#include <iostream>

void write_sum(int x,int y)
{
    std::cout<<x<<" + "<<y<<" = "<<(x+y)<<std::endl;
}

int main()
{
    std::thread t(write_sum,123,456);
    t.join();
}

The std::thread constructor is a variadic template, so it can take any number of arguments up to the compiler's internal limit, but if you need to pass more than a couple of parameters to your thread function then you might like to rethink your design.

Next time

We're not done with starting threads just yet — there's a few more nuances to passing arguments which we haven't covered. In the third part of this series we'll look at passing references, and using class member functions as the thread function.

Subscribe to the RSS feed RSS feed or email newsletter for this blog to be sure you don't miss the rest of the series.

Try it out

If you're using Microsoft Visual Studio 2008 or g++ 4.3 or 4.4 on Ubuntu Linux you can try out the examples from this series using our just::thread implementation of the new C++0x thread library. Get your copy today.

Multithreading in C++0x Series

Here are the posts in this series so far:

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Multithreading in C++0x part 1: Starting Threads

Tuesday, 10 February 2009

This is the first of a series of blog posts introducing the new C++0x thread library.

Concurrency and multithreading is all about running multiple pieces of code in parallel. If you have the hardware for it in the form of a nice shiny multi-core CPU or a multi-processor system then this code can run truly in parallel, otherwise it is interleaved by the operating system — a bit of one task, then a bit of another. This is all very well, but somehow you have to specify what code to run on all these threads.

High level constructs such as the parallel algorithms in Intel's Threading Building Blocks manage the division of code between threads for you, but we don't have any of these in C++0x. Instead, we have to manage the threads ourselves. The tool for this is std::thread.

Running a simple function on another thread

Let's start by running a simple function on another thread, which we do by constructing a new std::thread object, and passing in the function to the constructor. std::thread lives in the <thread> header, so we'd better include that first.

#include <thread>

void my_thread_func()
{}

int main()
{
    std::thread t(my_thread_func);
}

If you compile and run this little app, it won't do a lot: though it starts a new thread, the thread function is empty. Let's make it do something, such as print "hello":

#include <thread>
#include <iostream>

void my_thread_func()
{
    std::cout<<"hello"<<std::endl;
}

int main()
{
    std::thread t(my_thread_func);
}

If you compile and run this little application, what happens? Does it print hello like we wanted? Well, actually there's no telling. It might do or it might not. I ran this simple application several times on my machine, and the output was unreliable: sometimes it output "hello", with a newline; sometimes it output "hello" without a newline, and sometimes it didn't output anything. What's up with that? Surely a simple app like this ought to behave predictably?

Waiting for threads to finish

Well, actually, no, this app does not have predictable behaviour. The problem is we're not waiting for our thread to finish. When the execution reaches the end of main() the program is terminated, whatever the other threads are doing. Since thread scheduling is unpredictable, we cannot know how far the other thread has got. It might have finished, it might have output the "hello", but not processed the std::endl yet, or it might not have even started. In any case it will be abruptly stopped as the application exits.

If we want to reliably print our message, we have to ensure that our thread has finished. We do that by joining with the thread by calling the join() member function of our thread object:

#include <thread>
#include <iostream>

void my_thread_func()
{
    std::cout<<"hello"<<std::endl;
}

int main()
{
    std::thread t(my_thread_func);
    t.join();
}

Now, main() will wait for the thread to finish before exiting, and the code will output "hello" followed by a newline every time. This highlights a general point: if you want a thread to have finished by a certain point in your code you have to wait for it. As well as ensuring that threads have finished by the time the program exits, this is also important if a thread has access to local variables: we want the thread to have finished before the local variables go out of scope.

Next Time

In this article we've looked at running simple functions on a separate thread, and waiting for the thread to finish. However, when you start a thread you aren't just limited to simple functions with no arguments: in the second part of this series we will look at how to start a thread with function objects, and how to pass arguments to the thread.

Subscribe to the RSS feed RSS feed or email newsletter for this blog to be sure you don't miss the rest of the series.

Try it out

If you're using Microsoft Visual Studio 2008 or g++ 4.3 or 4.4 on Ubuntu Linux you can try out the examples from this series using our just::thread implementation of the new C++0x thread library. Get your copy today.

Multithreading in C++0x Series

Here are the posts in this series so far:

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Managing Threads with a Vector

Wednesday, 10 December 2008

One of the nice things about C++0x is the support for move semantics that comes from the new Rvalue Reference language feature. Since this is a language feature, it means that we can easily have types that are movable but not copyable without resorting to std::auto_ptr-like hackery. One such type is the new std::thread class. A thread of execution can only be associated with one std::thread object at a time, so std::thread is not copyable, but it is movable — this allows you to transfer ownership of a thread of execution between objects, and return std::thread objects from functions. The important point for today's blog post is that it allows you to store std::thread objects in containers.

Move-aware containers

The C++0x standard containers are required to be move-aware, and move objects rather than copy them when changing their position within the container. For existing copyable types that don't have a specific move constructor or move-assignment operator that takes an rvalue reference this amounts to the same thing — when a std::vector is resized, or an element is inserted in the middle, the elements will be copied to their new locations. The important difference is that you can now store types that only have a move-constructor and move-assignment operator in the standard containers because the objects are moved rather than copied.

This means that you can now write code like:

std::vector<std::thread> v;

v.push_back(std::thread(some_function));

and it all "just works". This is good news for managing multiple threads where the number of threads is not known until run-time — if you're tuning the number of threads to the number of processors, using std::thread::hardware_concurrency() for example. It also means that you can then use the std::vector<std::thread> with the standard library algorithms such as std::for_each:

void do_join(std::thread& t)
{
    t.join();
}

void join_all(std::vector<std::thread>& v)
{
    std::for_each(v.begin(),v.end(),do_join);
}

If you need an extra thread because one of your threads is blocked waiting for something, you can just use insert() or push_back() to add a new thread to the vector. Of course you can also just move threads into or out of the vector by indexing the elements directly:

std::vector<std::thread> v(std::thread::hardware_concurrency());

for(unsigned i=0;i<v.size();++i)
{
    v[i]=std::thread(do_work);
}

In fact, many of the examples in my book use std::vector<std::thread> for managing the threads, as it's the simplest way to do it.

Other containers work too

It's not just std::vector that's required to be move-aware — all the other standard containers are too. This means you can have a std::list<std::thread>, or a std::deque<std::thread>, or even a std::map<int,std::thread>. In fact, the whole C++0x standard library is designed to work with move-only types such as std::thread.

Try it out today

Wouldn't it be nice if you could try it out today, and get used to using containers of std::thread objects without having to wait for a C++0x compiler? Well, you can — the 0.6 beta of the just::thread C++0x Thread Library released last Friday provides a specialization of std::vector<std::thread> so that you can write code like in these examples and it will work with Microsoft Visual Studio 2008. Sign up at the just::thread Support Forum to download it today.

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , , , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Peterson's lock with C++0x atomics

Friday, 05 December 2008

Bartosz Milewski shows an implementation of Peterson's locking algorithm in his latest post on C++ atomics and memory ordering. Dmitriy V'jukov posted an alternative implementation in the comments. Also in the comments, Bartosz says:

"So even though I don't have a formal proof, I believe my implementation of Peterson lock is correct. For all I know, Dmitriy's implementation might also be correct, but it's much harder to prove."

I'd like to offer an analysis of both algorithms to see if they are correct, below. However, before we start I'd also like to highlight a comment that Bartosz made in his conclusion:

"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."

This is something I wholeheartedly agree with. If you weren't convinced by my previous post on Memory Models and Synchronization, maybe the proof below will convince you to stick to memory_order_seq_cst (the default) unless you really need to do otherwise.

C++0x memory ordering recap

In C++0x, we have to think about things in terms of the happens-before and synchronizes-with relationships described in the Standard — it's no good saying "it works on my CPU" because different CPUs have different default ordering constraints on basic operations such as load and store. In brief, those relationships are:

Synchronizes-with
An operation A synchronizes-with an operation B if A is a store to some atomic variable m, with an ordering of std::memory_order_release, or std::memory_order_seq_cst, B is a load from the same variable m, with an ordering of std::memory_order_acquire or std::memory_order_seq_cst, and B reads the value stored by A.
Happens-before
An operation A happens-before an operation B if:
  • A is performed on the same thread as B, and A is before B in program order, or
  • A synchronizes-with B, or
  • A happens-before some other operation C, and C happens-before B.
There's a few more nuances to do with std::memory_order_consume, but this is enough for now.

If all your operations use std::memory_order_seq_cst, then there is the additional constraint of total ordering, as I mentioned before, but neither of the implementations in question use any std::memory_order_seq_cst operations, so we can leave that aside for now.

Now, let's look at the implementations.

Bartosz's implementation

I've extracted the code for Bartosz's implementation from his posts, and it is shown below:

class Peterson_Bartosz
{
private:
    // indexed by thread ID, 0 or 1
    std::atomic<bool> _interested[2];
    // who's yielding priority?
    std::atomic<int> _victim;
public:
    Peterson_Bartosz()
    {
       _victim.store(0, std::memory_order_release);
       _interested[0].store(false, std::memory_order_release);
       _interested[1].store(false, std::memory_order_release);
    }
    void lock()
    {
       int me = threadID; // either 0 or 1
       int he = 1 ? me; // the other thread
       _interested[me].exchange(true, std::memory_order_acq_rel);
       _victim.store(me, std::memory_order_release);
       while (_interested[he].load(std::memory_order_acquire)
           && _victim.load(std::memory_order_acquire) == me)
          continue; // spin
    }
    void unlock()
    {
        int me = threadID;
        _interested[me].store(false,std::memory_order_release);
    }
}

There are three things to prove with Peterson's lock:

  • If thread 0 successfully acquires the lock, then thread 1 will not do so;
  • If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock;
  • If thread 0 fails to acquire the lock, then thread 1 does so.

Let's look at each in turn.

If thread 0 successfully acquires the lock, then thread 1 will not do so

Initially _victim is 0, and the _interested variables are both false. The call to lock() from thread 0 will then set _interested[0] to true, and _victim to 0.

The loop then checks _interested[1], which is still false, so we break out of the loop, and the lock is acquired.

So, what about thread 1? Thread 1 now comes along and tries to acquire the lock. It sets _interested[1] to true, and _victim to 1, and then enters the while loop. This is where the fun begins.

The first thing we check is _interested[0]. Now, we know this was set to true in thread 0 as it acquired the lock, but the important thing is: does the CPU running thread 1 know that? Is it guaranteed by the memory model?

For it to be guaranteed by the memory model, we have to prove that the store to _interested[0] from thread 0 happens-before the load from thread 1. This is trivially true if we read true in thread 1, but that doesn't help: we need to prove that we can't read false. We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, and our search comes up empty: _interested[1] is loaded by thread 1 as part of the exchange call, but it is not written by thread 0, and _victim is written by thread 1 without reading the value stored by thread 0. Consequently, there is no ordering guarantee on the read of _interested[0], and thread 1 may also break out of the while loop and acquire the lock.

This implementation is thus broken. Let's now look at Dmitriy's implementation.

Dmitriy's implementation

Dmitriy posted his implementation in the comments using the syntax for his Relacy Race Detector tool, but it's trivially convertible to C++0x syntax. Here is the C++0x version of his code:

std::atomic<int> flag0(0),flag1(0),turn(0);

void lock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(1, std::memory_order_relaxed);
        turn.exchange(1, std::memory_order_acq_rel);

        while (flag1.load(std::memory_order_acquire)
            && 1 == turn.load(std::memory_order_relaxed))
            std::this_thread::yield();
    }
    else
    {
        flag1.store(1, std::memory_order_relaxed);
        turn.exchange(0, std::memory_order_acq_rel);

        while (flag0.load(std::memory_order_acquire)
            && 0 == turn.load(std::memory_order_relaxed))
            std::this_thread::yield();
    }
}

void unlock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(0, std::memory_order_release);
    }
    else
    {
        flag1.store(0, std::memory_order_release);
    }
}

So, how does this code fare?

If thread 0 successfully acquires the lock, then thread 1 will not do so

Initially the turn, flag0 and flag1 variables are all 0. The call to lock() from thread 0 will then set flag0 to 1, and turn to 1. These variables are essentially equivalent to the variables in Bartosz's implementation, but turn is set to 0 when _victim is set to 1, and vice-versa. That doesn't affect the logic of the code.

The loop then checks flag1, which is still 0, so we break out of the loop, and the lock is acquired.

So, what about thread 1? Thread 1 now comes along and tries to acquire the lock. It sets flag1 to 1, and turn to 0, and then enters the while loop. This is where the fun begins.

As before, the first thing we check is flag0. Now, we know this was set to 1 in thread 0 as it acquired the lock, but the important thing is: does the CPU running thread 1 know that? Is it guaranteed by the memory model?

Again, for it to be guaranteed by the memory model, we have to prove that the store to flag0 from thread 0 happens-before the load from thread 1. This is trivially true if we read 1 in thread 1, but that doesn't help: we need to prove that we can't read 0. We therefore need to find a variable which was stored by thread 0, and loaded by thread 1, as before.

This time our search is successful: turn is set using an exchange operation, which is a read-modify-write operation. Since it uses std::memory_order_acq_rel memory ordering, it is both a load-acquire and a store-release. If the load part of the exchange reads the value written by thread 0, we're home dry: turn is stored with a similar exchange operation with std::memory_order_acq_rel in thread 0, so the store from thread 0 synchronizes-with the load from thread 1.

This means that the store to flag0 from thread 0 happens-before the exchange on turn in thread 1, and thus happens-before the load in the while loop. The load in the while loop thus reads 1 from flag0, and proceeds to check turn.

Now, since the store to turn from thread 0 happens-before the store from thread 1 (we're relying on that for the happens-before relationship on flag0, remember), we know that the value to be read will be the value we stored in thread 1: 0. Consequently, we keep looping.

OK, so if the store to turn in thread 1 reads the value stored by thread 0 then thread 1 will stay out of the lock, but what if it doesn't read the value store by thread 0? In this case, we know that the exchange call from thread 0 must have seen the value written by the exchange in thread 1 (writes to a single atomic variable always become visible in the same order for all threads), which means that the write to flag1 from thread 1 happens-before the read in thread 0 and so thread 0 cannot have acquired the lock. Since this was our initial assumption (thread 0 has acquired the lock), we're home dry — thread 1 can only acquire the lock if thread 0 didn't.

If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock

OK, so we've got as far as thread 0 acquiring the lock and thread 1 waiting. What happens if thread 0 now releases the lock? It does this simply by writing 0 to flag0. The while loop in thread 1 checks flag0 every time round, and breaks out if the value read is 0. Therefore, thread 1 will eventually acquire the mutex. Of course, there is no guarantee when it will acquire the mutex — it might take arbitrarily long for the the write to flag0 to make its way to thread 1, but it will get there in the end. Since flag0 is never written by thread 1, it doesn't matter whether thread 0 has already released the lock when thread 1 starts waiting, or whether thread 1 is already waiting — the while loop will still terminate, and thread 1 will acquire the lock in both cases.

That just leaves our final check.

If thread 0 fails to acquire the lock, then thread 1 does so

We've essentially already covered this when we checked that thread 1 doesn't acquire the lock if thread 0 does, but this time we're going in reverse. If thread 0 doesn't acquire the lock, it is because it sees flag1 as 1 and turn as 1. Since flag1 is only written by thread 1, if it is 1 then thread 1 must have at least called lock(). If thread 1 has called unlock then eventually flag1 will be read as 0, so thread 0 will acquire the lock. So, let's assume for now that thread 1 hasn't got that far, so flag1 is still 1. The next check is for turn to be 1. This is the value written by thread 0. If we read it as 1 then either the write to turn from thread 1 has not yet become visible to thread 0, or the write happens-before the write by thread 0, so the write from thread 0 overwrote the old value.

If the write from thread 1 happens-before the write from thread 0 then thread 1 will eventually see turn as 1 (since the last write is by thread 0), and thus thread 1 will acquire the lock. On the other hand, if the write to turn from thread 0 happens-before the write to turn from thread 1, then thread 0 will eventually see the turn as 0 and acquire the lock. Therefore, for thread 0 to be stuck waiting the last write to turn must have been by thread 0, which implies thread 1 will eventually get the lock.

Therefore, Dmitriy's implementation works.

Differences, and conclusion

The key difference between the implementations other than the naming of the variables is which variable the exchange operation is applied to. In Bartosz's implementation, the exchange is applied to _interested[me], which is only ever written by one thread for a given value of me. In Dmitriy's implementation, the exchange is applied to the turn variable, which is the variable updated by both threads. It therefore acts as a synchronization point for the threads. This is the key to the whole algorithm — even though many of the operations in Dmitriy's implementation use std::memory_order_relaxed, whereas Bartosz's implementation uses std::memory_order_acquire and std::memory_order_release everywhere, the single std::memory_order_acq_rel on the exchange on the right variable is enough.

I'd like to finish by repeating Bartosz's statement about relaxed memory orderings:

"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , , , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Memory Models and Synchronization

Monday, 24 November 2008

I have read a couple of posts on memory models over the couple of weeks: one from Jeremy Manson on What Volatile Means in Java, and one from Bartosz Milewski entitled Who ordered sequential consistency?. Both of these cover a Sequentially Consistent memory model — in Jeremy's case because sequential consistency is required by the Java Memory Model, and in Bartosz' case because he's explaining what it means to be sequentially consistent, and why we would want that.

In a sequentially consistent memory model, there is a single total order of all atomic operations which is the same across all processors in the system. You might not know what the order is in advance, and it may change from execution to execution, but there is always a total order.

This is the default for the new C++0x atomics, and required for Java's volatile, for good reason — it is considerably easier to reason about the behaviour of code that uses sequentially consistent orderings than code that uses a more relaxed ordering.

The thing is, C++0x atomics are only sequentially consistent by default — they also support more relaxed orderings.

Relaxed Atomics and Inconsistent Orderings

I briefly touched on the properties of relaxed atomic operations in my presentation on The Future of Concurrency in C++ at ACCU 2008 (see the slides). The key point is that relaxed operations are unordered. Consider this simple example with two threads:

#include <thread>
#include <cstdatomic>
std::atomic<int> x(0),y(0);

void thread1()
{
    x.store(1,std::memory_order_relaxed);
    y.store(1,std::memory_order_relaxed);
}

void thread2()
{
    int a=y.load(std::memory_order_relaxed);
    int b=x.load(std::memory_order_relaxed);
    if(a==1)
        assert(b==1);
}

std::thread t1(thread1);
std::thread t2(thread2);

All the atomic operations here are using memory_order_relaxed, so there is no enforced ordering. Therefore, even though thread1 stores x before y, there is no guarantee that the writes will reach thread2 in that order: even if a==1 (implying thread2 has seen the result of the store to y), there is no guarantee that b==1, and the assert may fire.

If we add more variables and more threads, then each thread may see a different order for the writes. Some of the results can be even more surprising than that, even with two threads. The C++0x working paper features the following example:

void thread1()
{
    int r1=y.load(std::memory_order_relaxed);
    x.store(r1,std::memory_order_relaxed);
}
void thread2()
{
    int r2=x.load(std::memory_order_relaxed);
    y.store(42,std::memory_order_relaxed);
    assert(r2==42);
}

There's no ordering between threads, so thread1 might see the store to y from thread2, and thus store the value 42 in x. The fun part comes because the load from x in thread2 can be reordered after everything else (even the store that occurs after it in the same thread) and thus load the value 42! Of course, there's no guarantee about this, so the assert may or may not fire — we just don't know.

Acquire and Release Ordering

Now you've seen quite how scary life can be with relaxed operations, it's time to look at acquire and release ordering. This provides pairwise synchronization between threads — the thread doing a load sees all the changes made before the corresponding store in another thread. Most of the time, this is actually all you need — you still get the "two cones" effect described in Jeremy's blog post.

With acquire-release ordering, independent reads of variables written independently can still give different orders in different threads, so if you do that sort of thing then you still need to think carefully. e.g.

std::atomic x(0),y(0);

void thread1()
{
    x.store(1,std::memory_order_release);
}

void thread2()
{
    y.store(1,std::memory_order_release);
}

void thread3()
{
    int a=x.load(std::memory_order_acquire);
    int b=y.load(std::memory_order_acquire);
}

void thread4()
{
    int c=x.load(std::memory_order_acquire);
    int d=y.load(std::memory_order_acquire);
}

Yes, thread3 and thread4 have the same code, but I separated them out to make it clear we've got two separate threads. In this example, the stores are on separate threads, so there is no ordering between them. Consequently the reader threads may see the writes in either order, and you might get a==1 and b==0 or vice versa, or both 1 or both 0. The fun part is that the two reader threads might see opposite orders, so you have a==1 and b==0, but c==0 and d==1! With sequentially consistent code, both threads must see consistent orderings, so this would be disallowed.

Summary

The details of relaxed memory models can be confusing, even for experts. If you're writing code that uses bare atomics, stick to sequential consistency until you can demonstrate that this is causing an undesirable impact on performance.

There's a lot more to the C++0x memory model and atomic operations than I can cover in a blog post — I go into much more depth in the chapter on atomics in my book.

Posted by Anthony Williams
[/ threading /] permanent link

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Deadlock Detection with just::thread

Wednesday, 12 November 2008

One of the biggest problems with multithreaded programming is the possibility of deadlocks. In the excerpt from my book published over at codeguru.com (Deadlock: The problem and a solution) I discuss various ways of dealing with deadlock, such as using std::lock when acquiring multiple locks at once and acquiring locks in a fixed order.

Following such guidelines requires discipline, especially on large code bases, and occasionally we all slip up. This is where the deadlock detection mode of the just::thread library comes in: if you compile your code with deadlock detection enabled then if a deadlock occurs the library will display a stack trace of the deadlock threads and the locations at which the synchronization objects involved in the deadlock were locked.

Let's look at the following simple code for an example.

#include <thread>
#include <mutex>
#include <iostream>

std::mutex io_mutex;

void thread_func()
{
    std::lock_guard<std::mutex> lk(io_mutex);
    std::cout<<"Hello from thread_func"<<std::endl;
}

int main()
{
    std::thread t(thread_func);
    std::lock_guard<std::mutex> lk(io_mutex);
    std::cout<<"Hello from main thread"<<std::endl;
    t.join();
    return 0;
}

Now, it is obvious just from looking at the code that there's a potential deadlock here: the main thread holds the lock on io_mutex across the call to t.join(). Therefore, if the main thread manages to lock the io_mutex before the new thread does then the program will deadlock: the main thread is waiting for thread_func to complete, but thread_func is blocked on the io_mutex, which is held by the main thread!

Compile the code and run it a few times: eventually you should hit the deadlock. In this case, the program will output "Hello from main thread" and then hang. The only way out is to kill the program.

Now compile the program again, but this time with _JUST_THREAD_DEADLOCK_CHECK defined — you can either define this in your project settings, or define it in the first line of the program with #define. It must be defined before any of the thread library headers are included in order to take effect. This time the program doesn't hang — instead it displays a message box with the title "Deadlock Detected!" looking similar to the following:

deadlock stack traces

Of course, you need to have debug symbols for your executable to get meaningful stack traces.

Anyway, this message box shows three stack traces. The first is labelled "Deadlock detected in thread 2 at:", and tells us that the deadlock was found in the call to std::thread::join from main, on line 19 of our source file (io_deadlock.cpp). Now, it's important to note that "line 19" is actually where execution will resume when join returns rather than the call site, so in this case the call to join is on line 18. If the next statement was also on line 18, the stack would report line 18 here.

The next stack trace is labelled "Thread 1 blocked at:", and tells us where the thread we're trying to join with is blocked. In this case, it's blocked in the call to mutex::lock from the std::lock_guard constructor called from thread_func returning to line 10 of our source file (the constructor is on line 9).

The final stack trace completes the circle by telling us where that mutex was locked. In this case the label says "Waiting for object locked on thread 2 at:", and the stack trace tells us it was the std::lock_guard constructor in main returning to line 17 of our source file.

This is all the information we need to see the deadlock in this case, but in more complex cases we might need to go further up the call stack, particularly if the deadlock occurs in a function called from lots of different threads, or the mutex being used in the function depends on its parameters.

The just::thread deadlock detection can help there too: if you're running the application from within the IDE, or you've got a Just-in-Time debugger installed then the application will now break into the debugger. You can then use the full capabilities of your debugger to examine the state of the application when the deadlock occurred.

Try It Out

You can download sample Visual C++ Express 2008 project for this example, which you can use with our just::thread implementation of the new C++0x thread library. The code should also work with g++.

just::thread doesn't just work with Microsoft Visual Studio 2008 — it's also available for g++ 4.3 on Ubuntu Linux. Get your copy today and try out the deadlock detection feature risk free with our 30-day money-back guarantee.

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Implementing a Thread-Safe Queue using Condition Variables (Updated)

Tuesday, 16 September 2008

One problem that comes up time and again with multi-threaded code is how to transfer data from one thread to another. For example, one common way to parallelize a serial algorithm is to split it into independent chunks and make a pipeline — each stage in the pipeline can be run on a separate thread, and each stage adds the data to the input queue for the next stage when it's done. For this to work properly, the input queue needs to be written so that data can safely be added by one thread and removed by another thread without corrupting the data structure.

Basic Thread Safety with a Mutex

The simplest way of doing this is just to put wrap a non-thread-safe queue, and protect it with a mutex (the examples use the types and functions from the upcoming 1.35 release of Boost):

template<typename Data>
class concurrent_queue
{
private:
    std::queue<Data> the_queue;
    mutable boost::mutex the_mutex;
public:
    void push(const Data& data)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        the_queue.push(data);
    }

    bool empty() const
    {
        boost::mutex::scoped_lock lock(the_mutex);
        return the_queue.empty();
    }

    Data& front()
    {
        boost::mutex::scoped_lock lock(the_mutex);
        return the_queue.front();
    }
    
    Data const& front() const
    {
        boost::mutex::scoped_lock lock(the_mutex);
        return the_queue.front();
    }

    void pop()
    {
        boost::mutex::scoped_lock lock(the_mutex);
        the_queue.pop();
    }
};

This design is subject to race conditions between calls to empty, front and pop if there is more than one thread removing items from the queue, but in a single-consumer system (as being discussed here), this is not a problem. There is, however, a downside to such a simple implementation: if your pipeline stages are running on separate threads, they likely have nothing to do if the queue is empty, so they end up with a wait loop:

    while(some_queue.empty())
    {
        boost::this_thread::sleep(boost::posix_time::milliseconds(50));
    }

Though the sleep avoids the high CPU consumption of a direct busy wait, there are still some obvious downsides to this formulation. Firstly, the thread has to wake every 50ms or so (or whatever the sleep period is) in order to lock the mutex, check the queue, and unlock the mutex, forcing a context switch. Secondly, the sleep period imposes a limit on how fast the thread can respond to data being added to the queue — if the data is added just before the call to sleep, the thread will wait at least 50ms before checking for data. On average, the thread will only respond to data after about half the sleep time (25ms here).

Waiting with a Condition Variable

As an alternative to continuously polling the state of the queue, the sleep in the wait loop can be replaced with a condition variable wait. If the condition variable is notified in push when data is added to an empty queue, then the waiting thread will wake. This requires access to the mutex used to protect the queue, so needs to be implemented as a member function of concurrent_queue:

template<typename Data>
class concurrent_queue
{
private:
    boost::condition_variable the_condition_variable;
public:
    void wait_for_data()
    {
        boost::mutex::scoped_lock lock(the_mutex);
        while(the_queue.empty())
        {
            the_condition_variable.wait(lock);
        }
    }
    void push(Data const& data)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        bool const was_empty=the_queue.empty();
        the_queue.push(data);
        if(was_empty)
        {
            the_condition_variable.notify_one();
        }
    }
    // rest as before
};

There are three important things to note here. Firstly, the lock variable is passed as a parameter to wait — this allows the condition variable implementation to atomically unlock the mutex and add the thread to the wait queue, so that another thread can update the protected data whilst the first thread waits.

Secondly, the condition variable wait is still inside a while loop — condition variables can be subject to spurious wake-ups, so it is important to check the actual condition being waited for when the call to wait returns.

Be careful when you notify

Thirdly, the call to notify_one comes after the data is pushed on the internal queue. This avoids the waiting thread being notified if the call to the_queue.push throws an exception. As written, the call to notify_one is still within the protected region, which is potentially sub-optimal: the waiting thread might wake up immediately it is notified, and before the mutex is unlocked, in which case it will have to block when the mutex is reacquired on the exit from wait. By rewriting the function so that the notification comes after the mutex is unlocked, the waiting thread will be able to acquire the mutex without blocking:

template<typename Data>
class concurrent_queue
{
public:
    void push(Data const& data)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        bool const was_empty=the_queue.empty();
        the_queue.push(data);

        lock.unlock(); // unlock the mutex

        if(was_empty)
        {
            the_condition_variable.notify_one();
        }
    }
    // rest as before
};

Reducing the locking overhead

Though the use of a condition variable has improved the pushing and waiting side of the interface, the interface for the consumer thread still has to perform excessive locking: wait_for_data, front and pop all lock the mutex, yet they will be called in quick succession by the consumer thread.

By changing the consumer interface to a single wait_and_pop function, the extra lock/unlock calls can be avoided:

template<typename Data>
class concurrent_queue
{
public:
    void wait_and_pop(Data& popped_value)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        while(the_queue.empty())
        {
            the_condition_variable.wait(lock);
        }
        
        popped_value=the_queue.front();
        the_queue.pop();
    }

    // rest as before
};

Using a reference parameter to receive the result is used to transfer ownership out of the queue in order to avoid the exception safety issues of returning data by-value: if the copy constructor of a by-value return throws, then the data has been removed from the queue, but is lost, whereas with this approach, the potentially problematic copy is performed prior to modifying the queue (see Herb Sutter's Guru Of The Week #8 for a discussion of the issues). This does, of course, require that an instance Data can be created by the calling code in order to receive the result, which is not always the case. In those cases, it might be worth using something like boost::optional to avoid this requirement.

Handling multiple consumers

As well as removing the locking overhead, the combined wait_and_pop function has another benefit — it automatically allows for multiple consumers. Whereas the fine-grained nature of the separate functions makes them subject to race conditions without external locking (one reason why the authors of the SGI STL advocate against making things like std::vector thread-safe — you need external locking to do many common operations, which makes the internal locking just a waste of resources), the combined function safely handles concurrent calls.

If multiple threads are popping entries from a full queue, then they just get serialized inside wait_and_pop, and everything works fine. If the queue is empty, then each thread in turn will block waiting on the condition variable. When a new entry is added to the queue, one of the threads will wake and take the value, whilst the others keep blocking. If more than one thread wakes (e.g. with a spurious wake-up), or a new thread calls wait_and_pop concurrently, the while loop ensures that only one thread will do the pop, and the others will wait.

Update: As commenter David notes below, using multiple consumers does have one problem: if there are several threads waiting when data is added, only one is woken. Though this is exactly what you want if only one item is pushed onto the queue, if multiple items are pushed then it would be desirable if more than one thread could wake. There are two solutions to this: use notify_all() instead of notify_one() when waking threads, or to call notify_one() whenever any data is added to the queue, even if the queue is not currently empty. If all threads are notified then the extra threads will see it as a spurious wake and resume waiting if there isn't enough data for them. If we notify with every push() then only the right number of threads are woken. This is my preferred option: condition variable notify calls are pretty light-weight when there are no threads waiting. The revised code looks like this:

template<typename Data>
class concurrent_queue
{
public:
    void push(Data const& data)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        the_queue.push(data);
        lock.unlock();
        the_condition_variable.notify_one();
    }
    // rest as before
};

There is one benefit that the separate functions give over the combined one — the ability to check for an empty queue, and do something else if the queue is empty. empty itself still works in the presence of multiple consumers, but the value that it returns is transitory — there is no guarantee that it will still apply by the time a thread calls wait_and_pop, whether it was true or false. For this reason it is worth adding an additional function: try_pop, which returns true if there was a value to retrieve (in which case it retrieves it), or false to indicate that the queue was empty.

template<typename Data>
class concurrent_queue
{
public:
    bool try_pop(Data& popped_value)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        if(the_queue.empty())
        {
            return false;
        }
        
        popped_value=the_queue.front();
        the_queue.pop();
        return true;
    }

    // rest as before
};

By removing the separate front and pop functions, our simple naive implementation has now become a usable multiple producer, multiple consumer concurrent queue.

The Final Code

Here is the final code for a simple thread-safe multiple producer, multiple consumer queue:

template<typename Data>
class concurrent_queue
{
private:
    std::queue<Data> the_queue;
    mutable boost::mutex the_mutex;
    boost::condition_variable the_condition_variable;
public:
    void push(Data const& data)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        the_queue.push(data);
        lock.unlock();
        the_condition_variable.notify_one();
    }

    bool empty() const
    {
        boost::mutex::scoped_lock lock(the_mutex);
        return the_queue.empty();
    }

    bool try_pop(Data& popped_value)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        if(the_queue.empty())
        {
            return false;
        }
        
        popped_value=the_queue.front();
        the_queue.pop();
        return true;
    }

    void wait_and_pop(Data& popped_value)
    {
        boost::mutex::scoped_lock lock(the_mutex);
        while(the_queue.empty())
        {
            the_condition_variable.wait(lock);
        }
        
        popped_value=the_queue.front();
        the_queue.pop();
    }

};

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

The Intel x86 Memory Ordering Guarantees and the C++ Memory Model

Tuesday, 26 August 2008

The July 2008 version of the Intel 64 and IA-32 Architecture documents includes the information from the memory ordering white paper I mentioned before. This makes it clear that on x86/x64 systems the preferred implementation of the C++0x atomic operations is as follows (which has been confirmed in discussions with Intel engineers):

Memory OrderingStoreLoad
std::memory_order_relaxedMOV [mem],regMOV reg,[mem]
std::memory_order_acquiren/aMOV reg,[mem]
std::memory_order_releaseMOV [mem],regn/a
std::memory_order_seq_cstXCHG [mem],regMOV reg,[mem]

As you can see, plain MOV is enough for even sequentially-consistent loads if a LOCKed instruction such as XCHG is used for the sequentially-consistent stores.

One thing to watch out for is the Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), which by their very nature (i.e. non-temporal) don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.

Additionally, if you're writing drivers which deal with memory pages marked WC (Write-Combining) then additional fence instructions will be required to ensure visibility between processors. However, if you're programming with WC pages then this shouldn't be a problem.

Posted by Anthony Williams
[/ threading /] permanent link
Tags: , , , , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

More recent entries Older entries