Multithreading and Concurrency
My book, C++ Concurrency in Action is currently available under the Manning Early Access Program. Buy it directly from the publisher now and get to see new chapters as I write them, plus a copy of the complete book when it is printed.
The just::thread implementation of the new C++0x thread library is available for Microsoft Visual Studio 2008. Order your copy today.
Multithreading in C++0x part 4: Protecting Shared Data
Saturday, 04 April 2009
This is the fourth of a series of blog posts introducing the new C++0x thread library. The first three parts covered starting threads in C++0x with simple functions, starting threads with function objects and additional arguments, and starting threads with member functions and reference arguments.
If you've read the previous parts of the series then you should be comfortable with starting threads to perform tasks "in the background", and waiting for them to finish. You can accomplish a lot of useful work like this, passing in the data to be accessed as parameters to the thread function, and then retrieving the result when the thread has completed. However, this won't do if you need to communicate between the threads whilst they are running — accessing shared memory concurrently from multiple threads causes undefined behaviour if either thread modifies the data. What you need here is some way of ensuring that the accesses are mutually exlusive, so only one thread can access the shared data at a time.
Mutual Exclusion with std::mutex
Mutexes are conceptually simple. A mutex is either "locked" or
"unlocked", and threads try and lock the mutex when they wish to
access some protected data. If the mutex is already locked then any
other threads that try and lock the mutex will have to wait. Once the
thread is done with the protected data it unlocks the mutex, and
another thread can lock the mutex. If you make sure that threads
always lock a particular mutex before accessing a particular piece of
shared data then other threads are excluded from accessing the data
until as long as another thread has locked the mutex. This prevents
concurrent access from multiple threads, and avoids the undefined
behaviour of data races. The simplest mutex provided by C++0x is
std::mutex.
Now, whilst std::mutex
has member functions for explicitly locking and unlocking, by far the
most common use case in C++ is where the mutex needs to be locked for
a specific region of code. This is where the std::lock_guard<>
template comes in handy by providing for exactly this scenario. The
constructor locks the mutex, and the destructor unlocks the mutex, so
to lock a mutex for the duration of a block of code, just construct a
std::lock_guard<>
object as a local variable at the start of the block. For example, to
protect a shared counter you can use std::lock_guard<>
to ensure that the mutex is locked for either an increment or a query
operation, as in the following example:
std::mutex m;
unsigned counter=0;
unsigned increment()
{
std::lock_guard<std::mutex> lk(m);
return ++counter;
}
unsigned query()
{
std::lock_guard<std::mutex> lk(m);
return counter;
}
This ensures that access to counter is
serialized — if more than one thread calls
query() concurrently then all but one will block until
the first has exited the function, and the remaining threads will then
have to take turns. Likewise, if more than one thread calls
increment() concurrently then all but one will
block. Since both functions lock the same mutex, if one
thread calls query() and another calls
increment() at the same time then one or other will have
to block. This mutual exclusion is the whole point of a
mutex.
Exception Safety and Mutexes
Using std::lock_guard<>
to lock the mutex has additional benefits over manually locking and
unlocking when it comes to exception safety. With manual locking, you
have to ensure that the mutex is unlocked correctly on every exit path
from the region where you need the mutex locked, including when
the region exits due to an exception. Suppose for a moment that
instead of protecting access to a simple integer counter we were
protecting access to a std::string, and appending parts
on the end. Appending to a string might have to allocate memory, and
thus might throw an exception if the memory cannot be
allocated. With std::lock_guard<>
this still isn't a problem — if an exception is thrown, the
mutex is still unlocked. To get the same behaviour with manual locking
we have to use a catch block, as shown below:
std::mutex m;
std::string s;
void append_with_lock_guard(std::string const& extra)
{
std::lock_guard<std::mutex> lk(m);
s+=extra;
}
void append_with_manual_lock(std::string const& extra)
{
m.lock();
try
{
s+=extra;
m.unlock();
}
catch(...)
{
m.unlock();
throw;
}
}
If you had to do this for every function which might throw an exception it would quickly get unwieldy. Of course, you still need to ensure that the code is exception-safe in general — it's no use automatically unlocking the mutex if the protected data is left in a state of disarray.
Next time
Next time we'll take a look at the std::unique_lock<>
template, which provides more options than std::lock_guard<>.
Subscribe to
the RSS feed
for this blog to be sure you don't miss the rest
of the series.
Multithreading in C++0x Series
Here are the posts in this series so far:
- Multithreading in C++0x Part 1: Starting Threads
- Multithreading in C++0x Part 2: Starting Threads with Function Objects and Arguments
- Multithreading in C++0x Part 3: Starting Threads with Member Functions and Reference Arguments
- Multithreading in C++0x Part 4: Protecting Shared Data
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, multithreading, C++0x, thread
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
Multithreading in C++0x part 3: Starting Threads with Member Functions and Reference Arguments
Thursday, 26 February 2009
This is the third of a series of blog posts introducing the new C++0x thread library. The first two parts covered Starting Threads in C++0x with simple functions, and starting threads with function objects and additional arguments.
If you've read the previous parts of the series then you've seen how to start threads with functions and function objects, with and without additional arguments. However, the function objects and arguments are always copied into the thread's internal storage. What if you wish to run a member function other than the function call operator, or pass a reference to an existing object?
The C++0x library can handle both these cases: the use of member
functions with std::thread
requires an additional argument for the object on which to invoke the
member function, and references are handled with
std::ref. Let's take a look at some examples.
Invoking a member function on a new thread
Starting a new thread which runs a member function of an existing
object: you just pass a pointer to the member function and a value to
use as the this pointer for the object in to the std::thread
constructor.
#include <thread>
#include <iostream>
class SayHello
{
public:
void greeting() const
{
std::cout<<"hello"<<std::endl;
}
};
int main()
{
SayHello x;
std::thread t(&SayHello::greeting,&x);
t.join();
}
You can of course pass additional arguments to the member function too:
#include <thread>
#include <iostream>
class SayHello
{
public:
void greeting(std::string const& message) const
{
std::cout<<message<<std::endl;
}
};
int main()
{
SayHello x;
std::thread t(&SayHello::greeting,&x,"goodbye");
t.join();
}
Now, the preceding examples both a plain pointer to a local object
for the this argument; if you're going to do that, you
need to ensure that the object outlives the thread, otherwise there
will be trouble. An alternative is to use a heap-allocated object and
a reference-counted pointer such as
std::shared_ptr<SayHello> to ensure that the object
stays around as long as the thread does:
#include <>
int main()
{
std::shared_ptr<SayHello> p(new SayHello);
std::thread t(&SayHello::greeting,p,"goodbye");
t.join();
}
So far, everything we've looked at has involved copying the
arguments and thread functions into the internal storage of a thread
even if those arguments are pointers, as in the this
pointers for the member functions. What if you want to pass in a
reference to an existing object, and a pointer just won't do?
That is the task of std::ref.
Passing function objects and arguments to a thread by reference
Suppose you have an object that implements the function call
operator, and you wish to invoke it on a new thread. The thing is you
want to invoke the function call operator on this particular
object rather than copying it. You could use the member function
support to call operator() explicitly, but that seems a
bit of a mess given that it is callable already. This is the
first instance in which std::ref can help — if
x is a callable object, then std::ref(x) is
too, so we can pass std::ref(x) as our function when we
start the thread, as below:
#include <thread>
#include <iostream>
#include <functional> // for std::ref
class PrintThis
{
public:
void operator()() const
{
std::cout<<"this="<<this<<std::endl;
}
};
int main()
{
PrintThis x;
x();
std::thread t(std::ref(x));
t.join();
std::thread t2(x);
t2.join();
}
In this case, the function call operator just prints the address of
the object. The exact form and values of the output will vary, but the
principle is the same: this little program should output three
lines. The first two should be the same, whilst the third is
different, as it invokes the function call operator on a copy
of x. For one run on my system it printed the following:
this=0x7fffb08bf7ef this=0x7fffb08bf7ef this=0x42674098
Of course, std::ref can be used for other arguments
too — the following code will print "x=43":
#include <thread>
#include <iostream>
#include <functional>
void increment(int& i)
{
++i;
}
int main()
{
int x=42;
std::thread t(increment,std::ref(x));
t.join();
std::cout<<"x="<<x<<std::endl;
}
When passing in references like this (or pointers for that matter),
you need to be careful not only that the referenced object outlives
the thread, but also that appropriate synchronization is used. In this
case it is fine, because we only access x before we start
the thread and after it is done, but concurrent access would need
protection with a mutex.
Next time
That wraps up all the variations on starting threads; next time we'll look at using mutexes to protect data from concurrent modification.
Subscribe to
the RSS feed
for this blog to be sure you don't miss the rest
of the series.
Multithreading in C++0x Series
Here are the posts in this series so far:
- Multithreading in C++0x Part 1: Starting Threads
- Multithreading in C++0x Part 2: Starting Threads with Function Objects and Arguments
- Multithreading in C++0x Part 3: Starting Threads with Member Functions and Reference Arguments
- Multithreading in C++0x Part 4: Protecting Shared Data
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, multithreading, C++0x, thread
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
Multithreading in C++0x part 2: Starting Threads with Function Objects and Arguments
Tuesday, 17 February 2009
This is the second of a series of blog posts introducing the new C++0x thread library. If you missed the first part, it covered Starting Threads in C++0x with simple functions.
If you read part 1 of this series, then you've seen how easy it is
to start a thread in C++0x: just construct an instance of std::thread,
passing in the function you wish to run on the new thread. Though this
is good, it would be quite limiting if new threads were constrained to
run plain functions without any arguments — all the information
needed would have to be passed via global variables, which would be
incredibly messy. Thankfully, this is not the case. Not only can you
run function objects on your new thread, as well as plain functions,
but you can pass arguments in too.
Running a function object on another thread
In keeping with the rest of the C++ standard library, you're not
limited to plain functions when starting threads — the std::thread
constructor can also be called with instances of classes that
implement the function-call operator. Let's say "hello" from our new
thread using a function object:
#include <thread>
#include <iostream>
class SayHello
{
public:
void operator()() const
{
std::cout<<"hello"<<std::endl;
}
};
int main()
{
std::thread t((SayHello()));
t.join();
}
If you're wondering about the extra parentheses around the
SayHello constructor call, this is to avoid what's known
as C++'s most vexing parse: without the parentheses, the
declaration is taken to be a declaration of a function called
t which takes a
pointer-to-a-function-with-no-parameters-returning-an-instance-of-SayHello,
and which returns a std::thread
object, rather than an object called t of type
std::thread. There
are a few other ways to avoid the problem. Firstly, you could create a
named variable of type SayHello and pass that to the
std::thread
constructor:
int main()
{
SayHello hello;
std::thread t(hello);
t.join();
}
Alternatively, you could use copy initialization:
int main()
{
std::thread t=std::thread(SayHello());
t.join();
}
And finally, if you're using a full C++0x compiler then you can use the new initialization syntax with braces instead of parentheses:
int main()
{
std::thread t{SayHello()};
t.join();
}
In this case, this is exactly equivalent to our first example with the double parentheses.
Anyway, enough about initialization. Whichever option you use, the
idea is the same: your function object is copied into internal storage
accessible to the new thread, and the new thread invokes your
operator(). Your class can of course have data members
and other member functions too, and this is one way of passing data to
the thread function: pass it in as a constructor argument and store it
as a data member:
#include <thread>
#include <iostream>
#include <string>
class Greeting
{
std::string message;
public:
explicit Greeting(std::string const& message_):
message(message_)
{}
void operator()() const
{
std::cout<<message<<std::endl;
}
};
int main()
{
std::thread t(Greeting("goodbye"));
t.join();
}
In this example, our message is stored as a data member in the
class, so when the Greeting instance is copied into the
thread the message is copied too, and this example will print
"goodbye" rather than "hello".
This example also demonstrates one way of passing information in to the new thread aside from the function to call — include it as data members of the function object. If this makes sense in terms of the function object then it's ideal, otherwise we need an alternate technique.
Passing Arguments to a Thread Function
As we've just seen, one way to pass arguments in to the thread
function is to package them in a class with a function call
operator. Well, there's no need to write a special class every time;
the standard library provides an easy way to do this in the form of
std::bind. The std::bind function template
takes a variable number of parameters. The first is always the
function or callable object which needs the parameters, and the
remainder are the parameters to pass when calling the function. The
result is a function object that stores copies of the supplied
arguments, with a function call operator that invokes the bound
function. We could therefore use this to pass the message to write to
our new thread:
#include <thread>
#include <iostream>
#include <string>
#include <functional>
void greeting(std::string const& message)
{
std::cout<<message<<std::endl;
}
int main()
{
std::thread t(std::bind(greeting,"hi!"));
t.join();
}
This works well, but we can actually do better than that — we
can pass the arguments directly to the std::thread
constructor and they will be copied into the internal storage for the
new thread and supplied to the thread function. We can thus write the
preceding example more simply as:
#include <thread>
#include <iostream>
#include <string>
void greeting(std::string const& message)
{
std::cout<<message<<std::endl;
}
int main()
{
std::thread t(greeting,"hi!");
t.join();
}
Not only is this code simpler, it's also likely to be more
efficient as the supplied arguments can be copied directly into the
internal storage for the thread rather than first into the object
generated by std::bind, which is then in turn copied into
the internal storage for the thread.
Multiple arguments can be supplied just by passing further
arguments to the std::thread
constructor:
#include <thread>
#include <iostream>
void write_sum(int x,int y)
{
std::cout<<x<<" + "<<y<<" = "<<(x+y)<<std::endl;
}
int main()
{
std::thread t(write_sum,123,456);
t.join();
}
The std::thread
constructor is a variadic template, so it can take any number of
arguments up to the compiler's internal limit, but if you need to pass
more than a couple of parameters to your thread function then you
might like to rethink your design.
Next time
We're not done with starting threads just yet — there's a few more nuances to passing arguments which we haven't covered. In the third part of this series we'll look at passing references, and using class member functions as the thread function.
Subscribe to
the RSS feed
for this blog to be sure you don't miss the rest
of the series.
Multithreading in C++0x Series
Here are the posts in this series so far:
- Multithreading in C++0x Part 1: Starting Threads
- Multithreading in C++0x Part 2: Starting Threads with Function Objects and Arguments
- Multithreading in C++0x Part 3: Starting Threads with Member Functions and Reference Arguments
- Multithreading in C++0x Part 4: Protecting Shared Data
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, multithreading, C++0x, thread
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
Multithreading in C++0x part 1: Starting Threads
Tuesday, 10 February 2009
This is the first of a series of blog posts introducing the new C++0x thread library.
Concurrency and multithreading is all about running multiple pieces of code in parallel. If you have the hardware for it in the form of a nice shiny multi-core CPU or a multi-processor system then this code can run truly in parallel, otherwise it is interleaved by the operating system — a bit of one task, then a bit of another. This is all very well, but somehow you have to specify what code to run on all these threads.
High level constructs such as the parallel algorithms in Intel's Threading Building
Blocks manage the division of code between threads for you, but we
don't have any of these in C++0x. Instead, we have to manage the
threads ourselves. The tool for this is std::thread.
Running a simple function on another thread
Let's start by running a simple function on another thread, which
we do by constructing a new std::thread
object, and passing in the function to the constructor. std::thread
lives in the <thread> header, so we'd better
include that first.
#include <thread>
void my_thread_func()
{}
int main()
{
std::thread t(my_thread_func);
}
If you compile and run this little app, it won't do a lot: though it starts a new thread, the thread function is empty. Let's make it do something, such as print "hello":
#include <thread>
#include <iostream>
void my_thread_func()
{
std::cout<<"hello"<<std::endl;
}
int main()
{
std::thread t(my_thread_func);
}
If you compile and run this little application, what happens? Does
it print hello like we wanted? Well, actually there's no
telling. It might do or it might not. I ran this simple application
several times on my machine, and the output was unreliable: sometimes
it output "hello", with a newline; sometimes it output "hello"
without a newline, and sometimes it didn't output
anything. What's up with that? Surely a simple app like this ought to
behave predictably?
Waiting for threads to finish
Well, actually, no, this app does not have
predictable behaviour. The problem is we're not waiting for our thread
to finish. When the execution reaches the end of main()
the program is terminated, whatever the other threads are doing. Since
thread scheduling is unpredictable, we cannot know how far the other
thread has got. It might have finished, it might have output the
"hello", but not processed the std::endl
yet, or it might not have even started. In any case it will be
abruptly stopped as the application exits.
If we want to reliably print our message, we have to
ensure that our thread has finished. We do that by joining
with the thread by calling the join()
member function of our thread object:
#include <thread>
#include <iostream>
void my_thread_func()
{
std::cout<<"hello"<<std::endl;
}
int main()
{
std::thread t(my_thread_func);
t.join();
}
Now, main() will wait for the thread to finish before
exiting, and the code will output "hello" followed by a newline every
time. This highlights a general point: if you want a thread to
have finished by a certain point in your code you have to wait for
it. As well as ensuring that threads have finished by the
time the program exits, this is also important if a thread has access
to local variables: we want the thread to have finished before the
local variables go out of scope.
Next Time
In this article we've looked at running simple functions on a separate thread, and waiting for the thread to finish. However, when you start a thread you aren't just limited to simple functions with no arguments: in the second part of this series we will look at how to start a thread with function objects, and how to pass arguments to the thread.
Subscribe to
the RSS feed
for this blog to be sure you don't miss the rest
of the series.
Multithreading in C++0x Series
Here are the posts in this series so far:
- Multithreading in C++0x Part 1: Starting Threads
- Multithreading in C++0x Part 2: Starting Threads with Function Objects and Arguments
- Multithreading in C++0x Part 3: Starting Threads with Member Functions and Reference Arguments
- Multithreading in C++0x Part 4: Protecting Shared Data
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, multithreading, C++0x, thread
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
Managing Threads with a Vector
Wednesday, 10 December 2008
One of the nice things about C++0x is the support for move
semantics that comes from the new Rvalue
Reference language feature. Since this is a language feature, it
means that we can easily have types that are movable but not
copyable without resorting to std::auto_ptr-like
hackery. One such type is the new std::thread class. A
thread of execution can only be associated with one
std::thread object at a time, so std::thread
is not copyable, but it is movable — this
allows you to transfer ownership of a thread of execution between
objects, and return std::thread objects from
functions. The important point for today's blog post is that it allows
you to store std::thread objects in containers.
Move-aware containers
The C++0x standard containers are required to be move-aware, and
move objects rather than copy them when changing
their position within the container. For existing copyable types that
don't have a specific move constructor or move-assignment
operator that takes an rvalue reference this amounts to the same
thing — when a std::vector is resized, or an
element is inserted in the middle, the elements will be copied to
their new locations. The important difference is that you can now
store types that only have a move-constructor and
move-assignment operator in the standard containers because the
objects are moved rather than copied.
This means that you can now write code like:
std::vector<std::thread> v; v.push_back(std::thread(some_function));
and it all "just works". This is good news for managing multiple
threads where the number of threads is not known until run-time
— if you're tuning the number of threads to the number of
processors, using
std::thread::hardware_concurrency() for example. It also
means that you can then use the
std::vector<std::thread> with the standard library
algorithms such as std::for_each:
void do_join(std::thread& t)
{
t.join();
}
void join_all(std::vector<std::thread>& v)
{
std::for_each(v.begin(),v.end(),do_join);
}
If you need an extra thread because one of your threads is blocked
waiting for something, you can just use insert() or
push_back() to add a new thread to the
vector. Of course you can also just move threads into or
out of the vector by indexing the elements directly:
std::vector<std::thread> v(std::thread::hardware_concurrency());
for(unsigned i=0;i<v.size();++i)
{
v[i]=std::thread(do_work);
}
In fact, many of the examples in my
book use std::vector<std::thread> for managing
the threads, as it's the simplest way to do it.
Other containers work too
It's not just std::vector that's required to be
move-aware — all the other standard containers are too. This
means you can have a std::list<std::thread>, or a
std::deque<std::thread>, or even a
std::map<int,std::thread>. In fact, the whole C++0x
standard library is designed to work with move-only types such as
std::thread.
Try it out today
Wouldn't it be nice if you could try it out today, and get used to
using containers of std::thread objects without having to
wait for a C++0x compiler? Well, you can — the
0.6 beta of the just::thread C++0x
Thread Library released last Friday provides a specialization of
std::vector<std::thread> so that you can write code
like in these examples and it will work with Microsoft Visual Studio
2008. Sign up at the just::thread
Support Forum to download it today.
Posted by Anthony Williams
[/ threading /] permanent link
Tags: threads, C++0x, vector, multithreading, concurrency
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
Peterson's lock with C++0x atomics
Friday, 05 December 2008
Bartosz Milewski shows an implementation of Peterson's locking algorithm in his latest post on C++ atomics and memory ordering. Dmitriy V'jukov posted an alternative implementation in the comments. Also in the comments, Bartosz says:
"So even though I don't have a formal proof, I believe my implementation of Peterson lock is correct. For all I know, Dmitriy's implementation might also be correct, but it's much harder to prove."
I'd like to offer an analysis of both algorithms to see if they are correct, below. However, before we start I'd also like to highlight a comment that Bartosz made in his conclusion:
"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."
This is something I wholeheartedly agree with. If you weren't
convinced by my previous post on Memory
Models and Synchronization, maybe the proof below will convince
you to stick to memory_order_seq_cst (the default) unless
you really need to do otherwise.
C++0x memory ordering recap
In C++0x, we have to think about things in terms of the happens-before and synchronizes-with relationships described in the Standard — it's no good saying "it works on my CPU" because different CPUs have different default ordering constraints on basic operations such as load and store. In brief, those relationships are:
- Synchronizes-with
- An operation A synchronizes-with an
operation B if A is a store to some
atomic variable
m, with an ordering ofstd::memory_order_release, orstd::memory_order_seq_cst, B is a load from the same variablem, with an ordering ofstd::memory_order_acquireorstd::memory_order_seq_cst, and B reads the value stored by A. - Happens-before
- An operation A happens-before an
operation B if:
- A is performed on the same thread as B, and A is before B in program order, or
- A synchronizes-with B, or
- A happens-before some other operation C, and C happens-before B.
std::memory_order_consume, but this is enough for now.
If all your operations use std::memory_order_seq_cst,
then there is the additional constraint of total ordering, as I
mentioned before, but neither of the implementations in question use
any std::memory_order_seq_cst operations, so we
can leave that aside for now.
Now, let's look at the implementations.
Bartosz's implementation
I've extracted the code for Bartosz's implementation from his posts, and it is shown below:
class Peterson_Bartosz
{
private:
// indexed by thread ID, 0 or 1
std::atomic<bool> _interested[2];
// who's yielding priority?
std::atomic<int> _victim;
public:
Peterson_Bartosz()
{
_victim.store(0, std::memory_order_release);
_interested[0].store(false, std::memory_order_release);
_interested[1].store(false, std::memory_order_release);
}
void lock()
{
int me = threadID; // either 0 or 1
int he = 1 ? me; // the other thread
_interested[me].exchange(true, std::memory_order_acq_rel);
_victim.store(me, std::memory_order_release);
while (_interested[he].load(std::memory_order_acquire)
&& _victim.load(std::memory_order_acquire) == me)
continue; // spin
}
void unlock()
{
int me = threadID;
_interested[me].store(false,std::memory_order_release);
}
}
There are three things to prove with Peterson's lock:
- If thread 0 successfully acquires the lock, then thread 1 will not do so;
- If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock;
- If thread 0 fails to acquire the lock, then thread 1 does so.
Let's look at each in turn.
If thread 0 successfully acquires the lock, then thread 1 will not do so
Initially _victim is 0, and the
_interested variables are both false. The
call to lock() from thread 0 will then set
_interested[0] to true, and
_victim to 0.
The loop then checks _interested[1], which is still
false, so we break out of the loop, and the lock is
acquired.
So, what about thread 1? Thread 1 now comes along and tries to
acquire the lock. It sets _interested[1] to
true, and _victim to 1, and then enters the
while loop. This is where the fun begins.
The first thing we check is _interested[0]. Now, we
know this was set to true in thread 0 as it acquired the
lock, but the important thing is: does the CPU running thread 1 know
that? Is it guaranteed by the memory model?
For it to be guaranteed by the memory model, we have to prove that
the store to _interested[0] from thread 0
happens-before the load from thread 1. This is trivially true
if we read true in thread 1, but that doesn't help: we
need to prove that we can't read false. We
therefore need to find a variable which was stored by thread 0, and
loaded by thread 1, and our search comes up empty:
_interested[1] is loaded by thread 1 as part of the
exchange call, but it is not written by thread 0, and
_victim is written by thread 1 without reading the value
stored by thread 0. Consequently, there is no ordering
guarantee on the read of _interested[0], and thread 1
may also break out of the while loop and acquire
the lock.
This implementation is thus broken. Let's now look at Dmitriy's implementation.
Dmitriy's implementation
Dmitriy posted his implementation in the comments using the syntax for his Relacy Race Detector tool, but it's trivially convertible to C++0x syntax. Here is the C++0x version of his code:
std::atomic<int> flag0(0),flag1(0),turn(0);
void lock(unsigned index)
{
if (0 == index)
{
flag0.store(1, std::memory_order_relaxed);
turn.exchange(1, std::memory_order_acq_rel);
while (flag1.load(std::memory_order_acquire)
&& 1 == turn.load(std::memory_order_relaxed))
std::this_thread::yield();
}
else
{
flag1.store(1, std::memory_order_relaxed);
turn.exchange(0, std::memory_order_acq_rel);
while (flag0.load(std::memory_order_acquire)
&& 0 == turn.load(std::memory_order_relaxed))
std::this_thread::yield();
}
}
void unlock(unsigned index)
{
if (0 == index)
{
flag0.store(0, std::memory_order_release);
}
else
{
flag1.store(0, std::memory_order_release);
}
}
So, how does this code fare?
If thread 0 successfully acquires the lock, then thread 1 will not do so
Initially the turn, flag0 and
flag1 variables are all 0. The call to
lock() from thread 0 will then set flag0 to
1, and turn to 1. These variables are essentially
equivalent to the variables in Bartosz's implementation, but
turn is set to 0 when _victim is set to 1,
and vice-versa. That doesn't affect the logic of the code.
The loop then checks flag1, which is still 0, so we
break out of the loop, and the lock is acquired.
So, what about thread 1? Thread 1 now comes along and tries to
acquire the lock. It sets flag1 to 1, and
turn to 0, and then enters the while
loop. This is where the fun begins.
As before, the first thing we check is flag0. Now, we
know this was set to 1 in thread 0 as it acquired the lock, but the
important thing is: does the CPU running thread 1 know that? Is it
guaranteed by the memory model?
Again, for it to be guaranteed by the memory model, we have to
prove that the store to flag0 from thread 0
happens-before the load from thread 1. This is trivially true
if we read 1 in thread 1, but that doesn't help: we need to prove that
we can't read 0. We therefore need to find a variable which
was stored by thread 0, and loaded by thread 1, as before.
This time our search is successful: turn is set using
an exchange operation, which is a read-modify-write
operation. Since it uses std::memory_order_acq_rel memory
ordering, it is both a load-acquire and a store-release. If the load
part of the exchange reads the value written by thread 0,
we're home dry: turn is stored with a similar
exchange operation with
std::memory_order_acq_rel in thread 0, so the store from
thread 0 synchronizes-with the load from thread 1.
This means that the store to flag0 from thread 0
happens-before the exchange on turn
in thread 1, and thus happens-before the load in
the while loop. The load in the
while loop thus reads 1 from flag0, and
proceeds to check turn.
Now, since the store to turn from thread 0
happens-before the store from thread 1 (we're relying on that
for the happens-before relationship on flag0,
remember), we know that the value to be read will be the value we
stored in thread 1: 0. Consequently, we keep looping.
OK, so if the store to turn in thread 1 reads the
value stored by thread 0 then thread 1 will stay out of the lock, but
what if it doesn't read the value store by thread 0? In this
case, we know that the exchange call from thread 0 must
have seen the value written by the exchange in thread 1
(writes to a single atomic variable always become visible in the same
order for all threads), which means that the write to
flag1 from thread 1 happens-before the read in
thread 0 and so thread 0 cannot have acquired the lock. Since this was
our initial assumption (thread 0 has acquired the lock), we're home
dry — thread 1 can only acquire the lock if thread 0 didn't.
If thread 0 acquires the lock and then releases it, then thread 1 will successfully acquire the lock
OK, so we've got as far as thread 0 acquiring the lock and thread 1
waiting. What happens if thread 0 now releases the lock? It does this
simply by writing 0 to flag0. The while loop
in thread 1 checks flag0 every time round, and breaks out
if the value read is 0. Therefore, thread 1 will eventually acquire
the mutex. Of course, there is no guarantee when it will
acquire the mutex — it might take arbitrarily long for the the
write to flag0 to make its way to thread 1, but it will
get there in the end. Since flag0 is never written by
thread 1, it doesn't matter whether thread 0 has already released the
lock when thread 1 starts waiting, or whether thread 1 is already
waiting — the while loop will still terminate, and
thread 1 will acquire the lock in both cases.
That just leaves our final check.
If thread 0 fails to acquire the lock, then thread 1 does so
We've essentially already covered this when we checked that thread
1 doesn't acquire the lock if thread 0 does, but this time we're going
in reverse. If thread 0 doesn't acquire the lock, it is because it
sees flag1 as 1 and turn as 1. Since
flag1 is only written by thread 1, if it is 1 then thread
1 must have at least called lock(). If thread 1 has
called unlock then eventually flag1 will be
read as 0, so thread 0 will acquire the lock. So, let's assume for now
that thread 1 hasn't got that far, so flag1 is still
1. The next check is for turn to be 1. This is the value
written by thread 0. If we read it as 1 then either the write to
turn from thread 1 has not yet become visible to thread
0, or the write happens-before the write by thread 0, so the
write from thread 0 overwrote the old value.
If the write from thread 1 happens-before the write from
thread 0 then thread 1 will eventually see turn as 1
(since the last write is by thread 0), and thus thread 1 will acquire
the lock. On the other hand, if the write to turn from
thread 0 happens-before the write to turn from
thread 1, then thread 0 will eventually see the turn as 0
and acquire the lock. Therefore, for thread 0 to be stuck waiting the
last write to turn must have been by thread 0, which
implies thread 1 will eventually get the lock.
Therefore, Dmitriy's implementation works.
Differences, and conclusion
The key difference between the implementations other than the
naming of the variables is which variable the exchange
operation is applied to. In Bartosz's implementation, the
exchange is applied to _interested[me],
which is only ever written by one thread for a given value of
me. In Dmitriy's implementation, the
exchange is applied to the turn variable,
which is the variable updated by both threads. It therefore acts as a
synchronization point for the threads. This is the key to the whole
algorithm — even though many of the operations in Dmitriy's
implementation use std::memory_order_relaxed, whereas
Bartosz's implementation uses std::memory_order_acquire
and std::memory_order_release everywhere, the single
std::memory_order_acq_rel on the exchange on
the right variable is enough.
I'd like to finish by repeating Bartosz's statement about relaxed memory orderings:
"Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."
Posted by Anthony Williams
[/ threading /] permanent link
Tags: concurrency, C++0x, atomics, memory model, synchronization
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
Memory Models and Synchronization
Monday, 24 November 2008
I have read a couple of posts on memory models over the couple of weeks: one from Jeremy Manson on What Volatile Means in Java, and one from Bartosz Milewski entitled Who ordered sequential consistency?. Both of these cover a Sequentially Consistent memory model — in Jeremy's case because sequential consistency is required by the Java Memory Model, and in Bartosz' case because he's explaining what it means to be sequentially consistent, and why we would want that.
In a sequentially consistent memory model, there is a single total order of all atomic operations which is the same across all processors in the system. You might not know what the order is in advance, and it may change from execution to execution, but there is always a total order.
This is the default for the new C++0x atomics, and required for
Java's volatile, for good reason — it is
considerably easier to reason about the behaviour of code that uses
sequentially consistent orderings than code that uses a more relaxed
ordering.
The thing is, C++0x atomics are only sequentially consistent by default — they also support more relaxed orderings.
Relaxed Atomics and Inconsistent Orderings
I briefly touched on the properties of relaxed atomic operations in my presentation on The Future of Concurrency in C++ at ACCU 2008 (see the slides). The key point is that relaxed operations are unordered. Consider this simple example with two threads:
#include <thread>
#include <cstdatomic>
std::atomic<int> x(0),y(0);
void thread1()
{
x.store(1,std::memory_order_relaxed);
y.store(1,std::memory_order_relaxed);
}
void thread2()
{
int a=y.load(std::memory_order_relaxed);
int b=x.load(std::memory_order_relaxed);
if(a==1)
assert(b==1);
}
std::thread t1(thread1);
std::thread t2(thread2);
All the atomic operations here are using
memory_order_relaxed, so there is no enforced
ordering. Therefore, even though thread1 stores
x before y, there is no guarantee that the
writes will reach thread2 in that order: even if
a==1 (implying thread2 has seen the result
of the store to y), there is no guarantee that
b==1, and the assert may fire.
If we add more variables and more threads, then each thread may see a different order for the writes. Some of the results can be even more surprising than that, even with two threads. The C++0x working paper features the following example:
void thread1()
{
int r1=y.load(std::memory_order_relaxed);
x.store(r1,std::memory_order_relaxed);
}
void thread2()
{
int r2=x.load(std::memory_order_relaxed);
y.store(42,std::memory_order_relaxed);
assert(r2==42);
}
There's no ordering between threads, so thread1 might
see the store to y from thread2, and thus
store the value 42 in x. The fun part comes because the
load from x in thread2 can be reordered
after everything else (even the store that occurs after it in the same
thread) and thus load the value 42! Of course, there's no guarantee
about this, so the assert may or may not fire — we
just don't know.
Acquire and Release Ordering
Now you've seen quite how scary life can be with relaxed operations, it's time to look at acquire and release ordering. This provides pairwise synchronization between threads — the thread doing a load sees all the changes made before the corresponding store in another thread. Most of the time, this is actually all you need — you still get the "two cones" effect described in Jeremy's blog post.
With acquire-release ordering, independent reads of variables written independently can still give different orders in different threads, so if you do that sort of thing then you still need to think carefully. e.g.
std::atomicx(0),y(0); void thread1() { x.store(1,std::memory_order_release); } void thread2() { y.store(1,std::memory_order_release); } void thread3() { int a=x.load(std::memory_order_acquire); int b=y.load(std::memory_order_acquire); } void thread4() { int c=x.load(std::memory_order_acquire); int d=y.load(std::memory_order_acquire); }
Yes, thread3 and thread4 have the same
code, but I separated them out to make it clear we've got two separate
threads. In this example, the stores are on separate threads, so there
is no ordering between them. Consequently the reader threads may see
the writes in either order, and you might get a==1 and
b==0 or vice versa, or both 1 or both 0. The fun part is
that the two reader threads might see opposite
orders, so you have a==1 and b==0, but
c==0 and d==1! With sequentially consistent
code, both threads must see consistent orderings, so this would be
disallowed.
Summary
The details of relaxed memory models can be confusing, even for experts. If you're writing code that uses bare atomics, stick to sequential consistency until you can demonstrate that this is causing an undesirable impact on performance.
There's a lot more to the C++0x memory model and atomic operations than I can cover in a blog post — I go into much more depth in the chapter on atomics in my book.
Posted by Anthony Williams
[/ threading /] permanent link
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
Deadlock Detection with just::thread
Wednesday, 12 November 2008
One of the biggest problems with multithreaded programming is the
possibility of deadlocks. In the excerpt from my
book published over at codeguru.com (Deadlock:
The problem and a solution) I discuss various ways of dealing with
deadlock, such as using std::lock when acquiring multiple
locks at once and acquiring locks in a fixed order.
Following such guidelines requires discipline, especially on large
code bases, and occasionally we all slip up. This is where the
deadlock detection mode of the
just::thread library comes in: if you compile your
code with deadlock detection enabled then if a deadlock occurs the
library will display a stack trace of the deadlock threads
and the locations at which the synchronization
objects involved in the deadlock were locked.
Let's look at the following simple code for an example.
#include <thread>
#include <mutex>
#include <iostream>
std::mutex io_mutex;
void thread_func()
{
std::lock_guard<std::mutex> lk(io_mutex);
std::cout<<"Hello from thread_func"<<std::endl;
}
int main()
{
std::thread t(thread_func);
std::lock_guard<std::mutex> lk(io_mutex);
std::cout<<"Hello from main thread"<<std::endl;
t.join();
return 0;
}
Now, it is obvious just from looking at the code that there's a
potential deadlock here: the main thread holds the lock on
io_mutex across the call to
t.join(). Therefore, if the main thread manages to lock
the io_mutex before the new thread does then the program
will deadlock: the main thread is waiting for thread_func
to complete, but thread_func is blocked on the
io_mutex, which is held by the main thread!
Compile the code and run it a few times: eventually you should hit the deadlock. In this case, the program will output "Hello from main thread" and then hang. The only way out is to kill the program.
Now compile the program again, but this time with
_JUST_THREAD_DEADLOCK_CHECK defined — you can
either define this in your project settings, or define it in the first
line of the program with #define. It must be defined
before any of the thread library headers are included
in order to take effect. This time the program doesn't hang —
instead it displays a message box with the title "Deadlock Detected!"
looking similar to the following:
Of course, you need to have debug symbols for your executable to get meaningful stack traces.
Anyway, this message box shows three stack traces. The first is
labelled "Deadlock detected in thread 2 at:", and tells us that the
deadlock was found in the call to std::thread::join from
main, on line 19 of our source file
(io_deadlock.cpp). Now, it's important to note that "line 19" is
actually where execution will resume when join returns
rather than the call site, so in this case the call to
join is on line 18. If the next statement was also on
line 18, the stack would report line 18 here.
The next stack trace is labelled "Thread 1 blocked at:", and tells
us where the thread we're trying to join with is blocked. In this
case, it's blocked in the call to mutex::lock from the
std::lock_guard constructor called from
thread_func returning to line 10 of our source file (the
constructor is on line 9).
The final stack trace completes the circle by telling us where that
mutex was locked. In this case the label says "Waiting for object
locked on thread 2 at:", and the stack trace tells us it was the
std::lock_guard constructor in main
returning to line 17 of our source file.
This is all the information we need to see the deadlock in this case, but in more complex cases we might need to go further up the call stack, particularly if the deadlock occurs in a function called from lots of different threads, or the mutex being used in the function depends on its parameters.
The just::thread deadlock
detection can help there too: if you're running the application from
within the IDE, or you've got a Just-in-Time debugger installed then
the application will now break into the debugger. You can then use the
full capabilities of your debugger to examine the state of the
application when the deadlock occurred.
Try it yourself: download the beta of the just::thread C++0x
thread library today. You can also download sample Visual C++
Express 2008 project for this
example.
Posted by Anthony Williams
[/ threading /] permanent link
Tags: multithreading, deadlock, c++
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
Implementing a Thread-Safe Queue using Condition Variables (Updated)
Tuesday, 16 September 2008
One problem that comes up time and again with multi-threaded code is how to transfer data from one thread to another. For example, one common way to parallelize a serial algorithm is to split it into independent chunks and make a pipeline — each stage in the pipeline can be run on a separate thread, and each stage adds the data to the input queue for the next stage when it's done. For this to work properly, the input queue needs to be written so that data can safely be added by one thread and removed by another thread without corrupting the data structure.
Basic Thread Safety with a Mutex
The simplest way of doing this is just to put wrap a non-thread-safe queue, and protect it with a mutex (the examples use the types and functions from the upcoming 1.35 release of Boost):
template<typename Data>
class concurrent_queue
{
private:
std::queue<Data> the_queue;
mutable boost::mutex the_mutex;
public:
void push(const Data& data)
{
boost::mutex::scoped_lock lock(the_mutex);
the_queue.push(data);
}
bool empty() const
{
boost::mutex::scoped_lock lock(the_mutex);
return the_queue.empty();
}
Data& front()
{
boost::mutex::scoped_lock lock(the_mutex);
return the_queue.front();
}
Data const& front() const
{
boost::mutex::scoped_lock lock(the_mutex);
return the_queue.front();
}
void pop()
{
boost::mutex::scoped_lock lock(the_mutex);
the_queue.pop();
}
};
This design is subject to race conditions between calls to empty, front and pop if there
is more than one thread removing items from the queue, but in a single-consumer system (as being discussed here), this is not a
problem. There is, however, a downside to such a simple implementation: if your pipeline stages are running on separate threads,
they likely have nothing to do if the queue is empty, so they end up with a wait loop:
while(some_queue.empty())
{
boost::this_thread::sleep(boost::posix_time::milliseconds(50));
}
Though the sleep avoids the high CPU consumption of a direct busy wait, there are still some obvious downsides to
this formulation. Firstly, the thread has to wake every 50ms or so (or whatever the sleep period is) in order to lock the mutex,
check the queue, and unlock the mutex, forcing a context switch. Secondly, the sleep period imposes a limit on how fast the thread
can respond to data being added to the queue — if the data is added just before the call to sleep, the thread
will wait at least 50ms before checking for data. On average, the thread will only respond to data after about half the sleep time
(25ms here).
Waiting with a Condition Variable
As an alternative to continuously polling the state of the queue, the sleep in the wait loop can be replaced with a condition
variable wait. If the condition variable is notified in push when data is added to an empty queue, then the waiting
thread will wake. This requires access to the mutex used to protect the queue, so needs to be implemented as a member function of
concurrent_queue:
template<typename Data>
class concurrent_queue
{
private:
boost::condition_variable the_condition_variable;
public:
void wait_for_data()
{
boost::mutex::scoped_lock lock(the_mutex);
while(the_queue.empty())
{
the_condition_variable.wait(lock);
}
}
void push(Data const& data)
{
boost::mutex::scoped_lock lock(the_mutex);
bool const was_empty=the_queue.empty();
the_queue.push(data);
if(was_empty)
{
the_condition_variable.notify_one();
}
}
// rest as before
};
There are three important things to note here. Firstly, the lock variable is passed as a parameter to
wait — this allows the condition variable implementation to atomically unlock the mutex and add the thread to the
wait queue, so that another thread can update the protected data whilst the first thread waits.
Secondly, the condition variable wait is still inside a while loop — condition variables can be subject to
spurious wake-ups, so it is important to check the actual condition being waited for when the call to wait
returns.
Be careful when you notify
Thirdly, the call to notify_one comes after the data is pushed on the internal queue. This avoids the
waiting thread being notified if the call to the_queue.push throws an exception. As written, the call to
notify_one is still within the protected region, which is potentially sub-optimal: the waiting thread might wake up
immediately it is notified, and before the mutex is unlocked, in which case it will have to block when the mutex is reacquired on
the exit from wait. By rewriting the function so that the notification comes after the mutex is unlocked, the
waiting thread will be able to acquire the mutex without blocking:
template<typename Data>
class concurrent_queue
{
public:
void push(Data const& data)
{
boost::mutex::scoped_lock lock(the_mutex);
bool const was_empty=the_queue.empty();
the_queue.push(data);
lock.unlock(); // unlock the mutex
if(was_empty)
{
the_condition_variable.notify_one();
}
}
// rest as before
};
Reducing the locking overhead
Though the use of a condition variable has improved the pushing and waiting side of the interface, the interface for the consumer
thread still has to perform excessive locking: wait_for_data, front and pop all lock the
mutex, yet they will be called in quick succession by the consumer thread.
By changing the consumer interface to a single wait_and_pop function, the extra lock/unlock calls can be avoided:
template<typename Data>
class concurrent_queue
{
public:
void wait_and_pop(Data& popped_value)
{
boost::mutex::scoped_lock lock(the_mutex);
while(the_queue.empty())
{
the_condition_variable.wait(lock);
}
popped_value=the_queue.front();
the_queue.pop();
}
// rest as before
};
Using a reference parameter to receive the result is used to transfer ownership out of the queue in order to avoid the exception
safety issues of returning data by-value: if the copy constructor of a by-value return throws, then the data has been removed from
the queue, but is lost, whereas with this approach, the potentially problematic copy is performed prior to modifying the queue (see
Herb Sutter's Guru Of The Week #8 for a discussion of the issues). This does, of
course, require that an instance Data can be created by the calling code in order to receive the result, which is not
always the case. In those cases, it might be worth using something like boost::optional to avoid this requirement.
Handling multiple consumers
As well as removing the locking overhead, the combined wait_and_pop function has another benefit — it
automatically allows for multiple consumers. Whereas the fine-grained nature of the separate functions makes them subject to race
conditions without external locking (one reason why the authors of the SGI
STL advocate against making things like std::vector thread-safe — you need external locking to do many common
operations, which makes the internal locking just a waste of resources), the combined function safely handles concurrent calls.
If multiple threads are popping entries from a full queue, then they just get serialized inside wait_and_pop, and
everything works fine. If the queue is empty, then each thread in turn will block waiting on the condition variable. When a new
entry is added to the queue, one of the threads will wake and take the value, whilst the others keep blocking. If more than one
thread wakes (e.g. with a spurious wake-up), or a new thread calls wait_and_pop concurrently, the while
loop ensures that only one thread will do the pop, and
the others will wait.
Update: As commenter David notes below, using multiple consumers does have one problem: if there are several
threads waiting when data is added, only one is woken. Though this is exactly what you want if only one item is pushed onto the
queue, if multiple items are pushed then it would be desirable if more than one thread could wake. There are two solutions to this:
use notify_all() instead of notify_one() when waking threads, or to call notify_one()
whenever any data is added to the queue, even if the queue is not currently empty. If all threads are notified then the extra
threads will see it as a spurious wake and resume waiting if there isn't enough data for them. If we notify with every
push() then only the right number of threads are woken. This is my preferred option: condition variable notify calls
are pretty light-weight when there are no threads waiting. The revised code looks like this:
template<typename Data>
class concurrent_queue
{
public:
void push(Data const& data)
{
boost::mutex::scoped_lock lock(the_mutex);
the_queue.push(data);
lock.unlock();
the_condition_variable.notify_one();
}
// rest as before
};
There is one benefit that the separate functions give over the combined one — the ability to check for an empty queue, and
do something else if the queue is empty. empty itself still works in the presence of multiple consumers, but the value
that it returns is transitory — there is no guarantee that it will still apply by the time a thread calls
wait_and_pop, whether it was true or false. For this reason it is worth adding an additional
function: try_pop, which returns true if there was a value to retrieve (in which case it retrieves it), or
false to indicate that the queue was empty.
template<typename Data>
class concurrent_queue
{
public:
bool try_pop(Data& popped_value)
{
boost::mutex::scoped_lock lock(the_mutex);
if(the_queue.empty())
{
return false;
}
popped_value=the_queue.front();
the_queue.pop();
return true;
}
// rest as before
};
By removing the separate front and pop functions, our simple naive implementation has now become a
usable multiple producer, multiple consumer concurrent queue.
The Final Code
Here is the final code for a simple thread-safe multiple producer, multiple consumer queue:
template<typename Data>
class concurrent_queue
{
private:
std::queue<Data> the_queue;
mutable boost::mutex the_mutex;
boost::condition_variable the_condition_variable;
public:
void push(Data const& data)
{
boost::mutex::scoped_lock lock(the_mutex);
the_queue.push(data);
lock.unlock();
the_condition_variable.notify_one();
}
bool empty() const
{
boost::mutex::scoped_lock lock(the_mutex);
return the_queue.empty();
}
bool try_pop(Data& popped_value)
{
boost::mutex::scoped_lock lock(the_mutex);
if(the_queue.empty())
{
return false;
}
popped_value=the_queue.front();
the_queue.pop();
return true;
}
void wait_and_pop(Data& popped_value)
{
boost::mutex::scoped_lock lock(the_mutex);
while(the_queue.empty())
{
the_condition_variable.wait(lock);
}
popped_value=the_queue.front();
the_queue.pop();
}
};
Posted by Anthony Williams
[/ threading /] permanent link
Tags: threading, thread safe, queue, condition variable
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?
The Intel x86 Memory Ordering Guarantees and the C++ Memory Model
Tuesday, 26 August 2008
The July 2008 version of the Intel 64 and IA-32 Architecture documents includes the information from the memory ordering white paper I mentioned before. This makes it clear that on x86/x64 systems the preferred implementation of the C++0x atomic operations is as follows (which has been confirmed in discussions with Intel engineers):
| Memory Ordering | Store | Load |
|---|---|---|
| std::memory_order_relaxed | MOV [mem],reg | MOV reg,[mem] |
| std::memory_order_acquire | n/a | MOV reg,[mem] |
| std::memory_order_release | MOV [mem],reg | n/a |
| std::memory_order_seq_cst | XCHG [mem],reg | MOV reg,[mem] |
As you can see, plain MOV is enough for even
sequentially-consistent loads if a LOCKed instruction
such as XCHG is used for the sequentially-consistent
stores.
One thing to watch out for is the Non-Temporal SSE instructions
(MOVNTI, MOVNTQ, etc.), which by their
very nature (i.e. non-temporal) don't follow the normal
cache-coherency rules. Therefore non-temporal stores must be
followed by an SFENCE instruction in order for their
results to be seen by other processors in a timely fashion.
Additionally, if you're writing drivers which deal with memory pages marked WC (Write-Combining) then additional fence instructions will be required to ensure visibility between processors. However, if you're programming with WC pages then this shouldn't be a problem.
Posted by Anthony Williams
[/ threading /] permanent link
Tags: intel, x86, c++, threading, memory ordering, memory model
Stumble It!
| Submit to Reddit
| Submit to DZone ![]()
If you liked this post, why not subscribe to the RSS feed
or Follow me on Twitter?