4eb962f025
[SVN r81976]
296 lines
13 KiB
Plaintext
296 lines
13 KiB
Plaintext
[library Boost.Lockfree
|
||
[quickbook 1.4]
|
||
[authors [Blechmann, Tim]]
|
||
[copyright 2008-2011 Tim Blechmann]
|
||
[category algorithms]
|
||
[purpose
|
||
lockfree concurrent data structures
|
||
]
|
||
[id lockfree]
|
||
[dirname lockfree]
|
||
[license
|
||
Distributed under the Boost Software License, Version 1.0.
|
||
(See accompanying file LICENSE_1_0.txt or copy at
|
||
[@http://www.boost.org/LICENSE_1_0.txt])
|
||
]
|
||
]
|
||
|
||
[c++]
|
||
|
||
|
||
[/ Images ]
|
||
|
||
[def _note_ [$images/note.png]]
|
||
[def _alert_ [$images/caution.png]]
|
||
[def _detail_ [$images/note.png]]
|
||
[def _tip_ [$images/tip.png]]
|
||
|
||
[/ Links ]
|
||
|
||
[def _lockfree_ [^boost.lockfree]]
|
||
|
||
[section Introduction & Motivation]
|
||
|
||
[h2 Introduction & Terminology]
|
||
|
||
The term *non-blocking* denotes concurrent data structures, which do not use traditional synchronization primitives like
|
||
guards to ensure thread-safety. Maurice Herlihy and Nir Shavit (compare [@http://books.google.com/books?id=pFSwuqtJgxYC
|
||
"The Art of Multiprocessor Programming"]) distinguish between 3 types of non-blocking data structures, each having different
|
||
properties:
|
||
|
||
* data structures are *wait-free*, if every concurrent operation is guaranteed to be finished in a finite number of
|
||
steps. It is therefore possible to give worst-case guarantees for the number of operations.
|
||
|
||
* data structures are *lock-free*, if some concurrent operations are guaranteed to be finished in a finite number of
|
||
steps. While it is in theory possible that some operations never make any progress, it is very unlikely to happen in
|
||
practical applications.
|
||
|
||
* data structures are *obstruction-free*, if a concurrent operation is guaranteed to be finished in a finite number of
|
||
steps, unless another concurrent operation interferes.
|
||
|
||
|
||
Some data structures can only be implemented in a lock-free manner, if they are used under certain restrictions. The
|
||
relevant aspects for the implementation of _lockfree_ are the number of producer and consumer threads. *Single-producer*
|
||
(*sp*) or *multiple producer* (*mp*) means that only a single thread or multiple concurrent threads are allowed to add
|
||
data to a data structure. *Single-consumer* (*sc*) or *Multiple-consumer* (*mc*) denote the equivalent for the removal
|
||
of data from the data structure.
|
||
|
||
|
||
[h2 Properties of Non-Blocking Data Structures]
|
||
|
||
Non-blocking data structures do not rely on locks and mutexes to ensure thread-safety. The synchronization is done completely in
|
||
user-space without any direct interaction with the operating system [footnote Spinlocks do not
|
||
directly interact with the operating system either. However it is possible that the owning thread is preempted by the
|
||
operating system, which violates the lock-free property.]. This implies that they are not prone to issues like priority
|
||
inversion (a low-priority thread needs to wait for a high-priority thread).
|
||
|
||
Instead of relying on guards, non-blocking data structures require *atomic operations* (specific CPU instructions executed
|
||
without interruption). This means that any thread either sees the state before or after the operation, but no
|
||
intermediate state can be observed. Not all hardware supports the same set of atomic instructions. If it is not
|
||
available in hardware, it can be emulated in software using guards. However this has the obvious drawback of losing the
|
||
lock-free property.
|
||
|
||
|
||
[h2 Performance of Non-Blocking Data Structures]
|
||
|
||
When discussing the performance of non-blocking data structures, one has to distinguish between *amortized* and
|
||
*worst-case* costs. The definition of 'lock-free' and 'wait-free' only mention the upper bound of an operation. Therefore
|
||
lock-free data structures are not necessarily the best choice for every use case. In order to maximise the throughput of an
|
||
application one should consider high-performance concurrent data structures [footnote
|
||
[@http://threadingbuildingblocks.org/ Intel's Thread Building Blocks library] provides many efficient concurrent data structures,
|
||
which are not necessarily lock-free.].
|
||
|
||
Lock-free data structures will be a better choice in order to optimize the latency of a system or to avoid priority inversion,
|
||
which may be necessary in real-time applications. In general we advise to consider if lock-free data structures are necessary or if
|
||
concurrent data structures are sufficient. In any case we advice to perform benchmarks with different data structures for a
|
||
specific workload.
|
||
|
||
|
||
[h2 Sources of Blocking Behavior]
|
||
|
||
Apart from locks and mutexes (which we are not using in _lockfree_ anyway), there are three other aspects, that could violate
|
||
lock-freedom:
|
||
|
||
[variablelist
|
||
[[Atomic Operations]
|
||
[Some architectures do not provide the necessary atomic operations in natively in hardware. If this is not
|
||
the case, they are emulated in software using spinlocks, which by itself is blocking.
|
||
]
|
||
]
|
||
|
||
[[Memory Allocations]
|
||
[Allocating memory from the operating system is not lock-free. This makes it impossible to implement true
|
||
dynamically-sized non-blocking data structures. The node-based data structures of _lockfree_ use a memory pool to allocate the
|
||
internal nodes. If this memory pool is exhausted, memory for new nodes has to be allocated from the operating system. However
|
||
all data structures of _lockfree_ can be configured to avoid memory allocations (instead the specific calls will fail).
|
||
This is especially useful for real-time systems that require lock-free memory allocations.
|
||
]
|
||
]
|
||
|
||
[[Exception Handling]
|
||
[The C++ exception handling does not give any guarantees about its real-time behavior. We therefore do
|
||
not encourage the use of exceptions and exception handling in lock-free code.]
|
||
]
|
||
]
|
||
|
||
[h2 Data Structures]
|
||
|
||
_lockfree_ implements three lock-free data structures:
|
||
|
||
[variablelist
|
||
[[[classref boost::lockfree::queue]]
|
||
[a lock-free multi-produced/multi-consumer queue]
|
||
]
|
||
|
||
[[[classref boost::lockfree::stack]]
|
||
[a lock-free multi-produced/multi-consumer stack]
|
||
]
|
||
|
||
[[[classref boost::lockfree::spsc_queue]]
|
||
[a wait-free single-producer/single-consumer queue (commonly known as ringbuffer)]
|
||
]
|
||
]
|
||
|
||
[h3 Data Structure Configuration]
|
||
|
||
The data structures can be configured with [@boost:/libs/parameter/doc/html/index.html Boost.Parameter]-style templates:
|
||
|
||
[variablelist
|
||
[[[classref boost::lockfree::fixed_sized]]
|
||
[Configures the data structure as *fixed sized*. The internal nodes are stored inside an array and they are addressed by
|
||
array indexing. This limits the possible size of the queue to the number of elements that can be addressed by the index
|
||
type (usually 2**16-2), but on platforms that lack double-width compare-and-exchange instructions, this is the best way
|
||
to achieve lock-freedom.
|
||
]
|
||
]
|
||
|
||
[[[classref boost::lockfree::capacity]]
|
||
[Sets the *capacity* of a data structure at compile-time. This implies that a data structure is fixed-sized.
|
||
]
|
||
]
|
||
|
||
[[[classref boost::lockfree::allocator]]
|
||
[Defines the allocator. _lockfree_ supports stateful allocator and is compatible with [@boost:/libs/interprocess/index.html Boost.Interprocess] allocators.]
|
||
]
|
||
]
|
||
|
||
|
||
[endsect]
|
||
|
||
[section Examples]
|
||
|
||
[h2 Queue]
|
||
|
||
The [classref boost::lockfree::queue boost::lockfree::queue] class implements a multi-writer/multi-reader queue. The
|
||
following example shows how integer values are produced and consumed by 4 threads each:
|
||
|
||
[import ../examples/queue.cpp]
|
||
[queue_example]
|
||
|
||
The program output is:
|
||
|
||
[pre
|
||
produced 40000000 objects.
|
||
consumed 40000000 objects.
|
||
]
|
||
|
||
|
||
[h2 Stack]
|
||
|
||
The [classref boost::lockfree::stack boost::lockfree::stack] class implements a multi-writer/multi-reader stack. The
|
||
following example shows how integer values are produced and consumed by 4 threads each:
|
||
|
||
[import ../examples/stack.cpp]
|
||
[stack_example]
|
||
|
||
|
||
The program output is:
|
||
|
||
[pre
|
||
produced 4000000 objects.
|
||
consumed 4000000 objects.
|
||
]
|
||
|
||
[h2 Waitfree Single-Producer/Single-Consumer Queue]
|
||
|
||
The [classref boost::lockfree::spsc_queue boost::lockfree::spsc_queue] class implements a wait-free single-producer/single-consumer queue. The
|
||
following example shows how integer values are produced and consumed by 2 separate threads:
|
||
|
||
[import ../examples/spsc_queue.cpp]
|
||
[spsc_queue_example]
|
||
|
||
|
||
The program output is:
|
||
|
||
[pre
|
||
produced 10000000 objects.
|
||
consumed 10000000 objects.
|
||
]
|
||
|
||
[endsect]
|
||
|
||
|
||
[section Rationale]
|
||
|
||
[section Data Structures]
|
||
|
||
The implementations are implementations of well-known data structures. The queue is based on
|
||
[@http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.3574 Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms by Michael Scott and Maged Michael],
|
||
the stack is based on [@http://books.google.com/books?id=YQg3HAAACAAJ Systems programming: coping with parallelism by R. K. Treiber]
|
||
and the spsc_queue is considered as 'folklore' and is implemented in several open-source projects including the linux kernel. All
|
||
data structures are discussed in detail in [@http://books.google.com/books?id=pFSwuqtJgxYC "The Art of Multiprocessor Programming" by Herlihy & Shavit].
|
||
|
||
[endsect]
|
||
|
||
[section Memory Management]
|
||
|
||
The lock-free [classref boost::lockfree::queue] and [classref boost::lockfree::stack] classes are node-based data structures,
|
||
based on a linked list. Memory management of lock-free data structures is a non-trivial problem, because we need to avoid that
|
||
one thread frees an internal node, while another thread still uses it. _lockfree_ uses a simple approach not returning any memory
|
||
to the operating system. Instead they maintain a *free-list* in order to reuse them later. This is done for two reasons:
|
||
first, depending on the implementation of the memory allocator freeing the memory may block (so the implementation would not
|
||
be lock-free anymore), and second, most memory reclamation algorithms are patented.
|
||
|
||
[endsect]
|
||
|
||
[section ABA Prevention]
|
||
|
||
The ABA problem is a common problem when implementing lock-free data structures. The problem occurs when updating an atomic
|
||
variable using a =compare_exchange= operation: if the value A was read, thread 1 changes it to say C and tries to update
|
||
the variable, it uses =compare_exchange= to write C, only if the current value is A. This might be a problem if in the meanwhile
|
||
thread 2 changes the value from A to B and back to A, because thread 1 does not observe the change of the state. The common way to
|
||
avoid the ABA problem is to associate a version counter with the value and change both atomically.
|
||
|
||
_lockfree_ uses a =tagged_ptr= helper class which associates a pointer with an integer tag. This usually requires a double-width
|
||
=compare_exchange=, which is not available on all platforms. IA32 did not provide the =cmpxchg8b= opcode before the pentium
|
||
processor and it is also lacking on many RISC architectures like PPC. Early X86-64 processors also did not provide a =cmpxchg16b=
|
||
instruction.
|
||
On 64bit platforms one can work around this issue, because often not the full 64bit address space is used. On X86_64 for example,
|
||
only 48bit are used for the address, so we can use the remaining 16bit for the ABA prevention tag. For details please consult the
|
||
implementation of the =boost::lockfree::detail::tagged_ptr= class.
|
||
|
||
For lock-free operations on 32bit platforms without double-width =compare_exchange=, we support a third approach: by using a
|
||
fixed-sized array to store the internal nodes we can avoid the use of 32bit pointers, but instead 16bit indices into the array
|
||
are sufficient. However this is only possible for fixed-sized data structures, that have an upper bound of internal nodes.
|
||
|
||
[endsect]
|
||
|
||
[section Interprocess Support]
|
||
|
||
The _lockfree_ data structures have basic support for [@boost:/libs/interprocess/index.html Boost.Interprocess]. The only
|
||
problem is the blocking emulation of lock-free atomics, which in the current implementation is not guaranteed to be interprocess-safe.
|
||
|
||
[endsect]
|
||
|
||
[endsect]
|
||
|
||
[xinclude autodoc.xml]
|
||
|
||
[section Appendices]
|
||
|
||
[section Supported Platforms & Compilers]
|
||
|
||
_lockfree_ has been tested on the following platforms:
|
||
|
||
* g++ 4.4, 4.5 and 4.6, linux, x86 & x86_64
|
||
* clang++ 3.0, linux, x86 & x86_64
|
||
|
||
[endsect]
|
||
|
||
[section Future Developments]
|
||
|
||
* More data structures (set, hash table, dequeue)
|
||
* Backoff schemes (exponential backoff or elimination)
|
||
|
||
[endsect]
|
||
|
||
[section References]
|
||
|
||
# [@http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.3574 Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms by Michael Scott and Maged Michael],
|
||
In Symposium on Principles of Distributed Computing, pages 267–275, 1996.
|
||
# [@http://books.google.com/books?id=pFSwuqtJgxYC M. Herlihy & Nir Shavit. The Art of Multiprocessor Programming], Morgan Kaufmann Publishers, 2008
|
||
|
||
[endsect]
|
||
|
||
[endsect]
|