The new implementation of 8 and 16-bit ops uses the lbarx/stbcx and
lharx/sthcx instructions available in Power8 and later architectures.
This allows to use smaller storage types, similar to those used by
compiler intrinsics.
Also added detection of 128-bit instructions lqarx/stqcx, which can
later be used to implement 128-bit ops.
Use ldrexb/w and strexb/w on ARMv7 and later to implement byte/word-wide
atomic ops. On the older ARM versions we still have to use 32-bit
widening implementation.
Also allowed immediate constants in some of the operations to improve
generated code.
Common ARM code extracted to a separate header to reuse with extra ops.
This allows for more flexibility in register allocation and potentially
more efficient code. Also, the temporary register was not exactly
customizable in the previous code, so it should have been cleaned up
anyway.
In order to support more flexible definition of the extra operations for
different platforms, define extra_operations as an addon to the existing
operations template. The extra_operations template will be used only by
the non-standard operations added by Boost.Atomic.
This is an attempt to improve generated code in the calling application that
involves CAS in a tight loop. The neccessity to cast between the value type and
the storage type for the `expected` argument results in inefficient code
that involves copying of the expected value and also saving the CAS result on
the stack. This has been observed at least with gcc 6.3 with a tight loop
on the user's side.
When we can ensure that the storage type can safely alias other types, and the
value type has the same size as the storage type, we can simplify CAS by
performing type punning on the `expected` reference instead of copying it back
and forth.
1. Expose value_type and difference_type (where present) to user's code.
2. Prohibit arithmetic operations on pointers to non-object types. In
particular, arithmetic operations such as fetch_add/fetch_sub will no longer
compile for pointers to cv void, pointers to functions and pointers to
non-static class members.
Also, use C++11 <type_traits> when possible instead of Boost.TypeTraits to
reduce dependencies. Cleaned up value_arg_type internal type usage for more
efficient argument passing.
In some of the asm blocks eax was modified as a result of cmpxchg8b but that
was not reflected in the register constraints. This could cause incorrect code
being generated.
Also, for non-gcc compilers which do not allow to auto-detect mfence availability (e.g. Oracle Studio) the instruction is assumed to be supported (since SSE2 is supported by virtually every x86 CPU now). This can be changed by defining BOOST_ATOMIC_NO_MFENCE.
1. Although the compiler is supposed to support __atomic and __sync intrinsics like gcc, it does not define any macros allowing to detect that. In particular, it is not possible to deduce what sizes of atomic operands are supported in hardware except to check the arch macros.
2. The compiler does not define any macros that allow to deduce the target x86 CPU, which makes it impossible to know whether the CPU supports cmpxchg8b/cmpxchg16b/mfence.
Because of that this commit changes handling of this compiler the following way:
1. On SPARC the compiler will use the gcc_sparc backend. Also, enabled the backend for the SPARCv8+ architecture as it appears to be almost a 32-bit equivalent to SPARCv9 and does support cas/casx instructions.
2. On x86 the compiler will use the gcc_x86 backend. By default cmpxchg8b/cmpxchg16b are assumed to be supported unless BOOST_ATOMIC_NO_CMPXCHG8B/BOOST_ATOMIC_NO_CMPXCHG16B is defined. The mfence instruction requires SSE2 and although ubiquitous these days, it will still be detected as not supported for now.
Fix for #10446. Some platforms (e.g. Raspberry Pi) only support atomic ops of some particular size but not less. Use extending arithmetic operations for these platforms. Also, make sure bools are always treated as 8-bit values, even if the actual type is larger. This makes its use in atomic<>, atomic_flag and lock pool more consistent.
Platform selection now works in two stages. First compiler is tested for
the supported configuration. If that fails, OS is tested. Lastly, if
nothing succeeded, emulation backend is selected.
memory_order_consume is promoted to memory_order_acquire on these
architectures as they have a weaker memory model than other
architectures. GCC seems to behave the same way. Added nonessential
checks to compiler barriers so that the behavior is closer to thread
fences.
Forwarding the "signed" property allows to specialize operations
implementation, which may be desired for some architectures. This also
eliminates the formally implementation-defined result of unsigned-to-
target type.
Also for x86 implementations removed compiler barriers since the
compilers already treat intrinsics/asm blocks as ones.
Added implementation of atomic ops with gcc __atomic intrinsics.
Implemented the unified atomic interface. Extracted atomic_flag to a
separate header. atomic_flag now follows standard requirements for
static initialization, if possible.
Cleanup of the success flag in CAS operations has been reworked. The
flag is automatically cleared by default and only set when the operation
succeeds. Also minor code reformatting and __volatile__ specifications
to prohibit the assembler code moving around.