413 lines
13 KiB
Plaintext
413 lines
13 KiB
Plaintext
[/
|
|
Copyright Oliver Kowalke 2017.
|
|
Distributed under the Boost Software License, Version 1.0.
|
|
(See accompanying file LICENSE_1_0.txt or copy at
|
|
http://www.boost.org/LICENSE_1_0.txt
|
|
]
|
|
|
|
[#numa]
|
|
[section:numa NUMA]
|
|
|
|
Modern micro-processors contain integrated memory controllers that are connected
|
|
via channels to the memory. Accessing the memory can be organized in two kinds:[br]
|
|
Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA).
|
|
|
|
In contrast to UMA, that provides a centralized pool of memory (and thus does
|
|
not scale after a certain number of processors), a NUMA architecture divides the
|
|
memory into local and remote memory relative to the micro-processor.[br]
|
|
Local memory is directly attached to the processor's integrated memory controller.
|
|
Memory connected to the memory controller of another micro-processor (multi-socket
|
|
systems) is considered as remote memory. If a memory controller access remote memory
|
|
it has to traverse the interconnect[footnote On x86 the interconnection is implemented
|
|
by Intel's Quick Path Interconnect (QPI) and AMD's HyperTransport.] and
|
|
connect to the remote memory controller.[br]
|
|
Thus accessing remote memory adds additional latency overhead to local memory access.
|
|
Because of the different memory locations, a NUMA-system experiences ['non-uniform]
|
|
memory access time.[br]
|
|
As a consequence the best performance is achieved by keeping the memory access
|
|
local.
|
|
|
|
[$../../../../libs/fiber/doc/NUMA.png [align center]]
|
|
|
|
|
|
[heading NUMA support in Boost.Fiber]
|
|
|
|
Because only a subset of the NUMA-functionality is exposed by several operating systems,
|
|
Boost.Fiber provides only a minimalistic NUMA API.
|
|
|
|
[important In order to enable NUMA support, b2 property `numa=on` must be specified
|
|
and linked against additional library `libboost_fiber_numa.so`.]
|
|
[important MinGW using pthread implementation is not supported on Windows.]
|
|
|
|
[table Supported functionality/operating systems
|
|
[
|
|
[]
|
|
[AIX]
|
|
[FreeBSD]
|
|
[HP/UX]
|
|
[Linux]
|
|
[Solaris]
|
|
[Windows]
|
|
]
|
|
[
|
|
[pin thread]
|
|
[+]
|
|
[+]
|
|
[+]
|
|
[+]
|
|
[+]
|
|
[+]
|
|
]
|
|
[
|
|
[logical CPUs/NUMA nodes]
|
|
[+]
|
|
[+]
|
|
[+]
|
|
[+]
|
|
[+]
|
|
[+[footnote Windows organizes logical cpus in groups of 64; boost.fiber maps
|
|
{group-id,cpud-id} to a scalar equivalent to cpu ID of Linux (64 * group ID + cpu ID).]]
|
|
]
|
|
[
|
|
[NUMA node distance]
|
|
[-]
|
|
[-]
|
|
[-]
|
|
[+]
|
|
[-]
|
|
[-]
|
|
]
|
|
[
|
|
[tested on]
|
|
[AIX 7.2]
|
|
[FreeBSD 11]
|
|
[-]
|
|
[Arch Linux (4.10.13)]
|
|
[OpenIndiana HIPSTER]
|
|
[Windows 10]
|
|
]
|
|
]
|
|
|
|
In order to keep the memory access local as possible, the NUMA topology must be evaluated.
|
|
|
|
std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
|
|
for ( auto n : topo) {
|
|
std::cout << "node: " << n.id << " | ";
|
|
std::cout << "cpus: ";
|
|
for ( auto cpu_id : n.logical_cpus) {
|
|
std::cout << cpu_id << " ";
|
|
}
|
|
std::cout << "| distance: ";
|
|
for ( auto d : n.distance) {
|
|
std::cout << d << " ";
|
|
}
|
|
std::cout << std::endl;
|
|
}
|
|
std::cout << "done" << std::endl;
|
|
|
|
output:
|
|
node: 0 | cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 | distance: 10 21
|
|
node: 1 | cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 | distance: 21 10
|
|
done
|
|
|
|
The example shows that the systems consits out of 2 NUMA-nodes, to each NUMA-node belong
|
|
16 logical cpus. The distance measures the costs to access the memory of another NUMA-node.
|
|
A NUMA-node has always a distance `10` to itself (lowest possible value).[br]
|
|
The position in the array corresponds with the NUMA-node ID.
|
|
|
|
Some work-loads benefit from pinning threads to a logical cpus. For instance scheduling
|
|
algorithm __numa_work_stealing__ pins the thread that runs the fiber scheduler to
|
|
a logical cpu. This prevents the operating system scheduler to move the thread to another
|
|
logical cpu that might run other fiber scheduler(s) or migrating the thread to a logical
|
|
cpu part of another NUMA-node.
|
|
|
|
void thread( std::uint32_t cpu_id, std::uint32_t node_id, std::vector< boost::fibers::numa::node > const& topo) {
|
|
// thread registers itself at work-stealing scheduler
|
|
boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( cpu_id, node_id, topo);
|
|
...
|
|
}
|
|
|
|
// evaluate the NUMA topology
|
|
std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
|
|
// start-thread runs on NUMA-node `0`
|
|
auto node = topo[0];
|
|
// start-thread is pinnded to first cpu ID in the list of logical cpus of NUMA-node `0`
|
|
auto start_cpu_id = * node.logical_cpus.begin();
|
|
// start worker-threads first
|
|
std::vector< std::thread > threads;
|
|
for ( auto & node : topo) {
|
|
for ( std::uint32_t cpu_id : node.logical_cpus) {
|
|
// exclude start-thread
|
|
if ( start_cpu_id != cpu_id) {
|
|
// spawn thread
|
|
threads.emplace_back( thread, cpu_id, node.id, std::cref( topo) );
|
|
}
|
|
}
|
|
}
|
|
// start-thread registers itself on work-stealing scheduler
|
|
boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( start_cpu_id, node.id, topo);
|
|
...
|
|
|
|
The example evaluates the NUMA topology with `boost::fibers::numa::topology()`
|
|
and spawns for each logical cpu a thread. Each spawned thread installs the
|
|
NUMA-aware work-stealing scheduler. The scheduler pins the thread to the
|
|
logical cpu that was specified at construction.[br]
|
|
If the local queue of one thread runs out of ready fibers, the thread tries to
|
|
steal a ready fiber from another thread running at logical cpu that belong to
|
|
the same NUMA-node (local memory access). If no fiber could be stolen, the
|
|
thread tries to steal fibers from logical cpus part of other NUMA-nodes (remote
|
|
memory access).
|
|
|
|
|
|
[heading Synopsis]
|
|
|
|
#include <boost/fiber/numa/pin_thread.hpp>
|
|
#include <boost/fiber/numa/topology.hpp>
|
|
|
|
namespace boost {
|
|
namespace fibers {
|
|
namespace numa {
|
|
|
|
struct node {
|
|
std::uint32_t id;
|
|
std::set< std::uint32_t > logical_cpus;
|
|
std::vector< std::uint32_t > distance;
|
|
};
|
|
bool operator<( node const&, node const&) noexcept;
|
|
|
|
std::vector< node > topology();
|
|
|
|
void pin_thread( std::uint32_t);
|
|
void pin_thread( std::uint32_t, std::thread::native_handle_type);
|
|
|
|
}}}
|
|
|
|
#include <boost/fiber/numa/algo/work_stealing.hpp>
|
|
|
|
namespace boost {
|
|
namespace fibers {
|
|
namespace numa {
|
|
namespace algo {
|
|
|
|
class work_stealing;
|
|
|
|
}}}
|
|
|
|
|
|
[ns_class_heading numa..node]
|
|
|
|
#include <boost/fiber/numa/topology.hpp>
|
|
|
|
namespace boost {
|
|
namespace fibers {
|
|
namespace numa {
|
|
|
|
struct node {
|
|
std::uint32_t id;
|
|
std::set< std::uint32_t > logical_cpus;
|
|
std::vector< std::uint32_t > distance;
|
|
};
|
|
bool operator<( node const&, node const&) noexcept;
|
|
|
|
}}}
|
|
|
|
[ns_data_member_heading numa..node..id]
|
|
|
|
std::uint32_t id;
|
|
|
|
[variablelist
|
|
[[Effects:] [ID of the NUMA-node]]
|
|
]
|
|
|
|
[ns_data_member_heading numa..node..logical_cpus]
|
|
|
|
std::set< std::uint32_t > logical_cpus;
|
|
|
|
[variablelist
|
|
[[Effects:] [set of logical cpu IDs belonging to the NUMA-node]]
|
|
]
|
|
|
|
[ns_data_member_heading numa..node..distance]
|
|
|
|
std::vector< std::uint32_t > distance;
|
|
|
|
[variablelist
|
|
[[Effects:] [The distance between NUMA-nodes describe the cots of accessing the
|
|
remote memory.]]
|
|
[[Note:] [A NUMA-node has a distance of `10` to itself, remote NUMA-nodes
|
|
have a distance > `10`. The index in the array corresponds to the ID `id`
|
|
of the NUMA-node. At the moment only Linux returns the correct distances,
|
|
for all other operating systems remote NUMA-nodes get a default value of
|
|
`20`.]]
|
|
]
|
|
|
|
[ns_operator_heading numa..node..operator_less..operator<]
|
|
|
|
bool operator<( node const& lhs, node const& rhs) const noexcept;
|
|
|
|
[variablelist
|
|
[[Returns:] [`true` if `lhs != rhs` is true and the
|
|
implementation-defined total order of `node::id` values places `lhs` before
|
|
`rhs`, false otherwise.]]
|
|
[[Throws:] [Nothing.]]
|
|
]
|
|
|
|
|
|
[ns_function_heading numa..topology]
|
|
|
|
#include <boost/fiber/numa/topology.hpp>
|
|
|
|
namespace boost {
|
|
namespace fibers {
|
|
namespace numa {
|
|
|
|
std::vector< node > topology();
|
|
|
|
}}}
|
|
|
|
[variablelist
|
|
[[Effects:] [Evaluates the NUMA topology.]]
|
|
[[Returns:] [a vector of NUMA-nodes describing the NUMA architecture of the
|
|
system (each element represents a NUMA-node).]]
|
|
[[Throws:] [`system_error`]]
|
|
]
|
|
|
|
|
|
[ns_function_heading numa..pin_thread]
|
|
|
|
#include <boost/fiber/numa/pin_thread.hpp>
|
|
|
|
namespace boost {
|
|
namespace fibers {
|
|
namespace numa {
|
|
|
|
void pin_thread( std::uint32_t cpu_id);
|
|
void pin_thread( std::uint32_t cpu_id, std::thread::native_handle_type h);
|
|
|
|
}}}
|
|
|
|
[variablelist
|
|
[[Effects:] [First version pins `this thread` to the logical cpu with ID `cpu_id`, e.g.
|
|
the operating system scheduler will not migrate the thread to another logical cpu.
|
|
The second variant pins the thread with the native ID `h` to logical cpu with ID `cpu_id`.]]
|
|
[[Throws:] [`system_error`]]
|
|
]
|
|
|
|
|
|
[ns_class_heading numa..work_stealing]
|
|
|
|
This class implements __algo__; the thread running this scheduler is pinned to the given
|
|
logical cpu. If the local ready-queue runs out of ready fibers, ready fibers are stolen
|
|
from other schedulers that run on logical cpus that belong to the same NUMA-node (local
|
|
memory access).[br]
|
|
If no ready fibers can be stolen from the local NUMA-node, the algorithm selects
|
|
schedulers running on other NUMA-nodes (remote memory access).[br]
|
|
The victim scheduler (from which a ready fiber is stolen) is selected at random.
|
|
|
|
#include <boost/fiber/numa/algo/work_stealing.hpp>
|
|
|
|
namespace boost {
|
|
namespace fibers {
|
|
namespace numa {
|
|
namespace algo {
|
|
|
|
class work_stealing : public algorithm {
|
|
public:
|
|
work_stealing( std::uint32_t cpu_id,
|
|
std::uint32_t node_id,
|
|
std::vector< boost::fibers::numa::node > const& topo,
|
|
bool suspend = false);
|
|
|
|
work_stealing( work_stealing const&) = delete;
|
|
work_stealing( work_stealing &&) = delete;
|
|
|
|
work_stealing & operator=( work_stealing const&) = delete;
|
|
work_stealing & operator=( work_stealing &&) = delete;
|
|
|
|
virtual void awakened( context *) noexcept;
|
|
|
|
virtual context * pick_next() noexcept;
|
|
|
|
virtual bool has_ready_fibers() const noexcept;
|
|
|
|
virtual void suspend_until( std::chrono::steady_clock::time_point const&) noexcept;
|
|
|
|
virtual void notify() noexcept;
|
|
};
|
|
|
|
}}}}
|
|
|
|
[heading Constructor]
|
|
|
|
work_stealing( std::uint32_t cpu_id, std::uint32_t node_id,
|
|
std::vector< boost::fibers::numa::node > const& topo,
|
|
bool suspend = false);
|
|
|
|
[variablelist
|
|
[[Effects:] [Constructs work-stealing scheduling algorithm. The thread is pinned to logical cpu with ID
|
|
`cpu_id`. If local ready-queue runs out of ready fibers, ready fibers are stolen from other schedulers
|
|
using `topology` (represents the NUMA-topology of the system).]]
|
|
[[Throws:] [`system_error`]]
|
|
[[Note:][If `suspend` is set to `true`, then the scheduler suspends if no ready fiber could be stolen.
|
|
The scheduler will by woken up if a sleeping fiber times out or it was notified from remote (other thread or
|
|
fiber scheduler).]]
|
|
]
|
|
|
|
[ns_member_heading numa..work_stealing..awakened]
|
|
|
|
virtual void awakened( context * f) noexcept;
|
|
|
|
[variablelist
|
|
[[Effects:] [Enqueues fiber `f` onto the shared ready queue.]]
|
|
[[Throws:] [Nothing.]]
|
|
]
|
|
|
|
[ns_member_heading numa..work_stealing..pick_next]
|
|
|
|
virtual context * pick_next() noexcept;
|
|
|
|
[variablelist
|
|
[[Returns:] [the fiber at the head of the ready queue, or `nullptr` if the
|
|
queue is empty.]]
|
|
[[Throws:] [Nothing.]]
|
|
[[Note:] [Placing ready fibers onto the tail of the sahred queue, and returning them
|
|
from the head of that queue, shares the thread between ready fibers in
|
|
round-robin fashion.]]
|
|
]
|
|
|
|
[ns_member_heading numa..work_stealing..has_ready_fibers]
|
|
|
|
virtual bool has_ready_fibers() const noexcept;
|
|
|
|
[variablelist
|
|
[[Returns:] [`true` if scheduler has fibers ready to run.]]
|
|
[[Throws:] [Nothing.]]
|
|
]
|
|
|
|
[ns_member_heading numa..work_stealing..suspend_until]
|
|
|
|
virtual void suspend_until( std::chrono::steady_clock::time_point const& abs_time) noexcept;
|
|
|
|
[variablelist
|
|
[[Effects:] [Informs `work_stealing` that no ready fiber will be available until
|
|
time-point `abs_time`. This implementation blocks in
|
|
[@http://en.cppreference.com/w/cpp/thread/condition_variable/wait_until
|
|
`std::condition_variable::wait_until()`].]]
|
|
[[Throws:] [Nothing.]]
|
|
]
|
|
|
|
[ns_member_heading numa..work_stealing..notify]
|
|
|
|
virtual void notify() noexcept = 0;
|
|
|
|
[variablelist
|
|
[[Effects:] [Wake up a pending call to [member_link
|
|
work_stealing..suspend_until], some fibers might be ready. This implementation
|
|
wakes `suspend_until()` via
|
|
[@http://en.cppreference.com/w/cpp/thread/condition_variable/notify_all
|
|
`std::condition_variable::notify_all()`].]]
|
|
[[Throws:] [Nothing.]]
|
|
]
|
|
|
|
[endsect]
|