fiber/doc/numa.qbk

413 lines
13 KiB
Plaintext

[/
Copyright Oliver Kowalke 2017.
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
http://www.boost.org/LICENSE_1_0.txt
]
[#numa]
[section:numa NUMA]
Modern micro-processors contain integrated memory controllers that are connected
via channels to the memory. Accessing the memory can be organized in two kinds:[br]
Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA).
In contrast to UMA, that provides a centralized pool of memory (and thus does
not scale after a certain number of processors), a NUMA architecture divides the
memory into local and remote memory relative to the micro-processor.[br]
Local memory is directly attached to the processor's integrated memory controller.
Memory connected to the memory controller of another micro-processor (multi-socket
systems) is considered as remote memory. If a memory controller access remote memory
it has to traverse the interconnect[footnote On x86 the interconnection is implemented
by Intel's Quick Path Interconnect (QPI) and AMD's HyperTransport.] and
connect to the remote memory controller.[br]
Thus accessing remote memory adds additional latency overhead to local memory access.
Because of the different memory locations, a NUMA-system experiences ['non-uniform]
memory access time.[br]
As a consequence the best performance is achieved by keeping the memory access
local.
[$../../../../libs/fiber/doc/NUMA.png [align center]]
[heading NUMA support in Boost.Fiber]
Because only a subset of the NUMA-functionality is exposed by several operating systems,
Boost.Fiber provides only a minimalistic NUMA API.
[important In order to enable NUMA support, b2 property `numa=on` must be specified
and linked against additional library `libboost_fiber_numa.so`.]
[important MinGW using pthread implementation is not supported on Windows.]
[table Supported functionality/operating systems
[
[]
[AIX]
[FreeBSD]
[HP/UX]
[Linux]
[Solaris]
[Windows]
]
[
[pin thread]
[+]
[+]
[+]
[+]
[+]
[+]
]
[
[logical CPUs/NUMA nodes]
[+]
[+]
[+]
[+]
[+]
[+[footnote Windows organizes logical cpus in groups of 64; boost.fiber maps
{group-id,cpud-id} to a scalar equivalent to cpu ID of Linux (64 * group ID + cpu ID).]]
]
[
[NUMA node distance]
[-]
[-]
[-]
[+]
[-]
[-]
]
[
[tested on]
[AIX 7.2]
[FreeBSD 11]
[-]
[Arch Linux (4.10.13)]
[OpenIndiana HIPSTER]
[Windows 10]
]
]
In order to keep the memory access local as possible, the NUMA topology must be evaluated.
std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
for ( auto n : topo) {
std::cout << "node: " << n.id << " | ";
std::cout << "cpus: ";
for ( auto cpu_id : n.logical_cpus) {
std::cout << cpu_id << " ";
}
std::cout << "| distance: ";
for ( auto d : n.distance) {
std::cout << d << " ";
}
std::cout << std::endl;
}
std::cout << "done" << std::endl;
output:
node: 0 | cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 | distance: 10 21
node: 1 | cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 | distance: 21 10
done
The example shows that the systems consits out of 2 NUMA-nodes, to each NUMA-node belong
16 logical cpus. The distance measures the costs to access the memory of another NUMA-node.
A NUMA-node has always a distance `10` to itself (lowest possible value).[br]
The position in the array corresponds with the NUMA-node ID.
Some work-loads benefit from pinning threads to a logical cpus. For instance scheduling
algorithm __numa_work_stealing__ pins the thread that runs the fiber scheduler to
a logical cpu. This prevents the operating system scheduler to move the thread to another
logical cpu that might run other fiber scheduler(s) or migrating the thread to a logical
cpu part of another NUMA-node.
void thread( std::uint32_t cpu_id, std::uint32_t node_id, std::vector< boost::fibers::numa::node > const& topo) {
// thread registers itself at work-stealing scheduler
boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( cpu_id, node_id, topo);
...
}
// evaluate the NUMA topology
std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
// start-thread runs on NUMA-node `0`
auto node = topo[0];
// start-thread is pinnded to first cpu ID in the list of logical cpus of NUMA-node `0`
auto start_cpu_id = * node.logical_cpus.begin();
// start worker-threads first
std::vector< std::thread > threads;
for ( auto & node : topo) {
for ( std::uint32_t cpu_id : node.logical_cpus) {
// exclude start-thread
if ( start_cpu_id != cpu_id) {
// spawn thread
threads.emplace_back( thread, cpu_id, node.id, std::cref( topo) );
}
}
}
// start-thread registers itself on work-stealing scheduler
boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( start_cpu_id, node.id, topo);
...
The example evaluates the NUMA topology with `boost::fibers::numa::topology()`
and spawns for each logical cpu a thread. Each spawned thread installs the
NUMA-aware work-stealing scheduler. The scheduler pins the thread to the
logical cpu that was specified at construction.[br]
If the local queue of one thread runs out of ready fibers, the thread tries to
steal a ready fiber from another thread running at logical cpu that belong to
the same NUMA-node (local memory access). If no fiber could be stolen, the
thread tries to steal fibers from logical cpus part of other NUMA-nodes (remote
memory access).
[heading Synopsis]
#include <boost/fiber/numa/pin_thread.hpp>
#include <boost/fiber/numa/topology.hpp>
namespace boost {
namespace fibers {
namespace numa {
struct node {
std::uint32_t id;
std::set< std::uint32_t > logical_cpus;
std::vector< std::uint32_t > distance;
};
bool operator<( node const&, node const&) noexcept;
std::vector< node > topology();
void pin_thread( std::uint32_t);
void pin_thread( std::uint32_t, std::thread::native_handle_type);
}}}
#include <boost/fiber/numa/algo/work_stealing.hpp>
namespace boost {
namespace fibers {
namespace numa {
namespace algo {
class work_stealing;
}}}
[ns_class_heading numa..node]
#include <boost/fiber/numa/topology.hpp>
namespace boost {
namespace fibers {
namespace numa {
struct node {
std::uint32_t id;
std::set< std::uint32_t > logical_cpus;
std::vector< std::uint32_t > distance;
};
bool operator<( node const&, node const&) noexcept;
}}}
[ns_data_member_heading numa..node..id]
std::uint32_t id;
[variablelist
[[Effects:] [ID of the NUMA-node]]
]
[ns_data_member_heading numa..node..logical_cpus]
std::set< std::uint32_t > logical_cpus;
[variablelist
[[Effects:] [set of logical cpu IDs belonging to the NUMA-node]]
]
[ns_data_member_heading numa..node..distance]
std::vector< std::uint32_t > distance;
[variablelist
[[Effects:] [The distance between NUMA-nodes describe the cots of accessing the
remote memory.]]
[[Note:] [A NUMA-node has a distance of `10` to itself, remote NUMA-nodes
have a distance > `10`. The index in the array corresponds to the ID `id`
of the NUMA-node. At the moment only Linux returns the correct distances,
for all other operating systems remote NUMA-nodes get a default value of
`20`.]]
]
[ns_operator_heading numa..node..operator_less..operator<]
bool operator<( node const& lhs, node const& rhs) const noexcept;
[variablelist
[[Returns:] [`true` if `lhs != rhs` is true and the
implementation-defined total order of `node::id` values places `lhs` before
`rhs`, false otherwise.]]
[[Throws:] [Nothing.]]
]
[ns_function_heading numa..topology]
#include <boost/fiber/numa/topology.hpp>
namespace boost {
namespace fibers {
namespace numa {
std::vector< node > topology();
}}}
[variablelist
[[Effects:] [Evaluates the NUMA topology.]]
[[Returns:] [a vector of NUMA-nodes describing the NUMA architecture of the
system (each element represents a NUMA-node).]]
[[Throws:] [`system_error`]]
]
[ns_function_heading numa..pin_thread]
#include <boost/fiber/numa/pin_thread.hpp>
namespace boost {
namespace fibers {
namespace numa {
void pin_thread( std::uint32_t cpu_id);
void pin_thread( std::uint32_t cpu_id, std::thread::native_handle_type h);
}}}
[variablelist
[[Effects:] [First version pins `this thread` to the logical cpu with ID `cpu_id`, e.g.
the operating system scheduler will not migrate the thread to another logical cpu.
The second variant pins the thread with the native ID `h` to logical cpu with ID `cpu_id`.]]
[[Throws:] [`system_error`]]
]
[ns_class_heading numa..work_stealing]
This class implements __algo__; the thread running this scheduler is pinned to the given
logical cpu. If the local ready-queue runs out of ready fibers, ready fibers are stolen
from other schedulers that run on logical cpus that belong to the same NUMA-node (local
memory access).[br]
If no ready fibers can be stolen from the local NUMA-node, the algorithm selects
schedulers running on other NUMA-nodes (remote memory access).[br]
The victim scheduler (from which a ready fiber is stolen) is selected at random.
#include <boost/fiber/numa/algo/work_stealing.hpp>
namespace boost {
namespace fibers {
namespace numa {
namespace algo {
class work_stealing : public algorithm {
public:
work_stealing( std::uint32_t cpu_id,
std::uint32_t node_id,
std::vector< boost::fibers::numa::node > const& topo,
bool suspend = false);
work_stealing( work_stealing const&) = delete;
work_stealing( work_stealing &&) = delete;
work_stealing & operator=( work_stealing const&) = delete;
work_stealing & operator=( work_stealing &&) = delete;
virtual void awakened( context *) noexcept;
virtual context * pick_next() noexcept;
virtual bool has_ready_fibers() const noexcept;
virtual void suspend_until( std::chrono::steady_clock::time_point const&) noexcept;
virtual void notify() noexcept;
};
}}}}
[heading Constructor]
work_stealing( std::uint32_t cpu_id, std::uint32_t node_id,
std::vector< boost::fibers::numa::node > const& topo,
bool suspend = false);
[variablelist
[[Effects:] [Constructs work-stealing scheduling algorithm. The thread is pinned to logical cpu with ID
`cpu_id`. If local ready-queue runs out of ready fibers, ready fibers are stolen from other schedulers
using `topology` (represents the NUMA-topology of the system).]]
[[Throws:] [`system_error`]]
[[Note:][If `suspend` is set to `true`, then the scheduler suspends if no ready fiber could be stolen.
The scheduler will by woken up if a sleeping fiber times out or it was notified from remote (other thread or
fiber scheduler).]]
]
[ns_member_heading numa..work_stealing..awakened]
virtual void awakened( context * f) noexcept;
[variablelist
[[Effects:] [Enqueues fiber `f` onto the shared ready queue.]]
[[Throws:] [Nothing.]]
]
[ns_member_heading numa..work_stealing..pick_next]
virtual context * pick_next() noexcept;
[variablelist
[[Returns:] [the fiber at the head of the ready queue, or `nullptr` if the
queue is empty.]]
[[Throws:] [Nothing.]]
[[Note:] [Placing ready fibers onto the tail of the sahred queue, and returning them
from the head of that queue, shares the thread between ready fibers in
round-robin fashion.]]
]
[ns_member_heading numa..work_stealing..has_ready_fibers]
virtual bool has_ready_fibers() const noexcept;
[variablelist
[[Returns:] [`true` if scheduler has fibers ready to run.]]
[[Throws:] [Nothing.]]
]
[ns_member_heading numa..work_stealing..suspend_until]
virtual void suspend_until( std::chrono::steady_clock::time_point const& abs_time) noexcept;
[variablelist
[[Effects:] [Informs `work_stealing` that no ready fiber will be available until
time-point `abs_time`. This implementation blocks in
[@http://en.cppreference.com/w/cpp/thread/condition_variable/wait_until
`std::condition_variable::wait_until()`].]]
[[Throws:] [Nothing.]]
]
[ns_member_heading numa..work_stealing..notify]
virtual void notify() noexcept = 0;
[variablelist
[[Effects:] [Wake up a pending call to [member_link
work_stealing..suspend_until], some fibers might be ready. This implementation
wakes `suspend_until()` via
[@http://en.cppreference.com/w/cpp/thread/condition_variable/notify_all
`std::condition_variable::notify_all()`].]]
[[Throws:] [Nothing.]]
]
[endsect]