123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412 |
- [/
- Copyright Oliver Kowalke 2017.
- Distributed under the Boost Software License, Version 1.0.
- (See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt
- ]
- [#numa]
- [section:numa NUMA]
- Modern micro-processors contain integrated memory controllers that are connected
- via channels to the memory. Accessing the memory can be organized in two kinds:[br]
- Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA).
- In contrast to UMA, that provides a centralized pool of memory (and thus does
- not scale after a certain number of processors), a NUMA architecture divides the
- memory into local and remote memory relative to the micro-processor.[br]
- Local memory is directly attached to the processor's integrated memory controller.
- Memory connected to the memory controller of another micro-processor (multi-socket
- systems) is considered as remote memory. If a memory controller access remote memory
- it has to traverse the interconnect[footnote On x86 the interconnection is implemented
- by Intel's Quick Path Interconnect (QPI) and AMD's HyperTransport.] and
- connect to the remote memory controller.[br]
- Thus accessing remote memory adds additional latency overhead to local memory access.
- Because of the different memory locations, a NUMA-system experiences ['non-uniform]
- memory access time.[br]
- As a consequence the best performance is achieved by keeping the memory access
- local.
- [$../../../../libs/fiber/doc/NUMA.png [align center]]
- [heading NUMA support in Boost.Fiber]
- Because only a subset of the NUMA-functionality is exposed by several operating systems,
- Boost.Fiber provides only a minimalistic NUMA API.
- [important In order to enable NUMA support, b2 property `numa=on` must be specified
- and linked against additional library `libboost_fiber_numa.so`.]
- [important MinGW using pthread implementation is not supported on Windows.]
- [table Supported functionality/operating systems
- [
- []
- [AIX]
- [FreeBSD]
- [HP/UX]
- [Linux]
- [Solaris]
- [Windows]
- ]
- [
- [pin thread]
- [+]
- [+]
- [+]
- [+]
- [+]
- [+]
- ]
- [
- [logical CPUs/NUMA nodes]
- [+]
- [+]
- [+]
- [+]
- [+]
- [+[footnote Windows organizes logical cpus in groups of 64; boost.fiber maps
- {group-id,cpud-id} to a scalar equivalent to cpu ID of Linux (64 * group ID + cpu ID).]]
- ]
- [
- [NUMA node distance]
- [-]
- [-]
- [-]
- [+]
- [-]
- [-]
- ]
- [
- [tested on]
- [AIX 7.2]
- [FreeBSD 11]
- [-]
- [Arch Linux (4.10.13)]
- [OpenIndiana HIPSTER]
- [Windows 10]
- ]
- ]
- In order to keep the memory access local as possible, the NUMA topology must be evaluated.
- std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
- for ( auto n : topo) {
- std::cout << "node: " << n.id << " | ";
- std::cout << "cpus: ";
- for ( auto cpu_id : n.logical_cpus) {
- std::cout << cpu_id << " ";
- }
- std::cout << "| distance: ";
- for ( auto d : n.distance) {
- std::cout << d << " ";
- }
- std::cout << std::endl;
- }
- std::cout << "done" << std::endl;
- output:
- node: 0 | cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 | distance: 10 21
- node: 1 | cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 | distance: 21 10
- done
- The example shows that the systems consits out of 2 NUMA-nodes, to each NUMA-node belong
- 16 logical cpus. The distance measures the costs to access the memory of another NUMA-node.
- A NUMA-node has always a distance `10` to itself (lowest possible value).[br]
- The position in the array corresponds with the NUMA-node ID.
- Some work-loads benefit from pinning threads to a logical cpus. For instance scheduling
- algorithm __numa_work_stealing__ pins the thread that runs the fiber scheduler to
- a logical cpu. This prevents the operating system scheduler to move the thread to another
- logical cpu that might run other fiber scheduler(s) or migrating the thread to a logical
- cpu part of another NUMA-node.
- void thread( std::uint32_t cpu_id, std::uint32_t node_id, std::vector< boost::fibers::numa::node > const& topo) {
- // thread registers itself at work-stealing scheduler
- boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( cpu_id, node_id, topo);
- ...
- }
- // evaluate the NUMA topology
- std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
- // start-thread runs on NUMA-node `0`
- auto node = topo[0];
- // start-thread is pinnded to first cpu ID in the list of logical cpus of NUMA-node `0`
- auto start_cpu_id = * node.logical_cpus.begin();
- // start worker-threads first
- std::vector< std::thread > threads;
- for ( auto & node : topo) {
- for ( std::uint32_t cpu_id : node.logical_cpus) {
- // exclude start-thread
- if ( start_cpu_id != cpu_id) {
- // spawn thread
- threads.emplace_back( thread, cpu_id, node.id, std::cref( topo) );
- }
- }
- }
- // start-thread registers itself on work-stealing scheduler
- boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( start_cpu_id, node.id, topo);
- ...
- The example evaluates the NUMA topology with `boost::fibers::numa::topology()`
- and spawns for each logical cpu a thread. Each spawned thread installs the
- NUMA-aware work-stealing scheduler. The scheduler pins the thread to the
- logical cpu that was specified at construction.[br]
- If the local queue of one thread runs out of ready fibers, the thread tries to
- steal a ready fiber from another thread running at logical cpu that belong to
- the same NUMA-node (local memory access). If no fiber could be stolen, the
- thread tries to steal fibers from logical cpus part of other NUMA-nodes (remote
- memory access).
- [heading Synopsis]
- #include <boost/fiber/numa/pin_thread.hpp>
- #include <boost/fiber/numa/topology.hpp>
- namespace boost {
- namespace fibers {
- namespace numa {
- struct node {
- std::uint32_t id;
- std::set< std::uint32_t > logical_cpus;
- std::vector< std::uint32_t > distance;
- };
- bool operator<( node const&, node const&) noexcept;
- std::vector< node > topology();
- void pin_thread( std::uint32_t);
- void pin_thread( std::uint32_t, std::thread::native_handle_type);
- }}}
- #include <boost/fiber/numa/algo/work_stealing.hpp>
- namespace boost {
- namespace fibers {
- namespace numa {
- namespace algo {
- class work_stealing;
- }}}
- [ns_class_heading numa..node]
- #include <boost/fiber/numa/topology.hpp>
- namespace boost {
- namespace fibers {
- namespace numa {
- struct node {
- std::uint32_t id;
- std::set< std::uint32_t > logical_cpus;
- std::vector< std::uint32_t > distance;
- };
- bool operator<( node const&, node const&) noexcept;
- }}}
- [ns_data_member_heading numa..node..id]
- std::uint32_t id;
- [variablelist
- [[Effects:] [ID of the NUMA-node]]
- ]
- [ns_data_member_heading numa..node..logical_cpus]
- std::set< std::uint32_t > logical_cpus;
- [variablelist
- [[Effects:] [set of logical cpu IDs belonging to the NUMA-node]]
- ]
- [ns_data_member_heading numa..node..distance]
- std::vector< std::uint32_t > distance;
- [variablelist
- [[Effects:] [The distance between NUMA-nodes describe the cots of accessing the
- remote memory.]]
- [[Note:] [A NUMA-node has a distance of `10` to itself, remote NUMA-nodes
- have a distance > `10`. The index in the array corresponds to the ID `id`
- of the NUMA-node. At the moment only Linux returns the correct distances,
- for all other operating systems remote NUMA-nodes get a default value of
- `20`.]]
- ]
- [ns_operator_heading numa..node..operator_less..operator<]
- bool operator<( node const& lhs, node const& rhs) const noexcept;
- [variablelist
- [[Returns:] [`true` if `lhs != rhs` is true and the
- implementation-defined total order of `node::id` values places `lhs` before
- `rhs`, false otherwise.]]
- [[Throws:] [Nothing.]]
- ]
- [ns_function_heading numa..topology]
- #include <boost/fiber/numa/topology.hpp>
- namespace boost {
- namespace fibers {
- namespace numa {
- std::vector< node > topology();
- }}}
- [variablelist
- [[Effects:] [Evaluates the NUMA topology.]]
- [[Returns:] [a vector of NUMA-nodes describing the NUMA architecture of the
- system (each element represents a NUMA-node).]]
- [[Throws:] [`system_error`]]
- ]
- [ns_function_heading numa..pin_thread]
- #include <boost/fiber/numa/pin_thread.hpp>
- namespace boost {
- namespace fibers {
- namespace numa {
- void pin_thread( std::uint32_t cpu_id);
- void pin_thread( std::uint32_t cpu_id, std::thread::native_handle_type h);
- }}}
- [variablelist
- [[Effects:] [First version pins `this thread` to the logical cpu with ID `cpu_id`, e.g.
- the operating system scheduler will not migrate the thread to another logical cpu.
- The second variant pins the thread with the native ID `h` to logical cpu with ID `cpu_id`.]]
- [[Throws:] [`system_error`]]
- ]
- [ns_class_heading numa..work_stealing]
- This class implements __algo__; the thread running this scheduler is pinned to the given
- logical cpu. If the local ready-queue runs out of ready fibers, ready fibers are stolen
- from other schedulers that run on logical cpus that belong to the same NUMA-node (local
- memory access).[br]
- If no ready fibers can be stolen from the local NUMA-node, the algorithm selects
- schedulers running on other NUMA-nodes (remote memory access).[br]
- The victim scheduler (from which a ready fiber is stolen) is selected at random.
- #include <boost/fiber/numa/algo/work_stealing.hpp>
- namespace boost {
- namespace fibers {
- namespace numa {
- namespace algo {
- class work_stealing : public algorithm {
- public:
- work_stealing( std::uint32_t cpu_id,
- std::uint32_t node_id,
- std::vector< boost::fibers::numa::node > const& topo,
- bool suspend = false);
- work_stealing( work_stealing const&) = delete;
- work_stealing( work_stealing &&) = delete;
- work_stealing & operator=( work_stealing const&) = delete;
- work_stealing & operator=( work_stealing &&) = delete;
- virtual void awakened( context *) noexcept;
- virtual context * pick_next() noexcept;
- virtual bool has_ready_fibers() const noexcept;
- virtual void suspend_until( std::chrono::steady_clock::time_point const&) noexcept;
- virtual void notify() noexcept;
- };
- }}}}
- [heading Constructor]
- work_stealing( std::uint32_t cpu_id, std::uint32_t node_id,
- std::vector< boost::fibers::numa::node > const& topo,
- bool suspend = false);
- [variablelist
- [[Effects:] [Constructs work-stealing scheduling algorithm. The thread is pinned to logical cpu with ID
- `cpu_id`. If local ready-queue runs out of ready fibers, ready fibers are stolen from other schedulers
- using `topology` (represents the NUMA-topology of the system).]]
- [[Throws:] [`system_error`]]
- [[Note:][If `suspend` is set to `true`, then the scheduler suspends if no ready fiber could be stolen.
- The scheduler will by woken up if a sleeping fiber times out or it was notified from remote (other thread or
- fiber scheduler).]]
- ]
- [ns_member_heading numa..work_stealing..awakened]
- virtual void awakened( context * f) noexcept;
- [variablelist
- [[Effects:] [Enqueues fiber `f` onto the shared ready queue.]]
- [[Throws:] [Nothing.]]
- ]
- [ns_member_heading numa..work_stealing..pick_next]
- virtual context * pick_next() noexcept;
- [variablelist
- [[Returns:] [the fiber at the head of the ready queue, or `nullptr` if the
- queue is empty.]]
- [[Throws:] [Nothing.]]
- [[Note:] [Placing ready fibers onto the tail of the sahred queue, and returning them
- from the head of that queue, shares the thread between ready fibers in
- round-robin fashion.]]
- ]
- [ns_member_heading numa..work_stealing..has_ready_fibers]
- virtual bool has_ready_fibers() const noexcept;
- [variablelist
- [[Returns:] [`true` if scheduler has fibers ready to run.]]
- [[Throws:] [Nothing.]]
- ]
- [ns_member_heading numa..work_stealing..suspend_until]
- virtual void suspend_until( std::chrono::steady_clock::time_point const& abs_time) noexcept;
- [variablelist
- [[Effects:] [Informs `work_stealing` that no ready fiber will be available until
- time-point `abs_time`. This implementation blocks in
- [@http://en.cppreference.com/w/cpp/thread/condition_variable/wait_until
- `std::condition_variable::wait_until()`].]]
- [[Throws:] [Nothing.]]
- ]
- [ns_member_heading numa..work_stealing..notify]
- virtual void notify() noexcept = 0;
- [variablelist
- [[Effects:] [Wake up a pending call to [member_link
- work_stealing..suspend_until], some fibers might be ready. This implementation
- wakes `suspend_until()` via
- [@http://en.cppreference.com/w/cpp/thread/condition_variable/notify_all
- `std::condition_variable::notify_all()`].]]
- [[Throws:] [Nothing.]]
- ]
- [endsect]
|