numa.qbk 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412
  1. [/
  2. Copyright Oliver Kowalke 2017.
  3. Distributed under the Boost Software License, Version 1.0.
  4. (See accompanying file LICENSE_1_0.txt or copy at
  5. http://www.boost.org/LICENSE_1_0.txt
  6. ]
  7. [#numa]
  8. [section:numa NUMA]
  9. Modern micro-processors contain integrated memory controllers that are connected
  10. via channels to the memory. Accessing the memory can be organized in two kinds:[br]
  11. Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA).
  12. In contrast to UMA, that provides a centralized pool of memory (and thus does
  13. not scale after a certain number of processors), a NUMA architecture divides the
  14. memory into local and remote memory relative to the micro-processor.[br]
  15. Local memory is directly attached to the processor's integrated memory controller.
  16. Memory connected to the memory controller of another micro-processor (multi-socket
  17. systems) is considered as remote memory. If a memory controller access remote memory
  18. it has to traverse the interconnect[footnote On x86 the interconnection is implemented
  19. by Intel's Quick Path Interconnect (QPI) and AMD's HyperTransport.] and
  20. connect to the remote memory controller.[br]
  21. Thus accessing remote memory adds additional latency overhead to local memory access.
  22. Because of the different memory locations, a NUMA-system experiences ['non-uniform]
  23. memory access time.[br]
  24. As a consequence the best performance is achieved by keeping the memory access
  25. local.
  26. [$../../../../libs/fiber/doc/NUMA.png [align center]]
  27. [heading NUMA support in Boost.Fiber]
  28. Because only a subset of the NUMA-functionality is exposed by several operating systems,
  29. Boost.Fiber provides only a minimalistic NUMA API.
  30. [important In order to enable NUMA support, b2 property `numa=on` must be specified
  31. and linked against additional library `libboost_fiber_numa.so`.]
  32. [important MinGW using pthread implementation is not supported on Windows.]
  33. [table Supported functionality/operating systems
  34. [
  35. []
  36. [AIX]
  37. [FreeBSD]
  38. [HP/UX]
  39. [Linux]
  40. [Solaris]
  41. [Windows]
  42. ]
  43. [
  44. [pin thread]
  45. [+]
  46. [+]
  47. [+]
  48. [+]
  49. [+]
  50. [+]
  51. ]
  52. [
  53. [logical CPUs/NUMA nodes]
  54. [+]
  55. [+]
  56. [+]
  57. [+]
  58. [+]
  59. [+[footnote Windows organizes logical cpus in groups of 64; boost.fiber maps
  60. {group-id,cpud-id} to a scalar equivalent to cpu ID of Linux (64 * group ID + cpu ID).]]
  61. ]
  62. [
  63. [NUMA node distance]
  64. [-]
  65. [-]
  66. [-]
  67. [+]
  68. [-]
  69. [-]
  70. ]
  71. [
  72. [tested on]
  73. [AIX 7.2]
  74. [FreeBSD 11]
  75. [-]
  76. [Arch Linux (4.10.13)]
  77. [OpenIndiana HIPSTER]
  78. [Windows 10]
  79. ]
  80. ]
  81. In order to keep the memory access local as possible, the NUMA topology must be evaluated.
  82. std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
  83. for ( auto n : topo) {
  84. std::cout << "node: " << n.id << " | ";
  85. std::cout << "cpus: ";
  86. for ( auto cpu_id : n.logical_cpus) {
  87. std::cout << cpu_id << " ";
  88. }
  89. std::cout << "| distance: ";
  90. for ( auto d : n.distance) {
  91. std::cout << d << " ";
  92. }
  93. std::cout << std::endl;
  94. }
  95. std::cout << "done" << std::endl;
  96. output:
  97. node: 0 | cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 | distance: 10 21
  98. node: 1 | cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 | distance: 21 10
  99. done
  100. The example shows that the systems consits out of 2 NUMA-nodes, to each NUMA-node belong
  101. 16 logical cpus. The distance measures the costs to access the memory of another NUMA-node.
  102. A NUMA-node has always a distance `10` to itself (lowest possible value).[br]
  103. The position in the array corresponds with the NUMA-node ID.
  104. Some work-loads benefit from pinning threads to a logical cpus. For instance scheduling
  105. algorithm __numa_work_stealing__ pins the thread that runs the fiber scheduler to
  106. a logical cpu. This prevents the operating system scheduler to move the thread to another
  107. logical cpu that might run other fiber scheduler(s) or migrating the thread to a logical
  108. cpu part of another NUMA-node.
  109. void thread( std::uint32_t cpu_id, std::uint32_t node_id, std::vector< boost::fibers::numa::node > const& topo) {
  110. // thread registers itself at work-stealing scheduler
  111. boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( cpu_id, node_id, topo);
  112. ...
  113. }
  114. // evaluate the NUMA topology
  115. std::vector< boost::fibers::numa::node > topo = boost::fibers::numa::topology();
  116. // start-thread runs on NUMA-node `0`
  117. auto node = topo[0];
  118. // start-thread is pinnded to first cpu ID in the list of logical cpus of NUMA-node `0`
  119. auto start_cpu_id = * node.logical_cpus.begin();
  120. // start worker-threads first
  121. std::vector< std::thread > threads;
  122. for ( auto & node : topo) {
  123. for ( std::uint32_t cpu_id : node.logical_cpus) {
  124. // exclude start-thread
  125. if ( start_cpu_id != cpu_id) {
  126. // spawn thread
  127. threads.emplace_back( thread, cpu_id, node.id, std::cref( topo) );
  128. }
  129. }
  130. }
  131. // start-thread registers itself on work-stealing scheduler
  132. boost::fibers::use_scheduling_algorithm< boost::fibers::algo::numa::work_stealing >( start_cpu_id, node.id, topo);
  133. ...
  134. The example evaluates the NUMA topology with `boost::fibers::numa::topology()`
  135. and spawns for each logical cpu a thread. Each spawned thread installs the
  136. NUMA-aware work-stealing scheduler. The scheduler pins the thread to the
  137. logical cpu that was specified at construction.[br]
  138. If the local queue of one thread runs out of ready fibers, the thread tries to
  139. steal a ready fiber from another thread running at logical cpu that belong to
  140. the same NUMA-node (local memory access). If no fiber could be stolen, the
  141. thread tries to steal fibers from logical cpus part of other NUMA-nodes (remote
  142. memory access).
  143. [heading Synopsis]
  144. #include <boost/fiber/numa/pin_thread.hpp>
  145. #include <boost/fiber/numa/topology.hpp>
  146. namespace boost {
  147. namespace fibers {
  148. namespace numa {
  149. struct node {
  150. std::uint32_t id;
  151. std::set< std::uint32_t > logical_cpus;
  152. std::vector< std::uint32_t > distance;
  153. };
  154. bool operator<( node const&, node const&) noexcept;
  155. std::vector< node > topology();
  156. void pin_thread( std::uint32_t);
  157. void pin_thread( std::uint32_t, std::thread::native_handle_type);
  158. }}}
  159. #include <boost/fiber/numa/algo/work_stealing.hpp>
  160. namespace boost {
  161. namespace fibers {
  162. namespace numa {
  163. namespace algo {
  164. class work_stealing;
  165. }}}
  166. [ns_class_heading numa..node]
  167. #include <boost/fiber/numa/topology.hpp>
  168. namespace boost {
  169. namespace fibers {
  170. namespace numa {
  171. struct node {
  172. std::uint32_t id;
  173. std::set< std::uint32_t > logical_cpus;
  174. std::vector< std::uint32_t > distance;
  175. };
  176. bool operator<( node const&, node const&) noexcept;
  177. }}}
  178. [ns_data_member_heading numa..node..id]
  179. std::uint32_t id;
  180. [variablelist
  181. [[Effects:] [ID of the NUMA-node]]
  182. ]
  183. [ns_data_member_heading numa..node..logical_cpus]
  184. std::set< std::uint32_t > logical_cpus;
  185. [variablelist
  186. [[Effects:] [set of logical cpu IDs belonging to the NUMA-node]]
  187. ]
  188. [ns_data_member_heading numa..node..distance]
  189. std::vector< std::uint32_t > distance;
  190. [variablelist
  191. [[Effects:] [The distance between NUMA-nodes describe the cots of accessing the
  192. remote memory.]]
  193. [[Note:] [A NUMA-node has a distance of `10` to itself, remote NUMA-nodes
  194. have a distance > `10`. The index in the array corresponds to the ID `id`
  195. of the NUMA-node. At the moment only Linux returns the correct distances,
  196. for all other operating systems remote NUMA-nodes get a default value of
  197. `20`.]]
  198. ]
  199. [ns_operator_heading numa..node..operator_less..operator<]
  200. bool operator<( node const& lhs, node const& rhs) const noexcept;
  201. [variablelist
  202. [[Returns:] [`true` if `lhs != rhs` is true and the
  203. implementation-defined total order of `node::id` values places `lhs` before
  204. `rhs`, false otherwise.]]
  205. [[Throws:] [Nothing.]]
  206. ]
  207. [ns_function_heading numa..topology]
  208. #include <boost/fiber/numa/topology.hpp>
  209. namespace boost {
  210. namespace fibers {
  211. namespace numa {
  212. std::vector< node > topology();
  213. }}}
  214. [variablelist
  215. [[Effects:] [Evaluates the NUMA topology.]]
  216. [[Returns:] [a vector of NUMA-nodes describing the NUMA architecture of the
  217. system (each element represents a NUMA-node).]]
  218. [[Throws:] [`system_error`]]
  219. ]
  220. [ns_function_heading numa..pin_thread]
  221. #include <boost/fiber/numa/pin_thread.hpp>
  222. namespace boost {
  223. namespace fibers {
  224. namespace numa {
  225. void pin_thread( std::uint32_t cpu_id);
  226. void pin_thread( std::uint32_t cpu_id, std::thread::native_handle_type h);
  227. }}}
  228. [variablelist
  229. [[Effects:] [First version pins `this thread` to the logical cpu with ID `cpu_id`, e.g.
  230. the operating system scheduler will not migrate the thread to another logical cpu.
  231. The second variant pins the thread with the native ID `h` to logical cpu with ID `cpu_id`.]]
  232. [[Throws:] [`system_error`]]
  233. ]
  234. [ns_class_heading numa..work_stealing]
  235. This class implements __algo__; the thread running this scheduler is pinned to the given
  236. logical cpu. If the local ready-queue runs out of ready fibers, ready fibers are stolen
  237. from other schedulers that run on logical cpus that belong to the same NUMA-node (local
  238. memory access).[br]
  239. If no ready fibers can be stolen from the local NUMA-node, the algorithm selects
  240. schedulers running on other NUMA-nodes (remote memory access).[br]
  241. The victim scheduler (from which a ready fiber is stolen) is selected at random.
  242. #include <boost/fiber/numa/algo/work_stealing.hpp>
  243. namespace boost {
  244. namespace fibers {
  245. namespace numa {
  246. namespace algo {
  247. class work_stealing : public algorithm {
  248. public:
  249. work_stealing( std::uint32_t cpu_id,
  250. std::uint32_t node_id,
  251. std::vector< boost::fibers::numa::node > const& topo,
  252. bool suspend = false);
  253. work_stealing( work_stealing const&) = delete;
  254. work_stealing( work_stealing &&) = delete;
  255. work_stealing & operator=( work_stealing const&) = delete;
  256. work_stealing & operator=( work_stealing &&) = delete;
  257. virtual void awakened( context *) noexcept;
  258. virtual context * pick_next() noexcept;
  259. virtual bool has_ready_fibers() const noexcept;
  260. virtual void suspend_until( std::chrono::steady_clock::time_point const&) noexcept;
  261. virtual void notify() noexcept;
  262. };
  263. }}}}
  264. [heading Constructor]
  265. work_stealing( std::uint32_t cpu_id, std::uint32_t node_id,
  266. std::vector< boost::fibers::numa::node > const& topo,
  267. bool suspend = false);
  268. [variablelist
  269. [[Effects:] [Constructs work-stealing scheduling algorithm. The thread is pinned to logical cpu with ID
  270. `cpu_id`. If local ready-queue runs out of ready fibers, ready fibers are stolen from other schedulers
  271. using `topology` (represents the NUMA-topology of the system).]]
  272. [[Throws:] [`system_error`]]
  273. [[Note:][If `suspend` is set to `true`, then the scheduler suspends if no ready fiber could be stolen.
  274. The scheduler will by woken up if a sleeping fiber times out or it was notified from remote (other thread or
  275. fiber scheduler).]]
  276. ]
  277. [ns_member_heading numa..work_stealing..awakened]
  278. virtual void awakened( context * f) noexcept;
  279. [variablelist
  280. [[Effects:] [Enqueues fiber `f` onto the shared ready queue.]]
  281. [[Throws:] [Nothing.]]
  282. ]
  283. [ns_member_heading numa..work_stealing..pick_next]
  284. virtual context * pick_next() noexcept;
  285. [variablelist
  286. [[Returns:] [the fiber at the head of the ready queue, or `nullptr` if the
  287. queue is empty.]]
  288. [[Throws:] [Nothing.]]
  289. [[Note:] [Placing ready fibers onto the tail of the sahred queue, and returning them
  290. from the head of that queue, shares the thread between ready fibers in
  291. round-robin fashion.]]
  292. ]
  293. [ns_member_heading numa..work_stealing..has_ready_fibers]
  294. virtual bool has_ready_fibers() const noexcept;
  295. [variablelist
  296. [[Returns:] [`true` if scheduler has fibers ready to run.]]
  297. [[Throws:] [Nothing.]]
  298. ]
  299. [ns_member_heading numa..work_stealing..suspend_until]
  300. virtual void suspend_until( std::chrono::steady_clock::time_point const& abs_time) noexcept;
  301. [variablelist
  302. [[Effects:] [Informs `work_stealing` that no ready fiber will be available until
  303. time-point `abs_time`. This implementation blocks in
  304. [@http://en.cppreference.com/w/cpp/thread/condition_variable/wait_until
  305. `std::condition_variable::wait_until()`].]]
  306. [[Throws:] [Nothing.]]
  307. ]
  308. [ns_member_heading numa..work_stealing..notify]
  309. virtual void notify() noexcept = 0;
  310. [variablelist
  311. [[Effects:] [Wake up a pending call to [member_link
  312. work_stealing..suspend_until], some fibers might be ready. This implementation
  313. wakes `suspend_until()` via
  314. [@http://en.cppreference.com/w/cpp/thread/condition_variable/notify_all
  315. `std::condition_variable::notify_all()`].]]
  316. [[Throws:] [Nothing.]]
  317. ]
  318. [endsect]