123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226 |
- [/
- Copyright Oliver Kowalke 2017.
- Distributed under the Boost Software License, Version 1.0.
- (See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt
- ]
- [#tuning]
- [section:tuning Tuning]
- [heading Disable synchronization]
- With [link cross_thread_sync `BOOST_FIBERS_NO_ATOMICS`] defined at the
- compiler[s] command line, synchronization between fibers (in different
- threads) is disabled. This is acceptable if the application is single threaded
- and/or fibers are not synchronized between threads.
- [heading Memory allocation]
- Memory allocation algorithm is significant for performance in a multithreaded
- environment, especially for __boost_fiber__ where fiber stacks are allocated on
- the heap. The default user-level memory allocator (UMA) of glibc is ptmalloc2
- but it can be replaced by another UMA that fit better for the concret work-load
- For instance Google[s]
- [@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] enables a
- better performance at the ['skynet] microbenchmark than glibc[s] default memory
- allocator.
- [heading Scheduling strategies]
- The fibers in a thread are coordinated by a fiber manager. Fibers trade control
- cooperatively, rather than preemptively.
- Depending on the work-load several strategies of scheduling the fibers are
- possible [footnote 1024cores.net:
- [@http://www.1024cores.net/home/scalable-architecture/task-scheduling-strategies Task Scheduling Strategies]]
- that can be implmented on behalf of __algo__.
- * work-stealing: ready fibers are hold in a local queue, when the
- fiber-scheduler's local queue runs out of ready fibers, it randomly
- selects another fiber-scheduler and tries to steal a ready fiber from the
- victim (implemented in __work_stealing__ and __numa_work_stealing__)
- * work-requesting: ready fibers are hold in a local queue, when the
- fiber-scheduler's local queue runs out of ready fibers, it randomly
- selects another fiber-scheduler and requests for a ready fibers, the victim
- fiber-scheduler sends a ready-fiber back
- * work-sharing: ready fibers are hold in a global queue, fiber-scheduler
- concurrently push and pop ready fibers to/from the global queue
- (implemented in __shared_work__)
- * work-distribution: fibers that became ready are proactivly distributed to
- idle fiber-schedulers or fiber-schedulers with low load
- * work-balancing: a dedicated (helper) fiber-scheduler periodically collects
- informations about all fiber-scheduler running in other threads and
- re-distributes ready fibers among them
- [heading TTAS locks]
- Boost.Fiber uses internally spinlocks to protect critical regions if fibers
- running on different threads interact.
- Spinlocks are implemented as TTAS (test-test-and-set) locks, i.e. the spinlock
- tests the lock before calling an atomic exchange. This strategy helps to reduce
- the cache line invalidations triggered by acquiring/releasing the lock.
- [heading Spin-wait loop]
- A lock is considered under contention, if a thread repeatedly fails to acquire
- the lock because some other thread was faster.
- Waiting for a short time lets other threads finish before trying to enter the
- critical section again. While busy waiting on the lock, relaxing the CPU (via
- pause/yield mnemonic) gives the CPU a hint that the code is in a spin-wait loop.
- * prevents expensive pipeline flushes (speculatively executed load and compare
- instructions are not pushed to pipeline)
- * another hardware thread (simultaneous multithreading) can get time slice
- * it does delay a few CPU cycles, but this is necessary to prevent starvation
- It is obvious that this strategy is useless on single core systems because the
- lock can only released if the thread gives up its time slice in order to let
- other threads run. The macro BOOST_FIBERS_SPIN_SINGLE_CORE replaces the CPU
- hints (pause/yield mnemonic) by informing the operating system
- (via `std::this_thread_yield()`) that the thread gives up its time slice and
- the operating system switches to another thread.
- [heading Exponential back-off]
- The macro BOOST_FIBERS_RETRY_THRESHOLD determines how many times the CPU
- iterates in the spin-wait loop before yielding the thread or blocking in
- futex-wait.
- The spinlock tracks how many times the thread failed to acquire the lock.
- The higher the contention, the longer the thread should back-off.
- A ["Binary Exponential Backoff] algorithm together with a randomized contention
- window is utilized for this purpose.
- BOOST_FIBERS_CONTENTION_WINDOW_THRESHOLD determines the upper limit of the
- contention window (expressed as the exponent for basis of two).
- [heading Speculative execution (hardware transactional memory)]
- Boost.Fiber uses spinlocks to protect critical regions that can be used
- together with transactional memory (see section [link speculation Speculative
- execution]).
- [note TXS is enabled if property `htm=tsx` is specified at b2 command-line and
- `BOOST_USE_TSX` is applied to the compiler.]
- [note A TSX-transaction will be aborted if the floating point state is modified
- inside a critical region. As a consequence floating point operations, e.g.
- tore/load of floating point related registers during a fiber (context) switch
- are disabled.]
- [heading NUMA systems]
- Modern multi-socket systems are usually designed as [link numa NUMA systems].
- A suitable fiber scheduler like __numa_work_stealing__ reduces
- remote memory access (latence).
- [heading Parameters]
- [table Parameters that migh be defiend at compiler's command line
- [
- [Parameter]
- [Default value]
- [Effect on Boost.Fiber]
- ]
- [
- [BOOST_FIBERS_NO_ATOMICS]
- [-]
- [no multithreading support, all atomics removed, no synchronization
- between fibers running in different threads]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_STD_MUTEX]
- [-]
- [`std::mutex` used inside spinlock]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_TTAS]
- [+]
- [spinlock with test-test-and-swap on shared variable]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE]
- [-]
- [spinlock with test-test-and-swap on shared variable, adaptive retries
- while busy waiting]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_TTAS_FUTEX]
- [-]
- [spinlock with test-test-and-swap on shared variable, suspend on
- futex after certain number of retries]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX]
- [-]
- [spinlock with test-test-and-swap on shared variable, while busy
- waiting adaptive retries, suspend on futex certain amount of retries]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_TTAS + BOOST_USE_TSX]
- [-]
- [spinlock with test-test-and-swap and speculative execution (Intel TSX
- required)]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE + BOOST_USE_TSX]
- [-]
- [spinlock with test-test-and-swap on shared variable, adaptive retries
- while busy waiting and speculative execution (Intel TSX required)]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_TTAS_FUTEX + BOOST_USE_TSX]
- [-]
- [spinlock with test-test-and-swap on shared variable, suspend on
- futex after certain number of retries and speculative execution
- (Intel TSX required)]
- ]
- [
- [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX + BOOST_USE_TSX]
- [-]
- [spinlock with test-test-and-swap on shared variable, while busy
- waiting adaptive retries, suspend on futex certain amount of retries
- and speculative execution (Intel TSX required)]
- ]
- [
- [BOOST_FIBERS_SPIN_SINGLE_CORE]
- [-]
- [on single core machines with multiple threads, yield thread
- (`std::this_thread::yield()`) after collisions]
- ]
- [
- [BOOST_FIBERS_RETRY_THRESHOLD]
- [64]
- [max number of retries while busy spinning, the use fallback]
- ]
- [
- [BOOST_FIBERS_CONTENTION_WINDOW_THRESHOLD]
- [16]
- [max size of collisions window, expressed as exponent for the basis of
- two]
- ]
- [
- [BOOST_FIBERS_SPIN_BEFORE_SLEEP0]
- [32]
- [max number of retries that relax the processor before the thread
- sleeps for 0s]
- ]
- [
- [BOOST_FIBERS_SPIN_BEFORE_YIELD]
- [64]
- [max number of retries where the thread sleeps for 0s before yield
- thread (`std::this_thread::yield()`)]
- ]
- ]
- [endsect]
|