1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889 |
- [/
- Copyright Oliver Kowalke 2016.
- Distributed under the Boost Software License, Version 1.0.
- (See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt
- ]
- [section:performance Performance]
- Performance measurements were taken using `std::chrono::highresolution_clock`,
- with overhead corrections.
- The code was compiled with gcc-6.3.1, using build options:
- variant = release, optimization = speed.
- Tests were executed on dual Intel XEON E5 2620v4 2.2GHz, 16C/32T, 64GB RAM,
- running Linux (x86_64).
- Measurements headed 1C/1T were run in a single-threaded process.
- The [@https://github.com/atemerev/skynet microbenchmark ['syknet]] from
- Alexander Temerev was ported and used for performance measurements.
- At the root the test spawns 10 threads-of-execution (ToE), e.g.
- actor/goroutine/fiber etc.. Each spawned ToE spawns additional 10 ToEs ...
- until *1,000,000* ToEs are created. ToEs return back their ordinal numbers
- (0 ... 999,999), which are summed on the previous level and sent back upstream,
- until reaching the root. The test was run 10-20 times, producing a range of
- values for each measurement.
- [table time per actor/erlang process/goroutine (other languages) (average over 1,000,000)
- [
- [Haskell | stack-1.4.0/ghc-8.0.1]
- [Go | go1.8.1]
- [Erlang | erts-8.3]
- ]
- [
- [0.05 \u00b5s - 0.06 \u00b5s]
- [0.42 \u00b5s - 0.49 \u00b5s]
- [0.63 \u00b5s - 0.73 \u00b5s]
- ]
- ]
- Pthreads are created with a stack size of 8kB while `std::thread`'s use the
- system default (1MB - 2MB). The microbenchmark could *not* be *run* with 1,000,000
- threads because of *resource exhaustion* (pthread and std::thread).
- Instead the test runs only at *10,000* threads.
- [table time per thread (average over 10,000 - unable to spawn 1,000,000 threads)
- [
- [pthread]
- [`std::thread`]
- [`std::async`]
- ]
- [
- [54 \u00b5s - 73 \u00b5s]
- [52 \u00b5s - 73 \u00b5s]
- [106 \u00b5s - 122 \u00b5s]
- ]
- ]
- The test utilizes 16 cores with Symmetric MultiThreading enabled (32 logical
- CPUs). The fiber stacks are allocated by __fixedsize_stack__.
- As the benchmark shows, the memory allocation algorithm is significant for
- performance in a multithreaded environment. The tests use glibc[s] memory
- allocation algorithm (based on ptmalloc2) as well as Google[s]
- [@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] (via
- linkflags="-ltcmalloc").[footnote
- Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo
- ["An Experimental Study on Memory Allocators in Multicore and
- Multithreaded Applications], PDCAT [,]11 Proceedings of the 2011 12th
- International Conference on Parallel and Distributed Computing, Applications
- and Technologies, pages 92-98]
- In the __work_stealing__ scheduling algorithm, each thread has its own local
- queue. Fibers that are ready to run are pushed to and popped from the local
- queue. If the queue runs out of ready fibers, fibers are stolen from the local
- queues of other participating threads.
- [table time per fiber (average over 1.000.000)
- [
- [fiber (16C/32T, work stealing, tcmalloc)]
- [fiber (1C/1T, round robin, tcmalloc)]
- ]
- [
- [0.05 \u00b5s - 0.09 \u00b5s]
- [1.69 \u00b5s - 1.79 \u00b5s]
- ]
- ]
- [endsect]
|