90 lines
3.3 KiB
Plaintext
90 lines
3.3 KiB
Plaintext
[/
|
|
Copyright Oliver Kowalke 2016.
|
|
Distributed under the Boost Software License, Version 1.0.
|
|
(See accompanying file LICENSE_1_0.txt or copy at
|
|
http://www.boost.org/LICENSE_1_0.txt
|
|
]
|
|
|
|
[section:performance Performance]
|
|
|
|
Performance measurements were taken using `std::chrono::highresolution_clock`,
|
|
with overhead corrections.
|
|
The code was compiled with gcc-6.3.1, using build options:
|
|
variant = release, optimization = speed.
|
|
Tests were executed on dual Intel XEON E5 2620v4 2.2GHz, 16C/32T, 64GB RAM,
|
|
running Linux (x86_64).
|
|
|
|
Measurements headed 1C/1T were run in a single-threaded process.
|
|
|
|
The [@https://github.com/atemerev/skynet microbenchmark ['syknet]] from
|
|
Alexander Temerev was ported and used for performance measurements.
|
|
At the root the test spawns 10 threads-of-execution (ToE), e.g.
|
|
actor/goroutine/fiber etc.. Each spawned ToE spawns additional 10 ToEs ...
|
|
until *1,000,000* ToEs are created. ToEs return back their ordinal numbers
|
|
(0 ... 999,999), which are summed on the previous level and sent back upstream,
|
|
until reaching the root. The test was run 10-20 times, producing a range of
|
|
values for each measurement.
|
|
|
|
[table time per actor/erlang process/goroutine (other languages) (average over 1,000,000)
|
|
[
|
|
[Haskell | stack-1.4.0/ghc-8.0.1]
|
|
[Go | go1.8.1]
|
|
[Erlang | erts-8.3]
|
|
]
|
|
[
|
|
[0.05 \u00b5s - 0.06 \u00b5s]
|
|
[0.42 \u00b5s - 0.49 \u00b5s]
|
|
[0.63 \u00b5s - 0.73 \u00b5s]
|
|
]
|
|
]
|
|
|
|
Pthreads are created with a stack size of 8kB while `std::thread`'s use the
|
|
system default (1MB - 2MB). The microbenchmark could *not* be *run* with 1,000,000
|
|
threads because of *resource exhaustion* (pthread and std::thread).
|
|
Instead the test runs only at *10,000* threads.
|
|
|
|
[table time per thread (average over 10,000 - unable to spawn 1,000,000 threads)
|
|
[
|
|
[pthread]
|
|
[`std::thread`]
|
|
[`std::async`]
|
|
]
|
|
[
|
|
[54 \u00b5s - 73 \u00b5s]
|
|
[52 \u00b5s - 73 \u00b5s]
|
|
[106 \u00b5s - 122 \u00b5s]
|
|
]
|
|
]
|
|
|
|
The test utilizes 16 cores with Symmetric MultiThreading enabled (32 logical
|
|
CPUs). The fiber stacks are allocated by __fixedsize_stack__.
|
|
|
|
As the benchmark shows, the memory allocation algorithm is significant for
|
|
performance in a multithreaded environment. The tests use glibc[s] memory
|
|
allocation algorithm (based on ptmalloc2) as well as Google[s]
|
|
[@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] (via
|
|
linkflags="-ltcmalloc").[footnote
|
|
Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo
|
|
["An Experimental Study on Memory Allocators in Multicore and
|
|
Multithreaded Applications], PDCAT [,]11 Proceedings of the 2011 12th
|
|
International Conference on Parallel and Distributed Computing, Applications
|
|
and Technologies, pages 92-98]
|
|
|
|
In the __work_stealing__ scheduling algorithm, each thread has its own local
|
|
queue. Fibers that are ready to run are pushed to and popped from the local
|
|
queue. If the queue runs out of ready fibers, fibers are stolen from the local
|
|
queues of other participating threads.
|
|
|
|
[table time per fiber (average over 1.000.000)
|
|
[
|
|
[fiber (16C/32T, work stealing, tcmalloc)]
|
|
[fiber (1C/1T, round robin, tcmalloc)]
|
|
]
|
|
[
|
|
[0.05 \u00b5s - 0.09 \u00b5s]
|
|
[1.69 \u00b5s - 1.79 \u00b5s]
|
|
]
|
|
]
|
|
|
|
[endsect]
|