377 lines
16 KiB
Plaintext
377 lines
16 KiB
Plaintext
[template perf[name value] [value]]
|
|
[template para[text] '''<para>'''[text]'''</para>''']
|
|
|
|
[mathpart perf Performance]
|
|
|
|
[section:perf_over2 Performance Overview]
|
|
[performance_overview]
|
|
[endsect]
|
|
|
|
[section:interp Interpreting these Results]
|
|
|
|
In all of the following tables, the best performing
|
|
result in each row, is assigned a relative value of "1" and shown
|
|
in bold, so a score of "2" means ['"twice as slow as the best
|
|
performing result".] Actual timings in nano-seconds per function call
|
|
are also shown in parenthesis. To make the results easier to read, they
|
|
are color-coded as follows: the best result and everything within 20% of
|
|
it is green, anything that's more than twice as slow as the best result is red,
|
|
and results in between are blue.
|
|
|
|
Result were obtained on a system
|
|
with an Intel core i7 4710MQ with 16Gb RAM and running
|
|
either Windows 8.1 or Xubuntu Linux.
|
|
|
|
[caution As usual with performance results these should be taken with a large pinch
|
|
of salt: relative performance is known to shift quite a bit depending
|
|
upon the architecture of the particular test system used. Further
|
|
more, our performance results were obtained using our own test data:
|
|
these test values are designed to provide good coverage of our code and test
|
|
all the appropriate corner cases. They do not necessarily represent
|
|
"typical" usage: whatever that may be!
|
|
]
|
|
|
|
[endsect] [/section:interp Interpreting these Results]
|
|
|
|
[section:getting_best Getting the Best Performance from this Library: Compiler and Compiler Options]
|
|
|
|
By far the most important thing you can do when using this library
|
|
is turn on your compiler's optimisation options. As the following
|
|
table shows the penalty for using the library in debug mode can be
|
|
quite large. In addition switching to 64-bit code has a small but noticeable
|
|
improvement in performance, as does switching to a different compiler
|
|
(Intel C++ 15 in this example).
|
|
|
|
[table_Compiler_Option_Comparison_on_Windows_x64]
|
|
|
|
[endsect] [/section:getting_best Getting the Best Performance from this Library: Compiler and Compiler Options]
|
|
|
|
[section:tradoffs Trading Accuracy for Performance]
|
|
|
|
There are a number of [link policy Policies] that can be used to trade accuracy for performance:
|
|
|
|
* Internal promotion: by default functions with `float` arguments are evaluated at `double` precision
|
|
internally to ensure full precision in the result. Similarly `double` precision functions are
|
|
evaluated at `long double` precision internally by default. Changing these defaults can have a significant
|
|
speed advantage at the expense of accuracy, note also that evaluating using `float` internally may result in
|
|
numerical instability for some of the more complex algorithms, we suggest you use this option with care.
|
|
* Target accuracy: just because you choose to evaluate at `double` precision doesn't mean you necessarily want
|
|
to target full 16-digit accuracy, if you wish you can change the default (full machine precision) to whatever
|
|
is "good enough" for your particular use case.
|
|
|
|
For example, suppose you want to evaluate `double` precision functions at `double` precision internally, you
|
|
can change the global default by passing `-DBOOST_MATH_PROMOTE_DOUBLE_POLICY=false` on the command line, or
|
|
at the point of call via something like this:
|
|
|
|
double val = boost::math::erf(my_argument, boost::math::policies::make_policy(boost::math::policies::promote_double<false>()));
|
|
|
|
However, an easier option might be:
|
|
|
|
#include <boost/math/special_functions.hpp> // Or any individual special function header
|
|
|
|
namespace math{
|
|
|
|
namespace precise{
|
|
//
|
|
// Define a Policy for accurate evaluation - this is the same as the default, unless
|
|
// someone has changed the global defaults.
|
|
//
|
|
typedef boost::math::policies::policy<> accurate_policy;
|
|
//
|
|
// Invoke BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS to declare
|
|
// functions that use the above policy. Note no trailing
|
|
// ";" required on the macro call:
|
|
//
|
|
BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS(accurate_policy)
|
|
|
|
|
|
}
|
|
|
|
namespace fast{
|
|
//
|
|
// Define a Policy for fast evaluation:
|
|
//
|
|
using namespace boost::math::polcies;
|
|
typedef policy<promote_double<false> > fast_policy;
|
|
//
|
|
// Invoke BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS:
|
|
//
|
|
BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS(fast_policy)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
And now one can call:
|
|
|
|
math::accurate::tgamma(x);
|
|
|
|
For the "accurate" version of tgamma, and:
|
|
|
|
math::fast::tgamma(x);
|
|
|
|
For the faster version.
|
|
|
|
Had we wished to change the target precision (to 9 decimal places) as well as the evaluation type used, we might have done:
|
|
|
|
namespace math{
|
|
namespace fast{
|
|
//
|
|
// Define a Policy for fast evaluation:
|
|
//
|
|
using namespace boost::math::polcies;
|
|
typedef policy<promote_double<false>, digits10<9> > fast_policy;
|
|
//
|
|
// Invoke BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS:
|
|
//
|
|
BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS(fast_policy)
|
|
|
|
}
|
|
}
|
|
|
|
One can do a similar thing with the distribution classes:
|
|
|
|
#include <boost/math/distributions.hpp> // or any individual distribution header
|
|
|
|
namespace math{ namespace fast{
|
|
//
|
|
// Define a policy for fastest possible evaluation:
|
|
//
|
|
using namespace boost::math::polcies;
|
|
typedef policy<promote_float<false> > fast_float_policy;
|
|
//
|
|
// Invoke BOOST_MATH_DECLARE_DISTRIBUTIONS
|
|
//
|
|
BOOST_MATH_DECLARE_DISTRIBUTIONS(float, fast_float_policy)
|
|
|
|
}} // namespaces
|
|
|
|
//
|
|
// And use:
|
|
//
|
|
float p_val = cdf(math::fast::normal(1.0f, 3.0f), 0.25f);
|
|
|
|
Here's how these options change the relative performance of the distributions on Linux:
|
|
|
|
[table_Distribution_performance_comparison_for_different_performance_options_with_GNU_C_version_5_1_0_on_linux]
|
|
|
|
[endsect] [/section:tradoffs Trading Accuracy for Performance]
|
|
|
|
[section:multiprecision Cost of High-Precision Non-built-in Floating-point]
|
|
|
|
Using user-defined floating-point like __multiprecision has a very high run-time cost.
|
|
|
|
To give some flavour of this:
|
|
|
|
[table:linpack_time Linpack Benchmark
|
|
[[floating-point type] [speed Mflops]]
|
|
[[double] [2727]]
|
|
[[__float128] [35]]
|
|
[[multiprecision::float128] [35]]
|
|
[[multiprecision::cpp_bin_float_quad] [6]]
|
|
]
|
|
|
|
[endsect] [/section:multiprecision Cost of High-Precision Non-built-in Floating-point]
|
|
|
|
|
|
[section:tuning Performance Tuning Macros]
|
|
|
|
There are a small number of performance tuning options
|
|
that are determined by configuration macros. These should be set
|
|
in boost/math/tools/user.hpp; or else reported to the Boost-development
|
|
mailing list so that the appropriate option for a given compiler and
|
|
OS platform can be set automatically in our configuration setup.
|
|
|
|
[table
|
|
[[Macro][Meaning]]
|
|
[[BOOST_MATH_POLY_METHOD]
|
|
[Determines how polynomials and most rational functions
|
|
are evaluated. Define to one
|
|
of the values 0, 1, 2 or 3: see below for the meaning of these values.]]
|
|
[[BOOST_MATH_RATIONAL_METHOD]
|
|
[Determines how symmetrical rational functions are evaluated: mostly
|
|
this only effects how the Lanczos approximation is evaluated, and how
|
|
the `evaluate_rational` function behaves. Define to one
|
|
of the values 0, 1, 2 or 3: see below for the meaning of these values.
|
|
]]
|
|
[[BOOST_MATH_MAX_POLY_ORDER]
|
|
[The maximum order of polynomial or rational function that will
|
|
be evaluated by a method other than 0 (a simple "for" loop).
|
|
]]
|
|
[[BOOST_MATH_INT_TABLE_TYPE(RT, IT)]
|
|
[Many of the coefficients to the polynomials and rational functions
|
|
used by this library are integers. Normally these are stored as tables
|
|
as integers, but if mixed integer / floating point arithmetic is much
|
|
slower than regular floating point arithmetic then they can be stored
|
|
as tables of floating point values instead. If mixed arithmetic is slow
|
|
then add:
|
|
|
|
#define BOOST_MATH_INT_TABLE_TYPE(RT, IT) RT
|
|
|
|
to boost/math/tools/user.hpp, otherwise the default of:
|
|
|
|
#define BOOST_MATH_INT_TABLE_TYPE(RT, IT) IT
|
|
|
|
Set in boost/math/config.hpp is fine, and may well result in smaller
|
|
code.
|
|
]]
|
|
]
|
|
|
|
The values to which `BOOST_MATH_POLY_METHOD` and `BOOST_MATH_RATIONAL_METHOD`
|
|
may be set are as follows:
|
|
|
|
[table
|
|
[[Value][Effect]]
|
|
[[0][The polynomial or rational function is evaluated using Horner's
|
|
method, and a simple for-loop.
|
|
|
|
Note that if the order of the polynomial
|
|
or rational function is a runtime parameter, or the order is
|
|
greater than the value of `BOOST_MATH_MAX_POLY_ORDER`, then
|
|
this method is always used, irrespective of the value
|
|
of `BOOST_MATH_POLY_METHOD` or `BOOST_MATH_RATIONAL_METHOD`.]]
|
|
[[1][The polynomial or rational function is evaluated without
|
|
the use of a loop, and using Horner's method. This only occurs
|
|
if the order of the polynomial is known at compile time and is less
|
|
than or equal to `BOOST_MATH_MAX_POLY_ORDER`. ]]
|
|
[[2][The polynomial or rational function is evaluated without
|
|
the use of a loop, and using a second order Horner's method.
|
|
In theory this permits two operations to occur in parallel
|
|
for polynomials, and four in parallel for rational functions.
|
|
This only occurs
|
|
if the order of the polynomial is known at compile time and is less
|
|
than or equal to `BOOST_MATH_MAX_POLY_ORDER`.]]
|
|
[[3][The polynomial or rational function is evaluated without
|
|
the use of a loop, and using a second order Horner's method.
|
|
In theory this permits two operations to occur in parallel
|
|
for polynomials, and four in parallel for rational functions.
|
|
This differs from method "2" in that the code is carefully ordered
|
|
to make the parallelisation more obvious to the compiler: rather than
|
|
relying on the compiler's optimiser to spot the parallelisation
|
|
opportunities.
|
|
This only occurs
|
|
if the order of the polynomial is known at compile time and is less
|
|
than or equal to `BOOST_MATH_MAX_POLY_ORDER`.]]
|
|
]
|
|
|
|
The performance test suite generates a report for your particular compiler showing which method is likely to work best,
|
|
the following tables show the results for MSVC-14.0 and GCC-5.1.0 (Linux). There's not much to choose between
|
|
the various methods, but generally loop-unrolled methods perform better. Interestingly, ordering the code
|
|
to try and "second guess" possible optimizations seems not to be such a good idea (method 3 below).
|
|
|
|
[table_Polynomial_Method_Comparison_with_Microsoft_Visual_C_version_14_0_on_Windows_x64]
|
|
|
|
[table_Rational_Method_Comparison_with_Microsoft_Visual_C_version_14_0_on_Windows_x64]
|
|
|
|
[table_Polynomial_Method_Comparison_with_GNU_C_version_5_1_0_on_linux]
|
|
|
|
[table_Rational_Method_Comparison_with_GNU_C_version_5_1_0_on_linux]
|
|
|
|
[endsect] [/section:tuning Performance Tuning Macros]
|
|
|
|
[section:comp_compilers Comparing Different Compilers]
|
|
|
|
By running our performance test suite multiple times, we can compare the effect of different compilers: as
|
|
might be expected, the differences are generally small compared to say disabling internal use of `long double`.
|
|
However, there are still gains to be main, particularly from some of the commercial offerings:
|
|
|
|
[table_Compiler_Comparison_on_Windows_x64]
|
|
|
|
[table_Compiler_Comparison_on_linux]
|
|
|
|
[endsect] [/section:comp_compilers Comparing Different Compilers]
|
|
|
|
[section:comparisons Comparisons to Other Open Source Libraries]
|
|
|
|
We've run our performance tests both for our own code, and against other
|
|
open source implementations of the same functions. The results are
|
|
presented below to give you a rough idea of how they all compare.
|
|
In order to give a more-or-less level playing field our test data
|
|
was screened against all the libraries being tested, and any
|
|
unsupported domains removed, likewise for any test cases that gave large errors
|
|
or unexpected non-finite values.
|
|
|
|
[caution
|
|
You should exercise extreme caution when interpreting
|
|
these results, relative performance may vary by platform, by compiler options settings,
|
|
the tests use data that gives good code coverage of /our/ code, but which may skew the
|
|
results towards the corner cases. Finally, remember that different
|
|
libraries make different choices with regard to performance verses
|
|
numerical stability.
|
|
]
|
|
|
|
The first results compare standard library functions to Boost equivalents with MSVC-14.0:
|
|
|
|
[table_Library_Comparison_with_Microsoft_Visual_C_version_14_0_on_Windows_x64]
|
|
|
|
On Linux with GCC, we can also compare to the TR1 functions, and to GSL and RMath:
|
|
|
|
[table_Library_Comparison_with_GNU_C_version_5_3_0_on_linux]
|
|
|
|
And finally we can compare the statistical distributions to GSL, RMath and DCDFLIB:
|
|
|
|
[table_Distribution_performance_comparison_with_GNU_C_version_5_3_0_on_linux]
|
|
|
|
[endsect] [/section:comparisons Comparisons to Other Open Source Libraries]
|
|
|
|
[section:perf_test_app The Performance Test Applications]
|
|
|
|
Under ['boost-path]\/libs\/math\/reporting\/performance you will find
|
|
some reasonable comprehensive performance test applications for this library.
|
|
|
|
In order to generate the tables you will have seen in this documentation (or others
|
|
for your specific compiler) you need to invoke `bjam` in this directory, using a C++11
|
|
capable compiler. Note that
|
|
results extend/overwrite whatever is already present in
|
|
['boost-path]\/libs\/math\/reporting\/performance\/doc\/performance_tables.qbk,
|
|
you may want to delete this file before you begin so as to make a fresh start for
|
|
your particular system.
|
|
|
|
The programs produce results in Boost's Quickbook format which is not terribly
|
|
human readable. If you configure your user-config.jam to be able to build Docbook
|
|
documentation, then you will also get a full summary of all the data in HTML format
|
|
in ['boost-path]\/libs\/math\/reporting\/performance\/html\/index.html. Assuming
|
|
you're on a 'nix-like platform the procedure to do this is to first install the
|
|
`xsltproc`, `Docbook DTD`, and `Bookbook XSL` packages. Then:
|
|
|
|
* Copy ['boost-path]\/tools\/build\/example\/user-config.jam to your home directory.
|
|
* Add `using xsltproc ;` to the end of the file (note the space surrounding each token, including the final ";", this is important!)
|
|
This assumes that `xsltproc` is in your path.
|
|
* Add `using boostbook : path-to-xsl-stylesheets : path-to-dtd ;` to the end of the file. The `path-to-dtd` should point
|
|
to version 4.2.x of the Docbook DTD, while `path-to-xsl-stylesheets` should point to the folder containing the latest XSLT stylesheets.
|
|
Both paths should use all forward slashes even on Windows.
|
|
|
|
At this point you should be able to run the tests and generate the HTML summary, if GSL, RMath or libstdc++ are
|
|
present in the compilers path they will be automatically tested. For DCDFLIB you will need to place the C
|
|
source in ['boost-path]\/libs\/math\/reporting\/performance\/third_party\/dcdflib.
|
|
|
|
If you want to compare multiple compilers, or multiple options for one compiler, then you will
|
|
need to invoke `bjam` multiple times, once for each compiler. Note that in order to test
|
|
multiple configurations of the same compiler, each has to be given a unique name in the test
|
|
program, otherwise they all edit the same table cells. Suppose you want to test GCC with
|
|
and without the -ffast-math option, in this case bjam would be invoked first as:
|
|
|
|
bjam toolset=gcc -a cxxflags=-std=gnu++11
|
|
|
|
Which would run the tests using default optimization options (-O3), we can then run again
|
|
using -ffast-math:
|
|
|
|
bjam toolset=gcc -a cxxflags='-std=gnu++11 -ffast-math' define=COMPILER_NAME='"GCC with -ffast-math"'
|
|
|
|
In the command line above, the -a flag forces a full rebuild, and the preprocessor define COMPILER_NAME needs to be set
|
|
to a string literal describing the compiler configuration, hence the double quotes - one for the command line, one for the
|
|
compiler.
|
|
|
|
[endsect] [/section:perf_test_app The Performance Test Applications]
|
|
|
|
[endmathpart] [/mathpart perf Performance]
|
|
|
|
[/
|
|
Copyright 2006 John Maddock and Paul A. Bristow.
|
|
Distributed under the Boost Software License, Version 1.0.
|
|
(See accompanying file LICENSE_1_0.txt or copy at
|
|
http://www.boost.org/LICENSE_1_0.txt).
|
|
]
|
|
|
|
|