performance.qbk 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376
  1. [template perf[name value] [value]]
  2. [template para[text] '''<para>'''[text]'''</para>''']
  3. [mathpart perf Performance]
  4. [section:perf_over2 Performance Overview]
  5. [performance_overview]
  6. [endsect]
  7. [section:interp Interpreting these Results]
  8. In all of the following tables, the best performing
  9. result in each row, is assigned a relative value of "1" and shown
  10. in bold, so a score of "2" means ['"twice as slow as the best
  11. performing result".] Actual timings in nano-seconds per function call
  12. are also shown in parenthesis. To make the results easier to read, they
  13. are color-coded as follows: the best result and everything within 20% of
  14. it is green, anything that's more than twice as slow as the best result is red,
  15. and results in between are blue.
  16. Result were obtained on a system
  17. with an Intel core i7 4710MQ with 16Gb RAM and running
  18. either Windows 8.1 or Xubuntu Linux.
  19. [caution As usual with performance results these should be taken with a large pinch
  20. of salt: relative performance is known to shift quite a bit depending
  21. upon the architecture of the particular test system used. Further
  22. more, our performance results were obtained using our own test data:
  23. these test values are designed to provide good coverage of our code and test
  24. all the appropriate corner cases. They do not necessarily represent
  25. "typical" usage: whatever that may be!
  26. ]
  27. [endsect] [/section:interp Interpreting these Results]
  28. [section:getting_best Getting the Best Performance from this Library: Compiler and Compiler Options]
  29. By far the most important thing you can do when using this library
  30. is turn on your compiler's optimisation options. As the following
  31. table shows the penalty for using the library in debug mode can be
  32. quite large. In addition switching to 64-bit code has a small but noticeable
  33. improvement in performance, as does switching to a different compiler
  34. (Intel C++ 15 in this example).
  35. [table_Compiler_Option_Comparison_on_Windows_x64]
  36. [endsect] [/section:getting_best Getting the Best Performance from this Library: Compiler and Compiler Options]
  37. [section:tradoffs Trading Accuracy for Performance]
  38. There are a number of [link policy Policies] that can be used to trade accuracy for performance:
  39. * Internal promotion: by default functions with `float` arguments are evaluated at `double` precision
  40. internally to ensure full precision in the result. Similarly `double` precision functions are
  41. evaluated at `long double` precision internally by default. Changing these defaults can have a significant
  42. speed advantage at the expense of accuracy, note also that evaluating using `float` internally may result in
  43. numerical instability for some of the more complex algorithms, we suggest you use this option with care.
  44. * Target accuracy: just because you choose to evaluate at `double` precision doesn't mean you necessarily want
  45. to target full 16-digit accuracy, if you wish you can change the default (full machine precision) to whatever
  46. is "good enough" for your particular use case.
  47. For example, suppose you want to evaluate `double` precision functions at `double` precision internally, you
  48. can change the global default by passing `-DBOOST_MATH_PROMOTE_DOUBLE_POLICY=false` on the command line, or
  49. at the point of call via something like this:
  50. double val = boost::math::erf(my_argument, boost::math::policies::make_policy(boost::math::policies::promote_double<false>()));
  51. However, an easier option might be:
  52. #include <boost/math/special_functions.hpp> // Or any individual special function header
  53. namespace math{
  54. namespace precise{
  55. //
  56. // Define a Policy for accurate evaluation - this is the same as the default, unless
  57. // someone has changed the global defaults.
  58. //
  59. typedef boost::math::policies::policy<> accurate_policy;
  60. //
  61. // Invoke BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS to declare
  62. // functions that use the above policy. Note no trailing
  63. // ";" required on the macro call:
  64. //
  65. BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS(accurate_policy)
  66. }
  67. namespace fast{
  68. //
  69. // Define a Policy for fast evaluation:
  70. //
  71. using namespace boost::math::polcies;
  72. typedef policy<promote_double<false> > fast_policy;
  73. //
  74. // Invoke BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS:
  75. //
  76. BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS(fast_policy)
  77. }
  78. }
  79. And now one can call:
  80. math::accurate::tgamma(x);
  81. For the "accurate" version of tgamma, and:
  82. math::fast::tgamma(x);
  83. For the faster version.
  84. Had we wished to change the target precision (to 9 decimal places) as well as the evaluation type used, we might have done:
  85. namespace math{
  86. namespace fast{
  87. //
  88. // Define a Policy for fast evaluation:
  89. //
  90. using namespace boost::math::polcies;
  91. typedef policy<promote_double<false>, digits10<9> > fast_policy;
  92. //
  93. // Invoke BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS:
  94. //
  95. BOOST_MATH_DECLARE_SPECIAL_FUNCTIONS(fast_policy)
  96. }
  97. }
  98. One can do a similar thing with the distribution classes:
  99. #include <boost/math/distributions.hpp> // or any individual distribution header
  100. namespace math{ namespace fast{
  101. //
  102. // Define a policy for fastest possible evaluation:
  103. //
  104. using namespace boost::math::polcies;
  105. typedef policy<promote_float<false> > fast_float_policy;
  106. //
  107. // Invoke BOOST_MATH_DECLARE_DISTRIBUTIONS
  108. //
  109. BOOST_MATH_DECLARE_DISTRIBUTIONS(float, fast_float_policy)
  110. }} // namespaces
  111. //
  112. // And use:
  113. //
  114. float p_val = cdf(math::fast::normal(1.0f, 3.0f), 0.25f);
  115. Here's how these options change the relative performance of the distributions on Linux:
  116. [table_Distribution_performance_comparison_for_different_performance_options_with_GNU_C_version_5_1_0_on_linux]
  117. [endsect] [/section:tradoffs Trading Accuracy for Performance]
  118. [section:multiprecision Cost of High-Precision Non-built-in Floating-point]
  119. Using user-defined floating-point like __multiprecision has a very high run-time cost.
  120. To give some flavour of this:
  121. [table:linpack_time Linpack Benchmark
  122. [[floating-point type] [speed Mflops]]
  123. [[double] [2727]]
  124. [[__float128] [35]]
  125. [[multiprecision::float128] [35]]
  126. [[multiprecision::cpp_bin_float_quad] [6]]
  127. ]
  128. [endsect] [/section:multiprecision Cost of High-Precision Non-built-in Floating-point]
  129. [section:tuning Performance Tuning Macros]
  130. There are a small number of performance tuning options
  131. that are determined by configuration macros. These should be set
  132. in boost/math/tools/user.hpp; or else reported to the Boost-development
  133. mailing list so that the appropriate option for a given compiler and
  134. OS platform can be set automatically in our configuration setup.
  135. [table
  136. [[Macro][Meaning]]
  137. [[BOOST_MATH_POLY_METHOD]
  138. [Determines how polynomials and most rational functions
  139. are evaluated. Define to one
  140. of the values 0, 1, 2 or 3: see below for the meaning of these values.]]
  141. [[BOOST_MATH_RATIONAL_METHOD]
  142. [Determines how symmetrical rational functions are evaluated: mostly
  143. this only effects how the Lanczos approximation is evaluated, and how
  144. the `evaluate_rational` function behaves. Define to one
  145. of the values 0, 1, 2 or 3: see below for the meaning of these values.
  146. ]]
  147. [[BOOST_MATH_MAX_POLY_ORDER]
  148. [The maximum order of polynomial or rational function that will
  149. be evaluated by a method other than 0 (a simple "for" loop).
  150. ]]
  151. [[BOOST_MATH_INT_TABLE_TYPE(RT, IT)]
  152. [Many of the coefficients to the polynomials and rational functions
  153. used by this library are integers. Normally these are stored as tables
  154. as integers, but if mixed integer / floating point arithmetic is much
  155. slower than regular floating point arithmetic then they can be stored
  156. as tables of floating point values instead. If mixed arithmetic is slow
  157. then add:
  158. #define BOOST_MATH_INT_TABLE_TYPE(RT, IT) RT
  159. to boost/math/tools/user.hpp, otherwise the default of:
  160. #define BOOST_MATH_INT_TABLE_TYPE(RT, IT) IT
  161. Set in boost/math/config.hpp is fine, and may well result in smaller
  162. code.
  163. ]]
  164. ]
  165. The values to which `BOOST_MATH_POLY_METHOD` and `BOOST_MATH_RATIONAL_METHOD`
  166. may be set are as follows:
  167. [table
  168. [[Value][Effect]]
  169. [[0][The polynomial or rational function is evaluated using Horner's
  170. method, and a simple for-loop.
  171. Note that if the order of the polynomial
  172. or rational function is a runtime parameter, or the order is
  173. greater than the value of `BOOST_MATH_MAX_POLY_ORDER`, then
  174. this method is always used, irrespective of the value
  175. of `BOOST_MATH_POLY_METHOD` or `BOOST_MATH_RATIONAL_METHOD`.]]
  176. [[1][The polynomial or rational function is evaluated without
  177. the use of a loop, and using Horner's method. This only occurs
  178. if the order of the polynomial is known at compile time and is less
  179. than or equal to `BOOST_MATH_MAX_POLY_ORDER`. ]]
  180. [[2][The polynomial or rational function is evaluated without
  181. the use of a loop, and using a second order Horner's method.
  182. In theory this permits two operations to occur in parallel
  183. for polynomials, and four in parallel for rational functions.
  184. This only occurs
  185. if the order of the polynomial is known at compile time and is less
  186. than or equal to `BOOST_MATH_MAX_POLY_ORDER`.]]
  187. [[3][The polynomial or rational function is evaluated without
  188. the use of a loop, and using a second order Horner's method.
  189. In theory this permits two operations to occur in parallel
  190. for polynomials, and four in parallel for rational functions.
  191. This differs from method "2" in that the code is carefully ordered
  192. to make the parallelisation more obvious to the compiler: rather than
  193. relying on the compiler's optimiser to spot the parallelisation
  194. opportunities.
  195. This only occurs
  196. if the order of the polynomial is known at compile time and is less
  197. than or equal to `BOOST_MATH_MAX_POLY_ORDER`.]]
  198. ]
  199. The performance test suite generates a report for your particular compiler showing which method is likely to work best,
  200. the following tables show the results for MSVC-14.0 and GCC-5.1.0 (Linux). There's not much to choose between
  201. the various methods, but generally loop-unrolled methods perform better. Interestingly, ordering the code
  202. to try and "second guess" possible optimizations seems not to be such a good idea (method 3 below).
  203. [table_Polynomial_Method_Comparison_with_Microsoft_Visual_C_version_14_0_on_Windows_x64]
  204. [table_Rational_Method_Comparison_with_Microsoft_Visual_C_version_14_0_on_Windows_x64]
  205. [table_Polynomial_Method_Comparison_with_GNU_C_version_5_1_0_on_linux]
  206. [table_Rational_Method_Comparison_with_GNU_C_version_5_1_0_on_linux]
  207. [endsect] [/section:tuning Performance Tuning Macros]
  208. [section:comp_compilers Comparing Different Compilers]
  209. By running our performance test suite multiple times, we can compare the effect of different compilers: as
  210. might be expected, the differences are generally small compared to say disabling internal use of `long double`.
  211. However, there are still gains to be main, particularly from some of the commercial offerings:
  212. [table_Compiler_Comparison_on_Windows_x64]
  213. [table_Compiler_Comparison_on_linux]
  214. [endsect] [/section:comp_compilers Comparing Different Compilers]
  215. [section:comparisons Comparisons to Other Open Source Libraries]
  216. We've run our performance tests both for our own code, and against other
  217. open source implementations of the same functions. The results are
  218. presented below to give you a rough idea of how they all compare.
  219. In order to give a more-or-less level playing field our test data
  220. was screened against all the libraries being tested, and any
  221. unsupported domains removed, likewise for any test cases that gave large errors
  222. or unexpected non-finite values.
  223. [caution
  224. You should exercise extreme caution when interpreting
  225. these results, relative performance may vary by platform, by compiler options settings,
  226. the tests use data that gives good code coverage of /our/ code, but which may skew the
  227. results towards the corner cases. Finally, remember that different
  228. libraries make different choices with regard to performance verses
  229. numerical stability.
  230. ]
  231. The first results compare standard library functions to Boost equivalents with MSVC-14.0:
  232. [table_Library_Comparison_with_Microsoft_Visual_C_version_14_0_on_Windows_x64]
  233. On Linux with GCC, we can also compare to the TR1 functions, and to GSL and RMath:
  234. [table_Library_Comparison_with_GNU_C_version_5_3_0_on_linux]
  235. And finally we can compare the statistical distributions to GSL, RMath and DCDFLIB:
  236. [table_Distribution_performance_comparison_with_GNU_C_version_5_3_0_on_linux]
  237. [endsect] [/section:comparisons Comparisons to Other Open Source Libraries]
  238. [section:perf_test_app The Performance Test Applications]
  239. Under ['boost-path]\/libs\/math\/reporting\/performance you will find
  240. some reasonable comprehensive performance test applications for this library.
  241. In order to generate the tables you will have seen in this documentation (or others
  242. for your specific compiler) you need to invoke `bjam` in this directory, using a C++11
  243. capable compiler. Note that
  244. results extend/overwrite whatever is already present in
  245. ['boost-path]\/libs\/math\/reporting\/performance\/doc\/performance_tables.qbk,
  246. you may want to delete this file before you begin so as to make a fresh start for
  247. your particular system.
  248. The programs produce results in Boost's Quickbook format which is not terribly
  249. human readable. If you configure your user-config.jam to be able to build Docbook
  250. documentation, then you will also get a full summary of all the data in HTML format
  251. in ['boost-path]\/libs\/math\/reporting\/performance\/html\/index.html. Assuming
  252. you're on a 'nix-like platform the procedure to do this is to first install the
  253. `xsltproc`, `Docbook DTD`, and `Bookbook XSL` packages. Then:
  254. * Copy ['boost-path]\/tools\/build\/example\/user-config.jam to your home directory.
  255. * Add `using xsltproc ;` to the end of the file (note the space surrounding each token, including the final ";", this is important!)
  256. This assumes that `xsltproc` is in your path.
  257. * Add `using boostbook : path-to-xsl-stylesheets : path-to-dtd ;` to the end of the file. The `path-to-dtd` should point
  258. to version 4.2.x of the Docbook DTD, while `path-to-xsl-stylesheets` should point to the folder containing the latest XSLT stylesheets.
  259. Both paths should use all forward slashes even on Windows.
  260. At this point you should be able to run the tests and generate the HTML summary, if GSL, RMath or libstdc++ are
  261. present in the compilers path they will be automatically tested. For DCDFLIB you will need to place the C
  262. source in ['boost-path]\/libs\/math\/reporting\/performance\/third_party\/dcdflib.
  263. If you want to compare multiple compilers, or multiple options for one compiler, then you will
  264. need to invoke `bjam` multiple times, once for each compiler. Note that in order to test
  265. multiple configurations of the same compiler, each has to be given a unique name in the test
  266. program, otherwise they all edit the same table cells. Suppose you want to test GCC with
  267. and without the -ffast-math option, in this case bjam would be invoked first as:
  268. bjam toolset=gcc -a cxxflags=-std=gnu++11
  269. Which would run the tests using default optimization options (-O3), we can then run again
  270. using -ffast-math:
  271. bjam toolset=gcc -a cxxflags='-std=gnu++11 -ffast-math' define=COMPILER_NAME='"GCC with -ffast-math"'
  272. In the command line above, the -a flag forces a full rebuild, and the preprocessor define COMPILER_NAME needs to be set
  273. to a string literal describing the compiler configuration, hence the double quotes - one for the command line, one for the
  274. compiler.
  275. [endsect] [/section:perf_test_app The Performance Test Applications]
  276. [endmathpart] [/mathpart perf Performance]
  277. [/
  278. Copyright 2006 John Maddock and Paul A. Bristow.
  279. Distributed under the Boost Software License, Version 1.0.
  280. (See accompanying file LICENSE_1_0.txt or copy at
  281. http://www.boost.org/LICENSE_1_0.txt).
  282. ]