performance.html 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  2. <html>
  3. <head>
  4. <meta http-equiv="Content-Language" content="en-us">
  5. <meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
  6. <meta name="GENERATOR" content="Microsoft FrontPage 6.0">
  7. <meta name="ProgId" content="FrontPage.Editor.Document">
  8. <link rel="stylesheet" type="text/css" href="../../../boost.css">
  9. <title>The Boost Statechart Library - Performance</title>
  10. </head>
  11. <body link="#0000FF" vlink="#800080">
  12. <table border="0" cellpadding="7" cellspacing="0" width="100%" summary=
  13. "header">
  14. <tr>
  15. <td valign="top" width="300">
  16. <h3><a href="../../../index.htm"><img alt="C++ Boost" src=
  17. "../../../boost.png" border="0" width="277" height="86"></a></h3>
  18. </td>
  19. <td valign="top">
  20. <h1 align="center">The Boost Statechart Library</h1>
  21. <h2 align="center">Performance</h2>
  22. </td>
  23. </tr>
  24. </table>
  25. <hr>
  26. <dl class="index">
  27. <dt><a href="#SpeedVersusScalabilityTradeoffs">Speed versus scalability
  28. tradeoffs</a></dt>
  29. <dt><a href="#MemoryManagementCustomization">Memory management
  30. customization</a></dt>
  31. <dt><a href="#RttiCustomization">RTTI customization</a></dt>
  32. <dt><a href="#ResourceUsage">Resource usage</a></dt>
  33. </dl>
  34. <h2><a name="SpeedVersusScalabilityTradeoffs" id=
  35. "SpeedVersusScalabilityTradeoffs">Speed versus scalability
  36. tradeoffs</a></h2>
  37. <p>Quite a bit of effort has gone into making the library fast for small
  38. simple machines <b>and</b> scaleable at the same time (this applies only to
  39. <code>state_machine&lt;&gt;</code>, there still is some room for optimizing
  40. <code>fifo_scheduler&lt;&gt;</code>, especially for multi-threaded builds).
  41. While I believe it should perform reasonably in most applications, the
  42. scalability does not come for free. Small, carefully handcrafted state
  43. machines will thus easily outperform equivalent Boost.Statechart machines.
  44. To get a picture of how big the gap is, I implemented a simple benchmark in
  45. the BitMachine example. The Handcrafted example is a handcrafted variant of
  46. the 1-bit-BitMachine implementing the same benchmark.</p>
  47. <p>I tried to create a fair but somewhat unrealistic <b>worst-case</b>
  48. scenario:</p>
  49. <ul>
  50. <li>For both machines exactly one object of the only event is allocated
  51. before starting the test. This same object is then sent to the machines
  52. over and over</li>
  53. <li>The Handcrafted machine employs GOF-visitor double dispatch. The
  54. states are preallocated so that event dispatch &amp; transition amounts
  55. to nothing more than two virtual calls and one pointer assignment</li>
  56. </ul>
  57. <p>The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the
  58. following dispatch and transition times per event:</p>
  59. <ul>
  60. <li>Handcrafted:
  61. <ul>
  62. <li>MSVC7.1: 10ns</li>
  63. <li>GCC3.4.2: 20ns</li>
  64. <li>Intel9.0: 20ns</li>
  65. </ul>
  66. </li>
  67. <li>1-bit-BitMachine with customized memory management:
  68. <ul>
  69. <li>MSVC7.1: 150ns</li>
  70. <li>GCC3.4.2: 320ns</li>
  71. <li>Intel9.0: 170ns</li>
  72. </ul>
  73. </li>
  74. </ul>
  75. <p>Although this is a big difference I still think it will not be
  76. noticeable in most&nbsp;real-world applications. No matter whether an
  77. application uses handcrafted or Boost.Statechart machines it will...</p>
  78. <ul>
  79. <li>almost never run into a situation where a state machine is swamped
  80. with as many events as in the benchmarks. A state machine will almost
  81. always spend a good deal of time waiting for events (which typically come
  82. from a human operator, from machinery or from electronic devices over
  83. often comparatively slow I/O channels). Parsers are just about the only
  84. application of FSMs where this is not the case. However, parser FSMs are
  85. usually not directly specified on the state machine level but on a higher
  86. one that is better suited for the task. Examples of such higher levels
  87. are: Boost.Regex, Boost.Spirit, XML Schemas, etc. Moreover, the nature of
  88. parsers allows for a number of optimizations that are not possible in a
  89. general-purpose FSM framework.<br>
  90. Bottom line: While it is possible to implement a parser with this
  91. library, it is almost never advisable to do so because other approaches
  92. lead to better performing and more expressive code</li>
  93. <li>often run state machines in their own threads. This adds considerable
  94. locking and thread-switching overhead. Performance tests with the
  95. PingPong example, where two asynchronous state machines exchange events,
  96. gave the following times to process one event and perform the resulting
  97. in-state reaction (using the library with
  98. <code>boost::fast_pool_allocator&lt;&gt;</code>):
  99. <ul>
  100. <li>Single-threaded (no locking and waiting): 840ns / 840ns</li>
  101. <li>Multi-threaded with one thread (the scheduler uses mutex locking
  102. but never has to wait for events): 6500ns / 4800ns</li>
  103. <li>Multi-threaded with two threads (both schedulers use mutex
  104. locking and exactly one always waits for an event): 14000ns /
  105. 7000ns</li>
  106. </ul>
  107. <p>As mentioned above, there definitely is some room to improve the
  108. timings for the asynchronous machines. Moreover, these are very crude
  109. benchmarks, designed to show the overhead of locking and thread context
  110. switching. The overhead in a real-world application will typically be
  111. smaller and other operating systems can certainly do better in this
  112. area. However, I strongly believe that on most platforms the threading
  113. overhead is usually larger than the time that Boost.Statechart spends
  114. for event dispatch and transition. Handcrafted machines will inevitably
  115. have the same overhead, making raw single-threaded dispatch and
  116. transition speed much less important</p>
  117. </li>
  118. <li>almost always allocate events with <code>new</code> and destroy them
  119. after consumption. This will add a few cycles, even if event memory
  120. management is customized</li>
  121. <li>often use state machines that employ orthogonal states and other
  122. advanced features. This forces the handcrafted machines to use a more
  123. adequate and more time-consuming book-keeping</li>
  124. </ul>
  125. <p>Therefore, in real-world applications event dispatch and transition not
  126. normally constitutes a bottleneck and the relative gap between handcrafted
  127. and Boost.Statechart machines also becomes much smaller than in the
  128. worst-case scenario.</p>
  129. <h3>Detailed performance data</h3>
  130. <p>In an effort to identify the main performance bottleneck, the example
  131. program "Performance" has been written. It measures the time that is spent
  132. to process one event in different BitMachine variants. In contrast to the
  133. BitMachine example, which uses only transitions, Performance uses a varying
  134. number of in-state reactions together with transitions. The only difference
  135. between in-state-reactions and transitions is that the former neither enter
  136. nor exit any states. Apart from that, the same amount of code needs to be
  137. run to dispatch an event and execute the resulting action.</p>
  138. <p>The following diagrams show the average time the library spends to
  139. process one event, depending on the percentage of in-state reactions
  140. employed. 0% means that all triggered reactions are transitions. 100% means
  141. that all triggered reactions are in-state reactions. I draw the following
  142. conclusions from these measurements:</p>
  143. <ul>
  144. <li>The fairly linear course of the curves suggests that the measurements
  145. with a 100% in-state reaction ratio are accurate and not merely a product
  146. of optimizations in the compiler. Such optimizations might have been
  147. possible due to the fact that in the 100% case it is known at
  148. compile-time that the current state will never change</li>
  149. <li>The data points with 100% in-state reaction ratio and speed optimized
  150. RTTI show that modern compilers seem to inline the complex-looking
  151. dispatch code so aggressively that dispatch is reduced to little more
  152. than it actually is, one virtual function call followed by a linear
  153. search for a suitable reaction. For instance, in the case of the 1-bit
  154. Bitmachine, Intel9.0 seems to produce dispatch code that is equally
  155. efficient like the two virtual function calls in the Handcrafted
  156. machine</li>
  157. <li>On all compilers and in all variants the time spent in event dispatch
  158. is dwarfed by the time spent to exit the current state and enter the
  159. target state. It is worth noting that BitMachine is a flat and
  160. non-orthogonal state machine, representing a close-to-worst case.
  161. Real-world machines will often exit and enter multiple states during a
  162. transition, what further dwarfs pure dispatch time. This makes the
  163. implementation of constant-time dispatch (as requested by a few people
  164. during formal review) an undertaking with little merit. Instead, the
  165. future optimization effort will concentrate on state-entry and
  166. state-exit</li>
  167. <li>Intel9.0 seems to have problems to optimize/inline code as soon as
  168. the amount of code grows over a certain threshold. Unlike with the other
  169. two compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit
  170. BitMachine into separate executables to get good performance. Even then
  171. was the performance overly bad for the 4-bit BitMachine. It was much
  172. worse when I compiled all 4 tests into the same executable. This surely
  173. looks like a bug in the compiler</li>
  174. </ul>
  175. <h4>Out of the box</h4>
  176. <p>The library is used as is, without any optimizations/modifications.</p>
  177. <p><img alt="PerformanceNormal1" src="PerformanceNormal1.gif" border="0"
  178. width="371" height="284"><img alt="PerformanceNormal2" src=
  179. "PerformanceNormal2.gif" border="0" width="371" height="284"><img alt=
  180. "PerformanceNormal3" src="PerformanceNormal3.gif" border="0" width="371"
  181. height="284"><img alt="PerformanceNormal4" src="PerformanceNormal4.gif"
  182. border="0" width="371" height="284"></p>
  183. <h4>Native RTTI</h4>
  184. <p>The library is used with <code><a href=
  185. "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code>
  186. defined.</p>
  187. <p><img alt="PerformanceNative1" src="PerformanceNative1.gif" border="0"
  188. width="371" height="284"><img alt="PerformanceNative2" src=
  189. "PerformanceNative2.gif" border="0" width="371" height="284"><img alt=
  190. "PerformanceNative3" src="PerformanceNative3.gif" border="0" width="371"
  191. height="284"><img alt="PerformanceNative4" src="PerformanceNative4.gif"
  192. border="0" width="371" height="284"></p>
  193. <h4>Customized memory-management</h4>
  194. <p>The library is used with customized memory management
  195. (<code>boost::fast_pool_allocator</code>).</p>
  196. <p><img alt="PerformanceCustom1" src="PerformanceCustom1.gif" border="0"
  197. width="371" height="284"><img alt="PerformanceCustom2" src=
  198. "PerformanceCustom2.gif" border="0" width="371" height="284"><img alt=
  199. "PerformanceCustom3" src="PerformanceCustom3.gif" border="0" width="371"
  200. height="284"><img alt="PerformanceCustom4" src="PerformanceCustom4.gif"
  201. border="0" width="371" height="284"></p>
  202. <h3><a name="DoubleDispatch" id="DoubleDispatch">Double dispatch</a></h3>
  203. <p>At the heart of every state machine lies an implementation of double
  204. dispatch. This is due to the fact that the incoming event <b>and</b> the
  205. active state define exactly which <a href=
  206. "definitions.html#Reaction">reaction</a> the state machine will produce.
  207. For each event dispatch, one virtual call is followed by a linear search
  208. for the appropriate reaction, using one RTTI comparison per reaction. The
  209. following alternatives were considered but rejected:</p>
  210. <ul>
  211. <li><a href=
  212. "http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic
  213. visitor</a>: This double-dispatch variant satisfies all scalability
  214. requirements but performs badly due to costly inheritance tree
  215. cross-casts. Moreover, a state must store one v-pointer for <b>each</b>
  216. reaction what slows down construction and makes memory management
  217. customization inefficient. In addition, C++ RTTI must inevitably be
  218. turned on, with negative effects on executable size. Boost.Statechart
  219. originally employed acyclic visitor and was about 4 times slower than it
  220. is now (MSVC7.1 on Intel Pentium M). The dispatch speed might be better
  221. on other platforms but the other negative effects will remain</li>
  222. <li><a href="http://en.wikipedia.org/wiki/Visitor_pattern">GOF
  223. Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine
  224. depend upon all events. That is, whenever a new event is added there is
  225. no way around recompiling the whole state machine. This is contrary to
  226. the scalability requirements</li>
  227. <li>Single two-dimensional array of function pointers: To satisfy
  228. requirement 6, it should be possible to spread a single state machine
  229. over several translation units. This however means that the dispatch
  230. table must be filled at runtime and the different translation units must
  231. somehow make themselves "known", so that their part of the state machine
  232. can be added to the table. There simply is no way to do this
  233. automatically <b>and</b> portably. The only portable way that a state
  234. machine distributed over several translation units could employ
  235. table-based double dispatch relies on the user. The programmer(s) would
  236. somehow have to <b>manually</b> tie together the various pieces of the
  237. state machine. Not only does this scale badly but is also quite
  238. error-prone</li>
  239. </ul>
  240. <h2><a name="MemoryManagementCustomization" id=
  241. "MemoryManagementCustomization">Memory management customization</a></h2>
  242. <p>Out of the box, everything (event objects, state objects, internal data,
  243. etc.) is allocated through <code>std::allocator&lt; void &gt;</code> (the
  244. default for the Allocator template parameter). This should be satisfactory
  245. for applications meeting the following prerequisites:</p>
  246. <ul>
  247. <li>There are no deterministic reaction time (hard real-time)
  248. requirements</li>
  249. <li>The application will never run long enough for heap fragmentation to
  250. become a problem. This is of course an issue for all long running
  251. programs not only the ones employing this library. However, it should be
  252. noted that fragmentation problems could show up earlier than with
  253. traditional FSM frameworks</li>
  254. </ul>
  255. <p>Should an application not meet these prerequisites, Boost.Statechart's
  256. memory management can be customized as follows:</p>
  257. <ul>
  258. <li>By passing a model of the standard Allocator concept to the class
  259. templates that support a corresponding parameter
  260. (<code>event&lt;&gt;</code>, <code>state_machine&lt;&gt;</code>,
  261. <code>asynchronous_state_machine&lt;&gt;</code>,
  262. <code>fifo_scheduler&lt;&gt;</code>, <code>fifo_worker&lt;&gt;</code>).
  263. This redirects all allocations to the passed custom allocator and should
  264. satisfy the needs of just about any project</li>
  265. <li>Additionally, it is possible to <b>separately</b> customize
  266. <b>state</b> memory management by overloading <code>operator new()</code>
  267. and <code>operator delete()</code> for all state classes but this is
  268. probably only useful under rare circumstances</li>
  269. </ul>
  270. <h2><a name="RttiCustomization" id="RttiCustomization">RTTI
  271. customization</a></h2>
  272. <p>RTTI is used for event dispatch and
  273. <code>state_downcast&lt;&gt;()</code>. Currently, there are exactly two
  274. options:</p>
  275. <ol>
  276. <li>By default, a speed-optimized internal implementation is
  277. employed</li>
  278. <li>The library can be instructed to use native C++ RTTI instead by
  279. defining <code><a href=
  280. "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li>
  281. </ol>
  282. <p>There are 2 reasons to favor 2:</p>
  283. <ul>
  284. <li>When a state machine (or parts of it) are compiled into a DLL,
  285. problems could arise from the use of the internal RTTI mechanism (see the
  286. FAQ item "<a href="faq.html#Dll">How can I compile a state machine into a
  287. dynamic link library (DLL)?</a>"). Using option 2 is one way to work
  288. around such problems (on some platforms, it seems to be the only
  289. way)</li>
  290. <li>State and event objects need to store one pointer less, meaning that
  291. in the best case the memory footprint of a state machine object could
  292. shrink by 15% (an empty event is typically 30% smaller, what can be an
  293. advantage when there are bursts of events rather than a steady flow).
  294. However, on most platforms executable size grows when C++ RTTI is turned
  295. on. So, given the small per machine object savings, this only makes sense
  296. in applications where both of the following conditions hold:
  297. <ul>
  298. <li>Event dispatch will never become a bottleneck</li>
  299. <li>There is a need to reduce the memory allocated at runtime (at the
  300. cost of a larger executable)</li>
  301. </ul>
  302. <p>Obvious candidates are embedded systems where the executable resides
  303. in ROM. Other candidates are applications running a large number of
  304. identical state machines where this measure could even reduce the
  305. <b>overall</b> memory footprint</p>
  306. </li>
  307. </ul>
  308. <h2><a name="ResourceUsage" id="ResourceUsage">Resource usage</a></h2>
  309. <h3>Memory</h3>
  310. <p>On a 32-bit box, one empty active state typically needs less than 50
  311. bytes of memory. Even <b>very</b> complex machines will usually have less
  312. than 20 simultaneously active states so just about every machine should run
  313. with less than one kilobyte of memory (not counting event queues).
  314. Obviously, the per-machine memory footprint is offset by whatever
  315. state-local members the user adds.</p>
  316. <h3>Processor cycles</h3>
  317. <p>The following ranking should give a rough picture of what feature will
  318. consume how many cycles:</p>
  319. <ol>
  320. <li><code>state_cast&lt;&gt;()</code>: By far the most cycle-consuming
  321. feature. Searches linearly for a suitable state, using one
  322. <code>dynamic_cast</code> per visited state</li>
  323. <li>State entry and exit: Profiling of the fully optimized
  324. 1-bit-BitMachine suggested that roughly 3 quarters of the total event
  325. processing time is spent destructing the exited state and constructing
  326. the entered state. Obviously, transitions where the <a href=
  327. "definitions.html#InnermostCommonContext">innermost common context</a> is
  328. "far" from the leaf states and/or with lots of orthogonal states can
  329. easily cause the destruction and construction of quite a few states
  330. leading to significant amounts of time spent for a transition</li>
  331. <li><code>state_downcast&lt;&gt;()</code>: Searches linearly for the
  332. requested state, using one virtual call and one RTTI comparison per
  333. visited state</li>
  334. <li>Deep history: For all innermost states inside a state passing either
  335. <code>has_deep_history</code> or <code>has_full_history</code> to its
  336. state base class, a binary search through the (usually small) history map
  337. must be performed on each exit. History slot allocation is performed
  338. exactly once, at first exit</li>
  339. <li>Shallow history: For all direct inner states of a state passing
  340. either <code>has_shallow_history</code> or <code>has_full_history</code>
  341. to its state base class, a binary search through the (usually small)
  342. history map must be performed on each exit. History slot allocation is
  343. performed exactly once, at first exit</li>
  344. <li>Event dispatch: One virtual call followed by a linear search for a
  345. suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI
  346. comparison per visited reaction</li>
  347. <li>Orthogonal states: One additional virtual call for each exited state
  348. <b>if</b> there is more than one active leaf state before a transition.
  349. It should also be noted that the worst-case event dispatch time is
  350. multiplied in the presence of orthogonal states. For example, if two
  351. orthogonal leaf states are added to a given state configuration, the
  352. worst-case time is tripled</li>
  353. </ol>
  354. <hr>
  355. <p><a href="http://validator.w3.org/check?uri=referer"><img border="0" src=
  356. "../../../doc/images/valid-html401.png" alt="Valid HTML 4.01 Transitional"
  357. height="31" width="88"></a></p>
  358. <p>Revised
  359. <!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->03 December, 2006<!--webbot bot="Timestamp" endspan i-checksum="38512" --></p>
  360. <p><i>Copyright &copy; 2003-<!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y" startspan -->2006<!--webbot bot="Timestamp" endspan i-checksum="770" -->
  361. <a href="contact.html">Andreas Huber D&ouml;nni</a></i></p>
  362. <p><i>Distributed under the Boost Software License, Version 1.0. (See
  363. accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or
  364. copy at <a href=
  365. "http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</a>)</i></p>
  366. </body>
  367. </html>