statechart/doc/performance.html

<html>

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 6.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<link rel="stylesheet" type="text/css" href="../../../boost.css">
<title>The Boost Statechart Library - Performance</title>
</head>

<body link="#0000ff" vlink="#800080">

<table border="0" cellpadding="7" cellspacing="0" width="100%" summary="header">
  <tr>
    <td valign="top" width="300">
    <h3><a href="../../../index.htm">
    <img alt="C++ Boost" src="../../../boost.png" border="0" width="277" height="86"></a></h3>
    </td>
    <td valign="top">
    <h1 align="center">The Boost Statechart Library</h1>
    <h2 align="center">Performance</h2>
    </td>
  </tr>
</table>
<hr>
<dl class="index">
  <dt><a href="#Speed versus scalability tradeoffs">Speed versus scalability
  tradeoffs</a></dt>
  <dt><a href="#Memory management customization">Memory management
  customization</a></dt>
  <dt><a href="#RTTI customization">RTTI customization</a></dt>
  <dt><a href="#Resource usage">Resource usage</a></dt>
</dl>
<h2><a name="Speed versus scalability tradeoffs">Speed versus scalability
tradeoffs</a></h2>
<p>Quite a bit of effort has gone into making the library fast for small
simple machines <b>and</b> scaleable at the same time (this applies only to
<code>state_machine&lt;&gt;</code>, there still is some room for optimizing <code>
fifo_scheduler&lt;&gt;</code>, especially for multi-threaded builds). While I
believe it should perform reasonably in most applications, the scalability
does not come for free. Small, carefully handcrafted state machines will thus
easily outperform equivalent Boost.Statechart machines. To get a picture of how big
the gap is, I implemented a simple benchmark in the BitMachine example. The
Handcrafted example is a handcrafted variant of the 1-bit-BitMachine
implementing the same benchmark.</p>
<p>I tried to create a fair but somewhat unrealistic <b>worst-case</b>
scenario:</p>
<ul>
  <li>For both machines exactly one object of the only event is allocated
	before starting the test. This same object is then sent to the machines
	over and over</li>
  <li>The Handcrafted machine employs GOF-visitor double dispatch. The states
  are preallocated so that event dispatch &amp; transition amounts to nothing more
	than two virtual calls and one pointer assignment</li>
</ul>
<p>The Benchmarks - compiled with MSVC7.1 (single threaded), running on
3.2GHz Intel Pentium 4 / 1.6GHz Pentium M - produced the following
dispatch and transition times per event:</p>
<ul>
  <li>Handcrafted: 10 nanoseconds / 10 nanoseconds</li>
  <li>1-bit-BitMachine with customized memory management: 130ns / 220ns</li>
</ul>
<p>Although this is a big difference I still think it will not be noticeable
in most&nbsp;real-world applications. No matter whether an application uses
handcrafted or Boost.Statechart machines it will...</p>
<ul>
  <li>almost never run into a situation where a state machine is swamped with
  as many events as in the benchmarks. A state machine will almost always spend a good deal of time waiting for events
	(which typically come from a human operator, from machinery or
	from electronic devices over often comparatively slow I/O channels).
  Parsers are just about the only application of FSMs where this is not the
  case. However, parser FSMs are usually not directly specified on the state
  machine level but on a higher one that is better suited for the task.
  Examples of such higher levels are: Boost.Regex, Boost.Spirit, XML Schemas,
  etc. Moreover, the nature of parsers allows for a number of optimizations
  that are not possible in a general-purpose FSM framework.<br>
  Bottom line: While it is possible to implement a parser with this library,
  it is almost never advisable to do so because other approaches lead to
  better performing and more expressive code</li>
  <li>often run state machines in their own threads. This adds considerable
  locking and thread-switching overhead. Performance tests with the PingPong
  example, where two asynchronous state machines exchange events, gave the
  following times to process one event and perform the resulting in-state
  reaction (using the library with <code>boost::fast_pool_allocator&lt;&gt;</code>):<ul>
    <li>Single-threaded (no locking and waiting): 840ns / 840ns</li>
    <li>Multi-threaded with one thread (the scheduler uses mutex locking but
    never has to wait for events): 6500ns / 4800ns</li>
    <li>Multi-threaded with two threads (both schedulers use mutex locking and
    exactly one always waits for an event): 14000ns / 7000ns</li>
  </ul>
  <p>As mentioned above, there definitely is some room to improve the
  timings for the asynchronous machines. Moreover, these are very crude
	benchmarks, designed to show the overhead of locking and thread context
	switching. The overhead in a real-world application will typically be
	smaller and other operating systems can certainly do better in this area.
	However, I strongly believe that on most platforms the threading overhead is
	usually larger
	than the time that Boost.Statechart spends for event dispatch and transition. Handcrafted machines will
  inevitably have the same overhead, making raw single-threaded dispatch and
  transition speed much less important</li>
  <li>almost always allocate events with <code>new</code> and destroy them
  after consumption. This will add a few cycles, even if event memory
  management is customized</li>
  <li>often use state machines that employ orthogonal states and other
  advanced features. This forces the handcrafted machines to use a more
  adequate and more time-consuming book-keeping</li>
</ul>
<p>Therefore, in real-world applications event dispatch and transition not
normally constitutes a bottleneck and the relative gap between handcrafted and
Boost.Statechart machines also becomes much smaller than in the worst-case scenario.</p>
<p>BitMachine measurements with more states and with different levels of
optimization:</p>
<table border="3" width="100%" id="AutoNumber2" cellpadding="2">
  <tr>
    <td width="25%" rowspan="2"><b>Machine configuration<br>
    # states / # outgoing transitions per state</b></td>
    <td width="75%" colspan="3"><b>Event dispatch &amp; transition time [nanoseconds]<br>
	<font color="#FF0000">MSVC 7.1: 3.2GHz Pentium 4 / 1.6GHz Pentium M</font><br>
	<font color="#0000FF">GCC 3.4.2: 3.2GHz Pentium 4 / 1.6GHz Pentium M</font></b></td>
  </tr>
  <tr>
    <td width="25%">Out of the box</td>
    <td width="25%">Same as out of the box but with <code>
    <a href="configuration.html#Application Defined Macros">
    BOOST_STATECHART_USE_NATIVE_RTTI</a></code> defined</td>
    <td width="25%">Same as out of the box but with customized memory
    management</td>
  </tr>
  <tr>
    <td width="25%">2 / 1</td>
    <td width="25%"><font color="#FF0000">410 / 460</font><br>
	<font color="#0000FF">540 / 480</font></td>
    <td width="25%"><font color="#FF0000">490 / 570</font><br>
<font color="#0000FF">510 / 500</font></td>
    <td width="25%"><font color="#FF0000">130 / 220</font><br>
<font color="#0000FF">320 / 230</font></td>
  </tr>
  <tr>
    <td width="25%">4 / 2</td>
    <td width="25%"><font color="#FF0000">440 / 470</font><br>
<font color="#0000FF">560 / 480</font></td>
    <td width="25%"><font color="#FF0000">530 / 640</font><br>
<font color="#0000FF">570 / 550</font></td>
    <td width="25%"><font color="#FF0000">160 / 240</font><br>
<font color="#0000FF">330 / 240</font></td>
  </tr>
  <tr>
    <td width="25%">8 / 3</td>
    <td width="25%"><font color="#FF0000">450 / 470</font><br>
<font color="#0000FF">580 / 510</font></td>
    <td width="25%"><font color="#FF0000">580 / 700</font><br>
<font color="#0000FF">610 / 630</font></td>
    <td width="25%"><font color="#FF0000">180 / 250</font><br>
<font color="#0000FF">340 / 260</font></td>
  </tr>
  <tr>
    <td width="25%">16 / 4</td>
    <td width="25%"><font color="#FF0000">490 / 480</font><br>
<font color="#0000FF">710 / 670</font></td>
    <td width="25%"><font color="#FF0000">720 / 790</font><br>
<font color="#0000FF">770 / 750</font></td>
    <td width="25%"><font color="#FF0000">230 / 260</font><br>
<font color="#0000FF">460 / 360</font></td>
  </tr>
  <tr>
    <td width="25%">32 / 5</td>
    <td width="25%"><font color="#FF0000">590 / 520</font><br>
<font color="#0000FF">790 / 690</font></td>
    <td width="25%"><font color="#FF0000">820 / 880</font><br>
<font color="#0000FF">920 / 910</font></td>
    <td width="25%"><font color="#FF0000">340 / 280</font><br>
<font color="#0000FF">590 / 470</font></td>
  </tr>
  </table>
<h3>Double dispatch</h3>
<p>At the heart of every state machine lies an implementation of double
dispatch. This is due to the fact that the incoming event <b>and</b> the
active state define exactly which <a href="definitions.html#Reaction">reaction</a>
the state machine will produce. For each event dispatch, one virtual call is
followed by a linear search for the appropriate reaction, using one RTTI
comparison per reaction. The following alternatives were considered but
rejected:</p>
<ul>
  <li><a href="http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic
  visitor</a>: This double-dispatch variant satisfies all scalability
  requirements but performs badly due to costly inheritance tree cross-casts.
  Moreover, a state must store one v-pointer for <b>each</b> reaction what
  slows down construction and makes memory management customization
  inefficient. In addition, C++ RTTI must inevitably be turned on, with
  negative effects on executable size. Boost.Statechart originally employed acyclic
  visitor and was about 4 times slower than it is now (MSVC7.1 on Intel
  Pentium M). The dispatch speed might be better on other platforms but the
	other negative effects will remain</li>
  <li>
  <a href="http://www.isbiel.ch/~due/courses/c355/slides/patterns/visitor.pdf">
  GOF Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine
	depend upon all events. That is, whenever a new event is added there is no
	way around recompiling the whole state machine. This is contrary to the
	scalability requirements</li>
  <li>Two-dimensional array of function pointers: To satisfy requirement 6, it
  should be possible to spread a single state machine over several translation
  units. This however means that the dispatch table must be filled at runtime
  and the different translation units must somehow make themselves &quot;known&quot;, so
  that their part of the state machine can be added to the table. There simply
  is no way to do this automatically <b>and</b> portably. The only portable
  way that a state machine distributed over several translation units could
  employ table-based double dispatch relies on the user. The programmer(s)
  would somehow have to <b>manually</b> tie together the various pieces of the
	state machine. Not only does this scale badly but is also quite
	error-prone</li>
</ul>
<h2><a name="Memory management customization">Memory management customization</a></h2>
<p>Out of the box, all internal data is allocated on the normal heap. This
should be satisfactory for applications meeting the following prerequisites:</p>
<ul>
  <li>There are no deterministic reaction time (hard real-time) requirements</li>
  <li>The application will never run long enough for heap fragmentation to
  become a problem. This is of course an issue for all long running programs
  not only the ones employing this library. However, it should be noted that
  fragmentation problems could show up earlier than with traditional FSM
  frameworks</li>
</ul>
<p>Should an application not meet these prerequisites customization of
all memory management (not just Boost.Statechart's) should be considered, which is
supported as follows:</p>
<ul>
  <li>By passing a class offering a <code>std::allocator&lt;&gt;</code> interface
  for the <code>Allocator</code> parameter of the <code>state_machine</code>
  class template</li>
  <li>By replacing the <code>simple_state</code>, <code>state</code> and <code>
  event</code> class templates with ones that have a customized <code>operator
  new()</code> and <code>operator delete()</code>. This can be as easy as
  inheriting your customized class templates from the framework-supplied class
  templates <b>and</b> your preferred small-object/deterministic/constant-time
  allocator base class</li>
</ul>
<p><code>simple_state&lt;&gt;</code> and <code>state&lt;&gt;</code> subtype objects are
constructed and destructed only by the state machine. It would therefore be
possible to use the <code>state_machine&lt;&gt;</code> allocator instead of forcing
the user to overload <code>operator new()</code> and <code>operator delete()</code>.
However, a lot of systems employ at most one instance of a particular state
machine, which means that a) there is at most one object of a particular state
and b) this object is always constructed, accessed and destructed by one and
the same thread. We can exploit these facts in a much simpler (and faster)
<code>new</code>/<code>delete</code> implementation (for example, see
UniqueObject.hpp in the BitMachine example). However, this is only possible as
long as we have the freedom to customize memory management for state classes
separately.</p>
<h2><a name="RTTI customization">RTTI customization</a></h2>
<p>RTTI is used for event dispatch and <code>state_downcast&lt;&gt;()</code>.
Currently, there are exactly two options:</p>
<ol>
  <li>By default, a speed-optimized internal implementation is employed</li>
  <li>The library can be instructed to use native C++ RTTI instead by defining
  <code><a href="configuration.html#Application Defined Macros">
  BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li>
</ol>
<p>Just about the only reason to favor 2 is the fact that state and event
objects need to store one pointer less, meaning that in the best case the
memory footprint of a state machine object could shrink by 15% (an empty event
is typically 30% smaller, what can be an advantage when there are bursts of
events rather than a steady flow). However, on most platforms executable size
grows when C++ RTTI is turned on. So, given the small per machine object
savings, option 2 only makes sense in applications where both of the following
conditions hold:</p>
<ul>
  <li>Event dispatch will never become a
  bottleneck</li>
  <li>There is a need to reduce the memory allocated at runtime (at the cost
  of a larger executable)</li>
</ul>
<p>Obvious candidates are embedded systems where the executable resides in
ROM. Other candidates are applications running a large number of identical
state machines where this measure could even reduce the <b>overall</b> memory
footprint.</p>
<h2><a name="Resource usage">Resource usage</a></h2>
<h3>Memory</h3>
<p>On a 32-bit box, one empty active state typically needs less than 50 bytes
of memory. Even <b>very</b> complex machines will usually have less than 20
simultaneously active states so just about every machine should run with less
than one kilobyte of memory (not counting event queues). Obviously, the
per-machine memory footprint is offset by whatever state-local members the
user adds.</p>
<h3>Processor cycles</h3>
<p>The following ranking should give a rough picture of what feature will
consume how many cycles:</p>
<ol>
  <li><code>state_cast&lt;&gt;()</code>: By far the most cycle-consuming feature.
  Searches linearly for a suitable state, using one <code>dynamic_cast</code>
  per visited state</li>
  <li>State entry and exit: Profiling of the fully optimized 1-bit-BitMachine
  suggested that roughly half of the total dispatch time is spent destructing the
  exited state and constructing the entered state. Obviously, transitions
  where the <a href="definitions.html#Innermost common context">innermost
  common context</a> is &quot;far&quot; from the leaf states and/or with lots of
	orthogonal states can easily cause the destruction and construction of quite
	a few states leading to significant amounts of time spent for a transition</li>
  <li><code>state_downcast&lt;&gt;()</code>: Searches linearly for the requested
  state, using one virtual call and one RTTI comparison per visited state</li>
  <li>Deep history: For all innermost states inside a state passing either
	<code>has_deep_history</code> or <code>has_full_history</code> to its
	state base class, a binary search
	through the (usually small) history map must be performed on each exit.
	History slot allocation is performed exactly once, at first exit</li>
	<li>Shallow history: For all direct inner states of a state passing either
	<code>has_shallow_history</code> or <code>has_full_history</code> to its
	state base class, a binary search
	through the (usually small) history map must be performed on each exit.
	History slot allocation is performed exactly once, at first exit</li>
  <li>Event dispatch: One virtual call followed by a linear search for a
  suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI
	comparison per visited reaction</li>
  <li>Orthogonal states: One additional virtual call for each exited state <b>
  if</b> there is more than one active leaf state before a transition. It should
	also be noted that the worst-case event dispatch time is multiplied in the
	presence of orthogonal states. For example, if two orthogonal leaf states
	are added to a given state configuration, the worst-case time is tripled</li>
</ol>
<hr>
<p>Revised
<!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->04 June, 2005<!--webbot bot="Timestamp" endspan i-checksum="19916" --></p>
<p><i><EFBFBD> Copyright <a href="mailto:ahd6974-spamgroupstrap@yahoo.com">Andreas Huber D<>nni</a>
2003-2005. <font color="#FF0000"><b>The link refers to a
<a href="http://en.wikipedia.org/wiki/Honeypot">spam honeypot</a>. Please remove the words spam and trap
to obtain my real address.</b></font></i></p>
<p><i>Distributed under the Boost Software License, Version 1.0. (See
accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or
copy at <a href="http://www.boost.org/LICENSE_1_0.txt">
http://www.boost.org/LICENSE_1_0.txt</a>)</i></p>

</body>

</html>