335 lines
18 KiB
HTML
335 lines
18 KiB
HTML
<html>
|
||
|
||
<head>
|
||
<meta http-equiv="Content-Language" content="en-us">
|
||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||
<meta name="GENERATOR" content="Microsoft FrontPage 6.0">
|
||
<meta name="ProgId" content="FrontPage.Editor.Document">
|
||
<link rel="stylesheet" type="text/css" href="../../../boost.css">
|
||
<title>The Boost Statechart Library - Performance</title>
|
||
</head>
|
||
|
||
<body link="#0000ff" vlink="#800080">
|
||
|
||
<table border="0" cellpadding="7" cellspacing="0" width="100%" summary="header">
|
||
<tr>
|
||
<td valign="top" width="300">
|
||
<h3><a href="../../../index.htm">
|
||
<img alt="C++ Boost" src="../../../boost.png" border="0" width="277" height="86"></a></h3>
|
||
</td>
|
||
<td valign="top">
|
||
<h1 align="center">The Boost Statechart Library</h1>
|
||
<h2 align="center">Performance</h2>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
<hr>
|
||
<dl class="index">
|
||
<dt><a href="#Speed versus scalability tradeoffs">Speed versus scalability
|
||
tradeoffs</a></dt>
|
||
<dt><a href="#Memory management customization">Memory management
|
||
customization</a></dt>
|
||
<dt><a href="#RTTI customization">RTTI customization</a></dt>
|
||
<dt><a href="#Resource usage">Resource usage</a></dt>
|
||
</dl>
|
||
<h2><a name="Speed versus scalability tradeoffs">Speed versus scalability
|
||
tradeoffs</a></h2>
|
||
<p>Quite a bit of effort has gone into making the library fast for small
|
||
simple machines <b>and</b> scaleable at the same time (this applies only to
|
||
<code>state_machine<></code>, there still is some room for optimizing <code>
|
||
fifo_scheduler<></code>, especially for multi-threaded builds). While I
|
||
believe it should perform reasonably in most applications, the scalability
|
||
does not come for free. Small, carefully handcrafted state machines will thus
|
||
easily outperform equivalent Boost.Statechart machines. To get a picture of how big
|
||
the gap is, I implemented a simple benchmark in the BitMachine example. The
|
||
Handcrafted example is a handcrafted variant of the 1-bit-BitMachine
|
||
implementing the same benchmark.</p>
|
||
<p>I tried to create a fair but somewhat unrealistic <b>worst-case</b>
|
||
scenario:</p>
|
||
<ul>
|
||
<li>For both machines exactly one object of the only event is allocated
|
||
before starting the test. This same object is then sent to the machines
|
||
over and over</li>
|
||
<li>The Handcrafted machine employs GOF-visitor double dispatch. The states
|
||
are preallocated so that event dispatch & transition amounts to nothing more
|
||
than two virtual calls and one pointer assignment</li>
|
||
</ul>
|
||
<p>The Benchmarks - compiled with MSVC7.1 (single threaded), running on
|
||
3.2GHz Intel Pentium 4 / 1.6GHz Pentium M - produced the following
|
||
dispatch and transition times per event:</p>
|
||
<ul>
|
||
<li>Handcrafted: 10 nanoseconds / 10 nanoseconds</li>
|
||
<li>1-bit-BitMachine with customized memory management: 130ns / 220ns</li>
|
||
</ul>
|
||
<p>Although this is a big difference I still think it will not be noticeable
|
||
in most real-world applications. No matter whether an application uses
|
||
handcrafted or Boost.Statechart machines it will...</p>
|
||
<ul>
|
||
<li>almost never run into a situation where a state machine is swamped with
|
||
as many events as in the benchmarks. A state machine will almost always spend a good deal of time waiting for events
|
||
(which typically come from a human operator, from machinery or
|
||
from electronic devices over often comparatively slow I/O channels).
|
||
Parsers are just about the only application of FSMs where this is not the
|
||
case. However, parser FSMs are usually not directly specified on the state
|
||
machine level but on a higher one that is better suited for the task.
|
||
Examples of such higher levels are: Boost.Regex, Boost.Spirit, XML Schemas,
|
||
etc. Moreover, the nature of parsers allows for a number of optimizations
|
||
that are not possible in a general-purpose FSM framework.<br>
|
||
Bottom line: While it is possible to implement a parser with this library,
|
||
it is almost never advisable to do so because other approaches lead to
|
||
better performing and more expressive code</li>
|
||
<li>often run state machines in their own threads. This adds considerable
|
||
locking and thread-switching overhead. Performance tests with the PingPong
|
||
example, where two asynchronous state machines exchange events, gave the
|
||
following times to process one event and perform the resulting in-state
|
||
reaction (using the library with <code>boost::fast_pool_allocator<></code>):<ul>
|
||
<li>Single-threaded (no locking and waiting): 840ns / 840ns</li>
|
||
<li>Multi-threaded with one thread (the scheduler uses mutex locking but
|
||
never has to wait for events): 6500ns / 4800ns</li>
|
||
<li>Multi-threaded with two threads (both schedulers use mutex locking and
|
||
exactly one always waits for an event): 14000ns / 7000ns</li>
|
||
</ul>
|
||
<p>As mentioned above, there definitely is some room to improve the
|
||
timings for the asynchronous machines. Moreover, these are very crude
|
||
benchmarks, designed to show the overhead of locking and thread context
|
||
switching. The overhead in a real-world application will typically be
|
||
smaller and other operating systems can certainly do better in this area.
|
||
However, I strongly believe that on most platforms the threading overhead is
|
||
usually larger
|
||
than the time that Boost.Statechart spends for event dispatch and transition. Handcrafted machines will
|
||
inevitably have the same overhead, making raw single-threaded dispatch and
|
||
transition speed much less important</li>
|
||
<li>almost always allocate events with <code>new</code> and destroy them
|
||
after consumption. This will add a few cycles, even if event memory
|
||
management is customized</li>
|
||
<li>often use state machines that employ orthogonal states and other
|
||
advanced features. This forces the handcrafted machines to use a more
|
||
adequate and more time-consuming book-keeping</li>
|
||
</ul>
|
||
<p>Therefore, in real-world applications event dispatch and transition not
|
||
normally constitutes a bottleneck and the relative gap between handcrafted and
|
||
Boost.Statechart machines also becomes much smaller than in the worst-case scenario.</p>
|
||
<p>BitMachine measurements with more states and with different levels of
|
||
optimization:</p>
|
||
<table border="3" width="100%" id="AutoNumber2" cellpadding="2">
|
||
<tr>
|
||
<td width="25%" rowspan="2"><b>Machine configuration<br>
|
||
# states / # outgoing transitions per state</b></td>
|
||
<td width="75%" colspan="3"><b>Event dispatch & transition time [nanoseconds]<br>
|
||
<font color="#FF0000">MSVC 7.1: 3.2GHz Pentium 4 / 1.6GHz Pentium M</font><br>
|
||
<font color="#0000FF">GCC 3.4.2: 3.2GHz Pentium 4 / 1.6GHz Pentium M</font></b></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="25%">Out of the box</td>
|
||
<td width="25%">Same as out of the box but with <code>
|
||
<a href="configuration.html#Application Defined Macros">
|
||
BOOST_STATECHART_USE_NATIVE_RTTI</a></code> defined</td>
|
||
<td width="25%">Same as out of the box but with customized memory
|
||
management</td>
|
||
</tr>
|
||
<tr>
|
||
<td width="25%">2 / 1</td>
|
||
<td width="25%"><font color="#FF0000">410 / 460</font><br>
|
||
<font color="#0000FF">540 / 480</font></td>
|
||
<td width="25%"><font color="#FF0000">490 / 570</font><br>
|
||
<font color="#0000FF">510 / 500</font></td>
|
||
<td width="25%"><font color="#FF0000">130 / 220</font><br>
|
||
<font color="#0000FF">320 / 230</font></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="25%">4 / 2</td>
|
||
<td width="25%"><font color="#FF0000">440 / 470</font><br>
|
||
<font color="#0000FF">560 / 480</font></td>
|
||
<td width="25%"><font color="#FF0000">530 / 640</font><br>
|
||
<font color="#0000FF">570 / 550</font></td>
|
||
<td width="25%"><font color="#FF0000">160 / 240</font><br>
|
||
<font color="#0000FF">330 / 240</font></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="25%">8 / 3</td>
|
||
<td width="25%"><font color="#FF0000">450 / 470</font><br>
|
||
<font color="#0000FF">580 / 510</font></td>
|
||
<td width="25%"><font color="#FF0000">580 / 700</font><br>
|
||
<font color="#0000FF">610 / 630</font></td>
|
||
<td width="25%"><font color="#FF0000">180 / 250</font><br>
|
||
<font color="#0000FF">340 / 260</font></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="25%">16 / 4</td>
|
||
<td width="25%"><font color="#FF0000">490 / 480</font><br>
|
||
<font color="#0000FF">710 / 670</font></td>
|
||
<td width="25%"><font color="#FF0000">720 / 790</font><br>
|
||
<font color="#0000FF">770 / 750</font></td>
|
||
<td width="25%"><font color="#FF0000">230 / 260</font><br>
|
||
<font color="#0000FF">460 / 360</font></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="25%">32 / 5</td>
|
||
<td width="25%"><font color="#FF0000">590 / 520</font><br>
|
||
<font color="#0000FF">790 / 690</font></td>
|
||
<td width="25%"><font color="#FF0000">820 / 880</font><br>
|
||
<font color="#0000FF">920 / 910</font></td>
|
||
<td width="25%"><font color="#FF0000">340 / 280</font><br>
|
||
<font color="#0000FF">590 / 470</font></td>
|
||
</tr>
|
||
</table>
|
||
<h3>Double dispatch</h3>
|
||
<p>At the heart of every state machine lies an implementation of double
|
||
dispatch. This is due to the fact that the incoming event <b>and</b> the
|
||
active state define exactly which <a href="definitions.html#Reaction">reaction</a>
|
||
the state machine will produce. For each event dispatch, one virtual call is
|
||
followed by a linear search for the appropriate reaction, using one RTTI
|
||
comparison per reaction. The following alternatives were considered but
|
||
rejected:</p>
|
||
<ul>
|
||
<li><a href="http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic
|
||
visitor</a>: This double-dispatch variant satisfies all scalability
|
||
requirements but performs badly due to costly inheritance tree cross-casts.
|
||
Moreover, a state must store one v-pointer for <b>each</b> reaction what
|
||
slows down construction and makes memory management customization
|
||
inefficient. In addition, C++ RTTI must inevitably be turned on, with
|
||
negative effects on executable size. Boost.Statechart originally employed acyclic
|
||
visitor and was about 4 times slower than it is now (MSVC7.1 on Intel
|
||
Pentium M). The dispatch speed might be better on other platforms but the
|
||
other negative effects will remain</li>
|
||
<li>
|
||
<a href="http://www.isbiel.ch/~due/courses/c355/slides/patterns/visitor.pdf">
|
||
GOF Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine
|
||
depend upon all events. That is, whenever a new event is added there is no
|
||
way around recompiling the whole state machine. This is contrary to the
|
||
scalability requirements</li>
|
||
<li>Two-dimensional array of function pointers: To satisfy requirement 6, it
|
||
should be possible to spread a single state machine over several translation
|
||
units. This however means that the dispatch table must be filled at runtime
|
||
and the different translation units must somehow make themselves "known", so
|
||
that their part of the state machine can be added to the table. There simply
|
||
is no way to do this automatically <b>and</b> portably. The only portable
|
||
way that a state machine distributed over several translation units could
|
||
employ table-based double dispatch relies on the user. The programmer(s)
|
||
would somehow have to <b>manually</b> tie together the various pieces of the
|
||
state machine. Not only does this scale badly but is also quite
|
||
error-prone</li>
|
||
</ul>
|
||
<h2><a name="Memory management customization">Memory management customization</a></h2>
|
||
<p>Out of the box, all internal data is allocated on the normal heap. This
|
||
should be satisfactory for applications meeting the following prerequisites:</p>
|
||
<ul>
|
||
<li>There are no deterministic reaction time (hard real-time) requirements</li>
|
||
<li>The application will never run long enough for heap fragmentation to
|
||
become a problem. This is of course an issue for all long running programs
|
||
not only the ones employing this library. However, it should be noted that
|
||
fragmentation problems could show up earlier than with traditional FSM
|
||
frameworks</li>
|
||
</ul>
|
||
<p>Should an application not meet these prerequisites customization of
|
||
all memory management (not just Boost.Statechart's) should be considered, which is
|
||
supported as follows:</p>
|
||
<ul>
|
||
<li>By passing a class offering a <code>std::allocator<></code> interface
|
||
for the <code>Allocator</code> parameter of the <code>state_machine</code>
|
||
class template</li>
|
||
<li>By replacing the <code>simple_state</code>, <code>state</code> and <code>
|
||
event</code> class templates with ones that have a customized <code>operator
|
||
new()</code> and <code>operator delete()</code>. This can be as easy as
|
||
inheriting your customized class templates from the framework-supplied class
|
||
templates <b>and</b> your preferred small-object/deterministic/constant-time
|
||
allocator base class</li>
|
||
</ul>
|
||
<p><code>simple_state<></code> and <code>state<></code> subtype objects are
|
||
constructed and destructed only by the state machine. It would therefore be
|
||
possible to use the <code>state_machine<></code> allocator instead of forcing
|
||
the user to overload <code>operator new()</code> and <code>operator delete()</code>.
|
||
However, a lot of systems employ at most one instance of a particular state
|
||
machine, which means that a) there is at most one object of a particular state
|
||
and b) this object is always constructed, accessed and destructed by one and
|
||
the same thread. We can exploit these facts in a much simpler (and faster)
|
||
<code>new</code>/<code>delete</code> implementation (for example, see
|
||
UniqueObject.hpp in the BitMachine example). However, this is only possible as
|
||
long as we have the freedom to customize memory management for state classes
|
||
separately.</p>
|
||
<h2><a name="RTTI customization">RTTI customization</a></h2>
|
||
<p>RTTI is used for event dispatch and <code>state_downcast<>()</code>.
|
||
Currently, there are exactly two options:</p>
|
||
<ol>
|
||
<li>By default, a speed-optimized internal implementation is employed</li>
|
||
<li>The library can be instructed to use native C++ RTTI instead by defining
|
||
<code><a href="configuration.html#Application Defined Macros">
|
||
BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li>
|
||
</ol>
|
||
<p>Just about the only reason to favor 2 is the fact that state and event
|
||
objects need to store one pointer less, meaning that in the best case the
|
||
memory footprint of a state machine object could shrink by 15% (an empty event
|
||
is typically 30% smaller, what can be an advantage when there are bursts of
|
||
events rather than a steady flow). However, on most platforms executable size
|
||
grows when C++ RTTI is turned on. So, given the small per machine object
|
||
savings, option 2 only makes sense in applications where both of the following
|
||
conditions hold:</p>
|
||
<ul>
|
||
<li>Event dispatch will never become a
|
||
bottleneck</li>
|
||
<li>There is a need to reduce the memory allocated at runtime (at the cost
|
||
of a larger executable)</li>
|
||
</ul>
|
||
<p>Obvious candidates are embedded systems where the executable resides in
|
||
ROM. Other candidates are applications running a large number of identical
|
||
state machines where this measure could even reduce the <b>overall</b> memory
|
||
footprint.</p>
|
||
<h2><a name="Resource usage">Resource usage</a></h2>
|
||
<h3>Memory</h3>
|
||
<p>On a 32-bit box, one empty active state typically needs less than 50 bytes
|
||
of memory. Even <b>very</b> complex machines will usually have less than 20
|
||
simultaneously active states so just about every machine should run with less
|
||
than one kilobyte of memory (not counting event queues). Obviously, the
|
||
per-machine memory footprint is offset by whatever state-local members the
|
||
user adds.</p>
|
||
<h3>Processor cycles</h3>
|
||
<p>The following ranking should give a rough picture of what feature will
|
||
consume how many cycles:</p>
|
||
<ol>
|
||
<li><code>state_cast<>()</code>: By far the most cycle-consuming feature.
|
||
Searches linearly for a suitable state, using one <code>dynamic_cast</code>
|
||
per visited state</li>
|
||
<li>State entry and exit: Profiling of the fully optimized 1-bit-BitMachine
|
||
suggested that roughly half of the total dispatch time is spent destructing the
|
||
exited state and constructing the entered state. Obviously, transitions
|
||
where the <a href="definitions.html#Innermost common context">innermost
|
||
common context</a> is "far" from the leaf states and/or with lots of
|
||
orthogonal states can easily cause the destruction and construction of quite
|
||
a few states leading to significant amounts of time spent for a transition</li>
|
||
<li><code>state_downcast<>()</code>: Searches linearly for the requested
|
||
state, using one virtual call and one RTTI comparison per visited state</li>
|
||
<li>Deep history: For all innermost states inside a state passing either
|
||
<code>has_deep_history</code> or <code>has_full_history</code> to its
|
||
state base class, a binary search
|
||
through the (usually small) history map must be performed on each exit.
|
||
History slot allocation is performed exactly once, at first exit</li>
|
||
<li>Shallow history: For all direct inner states of a state passing either
|
||
<code>has_shallow_history</code> or <code>has_full_history</code> to its
|
||
state base class, a binary search
|
||
through the (usually small) history map must be performed on each exit.
|
||
History slot allocation is performed exactly once, at first exit</li>
|
||
<li>Event dispatch: One virtual call followed by a linear search for a
|
||
suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI
|
||
comparison per visited reaction</li>
|
||
<li>Orthogonal states: One additional virtual call for each exited state <b>
|
||
if</b> there is more than one active leaf state before a transition. It should
|
||
also be noted that the worst-case event dispatch time is multiplied in the
|
||
presence of orthogonal states. For example, if two orthogonal leaf states
|
||
are added to a given state configuration, the worst-case time is tripled</li>
|
||
</ol>
|
||
<hr>
|
||
<p>Revised
|
||
<!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->04 June, 2005<!--webbot bot="Timestamp" endspan i-checksum="19916" --></p>
|
||
<p><i><EFBFBD> Copyright <a href="mailto:ahd6974-spamgroupstrap@yahoo.com">Andreas Huber D<>nni</a>
|
||
2003-2005. <font color="#FF0000"><b>The link refers to a
|
||
<a href="http://en.wikipedia.org/wiki/Honeypot">spam honeypot</a>. Please remove the words spam and trap
|
||
to obtain my real address.</b></font></i></p>
|
||
<p><i>Distributed under the Boost Software License, Version 1.0. (See
|
||
accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or
|
||
copy at <a href="http://www.boost.org/LICENSE_1_0.txt">
|
||
http://www.boost.org/LICENSE_1_0.txt</a>)</i></p>
|
||
|
||
</body>
|
||
|
||
</html>
|