994d4e48cc
[SVN r44163]
355 lines
15 KiB
HTML
355 lines
15 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<html>
|
|
<head>
|
|
<meta content=
|
|
"HTML Tidy for Windows (vers 1st February 2003), see www.w3.org"
|
|
name="generator">
|
|
<title>
|
|
Basic Concepts
|
|
</title>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
|
|
<link rel="stylesheet" href="theme/style.css" type="text/css">
|
|
</head>
|
|
<body>
|
|
<table width="100%" border="0" background="theme/bkd2.gif" cellspacing="2">
|
|
<tr>
|
|
<td width="10"></td>
|
|
<td width="85%">
|
|
<font size="6" face="Verdana, Arial, Helvetica, sans-serif"><b>Basic
|
|
Concepts</b></font>
|
|
</td>
|
|
<td width="112">
|
|
<a href="http://spirit.sf.net"><img src="theme/spirit.gif"
|
|
width="112" height="48" align="right" border="0"></a>
|
|
</td>
|
|
</tr>
|
|
</table><br>
|
|
<table border="0">
|
|
<tr>
|
|
<td width="10"></td>
|
|
<td width="30">
|
|
<a href="../index.html"><img src="theme/u_arr.gif" border="0"></a>
|
|
</td>
|
|
<td width="30">
|
|
<a href="quick_start.html"><img src="theme/l_arr.gif" border="0"></a>
|
|
</td>
|
|
<td width="30">
|
|
<a href="organization.html"><img src="theme/r_arr.gif" border="0">
|
|
</a>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
There are a few fundamental concepts that need to be understood well: 1)
|
|
The <strong>Parser</strong>, 2) <strong>Match</strong>, 3) The
|
|
<strong>Scanner</strong>, and 4) <strong>Semantic Actions</strong>. These
|
|
basic concepts interact with one another, and the functionalities of each
|
|
interweave throughout the framework to make it one coherent whole.
|
|
</p>
|
|
<table width="48%" border="0" align="center">
|
|
<tr>
|
|
<td height="211">
|
|
<img src="theme/intro1.png">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<h2>
|
|
The Parser
|
|
</h2>
|
|
<p>
|
|
Central to the framework is the parser. The parser does the actual work
|
|
of recognizing a linear input stream of data read sequentially from start
|
|
to end by the scanner. The parser attempts to match the input following a
|
|
well-defined set of specifications known as grammar rules. The parser
|
|
reports the success or failure to its client through a match object. When
|
|
successful, the parser calls a client-supplied semantic action. Finally,
|
|
the semantic action extracts structural information depending on the data
|
|
passed by the parser and the hierarchical context of the parser it is
|
|
attached to.
|
|
</p>
|
|
<p>
|
|
Parsers come in different flavors. The Spirit framework comes bundled
|
|
with an extensive set of pre-defined parsers that perform various parsing
|
|
tasks from the trivial to the complex. The parser, as a concept, has a
|
|
public conceptual interface contract. Following the contract, anyone can
|
|
write a conforming parser that will play along well with the framework's
|
|
predefined components. We shall provide a blueprint detailing the
|
|
conceptual interface of the parser later.
|
|
</p>
|
|
<p>
|
|
Clients of the framework generally do not need to write their own
|
|
hand-coded parsers at all. Spirit has an immense repertoire of
|
|
pre-defined parsers covering all aspects of syntax and semantic analysis.
|
|
We shall examine this repertoire of parsers in the following sections. In
|
|
the rare case where a specific functionality is not available, it is
|
|
extremely easy to write a user-defined parser. The ease in writing a
|
|
parser entity is the main reason for Spirit's extensibility.
|
|
</p>
|
|
<h2>
|
|
Primitives and Composites
|
|
</h2>
|
|
<p>
|
|
Spirit parsers fall into two categories: <b>primitives</b> and
|
|
<b>composites</b>. These two categories are more or less synonymous to
|
|
terminals and non-terminals in parsing lingo. Primitives are
|
|
non-decomposable atomic units. Composites on the other hand are parsers
|
|
that are composed of other parsers which can in turn be a primitive or
|
|
another composite. To illustrate, consider the Spirit expression:
|
|
</p>
|
|
|
|
<pre><code><font color="#000000"> </font></code><code><span class="identifier">real_p</span> <span class=
|
|
"special">>></span> <span class="special">*(</span><span class=
|
|
"literal">','</span> <span class="special">>></span> <span class=
|
|
"identifier">real_p</span><span class="special">)</span></code>
|
|
</pre>
|
|
<p>
|
|
<tt><tt>real_p</tt></tt> is a primitive parser that can parse real
|
|
numbers. The quoted comma <tt class="quotes">','</tt> in the expression
|
|
is a shortcut and is equivalent to <tt>ch_p<span class=
|
|
"operators">(</span><span class="quotes">','</span><span class=
|
|
"operators">)</span></tt>, which is another primitive parser that
|
|
recognizes single characters.
|
|
</p>
|
|
<p>
|
|
The expression above corresponds to the following parse tree:
|
|
</p>
|
|
<table width="29%" border="0" align="center">
|
|
<tr>
|
|
<td>
|
|
<img src="theme/intro7.png">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
The expression:
|
|
</p>
|
|
|
|
<pre><code><font color="#000000"> </font></code><span class=
|
|
"literal">','</span> <span class="special">>></span> <span class=
|
|
"identifier">real_p</span>
|
|
</pre>
|
|
<p>
|
|
composes a <b>sequence</b> parser. The <tt>sequence</tt> parser is a
|
|
composite parser comprising two parsers: the one on its left hand side
|
|
(lhs), <tt>ch_p<span class="operators">(</span><span class=
|
|
"quotes">','</span><span class="operators">)</span></tt> ; and the other
|
|
on its right hand side (rhs), <tt>real_p</tt>. This composite parser,
|
|
when called, calls its lhs and rhs in sequence and reports a successful
|
|
match only if both are successful.
|
|
</p>
|
|
<table width="14%" border="0" align="center">
|
|
<tr>
|
|
<td>
|
|
<img src="theme/intro2.png">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
The <tt>sequence</tt> parser is a binary composite. It is composed of two
|
|
parsers. There are unary composites as well. Unary composites hold only a
|
|
single subject. Like the binary composite, the unary composite may change
|
|
the behavior of its embedded subject. One particular example is the
|
|
<b>Kleene star</b>. The Kleene star, when called to parse, calls its sole
|
|
subject zero or more times. "Zero or more" implies that the Kleene star
|
|
always returns a successful match, possibly matching the null string: "".
|
|
</p>
|
|
<p>
|
|
The expression:
|
|
</p>
|
|
|
|
<pre><code><font color="#000000"> </font></code><code><span class=
|
|
"special">*(</span><span class="literal">','</span> <span class=
|
|
"special">>></span> <span class=
|
|
"identifier">real_p</span><span class="special">)</span></code>
|
|
</pre>
|
|
<p>
|
|
wraps the whole sequence composite above inside a <tt>kleene_star</tt>.
|
|
</p>
|
|
<table width="17%" border="0" align="center">
|
|
<tr>
|
|
<td>
|
|
<img src="theme/intro3.png">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
Finally, the full expression composes a <tt>real_p</tt> primitive parser
|
|
and the <tt>kleene_star</tt> we have above into another higher level
|
|
<tt>sequence</tt> parser composite.
|
|
</p>
|
|
<table width="34%" border="0" align="center">
|
|
<tr>
|
|
<td>
|
|
<img src="theme/intro4.png">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
A few simple classes, when composed and structured in a hierarchy, form a
|
|
very powerful object-oriented recursive-descent parsing engine. These
|
|
classes provide the infrastructure needed for the construction of
|
|
more-complex parsers. The final parser composite is a non-deterministic
|
|
recursive-descent parser with infinite look-ahead.
|
|
</p>
|
|
<p>
|
|
Top-down descent traverses the hierarchy. The outer <tt>sequence</tt>
|
|
calls the leftmost <tt>real_p</tt> parser. If successful, the
|
|
<tt>kleene_star</tt> is called next. The <tt>kleene_star</tt> calls the
|
|
inner <tt>sequence</tt> repeatedly in a loop until it fails to match, or
|
|
the input is exhausted. Inside, <tt>ch_p(',')</tt> and then
|
|
<tt>real_p</tt> are called in sequence. The following diagram illustrates
|
|
what is happening, somewhat reminiscent of Pascal syntax diagrams.
|
|
</p>
|
|
<table width="37%" border="0" align="center">
|
|
<tr>
|
|
<td>
|
|
<img src="theme/intro5.png">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<p>
|
|
The flexibility of object embedding and composition combined with
|
|
recursion opens up a unique approach to parsing. Subclasses are free to
|
|
form aggregates and algorithms of arbitrary complexity. Complex parsers
|
|
can be created with the composition of only a few primitive classes.
|
|
</p>
|
|
<p>
|
|
The framework is designed to be fully open-ended and extensible. New
|
|
primitives or composites, from the trivial to the complex, may be added
|
|
any time. Composition happens (statically) at compile time. This is
|
|
possible through the expressive flexibility of C++ expression templates
|
|
and template meta-programming.
|
|
</p>
|
|
<p>
|
|
The result is a composite composed of primitives and smaller composites.
|
|
This embedding strategy gives us the ability to build hierarchical
|
|
structures that fully model EBNF expressions of arbitrary complexity.
|
|
Later on, we shall see more primitive and composite building blocks.
|
|
</p>
|
|
<h2>
|
|
The Scanner
|
|
</h2>
|
|
<p>
|
|
Like the parser, the scanner is also an abstract concept. The task of the
|
|
scanner is to feed the sequential input data stream to the parser. The
|
|
scanner is composed of two STL conforming forward iterators, first and
|
|
last, where first is held by reference and last, by value. The first
|
|
iterator is held by reference to allow re-positioning by the parser. A
|
|
set of policies governs how the scanner behaves. Parsers extract data
|
|
from the scanner and position the iterator appropriately through its
|
|
member functions.
|
|
</p>
|
|
<p>
|
|
Knowledge of the intricacies of these policies is not required at all in
|
|
most cases. However, knowledge of the scanner's basic API is required to
|
|
write fully-conforming Spirit parsers. The scanner's API will be outlined
|
|
in a separate section. In addition, for the power users and the
|
|
adventurous among us, a full section will be devoted to covering the
|
|
scanner policies. The scanner policies make Spirit very flexible and
|
|
extensible. For instance, some of the policies may be modified to filter
|
|
data. A practical example is a scanner policy that does not distinguish
|
|
upper and lower case whereby making it useful for parsing case
|
|
insensitive input. Another example is a scanner policy that strips white
|
|
spaces from the input.
|
|
</p>
|
|
<h2>
|
|
The Match
|
|
</h2>
|
|
<p>
|
|
The parser has a conceptual parse member function taking in a scanner and
|
|
returning a match object. The primary function of the match object is to
|
|
report parsing success (or failure) back to the parser's caller; i.e., it
|
|
evaluates to true if the parse function is successful, false otherwise.
|
|
If the parse is successful, the match object may also be queried to
|
|
report the number of characters matched (using <tt>match.length()</tt>).
|
|
The length is non-negative if the match is successful, and the typical
|
|
length of a parse failure is -1. A zero length is perfectly valid and
|
|
still represents a successful match.
|
|
</p>
|
|
<p>
|
|
Parsers may have attribute data associated with it. For example, the
|
|
real_p parser has a numeric datum associated with it. This attribute is
|
|
the parsed number. This attribute is passed on to the returned match
|
|
object. The match object may be queried to get this attribute. This datum
|
|
is valid only when the match is successful.
|
|
</p>
|
|
<h2>
|
|
Semantic Actions
|
|
</h2>
|
|
<p>
|
|
A composite parser forms a hierarchy. Parsing proceeds from the topmost
|
|
parent parser which delegates and apportions the parsing task to its
|
|
children recursively to its children's children and so on until a
|
|
primitive is reached. By attaching semantic actions to various points in
|
|
this hierarchy, in effect we can transform the flat linear input stream
|
|
into a structured representation. This is essentially what parsers do.
|
|
</p>
|
|
<p>
|
|
Recall our example above:
|
|
</p>
|
|
|
|
<pre><code><font color="#000000"> </font></code><code><span class=
|
|
"identifier">real_p</span> <span class=
|
|
"special">>></span> <span class="special">*(</span><span class=
|
|
"literal">','</span> <span class="special">>></span> <span class=
|
|
"identifier">real_p</span><span class="special">)</span></code>
|
|
</pre>
|
|
<p>
|
|
By hooking a function (or functor) into the real_p parsers, we can
|
|
extract the numbers from the input:
|
|
</p>
|
|
|
|
<pre><code><font color="#000000"> </font></code><span class=
|
|
"identifier">real_p</span><span class="special">[&</span><span class=
|
|
"identifier">f</span><span class="special">]</span> <span class=
|
|
"special">>></span> <span class="special">*(</span><span class=
|
|
"literal">','</span> <span class="special">>></span> <span class=
|
|
"identifier">real_p</span><span class="special">[&</span><span class=
|
|
"identifier">f</span><span class="special">])</span>
|
|
</pre>
|
|
<table width="41%" border="0" align="center">
|
|
<tr>
|
|
<td>
|
|
<img src="theme/intro6.png">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p> where <tt>f</tt> is a function that takes in a single argument. The <tt><span class="operators">[&</span>f<span class=
|
|
"operators">]</span></tt> hooks the parser with the function such that when
|
|
<tt>real_p</tt> recognizes a valid number, the function <tt>f</tt> is called.
|
|
It is up to the function then to do what is appropriate. For example, it can
|
|
stuff the numbers in a vector. Or perhaps, if the grammar is changed slightly
|
|
by replacing <tt class="quotes">','</tt> with <tt class="quotes">'+'</tt>, then
|
|
we have a primitive calculator that computes sums. The function <tt>f</tt> then
|
|
can then be made to add all incoming numbers.<br>
|
|
</p>
|
|
<table border="0">
|
|
<tr>
|
|
<td width="10"></td>
|
|
<td width="30">
|
|
<a href="../index.html"><img src="theme/u_arr.gif" border="0"></a>
|
|
</td>
|
|
<td width="30">
|
|
<a href="quick_start.html"><img src="theme/l_arr.gif" border="0"></a>
|
|
</td>
|
|
<td width="30">
|
|
<a href="organization.html"><img src="theme/r_arr.gif" border="0">
|
|
</a>
|
|
</td>
|
|
</tr>
|
|
</table><br>
|
|
<hr size="1">
|
|
<p class="copyright">
|
|
Copyright © 1998-2003 Joel de Guzman<br>
|
|
<br>
|
|
<font size="2">Use, modification and distribution is subject to the Boost
|
|
Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or
|
|
copy at http://www.boost.org/LICENSE_1_0.txt)</font>
|
|
</p>
|
|
<p>
|
|
|
|
</p>
|
|
</body>
|
|
</html>
|