83a792d7ed
[SVN r67619]
114 lines
3.9 KiB
Plaintext
114 lines
3.9 KiB
Plaintext
[/==============================================================================
|
|
Copyright (C) 2001-2011 Joel de Guzman
|
|
Copyright (C) 2001-2011 Hartmut Kaiser
|
|
|
|
Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
===============================================================================/]
|
|
[section Parsing Expression Grammar]
|
|
|
|
Parsing Expression Grammars (PEG) [footnote Bryan Ford: Parsing Expression
|
|
Grammars: A Recognition-Based Syntactic Foundation,
|
|
[@http://pdos.csail.mit.edu/~baford/packrat/popl04/]] are a derivative of
|
|
Extended Backus-Naur Form (EBNF) [footnote Richard E. Pattis: EBNF: A Notation
|
|
to Describe Syntax, [@http://www.cs.cmu.edu/~pattis/misc/ebnf.pdf]]
|
|
with a different interpretation, designed to represent a recursive descent
|
|
parser. A PEG can be directly represented as a recursive-descent parser.
|
|
|
|
Like EBNF, PEG is a formal grammar for describing a formal language in
|
|
terms of a set of rules used to recognize strings of this language.
|
|
Unlike EBNF, PEGs have an exact interpretation. There is only one valid
|
|
parse tree (see __ast__) for each PEG grammar.
|
|
|
|
[heading Sequences]
|
|
|
|
Sequences are represented by juxtaposition like in EBNF:
|
|
|
|
a b
|
|
|
|
The PEG expression above states that, in order for this to succeed,
|
|
`b` must follow `a`. Here's the syntax diagram:
|
|
|
|
[:__sd_sequence__]
|
|
|
|
Here's a trivial example:
|
|
|
|
'x' digit
|
|
|
|
which means the character `x` must be followed by a digit.
|
|
|
|
[note In __qi__, we use the `>>` for sequences since C++ does not
|
|
allow juxtaposition, while in __karma__ we use the `<<` instead.]
|
|
|
|
[heading Alternatives]
|
|
|
|
Alternatives are represented in PEG using the slash:
|
|
|
|
a / b
|
|
|
|
[note In __qi__ and __karma__, we use the `|` for alternatives just as in EBNF.]
|
|
|
|
Alternatives allow for choices. The expression above reads: try to match
|
|
`a`. If `a` succeeds, success, if not try to match `b`. This is a bit of
|
|
a deviation from the usual EBNF interpretation where you simply match
|
|
`a` *or* `b`. Here's the syntax diagram:
|
|
|
|
[:__sd_choice__]
|
|
|
|
PEGs allow for ambiguity in the alternatives. In the expression above,
|
|
both `a` or `b` can both match an input string. However, only the first
|
|
matching alternative is valid. As noted, there can only be one valid
|
|
parse tree. [/FIXME: $$$ explain more about this $$$]
|
|
|
|
[heading Loops]
|
|
|
|
Again, like EBNF, PEG uses the regular-expression Kleene star and the
|
|
plus loops:
|
|
|
|
a*
|
|
a+
|
|
|
|
[note __qi__ and __karma__ use the prefix star and plus since there is no
|
|
postfix star or plus in C++.]
|
|
|
|
Here are the syntax diagrams:
|
|
|
|
[:__sd_kleene__]
|
|
[:__sd_plus__]
|
|
|
|
The first, called the Kleene star, matches zero or more of its subject
|
|
`a`. The second, plus, matches one ore more of its subject `a`.
|
|
|
|
Unlike EBNF, PEGs have greedy loops. It will match as much as it can
|
|
until its subject fails to match without regard to what follows. The
|
|
following is a classic example of a fairly common EBNF/regex expression
|
|
failing to match in PEG:
|
|
|
|
alnum* digit
|
|
|
|
In PEG, alnum will eat as much alpha-numeric characters as it can
|
|
leaving nothing more left behind. Thus, the trailing digit will get
|
|
nothing. Loops are simply implemented in recursive descent code as
|
|
for/while loops making them extremely efficient. That is a definite
|
|
advantage. On the other hand, those who are familiar with EBNF and regex
|
|
behavior might find the behavior a major gotcha. PEG provides a couple
|
|
of other mechanisms to circumvent this. We will see more of these other
|
|
mechanisms shortly.
|
|
|
|
[heading Difference]
|
|
|
|
In some cases, you may want to restrict a certain expression. You can
|
|
think of a PEG expression as a match for a potentially infinite set of
|
|
strings. The difference operator allows you to restrict this set:
|
|
|
|
a - b
|
|
|
|
The expression reads: match `a` but not `b`.
|
|
|
|
[note There is no difference operator in __karma__, as the concept does not
|
|
make sense in the context of output generation.]
|
|
|
|
[endsect]
|
|
|
|
|