d8a15a2197
[SVN r46980]
519 lines
19 KiB
Plaintext
519 lines
19 KiB
Plaintext
[/
|
|
/ Copyright (c) 2008 Eric Niebler
|
|
/
|
|
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
/]
|
|
|
|
[section Semantic Actions and User-Defined Assertions]
|
|
|
|
[h2 Overview]
|
|
|
|
Imagine you want to parse an input string and build a `std::map<>` from it. For
|
|
something like that, matching a regular expression isn't enough. You want to
|
|
/do something/ when parts of your regular expression match. Xpressive lets
|
|
you attach semantic actions to parts of your static regular expressions. This
|
|
section shows you how.
|
|
|
|
[h2 Semantic Actions]
|
|
|
|
Consider the following code, which uses xpressive's semantic actions to parse
|
|
a string of word/integer pairs and stuffs them into a `std::map<>`. It is
|
|
described below.
|
|
|
|
#include <string>
|
|
#include <iostream>
|
|
#include <boost/xpressive/xpressive.hpp>
|
|
#include <boost/xpressive/regex_actions.hpp>
|
|
using namespace boost::xpressive;
|
|
|
|
int main()
|
|
{
|
|
std::map<std::string, int> result;
|
|
std::string str("aaa=>1 bbb=>23 ccc=>456");
|
|
|
|
// Match a word and an integer, separated by =>,
|
|
// and then stuff the result into a std::map<>
|
|
sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) )
|
|
[ ref(result)[s1] = as<int>(s2) ];
|
|
|
|
// Match one or more word/integer pairs, separated
|
|
// by whitespace.
|
|
sregex rx = pair >> *(+_s >> pair);
|
|
|
|
if(regex_match(str, rx))
|
|
{
|
|
std::cout << result["aaa"] << '\n';
|
|
std::cout << result["bbb"] << '\n';
|
|
std::cout << result["ccc"] << '\n';
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
|
|
This program prints the following:
|
|
|
|
[pre
|
|
1
|
|
23
|
|
456
|
|
]
|
|
|
|
The regular expression `pair` has two parts: the pattern and the action. The
|
|
pattern says to match a word, capturing it in sub-match 1, and an integer,
|
|
capturing it in sub-match 2, separated by `"=>"`. The action is the part in
|
|
square brackets: `[ ref(result)[s1] = as<int>(s2) ]`. It says to take sub-match
|
|
one and use it to index into the `results` map, and assign to it the result of
|
|
converting sub-match 2 to an integer.
|
|
|
|
[note To use semantic actions with your static regexes, you must
|
|
`#include <boost/xpressive/regex_actions.hpp>`]
|
|
|
|
How does this work? Just as the rest of the static regular expression, the part
|
|
between brackets is an expression template. It encodes the action and executes
|
|
it later. The expression `ref(result)` creates a lazy reference to the `result`
|
|
object. The larger expression `ref(result)[s1]` is a lazy map index operation.
|
|
Later, when this action is getting executed, `s1` gets replaced with the
|
|
first _sub_match_. Likewise, when `as<int>(s2)` gets executed, `s2` is replaced
|
|
with the second _sub_match_. The `as<>` action converts its argument to the
|
|
requested type using Boost.Lexical_cast. The effect of the whole action is to
|
|
insert a new word/integer pair into the map.
|
|
|
|
[note There is an important difference between the function `boost::ref()` in
|
|
`<boost/ref.hpp>` and `boost::xpressive::ref()` in
|
|
`<boost/xpressive/regex_actions.hpp>`. The first returns a plain
|
|
`reference_wrapper<>` which behaves in many respects like an ordinary
|
|
reference. By contrast, `boost::xpressive::ref()` returns a /lazy/ reference
|
|
that you can use in expressions that are executed lazily. That is why we can
|
|
say `ref(result)[s1]`, even though `result` doesn't have an `operator[]` that
|
|
would accept `s1`.]
|
|
|
|
In addition to the sub-match placeholders `s1`, `s2`, etc., you can also use
|
|
the placeholder `_` within an action to refer back to the string matched by
|
|
the sub-expression to which the action is attached. For instance, you can use
|
|
the following regex to match a bunch of digits, interpret them as an integer
|
|
and assign the result to a local variable:
|
|
|
|
int i = 0;
|
|
// Here, _ refers back to all the
|
|
// characters matched by (+_d)
|
|
sregex rex = (+_d)[ ref(i) = as<int>(_) ];
|
|
|
|
[h3 Lazy Action Execution]
|
|
|
|
What does it mean, exactly, to attach an action to part of a regular expression
|
|
and perform a match? When does the action execute? If the action is part of a
|
|
repeated sub-expression, does the action execute once or many times? And if the
|
|
sub-expression initially matches, but ultimately fails because the rest of the
|
|
regular expression fails to match, is the action executed at all?
|
|
|
|
The answer is that by default, actions are executed /lazily/. When a sub-expression
|
|
matches a string, its action is placed on a queue, along with the current
|
|
values of any sub-matches to which the action refers. If the match algorithm
|
|
must backtrack, actions are popped off the queue as necessary. Only after the
|
|
entire regex has matched successfully are the actions actually exeucted. They
|
|
are executed all at once, in the order in which they were added to the queue,
|
|
as the last step before _regex_match_ returns.
|
|
|
|
For example, consider the following regex that increments a counter whenever
|
|
it finds a digit.
|
|
|
|
int i = 0;
|
|
std::string str("1!2!3?");
|
|
// count the exciting digits, but not the
|
|
// questionable ones.
|
|
sregex rex = +( _d [ ++ref(i) ] >> '!' );
|
|
regex_search(str, rex);
|
|
assert( i == 2 );
|
|
|
|
The action `++ref(i)` is queued three times: once for each found digit. But
|
|
it is only /executed/ twice: once for each digit that precedes a `'!'`
|
|
character. When the `'?'` character is encountered, the match algorithm
|
|
backtracks, removing the final action from the queue.
|
|
|
|
[h3 Immediate Action Execution]
|
|
|
|
When you want semantic actions to execute immediately, you can wrap the
|
|
sub-expression containing the action in a [^[funcref boost::xpressive::keep keep()]].
|
|
`keep()` turns off back-tracking for its sub-expression, but it also causes
|
|
any actions queued by the sub-expression to execute at the end of the `keep()`.
|
|
It is as if the sub-expression in the `keep()` were compiled into an
|
|
independent regex object, and matching the `keep()` is like a separate invocation
|
|
of `regex_search()`. It matches characters and executes actions but never backtracks
|
|
or unwinds. For example, imagine the above example had been written as follows:
|
|
|
|
int i = 0;
|
|
std::string str("1!2!3?");
|
|
// count all the digits.
|
|
sregex rex = +( keep( _d [ ++ref(i) ] ) >> '!' );
|
|
regex_search(str, rex);
|
|
assert( i == 3 );
|
|
|
|
We have wrapped the sub-expression `_d [ ++ref(i) ]` in `keep()`. Now, whenever
|
|
this regex matches a digit, the action will be queued and then immediately
|
|
executed before we try to match a `'!'` character. In this case, the action
|
|
executes three times.
|
|
|
|
[note Like `keep()`, actions within [^[funcref boost::xpressive::before before()]]
|
|
and [^[funcref boost::xpressive::after after()]] are also executed early when their
|
|
sub-expressions have matched.]
|
|
|
|
[h3 Lazy Functions]
|
|
|
|
So far, we've seen how to write semantic actions consisting of variables and
|
|
operators. But what if you want to be able to call a function from a semantic
|
|
action? Xpressive provides a mechanism to do this.
|
|
|
|
The first step is to define a function object type. Here, for instance, is a
|
|
function object type that calls `push()` on its argument:
|
|
|
|
struct push_impl
|
|
{
|
|
// Result type, needed for tr1::result_of
|
|
typedef void result_type;
|
|
|
|
template<typename Sequence, typename Value>
|
|
void operator()(Sequence &seq, Value const &val) const
|
|
{
|
|
seq.push(val);
|
|
}
|
|
};
|
|
|
|
The next step is to use xpressive's `function<>` template to define a function
|
|
object named `push`:
|
|
|
|
// Global "push" function object.
|
|
function<push_impl>::type const push = {{}};
|
|
|
|
The initialization looks a bit odd, but this is because `push` is being
|
|
statically initialized. That means it doesn't need to be constructed
|
|
at runtime. We can use `push` in semantic actions as follows:
|
|
|
|
std::stack<int> ints;
|
|
// Match digits, cast them to an int
|
|
// and push it on the stack.
|
|
sregex rex = (+_d)[push(ref(ints), as<int>(_))];
|
|
|
|
You'll notice that doing it this way causes member function invocations
|
|
to look like ordinary function invocations. You can choose to write your
|
|
semantic action in a different way that makes it look a bit more like
|
|
a member function call:
|
|
|
|
sregex rex = (+_d)[ref(ints)->*push(as<int>(_))];
|
|
|
|
Xpressive recognizes the use of the `->*` and treats this expression
|
|
exactly the same as the one above.
|
|
|
|
When your function object must return a type that depends on its
|
|
arguments, you can use a `result<>` member template instead of the
|
|
`result_type` typedef. Here, for example, is a `first` function object
|
|
that returns the `first` member of a `std::pair<>` or _sub_match_:
|
|
|
|
// Function object that returns the
|
|
// first element of a pair.
|
|
struct first_impl
|
|
{
|
|
template<typename Sig> struct result {};
|
|
|
|
template<typename This, typename Pair>
|
|
struct result<This(Pair)>
|
|
{
|
|
typedef typename remove_reference<Pair>
|
|
::type::first_type type;
|
|
};
|
|
|
|
template<typename Pair>
|
|
typename Pair::first_type
|
|
operator()(Pair const &p) const
|
|
{
|
|
return p.first;
|
|
}
|
|
};
|
|
|
|
// OK, use as first(s1) to get the begin iterator
|
|
// of the sub-match referred to by s1.
|
|
function<first_impl>::type const first = {{}};
|
|
|
|
[h3 Referring to Local Variables]
|
|
|
|
As we've seen in the examples above, we can refer to local variables within
|
|
an actions using `xpressive::ref()`. Any such variables are held by reference
|
|
by the regular expression, and care should be taken to avoid letting those
|
|
references dangle. For instance, in the following code, the reference to `i`
|
|
is left to dangle when `bad_voodoo()` returns:
|
|
|
|
sregex bad_voodoo()
|
|
{
|
|
int i = 0;
|
|
sregex rex = +( _d [ ++ref(i) ] >> '!' );
|
|
// ERROR! rex refers by reference to a local
|
|
// variable, which will dangle after bad_voodoo()
|
|
// returns.
|
|
return rex;
|
|
}
|
|
|
|
When writing semantic actions, it is your responsibility to make sure that
|
|
all the references do not dangle. One way to do that would be to make the
|
|
variables shared pointers that are held by the regex by value.
|
|
|
|
sregex good_voodoo(boost::shared_ptr<int> pi)
|
|
{
|
|
// Use val() to hold the shared_ptr by value:
|
|
sregex rex = +( _d [ ++*val(pi) ] >> '!' );
|
|
// OK, rex holds a reference count to the integer.
|
|
return rex;
|
|
}
|
|
|
|
In the above code, we use `xpressive::val()` to hold the shared pointer by
|
|
value. That's not normally necessary because local variables appearing in
|
|
actions are held by value by default, but in this case, it is necessary. Had
|
|
we written the action as `++*pi`, it would have executed immediately. That's
|
|
because `++*pi` is not an expression template, but `++*val(pi)` is.
|
|
|
|
It can be tedious to wrap all your variables in `ref()` and `val()` in your
|
|
semantic actions. Xpressive provides the `reference<>` and `value<>` templates
|
|
to make things easier. The following table shows the equivalencies:
|
|
|
|
[table reference<> and value<>
|
|
[[This ...][... is equivalent to this ...]]
|
|
[[``int i = 0;
|
|
|
|
sregex rex = +( _d [ ++ref(i) ] >> '!' );``][``int i = 0;
|
|
reference<int> ri(i);
|
|
sregex rex = +( _d [ ++ri ] >> '!' );``]]
|
|
[[``boost::shared_ptr<int> pi(new int(0));
|
|
|
|
sregex rex = +( _d [ ++*val(pi) ] >> '!' );``][``boost::shared_ptr<int> pi(new int(0));
|
|
value<boost::shared_ptr<int> > vpi(pi);
|
|
sregex rex = +( _d [ ++*vpi ] >> '!' );``]]
|
|
]
|
|
|
|
As you can see, when using `reference<>`, you need to first declare a local
|
|
variable and then declare a `reference<>` to it. These two steps can be combined
|
|
into one using `local<>`.
|
|
|
|
[table local<> vs. reference<>
|
|
[[This ...][... is equivalent to this ...]]
|
|
[[``local<int> i(0);
|
|
|
|
sregex rex = +( _d [ ++i ] >> '!' );``][``int i = 0;
|
|
reference<int> ri(i);
|
|
sregex rex = +( _d [ ++ri ] >> '!' );``]]
|
|
]
|
|
|
|
We can use `local<>` to rewrite the above example as follows:
|
|
|
|
local<int> i(0);
|
|
std::string str("1!2!3?");
|
|
// count the exciting digits, but not the
|
|
// questionable ones.
|
|
sregex rex = +( _d [ ++i ] >> '!' );
|
|
regex_search(str, rex);
|
|
assert( i.get() == 2 );
|
|
|
|
Notice that we use `local<>::get()` to access the value of the local
|
|
variable. Also, beware that `local<>` can be used to create a dangling
|
|
reference, just as `reference<>` can.
|
|
|
|
[h3 Referring to Non-Local Variables]
|
|
|
|
In the beginning of this
|
|
section, we used a regex with a semantic action to parse a string of
|
|
word/integer pairs and stuff them into a `std::map<>`. That required that
|
|
the map and the regex be defined together and used before either could
|
|
go out of scope. What if we wanted to define the regex once and use it
|
|
to fill lots of different maps? We would rather pass the map into the
|
|
_regex_match_ algorithm rather than embed a reference to it directly in
|
|
the regex object. What we can do instead is define a placeholder and use
|
|
that in the semantic action instead of the map itself. Later, when we
|
|
call one of the regex algorithms, we can bind the reference to an actual
|
|
map object. The following code shows how.
|
|
|
|
// Define a placeholder for a map object:
|
|
placeholder<std::map<std::string, int> > _map;
|
|
|
|
// Match a word and an integer, separated by =>,
|
|
// and then stuff the result into a std::map<>
|
|
sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) )
|
|
[ _map[s1] = as<int>(s2) ];
|
|
|
|
// Match one or more word/integer pairs, separated
|
|
// by whitespace.
|
|
sregex rx = pair >> *(+_s >> pair);
|
|
|
|
// The string to parse
|
|
std::string str("aaa=>1 bbb=>23 ccc=>456");
|
|
|
|
// Here is the actual map to fill in:
|
|
std::map<std::string, int> result;
|
|
|
|
// Bind the _map placeholder to the actual map
|
|
smatch what;
|
|
what.let( _map = result );
|
|
|
|
// Execute the match and fill in result map
|
|
if(regex_match(str, what, rx))
|
|
{
|
|
std::cout << result["aaa"] << '\n';
|
|
std::cout << result["bbb"] << '\n';
|
|
std::cout << result["ccc"] << '\n';
|
|
}
|
|
|
|
This program displays:
|
|
|
|
[pre
|
|
1
|
|
23
|
|
456
|
|
]
|
|
|
|
We use `placeholder<>` here to define `_map`, which stands in for a
|
|
`std::map<>` variable. We can use the placeholder in the semantic action as if
|
|
it were a map. Then, we define a _match_results_ struct and bind an actual map
|
|
to the placeholder with "`what.let( _map = result );`". The _regex_match_ call
|
|
behaves as if the placeholder in the semantic action had been replaced with a
|
|
reference to `result`.
|
|
|
|
[note Placeholders in semantic actions are not /actually/ replaced at runtime
|
|
with references to variables. The regex object is never mutated in any way
|
|
during any of the regex algorithms, so they are safe to use in multiple
|
|
threads.]
|
|
|
|
The syntax for late-bound action arguments is a little different if you are
|
|
using _regex_iterator_ or _regex_token_iterator_. The regex iterators accept
|
|
an extra constructor parameter for specifying the argument bindings. There is
|
|
a `let()` function that you can use to bind variables to their placeholders.
|
|
The following code demonstrates how.
|
|
|
|
// Define a placeholder for a map object:
|
|
placeholder<std::map<std::string, int> > _map;
|
|
|
|
// Match a word and an integer, separated by =>,
|
|
// and then stuff the result into a std::map<>
|
|
sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) )
|
|
[ _map[s1] = as<int>(s2) ];
|
|
|
|
// The string to parse
|
|
std::string str("aaa=>1 bbb=>23 ccc=>456");
|
|
|
|
// Here is the actual map to fill in:
|
|
std::map<std::string, int> result;
|
|
|
|
// Create a regex_iterator to find all the matches
|
|
sregex_iterator it(str.begin(), str.end(), pair, let(_map=result));
|
|
sregex_iterator end;
|
|
|
|
// step through all the matches, and fill in
|
|
// the result map
|
|
while(it != end)
|
|
++it;
|
|
|
|
std::cout << result["aaa"] << '\n';
|
|
std::cout << result["bbb"] << '\n';
|
|
std::cout << result["ccc"] << '\n';
|
|
|
|
This program displays:
|
|
|
|
[pre
|
|
1
|
|
23
|
|
456
|
|
]
|
|
|
|
[h2 User-Defined Assertions]
|
|
|
|
You are probably already familiar with regular expression /assertions/. In
|
|
Perl, some examples are the [^^] and [^$] assertions, which you can use to
|
|
match the beginning and end of a string, respectively. Xpressive lets you
|
|
define your own assertions. A custom assertion is a contition which must be
|
|
true at a point in the match in order for the match to succeed. You can check
|
|
a custom assertion with xpressive's _check_ function.
|
|
|
|
There are a couple of ways to define a custom assertion. The simplest is to
|
|
use a function object. Let's say that you want to ensure that a sub-expression
|
|
matches a sub-string that is either 3 or 6 characters long. The following
|
|
struct defines such a predicate:
|
|
|
|
// A predicate that is true IFF a sub-match is
|
|
// either 3 or 6 characters long.
|
|
struct three_or_six
|
|
{
|
|
bool operator()(ssub_match const &sub) const
|
|
{
|
|
return sub.length() == 3 || sub.length() == 6;
|
|
}
|
|
};
|
|
|
|
You can use this predicate within a regular expression as follows:
|
|
|
|
// match words of 3 characters or 6 characters.
|
|
sregex rx = (bow >> +_w >> eow)[ check(three_or_six()) ] ;
|
|
|
|
The above regular expression will find whole words that are either 3 or 6
|
|
characters long. The `three_or_six` predicate accepts a _sub_match_ that refers
|
|
back to the part of the string matched by the sub-expression to which the
|
|
custom assertion is attached.
|
|
|
|
[note The custom assertion participates in determining whether the match
|
|
succeeds or fails. Unlike actions, which execute lazily, custom assertions
|
|
execute immediately while the regex engine is searching for a match.]
|
|
|
|
Custom assertions can also be defined inline using the same syntax as for
|
|
semantic actions. Below is the same custom assertion written inline:
|
|
|
|
// match words of 3 characters or 6 characters.
|
|
sregex rx = (bow >> +_w >> eow)[ check(length(_)==3 || length(_)==6) ] ;
|
|
|
|
In the above, `length()` is a lazy function that calls the `length()` member
|
|
function of its argument, and `_` is a placeholder that receives the
|
|
`sub_match`.
|
|
|
|
Once you get the hang of writing custom assertions inline, they can be
|
|
very powerful. For example, you can write a regular expression that
|
|
only matches valid dates (for some suitably liberal definition of the
|
|
term ["valid]).
|
|
|
|
int const days_per_month[] =
|
|
{31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 31, 31};
|
|
|
|
mark_tag month(1), day(2);
|
|
// find a valid date of the form month/day/year.
|
|
sregex date =
|
|
(
|
|
// Month must be between 1 and 12 inclusive
|
|
(month= _d >> !_d) [ check(as<int>(_) >= 1
|
|
&& as<int>(_) <= 12) ]
|
|
>> '/'
|
|
// Day must be between 1 and 31 inclusive
|
|
>> (day= _d >> !_d) [ check(as<int>(_) >= 1
|
|
&& as<int>(_) <= 31) ]
|
|
>> '/'
|
|
// Only consider years between 1970 and 2038
|
|
>> (_d >> _d >> _d >> _d) [ check(as<int>(_) >= 1970
|
|
&& as<int>(_) <= 2038) ]
|
|
)
|
|
// Ensure the month actually has that many days!
|
|
[ check( ref(days_per_month)[as<int>(month)-1] >= as<int>(day) ) ]
|
|
;
|
|
|
|
smatch what;
|
|
std::string str("99/99/9999 2/30/2006 2/28/2006");
|
|
|
|
if(regex_search(str, what, date))
|
|
{
|
|
std::cout << what[0] << std::endl;
|
|
}
|
|
|
|
The above program prints out the following:
|
|
|
|
[pre
|
|
2/28/2006
|
|
]
|
|
|
|
Notice how the inline custom assertions are used to range-check the values for
|
|
the month, day and year. The regular expression doesn't match `"99/99/9999"` or
|
|
`"2/30/2006"` because they are not valid dates. (There is no 99th month, and
|
|
February doesn't have 30 days.)
|
|
|
|
[endsect]
|