xpressive/doc/actions.qbk
2008-07-02 04:20:43 +00:00

519 lines
19 KiB
Plaintext

[/
/ Copyright (c) 2008 Eric Niebler
/
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
/]
[section Semantic Actions and User-Defined Assertions]
[h2 Overview]
Imagine you want to parse an input string and build a `std::map<>` from it. For
something like that, matching a regular expression isn't enough. You want to
/do something/ when parts of your regular expression match. Xpressive lets
you attach semantic actions to parts of your static regular expressions. This
section shows you how.
[h2 Semantic Actions]
Consider the following code, which uses xpressive's semantic actions to parse
a string of word/integer pairs and stuffs them into a `std::map<>`. It is
described below.
#include <string>
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
#include <boost/xpressive/regex_actions.hpp>
using namespace boost::xpressive;
int main()
{
std::map<std::string, int> result;
std::string str("aaa=>1 bbb=>23 ccc=>456");
// Match a word and an integer, separated by =>,
// and then stuff the result into a std::map<>
sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) )
[ ref(result)[s1] = as<int>(s2) ];
// Match one or more word/integer pairs, separated
// by whitespace.
sregex rx = pair >> *(+_s >> pair);
if(regex_match(str, rx))
{
std::cout << result["aaa"] << '\n';
std::cout << result["bbb"] << '\n';
std::cout << result["ccc"] << '\n';
}
return 0;
}
This program prints the following:
[pre
1
23
456
]
The regular expression `pair` has two parts: the pattern and the action. The
pattern says to match a word, capturing it in sub-match 1, and an integer,
capturing it in sub-match 2, separated by `"=>"`. The action is the part in
square brackets: `[ ref(result)[s1] = as<int>(s2) ]`. It says to take sub-match
one and use it to index into the `results` map, and assign to it the result of
converting sub-match 2 to an integer.
[note To use semantic actions with your static regexes, you must
`#include <boost/xpressive/regex_actions.hpp>`]
How does this work? Just as the rest of the static regular expression, the part
between brackets is an expression template. It encodes the action and executes
it later. The expression `ref(result)` creates a lazy reference to the `result`
object. The larger expression `ref(result)[s1]` is a lazy map index operation.
Later, when this action is getting executed, `s1` gets replaced with the
first _sub_match_. Likewise, when `as<int>(s2)` gets executed, `s2` is replaced
with the second _sub_match_. The `as<>` action converts its argument to the
requested type using Boost.Lexical_cast. The effect of the whole action is to
insert a new word/integer pair into the map.
[note There is an important difference between the function `boost::ref()` in
`<boost/ref.hpp>` and `boost::xpressive::ref()` in
`<boost/xpressive/regex_actions.hpp>`. The first returns a plain
`reference_wrapper<>` which behaves in many respects like an ordinary
reference. By contrast, `boost::xpressive::ref()` returns a /lazy/ reference
that you can use in expressions that are executed lazily. That is why we can
say `ref(result)[s1]`, even though `result` doesn't have an `operator[]` that
would accept `s1`.]
In addition to the sub-match placeholders `s1`, `s2`, etc., you can also use
the placeholder `_` within an action to refer back to the string matched by
the sub-expression to which the action is attached. For instance, you can use
the following regex to match a bunch of digits, interpret them as an integer
and assign the result to a local variable:
int i = 0;
// Here, _ refers back to all the
// characters matched by (+_d)
sregex rex = (+_d)[ ref(i) = as<int>(_) ];
[h3 Lazy Action Execution]
What does it mean, exactly, to attach an action to part of a regular expression
and perform a match? When does the action execute? If the action is part of a
repeated sub-expression, does the action execute once or many times? And if the
sub-expression initially matches, but ultimately fails because the rest of the
regular expression fails to match, is the action executed at all?
The answer is that by default, actions are executed /lazily/. When a sub-expression
matches a string, its action is placed on a queue, along with the current
values of any sub-matches to which the action refers. If the match algorithm
must backtrack, actions are popped off the queue as necessary. Only after the
entire regex has matched successfully are the actions actually exeucted. They
are executed all at once, in the order in which they were added to the queue,
as the last step before _regex_match_ returns.
For example, consider the following regex that increments a counter whenever
it finds a digit.
int i = 0;
std::string str("1!2!3?");
// count the exciting digits, but not the
// questionable ones.
sregex rex = +( _d [ ++ref(i) ] >> '!' );
regex_search(str, rex);
assert( i == 2 );
The action `++ref(i)` is queued three times: once for each found digit. But
it is only /executed/ twice: once for each digit that precedes a `'!'`
character. When the `'?'` character is encountered, the match algorithm
backtracks, removing the final action from the queue.
[h3 Immediate Action Execution]
When you want semantic actions to execute immediately, you can wrap the
sub-expression containing the action in a [^[funcref boost::xpressive::keep keep()]].
`keep()` turns off back-tracking for its sub-expression, but it also causes
any actions queued by the sub-expression to execute at the end of the `keep()`.
It is as if the sub-expression in the `keep()` were compiled into an
independent regex object, and matching the `keep()` is like a separate invocation
of `regex_search()`. It matches characters and executes actions but never backtracks
or unwinds. For example, imagine the above example had been written as follows:
int i = 0;
std::string str("1!2!3?");
// count all the digits.
sregex rex = +( keep( _d [ ++ref(i) ] ) >> '!' );
regex_search(str, rex);
assert( i == 3 );
We have wrapped the sub-expression `_d [ ++ref(i) ]` in `keep()`. Now, whenever
this regex matches a digit, the action will be queued and then immediately
executed before we try to match a `'!'` character. In this case, the action
executes three times.
[note Like `keep()`, actions within [^[funcref boost::xpressive::before before()]]
and [^[funcref boost::xpressive::after after()]] are also executed early when their
sub-expressions have matched.]
[h3 Lazy Functions]
So far, we've seen how to write semantic actions consisting of variables and
operators. But what if you want to be able to call a function from a semantic
action? Xpressive provides a mechanism to do this.
The first step is to define a function object type. Here, for instance, is a
function object type that calls `push()` on its argument:
struct push_impl
{
// Result type, needed for tr1::result_of
typedef void result_type;
template<typename Sequence, typename Value>
void operator()(Sequence &seq, Value const &val) const
{
seq.push(val);
}
};
The next step is to use xpressive's `function<>` template to define a function
object named `push`:
// Global "push" function object.
function<push_impl>::type const push = {{}};
The initialization looks a bit odd, but this is because `push` is being
statically initialized. That means it doesn't need to be constructed
at runtime. We can use `push` in semantic actions as follows:
std::stack<int> ints;
// Match digits, cast them to an int
// and push it on the stack.
sregex rex = (+_d)[push(ref(ints), as<int>(_))];
You'll notice that doing it this way causes member function invocations
to look like ordinary function invocations. You can choose to write your
semantic action in a different way that makes it look a bit more like
a member function call:
sregex rex = (+_d)[ref(ints)->*push(as<int>(_))];
Xpressive recognizes the use of the `->*` and treats this expression
exactly the same as the one above.
When your function object must return a type that depends on its
arguments, you can use a `result<>` member template instead of the
`result_type` typedef. Here, for example, is a `first` function object
that returns the `first` member of a `std::pair<>` or _sub_match_:
// Function object that returns the
// first element of a pair.
struct first_impl
{
template<typename Sig> struct result {};
template<typename This, typename Pair>
struct result<This(Pair)>
{
typedef typename remove_reference<Pair>
::type::first_type type;
};
template<typename Pair>
typename Pair::first_type
operator()(Pair const &p) const
{
return p.first;
}
};
// OK, use as first(s1) to get the begin iterator
// of the sub-match referred to by s1.
function<first_impl>::type const first = {{}};
[h3 Referring to Local Variables]
As we've seen in the examples above, we can refer to local variables within
an actions using `xpressive::ref()`. Any such variables are held by reference
by the regular expression, and care should be taken to avoid letting those
references dangle. For instance, in the following code, the reference to `i`
is left to dangle when `bad_voodoo()` returns:
sregex bad_voodoo()
{
int i = 0;
sregex rex = +( _d [ ++ref(i) ] >> '!' );
// ERROR! rex refers by reference to a local
// variable, which will dangle after bad_voodoo()
// returns.
return rex;
}
When writing semantic actions, it is your responsibility to make sure that
all the references do not dangle. One way to do that would be to make the
variables shared pointers that are held by the regex by value.
sregex good_voodoo(boost::shared_ptr<int> pi)
{
// Use val() to hold the shared_ptr by value:
sregex rex = +( _d [ ++*val(pi) ] >> '!' );
// OK, rex holds a reference count to the integer.
return rex;
}
In the above code, we use `xpressive::val()` to hold the shared pointer by
value. That's not normally necessary because local variables appearing in
actions are held by value by default, but in this case, it is necessary. Had
we written the action as `++*pi`, it would have executed immediately. That's
because `++*pi` is not an expression template, but `++*val(pi)` is.
It can be tedious to wrap all your variables in `ref()` and `val()` in your
semantic actions. Xpressive provides the `reference<>` and `value<>` templates
to make things easier. The following table shows the equivalencies:
[table reference<> and value<>
[[This ...][... is equivalent to this ...]]
[[``int i = 0;
sregex rex = +( _d [ ++ref(i) ] >> '!' );``][``int i = 0;
reference<int> ri(i);
sregex rex = +( _d [ ++ri ] >> '!' );``]]
[[``boost::shared_ptr<int> pi(new int(0));
sregex rex = +( _d [ ++*val(pi) ] >> '!' );``][``boost::shared_ptr<int> pi(new int(0));
value<boost::shared_ptr<int> > vpi(pi);
sregex rex = +( _d [ ++*vpi ] >> '!' );``]]
]
As you can see, when using `reference<>`, you need to first declare a local
variable and then declare a `reference<>` to it. These two steps can be combined
into one using `local<>`.
[table local<> vs. reference<>
[[This ...][... is equivalent to this ...]]
[[``local<int> i(0);
sregex rex = +( _d [ ++i ] >> '!' );``][``int i = 0;
reference<int> ri(i);
sregex rex = +( _d [ ++ri ] >> '!' );``]]
]
We can use `local<>` to rewrite the above example as follows:
local<int> i(0);
std::string str("1!2!3?");
// count the exciting digits, but not the
// questionable ones.
sregex rex = +( _d [ ++i ] >> '!' );
regex_search(str, rex);
assert( i.get() == 2 );
Notice that we use `local<>::get()` to access the value of the local
variable. Also, beware that `local<>` can be used to create a dangling
reference, just as `reference<>` can.
[h3 Referring to Non-Local Variables]
In the beginning of this
section, we used a regex with a semantic action to parse a string of
word/integer pairs and stuff them into a `std::map<>`. That required that
the map and the regex be defined together and used before either could
go out of scope. What if we wanted to define the regex once and use it
to fill lots of different maps? We would rather pass the map into the
_regex_match_ algorithm rather than embed a reference to it directly in
the regex object. What we can do instead is define a placeholder and use
that in the semantic action instead of the map itself. Later, when we
call one of the regex algorithms, we can bind the reference to an actual
map object. The following code shows how.
// Define a placeholder for a map object:
placeholder<std::map<std::string, int> > _map;
// Match a word and an integer, separated by =>,
// and then stuff the result into a std::map<>
sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) )
[ _map[s1] = as<int>(s2) ];
// Match one or more word/integer pairs, separated
// by whitespace.
sregex rx = pair >> *(+_s >> pair);
// The string to parse
std::string str("aaa=>1 bbb=>23 ccc=>456");
// Here is the actual map to fill in:
std::map<std::string, int> result;
// Bind the _map placeholder to the actual map
smatch what;
what.let( _map = result );
// Execute the match and fill in result map
if(regex_match(str, what, rx))
{
std::cout << result["aaa"] << '\n';
std::cout << result["bbb"] << '\n';
std::cout << result["ccc"] << '\n';
}
This program displays:
[pre
1
23
456
]
We use `placeholder<>` here to define `_map`, which stands in for a
`std::map<>` variable. We can use the placeholder in the semantic action as if
it were a map. Then, we define a _match_results_ struct and bind an actual map
to the placeholder with "`what.let( _map = result );`". The _regex_match_ call
behaves as if the placeholder in the semantic action had been replaced with a
reference to `result`.
[note Placeholders in semantic actions are not /actually/ replaced at runtime
with references to variables. The regex object is never mutated in any way
during any of the regex algorithms, so they are safe to use in multiple
threads.]
The syntax for late-bound action arguments is a little different if you are
using _regex_iterator_ or _regex_token_iterator_. The regex iterators accept
an extra constructor parameter for specifying the argument bindings. There is
a `let()` function that you can use to bind variables to their placeholders.
The following code demonstrates how.
// Define a placeholder for a map object:
placeholder<std::map<std::string, int> > _map;
// Match a word and an integer, separated by =>,
// and then stuff the result into a std::map<>
sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) )
[ _map[s1] = as<int>(s2) ];
// The string to parse
std::string str("aaa=>1 bbb=>23 ccc=>456");
// Here is the actual map to fill in:
std::map<std::string, int> result;
// Create a regex_iterator to find all the matches
sregex_iterator it(str.begin(), str.end(), pair, let(_map=result));
sregex_iterator end;
// step through all the matches, and fill in
// the result map
while(it != end)
++it;
std::cout << result["aaa"] << '\n';
std::cout << result["bbb"] << '\n';
std::cout << result["ccc"] << '\n';
This program displays:
[pre
1
23
456
]
[h2 User-Defined Assertions]
You are probably already familiar with regular expression /assertions/. In
Perl, some examples are the [^^] and [^$] assertions, which you can use to
match the beginning and end of a string, respectively. Xpressive lets you
define your own assertions. A custom assertion is a contition which must be
true at a point in the match in order for the match to succeed. You can check
a custom assertion with xpressive's _check_ function.
There are a couple of ways to define a custom assertion. The simplest is to
use a function object. Let's say that you want to ensure that a sub-expression
matches a sub-string that is either 3 or 6 characters long. The following
struct defines such a predicate:
// A predicate that is true IFF a sub-match is
// either 3 or 6 characters long.
struct three_or_six
{
bool operator()(ssub_match const &sub) const
{
return sub.length() == 3 || sub.length() == 6;
}
};
You can use this predicate within a regular expression as follows:
// match words of 3 characters or 6 characters.
sregex rx = (bow >> +_w >> eow)[ check(three_or_six()) ] ;
The above regular expression will find whole words that are either 3 or 6
characters long. The `three_or_six` predicate accepts a _sub_match_ that refers
back to the part of the string matched by the sub-expression to which the
custom assertion is attached.
[note The custom assertion participates in determining whether the match
succeeds or fails. Unlike actions, which execute lazily, custom assertions
execute immediately while the regex engine is searching for a match.]
Custom assertions can also be defined inline using the same syntax as for
semantic actions. Below is the same custom assertion written inline:
// match words of 3 characters or 6 characters.
sregex rx = (bow >> +_w >> eow)[ check(length(_)==3 || length(_)==6) ] ;
In the above, `length()` is a lazy function that calls the `length()` member
function of its argument, and `_` is a placeholder that receives the
`sub_match`.
Once you get the hang of writing custom assertions inline, they can be
very powerful. For example, you can write a regular expression that
only matches valid dates (for some suitably liberal definition of the
term ["valid]).
int const days_per_month[] =
{31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 31, 31};
mark_tag month(1), day(2);
// find a valid date of the form month/day/year.
sregex date =
(
// Month must be between 1 and 12 inclusive
(month= _d >> !_d) [ check(as<int>(_) >= 1
&& as<int>(_) <= 12) ]
>> '/'
// Day must be between 1 and 31 inclusive
>> (day= _d >> !_d) [ check(as<int>(_) >= 1
&& as<int>(_) <= 31) ]
>> '/'
// Only consider years between 1970 and 2038
>> (_d >> _d >> _d >> _d) [ check(as<int>(_) >= 1970
&& as<int>(_) <= 2038) ]
)
// Ensure the month actually has that many days!
[ check( ref(days_per_month)[as<int>(month)-1] >= as<int>(day) ) ]
;
smatch what;
std::string str("99/99/9999 2/30/2006 2/28/2006");
if(regex_search(str, what, date))
{
std::cout << what[0] << std::endl;
}
The above program prints out the following:
[pre
2/28/2006
]
Notice how the inline custom assertions are used to range-check the values for
the month, day and year. The regular expression doesn't match `"99/99/9999"` or
`"2/30/2006"` because they are not valid dates. (There is no 99th month, and
February doesn't have 30 days.)
[endsect]