164 lines
7.9 KiB
Plaintext
164 lines
7.9 KiB
Plaintext
[/
|
|
Copyright 2006-2007 John Maddock.
|
|
Distributed under the Boost Software License, Version 1.0.
|
|
(See accompanying file LICENSE_1_0.txt or copy at
|
|
http://www.boost.org/LICENSE_1_0.txt).
|
|
]
|
|
|
|
[section:intro Introduction and Overview]
|
|
|
|
Regular expressions are a form of pattern-matching that are often used in
|
|
text processing; many users will be familiar with the Unix utilities grep, sed
|
|
and awk, and the programming language Perl, each of which make extensive use
|
|
of regular expressions. Traditionally C++ users have been limited to the
|
|
POSIX C API's for manipulating regular expressions, and while Boost.Regex does
|
|
provide these API's, they do not represent the best way to use the library.
|
|
For example Boost.Regex can cope with wide character strings, or search and
|
|
replace operations (in a manner analogous to either sed or Perl), something
|
|
that traditional C libraries can not do.
|
|
|
|
The class [basic_regex] is the key class in this library; it represents a
|
|
"machine readable" regular expression, and is very closely modeled on
|
|
`std::basic_string`, think of it as a string plus the actual state-machine
|
|
required by the regular expression algorithms. Like `std::basic_string` there
|
|
are two typedefs that are almost always the means by which this class is referenced:
|
|
|
|
namespace boost{
|
|
|
|
template <class charT,
|
|
class traits = regex_traits<charT> >
|
|
class basic_regex;
|
|
|
|
typedef basic_regex<char> regex;
|
|
typedef basic_regex<wchar_t> wregex;
|
|
|
|
}
|
|
|
|
To see how this library can be used, imagine that we are writing a credit
|
|
card processing application. Credit card numbers generally come as a string
|
|
of 16-digits, separated into groups of 4-digits, and separated by either a
|
|
space or a hyphen. Before storing a credit card number in a database
|
|
(not necessarily something your customers will appreciate!), we may want to
|
|
verify that the number is in the correct format. To match any digit we could
|
|
use the regular expression \[0-9\], however ranges of characters like this are
|
|
actually locale dependent. Instead we should use the POSIX standard
|
|
form \[\[:digit:\]\], or the Boost.Regex and Perl shorthand for this \\d (note
|
|
that many older libraries tended to be hard-coded to the C-locale,
|
|
consequently this was not an issue for them). That leaves us with the
|
|
following regular expression to validate credit card number formats:
|
|
|
|
[pre (\d{4}\[- \]){3}\d{4}]
|
|
|
|
Here the parenthesis act to group (and mark for future reference)
|
|
sub-expressions, and the {4} means "repeat exactly 4 times". This is an
|
|
example of the extended regular expression syntax used by Perl, awk and egrep.
|
|
Boost.Regex also supports the older "basic" syntax used by sed and grep,
|
|
but this is generally less useful, unless you already have some basic regular
|
|
expressions that you need to reuse.
|
|
|
|
Now let's take that expression and place it in some C++ code to validate the
|
|
format of a credit card number:
|
|
|
|
bool validate_card_format(const std::string& s)
|
|
{
|
|
static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
|
|
return regex_match(s, e);
|
|
}
|
|
|
|
Note how we had to add some extra escapes to the expression: remember that
|
|
the escape is seen once by the C++ compiler, before it gets to be seen by
|
|
the regular expression engine, consequently escapes in regular expressions
|
|
have to be doubled up when embedding them in C/C++ code. Also note that
|
|
all the examples assume that your compiler supports argument-dependent
|
|
lookup, if yours doesn't (for example VC6), then you will have to add some
|
|
`boost::` prefixes to some of the function calls in the examples.
|
|
|
|
Those of you who are familiar with credit card processing, will have realized
|
|
that while the format used above is suitable for human readable card numbers,
|
|
it does not represent the format required by online credit card systems; these
|
|
require the number as a string of 16 (or possibly 15) digits, without any
|
|
intervening spaces. What we need is a means to convert easily between the two
|
|
formats, and this is where search and replace comes in. Those who are familiar
|
|
with the utilities sed and Perl will already be ahead here; we need two
|
|
strings - one a regular expression - the other a "format string" that provides
|
|
a description of the text to replace the match with. In Boost.Regex this
|
|
search and replace operation is performed with the algorithm [regex_replace],
|
|
for our credit card example we can write two algorithms like this to
|
|
provide the format conversions:
|
|
|
|
// match any format with the regular expression:
|
|
const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
|
|
const std::string machine_format("\\1\\2\\3\\4");
|
|
const std::string human_format("\\1-\\2-\\3-\\4");
|
|
|
|
std::string machine_readable_card_number(const std::string s)
|
|
{
|
|
return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed);
|
|
}
|
|
|
|
std::string human_readable_card_number(const std::string s)
|
|
{
|
|
return regex_replace(s, e, human_format, boost::match_default | boost::format_sed);
|
|
}
|
|
|
|
Here we've used marked sub-expressions in the regular expression to split out
|
|
the four parts of the card number as separate fields, the format string then
|
|
uses the sed-like syntax to replace the matched text with the reformatted version.
|
|
|
|
In the examples above, we haven't directly manipulated the results of
|
|
a regular expression match, however in general the result of a match contains
|
|
a number of sub-expression matches in addition to the overall match. When the
|
|
library needs to report a regular expression match it does so using an instance
|
|
of the class [match_results], as before there are typedefs of this class for
|
|
the most common cases:
|
|
|
|
namespace boost{
|
|
|
|
typedef match_results<const char*> cmatch;
|
|
typedef match_results<const wchar_t*> wcmatch;
|
|
typedef match_results<std::string::const_iterator> smatch;
|
|
typedef match_results<std::wstring::const_iterator> wsmatch;
|
|
|
|
}
|
|
|
|
The algorithms [regex_search] and [regex_match] make use of [match_results]
|
|
to report what matched; the difference between these algorithms is that
|
|
[regex_match] will only find matches that consume /all/ of the input text,
|
|
where as [regex_search] will search for a match anywhere within the text being matched.
|
|
|
|
Note that these algorithms are not restricted to searching regular C-strings,
|
|
any bidirectional iterator type can be searched, allowing for the
|
|
possibility of seamlessly searching almost any kind of data.
|
|
|
|
For search and replace operations, in addition to the algorithm [regex_replace]
|
|
that we have already seen, the [match_results] class has a `format` member that
|
|
takes the result of a match and a format string, and produces a new string
|
|
by merging the two.
|
|
|
|
For iterating through all occurrences of an expression within a text,
|
|
there are two iterator types: [regex_iterator] will enumerate over the
|
|
[match_results] objects found, while [regex_token_iterator] will enumerate
|
|
a series of strings (similar to perl style split operations).
|
|
|
|
For those that dislike templates, there is a high level wrapper class
|
|
[RegEx] that is an encapsulation of the lower level template code - it
|
|
provides a simplified interface for those that don't need the full
|
|
power of the library, and supports only narrow characters, and the
|
|
"extended" regular expression syntax. This class is now deprecated as
|
|
it does not form part of the regular expressions C++ standard library proposal.
|
|
|
|
The POSIX API functions: [regcomp], [regexec], [regfree] and [regerr],
|
|
are available in both narrow character and Unicode versions, and are
|
|
provided for those who need compatibility with these API's.
|
|
|
|
Finally, note that the library now has
|
|
[link boost_regex.background.locale run-time localization support],
|
|
and recognizes the full POSIX regular expression syntax - including
|
|
advanced features like multi-character collating elements and equivalence
|
|
classes - as well as providing compatibility with other regular expression
|
|
libraries including GNU and BSD4 regex packages, PCRE and Perl 5.
|
|
|
|
[endsect]
|
|
|
|
|