123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313 |
- [/
- / Copyright (c) 2008 Eric Niebler
- /
- / Distributed under the Boost Software License, Version 1.0. (See accompanying
- / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
- /]
- [section Grammars and Nested Matches]
- [h2 Overview]
- One of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++
- code and data from within the regex. This enables programming idioms that are not possible with other regular
- expression libraries. Of particular note is the ability for one regex to refer to another regex, allowing you
- to build grammars out of regular expressions. This section describes how to embed one regex in another by value
- and by reference, how regex objects behave when they refer to other regexes, and how to access the tree of results
- after a successful parse.
- [h2 Embedding a Regex by Value]
- The _basic_regex_ object has value semantics. When a regex object appears on the right-hand side in the definition
- of another regex, it is as if the regex were embedded by value; that is, a copy of the nested regex is stored by
- the enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regex
- participates fully in the match, back-tracking as needed to make the match succeed.
- Consider a text editor that has a regex-find feature with a whole-word option. You can implement this with
- xpressive as follows:
- find_dialog dlg;
- if( dialog_ok == dlg.do_modal() )
- {
- std::string pattern = dlg.get_text(); // the pattern the user entered
- bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option?
- sregex re = sregex::compile( pattern ); // try to compile the pattern
- if( whole_word )
- {
- // wrap the regex in begin-word / end-word assertions
- re = bow >> re >> eow;
- }
- // ... use re ...
- }
- Look closely at this line:
- // wrap the regex in begin-word / end-word assertions
- re = bow >> re >> eow;
- This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back to
- the original regex. Since a copy of the old regex was made on the right-hand side, this works as you might
- expect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions.
- [note Note that `re = bow >> re >> eow` does ['not] define a recursive regular expression, since regex
- objects embed by value by default. The next section shows how to define a recursive regular expression by
- embedding a regex by reference.]
- [h2 Embedding a Regex by Reference]
- If you want to be able to build recursive regular expressions and context-free grammars, embedding a regex
- by value is not enough. You need to be able to make your regular expressions self-referential. Most regular
- expression engines don't give you that power, but xpressive does.
- [tip The theoretical computer scientists out there will correctly point out that a self-referential
- regular expression is not "regular", so in the strict sense, xpressive isn't really a ['regular] expression engine
- at all. But as Larry Wall once said, "the term '''[regular expression]''' has grown with the capabilities of our
- pattern matching engines, so I'm not going to try to fight linguistic necessity here."]
- Consider the following code, which uses the `by_ref()` helper to define a recursive regular expression that
- matches balanced, nested parentheses:
- sregex parentheses;
- parentheses // A balanced set of parentheses ...
- = '(' // is an opening parenthesis ...
- >> // followed by ...
- *( // zero or more ...
- keep( +~(set='(',')') ) // of a bunch of things that are not parentheses ...
- | // or ...
- by_ref(parentheses) // a balanced set of parentheses
- ) // (ooh, recursion!) ...
- >> // followed by ...
- ')' // a closing parenthesis
- ;
- Matching balanced, nested tags is an important text processing task, and it is one that "classic" regular
- expressions cannot do. The `by_ref()` helper makes it possible. It allows one regex object to be embedded
- in another ['by reference]. Since the right-hand side holds `parentheses` by reference, assigning the right-hand
- side back to `parentheses` creates a cycle, which will execute recursively.
- [h2 Building a Grammar]
- Once we allow self-reference in our regular expressions, the genie is out of the bottle and all manner of
- fun things are possible. In particular, we can now build grammars out of regular expressions. Let's have
- a look at the text-book grammar example: the humble calculator.
- sregex group, factor, term, expression;
- group = '(' >> by_ref(expression) >> ')';
- factor = +_d | group;
- term = factor >> *(('*' >> factor) | ('/' >> factor));
- expression = term >> *(('+' >> term) | ('-' >> term));
- The regex `expression` defined above does something rather remarkable for a regular expression: it matches
- mathematical expressions. For example, if the input string were `"foo 9*(10+3) bar"`, this pattern would
- match `"9*(10+3)"`. It only matches well-formed mathematical expressions, where the parentheses are
- balanced and the infix operators have two arguments each. Don't try this with just any regular expression
- engine!
- Let's take a closer look at this regular expression grammar. Notice that it is cyclic: `expression` is
- implemented in terms of `term`, which is implemented in terms of `factor`, which is implemented in terms
- of `group`, which is implemented in terms of `expression`, closing the loop. In general, the way to define
- a cyclic grammar is to forward-declare the regex objects and embed by reference those regular expressions
- that have not yet been initialized. In the above grammar, there is only one place where we need to reference
- a regex object that has not yet been initialized: the definition of `group`. In that place, we use
- `by_ref()` to embed `expression` by reference. In all other places, it is sufficient to embed the other
- regex objects by value, since they have already been initialized and their values will not change.
- [tip [*Embed by value if possible]
- \n\n
- In general, prefer embedding regular expressions by value rather than by reference. It
- involves one less indirection, making your patterns match a little faster. Besides, value semantics
- are simpler and will make your grammars easier to reason about. Don't worry about the expense of "copying"
- a regex. Each regex object shares its implementation with all of its copies.]
- [h2 Dynamic Regex Grammars]
- Using _regex_compiler_, you can also build grammars out of dynamic regular expressions.
- You do that by creating named regexes, and referring to other regexes by name. Each
- _regex_compiler_ instance keeps a mapping from names to regexes that have been created
- with it.
- You can create a named dynamic regex by prefacing your regex with `"(?$name=)"`, where
- /name/ is the name of the regex. You can refer to a named regex from another regex with
- `"(?$name)"`. The named regex does not need to exist yet at the time it is referenced
- in another regex, but it must exist by the time you use the regex.
- Below is a code fragment that uses dynamic regex grammars to implement the calculator
- example from above.
- using namespace boost::xpressive;
- using namespace regex_constants;
- sregex expr;
- {
- sregex_compiler compiler;
- syntax_option_type x = ignore_white_space;
- compiler.compile("(? $group = ) \\( (? $expr ) \\) ", x);
- compiler.compile("(? $factor = ) \\d+ | (? $group ) ", x);
- compiler.compile("(? $term = ) (? $factor )"
- " ( \\* (? $factor ) | / (? $factor ) )* ", x);
- expr = compiler.compile("(? $expr = ) (? $term )"
- " ( \\+ (? $term ) | - (? $term ) )* ", x);
- }
- std::string str("foo 9*(10+3) bar");
- smatch what;
- if(regex_search(str, what, expr))
- {
- // This prints "9*(10+3)":
- std::cout << what[0] << std::endl;
- }
- As with static regex grammars, nested regex invocations create nested
- match results (see /Nested Results/ below). The result is a complete parse tree
- for string that matched. Unlike static regexes, dynamic regexes are always
- embedded by reference, not by value.
- [h2 Cyclic Patterns, Copying and Memory Management, Oh My!]
- The calculator examples above raises a number of very complicated memory-management issues. Each of the
- four regex objects refer to each other, some directly and some indirectly, some by value and some by
- reference. What if we were to return one of them from a function and let the others go out of scope?
- What becomes of the references? The answer is that the regex objects are internally reference counted,
- such that they keep their referenced regex objects alive as long as they need them. So passing a regex
- object by value is never a problem, even if it refers to other regex objects that have gone out of scope.
- Those of you who have dealt with reference counting are probably familiar with its Achilles Heel: cyclic
- references. If regex objects are reference counted, what happens to cycles like the one created in the
- calculator examples? Are they leaked? The answer is no, they are not leaked. The _basic_regex_ object has some tricky
- reference tracking code that ensures that even cyclic regex grammars are cleaned up when the last external
- reference goes away. So don't worry about it. Create cyclic grammars, pass your regex objects around and
- copy them all you want. It is fast and efficient and guaranteed not to leak or result in dangling references.
- [h2 Nested Regexes and Sub-Match Scoping]
- Nested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex write
- to and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on the
- sub-matches written by the outer regex. For example, what does this do?
- sregex inner = sregex::compile( "(.)\\1" );
- sregex outer = (s1= _) >> inner >> s1;
- The author probably didn't intend for the inner regex to overwrite the sub-match written by the outer
- regex. The problem is particularly acute when the inner regex is accepted from the user as input. The
- author has no way of knowing whether the inner regex will stomp the sub-match vector or not. This is
- clearly not acceptable.
- Instead, what actually happens is that each invocation of a nested regex gets its own scope. Sub-matches
- belong to that scope. That is, each nested regex invocation gets its own copy of the sub-match vector to
- play with, so there is no way for an inner regex to stomp on the sub-matches of an outer regex. So, for
- example, the regex `outer` defined above would match `"ABBA"`, as it should.
- [h2 Nested Results]
- If nested regexes have their own sub-matches, there should be a way to access them after a successful
- match. In fact, there is. After a _regex_match_ or _regex_search_, the _match_results_ struct behaves
- like the head of a tree of nested results. The _match_results_ class provides a `nested_results()`
- member function that returns an ordered sequence of _match_results_ structures, representing the
- results of the nested regexes. The order of the nested results is the same as the order in which
- the nested regex objects matched.
- Take as an example the regex for balanced, nested parentheses we saw earlier:
- sregex parentheses;
- parentheses = '(' >> *( keep( +~(set='(',')') ) | by_ref(parentheses) ) >> ')';
- smatch what;
- std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" );
- if( regex_search( str, what, parentheses ) )
- {
- // display the whole match
- std::cout << what[0] << '\n';
- // display the nested results
- std::for_each(
- what.nested_results().begin(),
- what.nested_results().end(),
- output_nested_results() );
- }
- This program displays the following:
- [pre
- ( a(b)c (c(e)f (g)h )i (j)6 )
- (b)
- (c(e)f (g)h )
- (e)
- (g)
- (j)
- ]
- Here you can see how the results are nested and that they are stored in the order in which they
- are found.
- [tip See the definition of [link boost_xpressive.user_s_guide.examples.display_a_tree_of_nested_results output_nested_results] in the
- [link boost_xpressive.user_s_guide.examples Examples] section.]
- [h2 Filtering Nested Results]
- Sometimes a regex will have several nested regex objects, and you want to know which result corresponds
- to which regex object. That's where `basic_regex<>::regex_id()` and `match_results<>::regex_id()`
- come in handy. When iterating over the nested results, you can compare the regex id from the results to
- the id of the regex object you're interested in.
- To make this a bit easier, xpressive provides a predicate to make it simple to iterate over just the
- results that correspond to a certain nested regex. It is called `regex_id_filter_predicate`, and it is
- intended to be used with _iterator_. You can use it as follows:
- sregex name = +alpha;
- sregex integer = +_d;
- sregex re = *( *_s >> ( name | integer ) );
- smatch what;
- std::string str( "marsha 123 jan 456 cindy 789" );
- if( regex_match( str, what, re ) )
- {
- smatch::nested_results_type::const_iterator begin = what.nested_results().begin();
- smatch::nested_results_type::const_iterator end = what.nested_results().end();
- // declare filter predicates to select just the names or the integers
- sregex_id_filter_predicate name_id( name.regex_id() );
- sregex_id_filter_predicate integer_id( integer.regex_id() );
- // iterate over only the results from the name regex
- std::for_each(
- boost::make_filter_iterator( name_id, begin, end ),
- boost::make_filter_iterator( name_id, end, end ),
- output_result
- );
- std::cout << '\n';
- // iterate over only the results from the integer regex
- std::for_each(
- boost::make_filter_iterator( integer_id, begin, end ),
- boost::make_filter_iterator( integer_id, end, end ),
- output_result
- );
- }
- where `output_results` is a simple function that takes a `smatch` and displays the full match.
- Notice how we use the `regex_id_filter_predicate` together with `basic_regex<>::regex_id()` and
- `boost::make_filter_iterator()` from the _iterator_ to select only those results
- corresponding to a particular nested regex. This program displays the following:
- [pre
- marsha
- jan
- cindy
- 123
- 456
- 789
- ]
- [endsect]
|