static_regexes.qbk 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232
  1. [/
  2. / Copyright (c) 2008 Eric Niebler
  3. /
  4. / Distributed under the Boost Software License, Version 1.0. (See accompanying
  5. / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
  6. /]
  7. [section Static Regexes]
  8. [h2 Overview]
  9. The feature that really sets xpressive apart from other C/C++ regular
  10. expression libraries is the ability to author a regular expression using C++
  11. expressions. xpressive achieves this through operator overloading, using a
  12. technique called ['expression templates] to embed a mini-language dedicated
  13. to pattern matching within C++. These "static regexes" have many advantages
  14. over their string-based brethren. In particular, static regexes:
  15. * are syntax-checked at compile-time; they will never fail at run-time due to
  16. a syntax error.
  17. * can naturally refer to other C++ data and code, including other regexes,
  18. making it simple to build grammars out of regular expressions and bind
  19. user-defined actions that execute when parts of your regex match.
  20. * are statically bound for better inlining and optimization. Static regexes
  21. require no state tables, virtual functions, byte-code or calls through
  22. function pointers that cannot be resolved at compile time.
  23. * are not limited to searching for patterns in strings. You can declare a
  24. static regex that finds patterns in an array of integers, for instance.
  25. Since we compose static regexes using C++ expressions, we are constrained by
  26. the rules for legal C++ expressions. Unfortunately, that means that
  27. "classic" regular expression syntax cannot always be mapped cleanly into
  28. C++. Rather, we map the regex ['constructs], picking new syntax that is
  29. legal C++.
  30. [h2 Construction and Assignment]
  31. You create a static regex by assigning one to an object of type _basic_regex_.
  32. For instance, the following defines a regex that can be used to find patterns
  33. in objects of type `std::string`:
  34. sregex re = '$' >> +_d >> '.' >> _d >> _d;
  35. Assignment works similarly.
  36. [h2 Character and String Literals]
  37. In static regexes, character and string literals match themselves. For
  38. instance, in the regex above, `'$'` and `'.'` match the characters `'$'` and
  39. `'.'` respectively. Don't be confused by the fact that [^$] and [^.] are
  40. meta-characters in Perl. In xpressive, literals always represent themselves.
  41. When using literals in static regexes, you must take care that at least one
  42. operand is not a literal. For instance, the following are ['not] valid
  43. regexes:
  44. sregex re1 = 'a' >> 'b'; // ERROR!
  45. sregex re2 = +'a'; // ERROR!
  46. The two operands to the binary `>>` operator are both literals, and the
  47. operand of the unary `+` operator is also a literal, so these statements
  48. will call the native C++ binary right-shift and unary plus operators,
  49. respectively. That's not what we want. To get operator overloading to kick
  50. in, at least one operand must be a user-defined type. We can use xpressive's
  51. `as_xpr()` helper function to "taint" an expression with regex-ness, forcing
  52. operator overloading to find the correct operators. The two regexes above
  53. should be written as:
  54. sregex re1 = as_xpr('a') >> 'b'; // OK
  55. sregex re2 = +as_xpr('a'); // OK
  56. [h2 Sequencing and Alternation]
  57. As you've probably already noticed, sub-expressions in static regexes must
  58. be separated by the sequencing operator, `>>`. You can read this operator as
  59. "followed by".
  60. // Match an 'a' followed by a digit
  61. sregex re = 'a' >> _d;
  62. Alternation works just as it does in Perl with the `|` operator. You can
  63. read this operator as "or". For example:
  64. // match a digit character or a word character one or more times
  65. sregex re = +( _d | _w );
  66. [h2 Grouping and Captures]
  67. In Perl, parentheses `()` have special meaning. They group, but as a
  68. side-effect they also create back\-references like [^$1] and [^$2]. In C++,
  69. parentheses only group \-\- there is no way to give them side\-effects. To
  70. get the same effect, we use the special `s1`, `s2`, etc. tokens. Assigning
  71. to one creates a back-reference. You can then use the back-reference later
  72. in your expression, like using [^\1] and [^\2] in Perl. For example,
  73. consider the following regex, which finds matching HTML tags:
  74. "<(\\w+)>.*?</\\1>"
  75. In static xpressive, this would be:
  76. '<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>'
  77. Notice how you capture a back-reference by assigning to `s1`, and then you
  78. use `s1` later in the pattern to find the matching end tag.
  79. [tip [*Grouping without capturing a back-reference] \n\n In
  80. xpressive, if you just want grouping without capturing a back-reference, you
  81. can just use `()` without `s1`. That is the equivalent of Perl's [^(?:)]
  82. non-capturing grouping construct.]
  83. [h2 Case-Insensitivity and Internationalization]
  84. Perl lets you make part of your regular expression case-insensitive by using
  85. the [^(?i:)] pattern modifier. xpressive also has a case-insensitivity
  86. pattern modifier, called `icase`. You can use it as follows:
  87. sregex re = "this" >> icase( "that" );
  88. In this regular expression, `"this"` will be matched exactly, but `"that"`
  89. will be matched irrespective of case.
  90. Case-insensitive regular expressions raise the issue of
  91. internationalization: how should case-insensitive character comparisons be
  92. evaluated? Also, many character classes are locale-specific. Which
  93. characters are matched by `digit` and which are matched by `alpha`? The
  94. answer depends on the `std::locale` object the regular expression object is
  95. using. By default, all regular expression objects use the global locale. You
  96. can override the default by using the `imbue()` pattern modifier, as
  97. follows:
  98. std::locale my_locale = /* initialize a std::locale object */;
  99. sregex re = imbue( my_locale )( +alpha >> +digit );
  100. This regular expression will evaluate `alpha` and `digit` according to
  101. `my_locale`. See the section on [link boost_xpressive.user_s_guide.localization_and_regex_traits
  102. Localization and Regex Traits] for more information about how to customize
  103. the behavior of your regexes.
  104. [h2 Static xpressive Syntax Cheat Sheet]
  105. The table below lists the familiar regex constructs and their equivalents in
  106. static xpressive.
  107. [def _s1_ [globalref boost::xpressive::s1 s1]]
  108. [def _bos_ [globalref boost::xpressive::bos bos]]
  109. [def _eos_ [globalref boost::xpressive::eos eos]]
  110. [def _b_ [globalref boost::xpressive::_b _b]]
  111. [def _n_ [globalref boost::xpressive::_n _n]]
  112. [def _ln_ [globalref boost::xpressive::_ln _ln]]
  113. [def _d_ [globalref boost::xpressive::_d _d]]
  114. [def _w_ [globalref boost::xpressive::_w _w]]
  115. [def _s_ [globalref boost::xpressive::_s _s]]
  116. [def _alnum_ [globalref boost::xpressive::alnum alnum]]
  117. [def _alpha_ [globalref boost::xpressive::alpha alpha]]
  118. [def _blank_ [globalref boost::xpressive::blank blank]]
  119. [def _cntrl_ [globalref boost::xpressive::cntrl cntrl]]
  120. [def _digit_ [globalref boost::xpressive::digit digit]]
  121. [def _graph_ [globalref boost::xpressive::graph graph]]
  122. [def _lower_ [globalref boost::xpressive::lower lower]]
  123. [def _print_ [globalref boost::xpressive::print print]]
  124. [def _punct_ [globalref boost::xpressive::punct punct]]
  125. [def _space_ [globalref boost::xpressive::space space]]
  126. [def _upper_ [globalref boost::xpressive::upper upper]]
  127. [def _xdigit_ [globalref boost::xpressive::xdigit xdigit]]
  128. [def _set_ [globalref boost::xpressive::set set]]
  129. [def _repeat_ [funcref boost::xpressive::repeat repeat]]
  130. [def _range_ [funcref boost::xpressive::range range]]
  131. [def _icase_ [funcref boost::xpressive::icase icase]]
  132. [def _before_ [funcref boost::xpressive::before before]]
  133. [def _after_ [funcref boost::xpressive::after after]]
  134. [def _keep_ [funcref boost::xpressive::keep keep]]
  135. [table Perl syntax vs. Static xpressive syntax
  136. [[Perl] [Static xpressive] [Meaning]]
  137. [[[^.]] [[globalref boost::xpressive::_ `_`]] [any character (assuming Perl's /s modifier).]]
  138. [[[^ab]] [`a >> b`] [sequencing of [^a] and [^b] sub-expressions.]]
  139. [[[^a|b]] [`a | b`] [alternation of [^a] and [^b] sub-expressions.]]
  140. [[[^(a)]] [`(_s1_= a)`] [group and capture a back-reference.]]
  141. [[[^(?:a)]] [`(a)`] [group and do not capture a back-reference.]]
  142. [[[^\1]] [`_s1_`] [a previously captured back-reference.]]
  143. [[[^a*]] [`*a`] [zero or more times, greedy.]]
  144. [[[^a+]] [`+a`] [one or more times, greedy.]]
  145. [[[^a?]] [`!a`] [zero or one time, greedy.]]
  146. [[[^a{n,m}]] [`_repeat_<n,m>(a)`] [between [^n] and [^m] times, greedy.]]
  147. [[[^a*?]] [`-*a`] [zero or more times, non-greedy.]]
  148. [[[^a+?]] [`-+a`] [one or more times, non-greedy.]]
  149. [[[^a??]] [`-!a`] [zero or one time, non-greedy.]]
  150. [[[^a{n,m}?]] [`-_repeat_<n,m>(a)`] [between [^n] and [^m] times, non-greedy.]]
  151. [[[^^]] [`_bos_`] [beginning of sequence assertion.]]
  152. [[[^$]] [`_eos_`] [end of sequence assertion.]]
  153. [[[^\b]] [`_b_`] [word boundary assertion.]]
  154. [[[^\B]] [`~_b_`] [not word boundary assertion.]]
  155. [[[^\\n]] [`_n_`] [literal newline.]]
  156. [[[^.]] [`~_n_`] [any character except a literal newline (without Perl's /s modifier).]]
  157. [[[^\\r?\\n|\\r]] [`_ln_`] [logical newline.]]
  158. [[[^\[^\\r\\n\]]] [`~_ln_`] [any single character not a logical newline.]]
  159. [[[^\w]] [`_w_`] [a word character, equivalent to set\[alnum | '_'\].]]
  160. [[[^\W]] [`~_w_`] [not a word character, equivalent to ~set\[alnum | '_'\].]]
  161. [[[^\d]] [`_d_`] [a digit character.]]
  162. [[[^\D]] [`~_d_`] [not a digit character.]]
  163. [[[^\s]] [`_s_`] [a space character.]]
  164. [[[^\S]] [`~_s_`] [not a space character.]]
  165. [[[^\[:alnum:\]]] [`_alnum_`] [an alpha-numeric character.]]
  166. [[[^\[:alpha:\]]] [`_alpha_`] [an alphabetic character.]]
  167. [[[^\[:blank:\]]] [`_blank_`] [a horizontal white-space character.]]
  168. [[[^\[:cntrl:\]]] [`_cntrl_`] [a control character.]]
  169. [[[^\[:digit:\]]] [`_digit_`] [a digit character.]]
  170. [[[^\[:graph:\]]] [`_graph_`] [a graphable character.]]
  171. [[[^\[:lower:\]]] [`_lower_`] [a lower-case character.]]
  172. [[[^\[:print:\]]] [`_print_`] [a printing character.]]
  173. [[[^\[:punct:\]]] [`_punct_`] [a punctuation character.]]
  174. [[[^\[:space:\]]] [`_space_`] [a white-space character.]]
  175. [[[^\[:upper:\]]] [`_upper_`] [an upper-case character.]]
  176. [[[^\[:xdigit:\]]] [`_xdigit_`] [a hexadecimal digit character.]]
  177. [[[^\[0-9\]]] [`_range_('0','9')`] [characters in range `'0'` through `'9'`.]]
  178. [[[^\[abc\]]] [`as_xpr('a') | 'b' |'c'`] [characters `'a'`, `'b'`, or `'c'`.]]
  179. [[[^\[abc\]]] [`(_set_= 'a','b','c')`] [['same as above]]]
  180. [[[^\[0-9abc\]]] [`_set_[ _range_('0','9') | 'a' | 'b' | 'c' ]`] [characters `'a'`, `'b'`, `'c'` or in range `'0'` through `'9'`.]]
  181. [[[^\[0-9abc\]]] [`_set_[ _range_('0','9') | (_set_= 'a','b','c') ]`] [['same as above]]]
  182. [[[^\[^abc\]]] [`~(_set_= 'a','b','c')`] [not characters `'a'`, `'b'`, or `'c'`.]]
  183. [[[^(?i:['stuff])]] [`_icase_(`[^['stuff]]`)`] [match ['stuff] disregarding case.]]
  184. [[[^(?>['stuff])]] [`_keep_(`[^['stuff]]`)`] [independent sub-expression, match ['stuff] and turn off backtracking.]]
  185. [[[^(?=['stuff])]] [`_before_(`[^['stuff]]`)`] [positive look-ahead assertion, match if before ['stuff] but don't include ['stuff] in the match.]]
  186. [[[^(?!['stuff])]] [`~_before_(`[^['stuff]]`)`] [negative look-ahead assertion, match if not before ['stuff].]]
  187. [[[^(?<=['stuff])]] [`_after_(`[^['stuff]]`)`] [positive look-behind assertion, match if after ['stuff] but don't include ['stuff] in the match. (['stuff] must be constant-width.)]]
  188. [[[^(?<!['stuff])]] [`~_after_(`[^['stuff]]`)`] [negative look-behind assertion, match if not after ['stuff]. (['stuff] must be constant-width.)]]
  189. [[[^(?P<['name]>['stuff])]] [`_mark_tag_ `[^['name]]`(`['n]`);`\n ...\n `(`[^['name]]`= `[^['stuff]]`)`] [Create a named capture.]]
  190. [[[^(?P=['name])]] [`_mark_tag_ `[^['name]]`(`['n]`);`\n ...\n [^['name]]] [Refer back to a previously created named capture.]]
  191. ]
  192. \n
  193. [endsect]