123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430 |
- [/
- Copyright 2006-2007 John Maddock.
- Distributed under the Boost Software License, Version 1.0.
- (See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt).
- ]
- [section:basic_extended POSIX Extended Regular Expression Syntax]
- [h3 Synopsis]
- The POSIX-Extended regular expression syntax is supported by the POSIX
- C regular expression API's, and variations are used by the utilities
- `egrep` and `awk`. You can construct POSIX extended regular expressions in
- Boost.Regex by passing the flag `extended` to the regex constructor, for example:
- // e1 is a case sensitive POSIX-Extended expression:
- boost::regex e1(my_expression, boost::regex::extended);
- // e2 a case insensitive POSIX-Extended expression:
- boost::regex e2(my_expression, boost::regex::extended|boost::regex::icase);
- [#boost_regex.posix_extended_syntax][h3 POSIX Extended Syntax]
- In POSIX-Extended regular expressions, all characters match themselves except for
- the following special characters:
- [pre .\[{}()\\\*+?|^$]
- [h4 Wildcard:]
- The single character '.' when used outside of a character set will match
- any single character except:
- * The NULL character when the flag `match_no_dot_null` is passed to the
- matching algorithms.
- * The newline character when the flag `match_not_dot_newline` is passed
- to the matching algorithms.
- [h4 Anchors:]
- A '^' character shall match the start of a line when used as the first
- character of an expression, or the first character of a sub-expression.
- A '$' character shall match the end of a line when used as the
- last character of an expression, or the last character of a sub-expression.
- [h4 Marked sub-expressions:]
- A section beginning `(` and ending `)` acts as a marked sub-expression.
- Whatever matched the sub-expression is split out in a separate field
- by the matching algorithms. Marked sub-expressions can also repeated,
- or referred to by a back-reference.
- [h4 Repeats:]
- Any atom (a single character, a marked sub-expression, or a character class)
- can be repeated with the `*`, `+`, `?`, and `{}` operators.
- The `*` operator will match the preceding atom /zero or more times/, for
- example the expression `a*b` will match any of the following:
- [pre
- b
- ab
- aaaaaaaab
- ]
- The `+` operator will match the preceding atom /one or more times/,
- for example the expression a+b will match any of the following:
- [pre
- ab
- aaaaaaaab
- ]
- But will not match:
- [pre
- b
- ]
- The `?` operator will match the preceding atom /zero or one times/, for
- example the expression `ca?b` will match any of the following:
- [pre
- cb
- cab
- ]
- But will not match:
- [pre
- caab
- ]
- An atom can also be repeated with a bounded repeat:
- `a{n}` Matches 'a' repeated /exactly n times/.
- `a{n,}` Matches 'a' repeated /n or more times/.
- `a{n, m}` Matches 'a' repeated /between n and m times inclusive/.
- For example:
- [pre ^a{2,3}\$]
- Will match either of:
- aa
- aaa
- But neither of:
- a
- aaaa
- It is an error to use a repeat operator, if the preceding construct can not
- be repeated, for example:
- a(*)
- Will raise an error, as there is nothing for the `*` operator to be applied to.
- [h4 Back references:]
- An escape character followed by a digit /n/, where /n/ is in the range 1-9,
- matches the same string that was matched by sub-expression /n/. For example
- the expression:
- [pre ^(a\*)\[\^a\]\*\\1\$]
- Will match the string:
- aaabbaaa
- But not the string:
- aaabba
- [caution The POSIX standard does not support back-references for "extended"
- regular expressions, this is a compatible extension to that standard.]
- [h4 Alternation]
- The `|` operator will match either of its arguments, so for example:
- `abc|def` will match either "abc" or "def".
- Parenthesis can be used to group alternations, for example: `ab(d|ef)`
- will match either of "abd" or "abef".
- [h4 Character sets:]
- A character set is a bracket-expression starting with \[ and ending with \],
- it defines a set of characters, and matches any single character that is
- a member of that set.
- A bracket expression may contain any combination of the following:
- [h5 Single characters:]
- For example `[abc]`, will match any of the characters 'a', 'b', or 'c'.
- [h5 Character ranges:]
- For example `[a-c]` will match any single character in the range 'a' to 'c'.
- By default, for POSIX-Extended regular expressions, a character /x/ is
- within the range /y/ to /z/, if it collates within that range; this
- results in locale specific behavior . This behavior can be turned
- off by unsetting the `collate`
- [link boost_regex.ref.syntax_option_type option flag] - in which case whether
- a character appears within a range is determined by comparing the code
- points of the characters only.
- [h5 Negation:]
- If the bracket-expression begins with the ^ character, then it matches the
- complement of the characters it contains, for example `[^a-c]` matches
- any character that is not in the range `a-c`.
- [h5 Character classes:]
- An expression of the form `[[:name:]]` matches the named character class "name",
- for example `[[:lower:]]` matches any lower case character.
- See [link boost_regex.syntax.character_classes character class names].
- [h5 Collating Elements:]
- An expression of the form `[[.col.]` matches the collating element /col/.
- A collating element is any single character, or any sequence of
- characters that collates as a single unit. Collating elements may
- also be used as the end point of a range, for example: `[[.ae.]-c]`
- matches the character sequence "ae", plus any single character
- in the range "ae"-c, assuming that "ae" is treated as a single
- collating element in the current locale.
- Collating elements may be used in place of escapes (which are not
- normally allowed inside character sets), for example `[[.^.]abc]`
- would match either one of the characters 'abc^'.
- As an extension, a collating element may also be specified via its
- [link boost_regex.syntax.collating_names symbolic name], for example:
- [[.NUL.]]
- matches a NUL character.
- [h5 Equivalence classes:]
- An expression of the form `[[=col=]]`, matches any character or collating element
- whose primary sort key is the same as that for collating element /col/,
- as with collating elements the name /col/ may be a
- [link boost_regex.syntax.collating_names symbolic name]. A primary
- sort key is one that ignores case, accentation, or locale-specific tailorings;
- so for example `[[=a=]]` matches any of the characters:
- a, '''À''', '''Á''', '''Â''',
- '''Ã''', '''Ä''', '''Å''', A, '''à''', '''á''',
- '''â''', '''ã''', '''ä''' and '''å'''.
- Unfortunately implementation of this is reliant on the platform's
- collation and localisation support; this feature can not be relied
- upon to work portably across all platforms, or even all locales on one platform.
- [h5 Combinations:]
- All of the above can be combined in one character set declaration,
- for example: `[[:digit:]a-c[.NUL.]]`.
- [h4 Escapes]
- The POSIX standard defines no escape sequences for POSIX-Extended
- regular expressions, except that:
- * Any special character preceded by an escape shall match itself.
- * The effect of any ordinary character being preceded by an escape is undefined.
- * An escape inside a character class declaration shall match itself: in
- other words the escape character is not "special" inside a character
- class declaration; so `[\^]` will match either a literal '\\' or a '^'.
- However, that's rather restrictive, so the following standard-compatible
- extensions are also supported by Boost.Regex:
- [h5 Escapes matching a specific character]
- The following escape sequences are all synonyms for single characters:
- [table
- [[Escape][Character]]
- [[\\a]['\\a']]
- [[\\e][0x1B]]
- [[\\f][\\f]]
- [[\\n][\\n]]
- [[\\r][\\r]]
- [[\\t][\\t]]
- [[\\v][\\v]]
- [[\\b][\\b (but only inside a character class declaration).]]
- [[\\cX][An ASCII escape sequence - the character whose code point is X % 32]]
- [[\\xdd][A hexadecimal escape sequence - matches the single character whose code point is 0xdd.]]
- [[\\x{dddd}][A hexadecimal escape sequence - matches the single character whose code point is 0xdddd.]]
- [[\\0ddd][An octal escape sequence - matches the single character whose code point is 0ddd.]]
- [[\\N{Name}][Matches the single character which has the symbolic name ['Name]. For example `\\N{newline}` matches the single character \\n.]]
- ]
- [h5 "Single character" character classes:]
- Any escaped character /x/, if /x/ is the name of a character class shall
- match any character that is a member of that class, and any
- escaped character /X/, if /x/ is the name of a character class,
- shall match any character not in that class.
- The following are supported by default:
- [table
- [[Escape sequence][Equivalent to]]
- [[`\d`][`[[:digit:]]`]]
- [[`\l`][`[[:lower:]]`]]
- [[`\s`][`[[:space:]]`]]
- [[`\u`][`[[:upper:]]`]]
- [[`\w`][`[[:word:]]`]]
- [[`\D`][`[^[:digit:]]`]]
- [[`\L`][`[^[:lower:]]`]]
- [[`\S`][`[^[:space:]]`]]
- [[`\U`][`[^[:upper:]]`]]
- [[`\W`][`[^[:word:]]`]]
- ]
- [h5 Character Properties]
- The character property names in the following table are all equivalent to the
- names used in character classes.
- [table
- [[Form][Description][Equivalent character set form]]
- [[`\pX`][Matches any character that has the property X.][`[[:X:]]`]]
- [[`\p{Name}`][Matches any character that has the property Name.][`[[:Name:]]`]]
- [[`\PX`][Matches any character that does not have the property X.][`[^[:X:]]`]]
- [[`\P{Name}`][Matches any character that does not have the property Name.][`[^[:Name:]]`]]
- ]
- For example `\pd` matches any "digit" character, as does `\p{digit}`.
- [h5 Word Boundaries]
- The following escape sequences match the boundaries of words:
- [table
- [[Escape][Meaning]]
- [[`\<`][Matches the start of a word.]]
- [[`\>`][Matches the end of a word.]]
- [[`\b`][Matches a word boundary (the start or end of a word).]]
- [[`\B`][Matches only when not at a word boundary.]]
- ]
- [h5 Buffer boundaries]
- The following match only at buffer boundaries: a "buffer" in this
- context is the whole of the input text that is being matched against
- (note that ^ and $ may match embedded newlines within the text).
- [table
- [[Escape][Meaning]]
- [[\\\`][Matches at the start of a buffer only.]]
- [[\\'][Matches at the end of a buffer only.]]
- [[`\A`][Matches at the start of a buffer only (the same as \\\`).]]
- [[`\z`][Matches at the end of a buffer only (the same as \\').]]
- [[`\Z`][Matches an optional sequence of newlines at the end of a buffer:
- equivalent to the regular expression `\n*\z`]]
- ]
- [h5 Continuation Escape]
- The sequence `\G` matches only at the end of the last match found, or at
- the start of the text being matched if no previous match was found.
- This escape useful if you're iterating over the matches contained within
- a text, and you want each subsequence match to start where the last one ended.
- [h5 Quoting escape]
- The escape sequence `\Q` begins a "quoted sequence": all the subsequent
- characters are treated as literals, until either the end of the
- regular expression or `\E` is found. For example the expression: `\Q\*+\Ea+`
- would match either of:
- \*+a
- \*+aaa
- [h5 Unicode escapes]
- [table
- [[Escape][Meaning]]
- [[`\C`][Matches a single code point: in Boost regex this has exactly the same effect as a "." operator.]]
- [[`\X`][Matches a combining character sequence: that is any non-combining character followed by a sequence of zero or more combining characters.]]
- ]
- [h5 Any other escape]
- Any other escape sequence matches the character that is escaped,
- for example \\@ matches a literal '@'.
- [h4 Operator precedence]
- The order of precedence for of operators is as follows:
- # Collation-related bracket symbols `[==] [::] [..]`
- # Escaped characters `\`
- # Character set (bracket expression) `[]`
- # Grouping `()`
- # Single-character-ERE duplication `* + ? {m,n}`
- # Concatenation
- # Anchoring ^$
- # Alternation `|`
- [h4 What Gets Matched]
- When there is more that one way to match a regular expression, the
- "best" possible match is obtained using the
- [link boost_regex.syntax.leftmost_longest_rule leftmost-longest rule].
- [h3 Variations]
- [h4 Egrep]
- When an expression is compiled with the
- [link boost_regex.ref.syntax_option_type flag `egrep`] set, then the
- expression is treated as a newline separated list of
- [link boost_regex.posix_extended_syntax POSIX-Extended expressions],
- a match is found if any of the
- expressions in the list match, for example:
- boost::regex e("abc\ndef", boost::regex::egrep);
- will match either of the POSIX-Basic expressions "abc" or "def".
- As its name suggests, this behavior is consistent with the Unix utility `egrep`,
- and with grep when used with the -E option.
- [h4 awk]
- In addition to the
- [link boost_regex.posix_extended_syntax POSIX-Extended features] the
- escape character is
- special inside a character class declaration.
- In addition, some escape sequences that are not defined as part of
- POSIX-Extended specification are required to be supported - however Boost.Regex
- supports these by default anyway.
- [h3 Options]
- There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_extended variety of flags]
- that may be combined with the `extended` and `egrep` options when
- constructing the regular expression, in particular note that the
- [link boost_regex.ref.syntax_option_type.syntax_option_type_extended `newline_alt`]
- option alters the syntax, while the
- [link boost_regex.ref.syntax_option_type.syntax_option_type_extended `collate`, `nosubs`
- and `icase` options] modify how the case and locale sensitivity are to be applied.
- [h3 References]
- [@http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html
- IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions.]
- [@http://www.opengroup.org/onlinepubs/000095399/utilities/grep.html
- IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, egrep.]
- [@http://www.opengroup.org/onlinepubs/000095399/utilities/awk.html
- IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, awk.]
- [endsect]
|