123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484 |
- [/
- Copyright 2006-2007 John Maddock.
- Distributed under the Boost Software License, Version 1.0.
- (See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt).
- ]
- [section:icu Working With Unicode and ICU String Types]
- [section:intro Introduction to using Regex with ICU]
- The header:
- <boost/regex/icu.hpp>
- contains the data types and algorithms necessary for working with regular
- expressions in a Unicode aware environment.
- In order to use this header you will need the
- [@http://www.ibm.com/software/globalization/icu/ ICU library], and you will need
- to have built the Boost.Regex library with
- [link boost_regex.install.building_with_unicode_and_icu_su ICU support enabled].
- The header will enable you to:
- * Create regular expressions that treat Unicode strings as sequences of UTF-32 code points.
- * Create regular expressions that support various Unicode data properties, including character classification.
- * Transparently search Unicode strings that are encoded as either UTF-8, UTF-16 or UTF-32.
- [endsect]
- [section:unicode_types Unicode regular expression types]
- Header `<boost/regex/icu.hpp>` provides a regular expression traits class that
- handles UTF-32 characters:
- class icu_regex_traits;
- and a regular expression type based upon that:
- typedef basic_regex<UChar32,icu_regex_traits> u32regex;
- The type `u32regex` is regular expression type to use for all Unicode
- regular expressions; internally it uses UTF-32 code points, but can be
- created from, and used to search, either UTF-8, or UTF-16 encoded strings
- as well as UTF-32 ones.
- The constructors, and assign member functions of `u32regex`, require UTF-32
- encoded strings, but there are a series of overloaded algorithms called
- `make_u32regex` which allow regular expressions to be created from
- UTF-8, UTF-16, or UTF-32 encoded strings:
- template <class InputIterator>
- u32regex make_u32regex(InputIterator i,
- InputIterator j,
- boost::regex_constants::syntax_option_type opt);
- [*Effects]: Creates a regular expression object from the iterator sequence \[i,j).
- The character encoding of the sequence is determined based upon sizeof(*i):
- 1 implies UTF-8, 2 implies UTF-16, and 4 implies UTF-32.
- u32regex make_u32regex(const char* p,
- boost::regex_constants::syntax_option_type opt
- = boost::regex_constants::perl);
- [*Effects]: Creates a regular expression object from the Null-terminated
- UTF-8 character sequence /p/.
- u32regex make_u32regex(const unsigned char* p,
- boost::regex_constants::syntax_option_type opt
- = boost::regex_constants::perl);
- [*Effects]: Creates a regular expression object from the Null-terminated UTF-8 character sequence p.
- u32regex make_u32regex(const wchar_t* p,
- boost::regex_constants::syntax_option_type opt
- = boost::regex_constants::perl);
- [*Effects]: Creates a regular expression object from the Null-terminated character sequence p. The character encoding of the sequence is determined based upon sizeof(wchar_t): 1 implies UTF-8, 2 implies UTF-16, and 4 implies UTF-32.
- u32regex make_u32regex(const UChar* p,
- boost::regex_constants::syntax_option_type opt
- = boost::regex_constants::perl);
- [*Effects]: Creates a regular expression object from the Null-terminated UTF-16 character sequence p.
- template<class C, class T, class A>
- u32regex make_u32regex(const std::basic_string<C, T, A>& s,
- boost::regex_constants::syntax_option_type opt
- = boost::regex_constants::perl);
- [*Effects]: Creates a regular expression object from the string s. The character encoding of the string is determined based upon sizeof(C): 1 implies UTF-8, 2 implies UTF-16, and 4 implies UTF-32.
- u32regex make_u32regex(const UnicodeString& s,
- boost::regex_constants::syntax_option_type opt
- = boost::regex_constants::perl);
- [*Effects]: Creates a regular expression object from the UTF-16 encoding string s.
- [endsect]
- [section:unicode_algo Unicode Regular Expression Algorithms]
- The regular expression algorithms [regex_match], [regex_search] and [regex_replace]
- all expect that the character sequence upon which they operate,
- is encoded in the same character encoding as the regular expression object
- with which they are used. For Unicode regular expressions that behavior is
- undesirable: while we may want to process the data in UTF-32 "chunks", the
- actual data is much more likely to encoded as either UTF-8 or UTF-16.
- Therefore the header <boost/regex/icu.hpp> provides a series of thin wrappers
- around these algorithms, called `u32regex_match`, `u32regex_search`, and
- `u32regex_replace`. These wrappers use iterator-adapters internally to
- make external UTF-8 or UTF-16 data look as though it's really a UTF-32 sequence,
- that can then be passed on to the "real" algorithm.
- [h4 u32regex_match]
- For each [regex_match] algorithm defined by `<boost/regex.hpp>`, then
- `<boost/regex/icu.hpp>` defines an overloaded algorithm that takes the
- same arguments, but which is called `u32regex_match`, and which will
- accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an
- ICU UnicodeString as input.
- Example: match a password, encoded in a UTF-16 UnicodeString:
- //
- // Find out if *password* meets our password requirements,
- // as defined by the regular expression *requirements*.
- //
- bool is_valid_password(const UnicodeString& password, const UnicodeString& requirements)
- {
- return boost::u32regex_match(password, boost::make_u32regex(requirements));
- }
- Example: match a UTF-8 encoded filename:
- //
- // Extract filename part of a path from a UTF-8 encoded std::string and return the result
- // as another std::string:
- //
- std::string get_filename(const std::string& path)
- {
- boost::u32regex r = boost::make_u32regex("(?:\\A|.*\\\\)([^\\\\]+)");
- boost::smatch what;
- if(boost::u32regex_match(path, what, r))
- {
- // extract $1 as a std::string:
- return what.str(1);
- }
- else
- {
- throw std::runtime_error("Invalid pathname");
- }
- }
- [h4 u32regex_search]
- For each [regex_search] algorithm defined by `<boost/regex.hpp>`, then
- `<boost/regex/icu.hpp>` defines an overloaded algorithm that takes the
- same arguments, but which is called `u32regex_search`, and which will
- accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an ICU
- UnicodeString as input.
- Example: search for a character sequence in a specific language block:
- UnicodeString extract_greek(const UnicodeString& text)
- {
- // searches through some UTF-16 encoded text for a block encoded in Greek,
- // this expression is imperfect, but the best we can do for now - searching
- // for specific scripts is actually pretty hard to do right.
- //
- // Here we search for a character sequence that begins with a Greek letter,
- // and continues with characters that are either not-letters ( [^[:L*:]] )
- // or are characters in the Greek character block ( [\\x{370}-\\x{3FF}] ).
- //
- boost::u32regex r = boost::make_u32regex(
- L"[\\x{370}-\\x{3FF}](?:[^[:L*:]]|[\\x{370}-\\x{3FF}])*");
- boost::u16match what;
- if(boost::u32regex_search(text, what, r))
- {
- // extract $0 as a UnicodeString:
- return UnicodeString(what[0].first, what.length(0));
- }
- else
- {
- throw std::runtime_error("No Greek found!");
- }
- }
- [h4 u32regex_replace]
- For each [regex_replace] algorithm defined by `<boost/regex.hpp>`, then
- `<boost/regex/icu.hpp>` defines an overloaded algorithm that takes
- the same arguments, but which is called `u32regex_replace`, and which will
- accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an ICU
- UnicodeString as input. The input sequence and the format string specifier
- passed to the algorithm, can be encoded independently (for example one can
- be UTF-8, the other in UTF-16), but the result string / output iterator
- argument must use the same character encoding as the text being searched.
- Example: Credit card number reformatting:
- //
- // Take a credit card number as a string of digits,
- // and reformat it as a human readable string with "-"
- // separating each group of four digit;,
- // note that we're mixing a UTF-32 regex, with a UTF-16
- // string and a UTF-8 format specifier, and it still all
- // just works:
- //
- const boost::u32regex e = boost::make_u32regex(
- "\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
- const char* human_format = "$1-$2-$3-$4";
- UnicodeString human_readable_card_number(const UnicodeString& s)
- {
- return boost::u32regex_replace(s, e, human_format);
- }
- [endsect]
- [section:unicode_iter Unicode Aware Regex Iterators]
- [h4 u32regex_iterator]
- Type `u32regex_iterator` is in all respects the same as [regex_iterator]
- except that since the regular expression type is always `u32regex`
- it only takes one template parameter (the iterator type). It also calls
- `u32regex_search` internally, allowing it to interface correctly with
- UTF-8, UTF-16, and UTF-32 data:
- template <class BidirectionalIterator>
- class u32regex_iterator
- {
- // for members see regex_iterator
- };
- typedef u32regex_iterator<const char*> utf8regex_iterator;
- typedef u32regex_iterator<const UChar*> utf16regex_iterator;
- typedef u32regex_iterator<const UChar32*> utf32regex_iterator;
- In order to simplify the construction of a `u32regex_iterator` from a string,
- there are a series of non-member helper functions called make_u32regex_iterator:
- u32regex_iterator<const char*>
- make_u32regex_iterator(const char* s,
- const u32regex& e,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_iterator<const wchar_t*>
- make_u32regex_iterator(const wchar_t* s,
- const u32regex& e,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_iterator<const UChar*>
- make_u32regex_iterator(const UChar* s,
- const u32regex& e,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- template <class charT, class Traits, class Alloc>
- u32regex_iterator<typename std::basic_string<charT, Traits, Alloc>::const_iterator>
- make_u32regex_iterator(const std::basic_string<charT, Traits, Alloc>& s,
- const u32regex& e,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_iterator<const UChar*>
- make_u32regex_iterator(const UnicodeString& s,
- const u32regex& e,
- regex_constants::match_flag_type m = regex_constants::match_default);
- Each of these overloads returns an iterator that enumerates all occurrences
- of expression /e/, in text /s/, using match_flags /m/.
- Example: search for international currency symbols, along with their associated numeric value:
- void enumerate_currencies(const std::string& text)
- {
- // enumerate and print all the currency symbols, along
- // with any associated numeric values:
- const char* re =
- "([[:Sc:]][[:Cf:][:Cc:][:Z*:]]*)?"
- "([[:Nd:]]+(?:[[:Po:]][[:Nd:]]+)?)?"
- "(?(1)"
- "|(?(2)"
- "[[:Cf:][:Cc:][:Z*:]]*"
- ")"
- "[[:Sc:]]"
- ")";
- boost::u32regex r = boost::make_u32regex(re);
- boost::u32regex_iterator<std::string::const_iterator>
- i(boost::make_u32regex_iterator(text, r)), j;
- while(i != j)
- {
- std::cout << (*i)[0] << std::endl;
- ++i;
- }
- }
- Calling
- [/this doesn't format correctly as code:]
- [pre enumerate_currencies(" $100.23 or '''£'''198.12 ");]
- Yields the output:
- [pre
- $100.23
- '''£'''198.12
- ]
- Provided of course that the input is encoded as UTF-8.
- [h4 u32regex_token_iterator]
- Type `u32regex_token_iterator` is in all respects the same as [regex_token_iterator]
- except that since the regular expression type is always `u32regex` it only
- takes one template parameter (the iterator type). It also calls
- `u32regex_search` internally, allowing it to interface correctly with UTF-8,
- UTF-16, and UTF-32 data:
- template <class BidirectionalIterator>
- class u32regex_token_iterator
- {
- // for members see regex_token_iterator
- };
- typedef u32regex_token_iterator<const char*> utf8regex_token_iterator;
- typedef u32regex_token_iterator<const UChar*> utf16regex_token_iterator;
- typedef u32regex_token_iterator<const UChar32*> utf32regex_token_iterator;
- In order to simplify the construction of a `u32regex_token_iterator` from a string,
- there are a series of non-member helper functions called `make_u32regex_token_iterator`:
- u32regex_token_iterator<const char*>
- make_u32regex_token_iterator(
- const char* s,
- const u32regex& e,
- int sub,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_token_iterator<const wchar_t*>
- make_u32regex_token_iterator(
- const wchar_t* s,
- const u32regex& e,
- int sub,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_token_iterator<const UChar*>
- make_u32regex_token_iterator(
- const UChar* s,
- const u32regex& e,
- int sub,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- template <class charT, class Traits, class Alloc>
- u32regex_token_iterator<typename std::basic_string<charT, Traits, Alloc>::const_iterator>
- make_u32regex_token_iterator(
- const std::basic_string<charT, Traits, Alloc>& s,
- const u32regex& e,
- int sub,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_token_iterator<const UChar*>
- make_u32regex_token_iterator(
- const UnicodeString& s,
- const u32regex& e,
- int sub,
- regex_constants::match_flag_type m = regex_constants::match_default);
- Each of these overloads returns an iterator that enumerates all occurrences of
- marked sub-expression sub in regular expression /e/, found in text /s/, using
- match_flags /m/.
- template <std::size_t N>
- u32regex_token_iterator<const char*>
- make_u32regex_token_iterator(
- const char* p,
- const u32regex& e,
- const int (&submatch)[N],
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- template <std::size_t N>
- u32regex_token_iterator<const wchar_t*>
- make_u32regex_token_iterator(
- const wchar_t* p,
- const u32regex& e,
- const int (&submatch)[N],
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- template <std::size_t N>
- u32regex_token_iterator<const UChar*>
- make_u32regex_token_iterator(
- const UChar* p,
- const u32regex& e,
- const int (&submatch)[N],
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- template <class charT, class Traits, class Alloc, std::size_t N>
- u32regex_token_iterator<typename std::basic_string<charT, Traits, Alloc>::const_iterator>
- make_u32regex_token_iterator(
- const std::basic_string<charT, Traits, Alloc>& p,
- const u32regex& e,
- const int (&submatch)[N],
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- template <std::size_t N>
- u32regex_token_iterator<const UChar*>
- make_u32regex_token_iterator(
- const UnicodeString& s,
- const u32regex& e,
- const int (&submatch)[N],
- regex_constants::match_flag_type m = regex_constants::match_default);
- Each of these overloads returns an iterator that enumerates one sub-expression
- for each submatch in regular expression /e/, found in text /s/, using match_flags /m/.
- u32regex_token_iterator<const char*>
- make_u32regex_token_iterator(
- const char* p,
- const u32regex& e,
- const std::vector<int>& submatch,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_token_iterator<const wchar_t*>
- make_u32regex_token_iterator(
- const wchar_t* p,
- const u32regex& e,
- const std::vector<int>& submatch,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_token_iterator<const UChar*>
- make_u32regex_token_iterator(
- const UChar* p,
- const u32regex& e,
- const std::vector<int>& submatch,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- template <class charT, class Traits, class Alloc>
- u32regex_token_iterator<typename std::basic_string<charT, Traits, Alloc>::const_iterator>
- make_u32regex_token_iterator(
- const std::basic_string<charT, Traits, Alloc>& p,
- const u32regex& e,
- const std::vector<int>& submatch,
- regex_constants::match_flag_type m = regex_constants::match_default);
-
- u32regex_token_iterator<const UChar*>
- make_u32regex_token_iterator(
- const UnicodeString& s,
- const u32regex& e,
- const std::vector<int>& submatch,
- regex_constants::match_flag_type m = regex_constants::match_default);
- Each of these overloads returns an iterator that enumerates one sub-expression for
- each submatch in regular expression /e/, found in text /s/, using match_flags /m/.
- Example: search for international currency symbols, along with their associated numeric value:
- void enumerate_currencies2(const std::string& text)
- {
- // enumerate and print all the currency symbols, along
- // with any associated numeric values:
- const char* re =
- "([[:Sc:]][[:Cf:][:Cc:][:Z*:]]*)?"
- "([[:Nd:]]+(?:[[:Po:]][[:Nd:]]+)?)?"
- "(?(1)"
- "|(?(2)"
- "[[:Cf:][:Cc:][:Z*:]]*"
- ")"
- "[[:Sc:]]"
- ")";
- boost::u32regex r = boost::make_u32regex(re);
- boost::u32regex_token_iterator<std::string::const_iterator>
- i(boost::make_u32regex_token_iterator(text, r, 1)), j;
- while(i != j)
- {
- std::cout << *i << std::endl;
- ++i;
- }
- }
- [endsect]
- [endsect]
|