123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231 |
- [/
- Copyright 2006-2007 John Maddock.
- Distributed under the Boost Software License, Version 1.0.
- (See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt).
- ]
- [section:locale Localization]
- Boost.Regex provides extensive support for run-time localization, the
- localization model used can be split into two parts: front-end and back-end.
- Front-end localization deals with everything which the user sees -
- error messages, and the regular expression syntax itself. For example a
- French application could change \[\[:word:\]\] to \[\[:mot:\]\] and \\w to \\m.
- Modifying the front end locale requires active support from the developer,
- by providing the library with a message catalogue to load, containing the
- localized strings. Front-end locale is affected by the LC_MESSAGES category only.
- Back-end localization deals with everything that occurs after the expression
- has been parsed - in other words everything that the user does not see or
- interact with directly. It deals with case conversion, collation, and character
- class membership. The back-end locale does not require any intervention from
- the developer - the library will acquire all the information it requires for
- the current locale from the underlying operating system / run time library.
- This means that if the program user does not interact with regular
- expressions directly - for example if the expressions are embedded in your
- C++ code - then no explicit localization is required, as the library will
- take care of everything for you. For example embedding the expression
- \[\[:word:\]\]+ in your code will always match a whole word, if the
- program is run on a machine with, for example, a Greek locale, then it
- will still match a whole word, but in Greek characters rather than Latin ones.
- The back-end locale is affected by the LC_TYPE and LC_COLLATE categories.
- There are three separate localization mechanisms supported by Boost.Regex:
- [h4 Win32 localization model.]
- This is the default model when the library is compiled under Win32, and is
- encapsulated by the traits class `w32_regex_traits`. When this model is in
- effect each [basic_regex] object gets it's own LCID, by default this is
- the users default setting as returned by GetUserDefaultLCID, but you can
- call imbue on the `basic_regex` object to set it's locale to some other
- LCID if you wish. All the settings used by Boost.Regex are acquired directly
- from the operating system bypassing the C run time library. Front-end
- localization requires a resource dll, containing a string table with the
- user-defined strings. The traits class exports the function:
- static std::string set_message_catalogue(const std::string& s);
- which needs to be called with a string identifying the name of the resource
- dll, before your code compiles any regular expressions (but not necessarily
- before you construct any `basic_regex` instances):
- boost::w32_regex_traits<char>::set_message_catalogue("mydll.dll");
- The library provides full Unicode support under NT, under Windows 9x
- the library degrades gracefully - characters 0 to 255 are supported, the
- remainder are treated as "unknown" graphic characters.
- [h4 C localization model.]
- This model has been deprecated in favor of the C++ locale for all non-Windows
- compilers that support it. This locale is encapsulated by the traits class
- `c_regex_traits`, Win32 users can force this model to take effect by
- defining the pre-processor symbol BOOST_REGEX_USE_C_LOCALE. When this model is
- in effect there is a single global locale, as set by `setlocale`. All settings
- are acquired from your run time library, consequently Unicode support is
- dependent upon your run time library implementation.
- Front end localization is not supported.
- Note that calling setlocale invalidates all compiled regular expressions,
- calling `setlocale(LC_ALL, "C")` will make this library behave equivalent to
- most traditional regular expression libraries including version 1 of this library.
- [h4 C++ localization model.]
- This model is the default for non-Windows compilers.
- When this model is in effect each instance of [basic_regex] has its own
- instance of `std::locale`, class [basic_regex] also has a member function
- `imbue` which allows the locale for the expression to be set on a
- per-instance basis. Front end localization requires a POSIX message catalogue,
- which will be loaded via the `std::messages` facet of the expression's locale,
- the traits class exports the symbol:
- static std::string set_message_catalogue(const std::string& s);
- which needs to be called with a string identifying the name of the
- message catalogue, before your code compiles any regular expressions
- (but not necessarily before you construct any basic_regex instances):
- boost::cpp_regex_traits<char>::set_message_catalogue("mycatalogue");
- Note that calling `basic_regex<>::imbue` will invalidate any expression
- currently compiled in that instance of [basic_regex].
- Finally note that if you build the library with a non-default localization model,
- then the appropriate pre-processor symbol (BOOST_REGEX_USE_C_LOCALE or
- BOOST_REGEX_USE_CPP_LOCALE) must be defined both when you build the support
- library, and when you include `<boost/regex.hpp>` or `<boost/cregex.hpp>`
- in your code. The best way to ensure this is to add the #define to
- `<boost/regex/user.hpp>`.
- [h4 Providing a message catalogue]
- In order to localize the front end of the library, you need to provide the
- library with the appropriate message strings contained either in a resource
- dll's string table (Win32 model), or a POSIX message catalogue (C++ models).
- In the latter case the messages must appear in message set zero of the
- catalogue. The messages and their id's are as follows:
-
- [table
- [[Message][id][Meaning][Default value]]
- [[101][The character used to start a sub-expression.]["(" ]]
- [[102][The character used to end a sub-expression declaration.][")" ]]
- [[103][The character used to denote an end of line assertion.]["$" ]]
- [[104][The character used to denote the start of line assertion.]["^" ]]
- [[105][The character used to denote the "match any character expression".]["." ]]
- [[106][The match zero or more times repetition operator.]["*" ]]
- [[107][The match one or more repetition operator.]["+" ]]
- [[108][The match zero or one repetition operator.]["?" ]]
- [[109][The character set opening character.]["\[" ]]
- [[110][The character set closing character.]["\]" ]]
- [[111][The alternation operator.]["|" ]]
- [[112][The escape character.]["\\" ]]
- [[113][The hash character (not currently used).]["#" ]]
- [[114][The range operator.]["-" ]]
- [[115][The repetition operator opening character.]["{" ]]
- [[116][The repetition operator closing character.]["}" ]]
- [[117][The digit characters.]["0123456789" ]]
- [[118][The character which when preceded by an escape character represents the word boundary assertion.]["b" ]]
- [[119][The character which when preceded by an escape character represents the non-word boundary assertion.]["B" ]]
- [[120][The character which when preceded by an escape character represents the word-start boundary assertion.]["<" ]]
- [[121][The character which when preceded by an escape character represents the word-end boundary assertion.][">" ]]
- [[122][The character which when preceded by an escape character represents any word character.]["w" ]]
- [[123][The character which when preceded by an escape character represents a non-word character.]["W" ]]
- [[124][The character which when preceded by an escape character represents a start of buffer assertion.]["`A" ]]
- [[125][The character which when preceded by an escape character represents an end of buffer assertion.]["'z" ]]
- [[126][The newline character. ]["\\n" ]]
- [[127][The comma separator.]["," ]]
- [[128][The character which when preceded by an escape character represents the bell character.]["a" ]]
- [[129][The character which when preceded by an escape character represents the form feed character.]["f" ]]
- [[130][The character which when preceded by an escape character represents the newline character.]["n" ]]
- [[131][The character which when preceded by an escape character represents the carriage return character.]["r" ]]
- [[132][The character which when preceded by an escape character represents the tab character.]["t" ]]
- [[133][The character which when preceded by an escape character represents the vertical tab character.]["v" ]]
- [[134][The character which when preceded by an escape character represents the start of a hexadecimal character constant.]["x" ]]
- [[135][The character which when preceded by an escape character represents the start of an ASCII escape character.]["c" ]]
- [[136][The colon character.][":" ]]
- [[137][The equals character.]["=" ]]
- [[138][The character which when preceded by an escape character represents the ASCII escape character.]["e" ]]
- [[139][The character which when preceded by an escape character represents any lower case character.]["l" ]]
- [[140][The character which when preceded by an escape character represents any non-lower case character.]["L" ]]
- [[141][The character which when preceded by an escape character represents any upper case character.]["u" ]]
- [[142][The character which when preceded by an escape character represents any non-upper case character.]["U" ]]
- [[143][The character which when preceded by an escape character represents any space character.]["s" ]]
- [[144][The character which when preceded by an escape character represents any non-space character.]["S" ]]
- [[145][The character which when preceded by an escape character represents any digit character.]["d" ]]
- [[146][The character which when preceded by an escape character represents any non-digit character.]["D" ]]
- [[147][The character which when preceded by an escape character represents the end quote operator.]["E" ]]
- [[148][The character which when preceded by an escape character represents the start quote operator.]["Q" ]]
- [[149][The character which when preceded by an escape character represents a Unicode combining character sequence.]["X" ]]
- [[150][The character which when preceded by an escape character represents any single character.]["C" ]]
- [[151][The character which when preceded by an escape character represents end of buffer operator.]["Z" ]]
- [[152][The character which when preceded by an escape character represents the continuation assertion.]["G" ]]
- [[153][The character which when preceded by (? indicates a zero width negated forward lookahead assert.][! ]]
- ]
- Custom error messages are loaded as follows:
- [table
- [[Message ID][Error message ID][Default string ]]
- [[201][REG_NOMATCH]["No match" ]]
- [[202][REG_BADPAT]["Invalid regular expression" ]]
- [[203][REG_ECOLLATE]["Invalid collation character" ]]
- [[204][REG_ECTYPE]["Invalid character class name" ]]
- [[205][REG_EESCAPE]["Trailing backslash" ]]
- [[206][REG_ESUBREG]["Invalid back reference" ]]
- [[207][REG_EBRACK]["Unmatched \[ or \[^" ]]
- [[208][REG_EPAREN]["Unmatched ( or \\(" ]]
- [[209][REG_EBRACE]["Unmatched \\{" ]]
- [[210][REG_BADBR]["Invalid content of \\{\\}" ]]
- [[211][REG_ERANGE]["Invalid range end" ]]
- [[212][REG_ESPACE]["Memory exhausted" ]]
- [[213][REG_BADRPT]["Invalid preceding regular expression" ]]
- [[214][REG_EEND]["Premature end of regular expression" ]]
- [[215][REG_ESIZE]["Regular expression too big" ]]
- [[216][REG_ERPAREN]["Unmatched ) or \\)" ]]
- [[217][REG_EMPTY]["Empty expression" ]]
- [[218][REG_E_UNKNOWN]["Unknown error" ]]
- ]
- Custom character class names are loaded as followed:
- [table
- [[Message ID][Description][Equivalent default class name ]]
- [[300][The character class name for alphanumeric characters.]["alnum" ]]
- [[301][The character class name for alphabetic characters.]["alpha" ]]
- [[302][The character class name for control characters.]["cntrl" ]]
- [[303][The character class name for digit characters.]["digit" ]]
- [[304][The character class name for graphics characters.]["graph" ]]
- [[305][The character class name for lower case characters.]["lower" ]]
- [[306][The character class name for printable characters.]["print" ]]
- [[307][The character class name for punctuation characters.]["punct" ]]
- [[308][The character class name for space characters.]["space" ]]
- [[309][The character class name for upper case characters.]["upper" ]]
- [[310][The character class name for hexadecimal characters.]["xdigit" ]]
- [[311][The character class name for blank characters.]["blank" ]]
- [[312][The character class name for word characters.]["word" ]]
- [[313][The character class name for Unicode characters.]["unicode" ]]
- ]
- Finally, custom collating element names are loaded starting from message
- id 400, and terminating when the first load thereafter fails. Each message
- looks something like: "tagname string" where tagname is the name used
- inside [[.tagname.]] and string is the actual text of the collating element.
- Note that the value of collating element [[.zero.]] is used for the
- conversion of strings to numbers - if you replace this with another value then
- that will be used for string parsing - for example use the Unicode
- character 0x0660 for [[.zero.]] if you want to use Unicode Arabic-Indic
- digits in your regular expressions in place of Latin digits.
- Note that the POSIX defined names for character classes and collating elements
- are always available - even if custom names are defined, in contrast,
- custom error messages, and custom syntax messages replace the default ones.
- [endsect]
|