// // Copyright (c) 2009-2011 Artyom Beilis (Tonkikh) // // Distributed under the Boost Software License, Version 1.0. (See // accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // // vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen /*! \page rationale Design Rationale - \ref rationale_why - \ref why_icu - \ref why_icu_wrapper - \ref why_icu_api_is_hidden - \ref why_gnu_gettext - \ref why_posix_names - \ref why_linear_chunks - \ref why_abstract_api - \ref why_no_special_character_type \section rationale_why Why is it needed? Why do we need a localization library, when standard C++ facets (should) provide most of the required functionality: - Case conversion is done using the \c std::ctype facet - Collation is supported by \c std::collate and has nice integration with \c std::locale - There are \c std::num_put , \c std::num_get , \c std::money_put , \c std::money_get , \c std::time_put and \c std::time_get for numbers, time, and currency formatting and parsing. - There is a \c std::messages class that supports localized message formatting. So why do we need such library if we have all the functionality within the standard library? Almost every(!) facet has design flaws: - \c std::collate supports only one level of collation, not allowing you to choose whether case- or accent-sensitive comparisons should be performed. - \c std::ctype, which is responsible for case conversion, assumes that all conversions can be done on a per-character basis. This is probably correct for many languages but it isn't correct in general. \n -# Case conversion may change a string's length. For example, the German word "grüßen" should be converted to "GRÜSSEN" in upper case: the letter "ß" should be converted to "SS", but the \c toupper function works on a single-character basis. -# Case conversion is context-sensitive. For example, the Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the Greek letter "Σ" is converted to "σ" or to "ς", depending on its position in the word. -# Case conversion cannot assume that a character is a single code point, which is incorrect for both the UTF-8 and UTF-16 encodings, where individual code-points are represented by up to 4 \c char 's or two \c wchar_t 's on the Windows platform. This makes \c std::ctype totally useless with these encodings. - \c std::numpunct and \c std::moneypunct do not specify the code points for digit representation at all, so they cannot format numbers with the digits used under Arabic locales. For example, the number "103" is expected to be displayed as "١٠٣" in the \c ar_EG locale. \n \c std::numpunct and \c std::moneypunct assume that the thousands separator is a single character. This is untrue for the UTF-8 encoding where only Unicode 0-0x7F range can be represented as a single character. As a result, localized numbers can't be represented correctly under locales that use the Unicode "EN SPACE" character for the thousands separator, such as Russian. \n This actually causes real problems under GCC and SunStudio compilers, where formatting numbers under a Russian locale creates invalid UTF-8 sequences. - \c std::time_put and \c std::time_get have several flaws: -# They assume that the calendar is always Gregorian, by using \c std::tm for time representation, ignoring the fact that in many countries dates may be displayed using different calendars. -# They always use a global time zone, not allowing specification of the time zone for formatting. The standard \c std::tm doesn't even include a timezone field at all. -# \c std::time_get is not symmetric with \c std::time_put, so you cannot parse dates and times created with \c std::time_put . (This issue is addressed in C++0x and some STL implementation like the Apache standard C++ library.) - \c std::messages does not provide support for plural forms, making it impossible to correctly localize such simple strings as "There are X files in the directory". Also, many features are not really supported by \c std::locale at all: timezones (as mentioned above), text boundary analysis, number spelling, and many others. So it is clear that the standard C++ locales are problematic for real-world applications. \section why_icu Why use an ICU wrapper instead of ICU? ICU is a very good localization library, but it has several serious flaws: - It is absolutely unfriendly to C++ developers. It ignores popular C++ idioms (the STL, RTTI, exceptions, etc), instead mostly mimicking the Java API. - It provides support for only one kind of string, UTF-16, when some users may want other Unicode encodings. For example, for XML or HTML processing UTF-8 is much more convenient and UTF-32 easier to use. Also there is no support for "narrow" encodings that are still very popular, such as the ISO-8859 encodings. For example: Boost.Locale provides direct integration with \c iostream allowing a more natural way of data formatting. For example: \code cout << "You have "<