charset_handling.txt 5.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
  1. //
  2. // Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
  3. //
  4. // Distributed under the Boost Software License, Version 1.0. (See
  5. // accompanying file LICENSE_1_0.txt or copy at
  6. // http://www.boost.org/LICENSE_1_0.txt)
  7. //
  8. // vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen
  9. /*!
  10. \page charset_handling Character Set Conversions
  11. \section codecvt Convenience Interface
  12. Boost.Locale provides \ref boost::locale::conv::to_utf() "to_utf", \ref boost::locale::conv::from_utf() "from_utf" and
  13. \ref boost::locale::conv::utf_to_utf() "utf_to_utf" functions in
  14. the \c boost::locale::conv namespace. They are simple and
  15. convenient functions to convert a string to and from
  16. UTF-8/16/32 strings and strings using other encodings.
  17. For example:
  18. \code
  19. std::string utf8_string = to_utf<char>(latin1_string,"Latin1");
  20. std::wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
  21. std::string latin1_string = from_utf(wide_string,"Latin1");
  22. std::string utf8_string2 = utf_to_utf<char>(wide_string);
  23. \endcode
  24. This function may use an explicit encoding name like "Latin1" or "ISO-8859-8",
  25. or use std::locale as a parameter to fetch this information from it.
  26. It also receives a policy parameter that tells it how to behave if the
  27. conversion can't be performed (i.e. an illegal or unsupported character is found).
  28. By default this function skips all illegal characters and tries to do the best it
  29. can, however, it is possible ask it to throw
  30. a \ref boost::locale::conv::conversion_error "conversion_error" exception
  31. by passing the \c stop flag to it:
  32. \code
  33. std::wstring s=to_utf<wchar_t>("\xFF\xFF","UTF-8",stop);
  34. // Throws because this string is illegal in UTF-8
  35. \endcode
  36. \section codecvt_codecvt std::codecvt facet
  37. Boost.Locale provides stream codepage conversion facets based on the \c std::codecvt facet.
  38. This allows conversion between wide-character encodings and 8-bit encodings like UTF-8, ISO-8859 or Shift-JIS.
  39. Most of compilers provide such facets, but:
  40. - Under Windows MSVC does not support UTF-8 encodings at all.
  41. - Under Linux the encodings are supported only if the required locales are generated. For example
  42. it may be impossible to create a \c he_IL.CP1255 locale even when the \c he_IL locale is available.
  43. Thus Boost.Locale provides an option to generate code-page conversion facets for use with
  44. Boost.Iostreams filters or \c std::wfstream. For example:
  45. \code
  46. std::locale loc= generator().generate("he_IL.UTF-8");
  47. std::wofstream file.
  48. file.imbue(loc);
  49. file.open("hello.txt");
  50. file << L"שלום!" << endl;
  51. \endcode
  52. Would create a file \c hello.txt encoded as UTF-8 with "שלום!" (shalom) in it.
  53. \section codecvt_iostreams_integration Integration with Boost.Iostreams
  54. You can use the \c std::codecvt facet directly, but this is quite tricky and
  55. requires accurate buffer and error management.
  56. You can use the \c boost::iostreams::code_converter class for stream-oriented
  57. conversions between the wide-character set and narrow locale character set.
  58. This is a sample program that converts wide to narrow characters for an arbitrary
  59. stream:
  60. \code
  61. #include <boost/iostreams/stream.hpp>
  62. #include <boost/iostreams/categories.hpp>
  63. #include <boost/iostreams/code_converter.hpp>
  64. #include <boost/locale.hpp>
  65. #include <iostream>
  66. namespace io = boost::iostreams;
  67. // Device that consumes the converted text,
  68. // In our case it just writes to standard output
  69. class consumer {
  70. public:
  71. typedef char char_type;
  72. typedef io::sink_tag category;
  73. std::streamsize write(const char* s, std::streamsize n)
  74. {
  75. std::cout.write(s,n);
  76. return n;
  77. }
  78. };
  79. int main()
  80. {
  81. // the device that converts wide characters
  82. // to narrow
  83. typedef io::code_converter<consumer> converter_device;
  84. // the stream that uses this device
  85. typedef io::stream<converter_device> converter_stream;
  86. consumer cons;
  87. // setup out converter to work
  88. // with he_IL.UTF-8 locale
  89. converter_device dev;
  90. boost::locale::generator gen;
  91. dev.imbue(gen("he_IL.UTF-8"));
  92. dev.open(cons);
  93. converter_stream stream;
  94. stream.open(dev);
  95. // Now wide characters that are written
  96. // to the stream would be given to
  97. // our consumer as narrow characters
  98. // in UTF-8 encoding
  99. stream << L"שלום" << std::flush;
  100. }
  101. \endcode
  102. \section codecvt_limitations Limitations of std::codecvt
  103. The Standard does not provide any information about \c std::mbstate_t that could be used to save
  104. intermediate code-page conversion states. It leaves the definition up to the compiler implementation, making it
  105. impossible to reimplement <tt>std::codecvt<wchar_t,char,mbstate_t></tt> for stateful encodings.
  106. Thus, Boost.Locale's \c codecvt facet implementation may be used with stateless encodings like UTF-8,
  107. ISO-8859, and Shift-JIS, but not with stateful encodings like UTF-7 or SCSU.
  108. \b Recommendation: Prefer the Unicode UTF-8 encoding for \c char based strings and files in your application.
  109. \note
  110. The implementation of codecvt for single byte encodings like ISO-8859-X and for UTF-8 is very efficent
  111. and would allow fast conversion of the content, however its performance may be sub-optimal for
  112. double-width encodings like Shift-JIS, due to the stateless problem described above.
  113. */