12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455 |
- http://www.linuxfromscratch.org/blfs/view/svn/introduction/locale-issues.html
- "The POSIX standard mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category."
- -------
- http://mail.nl.linux.org/linux-utf8/2001-02/msg00103.html
- From: Markus Kuhn
- Tom Tromey wrote on 2001-02-05 00:36 UTC:
- > Kai> IMAO, a *real* filesystem should use some encoding of ISO 10646 -
- > Kai> UTF-8, UTF-16, or UTF-32 are all viable options. The same should
- > Kai> be true for the kernel filename interfaces.
- >
- > I like this, but what should I do right now?
- The POSIX kernel file system interface is engraved into stone and
- extremely unlikely to change. File names are arbitrary binary strings,
- with only the '/' and '\0' bytes having any special semantics. You can
- use arbitrary coded character sets on it as long as they do not
- introduce '/' and '\0' bytes spuriously. Writers and readers have to
- somehow agree on what encoding to use and the only really practical way
- is to use the same encoding on all systems that share files. Eventually,
- everyone will be using UTF-8 for file names on POSIX systems. Right now,
- I would recommend users to use only ASCII for filenames, as this is
- already UTF-8 and therefore simplifies migration. Using the ISO 8859,
- JIS, etc. filenames should soon be considered deprecated practice.
- > I work on libgcj, the runtime component of gcj, the Java front end to
- > GCC. In libgcj of course we use UCS-2 everywhere, since that is what
- > Java does. Currently, for Unixy systems, we assume that all file
- > names are UTF-8.
- The best solution is to assume that the file names are in the
- locale-specific multi-byte encoding. Simply use mbrtowc and wcrtomb to
- convert between Unicode and the locale-dependent multi-byte encoding
- used in file names and text files if the ISO C 99 symbol
- __STDC_ISO_10646__ is defined (which guarantees that wchar_t = UCS). On
- Linux, this has been the case since glibc 2.2.
- > (Actually, we do something notably worse, which is
- > assume that file names are Java-style UTF-8, with the weird encoding
- > for \u0000.)
- \u0000 = NUL was never a character allowed in filenames under POSIX.
- Raise an exception if someone tries to use it in a filename. Problem
- solved.
- I never understood, why Java found it necessary to introduce two
- distinct ASCII NUL characters.
- ------
- Interesting idea. Use iconv to create shift-jis or other mbcs test cases.
|