16099b4c7d
[SVN r77554]
56 lines
2.4 KiB
Plaintext
56 lines
2.4 KiB
Plaintext
http://www.linuxfromscratch.org/blfs/view/svn/introduction/locale-issues.html
|
|
|
|
"The POSIX standard mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category."
|
|
|
|
-------
|
|
|
|
http://mail.nl.linux.org/linux-utf8/2001-02/msg00103.html
|
|
|
|
From: Markus Kuhn
|
|
|
|
Tom Tromey wrote on 2001-02-05 00:36 UTC:
|
|
> Kai> IMAO, a *real* filesystem should use some encoding of ISO 10646 -
|
|
> Kai> UTF-8, UTF-16, or UTF-32 are all viable options. The same should
|
|
> Kai> be true for the kernel filename interfaces.
|
|
>
|
|
> I like this, but what should I do right now?
|
|
|
|
The POSIX kernel file system interface is engraved into stone and
|
|
extremely unlikely to change. File names are arbitrary binary strings,
|
|
with only the '/' and '\0' bytes having any special semantics. You can
|
|
use arbitrary coded character sets on it as long as they do not
|
|
introduce '/' and '\0' bytes spuriously. Writers and readers have to
|
|
somehow agree on what encoding to use and the only really practical way
|
|
is to use the same encoding on all systems that share files. Eventually,
|
|
everyone will be using UTF-8 for file names on POSIX systems. Right now,
|
|
I would recommend users to use only ASCII for filenames, as this is
|
|
already UTF-8 and therefore simplifies migration. Using the ISO 8859,
|
|
JIS, etc. filenames should soon be considered deprecated practice.
|
|
|
|
> I work on libgcj, the runtime component of gcj, the Java front end to
|
|
> GCC. In libgcj of course we use UCS-2 everywhere, since that is what
|
|
> Java does. Currently, for Unixy systems, we assume that all file
|
|
> names are UTF-8.
|
|
|
|
The best solution is to assume that the file names are in the
|
|
locale-specific multi-byte encoding. Simply use mbrtowc and wcrtomb to
|
|
convert between Unicode and the locale-dependent multi-byte encoding
|
|
used in file names and text files if the ISO C 99 symbol
|
|
__STDC_ISO_10646__ is defined (which guarantees that wchar_t = UCS). On
|
|
Linux, this has been the case since glibc 2.2.
|
|
|
|
> (Actually, we do something notably worse, which is
|
|
> assume that file names are Java-style UTF-8, with the weird encoding
|
|
> for \u0000.)
|
|
|
|
\u0000 = NUL was never a character allowed in filenames under POSIX.
|
|
Raise an exception if someone tries to use it in a filename. Problem
|
|
solved.
|
|
|
|
I never understood, why Java found it necessary to introduce two
|
|
distinct ASCII NUL characters.
|
|
|
|
------
|
|
|
|
Interesting idea. Use iconv to create shift-jis or other mbcs test cases.
|