On 3/10/06, Michal S. [email protected] wrote:
their native filename format, then we’re going to be much better.
That will, however, break some assumptions by really stupid
programs.)
Why the hell utf-16? It is no longer compatible with ascii, yet 16
bits are far from sufficient to cover current unicode. So you still
get multiword characters. It is not even dword aligned for fast
processing by current cpus. I would like utf-8 for compatibility, and
utf-32 for easy string processing. But I do not see much use for
utf-16.
UTF-16 is actually pretty performant and the implementation of wchar_t
on MacOS X and Windows is (you guessed it!) UTF-16. The filesystems for
both of these operating systems (which have far superior Unicode
support than anything else) both use UTF-16 as the native filename
encoding (this is true for HFS+, NTFS4, and NTFS5). The only difference
between what MacOS X does and Windows does for this is that Apple chose
to use decomposed characters instead of composed characters (e.g.,
LOWERCASE E + COMBINING ACUTE ACCENT instead of LOWERCASE E ACUTE
ACCENT).
Look at the performance numbers for ICU4C: it’s pretty damn good. UTF-32
isn’t exactly space conservative (since with UTF-16 most of the BMP
can be represented with a single wchar_t, and only a few need surrogates
taking up exactly two wchar_ts, whereas all characters would take up
four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.
On 3/10/06, Anthony DeRobertis [email protected] wrote:
Austin Z. wrote:
Unix support for Unicode is still in the stone ages because of the
nonsense that POSIX put on Unix ages ago. (When Unix filesystems can
write UTF-16 as their native filename format, then we’re going to be
much better. That will, however, break some assumptions by really
stupid programs.)
Ummm, no. UTF-16 filenames would break every correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been the
end-of-string marker.
You’re right. And I’m saying that I don’t care. People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I’ll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems. One could do
what I think that Apple has done and provided two filesystem
interfaces that are synchronized. The native interface – and the more
efficient one – will be using UTF-16 because that’s what HFS+ speaks.
The secondary interface (that also works on UFS filesystems) would
translate to UTF-8 and/or follow the nonsensical POSIX rules for native
encodings.
Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this ‘stone age’ you refer to?
Change and environment variable and watch your programs break that had
worked so well with Unicode. That is the stone age that I refer to.
I’m also guessing that you don’t do much with long Japanese filenames or
deep paths that involve anything except US-ASCII (a subset of UTF-8).
UTF-8 can take multiple octets to represent a character. So can UTF-16,
UTF-32, and every other variation of Unicode.
This last statement is true only because you use the term “octet.” It’s
a useless term here, because UTF-8 only has any level of efficiency for
US-ASCII. Even if you step to European content, UTF-8 is no longer
perfectly efficient, and when you step to Asian content, UTF-8 is so
bloody inefficient that most folks who have to deal with it would rather
work in a native encoding (EUC-JP or SJIS, anyone?) which is 1…2 bytes
or do everything in UTF-16.
Depending on content, a string in UTF-8 can consume more octects than
the same string in UTF-16, or vice versa.
Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don’t
get to have the fun of picking between big- and little-endian!
Are people always this stupid when it comes to things that they clearly
don’t understand? Yes, UTF-16 may have the problem of not knowing if
you’re dealing with UTF-16BE or UTF-16LE, but it’s my understanding that
this is only an issue when you’re dealing with both on the same
system. Additionally, most platforms specify a default. It’s been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.
There. Problem solved.
If you’re going to babble on about Unicode, it’d be nice if you knew
more than
the knee-jerk stuff you’ve posted so far. Either of you.
-austin