Unicode in ruby

nibbler61 · March 13, 2006, 5:45pm

On 3/12/06, Austin Z. [email protected] wrote:

instead of LOWERCASE E ACUTE ACCENT).
simple type. You cannot have character arrays. And no library can
completely wrap this inconsistency and isolate you from dealing with
it.

If you’re simply dealing with text, you don’t need arrays of characters.
Frankly, if you don’t care what Windows, OS X, and ICU use, then you’re
completely ignorant of the real world and what is useful and necessary
for Unicode.

The native encoding is bound to be different between platforms. I want
to use an encoding that I like on all platforms, and convert the
strings for filenames or whatever to fit the current platform. That is
why I do not care what a particular platform you name uses.

By the way, you are wrong – you can have arrays of characters. It’s
just that those characters are not guaranteed to be a fixed length. It
will be the same with Ruby moving forward.

Yes, you can have arrays of strings. Nice. But to turn a text string
into a string of characters you have to turn it into an array of
strings. Instead of just indexing an array of basic types that
represent the characters.

And there is a need to look at the actual characters at times.There
are programs that actually process the text, not only save what the
user entered in a web form. I can think of text editors, terminal
emulators, and linguistic tools. I am sure there are others.

Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.

Nice theory. What reduces the number of errors is no longer thinking in
terms of arrays of characters, but in terms of text strings.

Or strings of strings of 16-bit words, packed? No, thanks. I want to
avoid that.

You say that utf-16 is more space-conserving for languages like
Japanese. Nice. But I do not care. I guess text consumes very small
portion of memory on my system. Both ram and hardrive. I do not care
if that doubles or quadruples. In the very few cases when I want to
save space (ie when sending email attachments) I can use gzip. It can
even compress repetitive text which no encoding can.

If you don’t care, then why are you arguing here? The Japanese – which
would include Matz – do care.

I do not care about the space inefficiency. Be it inefficiency in
storing Czech text, Japanese text, English text, or any other. It has
nothing to do with the fact I do not speak Japanese.

I think that most of my ram and hardrive space is consumed by other
stuff than text. For that reason I do not care about the relative
efficiency of text encoding. It will have minimal impact on the
performance or amounts of memory consumed on the system. And there is
always the possibility to compress the text.

subset of UTF-8).
Hmm, so you call the possibility to choose your encoding living in
stone age. I would call it living in reality. There are various
encodings out there.

Yes, it’s the stone age. The filesystem should allow you to see things
in UTF-8 or SJIS or EUC-JP if you want, but internally it should be
using something a hell of a lot smarter than those encodings. This is
what HFS+ and Windows allow.

Well, the libc could store the strings in some utf-* encoding on the
disk, and translate that based on the current locale. I wonder if that
is against POSIX or not.
But it is not done, and it is wrong. There are problems of this kind
on Windows as well. It is still not recommended to use non-ascii
characters in filenames around here…

Thanks

Michal

nibbler61 · March 13, 2006, 7:17pm

Hi,

I do not care about the space inefficiency. Be it inefficiency in
storing Czech text, Japanese text, English text, or any other. It has
nothing to do with the fact I do not speak Japanese.

I’m writing a cross-platform app in ruby that will include text editing
and
also p2p chat. I’d like to handle extended character sets. Do we have
any
recipies or best practices for handling non-ASCII character encodings
in present-day ruby v1.8.4?

How are folks currently handling non-ASCII “wide” character encodings
in Ruby?

Thanks,

Bill

nibbler61 · March 12, 2006, 10:19pm

On 3/11/06, Michal S. [email protected] wrote:

completely wrap this inconsistency and isolate you from dealing with
it.

If you’re simply dealing with text, you don’t need arrays of characters.
Frankly, if you don’t care what Windows, OS X, and ICU use, then you’re
completely ignorant of the real world and what is useful and necessary
for Unicode.

By the way, you are wrong – you can have arrays of characters. It’s
just that those characters are not guaranteed to be a fixed length. It
will be the same with Ruby moving forward.

Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.

Nice theory. What reduces the number of errors is no longer thinking in
terms of arrays of characters, but in terms of text strings.

You say that utf-16 is more space-conserving for languages like
Japanese. Nice. But I do not care. I guess text consumes very small
portion of memory on my system. Both ram and hardrive. I do not care
if that doubles or quadruples. In the very few cases when I want to
save space (ie when sending email attachments) I can use gzip. It can
even compress repetitive text which no encoding can.

If you don’t care, then why are you arguing here? The Japanese – which
would include Matz – do care.

Hmm, so you call the possibility to choose your encoding living in
stone age. I would call it living in reality. There are various
encodings out there.

Yes, it’s the stone age. The filesystem should allow you to see things
in UTF-8 or SJIS or EUC-JP if you want, but internally it should be
using something a hell of a lot smarter than those encodings. This is
what HFS+ and Windows allow.

other weird encodings is historical. What do you mean by efficiency?
If you want space efficiency use compression. If you want speed, use
utf-32 or similar encoding that does not have to deal with special
cases.

You’d be half-right. The historical reason is that most programs still
don’t deal with Unicode properly. On Unix/POSIX this is mostly because
of the brain-dead nonsense related to locales. On Windows this is mostly
because of entrenched behaviours.

However, there is significant resistance to Unicode in Asian countries
because of politics, and to UTF-8 in particular because of its
inefficiency both in processing and storage. UTF-32 is equally
inefficient in storage to all language, and UTF-16 is the balance
between those two. That’s why UTF-16 was chosen for HFS+ and NTFS.

system. Additionally, most platforms specify a default. It’s been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.
iirc there are even byte-order marks. If you insert one in every
string you can get them identified at any time without doubt

You’re right. There are. They’re one of the mistakes with UTF-16, IMO.

But do not trust me on that. I do not know anything about unicode, and
I want to sidestep the issue by using an encoding that is easy to work
with, even for ignorants

Which is why I want what Matz is providing. Something where it doesn’t
matter what encoding you have, but something where Ruby provides the
ability natively to switch between these encodings.

-austin

nibbler61 · March 13, 2006, 7:48pm

On 3/13/06, Michal S. [email protected] wrote:

use, then you’re completely ignorant of the real world and what is
useful and necessary for Unicode.
The native encoding is bound to be different between platforms. I want
to use an encoding that I like on all platforms, and convert the
strings for filenames or whatever to fit the current platform. That is
why I do not care what a particular platform you name uses.

I think you’re just confused here, Michal.

By the way, you are wrong – you can have arrays of characters.
It’s just that those characters are not guaranteed to be a fixed
length. It will be the same with Ruby moving forward.
Yes, you can have arrays of strings. Nice. But to turn a text string
into a string of characters you have to turn it into an array of
strings. Instead of just indexing an array of basic types that
represent the characters.

And there is a need to look at the actual characters at times.There
are programs that actually process the text, not only save what the
user entered in a web form. I can think of text editors, terminal
emulators, and linguistic tools. I am sure there are others.

NO! This is where you’re 100% wrong. Text editors, terminal emulators,
and linguistic tools especially should never be looking at the raw
bytes underneath the character strings. They should be dealing with the
characters as discrete entities.

This is what I’m talking about. Byte arrays as characters are nonsense
in today’s world. If you don’t have an encoding attached to something,
then you can’t possibly know what it means.

Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.
Nice theory. What reduces the number of errors is no longer thinking in
terms of arrays of characters, but in terms of text strings.
Or strings of strings of 16-bit words, packed? No, thanks. I want to
avoid that.

Um. You’re confused here. It’s a text string with a UTF-16 encoding.

nothing to do with the fact I do not speak Japanese.
I think that most of my ram and hardrive space is consumed by other
stuff than text. For that reason I do not care about the relative
efficiency of text encoding. It will have minimal impact on the
performance or amounts of memory consumed on the system. And there is
always the possibility to compress the text.

Then you are willfully ignorant of the concerns of a lot of people.

Hmm, so you call the possibility to choose your encoding living in
stone age. I would call it living in reality. There are various
encodings out there.
Yes, it’s the stone age. The filesystem should allow you to see
things in UTF-8 or SJIS or EUC-JP if you want, but internally it
should be using something a hell of a lot smarter than those
encodings. This is what HFS+ and Windows allow.
Well, the libc could store the strings in some utf-* encoding on the
disk, and translate that based on the current locale. I wonder if that
is against POSIX or not.

It’s against POSIX because POSIX specifies that the disk should be
storing the file in single-character bytes and that 0x00 is a filename
terminator. POSIX, though, is stupid.

But it is not done, and it is wrong. There are problems of this kind
on Windows as well. It is still not recommended to use non-ascii
characters in filenames around here…

Actually, that’s only if you’re using stupid programs. Unfortunately,
that includes Ruby right now. When Matz gets the M17N strings checked
into Ruby 1.9, I will be working toward significantly improving the
Windows filesystem handling so that full Unicode is supported.

-austin

nibbler61 · March 14, 2006, 7:09am

On 3/13/06, Anthony DeRobertis [email protected] wrote:

UTF-8 can take multiple octets to represent a character. So can
UTF-16, UTF-32, and every other variation of Unicode.
This last statement is true only because you use the term “octet.”
You’re correct; that isn’t what I meant to say. Something along the
lines of the following is better worded:
    UTF-8 can take more than one octet to represent a
    character; UTF-16 can take more than two; UTF-32
    more than four; etc.

No. UTF-32 does not have surrogates. Unicode is perfectly
representable in either 20 or 21 bits. A single character is always
representable in a uint32_t sized space with UTF-32.

POSIX is outdated and needs to be scrapped or fixed. Preferably the
former. Preferably by people who know what they’re doing – and not
the folks behind the GNU libc.

-austin

nibbler61 · March 14, 2006, 12:34am

Austin Z. wrote:

On 3/10/06, Anthony DeRobertis [email protected] wrote:

Ummm, no. UTF-16 filenames would break every correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been
the end-of-string marker.

You’re right. And I’m saying that I don’t care.

Well, I suspect most other people want to maintain backwards
compatibility. Hence the existence of UTF-8.

People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I’ll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems.

Why? POSIX gives nearly binary-transparent file names; the only
exception is the single octet 0x00. Considering the 1:1 mapping between
UTF-8 and other Unicode encodings, how can the choice of one or another
“badly limit” what can be done?

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this ‘stone age’ you refer to?

Change and environment variable and watch your programs break that had
worked so well with Unicode. That is the stone age that I refer to.

dd if=/dev/urandom of=/lib/ld-linux.so.2 and watch all my programs
break, too. What’s you point?

It is always possible to break a computer system if you try hard enough
(or, all too often, not hard at all); but if the user actively attempts
to make his machine malfunction, that’s not the OS’s problem.

I’m also guessing that you don’t do much with long Japanese filenames
or deep paths that involve anything except US-ASCII (a subset of
UTF-8).

Well, I have Japanese file names (though not that many in the grand
scheme of things), and have a lot of files and directories named in non
US-ASCII. Yeah, I know that file name length and path length limits
suck, but that’s an implementation limitation of e.g. ext3, nothing
fundamental.

UTF-8 can take multiple octets to represent a character. So can
UTF-16, UTF-32, and every other variation of Unicode.

This last statement is true only because you use the term “octet.”

You’re correct; that isn’t what I meant to say. Something along the
lines of the following is better worded:

    UTF-8 can take more than one octet to represent a
    character; UTF-16 can take more than two; UTF-32
    more than four; etc.

It’s a useless term here, because UTF-8 only has any level of
efficiency for US-ASCII.

English, I’ve heard, is a rather common language.

Even if you step to European content, UTF-8
is no longer perfectly efficient,

Of course not — but still generally better than UTF-16, I think.
Spanish, I’ve heard, is also a rather common language.

and when you step to Asian content,
UTF-8 is so bloody inefficient that most folks who have to deal with
it would rather work in a native encoding (EUC-JP or SJIS, anyone?)
which is 1…2 bytes or do everything in UTF-16.

Yes, for CJK, UTF-8 is fairly inefficient. A full 33% bigger than
UTF-16.

OTOH, it has some nice advantages over UTF-16, like being backwards
compatible with C strings, being resynchronizable (if a octet is lost),
not having byte-order issues, etc.

Now, honestly, what portion of your hard disk is taken up by file names?

nibbler61 · March 14, 2006, 7:33am

From: “Austin Z.” [email protected]

On 3/13/06, Anthony DeRobertis [email protected] wrote:
    UTF-8 can take more than one octet to represent a
    character; UTF-16 can take more than two; UTF-32
    more than four; etc.
No. UTF-32 does not have surrogates. Unicode is perfectly
representable in either 20 or 21 bits. A single character is always
representable in a uint32_t sized space with UTF-32.

Hi, I have zero background in non-ASCII character representations,
but the following post has been echoing in my head as a data point
for… can’t believe it’s been three-and-a-half years:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/46284

Does that have any relation to your current context? Curt seems to
be talking not of surrogates, but saying “combining characters”
mean variable-length issues still exist with UTF-32 ?

Regards,

Bill

nibbler61 · March 14, 2006, 3:01pm

On 3/14/06, Bill K. [email protected] wrote:

for… can’t believe it’s been three-and-a-half years:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/46284

Does that have any relation to your current context? Curt seems to
be talking not of surrogates, but saying “combining characters”
mean variable-length issues still exist with UTF-32 ?

Yes and no. When you use combining characters, each of the combining
characters (such as COMBINING CEDILLA or COMBINING ACCENT ACUTE) is a
distinct character. If I understand the Unicode standard correctly –
which is perhaps questionable – you can go either direction. But I
had forgotten (temporarily) about combining characters. For the most
part, Apple chooses to use them and Microsoft chooses not to use them
in native representations wherever possible. Where it becomes
difficult is when you need to combine characters that do not otherwise
have canonical forms. At that point, yes, UTF-32 can have multiple
uint32_t elements creating one character. I think that for most
languages, though, the use of combining characters is not necessary.

I withdraw my absolute, though. If you’re creating a meaningful glyph
with combining characters, you can have multiple uint32_t elements
creating that glyphn in UTF-32. Without combining characters, however,
UTF-32 is perfectly representational of all glyphs possible with
Unicode.

-austin

nibbler61 · March 14, 2006, 3:05pm

I don’t get it guys. Supporting (not exclusively using) Unicode
transparently should be a no-brainer for a serious programming language
these days. I love Ruby but multi-byte string is a pain. And they are
everywhere. There’s no logic in resisting. There are more chars in the
world than on your keyboard. Even in the US, there are official and
*correct * chars for quotation marks nit in the US_ASCII set. Using the
inch-sign for quotes is plain wrong. Come on, we’re in th 21st century
and the world is a global place. OpenSource people should know that
best. It can’t be so difficult technically - others do it, why can’t
you?

All we want is a Unicode safe Ruby.

Best,
Andreas

nibbler61 · March 14, 2006, 4:11pm

Austin Z. wrote:

No. UTF-32 does not have surrogates. Unicode is perfectly
representable in either 20 or 21 bits. A single character is always
representable in a uint32_t sized space with UTF-32.

Depends on what you call a character; in the technical way Unicode uses
the term, yes, UTF-32 can represent every character at present.

In the way that users understand characters (what the unicode standard
calls a “grapheme”) â?? the way text-processing software needs to
manipulate characters â?? no it can’t.

dÌ?Ì? is not three characters to the user.

POSIX is outdated and needs to be scrapped or fixed.

So far, you have provided no evidence of this, just assertions that
somehow UTF-8 is horribly limiting.

nibbler61 · March 14, 2006, 4:39pm

On 3/13/06, Austin Z. [email protected] wrote:

characters. Frankly, if you don’t care what Windows, OS X, and ICU
use, then you’re completely ignorant of the real world and what is
useful and necessary for Unicode.
The native encoding is bound to be different between platforms. I want
to use an encoding that I like on all platforms, and convert the
strings for filenames or whatever to fit the current platform. That is
why I do not care what a particular platform you name uses.

I think you’re just confused here, Michal.
In what way?

are programs that actually process the text, not only save what the
user entered in a web form. I can think of text editors, terminal
emulators, and linguistic tools. I am sure there are others.

NO! This is where you’re 100% wrong. Text editors, terminal emulators,
and linguistic tools especially should never be looking at the raw
bytes underneath the character strings. They should be dealing with the
characters as discrete entities.

I am saying I want to look at characters, not that I want to look at
bytes.
And I am saying that looking at entities that happen to be all the
same size makes things much simpler than looking at strings packed
into another string without separators. And multiword characters are
word strings, nothing else.

This is what I’m talking about. Byte arrays as characters are nonsense
in today’s world. If you don’t have an encoding attached to something,
then you can’t possibly know what it means.

No, it is not.
Sure, I am not for byte arrays or chunks of data of unknown encoding.

Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.
Nice theory. What reduces the number of errors is no longer thinking in
terms of arrays of characters, but in terms of text strings.
Or strings of strings of 16-bit words, packed? No, thanks. I want to
avoid that.

Um. You’re confused here. It’s a text string with a UTF-16 encoding.

Yes, It is a text string, which is a sring of characters packed one
after another, which happen to be themself strings of 16-bit words.
Give me the 100th character.

storing Czech text, Japanese text, English text, or any other. It has
nothing to do with the fact I do not speak Japanese.

I think that most of my ram and hardrive space is consumed by other
stuff than text. For that reason I do not care about the relative
efficiency of text encoding. It will have minimal impact on the
performance or amounts of memory consumed on the system. And there is
always the possibility to compress the text.

Then you are willfully ignorant of the concerns of a lot of people.

OK, you are concerned about the space consumed by text. I wonder how
large portion of your ram is used for text. Or how large portion of
your hardrive is used by text for which you can choose the encoding.
I got lots of C sources but I suspect that C compilers won’t accept
wide characters anytime soon. And anything but byte encoding is quite
pointless for C sources. Most of the stuff is identifiers that can be
only 7-bit ascii anyway.

Thanks

Michal

nibbler61 · March 14, 2006, 4:57pm

On 3/14/06, Andreas [email protected] wrote:

All we want is a Unicode safe Ruby.

You have it.

You’ll have something even better in Ruby 2.0.

You will not have it natively in Ruby 1.8.

-austin

nibbler61 · March 14, 2006, 4:45pm

On 3/14/06, Bill K. [email protected] wrote:

representable in a uint32_t sized space with UTF-32.

well, in some languages you get characters like “LATIN CAPITAL LETTER
A WITH ACUTE”.
In a string you can either get the above or “LATIN CAPITAL LETTER A”
followed by “COMBINING ACUTE” or somesuch. This is decomposed.

And there are libraries for normalizing/composing/decomposing unicode
strings.

Thanks

Michal

nibbler61 · March 15, 2006, 11:18pm

On 3/14/06, Shawn A. [email protected] wrote:

As long as we’re discussing unicode here,
I am testing a large C++ application and would like to wrap it with
Ruby. However, this application makes wide use of unicode and wchar_t
in almost all it’s method calls. Can anyone help my feeble mind
understand how to do this?

Would someone be able to point me in the direction of some example
code or such for both calling into the C++ code with a wchar_t
argument, and getting wchar_t’s back from calls? Apparently SWIG
won’t touch this?

This is exactly the thing that is not supported right now. But you
might be able to convert the wide character strings to something else
using iconv.

And you could possibly use icu4r to work with wide strings directly if
it happens to use the same wide characters. But I suspect you would
have to write some glue code to put it all together.

Swig is supposed to make such argument conversions easier.

Thanks

Michal

–
Support the freedom of music!
Maybe it’s a weird genre … but weird is not illegal.
Maybe next time they will send a special forces commando
to your picnic … because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz

nibbler61 · March 14, 2006, 5:50pm

As long as we’re discussing unicode here,
I am testing a large C++ application and would like to wrap it with
Ruby. However, this application makes wide use of unicode and wchar_t
in almost all it’s method calls. Can anyone help my feeble mind
understand how to do this?

Would someone be able to point me in the direction of some example
code or such for both calling into the C++ code with a wchar_t
argument, and getting wchar_t’s back from calls? Apparently SWIG
won’t touch this?

Thanks,
Shawn