Forum: Ruby unicode in ruby

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Richard G. (Guest)
on 2006-03-08 13:48
(Received via mailing list)
i'm using IO.foreach to parse the lines in a file. now i'm trying to get
it to work with unicode encoded files. does ruby support unicode? how do
i compare a variable with a unicode constant string?

the script goes something like:

IO.foreach("myfile.txt") { |line|
   if line.downcase[0,2] == "id"
Michal S. (Guest)
on 2006-03-08 14:46
(Received via mailing list)
On 3/8/06, Richard G. <removed_email_address@domain.invalid> wrote:
> i'm using IO.foreach to parse the lines in a file. now i'm trying to get
> it to work with unicode encoded files. does ruby support unicode? how do
> i compare a variable with a unicode constant string?
>
> the script goes something like:
>
> IO.foreach("myfile.txt") { |line|
>    if line.downcase[0,2] == "id"

To get unicode downcase you probably want icu4r. To handle the cases
you are interested in you could write your own. However, the []
operator of ruby strings returns bytes, not characters.

hth

Michal

--
             Support the freedom of music!
Maybe it's a weird genre  ..  but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
 www.music-versus-guns.org  http://en.policejnistat.cz
unknown (Guest)
on 2006-03-08 14:55
(Received via mailing list)
Michal S. <removed_email_address@domain.invalid> wrote:

>
> On 3/8/06, Richard G. <removed_email_address@domain.invalid> wrote:

> i'm using IO.foreach [.. no \n ]


you don't make use of "\n" at uni-berlin.de when wrapping ?

could be more readable ;-)
Richard G. (Guest)
on 2006-03-08 20:15
(Received via mailing list)
so, you guys are telling me a language developed since the year 2000
doesn't support unicode strings natively? in my opinion, that's a pretty
glaring problem.
Logan C. (Guest)
on 2006-03-08 20:21
(Received via mailing list)
On Mar 8, 2006, at 1:13 PM, Richard G. wrote:

> so, you guys are telling me a language developed since the year
> 2000 doesn't support unicode strings natively? in my opinion,
> that's a pretty glaring problem.
>

Ruby doesn't really support any strings natively. It just happens to
have a bytevector class that acts a lot like a string ;) Having said
that, have you tried:
$KCODE="u" # Assumes the source file is encoded as UTF8, effects
literal strings, regexps, etc.

If your source file is UTF16 or some other non-UTF8 encoding you'll
have to use iconv to get into UTF8 to compare with the literals in
your source.
Michal S. (Guest)
on 2006-03-08 20:27
(Received via mailing list)
On 3/8/06, Richard G. <removed_email_address@domain.invalid> wrote:
> so, you guys are telling me a language developed since the year 2000
> doesn't support unicode strings natively? in my opinion, that's a pretty
> glaring problem.

For me it is a problem as well. But getting unicode right is hard.
Look at the size of the icu library and the size of ruby itself.
Anyway, unicode regexps are planned for ruby 2.0 iirc.

Thanks

Michal


--
             Support the freedom of music!
Maybe it's a weird genre  ..  but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
 www.music-versus-guns.org  http://en.policejnistat.cz
Michal S. (Guest)
on 2006-03-08 20:33
(Received via mailing list)
On 3/8/06, Logan C. <removed_email_address@domain.invalid> wrote:
> that, have you tried:
> $KCODE="u" # Assumes the source file is encoded as UTF8, effects
> literal strings, regexps, etc.
>
> If your source file is UTF16 or some other non-UTF8 encoding you'll
> have to use iconv to get into UTF8 to compare with the literals in
> your source.

err, no that is not what people want when they speak about downcase in
unicode.
Sure, you can write a string encoded in utf-8 in your source, and
verify it is byte-identical to another string. That is about all you
get this way.
I suspect regexps won't work right with multibyte characters, for
downcase or case -insensitive regexps you would even need to know the
language.

Thanks

Michal


--
             Support the freedom of music!
Maybe it's a weird genre  ..  but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
 www.music-versus-guns.org  http://en.policejnistat.cz
Daniel H. (Guest)
on 2006-03-08 20:55
(Received via mailing list)
On Mar 8, 2006, at 7:24 PM, Michal S. wrote:

> Anyway, unicode regexps are planned for ruby 2.0 iirc.

Unicode strings are also planned for Ruby 2 (possibly implemented
already?).

-- Daniel
Eric J. (Guest)
on 2006-03-08 21:04
(Received via mailing list)
Logan C. <removed_email_address@domain.invalid> writes:

> Ruby doesn't really support any strings natively. It just happens to
> have a bytevector class that acts a lot like a string ;)

.... that acts a lot like a string /of ASCII chars/, actually. Rather
anachronic, imho.

I can't consider that "il était une fois".length == 18 is the way it
should be with a string in a modern language.

Of course, tweaking with -K and jcode and/or other third parties
modules and/or various hacks allow some enhancements (we have a
jlength method that seems working), but that's not the Peru, either
(case methods support only ASCII chars, etc.)

Waiting for a plain support in Rite (much more important to me than
the "end" issues...).
rtilley (Guest)
on 2006-03-08 21:20
(Received via mailing list)
Eric J. wrote:

> Waiting for a plain support in Rite (much more important to me than
> the "end" issues...)

Speaking of Rite... is there a timeline on its release yet? One year?
Two years? More?
Daniel H. (Guest)
on 2006-03-08 21:26
(Received via mailing list)
On Mar 8, 2006, at 8:18 PM, rtilley wrote:

> Speaking of Rite... is there a timeline on its release yet? One
> year? Two years? More?

http://www.atdot.net/yarv/
http://redhanded.hobix.com/cult/yarvMergedMatz.html

-- Daniel
rtilley (Guest)
on 2006-03-08 21:41
(Received via mailing list)
Daniel H. wrote:
> http://www.atdot.net/yarv/
> http://redhanded.hobix.com/cult/yarvMergedMatz.html
>
> -- Daniel

I'm new to Ruby... I did not know that Rite was tied to YARV. Thanks for
the links!
Richard G. (Guest)
on 2006-03-10 04:10
(Received via mailing list)
guess i'll wait till then. thanks for the info guys.
Richard G. (Guest)
on 2006-03-10 04:22
(Received via mailing list)
exactly. utf-8 doesn't mean one byte per char necessarily.

how have folks solved this problem when writing web sites in rails?
PJ Hyett (Guest)
on 2006-03-10 04:40
(Received via mailing list)
It's a huge f*cking pain in the ass. We've been trying to convert
Wayfaring.com over to UTF8 off and on for about a month and it's
completely useless. Either you start the site using UTF8 (using crappy
hacks IMO) or forgetaboutit. We're about to break ground on a new site
and I almost don't want to do it until ruby 2.0 comes out with the
unicode support built in.

-PJ
http://pjhyett.com
Austin Z. (Guest)
on 2006-03-10 06:35
(Received via mailing list)
On 3/8/06, Richard G. <removed_email_address@domain.invalid> wrote:
> so, you guys are telling me a language developed since the year 2000
> doesn't support unicode strings natively? in my opinion, that's a
> pretty glaring problem.

Please note that Ruby itself is ten years old. Unicode has only
*recently* (the last three or four years, with the release of Windows
XP) become a major factor, especially in Japan. Unix support for Unicode
is still in the stone ages because of the nonsense that POSIX put on
Unix ages ago. (When Unix filesystems can write UTF-16 as their native
filename format, then we're going to be much better. That will, however,
break some assumptions by really stupid programs.)

I've been following what Matz has had to say and have recently done
quite a bit of work with Unicode. The reality is that Unicode is hard,
and there's cultural and other reasons for Ruby *not* to have Unicode
(UTF-16 or UTF-8) strings by default. I think that Matz's plans for M17N
strings is far superior to assuming Unicode by default.

Basically, Ruby will have the capabilities to work with UTF-8, UTF-16,
and probably the ISO-8859-* encodings natively, as well as the existing
SJIS and EUC-JP support. I wouldn't be surprised if it also includes
other EUC-* encodings. Essentially, you'll be able to do:

  s = "école"
  s.encoding # -> :raw (or something like that)
  s.encoding = :iso8859_1 # "école"
  s.encoding = :utf8 # "Ã?(c)cole"
  s.capitalize! # "Ã?â?°cole"
  s.encoding = :iso8859_1 # "Ã?cole"

More than that, using the same string:

  s[0] # "Ã?"
  s.encoding = :utf8
  s[0] # "Ã?â?°"

I've shown everything as a byte string here. The point is, though, that
going from the raw encoding -- which may be the default, or the default
may be able to be set -- shouldn't cause any byte conversions. I suspect
that Matz will have a different way to get at the underlying bytes, but
that's what will be happening for Ruby 2.0.

The last indication I had seen suggested that M17N strings were closer,
but not yet done. I'm looking forward to them.


-austin
Michal S. (Guest)
on 2006-03-10 15:08
(Received via mailing list)
On 3/10/06, Austin Z. <removed_email_address@domain.invalid> wrote:
> filename format, then we're going to be much better. That will, however,
> break some assumptions by really stupid programs.)
>

Why the hell utf-16? It is no longer compatible with ascii, yet 16
bits are far from sufficient to cover current unicode. So you still
get multiword characters. It is not even dword aligned for fast
processing by current cpus.
I would like utf-8 for compatibility, and utf-32 for easy string
processing. But I do not see much use for utf-16.

Thanks

Michal

--
             Support the freedom of music!
Maybe it's a weird genre  ..  but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
 www.music-versus-guns.org  http://en.policejnistat.cz
Anthony DeRobertis (Guest)
on 2006-03-10 22:59
(Received via mailing list)
Austin Z. wrote:

> Unix support for
> Unicode is still in the stone ages because of the nonsense that POSIX
> put on Unix ages ago. (When Unix filesystems can write UTF-16 as their
> native filename format, then we're going to be much better. That will,
> however, break some assumptions by really stupid programs.)

Ummm, no. UTF-16 filenames would break *every* correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been the
end-of-string marker.

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this 'stone age' you refer to?

UTF-8 can take multiple octets to represent a character. So can UTF-16,
UTF-32, and every other variation of Unicode.

Depending on content, a string in UTF-8 can consume more octects than
the same string in UTF-16, or vice versa.

Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't
get to have the fun of picking between big- and little-endian!
Austin Z. (Guest)
on 2006-03-11 06:42
(Received via mailing list)
On 3/10/06, Michal S. <removed_email_address@domain.invalid> wrote:
>> their native filename format, then we're going to be much better.
>> That will, however, break some assumptions by really stupid
>> programs.)
> Why the hell utf-16? It is no longer compatible with ascii, yet 16
> bits are far from sufficient to cover current unicode. So you still
> get multiword characters. It is not even dword aligned for fast
> processing by current cpus. I would like utf-8 for compatibility, and
> utf-32 for easy string processing. But I do not see much use for
> utf-16.

UTF-16 is actually pretty performant and the implementation of wchar_t
on MacOS X and Windows is (you guessed it!) UTF-16. The filesystems for
both of these operating systems (which have *far* superior Unicode
support than anything else) both use UTF-16 as the native filename
encoding (this is true for HFS+, NTFS4, and NTFS5). The only difference
between what MacOS X does and Windows does for this is that Apple chose
to use decomposed characters instead of composed characters (e.g.,
LOWERCASE E + COMBINING ACUTE ACCENT instead of LOWERCASE E ACUTE
ACCENT).

Look at the performance numbers for ICU4C: it's pretty damn good. UTF-32
isn't exactly space conservative (since with UTF-16 *most* of the BMP
can be represented with a single wchar_t, and only a few need surrogates
taking up exactly *two* wchar_ts, whereas *all* characters would take up
four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.

On 3/10/06, Anthony DeRobertis <removed_email_address@domain.invalid> wrote:
> Austin Z. wrote:
>> Unix support for Unicode is still in the stone ages because of the
>> nonsense that POSIX put on Unix ages ago. (When Unix filesystems can
>> write UTF-16 as their native filename format, then we're going to be
>> much better. That will, however, break some assumptions by really
>> stupid programs.)
> Ummm, no. UTF-16 filenames would break *every* correctly-implemented
> UNIX program: UTF-16 allows the octect 0x00, which has always been the
> end-of-string marker.

You're right. And I'm saying that I don't care. People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I'll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems. One could do
what I *think* that Apple has done and provided two filesystem
interfaces that are synchronized. The native interface -- and the more
efficient one -- will be using UTF-16 because that's what HFS+ speaks.
The secondary interface (that also works on UFS filesystems) would
translate to UTF-8 and/or follow the nonsensical POSIX rules for native
encodings.

> Personally, my file names have been in UTF-8 for quite some time now,
> and it works well: What exactly is this 'stone age' you refer to?

Change and environment variable and watch your programs break that had
worked so well with Unicode. *That* is the stone age that I refer to.
I'm also guessing that you don't do much with long Japanese filenames or
deep paths that involve *anything* except US-ASCII (a subset of UTF-8).

> UTF-8 can take multiple octets to represent a character. So can UTF-16,
> UTF-32, and every other variation of Unicode.

This last statement is true only because you use the term "octet." It's
a useless term here, because UTF-8 only has any level of efficiency for
US-ASCII. Even if you step to European content, UTF-8 is no longer
perfectly efficient, and when you step to Asian content, UTF-8 is so
bloody inefficient that most folks who have to deal with it would rather
work in a native encoding (EUC-JP or SJIS, anyone?) which is 1..2 bytes
or do everything in UTF-16.

> Depending on content, a string in UTF-8 can consume more octects than
> the same string in UTF-16, or vice versa.
>
> Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't
> get to have the fun of picking between big- and little-endian!

Are people always this stupid when it comes to things that they clearly
don't understand? Yes, UTF-16 may have the problem of not knowing if
you're dealing with UTF-16BE or UTF-16LE, but it's my understanding that
this is *only* an issue when you're dealing with both on the same
system. Additionally, most platforms specify a default. It's been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.

There. Problem solved.

If you're going to babble on about Unicode, it'd be nice if you knew
more than
the knee-jerk stuff you've posted so far. Either of you.

-austin
Michal S. (Guest)
on 2006-03-12 00:03
(Received via mailing list)
On 3/11/06, Austin Z. <removed_email_address@domain.invalid> wrote:
> >> put on Unix ages ago. (When Unix filesystems can write UTF-16 as
> UTF-16 is actually pretty performant and the implementation of wchar_t
> isn't exactly space conservative (since with UTF-16 *most* of the BMP
> can be represented with a single wchar_t, and only a few need surrogates
> taking up exactly *two* wchar_ts, whereas *all* characters would take up
> four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.

I do not care what Windows, OS X, or ICU uses. I care what I want to
use. Even if most characters are encoded with single word you have to
cope with multiword characters. That means that a character is not a
simple type. You cannot have character arrays. And no library can
completely wrap this inconsistency and isolate you from dealing with
it.

Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.

You say that utf-16 is more space-conserving for languages like
Japanese. Nice. But I do not care. I guess text consumes very small
portion of memory on my system. Both ram and hardrive.  I do not care
if that doubles or quadruples. In the very few cases when I want to
save space (ie when sending email attachments) I can use gzip. It can
even compress repetitive text which no encoding can.


> > end-of-string marker.
> encodings.
>
> > Personally, my file names have been in UTF-8 for quite some time now,
> > and it works well: What exactly is this 'stone age' you refer to?
>
> Change and environment variable and watch your programs break that had
> worked so well with Unicode. *That* is the stone age that I refer to.
> I'm also guessing that you don't do much with long Japanese filenames or
> deep paths that involve *anything* except US-ASCII (a subset of UTF-8).

Hmm, so you call the possibility to choose your encoding living in
stone age. I would call it living in reality. There are various
encodings out there.

> or do everything in UTF-16.
No, I suspect the reason for using EUC-JP, SJIS, or ISO-8859-*, and
other weird encodings is historical.
What do you mean by efficiency? If you want space efficiency use
compression. If you want speed, use utf-32 or similar encoding that
does not have to deal with special cases.

> this is *only* an issue when you're dealing with both on the same
> system. Additionally, most platforms specify a default. It's been a
> while (almost a year), but I think that ICU4C defaults to UTF-16BE
> internally, not just UTF-16.

iirc there are even byte-order marks. If you insert one in every
string you can get them identified at any time without doubt :)

But do not trust me on that. I do not know anything about unicode, and
I want to sidestep the issue by using an encoding that is easy to work
with, even for ignorants :P


Thanks

Michal
Austin Z. (Guest)
on 2006-03-12 23:19
(Received via mailing list)
On 3/11/06, Michal S. <removed_email_address@domain.invalid> wrote:
>>
> completely wrap this inconsistency and isolate you from dealing with
> it.

If you're simply dealing with text, you don't need arrays of characters.
Frankly, if you don't care what Windows, OS X, and ICU use, then you're
completely ignorant of the real world and what is useful and necessary
for Unicode.

By the way, you are wrong -- you *can* have arrays of characters. It's
just that those characters are not guaranteed to be a fixed length. It
will be the same with Ruby moving forward.

> Even if the library is performant with multiword characters it is
> complex. That means more prone to errors. Both in itself and in the
> software that interfaces it.

Nice theory. What reduces the number of errors is no longer thinking in
terms of arrays of characters, but in terms of text strings.

> You say that utf-16 is more space-conserving for languages like
> Japanese. Nice. But I do not care. I guess text consumes very small
> portion of memory on my system. Both ram and hardrive.  I do not care
> if that doubles or quadruples. In the very few cases when I want to
> save space (ie when sending email attachments) I can use gzip. It can
> even compress repetitive text which no encoding can.

If you don't care, then why are you arguing here? The Japanese -- which
would include Matz -- *do* care.

> Hmm, so you call the possibility to choose your encoding living in
> stone age. I would call it living in reality. There are various
> encodings out there.

Yes, it's the stone age. The filesystem should allow you to see things
in UTF-8 or SJIS or EUC-JP if you want, but internally it should be
using something a hell of a lot smarter than those encodings. This is
what HFS+ and Windows allow.

> other weird encodings is historical. What do you mean by efficiency?
> If you want space efficiency use compression. If you want speed, use
> utf-32 or similar encoding that does not have to deal with special
> cases.

You'd be half-right. The historical reason is that most programs still
don't deal with Unicode properly. On Unix/POSIX this is mostly because
of the brain-dead nonsense related to locales. On Windows this is mostly
because of entrenched behaviours.

However, there is significant resistance to Unicode in Asian countries
because of politics, and to UTF-8 in particular because of its
inefficiency both in processing and storage. UTF-32 is equally
inefficient in storage to *all* language, and UTF-16 is the balance
between those two. That's why UTF-16 was chosen for HFS+ and NTFS.

>> system. Additionally, most platforms specify a default. It's been a
>> while (almost a year), but I think that ICU4C defaults to UTF-16BE
>> internally, not just UTF-16.
> iirc there are even byte-order marks. If you insert one in every
> string you can get them identified at any time without doubt :)

You're right. There are. They're one of the mistakes with UTF-16, IMO.

> But do not trust me on that. I do not know anything about unicode, and
> I want to sidestep the issue by using an encoding that is easy to work
> with, even for ignorants :P

Which is why I want what Matz is providing. Something where it doesn't
matter what encoding you have, but something where Ruby provides the
ability *natively* to switch between these encodings.

-austin
Michal S. (Guest)
on 2006-03-13 18:45
(Received via mailing list)
On 3/12/06, Austin Z. <removed_email_address@domain.invalid> wrote:
> >> instead of LOWERCASE E ACUTE ACCENT).
> > simple type. You cannot have character arrays. And no library can
> > completely wrap this inconsistency and isolate you from dealing with
> > it.
>
> If you're simply dealing with text, you don't need arrays of characters.
> Frankly, if you don't care what Windows, OS X, and ICU use, then you're
> completely ignorant of the real world and what is useful and necessary
> for Unicode.

The native encoding is bound to be different between platforms. I want
to use an encoding that I like on all platforms, and convert the
strings for filenames or whatever to fit the current platform. That is
why I do not care what a particular platform you name uses.

>
> By the way, you are wrong -- you *can* have arrays of characters. It's
> just that those characters are not guaranteed to be a fixed length. It
> will be the same with Ruby moving forward.

Yes, you can have arrays of strings. Nice. But to turn a text string
into a string of characters you have to turn it into an array of
strings. Instead of just indexing an array of basic types that
represent the characters.

And there is a need to look at the actual characters at times.There
are programs that actually process the text, not only save what the
user entered in a web form. I can think of text editors, terminal
emulators, and linguistic tools. I am sure there are others.

>
> > Even if the library is performant with multiword characters it is
> > complex. That means more prone to errors. Both in itself and in the
> > software that interfaces it.
>
> Nice theory. What reduces the number of errors is no longer thinking in
> terms of arrays of characters, but in terms of text strings.

Or strings of strings of 16-bit words, packed? No, thanks. I want to
avoid that.

>
> > You say that utf-16 is more space-conserving for languages like
> > Japanese. Nice. But I do not care. I guess text consumes very small
> > portion of memory on my system. Both ram and hardrive.  I do not care
> > if that doubles or quadruples. In the very few cases when I want to
> > save space (ie when sending email attachments) I can use gzip. It can
> > even compress repetitive text which no encoding can.
>
> If you don't care, then why are you arguing here? The Japanese -- which
> would include Matz -- *do* care.

I do not care about the space inefficiency. Be it inefficiency in
storing Czech text, Japanese text, English text, or any other. It has
nothing to do with the fact I do not speak Japanese.

I think that most of my ram and hardrive space is consumed by other
stuff than text. For that reason I do not care about the relative
efficiency of text encoding. It will have minimal impact on the
performance or amounts of memory consumed on the system. And there is
always the possibility to compress the text.

> >> subset of UTF-8).
> > Hmm, so you call the possibility to choose your encoding living in
> > stone age. I would call it living in reality. There are various
> > encodings out there.
>
> Yes, it's the stone age. The filesystem should allow you to see things
> in UTF-8 or SJIS or EUC-JP if you want, but internally it should be
> using something a hell of a lot smarter than those encodings. This is
> what HFS+ and Windows allow.
>

Well, the libc could store the strings in some utf-* encoding on the
disk, and translate that based on the current locale. I wonder if that
is against POSIX or not.
But it is not done, and it is wrong. There are problems of this kind
on Windows as well. It is still not recommended to use non-ascii
characters in filenames around here..

Thanks

Michal
Bill K. (Guest)
on 2006-03-13 20:17
(Received via mailing list)
Hi,

> I do not care about the space inefficiency. Be it inefficiency in
> storing Czech text, Japanese text, English text, or any other. It has
> nothing to do with the fact I do not speak Japanese.

I'm writing a cross-platform app in ruby that will include text editing
and
also p2p chat.  I'd like to handle extended character sets.  Do we have
any
recipies or best practices for handling non-ASCII character encodings
in present-day ruby v1.8.4?

How are folks currently handling non-ASCII "wide" character encodings
in Ruby?


Thanks,

Bill
Austin Z. (Guest)
on 2006-03-13 20:48
(Received via mailing list)
On 3/13/06, Michal S. <removed_email_address@domain.invalid> wrote:
>> use, then you're completely ignorant of the real world and what is
>> useful and necessary for Unicode.
> The native encoding is bound to be different between platforms. I want
> to use an encoding that I like on all platforms, and convert the
> strings for filenames or whatever to fit the current platform. That is
> why I do not care what a particular platform you name uses.

I think you're just confused here, Michal.

>> By the way, you are wrong -- you *can* have arrays of characters.
>> It's just that those characters are not guaranteed to be a fixed
>> length. It will be the same with Ruby moving forward.
> Yes, you can have arrays of strings. Nice. But to turn a text string
> into a string of characters you have to turn it into an array of
> strings. Instead of just indexing an array of basic types that
> represent the characters.

> And there is a need to look at the actual characters at times.There
> are programs that actually process the text, not only save what the
> user entered in a web form. I can think of text editors, terminal
> emulators, and linguistic tools. I am sure there are others.

NO! This is where you're 100% wrong. Text editors, terminal emulators,
and linguistic tools *especially* should never be looking at the raw
bytes underneath the character strings. They should be dealing with the
characters as discrete entities.

This is what I'm talking about. Byte arrays as characters are nonsense
in today's world. If you don't have an encoding attached to something,
then you can't *possibly* know what it means.

>>> Even if the library is performant with multiword characters it is
>>> complex. That means more prone to errors. Both in itself and in the
>>> software that interfaces it.
>> Nice theory. What reduces the number of errors is no longer thinking in
>> terms of arrays of characters, but in terms of text strings.
> Or strings of strings of 16-bit words, packed? No, thanks. I want to
> avoid that.

Um. You're confused here. It's a text string with a UTF-16 encoding.

> nothing to do with the fact I do not speak Japanese.
> I think that most of my ram and hardrive space is consumed by other
> stuff than text. For that reason I do not care about the relative
> efficiency of text encoding. It will have minimal impact on the
> performance or amounts of memory consumed on the system. And there is
> always the possibility to compress the text.

Then you are willfully ignorant of the concerns of a lot of people.

>>> Hmm, so you call the possibility to choose your encoding living in
>>> stone age. I would call it living in reality. There are various
>>> encodings out there.
>> Yes, it's the stone age. The filesystem should allow you to see
>> things in UTF-8 or SJIS or EUC-JP if you want, but internally it
>> should be using something a hell of a lot smarter than those
>> encodings. This is what HFS+ and Windows allow.
> Well, the libc could store the strings in some utf-* encoding on the
> disk, and translate that based on the current locale. I wonder if that
> is against POSIX or not.

It's against POSIX because POSIX specifies that the disk should be
storing the file in single-character bytes and that 0x00 is a filename
terminator. POSIX, though, is stupid.

> But it is not done, and it is wrong. There are problems of this kind
> on Windows as well. It is still not recommended to use non-ascii
> characters in filenames around here..

Actually, that's only if you're using stupid programs. Unfortunately,
that includes Ruby right now. When Matz gets the M17N strings checked
into Ruby 1.9, I will be working toward significantly improving the
Windows filesystem handling so that full Unicode is supported.

-austin
Anthony DeRobertis (Guest)
on 2006-03-14 01:34
(Received via mailing list)
Austin Z. wrote:

> On 3/10/06, Anthony DeRobertis <removed_email_address@domain.invalid> wrote:
>
>> Ummm, no. UTF-16 filenames would break *every* correctly-implemented
>> UNIX program: UTF-16 allows the octect 0x00, which has always been
>> the end-of-string marker.
>
> You're right. And I'm saying that I don't care.

Well, I suspect most other people want to maintain backwards
compatibility. Hence the existence of UTF-8.

> People need to stop
> thinking in terms of bytes (octets) and start thinking in terms of
> characters. I'll say it flat out here: the POSIX filesystem definition
> is going to badly limit what can be done with Unix systems.

Why? POSIX gives nearly binary-transparent file names; the only
exception is the single octet 0x00. Considering the 1:1 mapping between
UTF-8 and other Unicode encodings, how can the choice of one or another
"badly limit" what can be done?

>> Personally, my file names have been in UTF-8 for quite some time now,
>> and it works well: What exactly is this 'stone age' you refer to?
>
> Change and environment variable and watch your programs break that had
> worked so well with Unicode. *That* is the stone age that I refer to.

dd if=/dev/urandom of=/lib/ld-linux.so.2 and watch all my programs
break, too. What's you point?

It is always possible to break a computer system if you try hard enough
(or, all too often, not hard at all); but if the user actively attempts
to make his machine malfunction, that's not the OS's problem.

> I'm also guessing that you don't do much with long Japanese filenames
> or deep paths that involve *anything* except US-ASCII (a subset of
> UTF-8).

Well, I have Japanese file names (though not that many in the grand
scheme of things), and have a lot of files and directories named in non
US-ASCII. Yeah, I know that file name length and path length limits
suck, but that's an implementation limitation of e.g. ext3, nothing
fundamental.

>
>> UTF-8 can take multiple octets to represent a character. So can
>> UTF-16, UTF-32, and every other variation of Unicode.
>
> This last statement is true only because you use the term "octet."

You're correct; that isn't what I meant to say. Something along the
lines of the following is better worded:

        UTF-8 can take more than one octet to represent a
        character; UTF-16 can take more than two; UTF-32
        more than four; etc.

> It's a useless term here, because UTF-8 only has any level of
> efficiency for US-ASCII.

English, I've heard, is a rather common language.

> Even if you step to European content, UTF-8
> is no longer perfectly efficient,

Of course not --- but still generally better than UTF-16, I think.
Spanish, I've heard, is also a rather common language.

> and when you step to Asian content,
> UTF-8 is so bloody inefficient that most folks who have to deal with
> it would rather work in a native encoding (EUC-JP or SJIS, anyone?)
> which is 1..2 bytes or do everything in UTF-16.

Yes, for CJK, UTF-8 is fairly inefficient. A full 33% bigger than
UTF-16.

OTOH, it has some nice advantages over UTF-16, like being backwards
compatible with C strings, being resynchronizable (if a octet is lost),
not having byte-order issues, etc.

Now, honestly, what portion of your hard disk is taken up by file names?
Austin Z. (Guest)
on 2006-03-14 08:09
(Received via mailing list)
On 3/13/06, Anthony DeRobertis <removed_email_address@domain.invalid> wrote:
> >> UTF-8 can take multiple octets to represent a character. So can
> >> UTF-16, UTF-32, and every other variation of Unicode.
> > This last statement is true only because you use the term "octet."
> You're correct; that isn't what I meant to say. Something along the
> lines of the following is better worded:
>
>         UTF-8 can take more than one octet to represent a
>         character; UTF-16 can take more than two; UTF-32
>         more than four; etc.

No. UTF-32 does not have surrogates. Unicode is perfectly
representable in either 20 or 21 bits. A single character is *always*
representable in a uint32_t sized space with UTF-32.

POSIX is outdated and needs to be scrapped or fixed. Preferably the
former. Preferably by people who know what they're doing -- and not
the folks behind the GNU libc.

-austin
Bill K. (Guest)
on 2006-03-14 08:33
(Received via mailing list)
From: "Austin Z." <removed_email_address@domain.invalid>
>
> On 3/13/06, Anthony DeRobertis <removed_email_address@domain.invalid> wrote:
>>
>>         UTF-8 can take more than one octet to represent a
>>         character; UTF-16 can take more than two; UTF-32
>>         more than four; etc.
>
> No. UTF-32 does not have surrogates. Unicode is perfectly
> representable in either 20 or 21 bits. A single character is *always*
> representable in a uint32_t sized space with UTF-32.

Hi, I have zero background in non-ASCII character representations,
but the following post has been echoing in my head as a data point
for... can't believe it's been three-and-a-half years:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...

Does that have any relation to your current context?  Curt seems to
be talking not of surrogates, but saying "combining characters"
mean variable-length issues still exist with UTF-32 ?


Regards,

Bill
Austin Z. (Guest)
on 2006-03-14 16:01
(Received via mailing list)
On 3/14/06, Bill K. <removed_email_address@domain.invalid> wrote:
> for... can't believe it's been three-and-a-half years:
>
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...
>
> Does that have any relation to your current context?  Curt seems to
> be talking not of surrogates, but saying "combining characters"
> mean variable-length issues still exist with UTF-32 ?

Yes and no. When you use combining characters, each of the combining
characters (such as COMBINING CEDILLA or COMBINING ACCENT ACUTE) is a
distinct character. If I understand the Unicode standard correctly --
which is perhaps questionable -- you can go either direction. But I
had forgotten (temporarily) about combining characters. For the most
part, Apple chooses to use them and Microsoft chooses not to use them
in native representations wherever possible. Where it becomes
difficult is when you need to combine characters that do not otherwise
have canonical forms. At *that* point, yes, UTF-32 can have multiple
uint32_t elements creating one character. I think that for most
languages, though, the use of combining characters is not necessary.

I withdraw my absolute, though. If you're creating a meaningful glyph
with combining characters, you *can* have multiple uint32_t elements
creating that glyphn in UTF-32. Without combining characters, however,
UTF-32 is perfectly representational of all glyphs possible with
Unicode.

-austin
Andreas (Guest)
on 2006-03-14 16:05
I don't get it guys. Supporting (not exclusively using) Unicode
transparently should be a no-brainer for a serious programming language
these days. I love Ruby but multi-byte string is a pain. And they are
everywhere. There's no logic in resisting. There are more chars in the
world than on your keyboard. Even in the US, there are official and
*correct * chars for quotation marks nit in the US_ASCII set. Using the
inch-sign for quotes is plain wrong. Come on, we're in th 21st century
and the world is a global place. OpenSource people should know that
best. It can't be so difficult technically - others do it, why can't
you?

All we want is a Unicode safe Ruby.

Best,
Andreas
Anthony DeRobertis (Guest)
on 2006-03-14 17:11
(Received via mailing list)
Austin Z. wrote:

> No. UTF-32 does not have surrogates. Unicode is perfectly
> representable in either 20 or 21 bits. A single character is *always*
> representable in a uint32_t sized space with UTF-32.

Depends on what you call a character; in the technical way Unicode uses
the term, yes, UTF-32 can represent every character at present.

In the way that users understand characters (what the unicode standard
calls a "grapheme") â?? the way text-processing software needs to
manipulate characters â?? no it can't.

dÌ?Ì? is not three characters to the user.

> POSIX is outdated and needs to be scrapped or fixed.

So far, you have provided no evidence of this, just assertions that
somehow UTF-8 is horribly limiting.
Michal S. (Guest)
on 2006-03-14 17:39
(Received via mailing list)
On 3/13/06, Austin Z. <removed_email_address@domain.invalid> wrote:
> >> characters. Frankly, if you don't care what Windows, OS X, and ICU
> >> use, then you're completely ignorant of the real world and what is
> >> useful and necessary for Unicode.
> > The native encoding is bound to be different between platforms. I want
> > to use an encoding that I like on all platforms, and convert the
> > strings for filenames or whatever to fit the current platform. That is
> > why I do not care what a particular platform you name uses.
>
> I think you're just confused here, Michal.
In what way?
> > are programs that actually process the text, not only save what the
> > user entered in a web form. I can think of text editors, terminal
> > emulators, and linguistic tools. I am sure there are others.
>
> NO! This is where you're 100% wrong. Text editors, terminal emulators,
> and linguistic tools *especially* should never be looking at the raw
> bytes underneath the character strings. They should be dealing with the
> characters as discrete entities.

I am saying I want to look at characters, not that I want to look at
bytes.
And I am saying that looking at entities that happen to be all the
same size makes things much simpler than looking at strings packed
into another string without separators. And multiword characters are
word strings, nothing else.

>
> This is what I'm talking about. Byte arrays as characters are nonsense
> in today's world. If you don't have an encoding attached to something,
> then you can't *possibly* know what it means.

No, it is not.
Sure, I am not for byte arrays or chunks of data of unknown encoding.

>
> >>> Even if the library is performant with multiword characters it is
> >>> complex. That means more prone to errors. Both in itself and in the
> >>> software that interfaces it.
> >> Nice theory. What reduces the number of errors is no longer thinking in
> >> terms of arrays of characters, but in terms of text strings.
> > Or strings of strings of 16-bit words, packed? No, thanks. I want to
> > avoid that.
>
> Um. You're confused here. It's a text string with a UTF-16 encoding.

Yes, It is a text string, which is a sring of characters packed one
after another, which happen to be themself strings of 16-bit words.
Give me the 100th character.

> > storing Czech text, Japanese text, English text, or any other. It has
> > nothing to do with the fact I do not speak Japanese.
>
> > I think that most of my ram and hardrive space is consumed by other
> > stuff than text. For that reason I do not care about the relative
> > efficiency of text encoding. It will have minimal impact on the
> > performance or amounts of memory consumed on the system. And there is
> > always the possibility to compress the text.
>
> Then you are willfully ignorant of the concerns of a lot of people.

OK, you are concerned about the space consumed by text. I wonder how
large portion of your ram is used for text. Or how large portion of
your hardrive is used by text for which you can choose the encoding.
I got lots of C sources but I suspect that C compilers won't accept
wide characters anytime soon. And anything but byte encoding is quite
pointless for C sources. Most of the stuff is identifiers that can be
only 7-bit ascii anyway.


Thanks

Michal
Michal S. (Guest)
on 2006-03-14 17:45
(Received via mailing list)
On 3/14/06, Bill K. <removed_email_address@domain.invalid> wrote:
> > representable in a uint32_t sized space with UTF-32.
>
well, in some languages you get characters like "LATIN CAPITAL LETTER
A WITH ACUTE".
In a string you can either get the above or "LATIN CAPITAL LETTER A"
followed by "COMBINING ACUTE" or somesuch. This is decomposed.

And there are libraries for normalizing/composing/decomposing unicode
strings.

Thanks

Michal
Austin Z. (Guest)
on 2006-03-14 17:57
(Received via mailing list)
On 3/14/06, Andreas <removed_email_address@domain.invalid> wrote:
>
> All we want is a Unicode safe Ruby.

You have it.

You'll have something even better in Ruby 2.0.

You will *not* have it natively in Ruby 1.8.

-austin
Shawn A. (Guest)
on 2006-03-14 18:50
(Received via mailing list)
As long as we're discussing unicode here,
I am testing a large C++ application and would like to wrap it with
Ruby.  However, this application makes wide use of unicode and wchar_t
in almost all it's method calls.  Can anyone help my feeble mind
understand how to do this?

Would someone be able to point me in the direction of some example
code or such for both calling into the C++ code with a wchar_t
argument, and getting wchar_t's back from calls?  Apparently SWIG
won't touch this?

Thanks,
Shawn
Michal S. (Guest)
on 2006-03-16 00:18
(Received via mailing list)
On 3/14/06, Shawn A. <removed_email_address@domain.invalid> wrote:
> As long as we're discussing unicode here,
> I am testing a large C++ application and would like to wrap it with
> Ruby.  However, this application makes wide use of unicode and wchar_t
> in almost all it's method calls.  Can anyone help my feeble mind
> understand how to do this?
>
> Would someone be able to point me in the direction of some example
> code or such for both calling into the C++ code with a wchar_t
> argument, and getting wchar_t's back from calls?  Apparently SWIG
> won't touch this?

This is exactly the thing that is not supported right now. But you
might be able to convert the wide character strings to something else
using iconv.

And you could possibly use icu4r to work with wide strings directly if
it happens to use the same wide characters. But I suspect you would
have to write some glue code to put it all together.

Swig is supposed to make such argument conversions easier.

Thanks

Michal


--
             Support the freedom of music!
Maybe it's a weird genre  ..  but weird is *not* illegal.
Maybe next time they will send a special forces commando
to your picnic .. because they think you are weird.
 www.music-versus-guns.org  http://en.policejnistat.cz
This topic is locked and can not be replied to.