Unicode in ruby


#1

i’m using IO.foreach to parse the lines in a file. now i’m trying to get
it to work with unicode encoded files. does ruby support unicode? how do
i compare a variable with a unicode constant string?

the script goes something like:

IO.foreach(“myfile.txt”) { |line|
if line.downcase[0,2] == “id”


#2

On 3/8/06, Richard G. removed_email_address@domain.invalid wrote:

i’m using IO.foreach to parse the lines in a file. now i’m trying to get
it to work with unicode encoded files. does ruby support unicode? how do
i compare a variable with a unicode constant string?

the script goes something like:

IO.foreach(“myfile.txt”) { |line|
if line.downcase[0,2] == “id”

To get unicode downcase you probably want icu4r. To handle the cases
you are interested in you could write your own. However, the []
operator of ruby strings returns bytes, not characters.

hth

Michal


Support the freedom of music!
Maybe it’s a weird genre … but weird is not illegal.
Maybe next time they will send a special forces commando
to your picnic … because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz


#3

Michal S. removed_email_address@domain.invalid wrote:

On 3/8/06, Richard G. removed_email_address@domain.invalid wrote:

i’m using IO.foreach [… no \n ]

you don’t make use of “\n” at uni-berlin.de when wrapping ?

could be more readable :wink:


#4

On Mar 8, 2006, at 1:13 PM, Richard G. wrote:

so, you guys are telling me a language developed since the year
2000 doesn’t support unicode strings natively? in my opinion,
that’s a pretty glaring problem.

Ruby doesn’t really support any strings natively. It just happens to
have a bytevector class that acts a lot like a string :wink: Having said
that, have you tried:
$KCODE=“u” # Assumes the source file is encoded as UTF8, effects
literal strings, regexps, etc.

If your source file is UTF16 or some other non-UTF8 encoding you’ll
have to use iconv to get into UTF8 to compare with the literals in
your source.


#5

On 3/8/06, Richard G. removed_email_address@domain.invalid wrote:

so, you guys are telling me a language developed since the year 2000
doesn’t support unicode strings natively? in my opinion, that’s a pretty
glaring problem.

For me it is a problem as well. But getting unicode right is hard.
Look at the size of the icu library and the size of ruby itself.
Anyway, unicode regexps are planned for ruby 2.0 iirc.

Thanks

Michal


Support the freedom of music!
Maybe it’s a weird genre … but weird is not illegal.
Maybe next time they will send a special forces commando
to your picnic … because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz


#6

so, you guys are telling me a language developed since the year 2000
doesn’t support unicode strings natively? in my opinion, that’s a pretty
glaring problem.


#7

On 3/8/06, Logan C. removed_email_address@domain.invalid wrote:

that, have you tried:
$KCODE=“u” # Assumes the source file is encoded as UTF8, effects
literal strings, regexps, etc.

If your source file is UTF16 or some other non-UTF8 encoding you’ll
have to use iconv to get into UTF8 to compare with the literals in
your source.

err, no that is not what people want when they speak about downcase in
unicode.
Sure, you can write a string encoded in utf-8 in your source, and
verify it is byte-identical to another string. That is about all you
get this way.
I suspect regexps won’t work right with multibyte characters, for
downcase or case -insensitive regexps you would even need to know the
language.

Thanks

Michal


Support the freedom of music!
Maybe it’s a weird genre … but weird is not illegal.
Maybe next time they will send a special forces commando
to your picnic … because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz


#8

On Mar 8, 2006, at 7:24 PM, Michal S. wrote:

Anyway, unicode regexps are planned for ruby 2.0 iirc.

Unicode strings are also planned for Ruby 2 (possibly implemented
already?).

– Daniel


#9

Logan C. removed_email_address@domain.invalid writes:

Ruby doesn’t really support any strings natively. It just happens to
have a bytevector class that acts a lot like a string :wink:

… that acts a lot like a string /of ASCII chars/, actually. Rather
anachronic, imho.

I can’t consider that “il était une fois”.length == 18 is the way it
should be with a string in a modern language.

Of course, tweaking with -K and jcode and/or other third parties
modules and/or various hacks allow some enhancements (we have a
jlength method that seems working), but that’s not the Peru, either
(case methods support only ASCII chars, etc.)

Waiting for a plain support in Rite (much more important to me than
the “end” issues…).


#10

Eric J. wrote:

Waiting for a plain support in Rite (much more important to me than
the “end” issues…)

Speaking of Rite… is there a timeline on its release yet? One year?
Two years? More?


#11

On Mar 8, 2006, at 8:18 PM, rtilley wrote:

Speaking of Rite… is there a timeline on its release yet? One
year? Two years? More?

http://www.atdot.net/yarv/
http://redhanded.hobix.com/cult/yarvMergedMatz.html

– Daniel


#12

Daniel H. wrote:

http://www.atdot.net/yarv/
http://redhanded.hobix.com/cult/yarvMergedMatz.html

– Daniel

I’m new to Ruby… I did not know that Rite was tied to YARV. Thanks for
the links!


#13

guess i’ll wait till then. thanks for the info guys.


#14

exactly. utf-8 doesn’t mean one byte per char necessarily.

how have folks solved this problem when writing web sites in rails?


#15

It’s a huge f*cking pain in the ass. We’ve been trying to convert
Wayfaring.com over to UTF8 off and on for about a month and it’s
completely useless. Either you start the site using UTF8 (using crappy
hacks IMO) or forgetaboutit. We’re about to break ground on a new site
and I almost don’t want to do it until ruby 2.0 comes out with the
unicode support built in.

-PJ
http://pjhyett.com


#16

On 3/10/06, Austin Z. removed_email_address@domain.invalid wrote:

filename format, then we’re going to be much better. That will, however,
break some assumptions by really stupid programs.)

Why the hell utf-16? It is no longer compatible with ascii, yet 16
bits are far from sufficient to cover current unicode. So you still
get multiword characters. It is not even dword aligned for fast
processing by current cpus.
I would like utf-8 for compatibility, and utf-32 for easy string
processing. But I do not see much use for utf-16.

Thanks

Michal


Support the freedom of music!
Maybe it’s a weird genre … but weird is not illegal.
Maybe next time they will send a special forces commando
to your picnic … because they think you are weird.
www.music-versus-guns.org http://en.policejnistat.cz


#17

Austin Z. wrote:

Unix support for
Unicode is still in the stone ages because of the nonsense that POSIX
put on Unix ages ago. (When Unix filesystems can write UTF-16 as their
native filename format, then we’re going to be much better. That will,
however, break some assumptions by really stupid programs.)

Ummm, no. UTF-16 filenames would break every correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been the
end-of-string marker.

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this ‘stone age’ you refer to?

UTF-8 can take multiple octets to represent a character. So can UTF-16,
UTF-32, and every other variation of Unicode.

Depending on content, a string in UTF-8 can consume more octects than
the same string in UTF-16, or vice versa.

Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don’t
get to have the fun of picking between big- and little-endian!


#18

On 3/8/06, Richard G. removed_email_address@domain.invalid wrote:

so, you guys are telling me a language developed since the year 2000
doesn’t support unicode strings natively? in my opinion, that’s a
pretty glaring problem.

Please note that Ruby itself is ten years old. Unicode has only
recently (the last three or four years, with the release of Windows
XP) become a major factor, especially in Japan. Unix support for Unicode
is still in the stone ages because of the nonsense that POSIX put on
Unix ages ago. (When Unix filesystems can write UTF-16 as their native
filename format, then we’re going to be much better. That will, however,
break some assumptions by really stupid programs.)

I’ve been following what Matz has had to say and have recently done
quite a bit of work with Unicode. The reality is that Unicode is hard,
and there’s cultural and other reasons for Ruby not to have Unicode
(UTF-16 or UTF-8) strings by default. I think that Matz’s plans for M17N
strings is far superior to assuming Unicode by default.

Basically, Ruby will have the capabilities to work with UTF-8, UTF-16,
and probably the ISO-8859-* encodings natively, as well as the existing
SJIS and EUC-JP support. I wouldn’t be surprised if it also includes
other EUC-* encodings. Essentially, you’ll be able to do:

s = “école”
s.encoding # -> :raw (or something like that)
s.encoding = :iso8859_1 # “école”
s.encoding = :utf8 # “Ã?©cole”
s.capitalize! # “Ã?â?°cole”
s.encoding = :iso8859_1 # “Ã?cole”

More than that, using the same string:

s[0] # “Ã?”
s.encoding = :utf8
s[0] # “Ã?â?°”

I’ve shown everything as a byte string here. The point is, though, that
going from the raw encoding – which may be the default, or the default
may be able to be set – shouldn’t cause any byte conversions. I suspect
that Matz will have a different way to get at the underlying bytes, but
that’s what will be happening for Ruby 2.0.

The last indication I had seen suggested that M17N strings were closer,
but not yet done. I’m looking forward to them.

-austin


#19

On 3/10/06, Michal S. removed_email_address@domain.invalid wrote:

their native filename format, then we’re going to be much better.
That will, however, break some assumptions by really stupid
programs.)
Why the hell utf-16? It is no longer compatible with ascii, yet 16
bits are far from sufficient to cover current unicode. So you still
get multiword characters. It is not even dword aligned for fast
processing by current cpus. I would like utf-8 for compatibility, and
utf-32 for easy string processing. But I do not see much use for
utf-16.

UTF-16 is actually pretty performant and the implementation of wchar_t
on MacOS X and Windows is (you guessed it!) UTF-16. The filesystems for
both of these operating systems (which have far superior Unicode
support than anything else) both use UTF-16 as the native filename
encoding (this is true for HFS+, NTFS4, and NTFS5). The only difference
between what MacOS X does and Windows does for this is that Apple chose
to use decomposed characters instead of composed characters (e.g.,
LOWERCASE E + COMBINING ACUTE ACCENT instead of LOWERCASE E ACUTE
ACCENT).

Look at the performance numbers for ICU4C: it’s pretty damn good. UTF-32
isn’t exactly space conservative (since with UTF-16 most of the BMP
can be represented with a single wchar_t, and only a few need surrogates
taking up exactly two wchar_ts, whereas all characters would take up
four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.

On 3/10/06, Anthony DeRobertis removed_email_address@domain.invalid wrote:

Austin Z. wrote:

Unix support for Unicode is still in the stone ages because of the
nonsense that POSIX put on Unix ages ago. (When Unix filesystems can
write UTF-16 as their native filename format, then we’re going to be
much better. That will, however, break some assumptions by really
stupid programs.)
Ummm, no. UTF-16 filenames would break every correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been the
end-of-string marker.

You’re right. And I’m saying that I don’t care. People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I’ll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems. One could do
what I think that Apple has done and provided two filesystem
interfaces that are synchronized. The native interface – and the more
efficient one – will be using UTF-16 because that’s what HFS+ speaks.
The secondary interface (that also works on UFS filesystems) would
translate to UTF-8 and/or follow the nonsensical POSIX rules for native
encodings.

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this ‘stone age’ you refer to?

Change and environment variable and watch your programs break that had
worked so well with Unicode. That is the stone age that I refer to.
I’m also guessing that you don’t do much with long Japanese filenames or
deep paths that involve anything except US-ASCII (a subset of UTF-8).

UTF-8 can take multiple octets to represent a character. So can UTF-16,
UTF-32, and every other variation of Unicode.

This last statement is true only because you use the term “octet.” It’s
a useless term here, because UTF-8 only has any level of efficiency for
US-ASCII. Even if you step to European content, UTF-8 is no longer
perfectly efficient, and when you step to Asian content, UTF-8 is so
bloody inefficient that most folks who have to deal with it would rather
work in a native encoding (EUC-JP or SJIS, anyone?) which is 1…2 bytes
or do everything in UTF-16.

Depending on content, a string in UTF-8 can consume more octects than
the same string in UTF-16, or vice versa.

Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don’t
get to have the fun of picking between big- and little-endian!

Are people always this stupid when it comes to things that they clearly
don’t understand? Yes, UTF-16 may have the problem of not knowing if
you’re dealing with UTF-16BE or UTF-16LE, but it’s my understanding that
this is only an issue when you’re dealing with both on the same
system. Additionally, most platforms specify a default. It’s been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.

There. Problem solved.

If you’re going to babble on about Unicode, it’d be nice if you knew
more than
the knee-jerk stuff you’ve posted so far. Either of you.

-austin


#20

On 3/11/06, Austin Z. removed_email_address@domain.invalid wrote:

put on Unix ages ago. (When Unix filesystems can write UTF-16 as
UTF-16 is actually pretty performant and the implementation of wchar_t
isn’t exactly space conservative (since with UTF-16 most of the BMP
can be represented with a single wchar_t, and only a few need surrogates
taking up exactly two wchar_ts, whereas all characters would take up
four uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.

I do not care what Windows, OS X, or ICU uses. I care what I want to
use. Even if most characters are encoded with single word you have to
cope with multiword characters. That means that a character is not a
simple type. You cannot have character arrays. And no library can
completely wrap this inconsistency and isolate you from dealing with
it.

Even if the library is performant with multiword characters it is
complex. That means more prone to errors. Both in itself and in the
software that interfaces it.

You say that utf-16 is more space-conserving for languages like
Japanese. Nice. But I do not care. I guess text consumes very small
portion of memory on my system. Both ram and hardrive. I do not care
if that doubles or quadruples. In the very few cases when I want to
save space (ie when sending email attachments) I can use gzip. It can
even compress repetitive text which no encoding can.

end-of-string marker.
encodings.

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this ‘stone age’ you refer to?

Change and environment variable and watch your programs break that had
worked so well with Unicode. That is the stone age that I refer to.
I’m also guessing that you don’t do much with long Japanese filenames or
deep paths that involve anything except US-ASCII (a subset of UTF-8).

Hmm, so you call the possibility to choose your encoding living in
stone age. I would call it living in reality. There are various
encodings out there.

or do everything in UTF-16.
No, I suspect the reason for using EUC-JP, SJIS, or ISO-8859-*, and
other weird encodings is historical.
What do you mean by efficiency? If you want space efficiency use
compression. If you want speed, use utf-32 or similar encoding that
does not have to deal with special cases.

this is only an issue when you’re dealing with both on the same
system. Additionally, most platforms specify a default. It’s been a
while (almost a year), but I think that ICU4C defaults to UTF-16BE
internally, not just UTF-16.

iirc there are even byte-order marks. If you insert one in every
string you can get them identified at any time without doubt :slight_smile:

But do not trust me on that. I do not know anything about unicode, and
I want to sidestep the issue by using an encoding that is easy to work
with, even for ignorants :stuck_out_tongue:

Thanks

Michal