Unicode

Zephyr_P · September 17, 2007, 1:25pm

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

Zephyr_P · September 17, 2007, 2:44pm

Zephyr P. wrote:

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

I was just looking at the source code for 1.8.6 this weekend. The C
syntax that’s being used is pre-ANSI-C (which means in 1988, it was
“old” syntax).

Rotsa Ruck.

Todd

Zephyr_P · September 17, 2007, 2:54pm

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

I was just looking at the source code for 1.8.6 this weekend. The C
syntax that’s being used is pre-ANSI-C (which means in 1988, it was
“old” syntax).

Apples and oranges. Unicode libraries like iconv use C linkage, so they
can
bond with most C implementations regardless of their compliance. (C
linkage
is very weak and simplistic.) All Cs can handle 8-bit strings, and can
be
programmed to use 16-bit strings, which are the requirements for UTF-8
and
UTF-16.

Like most languages, Ruby’s source is in a primitive form of C to
maximize
the number of compilers, and hence the number of platforms and
hardwares,
that it runs on. I would suspect - unless Matz is an even greater genius
than average - that Ruby’s C style has been carefully retrofitted, after
the
language passed its first few version ticks.

Rotsa Ruck.

Racial slur noted.

Zephyr_P · September 17, 2007, 4:52pm

Todd B. wrote:

For the record, this was NOT intended to slur anything. It was not my
intent, nor is my nature, to slur. However, reading this in hindsight,
it certainly could be taken this way. Please accept my apologies.

Oh my apologies too - Scooby Doo is quite over my head. All I could
imagine was Matz in a kimono serving Sake.

Zephyr_P · September 17, 2007, 5:02pm

Hi,

In message “Re: Unicode”
on Mon, 17 Sep 2007 22:50:12 +0900, Todd B.
[email protected] writes:

|Lotsa luck getting something like Unicode implemented when the
|underlying C contructs are using such an outdated syntax as ruby’s does.

Old K&R style has nothing related to Unicode support of the language.
If you think it does, please elaborate.

It just reflects the history of the language. When I started
developing Ruby, old Sun CC compiler does not understand new style,
and I wanted Ruby to run on that platform, which I was using then.

For your information, the next release (1.9) finally abandoned the old
style.

          matz.

Zephyr_P · September 17, 2007, 3:50pm

Phlip wrote:

Racial slur noted.

You got a problem with Scooby Doo?

For the record, this was NOT intended to slur anything. It was not my
intent, nor is my nature, to slur. However, reading this in hindsight,
it certainly could be taken this way. Please accept my apologies.

Now, I’ll rephrase.

Lotsa luck getting something like Unicode implemented when the
underlying C contructs are using such an outdated syntax as ruby’s does.

But, as Phlip implies, it’s just a simple matter of programming.

Todd

Zephyr_P · September 17, 2007, 5:59pm

Hi,

In message “Re: Unicode”
on Tue, 18 Sep 2007 00:50:36 +0900, Todd B.
[email protected] writes:

|I’m new to C programming, but not new to programming. Therefore, my
|assumption (yes, assumption) was that using whatever compiler swithes
|were necessary to accept the old-style syntax would obviate the
|opportunity to bring in “modern” libraries with unicode support, and/or
|prohibit those aspects of the language that would enable the use of
|unicode features.

Even though the old style has some drawbacks (less type checks for
example), it does not have any linkage problem you’ve worried.

          matz.

Zephyr_P · September 21, 2007, 11:21am

On 15/09/2007, Zephyr P. [email protected] wrote:

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to “U” (and the default is “N” even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack “U*”, and use stuff like each, reverse,
[], …
You have to remember to convert the string back, though.

Thanks

Michal

Zephyr_P · September 22, 2007, 1:04pm

Michal S. wrote:
On 15/09/2007, Zephyr P. [email protected] wrote:

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to “U” (and the default is “N” even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack “U*”, and use stuff like each, reverse,
[], …
You have to remember to convert the string back, though.

Thanks

Michal

… or you may use the /re/u regex option to handle UTF-8 encoded
strings (cf. http://snippets.dzone.com/posts/show/4527 ).

Cheers,

j.k.

Zephyr_P · September 25, 2007, 11:00pm

On Sep 14, 2007, at 9:05 PM, Zephyr P. wrote:

I hate to discuss something related to the development timeline, I
know its tenable, but When will it be reasonable to expect Unicode
support from Ruby?

Ruby has some UTF-8 support today. Support will increase with the
m17n support though.

See last question and answer here:

http://blog.grayproductions.net/articles/the_ruby_vm_episode_iv

James Edward G. II

Zephyr_P · September 25, 2007, 11:05pm

Zephyr P. wrote:

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support from
Ruby?

“Unicode” is not an encoding. Are you asking for UTF-8, UTF-16, or
something
else?

Zephyr_P · September 28, 2007, 11:49pm

On 9/21/07, Michal S. [email protected] wrote:

str.scan(/./)[0] or str.unpack “U*”, and use stuff like each, reverse,
[], …
You have to remember to convert the string back, though.

What about UTF-16?

http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/

Zephyr_P · September 29, 2007, 3:08am

On Sep 28, 2007, at 4:49 PM, Felipe C. wrote:

you specify the -K option in the .rb file it overrides the option

–
Felipe C.

Go to unicode.org
There you can read a full explanation (or a brief one) about why you
don’t need to worry about UTF-16
UTF-8 is all you need.
Unicode is something everyone needs to read up on at some point.
I have to read up on every now and then because my brain leaks.

Zephyr_P · September 17, 2007, 5:50pm

Yukihiro M. wrote:

Hi,

Old K&R style has nothing related to Unicode support of the language.
If you think it does, please elaborate.

It just reflects the history of the language. When I started
developing Ruby, old Sun CC compiler does not understand new style,
and I wanted Ruby to run on that platform, which I was using then.

For your information, the next release (1.9) finally abandoned the old
style.
          matz.

Thanks Matz.

I’m new to C programming, but not new to programming. Therefore, my
assumption (yes, assumption) was that using whatever compiler swithes
were necessary to accept the old-style syntax would obviate the
opportunity to bring in “modern” libraries with unicode support, and/or
prohibit those aspects of the language that would enable the use of
unicode features.

So, apparently, since they (“they” being unicode support and the
syntax/compiler switches) are not related, and that’s great.

By the way, as an aside, I really like the language you developed and
have made available. I primarily use Ruby with SketchUp (a 3D modeling
program - http://www.sketchup.com) for extending the functionality of
the product. (SketchUp has a Ruby API) I was looking at the source to
see what it would take to implement a debugger than would work with Ruby
while running under SketchUp. I would like to step through expression
evaluation as the script runs.

(Big aspirations for a new C programmer like myself!)

Todd

Zephyr_P · September 29, 2007, 9:14pm

Yes but what about stuff already encoded in UTF-16?

That’s why I said read up on unicode!
After you read that stuff you’ll understand why it’s no problem.
I’m not going to explain it. Many people understand it, but when
explaining it might make mistakes.
Read the unicode stuff carefully. It’s vital for many things.

The only thing you might run into is BOM or Endian-ness, but it’s
doubtful it will be an issue in most cases.

This might get you started.

Even Joel Spoelsky wrote a brief bit on unicode… mostly trumpeting
how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2
bytes for lower-level code-points (the stuff also known as ASCII
range) where UTF-8 does not.

You really need to spend an afternoon reading about unicode. It
should be required in any computer science program as part of an
encoding course, Americans in particular are often the ones who know
the least about it…

Zephyr_P · September 29, 2007, 2:08pm

On 9/29/07, John J. [email protected] wrote:

What about UTF-16?
Unicode is something everyone needs to read up on at some point.
I have to read up on every now and then because my brain leaks.

Yes but what about stuff already encoded in UTF-16?

Zephyr_P · September 29, 2007, 9:32pm

On Sep 29, 2007, at 2:13 PM, John J. wrote:

The short version is that UTF-16 is basically wasteful.

That’s not always accurate:

$ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt >
japanese_prose_in_utf16.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
14 66 5921 japanese_prose_in_utf8.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
16 45 3968 japanese_prose_in_utf16.txt

James Edward G. II

Zephyr_P · September 30, 2007, 1:36am

On Sep 29, 2007, at 2:29 PM, James Edward G. II wrote:

Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
16 45 3968 japanese_prose_in_utf16.txt

James Edward G. II

interesting that you would generate more lines, fewer words, and
fewer bytes (probably explained by fewer words…)
wc defines words as whitespace delimited, Extremely interesting
considering that Japanese uses no whitespace except in page layout.
Grammar does not dictate any whitespace at all. At most in Japanese
prose you might have one whitespace between sentences, perhaps only
between “paragraphs”

I don’t know how iconv handles things. man iconv says it uses iswspace
(3) which is in wctype.h but I always hate reading those headers.
I tried using iconv on a file in utf-8 to utf-16 and then back again.
Results are similar, but interstingly, it’s no indication of file
size. Files are the same size
I then tried the same with some code in C++ and similar results occured.
It would seem to be a whitspace issue. I didn’t realize this, but it
does look like utf-8 is generating fewer whitespace characters while
generating a bigger file…?
I’m curious what the deal is there.

In theory utf-8 should do better than utf-16 for characters in the
ASCII range…
at least that was my understanding. And assuming code files are
largely ASCII character sets…
hmm…!?

Zephyr_P · September 30, 2007, 4:47am

Hi,

On 9/29/07, John J. [email protected] wrote:

doubtful it will be an issue in most cases.

This might get you started.
FAQ - UTF-8, UTF-16, UTF-32 & BOM

Even Joel Spoelsky wrote a brief bit on unicode… mostly trumpeting
how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2
bytes for lower-level code-points (the stuff also known as ASCII
range) where UTF-8 does not.

As you suggested I read the article:

I didn’t find anything new. It’s just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can’t store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn’t
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

You really need to spend an afternoon reading about unicode. It
should be required in any computer science program as part of an
encoding course, Americans in particular are often the ones who know
the least about it…

What is there to know about Unicode? There’s a couple of character
sets, use UTF-8, and remember that one character != one byte. Is there
anything else for practical purposes?

I’m sorry if I’m being rude, but I really don’t like when people tell
me to read stuff I already know.

My question is still there:

Let’s say I want to rename a file “fooobar”, and remove the third “o”,
but it’s UTF-16, and Ruby only supports UTF-8, so I remove the “o” and
of course there will still be a 0x00 in there. That’s if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

I don’t mind reading some more if I can actually find the answer.

Best regards.

Zephyr_P · September 30, 2007, 1:51am

On Sep 29, 2007, at 2:29 PM, James Edward G. II wrote:

Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
16 45 3968 japanese_prose_in_utf16.txt

James Edward G. II

Scratch that! I must’ve gone cross-eyed!
My c++ code was indeed smaller file size in utf-8 than utf-16 as I
expected!
Interestingly, *nix’s apparently use utf-32 internally regardless of
the source encoding… very interesting