Which encoding causes fewest problems in Ruby 1.8.2?

hozer · June 11, 2006, 3:12am

I posted a similar question in the rails group but this is more specific
to ruby 1.8.2.

I read that ruby has problems with multibyte charsets. And I read that
there might be some problems with ISO-8859-15 related to REXML. And I
read that regex might have problems with ISO-8859-1.

Given the above problems (or rumors), which encoding is recommended for
use with ruby 1.8.2?

UTF-8
ISO-8859-1
ISO-8859-15

I’m certain both UTF-8 and ISO-8859-15 will support all the characters
I’ll ever use. And ISO-8859-1 only lacks a couple characters I might
use on very rare occassions so I’m just looking for a charset that will
cause fewest problems with Ruby.

Thanks in advance for any suggestions.

hozer · June 11, 2006, 12:58pm

On 6/11/06, Jim S. [email protected] wrote:

I posted a similar question in the rails group but this is more specific
to ruby 1.8.2.

I read that ruby has problems with multibyte charsets. And I read that
there might be some problems with ISO-8859-15 related to REXML. And I
read that regex might have problems with ISO-8859-1.

Given the above problems (or rumors), which encoding is recommended for
use with ruby 1.8.2?

None. They all cause problems. With utf-8 most string functions won’t
work correctly (probably including regexps). There are special
extensions to work around this to some extent.

ISO-8858-1 and ISO-8859-15 should be pretty much the same. They are
simple 8-bit so the string functions that expect 1-byte characters
work. They won’t allow you to use slightly more exotic characters
(like greek letters for maths, …).

hozer · June 12, 2006, 5:17am

On 6/11/06, Yukihiro M. [email protected] wrote:

|ISO-8859-15

String and Regexp handles all of them for most of the cases. But
upper/lower case handling for non ASCII alphabets are not supported.
Use -Ku for UTF-8 and -Kn for ISO-8859-*.

Length and indexing do not work very well with utf-8.

~ $ irb -Ku
irb(main):001:0> $KCODE
=> “UTF8”
irb(main):002:0> a=‘Î±-Ï?’
=> “Î±-Ï?”
irb(main):003:0> r=/[Î²-Ï?]/
=> /[Î²-Ï?]/
irb(main):004:0> a.length
=> 5
irb(main):005:0> a[0…0]
=> “\316”
irb(main):006:0> a[0…1]
=> “Î±”

Fortunately, the regexps work.

irb(main):007:0> a =~ r
=> 3

So you could use a.scan /./ to calculate length or index characters in a
string.

irb(main):008:0> a.scan /./
=> [“Î±”, “-”, “Ï?”]
irb(main):009:0> (a.scan /./).length
=> 3

Michal

hozer · June 11, 2006, 2:14pm

Hi,

In message “Re: Which encoding causes fewest problems in Ruby 1.8.2?”
on Sun, 11 Jun 2006 10:12:45 +0900, Jim S. [email protected]
writes:

|Given the above problems (or rumors), which encoding is recommended for
|use with ruby 1.8.2?
|
|UTF-8
|ISO-8859-1
|ISO-8859-15

String and Regexp handles all of them for most of the cases. But
upper/lower case handling for non ASCII alphabets are not supported.
Use -Ku for UTF-8 and -Kn for ISO-8859-*.

						matz.

hozer · June 12, 2006, 2:35pm

On 6/12/06, Yukihiro M. [email protected] wrote:

Hi,

In message “Re: Which encoding causes fewest problems in Ruby 1.8.2?”
on Mon, 12 Jun 2006 12:14:48 +0900, “Michal S.” [email protected] writes:

|Length and indexing do not work very well with utf-8.

I know. Operations on characters should based on Regexp.

I am sure you do know

But it is not what I call ‘String handles them all most of the cases’.

Thanks

Michal

hozer · June 12, 2006, 7:39am

Hi,

In message “Re: Which encoding causes fewest problems in Ruby 1.8.2?”
on Mon, 12 Jun 2006 12:14:48 +0900, “Michal S.”
[email protected] writes:

|Length and indexing do not work very well with utf-8.

I know. Operations on characters should based on Regexp.

						matz.

hozer · June 12, 2006, 4:29pm

Hi,

In message “Re: Which encoding causes fewest problems in Ruby 1.8.2?”
on Mon, 12 Jun 2006 21:33:10 +0900, “Michal S.”
[email protected] writes:

|> I know. Operations on characters should based on Regexp.

|But it is not what I call ‘String handles them all most of the cases’.

OK, then I’d say 'string handles them all most of the case if your
operations are based on Regexp".

						matz.