Forum: Ruby Which encoding causes fewest problems in Ruby 1.8.2?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Jim S. (Guest)
on 2006-06-11 05:12
I posted a similar question in the rails group but this is more specific
to ruby 1.8.2.

I read that ruby has problems with multibyte charsets.  And I read that
there might be some problems with ISO-8859-15 related to REXML.  And I
read that regex might have problems with ISO-8859-1.

Given the above problems (or rumors), which encoding is recommended for
use with ruby 1.8.2?

UTF-8
ISO-8859-1
ISO-8859-15

I'm certain both UTF-8 and ISO-8859-15 will support all the characters
I'll ever use.  And ISO-8859-1 only lacks a couple characters I might
use on very rare occassions so I'm just looking for a charset that will
cause fewest problems with Ruby.

Thanks in advance for any suggestions.
Michal S. (Guest)
on 2006-06-11 14:58
(Received via mailing list)
On 6/11/06, Jim S. <removed_email_address@domain.invalid> wrote:
> I posted a similar question in the rails group but this is more specific
> to ruby 1.8.2.
>
> I read that ruby has problems with multibyte charsets.  And I read that
> there might be some problems with ISO-8859-15 related to REXML.  And I
> read that regex might have problems with ISO-8859-1.
>
> Given the above problems (or rumors), which encoding is recommended for
> use with ruby 1.8.2?

None. They all cause problems. With utf-8 most string functions won't
work correctly (probably including regexps). There are special
extensions to work around this to some extent.

ISO-8858-1 and ISO-8859-15 should be pretty much the same. They are
simple 8-bit so the string functions that expect 1-byte characters
work. They won't allow you to use slightly more exotic characters
(like greek letters for maths, ...).
Yukihiro M. (Guest)
on 2006-06-11 16:14
(Received via mailing list)
Hi,

In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?"
    on Sun, 11 Jun 2006 10:12:45 +0900, Jim S. 
<removed_email_address@domain.invalid>
writes:

|Given the above problems (or rumors), which encoding is recommended for
|use with ruby 1.8.2?
|
|UTF-8
|ISO-8859-1
|ISO-8859-15

String and Regexp handles all of them for most of the cases.  But
upper/lower case handling for non ASCII alphabets are not supported.
Use -Ku for UTF-8 and -Kn for ISO-8859-*.

							matz.
Michal S. (Guest)
on 2006-06-12 07:17
(Received via mailing list)
On 6/11/06, Yukihiro M. <removed_email_address@domain.invalid> wrote:
> |ISO-8859-15
>
> String and Regexp handles all of them for most of the cases.  But
> upper/lower case handling for non ASCII alphabets are not supported.
> Use -Ku for UTF-8 and -Kn for ISO-8859-*.
>

Length and indexing do not work very well with utf-8.

~ $ irb -Ku
irb(main):001:0> $KCODE
=> "UTF8"
irb(main):002:0>  a='α-Ï?'
=> "α-Ï?"
irb(main):003:0> r=/[β-Ï?]/
=> /[β-Ï?]/
irb(main):004:0> a.length
=> 5
irb(main):005:0> a[0..0]
=> "\316"
irb(main):006:0> a[0..1]
=> "α"

Fortunately, the regexps work.

irb(main):007:0> a =~ r
=> 3

So you could use a.scan /./ to calculate length or index characters in a
string.

irb(main):008:0> a.scan /./
=> ["α", "-", "Ï?"]
irb(main):009:0> (a.scan /./).length
=> 3

Michal
Yukihiro M. (Guest)
on 2006-06-12 09:39
(Received via mailing list)
Hi,

In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?"
    on Mon, 12 Jun 2006 12:14:48 +0900, "Michal S."
<removed_email_address@domain.invalid> writes:

|Length and indexing do not work very well with utf-8.

I know.  Operations on characters should based on Regexp.

							matz.
Michal S. (Guest)
on 2006-06-12 16:35
(Received via mailing list)
On 6/12/06, Yukihiro M. <removed_email_address@domain.invalid> wrote:
> Hi,
>
> In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?"
>     on Mon, 12 Jun 2006 12:14:48 +0900, "Michal S." 
<removed_email_address@domain.invalid> writes:
>
> |Length and indexing do not work very well with utf-8.
>
> I know.  Operations on characters should based on Regexp.
>

I am sure you do know :)

But it is not what I call 'String handles them all most of the cases'.

Thanks

Michal
Yukihiro M. (Guest)
on 2006-06-12 18:29
(Received via mailing list)
Hi,

In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?"
    on Mon, 12 Jun 2006 21:33:10 +0900, "Michal S."
<removed_email_address@domain.invalid> writes:

|> I know.  Operations on characters should based on Regexp.

|But it is not what I call 'String handles them all most of the cases'.

OK, then I'd say 'string handles them all most of the case if your
operations are based on Regexp". ;-)

							matz.
This topic is locked and can not be replied to.