Forum: Ruby Problem with String encoding when modifying it in C method

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-03 18:41
(Received via mailing list)
Hi, I've added a method "multi_capitalize" to String class. This
method is done in C and basically modifies the string:

  "record-roUTE".multi_capitalize => "Record-Route"

The problem is that after the method execution, the new String has
ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
1.9.1).

--------------------------------------------------------------------------------
irb> hname = "record-rouTE-€"
"record-rouTE-€"

irb> hname.encoding
#<Encoding:UTF-8>

irb> hname2 = hname.multi_capitalize
"Record-Route-\xE2\x82\xAC"                <------- !!!

irb> hname2.encoding
#<Encoding:ASCII-8BIT>                     <------- !!!

irb> hname2.force_encoding("utf-8")
"Record-Route-€"

irb> hname2.encoding
#<Encoding:UTF-8>
--------------------------------------------------------------------------------

What should I add to my C method to mantain the UTF-8 codification
after the changes in the string?
Could I invoke the C "force_encoding()" function from the C code
before returning the modified string? How to invoke it?

Thanks a lot.
8b4249ca3bb8c123da9f7aca63a652e1?d=identicon&s=25 Andre Nathan (Guest)
on 2009-04-03 20:18
(Received via mailing list)
On Sat, 2009-04-04 at 01:39 +0900, Iñaki Baz Castillo wrote:
> Could I invoke the C "force_encoding()" function from the C code
> before returning the modified string? How to invoke it?

You can call it as (untested):

  rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));

I'm not sure how to make your multi-capitalize method do the right
thing, but maybe reading the source of rb_str_capitalize_bang in
string.c helps.

Best,
Andre
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-03 20:34
(Received via mailing list)
El Viernes 03 Abril 2009, Andre Nathan escribió:
> string.c helps.
Thanks a lot, I will check it.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-03 21:01
(Received via mailing list)
El Viernes 03 Abril 2009, Iñaki Baz Castillo escribió:
> > thing, but maybe reading the source of rb_str_capitalize_bang in
> > string.c helps.
>
> Thanks a lot, I will check it.

Yes, rb_str_capitralize_bang handles a lot of stuf realted to encoding:

    c = rb_enc_codepoint(s, send, enc);
    if (rb_enc_islower(c, enc)) {
  rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
  modify = 1;
    }
    s += rb_enc_codelen(c, enc);

so this is the way :)

Thanks a lot.
6ece05caebbfa91944047629101bc5ea?d=identicon&s=25 Takehiro Kubo (kubo)
on 2009-04-04 12:34
(Received via mailing list)
Hi,

On Sat, Apr 4, 2009 at 1:39 AM, Iñaki Baz Castillo <ibc@aliax.net> wrote:
> Hi, I've added a method "multi_capitalize" to String class. This
> method is done in C and basically modifies the string:
>
>  "record-roUTE".multi_capitalize => "Record-Route"
>
> The problem is that after the method execution, the new String has
> ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
> 1.9.1).

    rb_encoding *enc = rb_enc_get(original_string)

    /* create a new string with the encoding same with the original
string */
    return rb_enc_str_new(char_pointer, length, enc);

rb_str_new() makes a ASCII-8BIT string.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-04 12:39
(Received via mailing list)
El Sábado 04 Abril 2009, KUBO Takehiro
escribió:> > 1.9.1).
>
>     rb_encoding *enc = rb_enc_get(original_string)
>
>     /* create a new string with the encoding same with the original string
> */ return rb_enc_str_new(char_pointer, length, enc);
>
> rb_str_new() makes a ASCII-8BIT string.

Thanks.
This topic is locked and can not be replied to.