Problem with String encoding when modifying it in C method


#1

Hi, I’ve added a method “multi_capitalize” to String class. This
method is done in C and basically modifies the string:

“record-roUTE”.multi_capitalize => “Record-Route”

The problem is that after the method execution, the new String has
ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
1.9.1).


irb> hname = “record-rouTE-€”
“record-rouTE-€”

irb> hname.encoding
#Encoding:UTF-8

irb> hname2 = hname.multi_capitalize
“Record-Route-\xE2\x82\xAC” <------- !!!

irb> hname2.encoding
#Encoding:ASCII-8BIT <------- !!!

irb> hname2.force_encoding(“utf-8”)
“Record-Route-€”

irb> hname2.encoding
#Encoding:UTF-8

What should I add to my C method to mantain the UTF-8 codification
after the changes in the string?
Could I invoke the C “force_encoding()” function from the C code
before returning the modified string? How to invoke it?

Thanks a lot.


#2

On Sat, 2009-04-04 at 01:39 +0900, Iñaki Baz C. wrote:

Could I invoke the C “force_encoding()” function from the C code
before returning the modified string? How to invoke it?

You can call it as (untested):

rb_funcall(str, rb_intern(“force_encoding”), 1, rb_str_new2(“utf-8”));

I’m not sure how to make your multi-capitalize method do the right
thing, but maybe reading the source of rb_str_capitalize_bang in
string.c helps.

Best,
Andre


#3

El Viernes 03 Abril 2009, Andre N. escribió:

string.c helps.
Thanks a lot, I will check it.


#4

El Viernes 03 Abril 2009, Iñaki Baz C. escribió:

thing, but maybe reading the source of rb_str_capitalize_bang in
string.c helps.

Thanks a lot, I will check it.

Yes, rb_str_capitralize_bang handles a lot of stuf realted to encoding:

c = rb_enc_codepoint(s, send, enc);
if (rb_enc_islower(c, enc)) {

rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
modify = 1;
}
s += rb_enc_codelen(c, enc);

so this is the way :slight_smile:

Thanks a lot.


#5

El Sábado 04 Abril 2009, KUBO Takehiro
escribió:> > 1.9.1).

rb_encoding *enc = rb_enc_get(original_string)

/* create a new string with the encoding same with the original string

*/ return rb_enc_str_new(char_pointer, length, enc);

rb_str_new() makes a ASCII-8BIT string.

Thanks.


#6

Hi,

On Sat, Apr 4, 2009 at 1:39 AM, Iñaki Baz C. removed_email_address@domain.invalid wrote:

Hi, I’ve added a method “multi_capitalize” to String class. This
method is done in C and basically modifies the string:

“record-roUTE”.multi_capitalize => “Record-Route”

The problem is that after the method execution, the new String has
ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
1.9.1).

rb_encoding *enc = rb_enc_get(original_string)

/* create a new string with the encoding same with the original 

string */
return rb_enc_str_new(char_pointer, length, enc);

rb_str_new() makes a ASCII-8BIT string.