Embedded Ruby: UTF-8 strings not working right

alcina · October 18, 2013, 2:28am

Hi!

I have the ruby vm embedded via the public ruby_* functions and execute
scripts via rb_eval_string_protect. The version I am using is ruby
2.1.0dev (2013-09-27 trunk 43059), and I’m working on 64bit Linux
(Fedora17). So far, it has been working okay for the most part. However,
I recently discovered that apparently UTF8 strings aren’t properly
recognized by my scripts.

As an example, if I run the following code via the normal ruby
executable:

p "éàß"

I get:

> "éàß"

back as is expected. However, doing the same thing via my embedded C++
app:

#include "ruby.h"

int main(int argc, char**argv)
{
    ruby_setup();
    const char rbScript[] = "p \"éàß\"";
    rb_eval_string_protect(rbScript, 0);
    return 0;
}

I get:

> "\xC3\xA9\xC3\xA0\xC3\x9F"

I feel like I’m just missing one tiny init function somewhere that sets
up the proper UTF8 support, but even after going through the main()
function of the ruby executable and all the called functions, I can’t
seem to find this magic init procedure.
The reason why this is especially problematic for me is because if I try
iterating over the individual chars of a string, I get each single UTF8
byte code instead of the full multibyte character.

Can someone help me out? Thanks in advance!

Jonas

jkt · October 20, 2013, 12:59am

I have finally managed to work around this strange behavior in Ruby. For
anyone that might come across the same problem, here’s what you have to
do:

Make sure to call

rb_enc_set_default_external
(rb_enc_from_encoding(rb_utf8_encoding()));

after ‘ruby_setup()’.

Prepend “#encoding:utf-8\n” to all scripts you will be eval-ing.

These two steps ensure all string literals defined in the scripts will
be UTF-8 by default. However, if you additionally rely on the Marshal
module, it will still spew out ASCII strings (if they weren’t serialized
as IVARs with explicit encoding). To do that, you have to fix the
marshal.c file yourself:

diff --git a/marshal.c b/marshal.c
index 4cba05d…dfce6ee 100644
— a/marshal.c
+++ b/marshal.c
@@ -1312,7 +1312,9 @@ r_unique(struct load_arg *arg)
static VALUE
r_string(struct load_arg *arg)
{

return r_bytes(arg);

VALUE str = r_bytes(arg);
rb_enc_associate(str, rb_utf8_encoding());
return str;
}

static VALUE

Thanks,
Jonas