Ruby 'C' Extensions and Unicode

praveen · February 9, 2010, 1:15pm

Hi,

I am working on enhancing the IBM_DB Ruby driver (database driver for
DB2 and Informix) by providing unicode support.

I tried googling with no luck to find any documents or links which
talk about the Ruby C extension API’s that can be used to unleash the
unicode support of Ruby-1.9 to

Convert Ruby string (unicode) object received in the extension API
into wchar (like rb_str2cstr, in ruby-1.8)
Convert wchar* to a Ruby Object (like rb_str_new2, in ruby-1.8).
Convert string objects between different formats (UCS-2, UCS-4).

Could some body put light on the answers for the above queries.

Along with the above things could you also tell me if Ruby by default
is compiled to use UCS-2 or UCS-4 or other format strings and how will
I be able to tap this info, of which format is being used,
programmatically in the extension.

Thanks

Praveen

praveen · February 9, 2010, 2:06pm

On Tue, Feb 9, 2010 at 9:15 PM, Praveen [email protected]
wrote:

Hi,

I am working on enhancing the IBM_DB Ruby driver (database driver for
DB2 and Informix) by providing unicode support.

I tried googling with no luck to find any documents or links which
talk about the Ruby C extension API’s that can be used to unleash the
unicode support of Ruby-1.9 to

Look at ruby-1.9.1-pxxx/include/ruby/encoding.h and
ruby-1.9.1-pxxx/string.c.

Convert Ruby string (unicode) object received in the extension API
into wchar (like rb_str2cstr, in ruby-1.8)

No generic way because wchar’s encoding is platform-dependent.
As far as I know, it is UCS-2 in Windows, UCS-4 in Linux,
locale-dependent
value in Solaris.

If it is UCS-2,
rb_encoding *ucs2_enc = rb_enc_find(“UCS-2”);
VALUE ucs2_string = rb_str_export_to_enc(string, ucs2_enc);
const char *ucs2_cstr = StringValueCStr(ucs2_string);

Convert wchar* to a Ruby Object (like rb_str_new2, in ruby-1.8).

If the wchar’s encoding is UCS-2,
rb_encoding *ucs2_enc = rb_enc_find(“UCS-2”);
VALUE ucs2_string = rb_external_str_new_with_enc(cstr, len, usc2_enc);

Convert string objects between different formats (UCS-2, UCS-4).

rb_encoding *ucs2_enc = rb_enc_find(“UCS-2”);
rb_encoding *ucs4_enc = rb_enc_find(“UCS-4”);
VALUE ucs4_string = rb_str_conv_enc(ucs2_string, ucs2_enc, ucs4_enc);

praveen · February 9, 2010, 2:14pm

On Tue, Feb 9, 2010 at 10:06 PM, KUBO Takehiro [email protected] wrote:

rb_encoding *ucs2_enc = rb_enc_find(“UCS-2”);
rb_encoding *ucs4_enc = rb_enc_find(“UCS-4”);

Sorry, UCS-2 and UCS-4 are not defined in ruby 1.9.1.
Use UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE instead.

praveen · February 9, 2010, 5:50pm

Hi Kubo,

Thanks for the information.

I will give a try and get back to you on how I progress [with doubts/
Success ].

Thanks

Praveen

praveen · February 16, 2010, 1:55pm

Forgot to mention. I am on ruby version

ruby 1.9.1p0 (2009-01-30 revision 21907) [x86_64-linux].

Let me know if you require any info

Thanks

Praveen

praveen · February 16, 2010, 10:19am

Hi Kubo,

I tried proceeding with the above mentioned APIs. However I am seeing
some interesting stuffs. Not sure if I am using the right constructs.

Below is the Ruby script I am using:

======================================
#encoding: utf-8

puts “Results in C extension”
puts “----------------------”
require ‘ibm_db’
str = “insert into woods (name) values (‘GÃœHRINGæ–‡’)”

conn = IBM_DB.connect ‘DRIVER={IBM DB2 ODBC
DRIVER};DATABASE=devdb;HOSTNAME=9.124.159.74;PORT=50000;PROTOCOL=TCPIP;UID=db2admin;PWD=db2admin;’,’’,’’
stmt = IBM_DB.exec conn, str
IBM_DB.close conn

print “----------------------\n\n”

puts “Results in Ruby script”
puts “----------------------”

puts “str.length is :#{str.length}”
puts “str.bytesize: #{str.bytesize}”
puts “Forcing encoding”
str1 = str.force_encoding(“UTF-16LE”)
puts “str.length is :#{str1.length}”
puts “str.bytesize: #{str1.bytesize}”

In the script above, IBM_DB is the C extension module. However the
database call has got nothing to do with the unicode API usage. I have
just resused the module for trying the unicode support.

The snippet in C extension that uses the unicode functions is as
below:

======================================
VALUE ibm_db_exec(int argc, VALUE *argv, VALUE self){
rb_scan_args(argc, argv, “21”, &connection, &stmt, &options);
if (!NIL_P(stmt)) {
rb_encoding *enc_received;
rb_encoding *ucs2_enc = rb_enc_find(“UTF-16LE”);
rb_encoding *ucs4_enc = rb_enc_find(“UTF-32LE”);

enc_received = rb_enc_from_index(ENCODING_GET(stmt));

printf("\nString in received format: %s\n",RSTRING_PTR(stmt));
printf("\nrb_str_length is: %d\n",rb_str_length(stmt));
printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt));
printf("\nEncoding format received: %s\n",enc_received->name);

stmt_ucs2  =  rb_str_export_to_enc(stmt,ucs2_enc);

printf("\nString in utf16 format: %s\n",RSTRING_PTR(stmt_ucs2));
printf("\nrb_str_length is: %d\n",rb_str_length(stmt_ucs2));
printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt_ucs2));
printf("\nEncoding after conversion: %s\n",ucs2_enc->name);

}
}

======================================

The above ruby script run produces the following output:

======================================

Results in C extension

String in received format: insert into woods (name) values
(‘GÃƒHRINGÃ¦’)

rb_str_length is: 89

RSTRING_LEN is: 47

Encoding format received: UTF-8

String in utf16 format: i #Expected because used printf

rb_str_length is: 89

RSTRING_LEN is: 88

Encoding after conversion: UTF-16LE

Results in Ruby script

str.length is :44
str.bytesize: 47
Forcing encoding
str.length is :24
str.bytesize: 47

======================================

I am not sure why is there a difference in the string length in the
original string [44] (UTF-8 format) and string after changing the
encoding [24] (to UTF-16LE). The same is the case in case of output in
the C extension, the bytesize and the length are same (+1 or -1) and
the length is different in different encoding formats.

Could you tell me what is that I am doing wrong?

Along with this, in C extension is there any API that I can call to
check if the given string is in a particular encoding or should I use
rb_enc_from_index and from there read the struct member name and
determine in the extension that I write?

Thanks

Praveen

praveen · February 16, 2010, 3:50pm

Hi,

2010/2/16 Praveen [email protected]:

puts “Results in C extension”

Â if (!NIL_P(stmt)) {

rb_str_length is: 89

======================================

I am not sure why is there a difference in the string length in the
original string [44] (UTF-8 format) and string after changing the
encoding [24] (to UTF-16LE). The same is the case in case of output in
the C extension, the bytesize and the length are same (+1 or -1) and
the length is different in different encoding formats.

89 is not an integer but a VALUE. VALUE of 89 means 44 of integer.

Could you tell me what is that I am doing wrong?

You should use String#encode instead of String#force_encode like this:

puts “Converting encoding”
str1 = str.encode(“UTF-16LE”)
puts “str.length is :#{str1.length}”
puts “str.bytesize: #{str1.bytesize}”

Along with this, in C extension is there any API that I can call to
check if the given string is in a particular encoding or should I use
rb_enc_from_index and from there read the struct member name and
determine in the extension that I write?

Using rb_enc_get is more simple then rb_enc_from_index like this:
enc_received = rb_enc_get(stmt);

And, rb_str_length returns not an integer but a VALUE. So you should
use NUM2INT like this:
printf(“\nrb_str_length is: %d\n”,NUM2INT(rb_str_length(stmt)));

Regards,

Park H.

praveen · February 16, 2010, 3:11pm

I’m not familiar with the enc_ APIs.

But I think the easiest way is to use

rb_funcall(some_str, rb_intern(“encode”) …

Praveen wrote:

Forgot to mention. I am on ruby version

ruby 1.9.1p0 (2009-01-30 revision 21907) [x86_64-linux].

Let me know if you require any info

Thanks

Praveen

praveen · March 20, 2010, 1:44pm

Hi,

I wanted to know if there is any function in the C extension
(Ruby-1.9) that can be used to convert the encoding of the string to
the encoding format specified by the user (in his environment or by
setting #encoding: at the beginning of .rb file).

I did find 2 function namely rb_str_export and rb_str_export_locale.
Not sure which one will convert the strings rightly to the format
which the user is set.

Could somebody guide.

Thanks

Praveen

praveen · February 22, 2010, 5:30pm

Thanks All for your help!!

Will Keep posted on how it goes.

Thanks

Praveen

praveen · March 22, 2010, 2:41pm

Praveen wrote:

I wanted to know if there is any function in the C extension
(Ruby-1.9) that can be used to convert the encoding of the string to
the encoding format specified by the user (in his environment or by
setting #encoding: at the beginning of .rb file).

Those are two different things.

The encoding guessed from the environment is Encoding.default_external
(or Encoding.find(“external”)). So you can do:

str2 = str1.encode(“external”)

The encoding specified in the #encoding line is called the source
encoding, and it’s the encoding which string literals like “abc”
(usually) get automatically.

Maybe it would be helpful first to review the concepts in ‘pure ruby’,
then map them to C. There are some details at:

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

http://blog.grayproductions.net/articles/ruby_19s_string (plus see other
articles linked from the table of contents)

Warning: it’s a large and complex topic.

I did find 2 function namely rb_str_export and rb_str_export_locale.
Not sure which one will convert the strings rightly to the format
which the user is set.

I suggest you start by calling the ruby-level functions, like
String#encode. If there is a particular need to call the underlying C
functions directly then you can look at the ruby source code for
String#encode and see what it calls.

Ruby 'C' Extensions and Unicode

puts “str.length is :#{str.length}” puts “str.bytesize: #{str.bytesize}” puts “Forcing encoding” str1 = str.force_encoding(“UTF-16LE”) puts “str.length is :#{str1.length}” puts “str.bytesize: #{str1.bytesize}”

Results in C extension

Encoding after conversion: UTF-16LE

Results in Ruby script

puts “str.length is :#{str.length}”
puts “str.bytesize: #{str.bytesize}”
puts “Forcing encoding”
str1 = str.force_encoding(“UTF-16LE”)
puts “str.length is :#{str1.length}”
puts “str.bytesize: #{str1.bytesize}”