String encoding problem with ruby 1.9.2p0

I am trying to make my http://booh.org/ app work with ruby 1.9.2p0.

There’s a problem retrieving String data from a Gtk::TreeIter. Ruby
says its encoding is ASCII-8BIT which I believe happens when the
encoding was not marked at String creation, and then if it contains
non ASCII characters it fails (on me, it fails in REXML).

-=-=—=-=—=-=—=-=—=-=–
#! /usr/bin/ruby

encoding: UTF-8

require ‘gtk2’
require ‘rexml/document’
include REXML

def check_with_rexml(string)
xmldoc = Document.new(“”)
xmldoc << XMLDecl.new(XMLDecl::DEFAULT_VERSION, ‘UTF-8’)
xmldoc.root.attributes[‘key’] = string
end

Gtk.init
treestore = Gtk::TreeStore.new(String)
iter = treestore.append(nil)

puts “\n*** PURE ASCII TEST ***”
pureascii = “pure ascii”
iter[0] = pureascii
puts "my string has encoding " + pureascii.encoding.to_s + " and
retrieved from iter has encoding " + iter[0].encoding.to_s
check_with_rexml(iter[0])

puts “\n*** WITH ACCENT TEST ***”
withaccent =
"finalisé"iter[0] = withaccent
puts "my string has encoding " + withaccent.encoding.to_s + " and
retrieved from iter has encoding " + iter[0].encoding.to_s
check_with_rexml(iter[0])
-=-=—=-=—=-=—=-=—=-=–

=>

*** PURE ASCII TEST ***
my string has encoding UTF-8 and retrieved from iter has encoding
ASCII-8BIT

*** WITH ACCENT TEST ***
my string has encoding UTF-8 and retrieved from iter has encoding
ASCII-8BIT
/usr/lib/ruby/1.9.1/rexml/text.rb:131:in =~': incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError) from /usr/lib/ruby/1.9.1/rexml/text.rb:131:in !~’
from /usr/lib/ruby/1.9.1/rexml/text.rb:131:in check' from /usr/lib/ruby/1.9.1/rexml/attribute.rb:153:in element=’
from /usr/lib/ruby/1.9.1/rexml/element.rb:1104:in []=' from /tmp/rb19encodingpb.rb:11:in check_with_rexml’
from /tmp/rb19encodingpb.rb:28:in `’

This is ruby 1.9.2p0 (2010-08-18 revision 29036), with rg2 0.19.2 and
0.90.0.
Thanks.


Guillaume C. - http://zarb.org/~gc/

Hi,

In [email protected]
“[ruby-gnome2-devel-en] String encoding problem with ruby 1.9.2p0” on
Thu, 23 Sep 2010 15:49:10 +0200,
Guillaume C. [email protected] wrote:

I am trying to make my http://booh.org/ app work with ruby 1.9.2p0.

Thanks.

There’s a problem retrieving String data from a Gtk::TreeIter. Ruby
says its encoding is ASCII-8BIT which I believe happens when the
encoding was not marked at String creation, and then if it contains
non ASCII characters it fails (on me, it fails in REXML).

String encoding support is one of the important TODO items.
We need help for the work.

Thanks,

kou

I am trying to make my http://booh.org/ app work with ruby 1.9.2p0.

Thanks.

It’s the fault of one of my users :cry:

There’s a problem retrieving String data from a Gtk::TreeIter. Ruby
says its encoding is ASCII-8BIT which I believe happens when the
encoding was not marked at String creation, and then if it contains
non ASCII characters it fails (on me, it fails in REXML).

String encoding support is one of the important TODO items.
We need help for the work.

I see. I’ll may throw it an eye but time is sparse, as for you I
guess. RG2 definitely lacks manpower, but it’s a recurrent problem
with little solutions (one workaround might be to concentrate more on
“core” bindings without sending too much time on subbindings with less
users?).


Guillaume C. - Guillaume Cottenceau

There’s a problem retrieving String data from a Gtk::TreeIter. Ruby
says its encoding is ASCII-8BIT which I believe happens when the
encoding was not marked at String creation, and then if it contains
non ASCII characters it fails (on me, it fails in REXML).

String encoding support is one of the important TODO items.
We need help for the work.

I have looked at the source code. I feared that all Ruby String
creations from GTK+ would miss setting the encoding, which is the
case, they seem to be (nearly) all ASCII-8BIT :confused:

So this simple GtkLabel test shows ASCII-8BIT encoding :confused:

f = “bla”
l = Gtk::Label.new
l.text = f
puts f.encoding
puts l.text.encoding

Not good…

So my proposal is the following: all Ruby String created by rg2 must
have an encoding set, otherwise here and there, there will be similar
problems to mine (operations failing on an ASCII-8BIT String gotten
from rg2).

But, as GTK+ returns gchar*, we don’t know what encoding to use. They
may fall into these categories:

  • Ruby String passed to GTK+ then retrieved from GTK+ (my case in
    GtkTreeStore, cases when reading GtkLabel, etc)

  • default stuff from GTK+ (not much I guess, and much probably in
    US-ASCII - some people have actual ideas about it?)

so if we use always a unique encoding when passing Ruby String to
GTK+, and this encoding is US-ASCII compatible, then we can use this
encoding when retrieving from GTK+; I propose to use UTF-8 which is
“standard” within GTK+ since 2.0, and quite standard elsewhere too
(sadly, not much standard in Japan, I heard?). Anyways, it would
always be better than current situation, no?

So my patch converts Ruby String to UTF-8 if it’s not already in UTF-8
before passing to GTK+; and properly creates an UTF-8 Ruby String when
retrieving from GTK+.

Notice: RVAL2CSTR would be now an alias to RVAL2CSTR_ACCEPT_NIL, and I
think it is more safe because it’s always better to properly handle
NIL case.

Notice: I also patched README because the partial modules building
documentation was broken.

What do you think of my patch?

Now, this patch properly works on the simple GtkLabel test, and my
GtkTreeStore based test posted in this thread earlier. If the
principle is fine with you, I think we must convert all rb_str_new2 of
sourcecode to CSTR2RVAL, and StringValuePtr to RVAL2CSTR (are there
other macros/ruby internals used to convert between Ruby String and
C?), to fix other similar problems.

Thanks,

On 26 September 2010 07:43, Guillaume C. [email protected]
wrote:

I propose to use UTF-8 which is
“standard” within GTK+ since 2.0, and quite standard elsewhere too
(sadly, not much standard in Japan, I heard?). Anyways, it would
always be better than current situation, no?

Perhaps I’m wrong, but I thought all strings in GTK+ were UTF-8.
I don’t know where I read it, but I’m pretty sure that other encodings
don’t work properly.

As for Japan, I think there are 3 main standards. On older Unix
machines they tend to use EUC. On older Windows boxes they
use Shift-JIS. But on modern Unix-like machines I think most people
use UTF-8 while on modern Windows boxes they use UCS2.

Having said that, when I use ruby gnome with ruby version 1.8.x
on Windows, the strings I get back from the IME are definitely
UTF-8 (tested on both Windows XP and Windows 7). So I’m
relatively sure that GTK+ is storing the string as UTF-8 somehow.

If that’s true then it should be OK to simply mark UTF-8 as the
encoding for all GTK+ strings.

    MikeC

Hi,

In [email protected]
“Re: [ruby-gnome2-devel-en] String encoding problem with ruby 1.9.2p0”
on Sun, 26 Sep 2010 00:43:56 +0200,
Guillaume C. [email protected] wrote:

I have looked at the source code. I feared that all Ruby String
creations from GTK+ would miss setting the encoding, which is the
case, they seem to be (nearly) all ASCII-8BIT :confused:

Yes. All those String doesn’t have encoding information.

GTK+, and this encoding is US-ASCII compatible, then we can use this
encoding when retrieving from GTK+; I propose to use UTF-8 which is
“standard” within GTK+ since 2.0, and quite standard elsewhere too
(sadly, not much standard in Japan, I heard?). Anyways, it would
always be better than current situation, no?

So my patch converts Ruby String to UTF-8 if it’s not already in UTF-8
before passing to GTK+; and properly creates an UTF-8 Ruby String when
retrieving from GTK+.

I almost agree with your patch but I have some comments.

  • It’s true that GTK+ uses UTF-8 as it’s internal/external
    encoding as Mike said. It’s OK to convert Ruby String to
    UTF-8 for passing to GTK+.

  • But there are some cases that GTK+ doesn’t return
    UTF-8. e.g. g_convert()(*1), g_utf8_to_utf16()(*2) and
    so on. So, we need to provide an API for specify
    encoding like CSTR2RVAL_ENCODING().

    (*1)
    GLib – 2.0
    (*2)
    GLib – 2.0

  • UTF-8 is the standard encoding in Japan, now. :slight_smile:

  • Please wait to change this after 0.90.2 is
    released. I’ll release 0.90.2 soon. :slight_smile:

Notice: RVAL2CSTR would be now an alias to RVAL2CSTR_ACCEPT_NIL, and I
think it is more safe because it’s always better to properly handle
NIL case.

We need to RVAL2CSTR with nil check because there are many
API that doesn’t accept NULL as gchar * argument. We will
need a function that do StringValuePtr() with encoding handling.

Notice: I also patched README because the partial modules building
documentation was broken.

What do you think of my patch?

Please commit README patch for now. Please wait other patches.

Now, this patch properly works on the simple GtkLabel test, and my
GtkTreeStore based test posted in this thread earlier. If the
principle is fine with you, I think we must convert all rb_str_new2 of
sourcecode to CSTR2RVAL, and StringValuePtr to RVAL2CSTR (are there
other macros/ruby internals used to convert between Ruby String and
C?), to fix other similar problems.

It seems OK for me. But we don’t forget that there are some
exceptions like g_convert(). We need to handle those
exceptions by hands.

Thanks,

kou

On Sun, Sep 26, 2010 at 2:08 AM, Mike C. [email protected]
wrote:

On 26 September 2010 07:43, Guillaume C. [email protected] wrote:

I propose to use UTF-8 which is
“standard” within GTK+ since 2.0, and quite standard elsewhere too
(sadly, not much standard in Japan, I heard?). Anyways, it would
always be better than current situation, no?

Perhaps I’m wrong, but I thought all strings in GTK+ were UTF-8.
I don’t know where I read it, but I’m pretty sure that other encodings
don’t work properly.

IIRC, it’s related to Pango mainly. For example, if your system is in
ISO-8859-1, then files gotten from, say, GtkFileChooser will be in
that encoding, not UTF-8 - that’s a point I forgot to handle in my
yesterday’s patch, actually :confused:


Guillaume C. - Guillaume Cottenceau

Mike C. wrote:

On 26 September 2010 07:43, Guillaume C. [email protected]
wrote:

I propose to use UTF-8 which is
“standard” within GTK+ since 2.0, and quite standard elsewhere too
(sadly, not much standard in Japan, I heard?). Anyways, it would
always be better than current situation, no?

Perhaps I’m wrong, but I thought all strings in GTK+ were UTF-8.
I don’t know where I read it, but I’m pretty sure that other encodings
don’t work properly.

As for Japan, I think there are 3 main standards. On older Unix
machines they tend to use EUC. On older Windows boxes they
use Shift-JIS. But on modern Unix-like machines I think most people
use UTF-8 while on modern Windows boxes they use UCS2.

Having said that, when I use ruby gnome with ruby version 1.8.x
on Windows, the strings I get back from the IME are definitely
UTF-8 (tested on both Windows XP and Windows 7). So I’m
relatively sure that GTK+ is storing the string as UTF-8 somehow.

If that’s true then it should be OK to simply mark UTF-8 as the
encoding for all GTK+ strings.

    MikeC

I’m not sure about the encoding. But with ruby-gnome 0.19.4 and ruby
1.9.2p0, the textview example failed, testgtk also fail on encoding:

C:\Users\swong24\Development\ruby-gtk2-0.19.4\gtk\sample\testgtk>testgtk.rb
internal:lib/rubygems/custom_require:29:in require': C:/Users/swong24/Develop ment/ruby-gtk2-0.19.4/gtk/sample/testgtk/labels.rb:54: invalid multibyte char (U S-ASCII) (SyntaxError) C:/Users/swong24/Development/ruby-gtk2-0.19.4/gtk/sample/testgtk/labels.rb:54: i nvalid multibyte char (US-ASCII) C:/Users/swong24/Development/ruby-gtk2-0.19.4/gtk/sample/testgtk/labels.rb:54: s yntax error, unexpected $end, expecting ')' ...!\nThis one is underlined in µùѵ£¼Φ¬₧πü«σà Ñτö¿quite a funky... ... ^ from <internal:lib/rubygems/custom_require>:29:in require’
from
C:/Users/swong24/Development/ruby-gtk2-0.19.4/gtk/sample/testgtk/te
stgtk.rb:48:in `’

  • Ruby String passed to GTK+ then retrieved from GTK+ (my case in
    GtkTreeStore, cases when reading GtkLabel, etc)

  • default stuff from GTK+ (not much I guess, and much probably in
    US-ASCII - some people have actual ideas about it?)

I forgot to handle cases of Strings got from GTK+ to represent file
paths (from GtkFileChooser for example). There, the system encoding is
used and we must handle that specifically.

So my patch converts Ruby String to UTF-8 if it’s not already in UTF-8
before passing to GTK+; and properly creates an UTF-8 Ruby String when
retrieving from GTK+.

I almost agree with your patch but I have some comments.

Thanks for the fast review :slight_smile:

(*2) GLib – 2.0
True enough! I am not too fluent in glib stuff :confused:

  • UTF-8 is the standard encoding in Japan, now. :slight_smile:

Great :slight_smile: But it represents an overhead for you guys, no? E.g. even
katakanas seem to need 3 bytes in UTF-8 :confused:

  • Please wait to change this after 0.90.2 is
    released. I’ll release 0.90.2 soon. :slight_smile:

Yes, of course. It has been too much time, and this is too
“sensitive”, for me to commit without your approval!

Notice: RVAL2CSTR would be now an alias to RVAL2CSTR_ACCEPT_NIL, and I
think it is more safe because it’s always better to properly handle
NIL case.

We need to RVAL2CSTR with nil check because there are many
API that doesn’t accept NULL as gchar * argument. We will
need a function that do StringValuePtr() with encoding handling.

Sorry, I didn’t properly understand that. I see now.

Now, this patch properly works on the simple GtkLabel test, and my
GtkTreeStore based test posted in this thread earlier. If the
principle is fine with you, I think we must convert all rb_str_new2 of
sourcecode to CSTR2RVAL, and StringValuePtr to RVAL2CSTR (are there
other macros/ruby internals used to convert between Ruby String and
C?), to fix other similar problems.

It seems OK for me. But we don’t forget that there are some
exceptions like g_convert(). We need to handle those
exceptions by hands.

Ok, I’ll try to improve my patch soon.


Guillaume C. - Guillaume Cottenceau

 - default stuff from GTK+ (not much I guess, and much probably in
US-ASCII - some people have actual ideas about it?)

I forgot to handle cases of Strings got from GTK+ to represent file
paths (from GtkFileChooser for example). There, the system encoding is
used and we must handle that specifically.

After reading GtkFileChooser documentation, it confirms my memories.
“filenames are always returned in the character set specified by the
G_FILENAME_ENCODING environment variable”. Also, “while you can pass
the result of gtk_file_chooser_get_filename() to open(2) or fopen(3),
you may not be able to directly set it as the text of a GtkLabel
widget unless you convert it first to UTF-8, which all GTK+ widgets
expect. You should use g_filename_to_utf8() to convert filenames into
strings that can be passed to GTK+ widgets.”

Also, I wasn’t sure what is the behaviour of File methods in Ruby 1.9.
My experimentation shows that the bytes from the encoding of the
String passed are used, e.g. without converting. For example in a
UTF-8 system, File.new(“mémé”) isn’t the same as
File.new(“mémé”.encode(“ISO-8859-1”)) (the last one gives ENOENT).

So, I think Strings representing files got from GTK+ in rg2 should:

1- have an associated encoding, so that they can be displayed in GTK+
without the programmer needing to do conversions (because my new
RVAL2CSTR will send UTF-8 char* to GtkLabel etc)
2- use the encoding expected by the operating system so that File.new
will work

First step is to use g_filename_to_utf8(), so that we are 100%
confident we can create a correct String (by using these bytes in Ruby
UTF-8 encoding).

Second step is to change the encoding of the String for realizing the
(2) item above. After looking at Glib documentation, it seems that
using the first item from g_get_filename_charsets() should be fine.

I have made a patch for doing this. I have tested in a UTF-8 system
with a file with an accent, with default settings and also by forcing
filename encoding to ISO-8859-1 (by launching the program with
LC_ALL=en_US in parameter), all works great! The patch is
“string-encoding-for-files.diff” and my little test program was:

-=-=—=-=—=-=—=-=–
#! /usr/bin/ruby

require ‘gtk2’

Gtk.init

fc = Gtk::FileChooserDialog.new(“Foo”,
nil,
Gtk::FileChooser::ACTION_OPEN,
nil,
[Gtk::Stock::OK,
Gtk::Dialog::RESPONSE_ACCEPT], [Gtk::Stock::CANCEL,
Gtk::Dialog::RESPONSE_CANCEL])
if fc.run == Gtk::Dialog::RESPONSE_ACCEPT
puts fc.filename
puts fc.filename.encoding
puts File.exists?(fc.filename)
end
-=-=—=-=—=-=—=-=–

Of course, the patch is not commitable right now: if the principle of
the patch is fine, it should probably be placed into glib2 and shared
accross other APIs of RG2 returning filenames, and an opposite process
be used for APIs receiving filenames.

What do you think of that analysis and patch?

 * But there are some cases that GTK+ doesn’t return
  UTF-8. e.g. g_convert()(*1), g_utf8_to_utf16()(*2) and
  so on. So, we need to provide an API for specify
  encoding like CSTR2RVAL_ENCODING().

  (*1) GLib – 2.0
  (*2) GLib – 2.0

I am not sure… Normally, in Ruby I would say the charset/encoding
conversions should be realized in Ruby. Also, specifying a charset to
convert “from” doesn’t make sense in Ruby 1.9 because Ruby Strings
have an encoding already. What is the use of these functions in Ruby
1.9 where Strings already carry an encoding and allow encoding
conversions? In my opinion, there should be no binding for these
functions anymore. Or maybe I missing something, what do you think we
should do for these?

Notice: RVAL2CSTR would be now an alias to RVAL2CSTR_ACCEPT_NIL, and I
think it is more safe because it’s always better to properly handle
NIL case.

We need to RVAL2CSTR with nil check because there are many
API that doesn’t accept NULL as gchar * argument. We will
need a function that do StringValuePtr() with encoding handling.

Sorry, I didn’t properly understand that. I see now.

Please see the patch “string-encoding-glib2.diff”, RVAL2CSTR has same
behaviour with nil passed as before.

Thanks,

Hi,

Thanks for your proposals!
But can we separate those proposals with each thread?

  1. FileChooser encoding
  2. g_covert() related things
  3. RVAL2CSTR encoding support

I want to resolve those proposals with 3., 1., 2. order.

First, could you please send a mail to this ML about only
3.? I’ll reply the mail with patch review result.

I’m sorry for your inconvenient but please help me. I can’t
process a long English mail at a time with my English skill
and hobby time. :-<

Thanks,

kou

In [email protected]
“Re: [ruby-gnome2-devel-en] String encoding problem with ruby 1.9.2p0”
on Sun, 26 Sep 2010 22:58:43 +0200,