Forum: Ruby-Gnome 2 Clipboard problem with utf-8 and ascii-8bit

Posted by Rafael Garrido (rg_rg)
on 2010-02-11 09:34
Hi,

I need to count the number of characters entered, as example I used the
"clipboard.rb" that comes in gtk-demo

If we introduce utf-8 characters (for example the character ¿) in the
first "Gtk::Entry", to paste the contents in the second "Gtk::Entry" and
look what the encoding always see that it is ASCII-8BIT, if we count the
number of characters in the string, ruby us back in this case 2.
Contrary if to paste content we make a "force_encoding("utf-8")" and
recounting again, ruby return 1.

The question: is there any parameter when we read the contents of
"Gtk::Entry" so that by default will not detect the characters as
ASCII-8BIT?

Thanks

Rafael
Posted by Mike Charlton (Guest)
on 2010-02-11 11:24
(Received via mailing list)
If I remember correctly Ruby 1.8 stores strings as bytes.  If you
have a multibyte encoding like Unicode, the size isn't counted
as characters.  It is counted as bytes.  So when you store
something like ¿ it takes 2 bytes and the length is 2.

In order to count the characters you need to know the encoding
of the string.  There several different ways to find the length
using different classes (all of which I have forgotten about -- sorry).
But usually you want to index into the string by character.
You can't really do this with Unicode strings.  Instead I usually
split the string into an array like this:

       TO_A_RE = Regexp.new('\s*',nil,'U')

       string = "こんにちわ"
       a = string.split(TO_A_RE)
       size = a.size

The TO_A_RE regexp is set for unicode (the 'U' option).  You
can set it to other values for other encodings.

Now why is your GTK:Entry getting 8bit ASCII characters?
My guess is that the locale you are using is not unicode,
but actually 8bit ASCII.  I don't know what platform you
are using, but if you are on Linux you could use
es_ES.utf8 (guessing that you are spanish???).  That
should fix the problem.

Note that Ruby 1.9 is completely different.  If I understand
correctly, the encoding is stored with the string and you
can actually index characters in multi-byte strings.  But
I haven't used it yet.

I hope that helps.  Using different string encodings in
Ruby 1.8 is considerably harder than it should be.

          MikeC
Posted by Rafael Garrido (rg_rg)
on 2010-02-11 11:46
Sorry for not giving enough information about my environment:
---
  ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]
  LANG=es_ES.UTF-8
  LC_COLLATE=es_ES.UTF-8
---

The characters display correctly in GTK application, but the problem is 
to check with Ruby which is the encoding of the string:
  string = "¿"
  string.encoding -> this always returns ASCII-8BIT instead of utf-8
If we count the characters:
  string.length -> 2 but should be 1

Posted by Mike Charlton (Guest)
on 2010-02-11 12:55
(Received via mailing list)
On 11 February 2010 19:46, Rg Rg <ruby-forum-incoming@andreas-s.net> 
wrote:
> The characters display correctly in GTK application, but the problem is
> to check with Ruby which is the encoding of the string:
>  string = "¿"
>  string.encoding -> this always returns ASCII-8BIT instead of utf-8
> If we count the characters:
>  string.length -> 2 but should be 1

Ah...  It looks like the encoding coding is being detected wrong.
This will be a Ruby 1.9 problem, not a GTK problem.  I'm
afraid I don't know enough to help...

       MikeC
Posted by Simon Arnaud (Guest)
on 2010-02-11 14:00
(Received via mailing list)
2010/2/11 Rg Rg <ruby-forum-incoming@andreas-s.net>:
> Sorry for not giving enough information about my environment:
> ---
>  ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]
>  LANG=es_ES.UTF-8
>  LC_COLLATE=es_ES.UTF-8
> ---
>

I don't have this behavior.

#$ ruby -v
ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux]
#$ cat encoding.rb
str = "É"
puts str.encoding
puts str.size
puts str.length
#$ ruby encoding.rb
encoding.rb:1: invalid multibyte char (US-ASCII)
encoding.rb:1: invalid multibyte char (US-ASCII)
#$ ruby -Ku encoding.rb
UTF-8
1
1
#$

If you don't want to specify -Ku, you can add encoding: utf-8 to the
first two lines, see
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

What you describe is the behavior of 1.8, not 1.9. Double check that.
Posted by Rafael Garrido (rg_rg)
on 2010-02-11 16:33
Simon Arnaud wrote:
> 2010/2/11 Rg Rg <ruby-forum-incoming@andreas-s.net>:
>> Sorry for not giving enough information about my environment:
>> ---
>>  ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]
>>  LANG=es_ES.UTF-8
>>  LC_COLLATE=es_ES.UTF-8
>> ---
>>
> 
> I don't have this behavior.
> 
> #$ ruby -v
> ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux]
> #$ cat encoding.rb
> str = "É"
> puts str.encoding
> puts str.size
> puts str.length
> #$ ruby encoding.rb
> encoding.rb:1: invalid multibyte char (US-ASCII)
> encoding.rb:1: invalid multibyte char (US-ASCII)
> #$ ruby -Ku encoding.rb
> UTF-8
> 1
> 1
> #$
> 
> If you don't want to specify -Ku, you can add encoding: utf-8 to the
> first two lines, see
> http://blog.nuclearsquid.com/writings/ruby-1-9-encodings
> 
> What you describe is the behavior of 1.8, not 1.9. Double check that.


I think that I haven't explained well, this behavior only happens to me 
when reading the chain of GTK, if I do the same example that you the 
behavior is the same as you.

Taking as example the "./Gtk-demo/clipboard.rb" in the following 
section:

     button.signal_connect('clicked', entry) do |w, e|
        clipboard = e.get_clipboard(Gdk::Selection::CLIPBOARD)
        clipboard.request_text do |board, text, data|
          e.text = text  # The text utf-8 is displayed correctly
        end
     end


If we change the above code:
"e.text = text" will exchange it for "e.text = text.encoding.to_s", this 
result in -> ASCII-8BIT
and if we show the length of the string the result is incorrect because 
it treats the string as utf-8

Posted by Simon Arnaud (Guest)
on 2010-02-12 10:46
(Received via mailing list)
2010/2/11 Rg Rg <ruby-forum-incoming@andreas-s.net>:
> I think that I haven't explained well, this behavior only happens to me
> when reading the chain of GTK, if I do the same example that you the
> behavior is the same as you.

Ah, yes, I tested a little with a GTK::Entry and it gives back a
string considered ASCII 8BIT.

Definately a bug somewhere, but dunno if it is ruby or ruby-gnome2.

I tried to look at the sources, but didn't understand where the getter
was defined.

Let's hope Kou can have a look at it, or give directions where to look.

Simon
Posted by Kouhei Sutou (Guest)
on 2010-02-12 13:51
(Received via mailing list)
Hi,

In <78f5e3ec1002120140r2c429a93sdb65298fef67aea5@mail.gmail.com>
  "Re: [ruby-gnome2-devel-en] Clipboard problem with utf-8 and 
ascii-8bit" on Fri, 12 Feb 2010 10:40:10 +0100,
  Simon Arnaud <mazwak@gmail.com> wrote:

> I tried to look at the sources, but didn't understand where the getter
> was defined.
> 
> Let's hope Kou can have a look at it, or give directions where to look.

Gtk::Clipbord#request_text passes UTF-8 encoding text to
callback in trunk. We need more work around
encoding. e.g. we should set UTF-8 encoding to a text
returned by Gtk::Entry#text.

I've add RBG_STRING_SET_UTF8_ENCODING() macro. Could someone
UTF-8 encoding set work in trunk? If someone sends a patch,
please someone reviews and commit it into trunk.

Thanks,
--
kou
Posted by Rafael Garrido (rg_rg)
on 2010-02-15 12:53
Kouhei Sutou wrote:
> Hi,
> 
> In <78f5e3ec1002120140r2c429a93sdb65298fef67aea5@mail.gmail.com>
>   "Re: [ruby-gnome2-devel-en] Clipboard problem with utf-8 and 
> ascii-8bit" on Fri, 12 Feb 2010 10:40:10 +0100,
>   Simon Arnaud <mazwak@gmail.com> wrote:
> 
>> I tried to look at the sources, but didn't understand where the getter
>> was defined.
>> 
>> Let's hope Kou can have a look at it, or give directions where to look.
> 
> Gtk::Clipbord#request_text passes UTF-8 encoding text to
> callback in trunk. We need more work around
> encoding. e.g. we should set UTF-8 encoding to a text
> returned by Gtk::Entry#text.
> 
> I've add RBG_STRING_SET_UTF8_ENCODING() macro. Could someone
> UTF-8 encoding set work in trunk? If someone sends a patch,
> please someone reviews and commit it into trunk.
> 
> Thanks,
> --
> kou


As I need to work in "GTK::Entry#text" I have patched the file 
rbgtkentry.c to use your macro.
I haven't read much code ruby-gnome, so the patch is probably not quite 
correct, but at least it work for me.


-------------------------------------

--- gtk/src/rbgtkentry.c.old    2010-02-15 12:32:01.000000000 +0100
+++ gtk/src/rbgtkentry.c        2010-02-15 12:31:29.000000000 +0100
@@ -135,6 +135,18 @@
 }
 #endif

+static VALUE
+entry_request_text(self)
+    VALUE self;
+{
+    VALUE vtext = Qnil;
+    const gchar *text;
+    text = gtk_entry_get_text(_SELF(self));
+    vtext = CSTR2RVAL(text);
+    RBG_STRING_SET_UTF8_ENCODING(vtext) ;
+    return vtext;
+}
+
 void
 Init_gtk_entry()
 {
@@ -149,6 +161,7 @@
 #endif
     rb_define_method(gEntry, "layout_index_to_text_index", 
entry_layout_index_to_text_index, 1);
     rb_define_method(gEntry, "text_index_to_layout_index", 
entry_text_index_to_layout_index, 1);
+    rb_define_method(gEntry, "text", entry_request_text, 0);

 #if GTK_CHECK_VERSION(2, 12, 0)
     rb_define_method(gEntry, "cursor_hadjustment",

-------------------------------------

Thanks
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.