Clipboard problem with utf-8 and ascii-8bit

Hi,

I need to count the number of characters entered, as example I used the
“clipboard.rb” that comes in gtk-demo

If we introduce utf-8 characters (for example the character ¿) in the
first “Gtk::Entry”, to paste the contents in the second “Gtk::Entry” and
look what the encoding always see that it is ASCII-8BIT, if we count the
number of characters in the string, ruby us back in this case 2.
Contrary if to paste content we make a “force_encoding(“utf-8”)” and
recounting again, ruby return 1.

The question: is there any parameter when we read the contents of
“Gtk::Entry” so that by default will not detect the characters as
ASCII-8BIT?

Thanks

Rafael

If I remember correctly Ruby 1.8 stores strings as bytes. If you
have a multibyte encoding like Unicode, the size isn’t counted
as characters. It is counted as bytes. So when you store
something like ¿ it takes 2 bytes and the length is 2.

In order to count the characters you need to know the encoding
of the string. There several different ways to find the length
using different classes (all of which I have forgotten about – sorry).
But usually you want to index into the string by character.
You can’t really do this with Unicode strings. Instead I usually
split the string into an array like this:

   TO_A_RE = Regexp.new('\s*',nil,'U')

   string = "こんにちわ"
   a = string.split(TO_A_RE)
   size = a.size

The TO_A_RE regexp is set for unicode (the ‘U’ option). You
can set it to other values for other encodings.

Now why is your GTK:Entry getting 8bit ASCII characters?
My guess is that the locale you are using is not unicode,
but actually 8bit ASCII. I don’t know what platform you
are using, but if you are on Linux you could use
es_ES.utf8 (guessing that you are spanish???). That
should fix the problem.

Note that Ruby 1.9 is completely different. If I understand
correctly, the encoding is stored with the string and you
can actually index characters in multi-byte strings. But
I haven’t used it yet.

I hope that helps. Using different string encodings in
Ruby 1.8 is considerably harder than it should be.

      MikeC

Sorry for not giving enough information about my environment:

ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]
LANG=es_ES.UTF-8
LC_COLLATE=es_ES.UTF-8

The characters display correctly in GTK application, but the problem is
to check with Ruby which is the encoding of the string:
string = “¿”
string.encoding -> this always returns ASCII-8BIT instead of utf-8
If we count the characters:
string.length -> 2 but should be 1

2010/2/11 Rg Rg [email protected]:

Sorry for not giving enough information about my environment:

 ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]
 LANG=es_ES.UTF-8
 LC_COLLATE=es_ES.UTF-8

I don’t have this behavior.

#$ ruby -v
ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux]
#$ cat encoding.rb
str = “É”
puts str.encoding
puts str.size
puts str.length
#$ ruby encoding.rb
encoding.rb:1: invalid multibyte char (US-ASCII)
encoding.rb:1: invalid multibyte char (US-ASCII)
#$ ruby -Ku encoding.rb
UTF-8
1
1
#$

If you don’t want to specify -Ku, you can add encoding: utf-8 to the
first two lines, see
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

What you describe is the behavior of 1.8, not 1.9. Double check that.

Simon A. wrote:

2010/2/11 Rg Rg [email protected]:

Sorry for not giving enough information about my environment:

 ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]
 LANG=es_ES.UTF-8
 LC_COLLATE=es_ES.UTF-8

I don’t have this behavior.

#$ ruby -v
ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux]
#$ cat encoding.rb
str = “É”
puts str.encoding
puts str.size
puts str.length
#$ ruby encoding.rb
encoding.rb:1: invalid multibyte char (US-ASCII)
encoding.rb:1: invalid multibyte char (US-ASCII)
#$ ruby -Ku encoding.rb
UTF-8
1
1
#$

If you don’t want to specify -Ku, you can add encoding: utf-8 to the
first two lines, see
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

What you describe is the behavior of 1.8, not 1.9. Double check that.

I think that I haven’t explained well, this behavior only happens to me
when reading the chain of GTK, if I do the same example that you the
behavior is the same as you.

Taking as example the “./Gtk-demo/clipboard.rb” in the following
section:

 button.signal_connect('clicked', entry) do |w, e|
    clipboard = e.get_clipboard(Gdk::Selection::CLIPBOARD)
    clipboard.request_text do |board, text, data|
      e.text = text  # The text utf-8 is displayed correctly
    end
 end

If we change the above code:
“e.text = text” will exchange it for “e.text = text.encoding.to_s”, this
result in → ASCII-8BIT
and if we show the length of the string the result is incorrect because
it treats the string as utf-8

2010/2/11 Rg Rg [email protected]:

I think that I haven’t explained well, this behavior only happens to me
when reading the chain of GTK, if I do the same example that you the
behavior is the same as you.

Ah, yes, I tested a little with a GTK::Entry and it gives back a
string considered ASCII 8BIT.

Definately a bug somewhere, but dunno if it is ruby or ruby-gnome2.

I tried to look at the sources, but didn’t understand where the getter
was defined.

Let’s hope Kou can have a look at it, or give directions where to look.

Simon

Hi,

In [email protected]
“Re: [ruby-gnome2-devel-en] Clipboard problem with utf-8 and
ascii-8bit” on Fri, 12 Feb 2010 10:40:10 +0100,
Simon A. [email protected] wrote:

I tried to look at the sources, but didn’t understand where the getter
was defined.

Let’s hope Kou can have a look at it, or give directions where to look.

Gtk::Clipbord#request_text passes UTF-8 encoding text to
callback in trunk. We need more work around
encoding. e.g. we should set UTF-8 encoding to a text
returned by Gtk::Entry#text.

I’ve add RBG_STRING_SET_UTF8_ENCODING() macro. Could someone
UTF-8 encoding set work in trunk? If someone sends a patch,
please someone reviews and commit it into trunk.

Thanks,

kou

On 11 February 2010 19:46, Rg Rg [email protected]
wrote:

The characters display correctly in GTK application, but the problem is
to check with Ruby which is the encoding of the string:
 string = “¿”
 string.encoding → this always returns ASCII-8BIT instead of utf-8
If we count the characters:
 string.length → 2 but should be 1

Ah… It looks like the encoding coding is being detected wrong.
This will be a Ruby 1.9 problem, not a GTK problem. I’m
afraid I don’t know enough to help…

   MikeC

Kouhei S. wrote:

Hi,

In [email protected]
“Re: [ruby-gnome2-devel-en] Clipboard problem with utf-8 and
ascii-8bit” on Fri, 12 Feb 2010 10:40:10 +0100,
Simon A. [email protected] wrote:

I tried to look at the sources, but didn’t understand where the getter
was defined.

Let’s hope Kou can have a look at it, or give directions where to look.

Gtk::Clipbord#request_text passes UTF-8 encoding text to
callback in trunk. We need more work around
encoding. e.g. we should set UTF-8 encoding to a text
returned by Gtk::Entry#text.

I’ve add RBG_STRING_SET_UTF8_ENCODING() macro. Could someone
UTF-8 encoding set work in trunk? If someone sends a patch,
please someone reviews and commit it into trunk.

Thanks,

kou

As I need to work in “GTK::Entry#text” I have patched the file
rbgtkentry.c to use your macro.
I haven’t read much code ruby-gnome, so the patch is probably not quite
correct, but at least it work for me.


— gtk/src/rbgtkentry.c.old 2010-02-15 12:32:01.000000000 +0100
+++ gtk/src/rbgtkentry.c 2010-02-15 12:31:29.000000000 +0100
@@ -135,6 +135,18 @@
}
#endif

+static VALUE
+entry_request_text(self)

  • VALUE self;
    +{
  • VALUE vtext = Qnil;
  • const gchar *text;
  • text = gtk_entry_get_text(_SELF(self));
  • vtext = CSTR2RVAL(text);
  • RBG_STRING_SET_UTF8_ENCODING(vtext) ;
  • return vtext;
    +}

void
Init_gtk_entry()
{
@@ -149,6 +161,7 @@
#endif
rb_define_method(gEntry, “layout_index_to_text_index”,
entry_layout_index_to_text_index, 1);
rb_define_method(gEntry, “text_index_to_layout_index”,
entry_text_index_to_layout_index, 1);

  • rb_define_method(gEntry, “text”, entry_request_text, 0);

#if GTK_CHECK_VERSION(2, 12, 0)
rb_define_method(gEntry, “cursor_hadjustment”,


Thanks