Iconv and incompatible encodings

Hi all,

All this talk of Unicode handling has brought up a problem I run into
every so often.

Is there any way to use the Iconv library to lossily convert between
partially incompatible encodings? In other words, if, for example, I’ve
got a UTF-8 string that I need to convert down to 7-bit ASCII, and I
don’t especially care what happens to the extended characters (short of
a single character being mapped to a single character - ideally one I
can specify), is there any way of forcing the recode?

I realise I could use the failure message to figure out which characters
in the input string are incompatible and blank them out case-by-case in
a loop, but that seems awfully wasteful.

Any ideas?

On 27/06/06, Alex Y. [email protected] wrote:

Is there any way to use the Iconv library to lossily convert between
partially incompatible encodings? In other words, if, for example, I’ve
got a UTF-8 string that I need to convert down to 7-bit ASCII, and I
don’t especially care what happens to the extended characters (short of
a single character being mapped to a single character - ideally one I
can specify), is there any way of forcing the recode?

Yes, there is. Add //IGNORE to the destination encoding to ignore
unavailable characters, or //TRANSLIT to transliterate them into
combinations of ASCII characters (e.g. `e for è).

E.g.:

#!/usr/bin/env rubby
$KCODE = ‘u’
require ‘iconv’

s = ‘caffè’

ic_ignore = Iconv.new(‘US-ASCII//IGNORE’, ‘UTF-8’)
puts ic_ignore.iconv(s) # => caff

ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s) # => caff`e

//TRANSLIT will raise an exception on characters it can’t
transliterate, however; this can be solved by using
‘//IGNORE//TRANSLIT’ together (in that order).

Paul.

Paul B. wrote:

combinations of ASCII characters (e.g. `e for è).
puts ic_ignore.iconv(s) # => caff
Ooh, that’s nice. Thanks for that. I guess it’s wishful thinking to
hope for:

puts ic_ignore.iconv(s) # => caffe

Paul B. wrote:

require ‘unicode’
or via gems.
Hah! That’s fantastic :slight_smile:

I love the smell of Ruby in the morning.

On 27/06/06, Alex Y. [email protected] wrote:

Ooh, that’s nice. Thanks for that. I guess it’s wishful thinking to
hope for:

puts ic_ignore.iconv(s) # => caffe

Your wish is granted:

#!/usr/bin/env rubby
$KCODE = ‘u’
require ‘unicode’

s = ‘caffè’

puts Unicode.normalize_KD(s).gsub(/[^\x00-\x7F]/n,‘’) # => caffe

It works by decomposing characters and rejecting everything above
position 127, which includes all the now-separated accents.

This does require the slightly-flaky unicode library (work is underway
to update it). You can get that from http://www.yoshidam.net/Ruby.html
or via gems.

Paul.

Paul B. wrote:

s = ‘caffè’

ic_ignore = Iconv.new(‘US-ASCII//IGNORE’, ‘UTF-8’)
puts ic_ignore.iconv(s) # => caff

ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s) # => caff`e

//TRANSLIT will raise an exception on characters it can’t
transliterate, however; this can be solved by using
‘//IGNORE//TRANSLIT’ together (in that order).

Can anyone else get this to work? Instead of “caff`e” I just get “caff?”

Daniel

Paul B. wrote:

On 10/07/06, Daniel DeLorme [email protected] wrote:

Can anyone else get this to work? Instead of “caff`e” I just get “caff?”

What’s your platform?

ubuntu breezy with ruby 1.8.4
iconv 2.3.5

On 10/07/06, Daniel DeLorme [email protected] wrote:

Can anyone else get this to work? Instead of “caff`e” I just get “caff?”

What’s your platform?

Paul.

On Oct 13, 2008, at 12:48 PM, Davi Barbosa wrote:

I already checked with str.each_byte {|x| puts x} and the strings are
exactly the same. Does anyone have any idea why I get two different
answers from Iconv?

My system:
$ irb --version
irb 0.9.5(05/04/13)
$ ruby --version
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]

I have ENV[‘LANG’]==en_US.UTF-8 in both cases.

Try adding the -KU switch to Ruby, to put it in UTF-8 mode.

James Edward G. II

James G. wrote:

Try adding the -KU switch to Ruby, to put it in UTF-8 mode.

Thank you for your really fast answer. I tried this:
$ ruby -KU -e “require ‘iconv’; puts
Iconv.conv(‘US-ASCII//TRANSLIT’,‘UTF-8’,‘éèêë’)”
???

$ ruby -e “$KCODE=‘u’; require ‘iconv’; puts
Iconv.conv(‘US-ASCII//TRANSLIT’,‘UTF-8’,‘éèêë’)”
???

and for information, I have also also:
$ echo ‘éèêë’ | iconv -t ASCII//TRANSLIT -f UTF-8
eeee

$ iconv --version
iconv (GNU libc) 2.7

irb(main):002:0> ‘é’.each_byte {|x| puts x}
195
169
=> “\303\251”

$ ruby -e “‘é’.each_byte {|x| puts x}”
195
169

and finally, the most weird, irb doesn’t work if I use pipe:
$ echo “require ‘iconv’; puts
Iconv.conv(‘US-ASCII//TRANSLIT’,‘UTF-8’,‘é’); ‘é’.each_byte{|x| puts x}”
| irb
require ‘iconv’; puts Iconv.conv(‘US-ASCII//TRANSLIT’,‘UTF-8’,‘é’);
‘é’.each_byte{|x| puts x}
?
195
169
“\303\251”

Hello,
I was going crazy with this problem. I searched a lot and found some
people with the same problem: Iconv works with irb but not in a ruby
script.
The solution was take another way. For example, Daniel L.
(http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/306663)
made 3 suggestions. The first one is use Ruby-GNOME2 library and:

require ‘gtk2’
ascii = GLib.convert(string, “ASCII//translit”, “UTF-8”)

This not only worked for me, as the Iconv started to work as expected!
For instance:
require ‘iconv’
require ‘gtk2’
puts Iconv.conv(“ASCII//translit”,“UTF-8”,“áà âä”)
gives ‘aaaa’.

The second solution:
ascii = %x{echo “#{str}” | iconv -f “ISO-8859-1” -t
“US-ASCII//TRANSLIT”}
also worked here.

The problem is that I’m not using a ruby script, I’m making an web page
with mod_ruby. So, %x{} gives an ‘Insecure operation’ error and “require
‘gtk2’” gives:
/var/www/dev/q/test.rbx:12: Cannot open display:
/usr/lib/ruby/1.8/gtk2.rb:12
./lib.rb:31:in `require’

His last suggestion is to write your own wrapper. Of course I’ve not
tried. Finally, I used the hack:
Unicode.normalize_KD(string).gsub(/[^\x00-\x7F]/n,‘’)
as described here: Iconv and incompatible encodings - Ruby - Ruby-Forum, and this looks
to work fine to remove accents (but I’m not sure if the result is an
ascii string)

Hi,

At Tue, 4 Nov 2008 04:04:23 +0900,
Davi Barbosa wrote in [ruby-talk:319309]:

require ‘gtk2’
ascii = GLib.convert(string, “ASCII//translit”, “UTF-8”)

This not only worked for me, as the Iconv started to work as expected!
For instance:
require ‘iconv’
require ‘gtk2’
puts Iconv.conv(“ASCII//translit”,“UTF-8”,“áàâä”)
gives ‘aaaa’.

GNU libiconv seems to need the locale set.
The issue would be fixed by the following patch.

Index: configure.in

— configure.in (revision 20103)
+++ configure.in (working copy)
@@ -559,5 +559,5 @@ AC_CHECK_HEADERS(stdlib.h string.h unist
syscall.h pwd.h grp.h a.out.h utime.h memory.h direct.h
sys/resource.h
sys/mkdev.h sys/utime.h netinet/in_systm.h float.h ieeefp.h
pthread.h \

  • ucontext.h intrinsics.h)
    
  • ucontext.h intrinsics.h locale.h)
    

dnl Check additional types.
Index: main.c

— main.c (revision 20103)
+++ main.c (working copy)
@@ -12,4 +12,7 @@

#include “ruby.h”
+#ifdef HAVE_LOCALE_H
+#include <locale.h>
+#endif

#ifdef human68k
@@ -35,4 +38,7 @@ main(argc, argv)
char **argv;
{
+#ifdef HAVE_LOCALE_H

  • setlocale(LC_CTYPE, “”);
    +#endif
    #ifdef _WIN32
    NtInitialize(&argc, &argv);

As this patch is not going to be accepted which is fair:
http://redmine.ruby-lang.org/issues/show/1528

Alternative and light solution is to create own extension as below:

  • locale.c -------------------------------------------

#include <locale.h>
#include <ruby.h>

#ifndef RSTRING_PTR
#define RSTRING_PTR(str) RSTRING(str)->ptr
#endif

VALUE Locale = Qnil;

VALUE method_setlocale(VALUE self, VALUE category, VALUE locale);

void Init_locale() {
Locale = rb_define_module(“Locale”);
rb_define_module_function(Locale, “setlocale”, method_setlocale, 2);
rb_define_const(Locale, “LC_CTYPE”, INT2NUM(0));
rb_define_const(Locale, “LC_NUMERIC”, INT2NUM(1));
rb_define_const(Locale, “LC_TIME”, INT2NUM(2));
rb_define_const(Locale, “LC_COLLATE”, INT2NUM(3));
rb_define_const(Locale, “LC_MONETARY”, INT2NUM(4));
rb_define_const(Locale, “LC_MESSAGES”, INT2NUM(5));
rb_define_const(Locale, “LC_ALL”, INT2NUM(6));
}

VALUE method_setlocale(VALUE self, VALUE category, VALUE locale) {
int c = NUM2INT(category);
char *r;
if(locale == Qnil) {
r = setlocale(c, NULL);
} else {
Check_Type(locale, T_STRING);
r = setlocale(c, RSTRING_PTR(locale));
}
return r == NULL ? Qnil : rb_str_new2(r);
}

  • extconf.rb -------------------------------------------

require ‘mkmf’
extension_name = ‘locale’
dir_config(extension_name)
create_makefile(extension_name)

… and use:
require ‘locale’
Locale::setlocale Locale::LC_CTYPE, ‘’

This is what I do in one of my projects and it works fine with Iconv,
note I use LC_CTYPE not LC_ALL to not affect numbers or dates
formatting.

Regards,

Adam S. | nanoant.com

Hi,

At Thu, 4 Jun 2009 00:28:25 +0900,
Adam S. wrote in [ruby-talk:338275]:

Alternative and light solution is to create own extension as below:

First of all, option after // is GNU iconv local extension.

Second, extconf.rb must check for the necessary header, and
whether each categories are defined.

extconf.rb

require ‘mkmf’
extension_name = ‘locale’
header = “locale.h”
dir_config(extension_name)
if have_header(header)
lc = %w[CTYPE NUMERIC TIME COLLATE MONETARY MESSAGES ALL]
lc = lc.delete_if {|n| !have_macro(“LC_#{n}”, header)}.
collect {|n| “def(#{n.downcase}, LC_#{n})”}.
join(’ ')
$defs << “-Dforeach_categories(def)=”#{lc}""
create_header
create_makefile(extension_name)
end

Next, StringValueCStr() is much better than Check_Type().

… and use:
require ‘locale’
Locale::setlocale Locale::LC_CTYPE, ‘’

And it feels too redundant. I guess Locale.ctype = ‘’ would be
easy.

/* locale.c */
#include <locale.h>
#include “ruby.h”

static VALUE
rb_setlocale(int category, VALUE locale)
{
char *r = setlocale(category, StringValueCStr(locale));
return r ? rb_str_new2® : Qnil;
}

static VALUE
rb_getlocale(int category)
{
char *r = setlocale(category, NULL);
return r ? rb_str_new2® : Qnil;
}

#define funcs(n, c)
static VALUE rb_getlocale_##n(VALUE self) {return rb_getlocale©;}
static VALUE rb_setlocale_##n(VALUE self, VALUE val) {return
rb_setlocale(c, val);}
/* end of funcs */

foreach_categories(funcs)

void
Init_locale(void)
{
VALUE locale = rb_define_module(“Locale”);
#define methods(n, c)
rb_define_singleton_method(locale, #n, rb_getlocale_##n, 0);
rb_define_singleton_method(locale, #n"=", rb_setlocale_##n, 1);
/* end of methods */

foreach_categories(methods);

}

Nobuyoshi,

Thanks, your solution is really more Ruby-way. I just wonder why
“setlocale” isn’t a part of Ruby standard library. Since Ruby maps/wraps
most of the standard (POSIX) functions (especially those available on
Windows too), this one should be also taken into consideration.

First of all, option after // is GNU iconv local extension.
Sure I know that, but it doesn’t mean it is EVIL, is it? Still it is
very useful for creating permalinks and removing accented characters
simply, w/o using any third party libraries and so, but unusable until
we call POSIX setlocale, which isn’t present in Ruby API.

Second, extconf.rb must check for the necessary header, and
whether each categories are defined.

Still it should be present on every system (AFAIK it is), since quoting
man: “The setlocale() function conforms to ISO/IEC 9899:1999 (``ISO
C99’’).”. Is there anyone who checks whether <stdio.h> exists?

And it feels too redundant. I guess Locale.ctype = ‘’ would be easy.

Sure yours is better. Mine didn’t consider fact that some of constants
may have different values on different systems.

If it is was to be included into standard library I’d leave
Locale::setlocale method as well, as you may combine types there and
also check returned value, where nil means failed association and String
on successful one, where documentation doesn’t explicitly say that
returned string is exactly the one that was passed. So with simple
Locale::ctype= we may miss some important feedback.

Cheers,
Adam.

Nobuyoshi N. wrote:

It affects library functions, such as printf(), and can cause
problems.

Of course, but FileUtils.rm_rf ‘/’ can do harm as well, but it is still
included in Ruby. So I don’t really get why I can’t hook into setlocale
with Ruby? This function is present at every recent OS, and accessible
for every C, C++, Perl or Python programmer. Sure I know it does change
way printf works and so on, but Ruby is for sober developers, isn’t it?
If setlocale causes trouble for someone, my answer is “don’t use it”,
but not “prohibit it”.

we call POSIX setlocale, which isn’t present in Ruby API.

You can’t rely on it if you want write portable script.

I don’t care. ascii//translit//IGNORE is present at Linux and Mac OSX,
and maybe many others, it is enough for me. Kernel#fork AFAIK doesn’t
work on Windows, but would you tell me not to use it because my script
won’t be portable?

Not all systems conform to C99, and not all categories are
available on all systems. For instance, mingw and perhaps
mswin don’t have LC_MESSAGES.

Right. It is just checking for <locale.h> existence is a bit paranoiac
for me :), but it’s just my point of view.

Sure yours is better. Mine didn’t consider fact that some of constants
may have different values on different systems.

Of course, otherwise why they need the macros?
:slight_smile:

You mean setters should raise an exception on error?

IMHO yes they should if setlocale returns NULL.

Anyway it is pretty too much for someone who wants just call:

setlocale LC_CTYPE, ‘’

But if we want to get a real interface for setlocale, yours is the
perfect one to have it included in official stdlib.

Cheers,
Adam.

Hi,

At Sun, 7 Jun 2009 00:24:55 +0900,
Adam S. wrote in [ruby-talk:338586]:

Thanks, your solution is really more Ruby-way. I just wonder why
“setlocale” isn’t a part of Ruby standard library. Since Ruby maps/wraps
most of the standard (POSIX) functions (especially those available on
Windows too), this one should be also taken into consideration.

It affects library functions, such as printf(), and can cause
problems.

First of all, option after // is GNU iconv local extension.
Sure I know that, but it doesn’t mean it is EVIL, is it? Still it is
very useful for creating permalinks and removing accented characters
simply, w/o using any third party libraries and so, but unusable until
we call POSIX setlocale, which isn’t present in Ruby API.

You can’t rely on it if you want write portable script.

Second, extconf.rb must check for the necessary header, and
whether each categories are defined.

Still it should be present on every system (AFAIK it is), since quoting
man: “The setlocale() function conforms to ISO/IEC 9899:1999 (``ISO
C99’’).”. Is there anyone who checks whether <stdio.h> exists?

Not all systems conform to C99, and not all categories are
available on all systems. For instance, mingw and perhaps
mswin don’t have LC_MESSAGES.

And it feels too redundant. I guess Locale.ctype = ‘’ would be easy.

Sure yours is better. Mine didn’t consider fact that some of constants
may have different values on different systems.

Of course, otherwise why they need the macros?

If it is was to be included into standard library I’d leave
Locale::setlocale method as well, as you may combine types there and
also check returned value, where nil means failed association and String
on successful one, where documentation doesn’t explicitly say that
returned string is exactly the one that was passed. So with simple
Locale::ctype= we may miss some important feedback.

You mean setters should raise an exception on error?

#include <locale.h>
#include “ruby.h”

static VALUE
rb_setlocale(int category, const char *locale)
{
char *r = setlocale(category, locale);
if (!r) rb_raise(rb_eRuntimeError, “setlocale”);
return rb_str_new2®;
}

static inline VALUE
locale_set(int category, VALUE locale)
{
return rb_setlocale(category, StringValueCStr(locale));
}

static inline VALUE
locale_get(int category)
{
return rb_setlocale(category, NULL);
}

#define funcs(n, c)
static VALUE rb_getlocale_##n(VALUE self) {return locale_get©;}
static VALUE rb_setlocale_##n(VALUE self, VALUE val) {return
locale_set(c, val);}
/* end of funcs */

foreach_categories(funcs)

void
Init_locale(void)
{
VALUE locale = rb_define_module(“Locale”);
#define methods(n, c)
rb_define_singleton_method(locale, #n, rb_getlocale_##n, 0);
rb_define_singleton_method(locale, #n"=", rb_setlocale_##n, 1);
/* end of methods */

foreach_categories(methods);

}

I have also a problem with iconv. I’m under linux (configured with utf-8
as usual) and under irb I get:
irb(main):016:0> Iconv.conv(“US-ASCII//TRANSLIT”,“UTF-8”,‘éèêë’)
=> “eeee”

But when I try the same in ruby or mod_ruby I get ‘???’, for example:
$ ruby -e “require ‘iconv’; puts
Iconv.conv(‘US-ASCII//TRANSLIT’,‘UTF-8’,‘éèêë’)”
???
I already checked with str.each_byte {|x| puts x} and the strings are
exactly the same. Does anyone have any idea why I get two different
answers from Iconv?

My system:
$ irb --version
irb 0.9.5(05/04/13)
$ ruby --version
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]

I have ENV[‘LANG’]==en_US.UTF-8 in both cases.