How are people making use of Iconv?

Wilson_B · December 21, 2005, 6:17am

Since Iconv jumped out of the pond and chewed on my leg the other
week, I’ve been toying with the idea of a character-set conversion
library implemented totally in Ruby, with identical behavior on every
platform.
However, I’m only using Iconv for simple things, like converting my
music tags from Shift-JIS to UTF-8.

What ‘serious’ things are people using this for? Are there any unit
tests? Any gems on RubyForge I can download containing projects that
make use of Iconv? What do you hate about Iconv?

Thanks,
–Wilson.

Wilson_B · December 21, 2005, 11:56am

Wilson B. wrote:

Since Iconv jumped out of the pond and chewed on my leg the other
week, I’ve been toying with the idea of a character-set conversion
library implemented totally in Ruby, with identical behavior on every
platform.
However, I’m only using Iconv for simple things, like converting my
music tags from Shift-JIS to UTF-8.

Well, that’s all that Iconv is supposed to be used for.

What ‘serious’ things are people using this for? Are there any unit
tests? Any gems on RubyForge I can download containing projects that
make use of Iconv?

Rails uses Iconv, at least in ActionMailer.

What do you hate about Iconv?

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

Wilson_B · December 21, 2005, 3:55pm

Andreas S. ([email protected]) wrote:
[snipped]

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a “as close as I could get with transliteration and ignoring
invalid characters” mode.

We’re doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn’t work
properly for multibyte characters, and you won’t be able to use a lookup
table for arbitrary source encodings, but it’s a start.

begin
  # convert element_text to native charset (note: in this case we're
  # converting from utf-8 to the native charset, but the only thing
  # about the code that's utf-8 specific is the assumption about
  # character width and the unicode lookup table below)
  ret = $iconv.iconv(element_text) << $iconv.iconv(nil)
rescue Iconv::IllegalSequence => e
  # save the portion of the string that was successful, the
  # invalid character, and the remaining (pending) string
  success_str = e.success
  ch, pending_str = e.failed.split(//, 2)
  ch_int = ch.to_i

  # see if we have a map for that characters
  if String::UNICODE_LUT.has_key?(ch_int)
    # we have a mapping for this character, so convert it and
    # re-process the string

    # log status
    err_str = _('converting unicode')
    $log.warn(meth) { "#{err_str} ##{ch_int}" }

    # create new string, with the bad character mapped
    element_text = success_str + UNICODE_LUT[ch_int] + pending_str
  else
    if $config['iconv_munge_illegal']
      # munge the illegal character with a safe string

      # log status
      err_str = _('munging unicode')
      $log.warn(meth) { "#{err_str} ##{ch_int}" }

      # create new string, with the bad character munged
      munge_str = $config['unicode_munge_str']
      element_text = success_str + munge_str + pending_str
    else
      # just drop the character altogether

      # log status
      err_str = _('dropping unicode')
      $log.warn(meth) { "#{err_str} ##{ch_int}" }

      # create new string, sans the bad character
      element_text = success_str + pending_str
    end
  end
  retry
end

Not a perfect solution, but it helps a bit.

Wilson_B · December 21, 2005, 4:56pm

On 12/21/05, Paul D. [email protected] wrote:

We’re doing something comparable in Raggle by trapping the exception and
stripping out the invalid character. Obviously this doesn’t work
properly for multibyte characters, and you won’t be able to use a lookup
table for arbitrary source encodings, but it’s a start.

What if String just had a couple of new methods on it:
String#transcode(from_encoding, to_encoding)
…and
String#transcode!(from_encoding, to_encoding)
…and the “modifies receiver” version returned true or false,
depending on whether it managed to convert every character?
Then you could do:
unless some_string.transcode!(‘Shift-JIS’, ‘UTF-8’)
puts “Some characters got mangle-fied!”
end

Is that a mess? I kinda like it, at first glance.

Wilson_B · December 21, 2005, 6:35pm

Wilson B. ([email protected]) wrote:

invalid characters" mode.
…and
String#transcode!(from_encoding, to_encoding)
…and the “modifies receiver” version returned true or false,
depending on whether it managed to convert every character?
Then you could do:
unless some_string.transcode!(‘Shift-JIS’, ‘UTF-8’)
puts “Some characters got mangle-fied!”
end

Is that a mess? I kinda like it, at first glance.

I know a future version of Ruby (2.0?) will make a distinction between
strings as arrays of bytes and strings as sets of characters with an
encoding (with the former being an obvious superset of the latter), so
I’m not sure how well that method would work with the new way of
handling strings.

That said, I like the idea, although I’d like an optional block to
handle unknown characters. I’d also add an hash as an optional third
argument which allows you to toggle transliteration, munging, and
exception behavior.

Wilson_B · December 21, 2005, 8:27pm

Paul D. [email protected] writes:

Andreas S. ([email protected]) wrote:
[snipped]

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a “as close as I could get with transliteration and ignoring
invalid characters” mode.

Can’t you just use //IGNORE?

Wilson_B · December 21, 2005, 8:36pm

Christian N. ([email protected]) wrote:

Iconv needs a “as close as I could get with transliteration and ignoring
invalid characters” mode.

Can’t you just use //IGNORE?

You sir, are a genius. That works great here.

Wilson_B · December 21, 2005, 8:33pm

Christian N. ([email protected]) wrote:

Iconv needs a “as close as I could get with transliteration and ignoring
invalid characters” mode.

Can’t you just use //IGNORE?

I wasn’t aware of “//IGNORE”. I’ll check it out. Thanks!