Forum: Ruby How are people making use of Iconv?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Wilson B. (Guest)
on 2005-12-21 07:17
(Received via mailing list)
Since Iconv jumped out of the pond and chewed on my leg the other
week, I've been toying with the idea of a character-set conversion
library implemented totally in Ruby, with identical behavior on every
platform.
However, I'm only using Iconv for simple things, like converting my
music tags from Shift-JIS to UTF-8.

What 'serious' things are people using this for? Are there any unit
tests?  Any gems on RubyForge I can download containing projects that
make use of Iconv? What do you hate about Iconv?

Thanks,
--Wilson.
Andreas S. (Guest)
on 2005-12-21 12:56
Wilson B. wrote:
> Since Iconv jumped out of the pond and chewed on my leg the other
> week, I've been toying with the idea of a character-set conversion
> library implemented totally in Ruby, with identical behavior on every
> platform.
> However, I'm only using Iconv for simple things, like converting my
> music tags from Shift-JIS to UTF-8.

Well, that's all that Iconv is supposed to be used for.

> What 'serious' things are people using this for? Are there any unit
> tests?  Any gems on RubyForge I can download containing projects that
> make use of Iconv?

Rails uses Iconv, at least in ActionMailer.

> What do you hate about Iconv?

I dislike that Iconv raises an exception when it finds characters it can
not convert. I would prefer if it could be made to ignore invalid
characters and just try to make the best of the text.
Paul D. (Guest)
on 2005-12-21 16:55
(Received via mailing list)
* Andreas S. (removed_email_address@domain.invalid) wrote:
[snipped]
> I dislike that Iconv raises an exception when it finds characters it can
> not convert. I would prefer if it could be made to ignore invalid
> characters and just try to make the best of the text.

Seconded, Thirded, and Quadrupled.

Iconv needs a "as close as I could get with transliteration and ignoring
invalid characters" mode.

We're doing something comparable in Raggle by trapping the exception and
stripping out the invalid character.  Obviously this doesn't work
properly for multibyte characters, and you won't be able to use a lookup
table for arbitrary source encodings, but it's a start.

    begin
      # convert element_text to native charset (note: in this case we're
      # converting from utf-8 to the native charset, but the only thing
      # about the code that's utf-8 specific is the assumption about
      # character width and the unicode lookup table below)
      ret = $iconv.iconv(element_text) << $iconv.iconv(nil)
    rescue Iconv::IllegalSequence => e
      # save the portion of the string that was successful, the
      # invalid character, and the remaining (pending) string
      success_str = e.success
      ch, pending_str = e.failed.split(//, 2)
      ch_int = ch.to_i

      # see if we have a map for that characters
      if String::UNICODE_LUT.has_key?(ch_int)
        # we have a mapping for this character, so convert it and
        # re-process the string

        # log status
        err_str = _('converting unicode')
        $log.warn(meth) { "#{err_str} ##{ch_int}" }

        # create new string, with the bad character mapped
        element_text = success_str + UNICODE_LUT[ch_int] + pending_str
      else
        if $config['iconv_munge_illegal']
          # munge the illegal character with a safe string

          # log status
          err_str = _('munging unicode')
          $log.warn(meth) { "#{err_str} ##{ch_int}" }

          # create new string, with the bad character munged
          munge_str = $config['unicode_munge_str']
          element_text = success_str + munge_str + pending_str
        else
          # just drop the character altogether

          # log status
          err_str = _('dropping unicode')
          $log.warn(meth) { "#{err_str} ##{ch_int}" }

          # create new string, sans the bad character
          element_text = success_str + pending_str
        end
      end
      retry
    end

Not a perfect solution, but it helps a bit.
Wilson B. (Guest)
on 2005-12-21 17:56
(Received via mailing list)
On 12/21/05, Paul D. <removed_email_address@domain.invalid> wrote:
>
> We're doing something comparable in Raggle by trapping the exception and
> stripping out the invalid character.  Obviously this doesn't work
> properly for multibyte characters, and you won't be able to use a lookup
> table for arbitrary source encodings, but it's a start.
>
<snip interesting code>

What if String just had a couple of new methods on it:
String#transcode(from_encoding, to_encoding)
..and
String#transcode!(from_encoding, to_encoding)
..and the "modifies receiver" version returned true or false,
depending on whether it managed to convert every character?
Then you could do:
unless some_string.transcode!('Shift-JIS', 'UTF-8')
  puts "Some characters got mangle-fied!"
end

Is that a mess? I kinda like it, at first glance.
Paul D. (Guest)
on 2005-12-21 19:35
(Received via mailing list)
* Wilson B. (removed_email_address@domain.invalid) wrote:
> > invalid characters" mode.
> ..and
> String#transcode!(from_encoding, to_encoding)
> ..and the "modifies receiver" version returned true or false,
> depending on whether it managed to convert every character?
> Then you could do:
> unless some_string.transcode!('Shift-JIS', 'UTF-8')
>   puts "Some characters got mangle-fied!"
> end
>
> Is that a mess? I kinda like it, at first glance.

I know a future version of Ruby (2.0?) will make a distinction between
strings as arrays of bytes and strings as sets of characters with an
encoding (with the former being an obvious superset of the latter), so
I'm not sure how well that method would work with the new way of
handling strings.

That said, I like the idea, although I'd like an optional block to
handle unknown characters.  I'd also add an hash as an optional third
argument which allows you to toggle transliteration, munging, and
exception behavior.
Christian N. (Guest)
on 2005-12-21 21:27
(Received via mailing list)
Paul D. <removed_email_address@domain.invalid> writes:

> * Andreas S. (removed_email_address@domain.invalid) wrote:
> [snipped]
>> I dislike that Iconv raises an exception when it finds characters it can
>> not convert. I would prefer if it could be made to ignore invalid
>> characters and just try to make the best of the text.
>
> Seconded, Thirded, and Quadrupled.
>
> Iconv needs a "as close as I could get with transliteration and ignoring
> invalid characters" mode.

Can't you just use //IGNORE?
Paul D. (Guest)
on 2005-12-21 21:33
(Received via mailing list)
* Christian N. (removed_email_address@domain.invalid) wrote:
> > Iconv needs a "as close as I could get with transliteration and ignoring
> > invalid characters" mode.
>
> Can't you just use //IGNORE?

I wasn't aware of "//IGNORE".  I'll check it out.  Thanks!
Paul D. (Guest)
on 2005-12-21 21:36
(Received via mailing list)
* Christian N. (removed_email_address@domain.invalid) wrote:
> > Iconv needs a "as close as I could get with transliteration and ignoring
> > invalid characters" mode.
>
> Can't you just use //IGNORE?

You sir, are a genius.  That works great here.
This topic is locked and can not be replied to.