Forum: Ruby open-uri and utf8

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Kjell Olsen (Guest)
on 2007-02-27 03:50
(Received via mailing list)
I'm trying to make requests to google translate [http://google.com/
translate_t] to translate words through a little ruby script. I can't
get open-uri to work on URL's with accented characters (åéîòü etc).
Just calling open() gives:

URI::InvalidURIError: bad URI(is not URI?): http://google.com/
translate_t?langpair=en|fr&text=élire
         from /usr/local/lib/ruby/1.8/uri/common.rb:432:in `split'
         from /usr/local/lib/ruby/1.8/uri/common.rb:481:in `parse'
         from /usr/local/lib/ruby/1.8/open-uri.rb:29:in `open'
         from (irb):2
         from :0

URI.encode()ing the url breaks the characters down into gunk (é => %
C3%A9) which the translator doesn't understand.

I've tried fooling with $KCODE and switching to net/http, neither to
any avail. Anyone have a hand to give me? I'll paste the code in full
below in case anyone wants to see it.

-kjell

--- translate.rb ------------

#!/usr/local/bin/ruby
%w[rubygems open-uri hpricot active_support readline].each {|lib|
require lib}
$KCODE = 'u'

# can't handle â/é/utf8 chars because open-uri won't let us have
special chars in a query, nor will net/http
class GoogleTranslator
   attr_reader :doc
   @@langs = 'fr|en'

   def initialize(text, langs=@@langs)
     @text, @langs = text.chomp, langs
     @doc = Hpricot(open(URI.encode("http://google.com/translate_t?
langpair=#{@langs}&text=#{@text}")))
   end

   def result
     @result ||= @doc.search('#result_box').inner_html
     write_history_line
     @result
   end

   def write_history_line # keep a file with all the translations,
just for kicks.
     `echo "#{"[#{@langs}]\t#{@text}\t\t->\t#{@result}"}" | cat >> '/
Users/kjell/mess/2007/03/translation-history.txt'`
   end

   class << self
     include Readline

     def prompt; "[#{@@langs}] > "; end
     def interact!
       while text = readline(prompt, true)
         (text =~ /lang: (.*)/) ? @@langs = $1 : puts("#
{GoogleTranslator.new(text).result}") # either switch languages or
spit out a translation
       end
     end
   end
end

ARGV[0] ? puts(GoogleTranslator.new(*ARGV[0..1]).result) :
GoogleTranslator.interact! # if called with an argument, translate
that argument; else set up for interaction
Brian C. (Guest)
on 2007-02-27 10:41
(Received via mailing list)
On Tue, Feb 27, 2007 at 10:50:11AM +0900, Kjell Olsen wrote:
> I'm trying to make requests to google translate [http://google.com/
> translate_t] to translate words through a little ruby script. I can't
> get open-uri to work on URL's with accented characters (åéîòü etc).

URLs cannot contain accented characters directly - they must be escaped
into
%xx hex form. See RFC 2396:

   Data must be escaped if it does not have a representation using an
   unreserved character; this includes data that does not correspond to
   a printable character of the US-ASCII coded character set, or that
   corresponds to any US-ASCII character that is disallowed, as
   explained below.

> URI.encode()ing the url breaks the characters down into gunk (é => %
> C3%A9) which the translator doesn't understand.

Looks like you have a disagreement between yourself and Google about
what
character set to use. %C3%A9 looks like a two-byte UTF8 character to me.
Perhaps Google expects ISO-8859-1; have a look at their API
documentation if
they provide any. You should be able to use iconv to convert from one to
the
other.

In ISO-8859-1, your accented character will almost certainly appear as a
single byte, which will URI.encode to a single %xx

HTH,

Brian.
This topic is locked and can not be replied to.