Forum: Ruby Open-uri with non-ascii character

Posted by Soichi Ishida (soichi)
on 2013-01-06 04:02
ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin10.8.0]

I want to parse a page like

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=日
the url contains non-ascii character as a query.  In this particular
case, it's Chinese.

If I try to open this page like

doc = Nokogiri::HTML(open(query)).read

it gives an error

/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:176:in
`split': bad URI(is not URI?):
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=日
(URI::InvalidURIError)
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:211:in
`parse'
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:747:in
`parse'
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/open-uri.rb:32:in
`open'
  from split_words_and_search_using_api.rb:23:in `<main>'

Somehow, I need to convert the character (UTF-8) into some valid form
for URL.

Could anybody suggest how to do that?

soichi
Posted by Carlo E. Prelz (Guest)
on 2013-01-06 08:17
(Received via mailing list)
Subject: Open-uri with non-ascii character
  Date: Sun 06 Jan 13 12:03:01PM +0900

Quoting Soichi Ishida (lists@ruby-forum.com):

> I want to parse a page like
>
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=???
> the url contains non-ascii character as a query.  In this particular
> case, it's Chinese.
>
> If I try to open this page like
>
> doc = Nokogiri::HTML(open(query)).read

Try this (query must contain the correct UTF-8):

require 'webrick/httputils'

..
..

query.force_encoding('binary')
query=WEBrick::HTTPUtils.escape(query)
doc=Nokogiri::HTML(open(query)).read

Carlo
Posted by Soichi Ishida (soichi)
on 2013-01-06 10:22
Thanks. Now I can open the site.
 I will be able to parse it then.

soichi
Posted by tamouse mailing lists (Guest)
on 2013-01-06 12:07
(Received via mailing list)
On Sat, Jan 5, 2013 at 9:03 PM, Soichi Ishida <lists@ruby-forum.com> 
wrote:
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿
> the url contains non-ascii character as a query.  In this particular
> case, it's Chinese.

Interestingly enough, if you look at what gets sent through as the
link above, it's:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?co...

which is also what you'd obtain via:

URI.escape("http://www.unicode.org/cgi-bin/GetUnihanData.pl?co...)
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.