Open-uri with non-ascii character

ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin10.8.0]

I want to parse a page like

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=日
the url contains non-ascii character as a query. In this particular
case, it’s Chinese.

If I try to open this page like

doc = Nokogiri::HTML(open(query)).read

it gives an error

/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:176:in
split': bad URI(is not URI?): http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=日 (URI::InvalidURIError) from /Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:211:in parse’
from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:747:in
parse' from /Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/open-uri.rb:32:in open’
from split_words_and_search_using_api.rb:23:in `’

Somehow, I need to convert the character (UTF-8) into some valid form
for URL.

Could anybody suggest how to do that?

soichi

Subject: Open-uri with non-ascii character
Date: Sun 06 Jan 13 12:03:01PM +0900

Quoting Soichi I. ([email protected]):

I want to parse a page like

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=???
the url contains non-ascii character as a query. In this particular
case, it’s Chinese.

If I try to open this page like

doc = Nokogiri::HTML(open(query)).read

Try this (query must contain the correct UTF-8):

require ‘webrick/httputils’


query.force_encoding(‘binary’)
query=WEBrick::HTTPUtils.escape(query)
doc=Nokogiri::HTML(open(query)).read

Carlo

On Sat, Jan 5, 2013 at 9:03 PM, Soichi I. [email protected]
wrote:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿
the url contains non-ascii character as a query. In this particular
case, it’s Chinese.

Interestingly enough, if you look at what gets sent through as the
link above, it’s:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿

which is also what you’d obtain via:

URI.escape(“http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿”)

Thanks. Now I can open the site.
I will be able to parse it then.

soichi