ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin10.8.0] I want to parse a page like http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=日 the url contains non-ascii character as a query. In this particular case, it's Chinese. If I try to open this page like doc = Nokogiri::HTML(open(query)).read it gives an error /Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:176:in `split': bad URI(is not URI?): http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=日 (URI::InvalidURIError) from /Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:211:in `parse' from /Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:747:in `parse' from /Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/open-uri.rb:32:in `open' from split_words_and_search_using_api.rb:23:in `<main>' Somehow, I need to convert the character (UTF-8) into some valid form for URL. Could anybody suggest how to do that? soichi
on 2013-01-06 04:02
on 2013-01-06 08:17
Subject: Open-uri with non-ascii character Date: Sun 06 Jan 13 12:03:01PM +0900 Quoting Soichi Ishida (lists@ruby-forum.com): > I want to parse a page like > > http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=??? > the url contains non-ascii character as a query. In this particular > case, it's Chinese. > > If I try to open this page like > > doc = Nokogiri::HTML(open(query)).read Try this (query must contain the correct UTF-8): require 'webrick/httputils' .. .. query.force_encoding('binary') query=WEBrick::HTTPUtils.escape(query) doc=Nokogiri::HTML(open(query)).read Carlo
on 2013-01-06 12:07
On Sat, Jan 5, 2013 at 9:03 PM, Soichi Ishida <lists@ruby-forum.com> wrote: > http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿 > the url contains non-ascii character as a query. In this particular > case, it's Chinese. Interestingly enough, if you look at what gets sent through as the link above, it's: http://www.unicode.org/cgi-bin/GetUnihanData.pl?co... which is also what you'd obtain via: URI.escape("http://www.unicode.org/cgi-bin/GetUnihanData.pl?co...)
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.