Hello all, I’m using open-uri combined with hpricot to make a basic
web crawler that scrapes for different links that I need. It seems to
be working perfectly, but I have encountered the following bug when
this type of link is encountered:
irb(main):015:0> URI.parse(‘http://hello.com/a.php?%1’)
URI::InvalidURIError: bad URI(is not URI?): http://hello.com/a.php?%1
from c:/ruby/lib/ruby/1.8/uri/common.rb:436:in split' from c:/ruby/lib/ruby/1.8/uri/common.rb:485:in
parse’
from (irb):15
Can anyone illuminate why this is a problem? Thanks!
On Feb 26, 2008, at 2:50 PM, Steve H. wrote:
Can anyone illuminate why this is a problem? Thanks!
Probably because %1 looks like a partially escaped character. Try:
?%251
Where %25 is an escaped %
-Rob
Rob B. http://agileconsultingllc.com
[email protected]
On Feb 26, 12:20 pm, Rob B. [email protected]
wrote:
Probably because %1 looks like a partially escaped character. Try:
?%251
Where %25 is an escaped %
-Rob
I appreciate the reply. This is a bit unfortunate, I am developing a
tool which has to handle URIs the same way the browser does. While I
realize that is not a “correct” URI, the browser still fetches the
pages without a problem. In some sense, I wish I could mirror the
functionality of the browser fetch using the URI module. Anyhow, thank
you for your help!
Steve H. wrote:
On Feb 26, 12:20 pm, Rob B. [email protected]
wrote:
Probably because %1 looks like a partially escaped character. Try:
?%251
Where %25 is an escaped %
-Rob
I appreciate the reply. This is a bit unfortunate, I am developing a
tool which has to handle URIs the same way the browser does. While I
realize that is not a “correct” URI, the browser still fetches the
pages without a problem. In some sense, I wish I could mirror the
functionality of the browser fetch using the URI module. Anyhow, thank
you for your help!
Maybe this helps:
URI.escape(‘http://hello.com/a.php?%1’)
=> “http://hello.com/a.php?%1”
Regards,
Siep
I too want to know how to handle invalid URIs in mechanize. Is there any
way
to override url checking ?
On Feb 26, 2008, at 13:20 PM, Steve H. wrote:
I appreciate the reply. This is a bit unfortunate, I am developing a
tool which has to handle URIs the same way the browser does. While I
realize that is not a “correct” URI, the browser still fetches the
pages without a problem. In some sense, I wish I could mirror the
functionality of the browser fetch using the URI module. Anyhow, thank
you for your help!
What about Mechanize?