Forum: Ruby-core [ruby-trunk - Bug #8241][Open] If uri host-part has underscore ( '_' ), 'URI#parse' raise 'URI::Inva

4b83e9789c3246ff89451550c26d8d71?d=identicon&s=25 Sangmin Ryu (neocoin)
on 2013-04-09 14:03
(Received via mailing list)
Issue #8241 has been reported by neocoin (Sangmin Ryu).

----------------------------------------
Bug #8241: If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'
https://bugs.ruby-lang.org/issues/8241

Author: neocoin (Sangmin Ryu)
Status: Open
Priority: Normal
Assignee: akira (akira yamada)
Category: core
Target version:
ruby -v: ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin11.4.2]


First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.


If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'

ex)
=begin
>require 'uri'
>URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part:
test_strin.helo.com (or bad hostname?)
from ...
/.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in
`initialize'
>
>
> e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry
part: test_string.hello.com (or bad hostname?)>
> puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `new'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'

vs

>URI.parse('http://teststring.hello.com')
>#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end

This problem is made by hostname regex pattern of 'URI#split ' in
uri/common.rb

https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil]
// normal

[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil]
// wrong

source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
      # hostname      = *( domainlabel "." ) toplabel [ "." ]
      # reg-name      = *( unreserved / pct-encoded / sub-delims ) #
RFC3986
      unless hostname
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-.]|%\\h\\h)+"
      end
=end

Through you could check source comment, 'reg-name'  in rfc3986 could be
'unreserved / pct-encoded / sub-delims )'

And 'unreserved' definition in rfc3986 (
http://tools.ietf.org/html/rfc3986#section-2.3 )

> unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

But hostname regex pattern has just '-' and '.' except '_' and '~'.

Please, check rfc3986 and add hostname pattern for reg-name like below.

=begin
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+"
=end
4b83e9789c3246ff89451550c26d8d71?d=identicon&s=25 Sangmin Ryu (neocoin)
on 2013-04-09 14:06
(Received via mailing list)
Issue #8241 has been updated by neocoin (Sangmin Ryu).

File edit_hostname_pattern.patch added


----------------------------------------
Bug #8241: If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'
https://bugs.ruby-lang.org/issues/8241#change-38392

Author: neocoin (Sangmin Ryu)
Status: Open
Priority: Normal
Assignee: akira (akira yamada)
Category: core
Target version:
ruby -v: ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin11.4.2]


First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.


If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'

ex)
=begin
>require 'uri'
>URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part:
test_strin.helo.com (or bad hostname?)
from ...
/.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in
`initialize'
>
>
> e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry
part: test_string.hello.com (or bad hostname?)>
> puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `new'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'

vs

>URI.parse('http://teststring.hello.com')
>#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end

This problem is made by hostname regex pattern of 'URI#split ' in
uri/common.rb

https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil]
// normal

[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil]
// wrong

source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
      # hostname      = *( domainlabel "." ) toplabel [ "." ]
      # reg-name      = *( unreserved / pct-encoded / sub-delims ) #
RFC3986
      unless hostname
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-.]|%\\h\\h)+"
      end
=end

Through you could check source comment, 'reg-name'  in rfc3986 could be
'unreserved / pct-encoded / sub-delims )'

And 'unreserved' definition in rfc3986 (
http://tools.ietf.org/html/rfc3986#section-2.3 )

> unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

But hostname regex pattern has just '-' and '.' except '_' and '~'.

Please, check rfc3986 and add hostname pattern for reg-name like below.

=begin
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+"
=end
9361878d459f1709feec780518946ee5?d=identicon&s=25 naruse (Yui NARUSE) (Guest)
on 2013-04-09 14:09
(Received via mailing list)
Issue #8241 has been updated by naruse (Yui NARUSE).


uri.rb is currently based on RFC 2373, and planning fix based on URL
spec.
http://url.spec.whatwg.org/
----------------------------------------
Bug #8241: If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'
https://bugs.ruby-lang.org/issues/8241#change-38393

Author: neocoin (Sangmin Ryu)
Status: Open
Priority: Normal
Assignee: akira (akira yamada)
Category: core
Target version:
ruby -v: ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin11.4.2]


First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.


If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'

ex)
=begin
>require 'uri'
>URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part:
test_strin.helo.com (or bad hostname?)
from ...
/.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in
`initialize'
>
>
> e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry
part: test_string.hello.com (or bad hostname?)>
> puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `new'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'

vs

>URI.parse('http://teststring.hello.com')
>#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end

This problem is made by hostname regex pattern of 'URI#split ' in
uri/common.rb

https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil]
// normal

[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil]
// wrong

source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
      # hostname      = *( domainlabel "." ) toplabel [ "." ]
      # reg-name      = *( unreserved / pct-encoded / sub-delims ) #
RFC3986
      unless hostname
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-.]|%\\h\\h)+"
      end
=end

Through you could check source comment, 'reg-name'  in rfc3986 could be
'unreserved / pct-encoded / sub-delims )'

And 'unreserved' definition in rfc3986 (
http://tools.ietf.org/html/rfc3986#section-2.3 )

> unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

But hostname regex pattern has just '-' and '.' except '_' and '~'.

Please, check rfc3986 and add hostname pattern for reg-name like below.

=begin
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+"
=end
4b83e9789c3246ff89451550c26d8d71?d=identicon&s=25 Sangmin Ryu (neocoin)
on 2013-04-09 14:33
(Received via mailing list)
Issue #8241 has been updated by neocoin (Sangmin Ryu).


naruse (Yui NARUSE) wrote:
> uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
> http://url.spec.whatwg.org/

Thank for feedback.

'rfc2373' is just ip v6 addressing part. This doen't include whole URI
definition.
( http://tools.ietf.org/html/rfc2373 )

So rfc3986 based comment in uri/common.rb  is right. Check plz.
----------------------------------------
Bug #8241: If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'
https://bugs.ruby-lang.org/issues/8241#change-38394

Author: neocoin (Sangmin Ryu)
Status: Open
Priority: Normal
Assignee: akira (akira yamada)
Category: core
Target version:
ruby -v: ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin11.4.2]


First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.


If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'

ex)
=begin
>require 'uri'
>URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part:
test_strin.helo.com (or bad hostname?)
from ...
/.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in
`initialize'
>
>
> e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry
part: test_string.hello.com (or bad hostname?)>
> puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `new'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'

vs

>URI.parse('http://teststring.hello.com')
>#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end

This problem is made by hostname regex pattern of 'URI#split ' in
uri/common.rb

https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil]
// normal

[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil]
// wrong

source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
      # hostname      = *( domainlabel "." ) toplabel [ "." ]
      # reg-name      = *( unreserved / pct-encoded / sub-delims ) #
RFC3986
      unless hostname
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-.]|%\\h\\h)+"
      end
=end

Through you could check source comment, 'reg-name'  in rfc3986 could be
'unreserved / pct-encoded / sub-delims )'

And 'unreserved' definition in rfc3986 (
http://tools.ietf.org/html/rfc3986#section-2.3 )

> unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

But hostname regex pattern has just '-' and '.' except '_' and '~'.

Please, check rfc3986 and add hostname pattern for reg-name like below.

=begin
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+"
=end
9361878d459f1709feec780518946ee5?d=identicon&s=25 naruse (Yui NARUSE) (Guest)
on 2013-04-09 15:17
(Received via mailing list)
Issue #8241 has been updated by naruse (Yui NARUSE).


neocoin (Sangmin Ryu) wrote:
> naruse (Yui NARUSE) wrote:
> > uri.rb is currently based on RFC 2373, and planning fix based on URL spec.
> > http://url.spec.whatwg.org/
>
> Thank for feedback.
>
> 'rfc2373' is just ip v6 addressing part. This doen't include whole URI
definition.
> ( http://tools.ietf.org/html/rfc2373 )
>
> So rfc3986 based comment in uri/common.rb  is right. Check plz.

Oops, it is RFC 2396. http://www.ietf.org/rfc/rfc2396.txt

And on RFC 2396, host of http scheme is defined on 3.2.2. Server-based
Naming Authority.
It says

      server        = [ [ userinfo "@" ] hostport ]
      userinfo      = *( unreserved | escaped |
                         ";" | ":" | "&" | "=" | "+" | "$" | "," )
      hostport      = host [ ":" port ]
      host          = hostname | IPv4address
      hostname      = *( domainlabel "." ) toplabel [ "." ]
      domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
      toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
----------------------------------------
Bug #8241: If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'
https://bugs.ruby-lang.org/issues/8241#change-38395

Author: neocoin (Sangmin Ryu)
Status: Open
Priority: Normal
Assignee: akira (akira yamada)
Category: core
Target version:
ruby -v: ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin11.4.2]


First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.


If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'

ex)
=begin
>require 'uri'
>URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part:
test_strin.helo.com (or bad hostname?)
from ...
/.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in
`initialize'
>
>
> e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry
part: test_string.hello.com (or bad hostname?)>
> puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `new'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'

vs

>URI.parse('http://teststring.hello.com')
>#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end

This problem is made by hostname regex pattern of 'URI#split ' in
uri/common.rb

https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil]
// normal

[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil]
// wrong

source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
      # hostname      = *( domainlabel "." ) toplabel [ "." ]
      # reg-name      = *( unreserved / pct-encoded / sub-delims ) #
RFC3986
      unless hostname
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-.]|%\\h\\h)+"
      end
=end

Through you could check source comment, 'reg-name'  in rfc3986 could be
'unreserved / pct-encoded / sub-delims )'

And 'unreserved' definition in rfc3986 (
http://tools.ietf.org/html/rfc3986#section-2.3 )

> unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

But hostname regex pattern has just '-' and '.' except '_' and '~'.

Please, check rfc3986 and add hostname pattern for reg-name like below.

=begin
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+"
=end
C4e88907313843cf07f6d85ba8162120?d=identicon&s=25 neocoin (Sangmin Ryu) (Guest)
on 2013-04-09 17:08
(Received via mailing list)
Issue #8241 has been updated by neocoin (Sangmin Ryu).


naruse (Yui NARUSE) wrote:
> > So rfc3986 based comment in uri/common.rb  is right. Check plz.
>       host          = hostname | IPv4address
>       hostname      = *( domainlabel "." ) toplabel [ "." ]
>       domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
>       toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

Yes, you are right. I checked rfc2396 (published in Aug 1998) too
through commented 'uri/common.rb'.
That document is URI general syntax starting point.
And in January 2005, rfc 3986  was published by rfc 2396 co-author.
(See also
http://en.wikipedia.org/wiki/Uniform_Resource_Iden...
)
As result, rfc3986 is current standard

I think, many web service companies (ex - ddns or private address for
blog company) use rfc3986 to be standard.

When I make a web crawler with ruby, second level domain ( google.com 's
'google' part) generally don't have
a underscore and tild. I know, DNS hosting service don't permit
underscore at second level domain.
But many third domains have underscore character. (
hello_world.google.com 's 'hello_world' part).

So I check URI spec in rfc3986 several years ago and post this issue.


Find below string in http://tools.ietf.org/html/rfc3986#appendix-A

Appendix A.  Collected ABNF for URI
...
 host          = IP-literal / IPv4address / reg-name
...
 reg-name      = *( unreserved / pct-encoded / sub-delims )
...
 unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"



See also.
Python urlparse method include rfc3986
http://docs.python.org/2/library/urlparse.html







----------------------------------------
Bug #8241: If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'
https://bugs.ruby-lang.org/issues/8241#change-38397

Author: neocoin (Sangmin Ryu)
Status: Open
Priority: Normal
Assignee: akira (akira yamada)
Category: core
Target version:
ruby -v: ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin11.4.2]


First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.


If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'

ex)
=begin
>require 'uri'
>URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part:
test_strin.helo.com (or bad hostname?)
from ...
/.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in
`initialize'
>
>
> e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry
part: test_string.hello.com (or bad hostname?)>
> puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `new'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'

vs

>URI.parse('http://teststring.hello.com')
>#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end

This problem is made by hostname regex pattern of 'URI#split ' in
uri/common.rb

https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil]
// normal

[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil]
// wrong

source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
      # hostname      = *( domainlabel "." ) toplabel [ "." ]
      # reg-name      = *( unreserved / pct-encoded / sub-delims ) #
RFC3986
      unless hostname
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-.]|%\\h\\h)+"
      end
=end

Through you could check source comment, 'reg-name'  in rfc3986 could be
'unreserved / pct-encoded / sub-delims )'

And 'unreserved' definition in rfc3986 (
http://tools.ietf.org/html/rfc3986#section-2.3 )

> unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

But hostname regex pattern has just '-' and '.' except '_' and '~'.

Please, check rfc3986 and add hostname pattern for reg-name like below.

=begin
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+"
=end
9361878d459f1709feec780518946ee5?d=identicon&s=25 unknown (Guest)
on 2014-06-26 12:11
(Received via mailing list)
Issue #8241 has been updated by Yui NARUSE.

Related to Bug #9974: Regression: URI.parse allows invalid URIs added

----------------------------------------
Bug #8241: If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'
https://bugs.ruby-lang.org/issues/8241#change-47384

* Author: Sangmin Ryu
* Status: Open
* Priority: Normal
* Assignee: akira yamada
* Category: core
* Target version:
* ruby -v: ruby 2.0.0p0 (2013-02-24 revision 39474)
[x86_64-darwin11.4.2]
* Backport:
----------------------------------------
First of all,
I say 'I am sorry', if this issue making activity is rude.
I don't know, where do I put this simple and critical issue.
This problem was found a long time ago (1 or 2 years ).
But problem is very clear and solution very simple.
So I wait just long time with monkey patch.


If uri host-part has underscore ( '_' ), 'URI#parse' raise
'URI::InvalidURIError'

ex)
=begin
>require 'uri'
>URI.parse 'http://test_strin.helo.com'
URI::InvalidURIError: the scheme http does not accept registry part:
test_strin.helo.com (or bad hostname?)
from ...
/.rbenv/versions/1.9.3-p125/lib/ruby/1.9.1/uri/generic.rb:213:in
`initialize'
>
>
> e=URI.parse('http://test_string.hello.com') rescue $!
=> #<URI::InvalidURIError: the scheme http does not accept registry
part: test_string.hello.com (or bad hostname?)>
> puts e.backtrace
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/generic.rb:214:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/http.rb:84:in
`initialize'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `new'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:214:in `parse'
.../.rbenv/versions/2.0.0-p0/lib/ruby/2.0.0/uri/common.rb:747:in `parse'

vs

>URI.parse('http://teststring.hello.com')
>#<URI::HTTP:0x007fbf31c1a078 URL:http://teststring.hello.com>
=end

This problem is made by hostname regex pattern of 'URI#split ' in
uri/common.rb

https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
[26] pry(main)> URI.split('http://teststring.hello.com')
=> ["http", nil, "teststring.hello.com", nil, nil, "", nil, nil, nil]
// normal

[27] pry(main)> URI.split('http://test_string.hello.com')
=> ["http", nil, nil, nil, "test_string.hello.com", "", nil, nil, nil]
// wrong

source position.
https://bugs.ruby-lang.org/projects/ruby-trunk/rep...
( https://github.com/ruby/ruby/blob/trunk/lib/uri/co... )

=begin
      # hostname      = *( domainlabel "." ) toplabel [ "." ]
      # reg-name      = *( unreserved / pct-encoded / sub-delims ) #
RFC3986
      unless hostname
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-.]|%\\h\\h)+"
      end
=end

Through you could check source comment, 'reg-name'  in rfc3986 could be
'unreserved / pct-encoded / sub-delims )'

And 'unreserved' definition in rfc3986 (
http://tools.ietf.org/html/rfc3986#section-2.3 )

> unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

But hostname regex pattern has just '-' and '.' except '_' and '~'.

Please, check rfc3986 and add hostname pattern for reg-name like below.

=begin
        ret[:HOSTNAME] = hostname = "(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+"
=end



---Files--------------------------------
edit_hostname_pattern.patch (152 Bytes)
This topic is locked and can not be replied to.