Forum: Ruby Found a ruby bug in the URI class, what do I do?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
328f0bb678423fcea01ebe3b0edc74e6?d=identicon&s=25 Ben Johnson (benjohnson)
on 2008-11-20 23:19
I'm pretty sure this is a bug, and it seem so obvious that I'm thinking
I might be doing something wrong. Check it out:


>> require "uri"
=> true

>> URI.parse("http://whatever.domain.com")
=> #<URI::HTTP:0x2b2b7c URL:http://whatever.domain.com>

>> URI.parse("http://whatever_again.domain.com")
URI::InvalidURIError: the scheme http does not accept registry part:
whatever_again.domain.com (or bad hostname?)
  from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/generic.rb:195:in
`initialize'
  from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/http.rb:78:in
`initialize'
  from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:488:in
`new'
  from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:488:in
`parse'
  from (irb):45


When you add an underscore the sub domain it raises an exception, even
though that is a perfectly valid host name. I dug deeper and found the
following:


>> URI.split("http://whatever.domain.com")
=> ["http", nil, "whatever.domain.com", nil, nil, "", nil, nil, nil]

>> URI.split("http://whatever_again.domain.com")
=> ["http", nil, nil, nil, "whatever_again.domain.com", "", nil, nil,
nil]


Notice its not recognizing the second URI as a host, which is really
strange, its actually saying its the registry.

Anyways, can anyone shed some light on this or do you have any ideas how
I can easily patch this?

Check out the split method source:
http://www.ruby-doc.org/stdlib/libdoc/uri/rdoc/cla...

That doesn't look to easy to patch since its one big regular expression
that breaks up the string.

Thanks for your help!
Bee69cfed999cd13e3bff73d472a39ee?d=identicon&s=25 Hassan Schroeder (Guest)
on 2008-11-21 00:24
(Received via mailing list)
On Thu, Nov 20, 2008 at 2:16 PM, Ben Johnson <bjohnson@binarylogic.com>
wrote:

>>> URI.parse("http://whatever_again.domain.com")
> URI::InvalidURIError: the scheme http does not accept registry part:
> whatever_again.domain.com (or bad hostname?)

> When you add an underscore the sub domain it raises an exception, even
> though that is a perfectly valid host name.

No, it's not. Check the DNS RFCs: A-Z, a-z, 0-9 and the hyphen are
the only legal characters.

FWIW,
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2008-11-21 10:06
Hassan Schroeder wrote:
> On Thu, Nov 20, 2008 at 2:16 PM, Ben Johnson <bjohnson@binarylogic.com>
> wrote:
>
>>>> URI.parse("http://whatever_again.domain.com")
>> URI::InvalidURIError: the scheme http does not accept registry part:
>> whatever_again.domain.com (or bad hostname?)
>
>> When you add an underscore the sub domain it raises an exception, even
>> though that is a perfectly valid host name.
>
> No, it's not. Check the DNS RFCs: A-Z, a-z, 0-9 and the hyphen are
> the only legal characters.

That's not strictly a DNS limitation. However there are old (pre-DNS)
RFCs that say those are the only legal characters in a "hostname".

The key DNS RFCs are 1034 and 1035. RFC 1034 gives a very liberal
definition of a domain name (binary labels, each 1 to 63 bytes long,
maximum 255 bytes in total)

However it then makes a recommendation:

"3.5. Preferred name syntax

The DNS specifications attempt to be as general as possible in the rules
for constructing domain names.  The idea is that the name of any
existing object can be expressed as a domain name with minimal changes.
However, when assigning a domain name for an object, the prudent user
will select a name which satisfies both the rules of the domain system
and any existing rules for the object, whether these rules are published
or implied by existing programs.

For example, when naming a mail domain, the user should satisfy both the
rules of this memo and those in RFC-822.  When creating a new host name,
the old rules for HOSTS.TXT should be followed.  This avoids problems
when old software is converted to use domain names.

The following syntax will result in fewer problems with many
applications that use domain names (e.g., mail, TELNET)."

(then gives a BNF description of letters/digits/hyphens)

In practice - there *are* hosts out there which have underscores in
their hostnames, so a library which forbids this on the basis of ancient
HOSTS.TXT rules causes real problems.

RFC 2822 (for E-mail addresses) is much more liberal.
B4f3d81c38a06a274a5be62a5562e1e2?d=identicon&s=25 David Madison (daveola)
on 2012-12-02 02:24
Attachment: urifix.rb (467 Bytes)
Here's a workaround that you can put in your script that fixes
URI.parse.

I didn't feel like rewriting the entire split, so I escaped out the '_'
in the URL with a legal string.

This will work great as long as you don't find a domain that has "_"
*and* the string "UNDERLINEuriSplitISbrokenUNDERLINE" in it.  That seems
like a reasonable premise for a kludgy hack fix.  :)

See attached file if you have problems with cut-and-paste because of the
line wrap:

require 'uri'

# Fix for broken URI.parse (doesn't allow '_' in subdomains)
module URI
  class << self
    alias origsplit split
    def split(uri)
      return origsplit(uri) unless
uri.gsub!(/^([^:]+:\/\/[^\/]+)_/,'\1UNDERLINEuriSplitISbrokenUNDERLINE')
      fix = origsplit(uri)
      fix[2].gsub!(/UNDERLINEuriSplitISbrokenUNDERLINE/,'_')
      fix
    end
  end
end

puts URI.parse("http://whatever.domain.com")
puts URI.parse("http://whatever_again.domain.com")
Aa082c8b00a50928e5860dcd70bf2368?d=identicon&s=25 tamouse mailing lists (Guest)
on 2012-12-02 03:22
(Received via mailing list)
On Sat, Dec 1, 2012 at 7:24 PM, David Madison <lists@ruby-forum.com>
wrote:
> line wrap:
>       fix = origsplit(uri)
> http://www.ruby-forum.com/attachment/7919/urifix.rb
>
>
> --
> Posted via http://www.ruby-forum.com/.
>

Underscores aren't valid in domain names..., just a-z, 0-9, and -
B4f3d81c38a06a274a5be62a5562e1e2?d=identicon&s=25 David Madison (daveola)
on 2012-12-02 13:09
tamouse mailing lists wrote in post #1087487:
> Underscores aren't valid in domain names..., just a-z, 0-9, and -


1) See Brian Candler's informative post above mine.
2) URIs are used for more than just domains/http
3) There are *many* subdomains that use underscore across the web.
Fight them, not me.  I need to write software that works with those
subdomains.
Aa082c8b00a50928e5860dcd70bf2368?d=identicon&s=25 tamouse mailing lists (Guest)
on 2012-12-02 16:29
(Received via mailing list)
On Sun, Dec 2, 2012 at 6:09 AM, David Madison <lists@ruby-forum.com>
wrote:
> --
> Posted via http://www.ruby-forum.com/.
>

I see his point, and yours, but I would say your patch is not solving
the actual underlying problem.

Handling an underscore in a domain name as you are is expedient to the
immediate issue you seem to have, however it is not a fix.

The DNS specs are broad, as they should be, in what is considered
valid as far as what the DNS can store. However, the DNS specs are not
what is authoritative for URI syntax, that is covered by RFC 3986,
with the BNF defined in Appendix A. Even this, though, is complicated
by IDNA (RFC 5891).

In any case, the URI module needs to be rethought in a way to handle
both of these considerations, not just patched to allow the
underscore.
B4f3d81c38a06a274a5be62a5562e1e2?d=identicon&s=25 David Madison (daveola)
on 2012-12-02 23:00
tamouse mailing lists wrote in post #1087523:
> Handling an underscore in a domain name as you are is expedient to the
> immediate issue you seem to have, however it is not a fix.

Well - it's not a proper fix from the perspective that URI is broken
and needs to be rethought, but it is a fix to the problem of:
  "I am using URI and I need to support sub_domain names and it
   looks like the maintainers of URI aren't planning on changing
   anytime soon"

:)
Aa082c8b00a50928e5860dcd70bf2368?d=identicon&s=25 tamouse mailing lists (Guest)
on 2012-12-03 19:25
(Received via mailing list)
On Sun, Dec 2, 2012 at 4:00 PM, David Madison <lists@ruby-forum.com>
wrote:
> :)
>
> --
> Posted via http://www.ruby-forum.com/.
>

This might offer a better workaround:

Change the regexp used to parse the hostname by creating a new parser
object:

1.9.3-p194 :047 > u = URI::Parser.new(:HOSTNAME =>
"(?:[a-zA-Z0-9\\-._~]|%\\h\\h)+")
 => #<URI::Parser:0x8728bcc>
1.9.3-p194 :048 > uri = u.parse("http://www_w.example.com")
 => #<URI::HTTP:0x87af21c URL:http://www_w.example.com>

Interesting thing, I was perusing the commit logs for uri/common.rb.
On Dec 5th, 2010, a change was applied that would have made the
hostname regexp match that specified in 3986, then about 1.5 hours
later, it was changed back. I wonder what their thinking was...
This topic is locked and can not be replied to.