Found a ruby bug in the URI class, what do I do?

benjohnson · November 20, 2008, 11:19pm

I’m pretty sure this is a bug, and it seem so obvious that I’m thinking
I might be doing something wrong. Check it out:

require “uri”
=> true

URI.parse(“http://whatever.domain.com”)
=> #<URI::HTTP:0x2b2b7c URL:http://whatever.domain.com>

URI.parse(“http://whatever_again.domain.com”)
URI::InvalidURIError: the scheme http does not accept registry part:
whatever_again.domain.com (or bad hostname?)
from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/generic.rb:195:in
initialize' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/http.rb:78:in initialize’
from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:488:in
new' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:488:in parse’
from (irb):45

When you add an underscore the sub domain it raises an exception, even
though that is a perfectly valid host name. I dug deeper and found the
following:

URI.split(“http://whatever.domain.com”)
=> [“http”, nil, “whatever.domain.com”, nil, nil, “”, nil, nil, nil]

URI.split(“http://whatever_again.domain.com”)
=> [“http”, nil, nil, nil, “whatever_again.domain.com”, “”, nil, nil,
nil]

Notice its not recognizing the second URI as a host, which is really
strange, its actually saying its the registry.

Anyways, can anyone shed some light on this or do you have any ideas how
I can easily patch this?

Check out the split method source:
http://www.ruby-doc.org/stdlib/libdoc/uri/rdoc/classes/URI.html#M009379

That doesn’t look to easy to patch since its one big regular expression
that breaks up the string.

Thanks for your help!

benjohnson · November 21, 2008, 12:24am

On Thu, Nov 20, 2008 at 2:16 PM, Ben J. [email protected]
wrote:

URI.parse(“http://whatever_again.domain.com”)
URI::InvalidURIError: the scheme http does not accept registry part:
whatever_again.domain.com (or bad hostname?)

When you add an underscore the sub domain it raises an exception, even
though that is a perfectly valid host name.

No, it’s not. Check the DNS RFCs: A-Z, a-z, 0-9 and the hyphen are
the only legal characters.

FWIW,

benjohnson · November 21, 2008, 10:06am

Hassan S. wrote:

On Thu, Nov 20, 2008 at 2:16 PM, Ben J. [email protected]
wrote:

URI.parse(“http://whatever_again.domain.com”)
URI::InvalidURIError: the scheme http does not accept registry part:
whatever_again.domain.com (or bad hostname?)

When you add an underscore the sub domain it raises an exception, even
though that is a perfectly valid host name.

No, it’s not. Check the DNS RFCs: A-Z, a-z, 0-9 and the hyphen are
the only legal characters.

That’s not strictly a DNS limitation. However there are old (pre-DNS)
RFCs that say those are the only legal characters in a “hostname”.

The key DNS RFCs are 1034 and 1035. RFC 1034 gives a very liberal
definition of a domain name (binary labels, each 1 to 63 bytes long,
maximum 255 bytes in total)

However it then makes a recommendation:

"3.5. Preferred name syntax

The DNS specifications attempt to be as general as possible in the rules
for constructing domain names. The idea is that the name of any
existing object can be expressed as a domain name with minimal changes.
However, when assigning a domain name for an object, the prudent user
will select a name which satisfies both the rules of the domain system
and any existing rules for the object, whether these rules are published
or implied by existing programs.

For example, when naming a mail domain, the user should satisfy both the
rules of this memo and those in RFC-822. When creating a new host name,
the old rules for HOSTS.TXT should be followed. This avoids problems
when old software is converted to use domain names.

The following syntax will result in fewer problems with many
applications that use domain names (e.g., mail, TELNET)."

(then gives a BNF description of letters/digits/hyphens)

In practice - there are hosts out there which have underscores in
their hostnames, so a library which forbids this on the basis of ancient
HOSTS.TXT rules causes real problems.

RFC 2822 (for E-mail addresses) is much more liberal.

benjohnson · December 2, 2012, 2:24am

Here’s a workaround that you can put in your script that fixes
URI.parse.

I didn’t feel like rewriting the entire split, so I escaped out the ‘_’
in the URL with a legal string.

This will work great as long as you don’t find a domain that has “_”
and the string “UNDERLINEuriSplitISbrokenUNDERLINE” in it. That seems
like a reasonable premise for a kludgy hack fix.

See attached file if you have problems with cut-and-paste because of the
line wrap:

require ‘uri’

Fix for broken URI.parse (doesn’t allow ‘_’ in subdomains)

module URI
class << self
alias origsplit split
def split(uri)
return origsplit(uri) unless
uri.gsub!(/^([^:]+://[^/]+)/,‘\1UNDERLINEuriSplitISbrokenUNDERLINE’)
fix = origsplit(uri)
fix[2].gsub!(/UNDERLINEuriSplitISbrokenUNDERLINE/,'')
fix
end
end
end

puts URI.parse(“http://whatever.domain.com”)
puts URI.parse(“http://whatever_again.domain.com”)

benjohnson · December 2, 2012, 3:22am

On Sat, Dec 1, 2012 at 7:24 PM, David M. [email protected]
wrote:

line wrap:
fix = origsplit(uri)
http://www.ruby-forum.com/attachment/7919/urifix.rb

–
Posted via http://www.ruby-forum.com/.

Underscores aren’t valid in domain names…, just a-z, 0-9, and -

benjohnson · December 2, 2012, 1:09pm

tamouse mailing lists wrote in post #1087487:

Underscores aren’t valid in domain names…, just a-z, 0-9, and -

See Brian C.'s informative post above mine.
URIs are used for more than just domains/http
There are many subdomains that use underscore across the web.
Fight them, not me. I need to write software that works with those
subdomains.

benjohnson · December 2, 2012, 4:29pm

On Sun, Dec 2, 2012 at 6:09 AM, David M. [email protected]
wrote:

–
Posted via http://www.ruby-forum.com/.

I see his point, and yours, but I would say your patch is not solving
the actual underlying problem.

Handling an underscore in a domain name as you are is expedient to the
immediate issue you seem to have, however it is not a fix.

The DNS specs are broad, as they should be, in what is considered
valid as far as what the DNS can store. However, the DNS specs are not
what is authoritative for URI syntax, that is covered by RFC 3986,
with the BNF defined in Appendix A. Even this, though, is complicated
by IDNA (RFC 5891).

In any case, the URI module needs to be rethought in a way to handle
both of these considerations, not just patched to allow the
underscore.

benjohnson · December 2, 2012, 11:00pm

tamouse mailing lists wrote in post #1087523:

Handling an underscore in a domain name as you are is expedient to the
immediate issue you seem to have, however it is not a fix.

Well - it’s not a proper fix from the perspective that URI is broken
and needs to be rethought, but it is a fix to the problem of:
“I am using URI and I need to support sub_domain names and it
looks like the maintainers of URI aren’t planning on changing
anytime soon”

benjohnson · December 3, 2012, 7:25pm

On Sun, Dec 2, 2012 at 4:00 PM, David M. [email protected]
wrote:

–
Posted via http://www.ruby-forum.com/.

This might offer a better workaround:

Change the regexp used to parse the hostname by creating a new parser
object:

1.9.3-p194 :047 > u = URI::Parser.new(:HOSTNAME =>
“(?:[a-zA-Z0-9\-._~]|%\h\h)+”)
=> #URI::Parser:0x8728bcc
1.9.3-p194 :048 > uri = u.parse(“http://www_w.example.com”)
=> #<URI::HTTP:0x87af21c URL:http://www_w.example.com>

Interesting thing, I was perusing the commit logs for uri/common.rb.
On Dec 5th, 2010, a change was applied that would have made the
hostname regexp match that specified in 3986, then about 1.5 hours
later, it was changed back. I wonder what their thinking was…