Bug in URI.parse?

bfsog · August 29, 2007, 4:34pm

I have looked through the archives for the mailing list, but didn’t see
this issue addressed. So here goes.

My local machine is named 3beers-wrk. I achieved this by putting an
entry in my local hosts file (since I’m on Windows, this would be in
\windows\system32\drivers\etc\hosts). However, I have also seen the
issue below reproduce with a machine whose name is similar, say
12345-server.

Below is a capture of my session with the shell, then with irb:

H:>ping 3beers-wrk

Pinging 3beers-wrk [127.0.0.1] with 32 bytes of data:

Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

…

H:>irb

irb(main):001:0> require ‘uri’

=> true

irb(main):002:0> URI.parse(“http://3beers-wrk.tsi.lan”)

=> #<URI::HTTP:0x1612dc0 URL:http://3beers-wrk.tsi.lan>

irb(main):003:0> URI.parse(“http://3beers-wrk”)

URI::InvalidURIError: the scheme http does not accept registry part:
3beers-wrk (or bad hostname?)

    from

c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/generic.rb:1
94:in `initialize’

    from

c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/http.rb:46:i
n `initialize’

    from

c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/common.rb:48
4:in `new’

    from

c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/common.rb:48
4:in `parse’

    from (irb):3

    from

c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/http.rb:57

URI.parse() is not properly parsing the non-qualified name 3beers-wrk
(though it does properly parse a fully-qualified hostname), which seems
to be in line with the grammar set forth in RFC2396. However, the RFC
makes the following comment: “In practice, however, the host component
may be a local domain literal” (section 3.2). This suggests that the
above URI is entirely valid. Further, this URI is acceptable to Web
browsers.

Is this a bug? How can I handle this seemingly valid URI?

Thanks!

Andrew

bfsog · August 29, 2007, 6:56pm

On Aug 29, 2007, at 9:33 AM, Andrew B. wrote:

12345-server.
    from
component
Thanks!

Andrew

Good question! I’ve yet to figure out a good way to handle the
InvalidURI errors myself.
Makes URI.parse pretty useless unless you have a way to handle URI
class’s errors raised.

bfsog · August 29, 2007, 7:25pm

On Aug 29, 8:33 am, “Andrew B.” [email protected] wrote:

H:>irb
    from
Is this a bug? How can I handle this seemingly valid URI?
It looks like URI.parse doesn’t like the leading number:

irb(main):001:0> require ‘uri’
irb(main):003:0> URI.parse(“http://xshare”)
=> #<URI::HTTP:0x16fd906 URL:http://xshare>
irb(main):004:0> URI.parse(“http://xshare-foo”)
=> #<URI::HTTP:0x16fc498 URL:http://xshare-foo>
irb(main):006:0> URI.parse(“http://3qshare”)
URI::InvalidURIError: the scheme http does not accept registry part:
3qshare (or bad hostname?)
from C:/ruby/lib/ruby/1.8/uri/generic.rb:195:in initialize' from C:/ruby/lib/ruby/1.8/uri/http.rb:78:in initialize’
from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in new' from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in parse’
from (irb):6

I couldn’t tell you what the proper behavior is.

Regards,

Dan

bfsog · August 29, 2007, 8:29pm

=> #<URI::HTTP:0x16fd906 URL:http://xshare>

I couldn’t tell you what the proper behavior is.

Regards,

Dan

That is true, and due to the following regular expressions from
uri/common.rb:

domainlabel = alphanum | alphanum *( alphanum | “-” ) alphanum

DOMLABEL = “(?:#{ALNUM}?)”

toplabel = alpha | alpha *( alphanum | “-” ) alphanum

TOPLABEL = “(?:#{ALPHA}?)”

hostname = *( domainlabel “.” ) toplabel [ “.” ]

HOSTNAME = “(?:#{DOMLABEL}\.)*#{TOPLABEL}\.?”

So a valid hostname will consist of optional DOMLABELs in front of a
TOPLABEL. The TOPLABEL must start with a letter, end in a letter or
digit,
with letters, digits and hyphens inbetween the two.

That is consistent with RFC 1035 (DOMAIN NAMES - IMPLEMENTATION AND
SPECIFICATION) [http://www.ietf.org/rfc/rfc1035.txt]:
The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less.

The error thrown by URI.parse is a little odd in this context, but
explained
as follows:

In the URI.parse chain, the URI is checked against a longer regular
expression that only partly matches the hostname, but also other URI
parts
(such as userinfo, the scheme etc.). The hostname part doesn’t match
here
because it’s dealing with an invalid hostname. The URI registry part
does
match your invalid hostname, so this information is passed on in the
array
of matched URI parts for the registry.
This array is then checked in Generic.new. That constructor finds the
string
passed for the registry, but the class is hard coded to not use
registries:

USE_REGISTRY = false

DOC: FIXME!

def self.use_registry
self::USE_REGISTRY
end

And in the constructor:

if @registry && !self.class.use_registry
raise InvalidURIError,
“the scheme #{@scheme} does not accept registry part: #{@registry} (or
bad
hostname?)”
end

To sum up: a hostname of 3beers-wrk is invalid as an ARPANET host
according
to the RFC, so the correct solution would be to rename the host.

Hope that helps,

Felix

bfsog · August 29, 2007, 8:43pm

Hope that helps,

Felix

Excuse me: a toplevel of 3beers-wrk is invalid.

Felix

bfsog · August 29, 2007, 9:10pm

Felix W. writes:

-----Original Message-----
From: Daniel B. [mailto:[email protected]]
Sent: Wednesday, August 29, 2007 10:24 AM
To: ruby-talk ML
Subject: Re: Bug in URI.parse?

It looks like URI.parse doesn’t like the leading number:

irb(main):001:0> require ‘uri’
irb(main):003:0> URI.parse(“http://xshare”)
=> #<URI::HTTP:0x16fd906 URL:http://xshare>
irb(main):004:0> URI.parse(“http://xshare-foo”)
=> #<URI::HTTP:0x16fc498 URL:http://xshare-foo>
irb(main):006:0> URI.parse(“http://3qshare”)
URI::InvalidURIError: the scheme http does not accept registry
part:
3qshare (or bad hostname?)
from C:/ruby/lib/ruby/1.8/uri/generic.rb:195:in
initialize' from C:/ruby/lib/ruby/1.8/uri/http.rb:78:in initialize’
from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in new' from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in parse’
from (irb):6

I couldn’t tell you what the proper behavior is.

Regards,

Dan

That is true, and due to the following regular expressions from
uri/common.rb:

domainlabel = alphanum | alphanum *( alphanum | “-” ) alphanum

DOMLABEL = “(?:#{ALNUM}?)”

toplabel = alpha | alpha *( alphanum | “-” ) alphanum

TOPLABEL = “(?:#{ALPHA}?)”

hostname = *( domainlabel “.” ) toplabel [ “.” ]

HOSTNAME = “(?:#{DOMLABEL}\.)*#{TOPLABEL}\.?”

So a valid hostname will consist of optional DOMLABELs in front of a
TOPLABEL. The TOPLABEL must start with a letter, end in a letter or
digit,
with letters, digits and hyphens inbetween the two.

That is consistent with RFC 1035 (DOMAIN NAMES - IMPLEMENTATION AND
SPECIFICATION) [http://www.ietf.org/rfc/rfc1035.txt]:
The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less.

The error thrown by URI.parse is a little odd in this context, but
explained
as follows:

In the URI.parse chain, the URI is checked against a longer regular
expression that only partly matches the hostname, but also other URI
parts
(such as userinfo, the scheme etc.). The hostname part doesn’t match
here
because it’s dealing with an invalid hostname. The URI registry part
does
match your invalid hostname, so this information is passed on in the
array
of matched URI parts for the registry.
This array is then checked in Generic.new. That constructor finds the
string
passed for the registry, but the class is hard coded to not use
registries:

USE_REGISTRY = false

DOC: FIXME!

def self.use_registry
self::USE_REGISTRY
end

And in the constructor:

if @registry && !self.class.use_registry
raise InvalidURIError,
“the scheme #{@scheme} does not accept registry part: #{@registry}
(or bad
hostname?)”
end

To sum up: a hostname of 3beers-wrk is invalid as an ARPANET host
according
to the RFC, so the correct solution would be to rename the host.

Hope that helps,

Felix

While I believe that Felix’s analysis is valid, the problem is that
there are valid, real domains that start with numbers, and URI should
parse those, and in fact, it generally does.

irb(main):002:0> require ‘uri’
=> true
irb(main):003:0> URI.parse(‘http://slashdot.org’)
=> #<URI::HTTP:0x2fee3c URL:http://slashdot.org>
irb(main):004:0> URI.parse(‘http://401k.com’)
=> #<URI::HTTP:0x2fca24 URL:http://401k.com>
irb(main):006:0> URI.parse(‘http://www.3com.com’)
=> #<URI::HTTP:0x2f7b64 URL:http://www.3com.com>
irb(main):007:0> URI.parse(‘https://401k.fidelity.com’)
=> #<URI::HTTPS:0x2f5364 URL:https://401k.fidelity.com>

All of these are real domains for real websites, and thus, the
suggestion of “rename the host” would not work very well.

The problem is probably better illustrated by this example:

irb(main):005:0> URI.parse(‘http://www.example.4bad’)
URI::InvalidURIError: the scheme http does not accept registry part:
www.example.4bad (or bad hostname?)
from /usr/local/lib/ruby/1.8/uri/generic.rb:195:in initialize' from /usr/local/lib/ruby/1.8/uri/http.rb:78:in initialize’
from /usr/local/lib/ruby/1.8/uri/common.rb:488:in new' from /usr/local/lib/ruby/1.8/uri/common.rb:488:in parse’
from (irb):5

Here, the top-level domain starts with a digit, and that is not
allowed. And we will most likely never see such a beast out in the
world. So the work-around for Dan’s original problem would be to
specify the domain name with the hostname: 3qshare.

But, I would contend that this is a bug in URI. My suggestion would
be that the regex for HOSTNAME be:
HOSTNAME = “#{DOMLABEL}(?:(?:\.#{DOMLABEL})*\.#{TOPLEVEL}\.?)”
(I’ll admit I’m not that familiar with this regex notation, so I’m
winging it; apologies for any mistakes.) The point is that the
hostname may not be specified with a domain, and if so, must still be
parsed. If the hostname is either a fully qualified hostname or just
a domain name, then the format of a top-level domain must be checked
and enforced, with optional (sub)domains in between.

Of course, I’m working from what Felix gave above; I haven’t gone
through uri/common.rb to any significant extent, so there may be other
things that this suggestion would cause to break.

Coey

–
Coey M.
Senior Test Engineer
(651) 628-2831
[email protected]

Secure Computing(R)
Your trusted source for enterprise security™
http://www.securecomputing.com
NASDAQ: SCUR

*** The information contained in this email message may be privileged,
confidential and protected from disclosure. If you are not the intended
recipient, any review, dissemination, distribution or copying is
strictly prohibited. If you have received this email message in error,
please notify the sender by reply email and delete the message and any
attachments. ***

bfsog · August 29, 2007, 10:17pm

If it is a bug change toplabel in common.rb to this

TOPLABEL = “(?:#{ALNUM}?)”

Thanks to my friendly MySQL admin .

Stephen B. IV

bfsog · August 29, 2007, 10:49pm

Ok lots of good responses, thanks! A few comments:

Felix: while URI.parse() is behaving according to the two cited RFCs, I
think it is missing an important use case. In “http://3beers-wrk”,
“3beers-wrk” isn’t a domain name, is it? It is an unqualified host name
(I assume we’d pick the host name up from context. Now, the RFC also
suggests that host name must follow these rules (starting with a letter,
etc.), and furthermore, all components of a domain name just follow this
convention, which suggests that the regexp is common.rb is also
incorrect.

Also, the solution of “rename the host” is a non-solution when dealing
with customers, who are using an otherwise perfectly acceptable hostname
(I haven’t found a tool yet that will balk at a hostname beginning with
a number)

Now, I’m not sure if the RFCs have been replaced by newer versions -
that would take some digging.

So, John, I’d say that this is a bug in URI.parse, since it follows
neither the published RFCs nor the practical implementation of them
today (as Coey points out). And if it follows neither, it’s really not
a very good general purpose function in the Ruby library and so should
be fixed.

Andrew

bfsog · August 29, 2007, 9:45pm

I wouldn’t call it a bug exactly, it does do what it is written to do.
Instead, let’s just say that URI.parse isn’t very robust.
It doesn’t handle lots of real-world situations in ways you would
expect.
You would expect some sort of message saying the TLD (top level
domain) is missing or bad, but also you would not expect this to end
your program abruptly.

A good URI parser will also accept IP addresses, since those are also
valid, at least in the sense that they are real and do exist and are
likely to be used or entered by users.
Another problem is the way it handles URLs missing the www or http://
or https://
While strictly speaking this should be required, it clearly is not
the reality of URLs in the world or the reality of how humans use
them. People have become accustomed to using what are officially
partial or bad URLs.

Most web browsers will accept a simple string and attempt to find it,
even if it means adding a TLD.

ARPANET is pretty pointless now.

I’ve begun my own script to check if a URL is correct, but only if it
is the human readable variety.
One of the biggest problems becomes the transitory nature of URLs.
They can change or disappear without notice.
Another problem is the path after a TLD. The path can be nearly
anything and can only be determined to be the first single / after
the apparent TLD.

bfsog · August 29, 2007, 11:20pm

think it is missing an important use case. In “http://3beers-wrk”,
“3beers-wrk” isn’t a domain name, is it? It is an
unqualified host name

That’s fair - it does mention that single unqualified hostnames should
work.
I don’t have enough time right now at work to look at the RFC for those

I’m not even sure there is one for them - and what that defines as
naming
standards, that might be worth investigating.

(I assume we’d pick the host name up from context. Now, the RFC also
suggests that host name must follow these rules (starting
with a letter,
etc.), and furthermore, all components of a domain name just
follow this
convention, which suggests that the regexp is common.rb is also
incorrect.

I think it does act correctly for qualified domain names, which is
important.

Also, the solution of “rename the host” is a non-solution when dealing
with customers, who are using an otherwise perfectly
acceptable hostname
(I haven’t found a tool yet that will balk at a hostname
beginning with
a number)

That’s true :o)

Now, I’m not sure if the RFCs have been replaced by newer versions -
that would take some digging.

I’m relatively certain it has not.

So, John, I’d say that this is a bug in URI.parse, since it follows
neither the published RFCs nor the practical implementation of them
today (as Coey points out). And if it follows neither, it’s
really not
a very good general purpose function in the Ruby library and so should
be fixed.

Andrew

Together with:

Thanks to my friendly MySQL admin .

Stephen B. IV

If it is a bug - maybe you should file on the core mailing list and
enquire?
-, here’s a better fix:

$ ruby -v
ruby 1.8.5 (2006-08-25) [i486-linux]
diff for uri/common.rb:
56c56
< HOSTNAME =
“(?:(?:#{DOMLABEL}\.)+#{TOPLABEL}\.?)|(?:#{DOMLABEL}?)”

  HOSTNAME = "(?:#{DOMLABEL}\\.)*#{TOPLABEL}\\.?"

If it’s a qualified domain name, enforce things as they were. If there
are
no sub-domains or domains to a top level domain, accept sub-domain
naming
stands (can start with a number) as a single, unqualified hostname.

With that change:

irb(main):001:0> require ‘uri’
=> true
irb(main):002:0> URI.parse(‘http://www.example.com’)
=> #<URI::HTTP:0xfdbdf1726 URL:http://www.example.com>
irb(main):003:0> URI.parse(‘http://2.example.com’)
=> #<URI::HTTP:0xfdbdf03ee URL:http://2.example.com>
irb(main):004:0> URI.parse(‘http://2test’)
=> #<URI::HTTP:0xfdbdef250 URL:http://2test>
irb(main):005:0> URI.parse(‘http://2test.4bad’)
URI::InvalidURIError: the scheme http does not accept registry part:
2test.4bad (or bad hostname?)
from /usr/lib/ruby/1.8/uri/generic.rb:194:in initialize' from /usr/lib/ruby/1.8/uri/http.rb:46:in initialize’
from /usr/lib/ruby/1.8/uri/common.rb:484:in new' from /usr/lib/ruby/1.8/uri/common.rb:484:in parse’
from (irb):5
from :0
irb(main):006:0>

Which should make everyone happy.

Unfortunately, you will have to edit your uri/common.rb file for that
directly - since these are declared as constants, you can override
them by
reclaring all modules involved (you’ll have to redeclare several
patterns
and regular expressions), but you will trigger warnings that way.

Hope that helps,

Felix

Bug in URI.parse?

domainlabel = alphanum | alphanum *( alphanum | “-” ) alphanum

toplabel = alpha | alpha *( alphanum | “-” ) alphanum

hostname = *( domainlabel “.” ) toplabel [ “.” ]

DOC: FIXME!

domainlabel = alphanum | alphanum *( alphanum | “-” ) alphanum

toplabel = alpha | alpha *( alphanum | “-” ) alphanum

hostname = *( domainlabel “.” ) toplabel [ “.” ]

DOC: FIXME!

That’s fair - it does mention that single unqualified hostnames should work. I don’t have enough time right now at work to look at the RFC for those

$ ruby -v ruby 1.8.5 (2006-08-25) [i486-linux] diff for uri/common.rb: 56c56 < HOSTNAME = “(?:(?:#{DOMLABEL}\.)+#{TOPLABEL}\.?)|(?:#{DOMLABEL}?)”

That’s fair - it does mention that single unqualified hostnames should
work.
I don’t have enough time right now at work to look at the RFC for those

$ ruby -v
ruby 1.8.5 (2006-08-25) [i486-linux]
diff for uri/common.rb:
56c56
< HOSTNAME =
“(?:(?:#{DOMLABEL}\.)+#{TOPLABEL}\.?)|(?:#{DOMLABEL}?)”