Forum: Ruby Creating a canonicalized url

Posted by Dan Cuddeford (cudds)
on 2008-01-24 13:14
Hello there guys,

I'm trying to track down an easy way to canonicalize a URL from with
ruby. I've been looking around for this but all I can find are some
procedure hacks sure as  # canonicalize the url
    if ($url -notmatch "^[a-z]+://") { $url = "http://$url" }

which isn't going to take into account everything according to RFC 2396

    *   Remove all leading and trailing dots
    * Replace consecutive dots with a single dot.
    * If the hostname can be parsed as an IP address, it should be
normalized to 4 dot-separated decimal values. The client should handle
any legal IP address encoding, including octal, hex, and fewer than 4
components.
    * Lowercase the whole string.


#  The sequences "/../" and "/./" in the path should be resolved, by
replacing "/./" with "/", and removing "/../" along with the preceding
path component.
# Runs of consecutive slashes should be replaced with a single slash
character.

So is there a method out there for this?
Posted by Rob Biedenharn (Guest)
on 2008-01-24 14:24
(Received via mailing list)
On Jan 24, 2008, at 7:14 AM, Dan Cuddeford wrote:

>    *   Remove all leading and trailing dots
> path component.
> # Runs of consecutive slashes should be replaced with a single slash
> character.
>
> So is there a method out there for this?

I'd start looking at URI, in particular, URI#parse.

$ fri URI#parse
------------------------------------------------------------- URI::parse
      URI::parse(uri)
------------------------------------------------------------------------
      Synopsis
        URI::parse(uri_str)

      Args
      +uri_str+: String with URI.

      Description
      Creates one of the URI's subclasses instance from the string.

      Raises
      URI::InvalidURIError

        Raised if URI given is not a correct one.

      Usage
        require 'uri'

        uri = URI.parse("http://www.ruby-lang.org/")
        p uri
        # => #<URI::HTTP:0x202281be URL:http://www.ruby-lang.org/>
        p uri.scheme
        # => "http"
        p uri.host
        # => "www.ruby-lang.org"

As for the "Lowercase the whole string" part, only the domain is
required to be case-insensitive.  It is possible for the underlying
web server to ignore case when finding a path, but the URI is not
necessarily a reference to the same resource if the case is altered.

-Rob

Rob Biedenharn    http://agileconsultingllc.com
Rob@AgileConsultingLLC.com
Posted by Jean-François Trân (Guest)
on 2008-01-24 15:07
(Received via mailing list)
2008/1/24, Rob Biedenharn <Rob@agileconsultingllc.com>:

> As for the "Lowercase the whole string" part, only the domain is
> required to be case-insensitive.  It is possible for the underlying
> web server to ignore case when finding a path, but the URI is not
> necessarily a reference to the same resource if the case is altered.

There's URI#normalize and URI#normalize! to downcase the host
part of the url.

   -- Jean-François.
Posted by Dan Cuddeford (cudds)
on 2008-01-24 15:12
Thanks for your help - I'll let you know how I get on
Posted by Dan Cuddeford (cudds)
on 2008-01-24 15:23
So it seems using the two together


 require 'uri'

        uri = URI.parse("http://www.ruBy-lang.org/ARSE")

  can = uri.normalize
  p can

  p can.host

  p can.path


means the path keeps it's case sensitivity but the host is normalized.

I think that's it - however,

try it with ruby-lang..org and

/usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http 
does not accept registry part: www.ruBy-lang..org (or bad hostname?) 
(URI::InvalidURIError)
        from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
        from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
        from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
        from canon.rb:3

So I guess it needs a bit or error checking before hand.
Posted by Rob Biedenharn (Guest)
on 2008-01-24 17:06
(Received via mailing list)
On Jan 24, 2008, at 9:23 AM, Dan Cuddeford wrote:

>  p can.host
> /usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http
> does not accept registry part: www.ruBy-lang..org (or bad hostname?)
> (URI::InvalidURIError)
>        from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
>        from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
>        from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
>        from canon.rb:3
>
> So I guess it needs a bit or error checking before hand.

require 'uri'

def canonicalize(uri)
   u = uri.kind_of?(URI) ? uri : URI.parse(uri.to_s)
   u.normalize!
   newpath = u.path
   while newpath.gsub!(%r{([^/]+)/\.\./?}) { |match|
              $1 == '..' ? match : ''
            } do end
   newpath = newpath.gsub(%r{/\./}, '/').sub(%r{/\.\z}, '/')
   u.path = newpath
   u.to_s
end

canonicalize('http://www.Ruby-Lang.ORG/ARSE/done/../../rear/./end/.')
=> "http://www.ruby-lang.org/rear/end/"

-Rob

Rob Biedenharn    http://agileconsultingllc.com
Rob@AgileConsultingLLC.com
Posted by Dan Cuddeford (cudds)
on 2008-01-24 17:38
Wow - thanks for the answer mate!
Posted by Jörg W Mittag (Guest)
on 2008-01-24 20:30
(Received via mailing list)
Dan Cuddeford wrote:
> Wow - thanks for the answer mate!

There's also the Addressable Gem: <http://Addressable.RubyForge.Org/>.

It's intended as a standards compliant replacement for the stdlib's
URI library. Take a look into the test directory of that sucker: over
440 Unit Tests (actually, Object Examples) for a frickin' URI parser!
(See: <http://Addressable.RubyForge.Org/specdoc/>) That guy is nuts!
That code's gotta be as rock-solid as it gets.

Oh, and back to the topic at hand: it has a normalize method built in:

  begin
    require 'rubygems'
    gem 'addressable'
  rescue LoadError; end
  require 'addressable/uri'
  uri = 
Addressable::URI.heuristic_parse('www.Ruby-Lang..ORG/ARSE/done/../../r e 
a r/./end/.#exit')
  uri.normalize!
  puts uri.display_uri # => 
http://www.ruby-lang..org/r%20e%20a%20r/end/#exit

jwm
Posted by Dan Cuddeford (cudds)
on 2008-01-25 16:27
Jörg W Mittag wrote:
>   puts uri.display_uri # => 
> http://www.ruby-lang..org/r%20e%20a%20r/end/#exit
> 
> jwm

Nice but shouldn't it go to ruby-lang.org?
Posted by Jörg W Mittag (Guest)
on 2008-01-26 01:25
(Received via mailing list)
Dan Cuddeford wrote:
> Jörg W Mittag wrote:
>>   puts uri.display_uri # => 
>> http://www.ruby-lang..org/r%20e%20a%20r/end/#exit
> Nice but shouldn't it go to ruby-lang.org?

I'm not sure. I just scanned RfC3986 and RfC1034 and I'm not even sure
that's a valid URI host part to begin with. *If* it's invalid, then
there's not much a URI normalizer can do, right?

However, I could be wrong. Reading RfCs is not exactly my specialty.

jwm
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.