Hello there guys,
I'm trying to track down an easy way to canonicalize a URL from with
ruby. I've been looking around for this but all I can find are some
procedure hacks sure as # canonicalize the url
if ($url -notmatch "^[a-z]+://") { $url = "http://$url" }
which isn't going to take into account everything according to RFC 2396
* Remove all leading and trailing dots
* Replace consecutive dots with a single dot.
* If the hostname can be parsed as an IP address, it should be
normalized to 4 dot-separated decimal values. The client should handle
any legal IP address encoding, including octal, hex, and fewer than 4
components.
* Lowercase the whole string.
# The sequences "/../" and "/./" in the path should be resolved, by
replacing "/./" with "/", and removing "/../" along with the preceding
path component.
# Runs of consecutive slashes should be replaced with a single slash
character.
So is there a method out there for this?
on 2008-01-24 13:14
on 2008-01-24 14:24
On Jan 24, 2008, at 7:14 AM, Dan Cuddeford wrote: > * Remove all leading and trailing dots > path component. > # Runs of consecutive slashes should be replaced with a single slash > character. > > So is there a method out there for this? I'd start looking at URI, in particular, URI#parse. $ fri URI#parse ------------------------------------------------------------- URI::parse URI::parse(uri) ------------------------------------------------------------------------ Synopsis URI::parse(uri_str) Args +uri_str+: String with URI. Description Creates one of the URI's subclasses instance from the string. Raises URI::InvalidURIError Raised if URI given is not a correct one. Usage require 'uri' uri = URI.parse("http://www.ruby-lang.org/") p uri # => #<URI::HTTP:0x202281be URL:http://www.ruby-lang.org/> p uri.scheme # => "http" p uri.host # => "www.ruby-lang.org" As for the "Lowercase the whole string" part, only the domain is required to be case-insensitive. It is possible for the underlying web server to ignore case when finding a path, but the URI is not necessarily a reference to the same resource if the case is altered. -Rob Rob Biedenharn http://agileconsultingllc.com Rob@AgileConsultingLLC.com
on 2008-01-24 15:07
2008/1/24, Rob Biedenharn <Rob@agileconsultingllc.com>: > As for the "Lowercase the whole string" part, only the domain is > required to be case-insensitive. It is possible for the underlying > web server to ignore case when finding a path, but the URI is not > necessarily a reference to the same resource if the case is altered. There's URI#normalize and URI#normalize! to downcase the host part of the url. -- Jean-François.
on 2008-01-24 15:23
So it seems using the two together
require 'uri'
uri = URI.parse("http://www.ruBy-lang.org/ARSE")
can = uri.normalize
p can
p can.host
p can.path
means the path keeps it's case sensitivity but the host is normalized.
I think that's it - however,
try it with ruby-lang..org and
/usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http
does not accept registry part: www.ruBy-lang..org (or bad hostname?)
(URI::InvalidURIError)
from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
from canon.rb:3
So I guess it needs a bit or error checking before hand.
on 2008-01-24 17:06
On Jan 24, 2008, at 9:23 AM, Dan Cuddeford wrote: > p can.host > /usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http > does not accept registry part: www.ruBy-lang..org (or bad hostname?) > (URI::InvalidURIError) > from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize' > from /usr/lib/ruby/1.8/uri/common.rb:488:in `new' > from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse' > from canon.rb:3 > > So I guess it needs a bit or error checking before hand. require 'uri' def canonicalize(uri) u = uri.kind_of?(URI) ? uri : URI.parse(uri.to_s) u.normalize! newpath = u.path while newpath.gsub!(%r{([^/]+)/\.\./?}) { |match| $1 == '..' ? match : '' } do end newpath = newpath.gsub(%r{/\./}, '/').sub(%r{/\.\z}, '/') u.path = newpath u.to_s end canonicalize('http://www.Ruby-Lang.ORG/ARSE/done/../../rear/./end/.') => "http://www.ruby-lang.org/rear/end/" -Rob Rob Biedenharn http://agileconsultingllc.com Rob@AgileConsultingLLC.com
on 2008-01-24 20:30
Dan Cuddeford wrote:
> Wow - thanks for the answer mate!
There's also the Addressable Gem: <http://Addressable.RubyForge.Org/>.
It's intended as a standards compliant replacement for the stdlib's
URI library. Take a look into the test directory of that sucker: over
440 Unit Tests (actually, Object Examples) for a frickin' URI parser!
(See: <http://Addressable.RubyForge.Org/specdoc/>) That guy is nuts!
That code's gotta be as rock-solid as it gets.
Oh, and back to the topic at hand: it has a normalize method built in:
begin
require 'rubygems'
gem 'addressable'
rescue LoadError; end
require 'addressable/uri'
uri =
Addressable::URI.heuristic_parse('www.Ruby-Lang..ORG/ARSE/done/../../r e
a r/./end/.#exit')
uri.normalize!
puts uri.display_uri # =>
http://www.ruby-lang..org/r%20e%20a%20r/end/#exit
jwm
on 2008-01-25 16:27
Jörg W Mittag wrote: > puts uri.display_uri # => > http://www.ruby-lang..org/r%20e%20a%20r/end/#exit > > jwm Nice but shouldn't it go to ruby-lang.org?
on 2008-01-26 01:25
Dan Cuddeford wrote: > Jörg W Mittag wrote: >> puts uri.display_uri # => >> http://www.ruby-lang..org/r%20e%20a%20r/end/#exit > Nice but shouldn't it go to ruby-lang.org? I'm not sure. I just scanned RfC3986 and RfC1034 and I'm not even sure that's a valid URI host part to begin with. *If* it's invalid, then there's not much a URI normalizer can do, right? However, I could be wrong. Reading RfCs is not exactly my specialty. jwm
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.