Problem with URI.parse

kaens · April 26, 2006, 5:09pm

Alright, I think I may have stumbled upon a bug, correct me if I’m
wrong.

I just wrote a script to pull linked-to files of a certain type of of
webpages. Granted, I’m still pretty new to Ruby. . . this really has me
stumped though. I’ve googled around, and looked through all the docs on
the relevant classess and I’m not getting anywhere with this.

I have a class that has two main ways of pulling the files - in one, you
give it an absolute URL to a page, and it searches through looking for
links ending with an extension, creates a list, and calls wget to get
them. In the other, it looks through a page of links to pages to see if
they have the filetype.

When I pass it a page using the absolute way, it works fine. If I pass
it in the other way, I get

“usr/lib/ruby/1.8/uri/common.rb:432:in split': bad URI(is not URI?): "www.thisistheurlblahexample.example" (URI::InvalidURIError) from /usr/lib/ruby/1.8/uri/common.rb:481:in parse’”

Alright. So it’s not a properly formed URI, right…or something. Thing
is, if I copy and past the erroneous url into the call for the absolute
method, it works fine. I figure I must be missing something critical
here - like data changing when it’s passed from function to function
within a class? I don’t know.

Here’s what I’m thinking is the relevant code: (sorry if it’s shoddy,
again I am pretty new at Ruby)

require ‘net/http’
require ‘uri’

#if you want to grab multiple filetypes, make url the form of
“(f1|f2|f3)”
class Grabber
def initialize(url, filetype)
@url = url
@filetype = filetype
@filelist = String::new
end

def linkCrawl
page = Net::HTTP.get URI.parse(“#{@url}”)
page.each do |line|
#check for a link
if line =~ %r|<a\shref\s*=“http.*”>|i
#make the list of files from that page
#sorry about this, people. I know it’s ugly. was trying things
here to
#see if it made a difference if I changed the string beforehand
instead
#of calling it inline. it didn’t, and it really shouldn’t…
#frustration
line.gsub!(%r|<a\shref=|, “”)
line.gsub!(%r|^\s+|, “”)
print “line: #{line}\n”
createList(“#{line}”)
end
end
print “filelist: #{@filelist}\n”
exec “wget #{@filelist}”
end

def createList(url)
page = Net::HTTP.get URI.parse(url)
page.each do |line|
#check for a link containing one of the filetypes
if line =~ %r|.<a\shref\s=“.#{@filetype}".>.|i
#strip the url of the filetype out of the html
@filelist.concat
“#{line.slice(%r|<a\shref=”.#{@filetype}”|i).gsub!(%r|<a href=|i, “”)}
"
end
end
end

def grabFiles
createList(“#{@url}”)
exec “wget #{@filelist}”
end
end

test = Grabber::new(“http://urlgoeshere.com”, “(f1|f2|f3)”)
test.linkCrawl

##the one below here always works
#test = Grabber::new(“http:urlgoeshere.com”, “(f1|f2|f3)”)
#test.grabFiles

kaens · April 26, 2006, 7:06pm

On Apr 26, 2006, at 5:09 PM, Jeremiah D. wrote:

Alright, I think I may have stumbled upon a bug, correct me if I’m
wrong.

I just wrote a script to pull linked-to files of a certain type of of
webpages. Granted, I’m still pretty new to Ruby. . . this really
has me
stumped though. I’ve googled around, and looked through all the
docs on
the relevant classess and I’m not getting anywhere with this.

You really shouldn’t be trying to parse html with regular
expressions, there are a few libraries to do this available. Instead
of using an external program (wget), you can also download the URI in
ruby.

Trying your code I get:

 (URI::InvalidURIError)/uri/common.rb:432:in `split': bad URI(is

not URI?): here

which seems reasonable.

Here is an example using RubyfulSoup:

require ‘open-uri’
require ‘fileutils’
require ‘uri’
require ‘rubygems’ # http://docs.rubygems.org/
require ‘rubyful_soup’ # Rubyful Soup: "The brush has got entangled in it!"
– sudo gem install rubyful_soup

class Grabber
def initialize(uri, file_types=[])
@uri = uri
@file_types_re = %r{#{Regexp.union(*file_types)}$}
end

def grab_files
find_uris.each do |link|
begin
data = open(link) { |a| a.read }
file_path = link.host + link.path
FileUtils.mkdir_p(File.dirname(file_path))
open(file_path, ‘wb’) { |f| f.write(data) }
rescue Exception => e
$stderr.puts “#{e.class}: #{e}”
end
end
end

def find_uris
soup = BeautifulSoup.new(open(@uri) { |f| f.read })
soup.find_all(‘a’) { |a| a[‘href’] =~ @file_types_re }.map do |a|
uri = URI.parse(a[‘href’])
# Create an absolute uri
uri.host ? uri : URI.join(@uri, uri)
end
end
end

Grabber.new(‘http://google.com’, %w{html}).grab_files

END

– Daniel

kaens · April 26, 2006, 8:27pm

On Apr 26, 2006, at 8:09 AM, Jeremiah D. wrote:

I have a class that has two main ways of pulling the files - in
“usr/lib/ruby/1.8/uri/common.rb:432:in split': bad URI(is not URI?): "www.thisistheurlblahexample.example" (URI::InvalidURIError) from /usr/lib/ruby/1.8/uri/common.rb:481:in parse’”

Alright. So it’s not a properly formed URI, right…or something.
Thing
is, if I copy and past the erroneous url into the call for the
absolute
method, it works fine. I figure I must be missing something critical
here - like data changing when it’s passed from function to function
within a class? I don’t know.

A URI has a protocol scheme.

@url = url
@filetype = filetype
@filelist = String::new

   @filelist = ''

end

def linkCrawl
page = Net::HTTP.get URI.parse(“#{@url}”)
page = Net::HTTP.get URI.parse(@url)
#frustration
If it is ugly you should fix it. Don’t leave broken windows.

    line.gsub!(%r|<a\shref=|, "")
    line.gsub!(%r|^\s+|, "")
    print "line: #{line}\n"

       puts "line: #{line}"

    createList("#{line}")

       createList line

  end
end
print "filelist: #{@filelist}\n"

   puts "filelist: #{@filelist}"

exec "wget #{@filelist}"
end

def createList(url)
url = @url + url
end
end

def grabFiles
createList(“#{@url}”)
createList @url

–
Posted via http://www.ruby-forum.com/.

–
Eric H. - [email protected] - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com