Forum: Ruby Problem with URI.parse

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Jeremiah D. (Guest)
on 2006-04-26 19:09
Alright, I think I may have stumbled upon a bug, correct me if I'm
wrong.

I just wrote a script to pull linked-to files of a certain type of of
webpages. Granted, I'm still pretty new to Ruby. . . this really has me
stumped though. I've googled around, and looked through all the docs on
the relevant classess and I'm not getting anywhere with this.

I have a class that has two main ways of pulling the files - in one, you
give it an absolute URL to a page, and it searches through looking for
links ending with an extension, creates a list, and calls wget to get
them. In the other, it looks through a page of links to pages to see if
they have the filetype.

When I pass it a page using the absolute way, it works fine. If I pass
it in the other way, I get

"usr/lib/ruby/1.8/uri/common.rb:432:in `split': bad URI(is not URI?):
"www.thisistheurlblahexample.example"  (URI::InvalidURIError)
        from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'"

Alright. So it's not a properly formed URI, right...or something. Thing
is, if I copy and past the erroneous url into the call for the absolute
method, it works fine. I figure I must be missing something critical
here - like data changing when it's passed from function to function
within a class? I don't know.

Here's what I'm thinking is the relevant code: (sorry if it's shoddy,
again I am pretty new at Ruby)

require 'net/http'
require 'uri'

#if you want to grab multiple filetypes, make url the form of
"(f1|f2|f3)"
class Grabber
  def initialize(url, filetype)
    @url = url
    @filetype = filetype
    @filelist = String::new
  end

  def linkCrawl
    page = Net::HTTP.get URI.parse("#{@url}")
    page.each do |line|
      #check for a link
      if line =~ %r|<a\shref\s*="http.*">|i
        #make the list of files from that page
        #sorry about this, people. I know it's ugly. was trying things
here to
        #see if it made a difference if I changed the string beforehand
instead
        #of calling it inline. it didn't, and it really shouldn't.....
        #frustration
        line.gsub!(%r|<a\shref=|, "")
        line.gsub!(%r|^\s+|, "")
        print "line: #{line}\n"
        createList("#{line}")
      end
    end
    print "filelist: #{@filelist}\n"
    exec "wget #{@filelist}"
  end

  def createList(url)
    page = Net::HTTP.get URI.parse(url)
    page.each do |line|
      #check for a link containing one of the filetypes
      if line =~ %r|.*<a\shref\s*=".*#{@filetype}".*>.*|i
        #strip the url of the filetype out of the html
        @filelist.concat
"#{line.slice(%r|<a\shref=".*#{@filetype}"|i).gsub!(%r|<a href=|i, "")}
"
      end
    end
  end

  def grabFiles
    createList("#{@url}")
    exec "wget #{@filelist}"
  end
end

test = Grabber::new("http://urlgoeshere.com", "(f1|f2|f3)")
test.linkCrawl

##the one below here always works
#test = Grabber::new("http:urlgoeshere.com", "(f1|f2|f3)")
#test.grabFiles
Daniel H. (Guest)
on 2006-04-26 21:06
(Received via mailing list)
On Apr 26, 2006, at 5:09 PM, Jeremiah D. wrote:

> Alright, I think I may have stumbled upon a bug, correct me if I'm
> wrong.
>
> I just wrote a script to pull linked-to files of a certain type of of
> webpages. Granted, I'm still pretty new to Ruby. . . this really
> has me
> stumped though. I've googled around, and looked through all the
> docs on
> the relevant classess and I'm not getting anywhere with this.

You really shouldn't be trying to parse html with regular
expressions, there are a few libraries to do this available. Instead
of using an external program (wget), you can also download the URI in
ruby.

Trying your code I get:

     (URI::InvalidURIError)/uri/common.rb:432:in `split': bad URI(is
not URI?): <A HREF="http://www.google.com/">here</A>

which seems reasonable.

Here is an example using RubyfulSoup:

require 'open-uri'
require 'fileutils'
require 'uri'
require 'rubygems' # http://docs.rubygems.org/
require 'rubyful_soup' # http://www.crummy.com/software/RubyfulSoup/
-- sudo gem install rubyful_soup

class Grabber
   def initialize(uri, file_types=[])
     @uri = uri
     @file_types_re = %r{#{Regexp.union(*file_types)}$}
   end

   def grab_files
     find_uris.each do |link|
       begin
         data = open(link) { |a| a.read }
         file_path = link.host + link.path
         FileUtils.mkdir_p(File.dirname(file_path))
         open(file_path, 'wb') { |f| f.write(data) }
       rescue Exception => e
         $stderr.puts "#{e.class}: #{e}"
       end
     end
   end

   def find_uris
     soup = BeautifulSoup.new(open(@uri) { |f| f.read })
     soup.find_all('a') { |a| a['href'] =~ @file_types_re }.map do |a|
       uri = URI.parse(a['href'])
       # Create an absolute uri
       uri.host ? uri : URI.join(@uri, uri)
     end
   end
end

Grabber.new('http://google.com', %w{html}).grab_files

__END__

-- Daniel
Eric H. (Guest)
on 2006-04-26 22:27
(Received via mailing list)
On Apr 26, 2006, at 8:09 AM, Jeremiah D. wrote:

> I have a class that has two main ways of pulling the files - in
> "usr/lib/ruby/1.8/uri/common.rb:432:in `split': bad URI(is not URI?):
> "www.thisistheurlblahexample.example"  (URI::InvalidURIError)
>         from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'"
>
> Alright. So it's not a properly formed URI, right...or something.
> Thing
> is, if I copy and past the erroneous url into the call for the
> absolute
> method, it works fine. I figure I must be missing something critical
> here - like data changing when it's passed from function to function
> within a class? I don't know.

A URI has a protocol scheme.

>     @url = url
>     @filetype = filetype
>     @filelist = String::new
       @filelist = ''
>   end
>
>   def linkCrawl
>     page = Net::HTTP.get URI.parse("#{@url}")
       page = Net::HTTP.get URI.parse(@url)
>         #frustration
If it is ugly you should fix it.  Don't leave broken windows.

>         line.gsub!(%r|<a\shref=|, "")
>         line.gsub!(%r|^\s+|, "")
>         print "line: #{line}\n"
           puts "line: #{line}"
>         createList("#{line}")
           createList line
>       end
>     end
>     print "filelist: #{@filelist}\n"
       puts "filelist: #{@filelist}"
>     exec "wget #{@filelist}"
>   end
>
>   def createList(url)
       url = @url + url
>     end
>   end
>
>   def grabFiles
>     createList("#{@url}")
       createList @url
>
> --
> Posted via http://www.ruby-forum.com/.

--
Eric H. - removed_email_address@domain.invalid - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com
This topic is locked and can not be replied to.