ScrAPI HTTPNoAccessError

Hi,

I’m having some problems using scrAPI. I’m getting some
HTTPNoAccessErrors on certain urls.

The program searches a page (http://en.wikiquote.org/wiki/List_of_films)
for all of the links on it that go to pages with movie quotes on them.

It then loops through the list, pulling out the details from each page
using this method:

def self.scrapemovies
Scraper::Base.parser :html_parser

urlarray = Movie.findurls

moviescraper = Scraper.define do
  process "h1", :name => :text
  process "p:nth-child(4)", :description => :text
  result :description, :name
end

urlarray.each do |url|
  fullurl = "http://en.wikiquote.org#{url}"
  movieurl = URI.parse(fullurl)
  data = moviescraper.scrape(movieurl)
  movie = Movie.new
  movie.url = fullurl
  movie.name = data.name
  movie.description = data.description
  movie.save
end

end

This worked ok until it got to
http://en.wikiquote.org/wiki/20,000_Leagues_Under_the_Sea which gave me
the http error because it had a comma in the URL.

I wrote a little bit of code in the Movie.findurl method that just
stripped out any URLs with commas or parentheses in as a bodge just to
get things working, but I’m even getting the error on this URL:
http://en.wikiquote.org/wiki/27_Dresses which is very odd, because it
worked fine on the previous one which was :
http://en.wikiquote.org/wiki/25th_Hour.

I can’t see the difference between them - I’ve tried manually visiting
the page, and it’s fine.

I’m assuming that I need to do some sort of cleverer parsing on the URLs
(so that I can include the ones with commas and parentheses too).

Is the Scraper::Base.parser :html_parser line got anything to do with
it? I couldn’t get the Tidy plugin to work properly, but I’m not sure
that it’s got anything to do with the URL parsing anyway.

I’m totally stuck - thanks in advance for any help.

Jules.

I should also add -

Before I got the findurl method to just strip out any URLs with non
standard characters, I tried this line:

fullurl.gsub!(",","%2C")

Which replaced the commas with the URL friendlier code. This didn’t work
either, nor did putting the whole lot inside a CGI.escape("")

The scrAPI documentation isn’t particularly helpful in regards to what
format the URL needs to be in.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs