I am trying to spider a site using Hpricot, but I keep getting out of buffer error. It will only let me do about two sites at a time, is there a way to clear the buffer after I process each page so I won't blow the buffer?
on 2007-01-25 06:11
on 2007-01-25 08:47
Can you post the code you are using? On 1/24/07, email@example.com <firstname.lastname@example.org> wrote: > > > I am trying to spider a site using Hpricot, but I keep getting out of > buffer error. It will only let me do about two sites at a time, is > there a way to clear the buffer after I process each page so I won't > blow the buffer? > > > > > -- Thanks, -Steve http://www.stevelongdo.com
on 2007-01-25 13:45
require 'rubygems' require 'hpricot' require 'open-uri' require 'active_record' ActiveRecord::Base.establish_connection( #connection info ) class Major < ActiveRecord::Base has_many :courses end class Course < ActiveRecord::Base belongs_to :major end def scrape(url) doc = Hpricot(open(url)) tables =(doc/"table") (tables/"tr").each do |major| createMajor major end end def createMajor(data) newMajor = Major.new newMajor.title = data.search("td").first.inner_html newMajor.abbrev =data.search("acronym").inner_html newMajor.link_to = data.search("a").to_s.split('"') puts newMajor.save end def courses(url) puts url doc = Hpricot(open("http://courses.tamu.edu/"+url.to_s)) courses = (doc/"//td[@class='sectionheading']") courses.each do |course| createCourse course end end def createCourse(data) course = data.inner_html.strip.split(' ') major = course course_no = course puts major,course_no course.pop course_name = course.slice!(3,course.length).join(' ') puts course_name end AllMajors = Major.find(:all, :limit=>3,:offset=>0) AllMajors.each do |course| courses(course.link_to,course.id) end #scrape(url goes here) This what I was last test with. I had it where scrape would call courses, but that broke the buffer before I even got output, this outputs the data from two pages and then breaks.
on 2007-01-26 12:55
Anyone someone has to know something about the buffer?