Buffer problem

casper_the_ghost · January 25, 2007, 6:11am

I am trying to spider a site using Hpricot, but I keep getting out of
buffer error. It will only let me do about two sites at a time, is
there a way to clear the buffer after I process each page so I won’t
blow the buffer?

casper_the_ghost · January 25, 2007, 8:47am

Can you post the code you are using?

On 1/24/07, [email protected] [email protected] wrote:

I am trying to spider a site using Hpricot, but I keep getting out of
buffer error. It will only let me do about two sites at a time, is
there a way to clear the buffer after I process each page so I won’t
blow the buffer?

–
Thanks,
-Steve
http://www.stevelongdo.com

casper_the_ghost · January 26, 2007, 12:55pm

Anyone someone has to know something about the buffer?

casper_the_ghost · January 25, 2007, 1:45pm

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘active_record’

ActiveRecord::Base.establish_connection(
#connection info
)

class Major < ActiveRecord::Base
has_many :courses
end

class Course < ActiveRecord::Base
belongs_to :major
end

def scrape(url)
doc = Hpricot(open(url))
tables =(doc/“table”)
(tables[6]/“tr”).each do |major|
createMajor major
end
end
def createMajor(data)
newMajor = Major.new
newMajor.title = data.search(“td”).first.inner_html
newMajor.abbrev =data.search(“acronym”).inner_html
newMajor.link_to = data.search(“a”).to_s.split(‘"’)[1]
puts newMajor.save
end
def courses(url)
puts url
doc = Hpricot(open(“http://courses.tamu.edu/”+url.to_s))
courses = (doc/“//td[@class=‘sectionheading’]”)
courses.each do |course|
createCourse course
end
end
def createCourse(data)
course = data.inner_html.strip.split(’ ‘)
major = course[0]
course_no = course[1]
puts major,course_no
course.pop
course_name = course.slice!(3,course.length).join(’ ')
puts course_name
end

AllMajors = Major.find(:all, :limit=>3,:offset=>0)
AllMajors.each do |course|
courses(course.link_to,course.id)
end
#scrape(url goes here)

This what I was last test with. I had it where scrape would call
courses, but that broke the buffer before I even got output, this
outputs the data from two pages and then breaks.