Mechanize out of buffer space

casper_the_ghost · January 26, 2007, 3:16pm

I am trying to scrape a site and then its children to get data I relate
in tables, the only problems is that I keep getting an “OUT OF BUFFER
SPACE” error. Is there a way to clear the buffer after each iteration
or am I doing something wrong?

Here’s the code:
require ‘rubygems’
require ‘mechanize’
require ‘active_record’

ActiveRecord::Base.establish_connection(
#connection goes here
)

class Major < ActiveRecord::Base
has_many :courses
end

class Course < ActiveRecord::Base
belongs_to :major
end

class Sections
def scrape(url)
agent = WWW::Mechanize.new
page = agent.get(url)
table = (page/‘//table’)[6]
(table/“tr”).each do |major|
@newMajor = Major.new
@newMajor.title = (major/‘//td’).first.inner_html
@newMajor.abbrev = (major/‘acronym’).inner_html
@newMajor.link_to = (major/‘a’).to_s.split(‘"’)[1]
puts title,abbrev,link_to
end
end
end

class Classes
attr_writer :major_id
def scrape(url)
agent = WWW::Mechanize.new
page = agent.get(“http://courses.tamu.edu/”+url.to_s)
(page/“//td[@class=‘sectionheading’]”).each do |course|
course = course.inner_html.strip.split(’ ‘)
course.pop
@newCourse = Course.new
@newCourse.major_id = @major_id
@newCourse.course_no = course[1]
@newCourse.name = course.slice!(3,course.length).join(’ ')
@newCourse.save
end
end
end

AllMajors = Major.find(:all)
AllMajors.each do |course|
start = Time.now
newClass = Classes.new
newClass.major_id = course.id
newClass.scrape(course.link_to)
puts “Added courses for #{course.title}”
finish = Time.now
puts “Took #{finish-start} seconds”
end
puts “Finished scraping courses”

casper_the_ghost · January 26, 2007, 6:10pm

After having to delve into the actual Hpricot source it turns out
there’s a predefined buffer size and you can’t change it without
actually editing the source and recompiling.