Hpricot parsing

mrcfab3 · April 19, 2009, 6:12pm

Ruby newbie here

Have successfully used hpricot to scrape correct

from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
…
@grab1 = doc.search(“//div[@class=‘article-bodytext’]”)

target data is in following logical form

name of funeral home

deceased1

advertising crap

funeral home 2

deceased 2

deceased 3

I’m struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the “record” logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?

Thx

mrcfab3 · April 20, 2009, 1:20am

Marc F. wrote:

Ruby newbie here

Have successfully used hpricot to scrape correct

from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
…
@grab1 = doc.search(“//div[@class=‘article-bodytext’]”)

target data is in following logical form

name of funeral home

deceased1

advertising crap

funeral home 2

deceased 2

deceased 3

I’m struggling to iterate thru this div…
I [want to insert a record into a table with each] record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type:

These methods seem like the ones you need:

elm.next_sibling (skips the newlines in the html)
elm.name

How about this:

require “rubygems”
require ‘hpricot’

str =<<ENDOFSTRING

name of funeral home

deceased1

advertising crap

funeral home 2

deceased 2

deceased 3

ENDOFSTRING

doc = Hpricot(str)
h3_tags = doc.search(“h3”)

h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != ‘p’

puts h3.inner_text
puts "\t #{elm.inner_text}"

end

–output:–
name of funeral home
deceased1
funeral home 2
deceased 2
funeral home 2
deceased 3

mrcfab3 · April 20, 2009, 1:40am

7stud – wrote:

h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != ‘p’
puts h3.inner_text
puts "\t #{elm.inner_text}"
end

end

To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:

h3_tags.each do |elm|
funeral_home = elm.inner_text

while elm = elm.next_sibling
break if elm.name != ‘p’

puts funeral_home
puts "\t #{elm.inner_text}"

end
end

mrcfab3 · April 20, 2009, 1:49am

Thanks so much 7-stud

I had been fixated on next_child thinking that next_sibling would skip
over the “p” tags. I really appreciate your thoughtfulness to provide a
working code snippet.

Marc

mrcfab3 · April 20, 2009, 4:05am

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method…not
yet
found.
I’d also be glad to know.
2009/4/20 Marc F. [email protected]

mrcfab3 · April 20, 2009, 4:25am

Wang J. wrote:

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method…not yet
found.

Try to write it. I hope I’m wrong, but I suspect that starting will be
easy, and
hitting your own target XML will be easy…

…but making it generic enough to publish will be another story!