Hpricot parsing

Ruby newbie here

Have successfully used hpricot to scrape correct

from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)

@grab1 = doc.search("//div[@class=‘article-bodytext’]")

target data is in following logical form

name of funeral home

deceased1

advertising crap

funeral home 2

deceased 2

deceased 3

I’m struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the “record” logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?

Thx

Marc F. wrote:

Ruby newbie here

Have successfully used hpricot to scrape correct

from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)

@grab1 = doc.search("//div[@class=‘article-bodytext’]")

target data is in following logical form

name of funeral home

deceased1

advertising crap

funeral home 2

deceased 2

deceased 3

I’m struggling to iterate thru this div…
I [want to insert a record into a table with each] record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type:

These methods seem like the ones you need:

elm.next_sibling (skips the newlines in the html)
elm.name

How about this:

require “rubygems”
require ‘hpricot’

str =<<ENDOFSTRING

name of funeral home

deceased1

advertising crap

funeral home 2

deceased 2

deceased 3

ENDOFSTRING

doc = Hpricot(str)
h3_tags = doc.search(“h3”)

h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != ‘p’

puts h3.inner_text
puts "\t #{elm.inner_text}"

end

end

–output:–
name of funeral home
deceased1
funeral home 2
deceased 2
funeral home 2
deceased 3

7stud – wrote:

h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != ‘p’

puts h3.inner_text
puts "\t #{elm.inner_text}"

end

end

To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:

h3_tags.each do |elm|
funeral_home = elm.inner_text

while elm = elm.next_sibling
break if elm.name != ‘p’

puts funeral_home
puts "\t #{elm.inner_text}"

end
end

Thanks so much 7-stud

I had been fixated on next_child thinking that next_sibling would skip
over the “p” tags. I really appreciate your thoughtfulness to provide a
working code snippet.

Marc

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method…not
yet
found.
I’d also be glad to know.
2009/4/20 Marc F. [email protected]

Wang J. wrote:

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method…not yet
found.

Try to write it. I hope I’m wrong, but I suspect that starting will be
easy, and
hitting your own target XML will be easy…

…but making it generic enough to publish will be another story!

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs