Ruby newbie here
Have successfully used hpricot to scrape correct
from desired page
http://www.montgomeryadvertiser.com/section/obits using
doc = Hpricot(uri above)
…
@grab1 = doc.search(“//div[@class=‘article-bodytext’]”)
target data is in following logical form
name of funeral home
deceased1
advertising crap
funeral home 2
deceased 2
deceased 3
I’m struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the “record” logic thru simple
knowing that a new record starts with each new h3 tag.
Any help for a newbie with first Ruby script?
Thx
Marc F. wrote:
Ruby newbie here
Have successfully used hpricot to scrape correct
from desired page
http://www.montgomeryadvertiser.com/section/obits using
doc = Hpricot(uri above)
…
@grab1 = doc.search(“//div[@class=‘article-bodytext’]”)
target data is in following logical form
name of funeral home
deceased1
advertising crap
funeral home 2
deceased 2
deceased 3
I’m struggling to iterate thru this div…
I [want to insert a record into a table with each] record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type:
These methods seem like the ones you need:
elm.next_sibling (skips the newlines in the html)
elm.name
How about this:
require “rubygems”
require ‘hpricot’
str =<<ENDOFSTRING
name of funeral home
deceased1
advertising crap
funeral home 2
deceased 2
deceased 3
ENDOFSTRING
doc = Hpricot(str)
h3_tags = doc.search(“h3”)
h3_tags.each do |h3|
elm = h3
while elm = elm.next_sibling
break if elm.name != ‘p’
puts h3.inner_text
puts "\t #{elm.inner_text}"
end
end
–output:–
name of funeral home
deceased1
funeral home 2
deceased 2
funeral home 2
deceased 3
7stud – wrote:
h3_tags.each do |h3|
elm = h3
while elm = elm.next_sibling
break if elm.name != ‘p’
puts h3.inner_text
puts "\t #{elm.inner_text}"
end
end
To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:
h3_tags.each do |elm|
funeral_home = elm.inner_text
while elm = elm.next_sibling
break if elm.name != ‘p’
puts funeral_home
puts "\t #{elm.inner_text}"
end
end
Thanks so much 7-stud
I had been fixated on next_child thinking that next_sibling would skip
over the “p” tags. I really appreciate your thoughtfulness to provide a
working code snippet.
Marc
Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method…not
yet
found.
I’d also be glad to know.
2009/4/20 Marc F. [email protected]
Wang J. wrote:
Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method…not yet
found.
Try to write it. I hope I’m wrong, but I suspect that starting will be
easy, and
hitting your own target XML will be easy…
…but making it generic enough to publish will be another story!