Forum: Ruby hpricot parsing

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Marc F. (Guest)
on 2009-04-19 20:12
Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the "record" logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?


Thx
7stud -. (Guest)
on 2009-04-20 03:20
Marc F. wrote:
> Ruby newbie here
>
> Have successfully used hpricot to scrape correct <div> from desired page
> http://www.montgomeryadvertiser.com/section/obits using
>
> doc = Hpricot(uri above)
> ...
> @grab1 = doc.search("//div[@class='article-bodytext']")
>
> target data is in following logical form
>
> <div>
> <h3>name of funeral home</h3>
> <p>deceased1</p>
> <div>advertising crap</div>
> <h3>funeral home 2</h3>
> <p>deceased 2</p>
> <p>deceased 3</p>
> </div>
>
> I'm struggling to iterate thru this div..
> I [want to insert a record into a table with each] record being a funeral home and 
person.
> I was thinking I could go thru each of the @grab1 elements and process
> according to tag type:

These methods seem like the ones you need:

elm.next_sibling  (skips the newlines in the html)
elm.name

How about this:

require "rubygems"
require 'hpricot'

str =<<ENDOFSTRING
<div>
  <h3>name of funeral home</h3>
  <p>deceased1</p>
  <div>advertising crap</div>
  <h3>funeral home 2</h3>
  <p>deceased 2</p>
  <p>deceased 3</p>
</div>
ENDOFSTRING

doc = Hpricot(str)
h3_tags = doc.search("h3")

h3_tags.each do |h3|
  elm = h3

  while elm = elm.next_sibling
    break if elm.name != 'p'

    puts h3.inner_text
    puts "\t #{elm.inner_text}"
  end

end


--output:--
name of funeral home
         deceased1
funeral home 2
         deceased 2
funeral home 2
         deceased 3
7stud -. (Guest)
on 2009-04-20 03:40
7stud -- wrote:
> h3_tags.each do |h3|
>   elm = h3
>
>   while elm = elm.next_sibling
>     break if elm.name != 'p'
>
>     puts h3.inner_text
>     puts "\t #{elm.inner_text}"
>   end
>
> end
>
>

To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:

h3_tags.each do |elm|
  funeral_home = elm.inner_text

  while elm = elm.next_sibling
    break if elm.name != 'p'

    puts funeral_home
    puts "\t #{elm.inner_text}"
  end
end
Marc F. (Guest)
on 2009-04-20 03:49
Thanks so much 7-stud

I had been fixated on next_child thinking that next_sibling would skip
over the "p" tags.  I really appreciate your thoughtfulness to provide a
working code snippet.

Marc
Wang J. (Guest)
on 2009-04-20 06:05
(Received via mailing list)
Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not
yet
found.
I'd also be glad to know.
2009/4/20 Marc F. <removed_email_address@domain.invalid>
Phlip (Guest)
on 2009-04-20 06:25
(Received via mailing list)
Wang J. wrote:

> Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
> found.

Try to write it. I hope I'm wrong, but I suspect that starting will be
easy, and
hitting your own target XML will be easy...

...but making it generic enough to publish will be another story!
This topic is locked and can not be replied to.