Forum: Ruby hpricot parsing

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
A8b1cfe969a334873af0f918eb74230c?d=identicon&s=25 Marc Farber (mrcfab3)
on 2009-04-19 18:12
Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the "record" logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?


Thx
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2009-04-20 01:20
Marc Farber wrote:
> Ruby newbie here
>
> Have successfully used hpricot to scrape correct <div> from desired page
> http://www.montgomeryadvertiser.com/section/obits using
>
> doc = Hpricot(uri above)
> ...
> @grab1 = doc.search("//div[@class='article-bodytext']")
>
> target data is in following logical form
>
> <div>
> <h3>name of funeral home</h3>
> <p>deceased1</p>
> <div>advertising crap</div>
> <h3>funeral home 2</h3>
> <p>deceased 2</p>
> <p>deceased 3</p>
> </div>
>
> I'm struggling to iterate thru this div..
> I [want to insert a record into a table with each] record being a funeral home and 
person.
> I was thinking I could go thru each of the @grab1 elements and process
> according to tag type:

These methods seem like the ones you need:

elm.next_sibling  (skips the newlines in the html)
elm.name

How about this:

require "rubygems"
require 'hpricot'

str =<<ENDOFSTRING
<div>
  <h3>name of funeral home</h3>
  <p>deceased1</p>
  <div>advertising crap</div>
  <h3>funeral home 2</h3>
  <p>deceased 2</p>
  <p>deceased 3</p>
</div>
ENDOFSTRING

doc = Hpricot(str)
h3_tags = doc.search("h3")

h3_tags.each do |h3|
  elm = h3

  while elm = elm.next_sibling
    break if elm.name != 'p'

    puts h3.inner_text
    puts "\t #{elm.inner_text}"
  end

end


--output:--
name of funeral home
         deceased1
funeral home 2
         deceased 2
funeral home 2
         deceased 3
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2009-04-20 01:40
7stud -- wrote:
> h3_tags.each do |h3|
>   elm = h3
>
>   while elm = elm.next_sibling
>     break if elm.name != 'p'
>
>     puts h3.inner_text
>     puts "\t #{elm.inner_text}"
>   end
>
> end
>
>

To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:

h3_tags.each do |elm|
  funeral_home = elm.inner_text

  while elm = elm.next_sibling
    break if elm.name != 'p'

    puts funeral_home
    puts "\t #{elm.inner_text}"
  end
end
A8b1cfe969a334873af0f918eb74230c?d=identicon&s=25 Marc Farber (mrcfab3)
on 2009-04-20 01:49
Thanks so much 7-stud

I had been fixated on next_child thinking that next_sibling would skip
over the "p" tags.  I really appreciate your thoughtfulness to provide a
working code snippet.

Marc
3b4596ca5d44ae7a06a07ab8dd205975?d=identicon&s=25 Wang Jian (Guest)
on 2009-04-20 04:05
(Received via mailing list)
Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not
yet
found.
I'd also be glad to know.
2009/4/20 Marc Farber <mrcfab3@gmail.com>
Aafa8848c4b764f080b1b31a51eab73d?d=identicon&s=25 Phlip (Guest)
on 2009-04-20 04:25
(Received via mailing list)
Wang Jian wrote:

> Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
> found.

Try to write it. I hope I'm wrong, but I suspect that starting will be
easy, and
hitting your own target XML will be easy...

...but making it generic enough to publish will be another story!
This topic is locked and can not be replied to.