Using Nokogiri

I’m trying to scrape some data off websites using nokogiri

require ‘rubygems’
require ‘open-uri’
require ‘nokogiri’ #using the latest 1.4.0

url = ‘http://www.whateverwebsitenameis.org

doc = Nokogiri::HTML(open(url))

This gets me data off the website I want to scrape.

The segment of the site I want looks like this (from FF ‘view
source’ )


Association Detail

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL

DIRECTORY RESULTS

  1. Some Institute name

  2. some address
    city, st zip
  3. United States
  4.  Phone:
    
  5.    (123) 456-7890<Br>
    

10

11) Web address: www.xyz.org

<br><br>

<A href="javascript:history.back();">Back to Search Results</

a>

<A href="AssociationSearch.cfm">Search Again</a>
---------------------------------------------------------------------------------

I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url: ‘http://www.xyz.org

I can find the beginning of this section of code by doing this:

doc.css(‘h2’).each do |elem| puts elem.content end
which displays ‘Association Detail’

I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific ‘Association Detail’
details. I’ve tried it with ‘xpath’ and ‘search’ according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri

but there’s something I’m just not getting correctly when I use other
elements get info from.

My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0

Thanks in advance for any help.

jzakiya wrote:

I’m trying to scrape some data off websites using nokogiri

require ‘rubygems’
require ‘open-uri’
require ‘nokogiri’ #using the latest 1.4.0

url = ‘http://www.whateverwebsitenameis.org

doc = Nokogiri::HTML(open(url))

This gets me data off the website I want to scrape.

The segment of the site I want looks like this (from FF ‘view
source’ )


Association Detail

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL

DIRECTORY RESULTS

  1. Some Institute name

  2. some address
    city, st zip
  3. United States
  4.  Phone:
    
  5.    (123) 456-7890<Br>
    

10

11) Web address: www.xyz.org

<br><br>

<A href="javascript:history.back();">Back to Search Results</

a>

<A href="AssociationSearch.cfm">Search Again</a>
---------------------------------------------------------------------------------

I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url: ‘http://www.xyz.org

I can find the beginning of this section of code by doing this:

doc.css(‘h2’).each do |elem| puts elem.content end
which displays ‘Association Detail’

I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific ‘Association Detail’
details. I’ve tried it with ‘xpath’ and ‘search’ according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri

but there’s something I’m just not getting correctly when I use other
elements get info from.

My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0

Thanks in advance for any help.
You aren’t really searching by css, which would involve things like
searching for tags based on their ‘class’ attribute or ‘id’ attribute.
Because the

tag doesn’t have any attributes, you are simply
searching by tag name, so you could do this instead:

doc.xpath(‘//h2’).each do |h2|
puts h2.content
end

That uses xpath notation to find all h2 tags on the page. Then you
might write something like this:

doc = Nokogiri::HTML.parse(html)

doc.xpath(‘//h2’).each do |h2|

if h2.content == “Association Detail”
puts “—”
puts h2.next.content
puts “—”
end

end

Knowing you can do that will enable you to write something like this:

results = []

doc.xpath(‘//h2’).each do |h2|

if h2.content == “Association Detail”
curr_elmt = h2

while (curr_elmt = curr_elmt.next)
  curr_content =  curr_elmt.content
  results << curr_content
  break if curr_content.include?("Web address:")
end

end
end

results.each do |result|
puts “–start–”
puts result
puts “–end–”
puts
end

output=

–start–
DETAIL
DIRECTORY RESULTS
–end–

–start–
Some Institute name
–end–

–start–

–end–

–start–

–end–

–start–

some address city, st zip

United States

  Phone:

    (123) 456-7890

) Web address: www.xyz.orgBack to Search Results
a>Search Again
–end–

As you can see, the html is pretty bad, so your results aren’t that
great. You will have to figure out how to extract the data you need
from those strings.

7stud – wrote:

jzakiya wrote:

I’m trying to scrape some data off websites using nokogiri

I chopped off the top of my code, which looks like this:

require ‘rubygems’
require ‘nokogiri’

html =<<ENDOFHTML

Association Detail

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL

DIRECTORY RESULTS

<b>Some Institute name</b><Br><br>

some address
city, st zip

United States <Br>

  Phone:

    (123) 456-7890<Br>

<br>

) Web address: www.xyz.org

<br><br>

<A href="javascript:history.back();">Back to Search Results</

a>

doc = Nokogiri::HTML.parse(html)

This should get what you want:

prefix = ‘//div[@class=“sectionHeaderText”]/following-sibling::’
xpaths = {
:name => “#{prefix}b/text()”,
:addr => “#{prefix}text()[2]”,
:citystzip => “#{prefix}text()[3]”,
:country => “#{prefix}text()[4]”,
:phone => “#{prefix}text()[5]”,
}
xpaths.each do |data,xpath|
puts "#{data} = " + doc.search(xpath).to_s.strip
end

– Mark.

7stud – wrote:

7stud – wrote:

jzakiya wrote:

Argh. Now I’ve chomped off the bottom of the html. This is what I used:

require ‘rubygems’
require ‘nokogiri’

html =<<ENDOFHTML

Association Detail

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL

DIRECTORY RESULTS

<b>Some Institute name</b><Br><br>

some address
city, st zip

United States <Br>

  Phone:

    (123) 456-7890<Br>

<br>

) Web address: www.xyz.org

<br><br>

<A href="javascript:history.back();">Back to Search Results</

a>

<A href="AssociationSearch.cfm">Search Again</a>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)

…rest of code

Mark T. wrote:

This should get what you want:

prefix = ‘//div[@class=“sectionHeaderText”]/following-sibling::’
xpaths = {
:name => “#{prefix}b/text()”,
:addr => “#{prefix}text()[2]”,
:citystzip => “#{prefix}text()[3]”,
:country => “#{prefix}text()[4]”,
:phone => “#{prefix}text()[5]”,
}
xpaths.each do |data,xpath|
puts "#{data} = " + doc.search(xpath).to_s.strip
end

– Mark.

I was wondering if you could answer some xpath questions? I would think
that in this xpath:

div[@class=“sectionHeaderText”]/following-sibling::text()[2]

the part:

div[@class=“sectionHeaderText”]/following-sibling

would be the tag. Then:

div[@class=“sectionHeaderText”]/following-sibling::text()

would be the tag’s text or “Some Institute name”. So then the
following [2]:

div[@class=“sectionHeaderText”]/following-sibling::text()[2]

doesn’t seem applicable. And in fact, when I run your code, it doesn’t
work:

addr =
citystzip =
name = Some Institute name
country =
phone =

===========

html =<<ENDOFHTML

Association Detail

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL

DIRECTORY RESULTS

<b>Some Institute name</b><Br><br>

some address
city, st zip

United States <Br>

  Phone:

    (123) 456-7890<Br>

<br>

) Web address: www.xyz.org

<br><br>

<A href="javascript:history.back();">Back to Search Results</

a>

<A href="AssociationSearch.cfm">Search Again</a>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)

prefix = ‘//div[@class=“sectionHeaderText”]/following-sibling::’
xpaths = {
:name => “#{prefix}b/text()”,
:addr => “#{prefix}text()[2]”,
:citystzip => “#{prefix}text()[3]”,
:country => “#{prefix}text()[4]”,
:phone => “#{prefix}text()[5]”,
}
xpaths.each do |key, val|
puts "#{key} = " + doc.search(val).to_s.strip
end

On Nov 9, 12:37 am, 7stud – [email protected] wrote:

}

<b>Some Institute name</b><Br><br>

target=“_Blank”>www.xyz.org

xpaths.each do |key, val|
puts "#{key} = " + doc.search(val).to_s.strip
end


Posted viahttp://www.ruby-forum.com/.

7stud’s approach works, but Mark’s doesn’t (currently).
Here’s the file I created which will get me all the raw
data I want (still have to process to get to final form).

file: scrape.rb

require ‘rubygems’
require ‘open-uri’
require ‘nokogiri’

def scrape (id)

id = id.to_s
url = “http://www.xyz.org/../../..ID=#{id}
doc = Nokogiri::HTML.parse(open(url))

results = []

doc.xpath(‘//h2’).each do |h2|
if h2.content == “Association Detail”
curr_elmt = h2
while (curr_elmt = curr_elmt.next)
curr_content = curr_elmt.content.gsub(/\n|\t|\r/,‘’).squeeze
(’ ').strip
results << curr_content unless curr_content.strip.empty?
break if curr_content.include?(“Back to Search Results”)
end
end
end

results.each do |result|
#Do while result is not a blank string
puts “–start–”
puts result
puts “–end–”
end
return results
end

So I just ‘require’ this file, and can then do:

info = scrape 1234

where ‘info’ is the array ‘results’. I can then process
that to my hearts delight.

Thanks 7stud for your help.
I would, however, like to know if Mark’s way can be made to work too.

Jabari

7stud’s approach works, but Mark’s doesn’t (currently).

Strange… it works for me.

mark@ubuntu:~$ ruby -v
ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]

Nokogiri 1.4.0
libxslt 1.1.24-2ubuntu2

Here’s the entire working program:

require ‘rubygems’
require ‘nokogiri’

html =<<ENDOFHTML

Association Detail

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL

DIRECTORY RESULTS

<b>Some Institute name</b><Br><br>

some address
city, st zip

United States <Br>

  Phone:

    (123) 456-7890<Br>

<br>

) Web address: www.xyz.org

<br><br>

<A href="javascript:history.back();">Back to Search Results</

a>

<A href="AssociationSearch.cfm">Search Again</a>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)
prefix = ‘//div[@class=“sectionHeaderText”]/following-sibling::’
xpaths = {
:name => “#{prefix}b/text()”,
:addr => “#{prefix}text()[2]”,
:citystzip => “#{prefix}text()[3]”,
:country => “#{prefix}text()[4]”,
:phone => “#{prefix}text()[5]”,
}
xpaths.each do |k,xpath|
puts "#{k} = " + doc.search(xpath).to_s.strip
end

Output:

addr = some address
citystzip = city, st zip
country = United States
phone = Phone:

    (123) 456-7890

name = Some Institute name

Mark T. wrote:

As I just posted in another message, it works for me. I wonder what’s
different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I’m using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

$ nokogiri -v
HI. You’re using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don’t like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version

nokogiri: 1.4.0
warnings: []

libxml:
compiled: 2.6.16
loaded: 2.6.16
binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.

On Nov 9, 12:37 am, 7stud – [email protected] wrote:

}

the part:

div[@class=“sectionHeaderText”]/following-sibling

would be the tag.

Not quite. following-sibling:: is an axis predicate that needs to be
followed by a node. Therefore following-sibling::text() is a set of
all text nodes after the div. After that, it’s just a matter of
indexing.

doesn’t seem applicable. And in fact, when I run your code, it doesn’t
work:

As I just posted in another message, it works for me. I wonder what’s
different about my environment. Are you using Nokogiri 1.4.0?

On Nov 9, 10:51 pm, 7stud – [email protected] wrote:

nokogiri.
binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.

Posted viahttp://www.ruby-forum.com/.

OK, when I put Mark’s code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn’t capture the website
url, which 7stud’s approach does. I haven’t figure out how to do it
with this approach, and merely adding more items in xpaths doesn’t
work.

So Mark, how can your approach be used to capture the url add the end
of the data section?

Here’s the file I used with Mark’s approach:

file: scrape1.rb

require ‘rubygems’
require ‘open-uri’
require ‘nokogiri’

def scrape (id)

id = id.to_s
url = “Welcome to ASAE — American Society of Association Executives
&type=association”
doc = Nokogiri::HTML.parse(open(url))

prefix = ‘//div[@class=“sectionHeaderText”]/following-sibling::’
xpaths = {
:name => “#{prefix}b/text()”,
:addr => “#{prefix}text()[2]”,
:citystzip => “#{prefix}text()[3]”,
:country => “#{prefix}text()[4]”,
:phone => “#{prefix}text()[5]”,
:web => “#{prefix}text()[6]”,
:url => “#{prefix}text()[7]”
}

results = {}
xpaths.each do |data,xpath|
results[data] = doc.search(xpath).to_s.gsub(/\n|\t|\r/,‘’).squeeze
(’ ').strip
puts "#{data} = " + results[data]
end
return results
end

And use as before: info = scrape 1234

Jabari

doc = Nokogiri::HTML.parse(open(url))

prefix = ‘//div[@class=“sectionHeaderText”]/following-sibling::’
xpaths = {
:name => “#{prefix}b/text()”,
:addr => “#{prefix}text()[2]”,
:citystzip => “#{prefix}text()[3]”,
:country => “#{prefix}text()[4]”,
:phone => “#{prefix}text()[5]”,
:web => “#{prefix}text()[6]”,
:url => “#{prefix}text()[7]”

You’ll need to modify that last line. Unlike the other items, the URL
is not in a text node, it is the href attribute of the first
element. So try:

:url => "#{prefix}a[1]/@href"

On Nov 9, 10:51 pm, 7stud – [email protected] wrote:

$ nokogiri -v

cool! I didn’t know about that.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version

nokogiri: 1.4.0
warnings: []

libxml:
compiled: 2.6.16
loaded: 2.6.16
binding: extension

This is most likely the problem.

Mine reports:
libxml:
loaded: 2.7.5
binding: extension
compiled: 2.7.5

with no warnings.

Can you install a newer version of libxml2? As you can see from
NEWS · master · GNOME / libxml2 · GitLab, your version dates back to 2004 with
tons of bug fixes (including XPath fixes) since.

On Nov 9, 10:51 pm, 7stud – [email protected] wrote:

nokogiri.
binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.

Posted viahttp://www.ruby-forum.com/.

On Nov 9, 10:51 pm, 7stud – [email protected] wrote:

  • Hide quoted text -
  • Show quoted text -

Mark T. wrote:

As I just posted in another message, it works for me. I wonder what’s
different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I’m using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

nokogiri.
/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version

nokogiri: 1.4.0
warnings: []

libxml:
compiled: 2.6.16
loaded: 2.6.16
binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.

Posted viahttp://www.ruby-forum.com/.

OK, when I put Mark’s code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn’t capture the website
url, which 7stud’s approach does. I haven’t figure out how to do it
with this approach, and merely adding more items in xpaths doesn’t
work.

So Mark, how can your approach be used to capture the url add the end
of the data section?

Here’s the file I used with Mark’s approach:

File: scrape1.rb

require ‘rubygems’
require ‘open-uri’
require ‘nokogiri’

def scrape (id)

id = id.to_s
url = “http://www.xyz.org/../../..ID=#{id}
doc = Nokogiri::HTML.parse(open(url))

prefix = ‘//div[@class=“sectionHeaderText”]/following-sibling::’
xpaths = {
:name => “#{prefix}b/text()”,
:addr => “#{prefix}text()[2]”,
:citystzip => “#{prefix}text()[3]”,
:country => “#{prefix}text()[4]”,
:phone => “#{prefix}text()[5]”,
:web => “#{prefix}text()[6]”,
:url => “#{prefix}text()[7]”
}

results = {}
xpaths.each do |data,xpath|
results[data] = doc.search(xpath).to_s.gsub(/\n|\t|\r/,‘’).squeeze
(’ ').strip
puts "#{data} = " + results[data]
end
return results
end

And use as before: info = scrape 1234

On Nov 10, 10:29 pm, Mark T. [email protected] wrote:

doc = Nokogiri::HTML.parse(open(url))

You’ll need to modify that last line. Unlike the other items, the URL
is not in a text node, it is the href attribute of the first
element. So try:

:url => "#{prefix}a[1]/@href"

Yes, this allows me to capture the url I want (and sometimes ones I
don’t want), and I’m able to post-process xpaths to get everything I
need.

xpaths = {
:name => “#{prefix}b/text()”,
:addr => “#{prefix}text()[2]”,
:citystzip => “#{prefix}text()[3]”,
:country => “#{prefix}text()[4]”,
:phone => “#{prefix}text()[5]”,
:url => “#{prefix}a[1]/@href”
}

Now, I just need to understand completely WHY/HOW it works. :slight_smile:

Jabari

Mark T. wrote:

On Nov 9, 10:51�pm, 7stud – [email protected] wrote:

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version

nokogiri: 1.4.0
warnings: []

libxml:
� compiled: 2.6.16
� loaded: 2.6.16
� binding: extension

This is most likely the problem.

Mine reports:
libxml:
loaded: 2.7.5
binding: extension
compiled: 2.7.5

with no warnings.

Can you install a newer version of libxml2? As you can see from
NEWS · master · GNOME / libxml2 · GitLab, your version dates back to 2004 with
tons of bug fixes (including XPath fixes) since.

I’ve looked into installing newer versions of libxml2 and libxslt, but
it looks complicated and fraught with danger for mac os x.


,
,
,
,
,
,
,
,
,
,
,

On Nov 12, 11:52 am, jzakiya [email protected] wrote:

So Mark, how can your approach be used to capture the url add the end
def scrape (id)
:country => “#{prefix}text()[4]”,
Yes, this allows me to capture the url I want (and sometimes ones I
}

Now, I just need to understand completely WHY/HOW it works. :slight_smile:

Let’s take the first one as an example. I noticed that everything was
after a div with the class “sectionHeaderText”, so I started with
that:

//div[@class=“sectionHeaderText”]

The double slash is a wildcard that means the div can be anywhere. The
part in brackets is called a predicate, and it constrains the
expression. I like to think of it as a “such that” clause. So you can
read the above as “a div such that the class is
‘sectionHeaderText’.” (Actually, it’s the set of all divs for which it
is true, so if you had multiple divs with the same class, it would
return them all)

Then I noticed that the items you wanted were not children of the div.
The div closes before you get to the text you want. Even
tags are
considered to be
which are self-closing. Therefore almost
everything you want is at the same nesting depth, or in XPath
terminology, they are siblings. The “following-sibling” is an XPath
“axis” (see the W3C Schools XPath tutorial for details on these
things). The name though was inside a element so I used the XPath
expression to get the following sibling that happens to be a
element:

//div[@class=“sectionHeaderText”]/following-sibling::b

Then, how you get text from within a node is the XPath function text()
which means all the text between tags, including whitespace.

//div[@class=“sectionHeaderText”]/following-sibling::b/text()

And there you have the name.

Now, the other things were text nodes between
elements. You could
pull them all by asking for the set of text node siblings of the div:

//div[@class=“sectionHeaderText”]/following-sibling::text()

But when you get more stuff than you want like that, you can index
them like an array:

//div[@class=“sectionHeaderText”]/following-sibling::text()[2]

and that happens to pull the street address.

So hopefully you see how the XPaths were put together. Usually they
are a bit simpler, but like 7stud said, it was pretty crappy HTML.

– Mark.

jzakiya wrote:

Now, I just need to understand completely WHY/HOW it works. :slight_smile:

Here is a pretty good basic XPath tutorial:

http://www.w3schools.com/XPath/xpath_nodes.asp