Help with net/http

Dobai-Pataky_BSSSSl · December 9, 2010, 9:43pm

I am trying to screen scrape a webpage and pull out the name, address,
city, state, zip and phone on a site that lists apartments for rent.

Here is my code:

temparray = Array.new

url = URI.parse(“http://www.apartment-directory.info”)
res = Net::HTTP.start(url.host, url.port) {|http|
http.get(‘/connecticut/0’)
}

puts res.body

res.body.each_line {|line|
line.gsub!(/"/, ‘’)
temparray.push(line) if line =~ /<td\svalign=top/
}
temparray.each do |j|
# j.gsub!(/<a\shref=/map.*</a>/,‘’)
j.gsub!(/\shref=/map//,‘’)
j.gsub!(/\d+\sclass=map>Map It!/,‘’)
j.gsub!(/</td>/,‘’)
j.gsub!(/<td\svalign=top>/, ‘’)
j.gsub!(/<td\svalign=top\snowrap>/, ‘’)
j.gsub!(/<tr\sbgcolor=white>/, ‘
’)
j.gsub!(/MapIt!/, ‘, ‘)
j.gsub!(/(/, ‘, (’)
j.gsub!(/</tr>/,’’)

       puts j
   }
        end

I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don’t want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html) but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a
between records.

Is there a better way to pull out the pertinent info and avoid all the
HTML tags?

thanks

atomic

magnifiedplaid · December 9, 2010, 10:03pm

Nokogiri provides a great interface for accessing the data trapped
inside markup.

Try something like:

page = Nokogiri::HTML res.body
data = []
page.xpath("//xpath/to/table").each do |node|
data << node.xpath("./rel/xpath/to/data/text()")
end

Alex S. | Sr. Quality Engineer | hi5 Networks, Inc. | [email protected]
|

magnifiedplaid · December 10, 2010, 6:28am

Thanks Alex.

I tried following the instructions to install the Nokogiri gem but it
gave me a few errors. I tried linking the libraries during the install:

[server01][/]$ gem install nokogiri – --with-xml2-lib=/usr/local/lib
–with-xml2-include=/usr/local/include/libxml2
–with-xslt-lib=/usr/local/lib
–with-xslt-include=/usr/local/include/libxslt
Building native extensions. This could take a while…
Successfully installed nokogiri-1.4.4
1 gem installed
Installing ri documentation for nokogiri-1.4.4…

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
Installing RDoc documentation for nokogiri-1.4.4…

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with

As a test, I created a test file with the following code:

require ‘open-uri’
doc = Nokogiri::HTML(open(“http://www.anysite.com/”))

But when I run it, I get the following so I don’t think the gem in
installed correctly:

[server01][/usr/bin]$ ./test.rb
./test.rb:7: uninitialized constant Nokogiri (NameError)

Would you be able to suggest anything to help me get Nokogiri installed
and working?

thanks

atomic

magnifiedplaid · December 10, 2010, 10:47am

I didn’t realized that, Jesus but it didn’t help in my installation.
When I run the test script, here’s what I get:

[server01][/usr/bin]$ ./test.rb
./test.rb:6:in `require’: no such file to load – nokogiri (LoadError)
from ./test.rb:6

thanks

atomic

magnifiedplaid · December 10, 2010, 9:20am

On Fri, Dec 10, 2010 at 6:28 AM, A. Mcbomb [email protected]
wrote:

Successfully installed nokogiri-1.4.4

[server01][/usr/bin]$ ./test.rb
./test.rb:7: uninitialized constant Nokogiri (NameError)

Would you be able to suggest anything to help me get Nokogiri installed
and working?

You have to require ‘nokogiri’

Jesus.

magnifiedplaid · December 10, 2010, 10:53am

On Fri, Dec 10, 2010 at 10:48 AM, A. Mcbomb [email protected]
wrote:

I didn’t realized that, Jesus but it didn’t help in my installation.
When I run the test script, here’s what I get:

[server01][/usr/bin]$ ./test.rb
./test.rb:6:in `require’: no such file to load – nokogiri (LoadError)
from ./test.rb:6

Did you require rubygems, before requiring nokogiri? The typical ways
are:

export RUBYOPT=rubygems

or calling ruby -rubygems ./test.rb

or adding require ‘rubygems’ to your script (there has been
discussions here about why this is not recommended, specially for
library code)

In general, to use a gem you have to require rubygems before requiring
the gem.

Jesus.

magnifiedplaid · December 10, 2010, 11:38am

That definately helped, Jesus…thanks.

Here is what I get now when I run the test script:

[server01][/usr/bin]$ ./test.rb
HI. You’re using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don’t like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

This sounds like I should upgrade…would you recommend just simply
replacing the one file (libxml2) and then reinstall the gem or are there
more files that I should replace such as libxslt as well?

thank you so much for helping me to get this going Jesus!

atomic

magnifiedplaid · December 10, 2010, 12:18pm

On Fri, Dec 10, 2010 at 11:39 AM, A. Mcbomb [email protected]
wrote:

using
thank you so much for helping me to get this going Jesus!
What OS (and version are you on?). I have a pretty old version of
Ubuntu (8.10) and have libxml2.so.2.6.32.
To correctly upgrade a library, please use your OS facilities (apt,
yum or whatever).

Jesus.

magnifiedplaid · December 10, 2010, 1:47pm

Here’s what my server is running:

Linux version 2.6.9-42.0.3.EL.wh1smp (root@wdl70144) (gcc version 3.4.6
20060404 (Red Hat 3.4.6-11)) #1 SMP Fri Aug 14 15:48:17 MDT 2009

The problem I run into is that since this is a shared hosting server,
they don’t allow me to add RPMs to the server.

Do you know of a way to update the library with a binary file for
instance?
What libraray do I need so I can look around?

thanks again Jesus

atomic

magnifiedplaid · December 10, 2010, 8:53pm

Installing gems local to your user account might help get around some
issues. Depending on what your host allows/supports, you might look into
using RVM to manage your Ruby installations and gems.

–Scott

magnifiedplaid · December 10, 2010, 11:57pm

On Fri, Dec 10, 2010 at 8:52 PM, Scott H. [email protected] wrote:

Installing gems local to your user account might help get around some
issues. Depending on what your host allows/supports, you might look into
using RVM to manage your Ruby installations and gems.

The problem is not the gem, is the libxml2 dependency. I don’t know
how to install a library locally for RedHat, maybe the OP can
investigate that. And then you have to tell nokogiri to use that local
version. I haven’t looked into it, maybe it’s easy, I don’t know.
Maybe some other person on the list can help the OP further.

Jesus.

magnifiedplaid · December 11, 2010, 1:41am

Hang on! It is working now. As I was writing my last post, I realized I
had been using:

page.xpath("//tr//td/a") and changed it to page.xpath("//tr/td")

and tried that after my last post.

I get the following output which is good except for the A type
characters, what is the best way to get rid of those and combine the
record on the same line seperated only by commas?

MapÂ It!Â Â

90 Gerrish Avenue

East Haven,Â Â CTÂ Â 06512

(203) 466-2605

Avalon Bay Communities

MapÂ It!Â Â

66 Glenbrook Road No. 200

Stamford,Â Â CTÂ Â 06902

(203) 357-0986

Avalon Grove Luxury Apartments

thanks again,

atomic

magnifiedplaid · December 11, 2010, 2:28am

You might also consider the mechanize library:
http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html

e.g.

require ‘rubygems’
require ‘mechanize’

Mechanize.new.get(“http://www.apartment-directory.info/alabama/0”) do
|page|
page.search(‘//tr’).each do |tr|
tds = tr.search(‘./td’)
puts tds[0].text.chomp rescue nil
puts tds[2].text.chomp rescue nil
puts tds[3].text.chomp rescue nil
puts
end
end

This sample script as-is is too greedy; it loops over every row of
every table table instead of just the interesting one.

$ ruby i.rb
[some garbage from other tables]
…
Aquadome Apartment
1619 8th Street Southwest
Decatur,AL35601

Arbor Park Apartments
175 Sloan Avenue East
Talladega,AL35160

Arbor Place Apartments
515 Fox Run Parkway No. 9A
Opelika,AL36801

Arbor Pointe Apartments
100 Dairy Road
Mobile,AL36612

Arboretum Apartments
1800 Arboretum Circle
Birmingham,AL35216

Arbors On Taylor
485 Taylor Road
Montgomery,AL36117

Arrow Head Apartments
129 South Union Avenue
Ozark,AL36360
…

magnifiedplaid · December 11, 2010, 1:33am

I got one of my servers updated and I’m now running Nokogiri without
errors which is great news.

Here is my new code:

url = URI.parse(“http://www.apartment-directory.info”)
res = Net::HTTP.start(url.host, url.port) {|http|
http.get(‘/connecticut/0’)
}

page = Nokogiri::HTML res.body
page.xpath(“//tr//td/a”).each do |node|
puts node.text
end

This returns some of the data that I need but not all of it.
I do not understand this line:

page.xpath(“//tr/td”)

I know it is supposed to be the path to the data I need but I’m not sure
how I can get to all the data I need from the URL, it seems like some of
the data is between tags that I can’t figure out.

This is one record from the webpage in HTML:

22 Glenbrook Road Condo Associates Map It! 22 Glenbrook Road Stamford, CT 06902 (203) 327-4028 ----- I need to be able to get the following information for one record out:

22 Glenbrook Road Condo Associates,22 Glenbrook
Road,Stamford,CT,O6902,(203) 327-4028

I thought that if I configured Nokogiri with:
page.xpath(“//tr/td”)

…that is would get me inside these table brackets but it’s not working.

Can you possibly point out where I’m going wrong?

thanks for the help,

atomic

magnifiedplaid · December 11, 2010, 2:53am

On Fri, Dec 10, 2010 at 4:38 PM, A. Mcbomb [email protected]
wrote:

page = Nokogiri::HTML res.body
page.xpath(“//tr//td/a”).each do |node|
puts node.text
end

This returns some of the data that I need but not all of it.
I do not understand this line:

page.xpath(“//tr/td”)

That’s not what you’re using.

I know it is supposed to be the path to the data I need but I’m not sure
how I can get to all the data I need from the URL, it seems like some of
the data is between tags that I can’t figure out.

If you did use //tr/td you would get all the information in the
table,
only some of which is within anchor (a) tags.

magnifiedplaid · December 11, 2010, 2:51am

I never heard of mechanize but I see from the doco that it requires
Nokogiri to run. I have a working copy of Nokogiri and just did a
successful ‘gem install mechanize’ but when I ran your basic script, I
get:

[root@trebek2 bin]# ./mechanize.rb
./mechanize.rb:7: uninitialized constant Mechanize (NameError)
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
gem_original_require' from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:inrequire’
from ./mechanize.rb:5

Am I missing something?

atomic

magnifiedplaid · December 11, 2010, 4:59am

On Fri, Dec 10, 2010 at 8:52 PM, A. Mcbomb [email protected]
wrote:

from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`require’
from ./mechanize.rb:5

Am I missing something?

Hmm… I’m not sure there… oh wait… Could it be confused since
your file is (also) named ‘mechanize’?:

$ cp i.rb mechanize.rb
$ ruby mechanize.rb
./mechanize.rb:5: uninitialized constant Mechanize (NameError)
from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in
gem_original_require' from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in require’
from mechanize.rb:3

Yeah, I think that’s it. Try renaming your script.

magnifiedplaid · December 11, 2010, 8:56pm

My last question though is, what is the easiest way to get rid of all
the garbage that I don’t want from the other tables?

Try to narrow down the xpath used to pull stuff out. E.g.

//table[2]/tbody/tr

to only get the rows from the second table.

magnifiedplaid · December 11, 2010, 2:01pm

I renamed the script and it worked!
Pretty nice…thanks.

My last question though is, what is the easiest way to get rid of all
the garbage that I don’t want from the other tables?

thanks alot

atomic

Help with net/http

Here is my code:

puts res.body

No definition for parse_with

Here is my new code:

page = Nokogiri::HTML res.body page.xpath(“//tr//td/a”).each do |node| puts node.text end

This is one record from the webpage in HTML:

page = Nokogiri::HTML res.body page.xpath(“//tr//td/a”).each do |node| puts node.text end

page = Nokogiri::HTML res.body
page.xpath(“//tr//td/a”).each do |node|
puts node.text
end

page = Nokogiri::HTML res.body
page.xpath(“//tr//td/a”).each do |node|
puts node.text
end