Hey there everyone. I’m having a slight problem using Mechanize. I’m
trying to scrape the yellowpages.com, and extract information about each
business listing. I’m extracting all the information I want, except for
one small portion: the business’s website. It is the href inside of a
link that I am trying to scrape. As far as I know, I’m following the
correct xpath rules, but I can’t seem to get the part I want. One
tricky thing that I’ve had to deal with is that not every listing has a
website. The website link and the “learn more” link are very similar,
xpath-wise, so I have to use an if statement to check the inner text of
both of them to make sure that I’m extracting the xpath one.
Jethrow, thanks but that’s not quite what I need. I need to extract
this link’s href attribute, which is the website of the buisness. I’m
using xpath, using the “…/a/@href” method which I believe is the
correct one. But it just doesn’t extract anything! Any other ideas?
a leading “/” means that you want the xpath search to begin from the
root of the document. “./” means to start from the context node, in
this case website.
Thanks Mike. I’ve updated the code. Putting “./” in my code messed up
what the code grabbed, so I eliminated all leading “/” and it worked.
However, the website part still doesn’t work. It looks like it is
grabbing what is inside of the a tags, instead of grabbing the href
address. A typical listing output looks like:
McDonald’s
1213 State St # B,
Santa Barbara
CA
93101
(805) 962-6976 »
?
Website
[“Restaurants”, “Fast Food Restaurants”, “American Restaurants”]
The three lines: , ?, and Website are all grabbed from the website
= website.search(…) line, but it’s grabbing the wrong thing! Do you
have any more suggestions?
I still haven’t figured this out. Perhaps I should phrase the question
a different way…
What is the preferred method of extracting the href attribute from a
link? I’ve tried doing it using .search() and searching for the xml @href attribute. For some reason that’s not working for me.
Is there a different way of extracting this attribute, without using
.search and an xml path? I’m sure mechanize has some other method
too…
I still haven’t figured this out. Perhaps I should phrase the question
a different way…
What is the preferred method of extracting the href attribute from a
link? I’ve tried doing it using .search() and searching for the xml @href attribute. For some reason that’s not working for me.
Is there a different way of extracting this attribute, without using
.search and an xml path? I’m sure mechanize has some other method
too…
With this and a local version of the page I was able to get the info you
want:
#!/bin/env ruby19
require ‘nokogiri’
raw = File.read(“restaurants.html”, mode: “r:UTF-8”)
puts raw.encoding
raw.force_encoding ‘UTF-8’
doc = Nokogiri.parse raw
doc.xpath(‘//div[@class=“listing_content”]’).each do |listing|
puts ‘----------------------------------------’
Robert, thank you so much for your help. I’m just starting out in ruby,
so I have a lot to learn! I’ve attached my code with the changes you
recommended. It works wonderfully! Are there any other changes or
optimizations I could make?
Thanks again for your help. This script above is actually part of a
larger script, but I was just showing you the smaller portion for
clarity. I’ve attached the whole script, in order to put the for loop
in context. I’m using the for loop to grab multiple pages in
succession. Hopefully this makes it more clear.
I’ve made all the changes you recommended, except consolidating the puts
statements. Could you explain how I would do that? Thank you!
Robert, thank you so much for your help. I’m just starting out in ruby,
so I have a lot to learn! I’ve attached my code with the changes you
recommended.