Ruby screen scraping

farfignugen · November 19, 2006, 5:17pm

Hi,

I’m looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Does anyone have any opinions on what might be the best way to approach
this task. Ive been looking at a number of different packages including
Htree.

Thanks

farfignugen · November 19, 2006, 5:22pm

Hi,

On 11/19/06, Chris G. [email protected] wrote:

I’m looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

If you want screen scraping, I would tell you to look at why’s
excellent Hpricot HTML parser. It’s really simple to use and very
effective.

http://code.whytheluckystiff.net/hpricot/

Cheers,
Alvim.

farfignugen · November 19, 2006, 5:27pm

For HTML scraping I recommend scrAPI.

gem install scrapi

homepage:
http://blog.labnotes.org/category/scrapi/

Example scraper:

Scraper.define do
attr_accessor :title, :author, :pub_date, :content

process “div#GuardianArticle > h1”, :title => :text
process “div#GuardianArticle > font[size=2] > b” do |element|
@author = element.children[0].content
@pub_date = element.children[2].content.strip
end
process “div#GuardianArticleBody”, :content => :text
end

farfignugen · November 19, 2006, 5:44pm

Chris G. wrote:

Hi,

I’m looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Once you have the page (open-uri if you know the URL exactly, or
www::mechanize if you need to navigate there (i.e. fill textfields,
click buttons etc)) I recommend to check out these possibilities:

regular expressions
HPpricot
scrAPI
Rubyful soup

Regular expressions would be the most old-school solution, in some cases
such a wrapper is the most robust (but since you are in control of the
generated page as I understood, robustness is possibly not an issue).

If you can’t do it with regexps, HPricot will be most probably adequate
(I would need to see the concrete page).

Finally, if neither of the above works, you should try scrAPI - and
though I don’t think so you should fail after this point, Rubyful soup
is another possibility to check out.

Peter
__
http://www.rubyrailways.com

farfignugen · November 19, 2006, 5:40pm

thanks guys I’ll look into both of them.

Another question I would have is how would i then get this scraped info
to insert into a mysql database called say “build” and a table called
“results”.

For now if you could base answers on the following htree code?

require ‘open-uri’
require ‘htree’
require ‘rexml/document’

url = “ruby - Google Search”
open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element(‘//a[@class=“l”]’) {
|elem| puts elem.attribute(‘href’).value }
}

which is returning a result of:

C:>ruby script2.rb
http://www.ruby-lang.org/
http://www.ruby-lang.org/en/20020101.html

http://www.rubycentral.com/book/

http://www.w3.org/TR/ruby/
http://poignantguide.net/

Cheers.

farfignugen · November 19, 2006, 5:46pm

Chris G. wrote:

require ‘rexml/document’

url = “ruby - Google Search”
open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element(‘//a[@class=“l”]’) {
|elem| puts elem.attribute(‘href’).value }
}
Something along the lines of

require “mysql”

dbh = Mysql.real_connect(“localhost”, “chris”, “”, “build”)
dbh.query("
INSERT INTO results
VALUES (whatever)

Cheers,

Peter
__
http://www.rubyrailways.com

farfignugen · November 19, 2006, 6:00pm

Thanks for the help.

Ill get on with it and see how it goes

farfignugen · November 19, 2006, 6:11pm

OK,here is the full code:

require ‘open-uri’
require ‘htree’
require ‘rexml/document’
require ‘mysql’

url = “ruby - Google Search”
results = []

open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element(‘//a[@class=“l”]’) {
|elem| results << elem.attribute(‘href’).value }

dbh = Mysql.real_connect(“localhost”, “peter”, “****”, “build”)

results.each do |result|
dbh.query(“INSERT INTO result VALUES (‘#{result}’)”)
end
}

HTH,

Peter
__
http://www.rubyrailways.com

farfignugen · November 19, 2006, 6:20pm

wow, thanks for that code.

One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i’m missing something here?

farfignugen · November 19, 2006, 6:53pm

ah thats great.

thanks again for your help

farfignugen · November 19, 2006, 10:40pm

OK that code all works great but i have one last question

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

this is text that i want to scrape

any ideas?

thanks.

farfignugen · November 19, 2006, 6:29pm

Chris G. wrote:

wow, thanks for that code.
Welcome

One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i’m missing something here?

My code assumed that the table has one column (e.g. ‘url’ in this case)
and the values were inserted into that column.

Otherwise if you have more columns, you can do this:

INSERT INTO people
(name, age) VALUES(‘Peter S.’, ‘23’ ).

You can do

INSERT INTO people VALUES(‘Peter S.’, ‘23’ )

as well, but in this case you have to be sure that the columns in your
DB are in the same order as in your insert query. In the first example
you don’t have to care about the column ordering in the DB, as far as
the mapping between the column names (first pair of parens) and the
values (second pair of parens) are O.K.

HTH,
Peter

__
http://www.rubyrailways.com

farfignugen · November 20, 2006, 2:45pm

Chris G. wrote:

OK that code all works great but i have one last question

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

this is text that i want to scrape

Hmm this is hard to tell just by this example. If you need ALL

s,
then those can be queried by this XPath:

//p

I am not sure what are you using now, but in Hpricot this would be:

doc = Hpricot(open(“http://stuff.com/”))
results = doc/“//p”

If you are still using, HTree, query this XPath there for the same
results.

However, I guess you want something more sophisticated than ALL the

s. Well this is where the troubles begin with screen scraping: you need to figure out some rules which extract *exactly* what you want - usually it is not that hard to come up with rules that extract more or less, but much worse to find the right ones...

To solve this problem, you need to tell us what do you want - i.e. an
example page, and a set of objects you would like to extract.

Cheers,
Peter

__
http://www.rubyrailways.com

farfignugen · November 20, 2006, 2:46pm

On Nov 19, 2006, at 8:50 PM, Paul L. wrote:

this is text that i want to scrape

This is why it is a bad idea to adopt a package or library to
going on. Then you abandon the library and write normal code.

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

I hope you’re not arguing that HTML should be parsed with simple
regular expression instead of a real parser. I think most would
agree with me when I say that strategy seldom holds up for long.

James Edward G. II

farfignugen · November 20, 2006, 2:46pm

Chris G. wrote:

OK that code all works great but i have one last question

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

this is text that i want to scrape

any ideas?

Really simple:

array = page_content.scan(%r{

(.*?)

}m).flatten

Returns an array, each cell of which is a paragraph from the original
page.

This is why it is a bad idea to adopt a package or library to accomplish
something that is easier to accomplish with a few lines of code, or even
one line as in this case.

At first the library seems as though it can do anyting, with no need to
understand what is actually going on. Pretty quickly you encounter
something the library cannot do, and you have to … understand what is
going on. Then you abandon the library and write normal code.

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

farfignugen · November 20, 2006, 2:46pm

James Edward G. II wrote:

On Nov 19, 2006, at 8:50 PM, Paul L. wrote:

Chris G. wrote:

OK that code all works great but i have one last question

/ …

Returns an array, each cell of which is a paragraph from the
original page.

/ …

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

I hope you’re not arguing that HTML should be parsed with simple
regular expression instead of a real parser. I think most would
agree with me when I say that strategy seldom holds up for long.

That depends on the complexity of the problem to be solved, and the
reliability of the source page’s HTML formatting.

For a page that can pass validation of one kind or another or that is
XHTML,
the simplest kinds of parsers provide terrific results. For legacy pages
and those that can be expected to have “relaxed” syntax, more robust
parsers are required.

But I must say I regularly see requests here for parsers that can be
expected to do anything, but often as not and IMHO, such a library
represents too much complexity for the majority of routine HTML/XML
parsing
tasks with Web pages and documents that are often generated, not
hand-written.

This thread is an example. Beginning with the generic equivalent of “Is
there a library that can …” followed almost immediately by “Great! But
how do I make it do this …”, requesting a really trivial extraction
step
that can be accomplished in a single line of Ruby.

I find this rather ironic, since Ruby is meant to provide an easy way to
create solutions to everyday problems. One then sees a blizzard of
libraries whose purpose is to shield the user from the complexities of
the
language, in a way that the remedy is often more complex than the
problem
it is meant to solve.

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but
apparently
never considered writing code to solve the problem directly. After
choosing
a library, the OP realized he didn’t see an obvious way to solve the
original problem – extracting specific content from the source pages.

As to modern XHTML Web pages that can pass a validator, I know from
direct
recent experience that they yield to the simplest parser design, and can
be
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code. It is
hard
to justify bringing out the big guns for a task like this, when one
could
instead use a small self-documenting routine such as I suggested.

In the bad old days of assembly and comparatively heavy, inflexible
languages like C, C++ and the like, it is easy to see why people would
be
motivated to create specialized libraries to solve generic problems just
once for all time. In fact, the argument can be made that Ruby is just
such
a library of generics, broadly speaking an extension/amplification of
the
STL project.

Now we see people writing easy-to-use application libraries, each
composed
using the easy-to-use Ruby library, but that are sometimes harder to
sort
out, or make practical use of, than a short bit of code would have been.

Lest my readers think I am going overboard here on a topic dear to my
heart,
let me quote the OP once again:

any ideas?
In other words, after choosing a library and playing with it for a
while, he
found himself back in square one, unable to solve the original problem.

To quote one of my favorite authors (William Burroughs), it seems people
are
busy inventing cures for which there are no diseases.

farfignugen · November 20, 2006, 2:46pm

Hola,

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

I hope you’re not arguing that HTML should be parsed with simple regular
expression instead of a real parser. I think most would agree with me
when I say that strategy seldom holds up for long.

I could not agree more with James here. HTML scraping is one of the most
tedious tasks these days. Paul, how far would your scraper get with this
‘HTML’:

This is a para.

This is another...

With Hpricot, this code

equire ‘rubygems’
require ‘hpricot’

doc = Hpricot(open(“1.html”).read)
results = doc/“//p”

works without any problems.

Of course I absolutely understand your viewpoint, but messed up HTML, as
you have seen, can make a real difference…

Peter

__
http://www.rubyrailways.com

farfignugen · November 20, 2006, 2:47pm

Peter S. wrote:

/ …

Of course I absolutely understand your viewpoint, but messed up HTML, as
you have seen, can make a real difference…

I agree completely (see my other post on this topic), but it appears the
OP
was trying to read machine-generated Web content, presumably with
reliable
syntax.

farfignugen · November 20, 2006, 2:47pm

I agree completely (see my other post on this topic), but it appears the OP
was trying to read machine-generated Web content, presumably with reliable
syntax.

Then you are right of course. I guess the problem is in the definition
of the term ‘screen scraping’ ( or ‘web extraction’ or ‘web mining’ or
‘html extraction’ - people can not even agree on it’s name ).

For me ‘screen scraping’ means a complex thing: navigating to the
document, parsing it into something meaningful and querying the objects
of the parsed structure. In general, I am assuming that neither of these
steps are trivial - maybe because I am working for a web extraction
company for years now and I have seen every kind of nice tricks of the
other side (a.k.a the anti-scrape camp)

Of course, if you define screen scraping as the last step only (i.e. you
have a parsed model (e.g. a well formed page) and you need to query
that) - then of course regular expressions are always the first thing to
consider.

Since the OP was referring to a machine generated page, I think the
latter applies - so yep, as far as he need all

's only, regular
expressions are probably the easiest thing to pull out.

Peter

__
http://www.rubyrailways.com

farfignugen · November 20, 2006, 2:47pm

On 20/nov/06, at 03:50, Paul L. wrote:

array = page_content.scan(%r{

(.*?)
}m).flatten

Please note that the P end tag isn’t required in HTML 4.01:
http://www.w3.org/TR/html4/struct/text.html#h-9.3.1