Ruby screen scraping

Gabriele M. wrote:

On 20/nov/06, at 03:50, Paul L. wrote:

array = page_content.scan(%r{

(.*?)

}m).flatten

Please note that the P end tag isn’t required in HTML 4.01:
Paragraphs, Lines, and Phrases

Yes, I’ve just been converting all my site pages to XHTML, so I
encountered
this difference big-time. My solution made some assumptions, one being
the
OP’s request –

My question is, how would i modify the
code in order to get it to capture say a block of text such as:

Â

this is text that i want to scrape

any ideas?

– was based on his knowledge that the pages in fact contained
paragraphs
enclosed by

.

The other assumption I made was based on context – it seems the pages
in
question are machine-generated, so presumably can be relied on to have
consistent syntax.

On Nov 20, 2006, at 2:45 AM, Paul L. wrote:

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but
apparently
never considered writing code to solve the problem directly.

Starting out by looking for a library that does the hard work for you
is a good first step, I would say. Do we really want to be
discouraging that?

As to modern XHTML Web pages that can pass a validator, I know from
direct
recent experience that they yield to the simplest parser design,
and can be
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code.

I’ve seen valid XHTML that wouldn’t be much fun to parse. You still
need to worry about whitespace, namespaces, the kind of quoting used,
CDATA sections, …

James Edward G. II

On Nov 20, 2006, at 11:35 AM, Paul L. wrote:

parse, to
be consistent – assuming the syntax is followed.

But if you use an already developed parser, you gain all their work
on edge cases, all their testing efforts, all their optimization
work, etc.

I see what you are saying about knowing you can count on the data,
but your messages are filled with a lot “as long as you are sure”
conditions. Dropping a bunch of those conditions is just one more
advantage to using a library.

You say you are always surprised when people build up all this hefty
library code when a simple regex will do, but I’m always shocked when
I can replace hundreds of lines of code by loading and making use of
a library. If we have to err on one side of that, I would prefer it
be on the library using side.

That said, I guess we’ll just have to agree to disagree. That’s for
the intelligent and civil debate.

James Edward G. II

James Edward G. II wrote:

On Nov 20, 2006, at 2:45 AM, Paul L. wrote:

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but
apparently
never considered writing code to solve the problem directly.

Starting out by looking for a library that does the hard work for you
is a good first step, I would say. Do we really want to be
discouraging that?

IMHO yes, when it doesn’t solve the problem at hand. This is obviously a
matter of personal taste, but I always try coding a solution first, or
at
least endeavor to understand what such a solution would entail, before
going shopping for a library. It’s based on KISS, and large libraries
that
can be relied on to solve any problem except the problem the adopter
faces,
fail the KISS principle.

/ …

I’ve seen valid XHTML that wouldn’t be much fun to parse. You still
need to worry about whitespace, namespaces, the kind of quoting used,
CDATA sections, …

These are all relatively easy to parse. Even the CDATA sections are
clearly
and consistently delimited, so can be reliably skipped over and
encapsulated. That was the design goal of XHTML – to be easy to parse,
to
be consistent – assuming the syntax is followed.

I just converted my 500-page Web site to XHTML, and at the end of the
project I found that I could parse any page on the site using a very
simple
parser. This was about the time the pages also began passing XHTML
validation tests.

But this is all by the way. The point is the OP adopted a powerful
library,
only to discover he still couldn’t solve his original problem, and if we
assume (as I did) that the pages are machine-generated and meet
reasonable
syntax standards, the one-line solution I posted will meet his
requirements.

Your earlier point, that a Web page picked at random might be virtually
unparseable, is certainly true, and solutions like we are discussing
assume
a high degree of cooperation between the page generator and the parser.

Chris G. wrote:

Turns out I actually ended up abandonning HTree and the rest. I used
net/http in order to fetch the page and then took the table of the page
that I was interested in examining and converted that using rexml. I
have now been able to grab the values that I wanted using XPath :slight_smile:
If you are keen on XPaths, why not:

table = XPath.first(doc, “//table[@class=‘index’ && @width=‘100%’]”)

then use ‘table’ instead of ‘converted_data’…

or even

module_name = XPath.first(doc, “//table[@class=‘index’ &&
@width=‘100%’]//td[@class=‘data’]/a/]”)

etc.

(Untested since I don’t have your doc, but it should ± work)

Cheers,
Peter

__
http://www.rubyrailways.com

James Edward G. II wrote:

and consistently delimited, so can be reliably skipped over and
encapsulated. That was the design goal of XHTML – to be easy to
parse, to
be consistent – assuming the syntax is followed.

But if you use an already developed parser, you gain all their work
on edge cases, all their testing efforts, all their optimization
work, etc.

Yes, all to the good, if the feature set is needed and if the target
environment can support the library. And if the library actually solves
the
original problem.

I see what you are saying about knowing you can count on the data,
but your messages are filled with a lot “as long as you are sure”
conditions. Dropping a bunch of those conditions is just one more
advantage to using a library.

Yes, unless the library serves no purpose and occupies memory and
machine
cycles better spent elsewhere. Without a library, you have to work out
the
problem directly. With a library, you have to work out the problems
caused
by the library.

My personal favorite for this dichotomy is REXML, which apparently can
do
anything, unless you have something specific in mind, then IMHO you are
better off writing your own code to parse XML data sets. It isn’t as
though
XML is a dark and mysterious world that is beyond the reasoning powers
of
mere mortals. If it were, the designers of the scheme were wasting their
time.

In the beginning, we had all sorts of weak and limited dataset
protocols.
These weaknesses are well addressed by XML, but some think XML is too
complicated to manipulate directly. So libraries like REXML get created.
But the libraries often turn out to be so complex and difficult to put
into
service that in some cases one is better off writing one’s own
generator/parser for the simpler applications of XML.

The complexity referenced above seems to arise from an irresistible
tendency
to put every feature into a library, with the side effect that important
and trivial/esoteric features often get mixed up together in the
documentation and the interface, and the library ends up too large to
justify for simple processing tasks.

Maybe now someone will write a library to bring REXML under control. Ad
infinitum.

You say you are always surprised when people build up all this hefty
library code when a simple regex will do,

No, not always, those are not my words. But in a case like this, where
the
library accomplishes everything except what the OP actually wanted, yes.
Please note that I only made this plain-code argument after the OP
explained that he had put the library in place, had run it through its
paces, only to discover that he still couldn’t solve the original
problem.

but I’m always shocked when
I can replace hundreds of lines of code by loading and making use of
a library. If we have to err on one side of that, I would prefer it
be on the library using side.

For myself, I prefer to know what is going on. As I said, it’s just a
personal preference.

That said, I guess we’ll just have to agree to disagree. That’s for
the intelligent and civil debate.

You’re welcome (if I read you correctly). Such an exchange is always
possible, some might say likely, between two people who both want it
that
way.

An aside with some small relevance. It’s just possible that the Linux
kernel
maintainers’ tendency to adopt existing libraries over laboriously
writing
fresh code will spawn a huge legal battle with Microsoft, who clearly
intend to argue (and who are now arguing) that it is their intellectual
property embedded in Linux, and therefore all those Linux users are
actually Microsoft customers.

I can see how this post may be interpreted, so I want to say I hope no
one
is misled. If I were really intent on avoiding libraries, I would write
everyting in assembly. My disdain for libraries is fully constrained by
reality and pragmatism, and there are plenty of libraries that I use
with
something that approaches reckless abandon.

But … when a specialized library can’t solve a problem that is soluble
with one line of tautological Ruby code, I’m more than willing to speak
up.

Turns out I actually ended up abandonning HTree and the rest. I used
net/http in order to fetch the page and then took the table of the page
that I was interested in examining and converted that using rexml. I
have now been able to grab the values that I wanted using XPath :slight_smile:

require ‘net/http’
require ‘uri’
require ‘rexml/document’
include REXML
def fetch(uri_str, limit=10)
fail ‘http redirect too deep’ if limit.zero?
puts “Trying: #{uri_str}”
response = Net:: HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess
response
when NetHTTPRedirection
fetch(response[‘location’], limit-1)
else
response.error!
end
end

response = fetch(‘http://10.37.150.55:8080’)

scraped_data = response.body

table_start_pos = scraped_data.index(‘

’)
#puts table_start_pos

table_end_pos = scraped_data.index(‘

’) + 9
#puts table_end_pos

height = table_end_pos - table_start_pos

gathered_data = response.body[table_start_pos,height]

converted_data = REXML::Document.new gathered_data
#puts converted_data

module_name = XPath.first(converted_data, “//td[@class=‘data’]/a/]”)
puts module_name

build_status = XPath.first (converted_data, “//td[2]/em”)
puts build_status.text

last_failure = XPath.first(converted_data, “//tbody/tr/td[3]”)
puts last_failure.text

last_success = XPath.first(converted_data, “//tbody/tr/td[4]”)
puts last_success.text

build_number = XPath.first(converted_data, “//tbody/tr/td[5]”)
puts build_number.text

On 11/19/06, Chris G. [email protected] wrote:

Chris,

There are many ways to accomplish this as others have pointed out. When
I
approached a similar task three years ago, I was working in the java
world
and would have loved to have some of the tools available for Ruby today.
However I believe that the technique I used has merit in some situations
today.

I was screen scraping realtor sites for data (to find the perfect
house),
because I was dissatisfied with the searching and data mining
capabilities
of the sites. I was mining multiple sites, so the technique had to be
flexible but also resilient because I did not control the source sites
(and
they would often change their layout). My first attempt used xpath’s to
try
and get to data, however that was futile since developers would often
change
the site’s layout and even small changes would break the logic (ie.
changing
nesting of tables, or adding styling around data).

After taking a step back and considering the situation from a fresh
perspective, I scrapped the idea of using xml style data location in
something that seemed too fluid, too fragile.

My second approach was much more resilient, I used simple regular
expressions to zoom in and find the data. After studying the source html
I
was able to discover a way to easily get to any data for the sites I was
working on.

The basic approach was this:

  1. I would use a regular expression to search into the html for
    something to
    get me close to the data, something that seemed to be consistent and
    unlikely to change. (a reference point)

  2. I would then extract reasonable number of characters before and/or
    after
    the reference point based on where the data is located. It is not
    necessary
    to know exactly just gather a conservative amount beyond what you think
    you
    need.

  3. repeat with step 1 if needed, or use regular expression to extract
    the
    data desired from this subsection of data extracted in step 2.

I wrapped these basic ideas in to a few simple methods to make it easy
and
it turned out to be a very successful approach. I found it easy to add
new
sites pretty easily and it turned out to be very robust technique that
was
very forgiving to changing fluff html. It was pretty easy to find a
reference point in the html that was consistent, and once there the data
was
close by, so I’d extract a healthy chunk and then it was pretty easy to
search in this smaller amount of data. Use logging for each step to help
you
while you are fine tuning the approach. But once I switched over to this
approach, I never had to revisit the code once I set it up for a site,
it
just worked. Low tech, simple, but suprisingly effective.

Of course after many months of daily operation mining the realtor sites,
I
eventually found the perfect house and abandoned the code; it had served
its
purpose well. So I don’t have anything concrete to offer you (and it was
in
Java), but if any of the other methods mentioned by the others don’t
quite
meet your needs or end up being too fragile, you might consider a
variation
on this approach for your own data extraction. It is especially flexible
for
scraping sites which tend to vary over time. In your case it sounds like
you
have control over the source so many methods would work for you, however
don’t forget that there may be some variation over time if you ever
upgrade
(cruise control).

Hope it helps you or others that are pursuing this task!

Blessings,

Jeff B.
MasterView project developer, http://masterview.org/
Inspired Horizons Training and Consultancy http://inspiredhorizons.com/