Screen scraping via regex vs. htmltools (vs. REXML)

dan · December 2, 2005, 6:29pm

I’ve finally reimplemented the screen scraper I mentioned on
http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/76e8bbd4a9e48277/396cb7ea35eab14f#396cb7ea35eab14f
using regexes and no external libraries. It is, as Daz suggested, many
times faster than REXML. My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree::Parser
instead.

Any other comments on ways to make the code faster, cleaner, and more
Ruby-like? Finally, can you please tell me why I can’t get strip to
work, if I switch the commenting for lines 15 and 16? (It doesn’t
remove the leading space in the second element of the last 6 lines.)
By contrast, the gsub on line 15 does what I want.

Thanks very much in advance for any advice you can offer on which tools
to use.

The program parses out all of the rows and then looks

for the right kinds of cells inside. It constructs

2 two-dimensional arrays of the results.

require ‘mechanize’
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
page = agent.get(‘http://www.dankohn.com/uamileage.html’).body

def table_clean (table)
table.each { |row|
row.each { |e|
e.gsub!(/<.*?>| /m,“”)
e.gsub!(/\s+/," “)
e.gsub!(/(^\s|\s$)/,”")
#~ e.strip
}
}
end

miletable = []
summarytable = []
row = /(.?)</tr>/m
milecells = /
<td.?class=“t4”>(.?)</td>\s
<td.?class=“t4”>(.?)</td>\s*
<td.?class=“t4”>(.?)</td>\s*
<td.?>(.?)</td>\s*
<td.?class=“t4”>(.?)</td>
/mx
summarycells = /
<td.?class=“t3”.?>(.?)</td>\s
<td.?class=“t3”.?>(.?)</td>
/mx
activitycells = /
<td.?class=“t4”.?>(.?)</td>\s*
<td.?colspan=(“4”|4).?>(.*?)</td>
/mx
page.scan(row) { |e|
rowtext = e.to_s
rowtext.scan(milecells) {
miletable << [$1,$2,$3,$4,$5]
}
rowtext.scan(summarycells) {
summarytable << [$1,$2]
}
rowtext.scan(activitycells) {
summarytable << [$1,$3]
}
}
table_clean(miletable)
table_clean(summarytable)
miletable.each {|e| print e.join(“:”),“\n”}
summarytable.each {|e| print e.join(“:”),“\n”}

      - dan

dan · December 2, 2005, 7:14pm

Dan K. wrote:

I’ve finally reimplemented the screen scraper I mentioned on
http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/76e8bbd4a9e48277/396cb7ea35eab14f#396cb7ea35eab14f
using regexes and no external libraries. It is, as Daz suggested, many
times faster than REXML. My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree::Parser
instead.

The code in your post seems to use Mechanize.
If you are using agent.get to fetch the HTML then you’ve already parsed
the html using htmltools & REXML. You can register callback objects
that are invoked when the parsing process encounters matching nodes.
Mechanize does this automatically for certain nodes (form stuff, I
think), but you can use watch_for_set= {} to define a set of nodes to
watch for.

This is what I use to construct the product pages for rubystuff.com from
the multiple CafePress pages that contain the images, prices, and
product description. I tell Mechanize to watch for img, tr, and td
elements, and it constructs sets of custom objects of just the parts of
the source HTML matching certain criteria. Then I extract the data,
create RSS feeds, and turn those into a set of aggregated HTML pages.

What I like about this is that the parse process gives me business
objects, with (hopefully) self-explanatory behavior. For example, I can
ask one of these objects for ‘product_id’ or ‘description’; the object
encapsulates the assorted XPath/regex code needed to get that from the
source HTML node, making the main part of the app easier to maintain.

James B.

–

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

dan · December 2, 2005, 7:59pm

Thanks for the response, James. My next question was actually about
debugging Mechanize
http://groups.google.com/group/comp.lang.ruby/msg/04fc7473b08c16fc.
Would you mind emailing me your scraping code, as I’ve been suffering
from a lack of examples to copy?

Also, are you sure Mechanize parses the whole page with get? It
doesn’t wait for a find?

      - dan

dan · December 2, 2005, 8:07pm

“James B.” [email protected] wrote in message

This is what I use to construct the product pages for rubystuff.com from

Any chance you could make that code available? Sounds like a useful
example.

Is Mechanize also a good option for writing acceptance tests, compared
to
Watir?

Thanks.

dan · December 2, 2005, 9:17pm

itsme213 wrote:

WATIR exposes the HTML DOM as seen by IE, which is not the raw HTML
source returned from the server (but perhaps someone more up on the
latest WATIR knows otherwise). Mechanize will get you the source HTML,
albeit sanitized for REXML parsing.

I find WATIR most useful for walking though a series of pages where
automated typing and clicking is essential. Pretty much every Web app
I’ve written in the last 9 months uses WATIR (plus my own custom DSL on
top of it) for functional testing. Major time saver.

I use Mechanize for data snarfing and occasional feed building.

James B.

–

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

dan · December 2, 2005, 9:09pm

Dan K. wrote:

Thanks for the response, James. My next question was actually about
debugging Mechanize
http://groups.google.com/group/comp.lang.ruby/msg/04fc7473b08c16fc.
Would you mind emailing me your scraping code, as I’ve been suffering
from a lack of examples to copy?

Also, are you sure Mechanize parses the whole page with get? It
doesn’t wait for a find?

Don’t think so, but I might be wrong. My code calls agent get, then
goes right into looping over the collected nodes.

I’ll see about putting my code together as an example.

As for debugging Mechanize, I’ve found it helpful to go to the lib
source and stick in some STDERR.puts calls to inspect request and
response data to be sure things are getting passed around as expected.

After that, unit tests are helpful.

James

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools