Forum: Ruby Screen scraping via regex vs. htmltools (vs. REXML)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
5730f209b34b8474639e0c2020f54513?d=identicon&s=25 dan (Guest)
on 2005-12-02 18:29
(Received via mailing list)
I've finally reimplemented the screen scraper I mentioned on
<http://groups.google.com/group/comp.lang.ruby/brow...
using regexes and no external libraries.  It is, as Daz suggested, many
times faster than REXML.  My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree::Parser
instead.

Any other comments on ways to make the code faster, cleaner, and more
Ruby-like?  Finally, can you please tell me why I can't get strip to
work, if I switch the commenting for lines 15 and 16?  (It doesn't
remove the leading space in the second element of the last 6 lines.)
By contrast, the gsub on line 15 does what I want.

Thanks very much in advance for any advice you can offer on which tools
to use.


# The program parses out all of the rows and then looks
# for the right kinds of cells inside.  It constructs
# 2 two-dimensional arrays of the results.

require 'mechanize'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
page = agent.get('http://www.dankohn.com/uamileage.html').body

def table_clean (table)
table.each { |row|
	row.each { |e|
		e.gsub!(/<.*?>|&nbsp;/m,"")
		e.gsub!(/\s+/," ")
		e.gsub!(/(^\s|\s$)/,"")
		#~ e.strip
			}
		}
end

miletable = []
summarytable = []
row = /<tr>(.*?)<\/tr>/m
milecells = /
	<td.*?class="t4">(.*?)<\/td>\s*
	<td.*?class="t4">(.*?)<\/td>\s*
	<td.*?class="t4">(.*?)<\/td>\s*
	<td.*?>(.*?)<\/td>\s*
	<td.*?class="t4">(.*?)<\/td>
	/mx
summarycells = /
	<td.*?class="t3".*?>(.*?)<\/td>\s*
	<td.*?class="t3".*?>(.*?)<\/td>
	/mx
activitycells = /
	<td.*?class="t4".*?>(.*?)<\/td>\s*
	<td.*?colspan=("4"|4).*?>(.*?)<\/td>
	/mx
page.scan(row) { |e|
	rowtext = e.to_s
	rowtext.scan(milecells) {
		miletable << [$1,$2,$3,$4,$5]
		}
	rowtext.scan(summarycells) {
		summarytable << [$1,$2]
		}
	rowtext.scan(activitycells) {
		summarytable << [$1,$3]
		}
	}
table_clean(miletable)
table_clean(summarytable)
miletable.each {|e| print e.join(":"),"\n"}
summarytable.each {|e| print e.join(":"),"\n"}


          - dan
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 james_b (Guest)
on 2005-12-02 19:14
(Received via mailing list)
Dan Kohn wrote:
> I've finally reimplemented the screen scraper I mentioned on
> 
<http://groups.google.com/group/comp.lang.ruby/brow...
> using regexes and no external libraries.  It is, as Daz suggested, many
> times faster than REXML.  My question is whether it would be smarter
> (faster?, easier to code?) to use htmltools or HTMLTree::Parser
> instead.

The code in your post seems to use Mechanize.
If you are using agent.get to fetch the HTML then you've already parsed
the html using htmltools & REXML.  You can register callback objects
that are invoked when the parsing  process encounters matching nodes.
Mechanize does this automatically for certain nodes (form stuff, I
think), but you can use watch_for_set= {} to define a set of nodes to
watch for.

This is what I use to construct the product pages for rubystuff.com from
the multiple CafePress pages that contain the images, prices, and
product description.  I tell Mechanize to watch for img, tr, and td
elements, and it constructs sets of custom objects of just the parts of
the source HTML matching certain criteria.  Then I extract the data,
create RSS feeds, and turn those into a set of aggregated HTML pages.

What I like about this is that the parse process gives me business
objects, with (hopefully) self-explanatory behavior.  For example, I can
ask one of these objects for 'product_id' or 'description'; the object
encapsulates the assorted XPath/regex code needed to get that from the
source HTML node, making the main part of the app easier to maintain.


James Britt

--

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com     - Playing with Better Toys
http://www.30secondrule.com   - Building Better Tools
5730f209b34b8474639e0c2020f54513?d=identicon&s=25 dan (Guest)
on 2005-12-02 19:59
(Received via mailing list)
Thanks for the response, James.  My next question was actually about
debugging Mechanize
<http://groups.google.com/group/comp.lang.ruby/msg/....
Would you mind emailing me your scraping code, as I've been suffering
from a lack of examples to copy?

Also, are you sure Mechanize parses the whole page with get?  It
doesn't wait for a find?

          - dan
Fe57662c550fb3cce44c398ddf2dd706?d=identicon&s=25 itsme213 (Guest)
on 2005-12-02 20:07
(Received via mailing list)
"James Britt" <james_b@neurogami.com> wrote in message

> This is what I use to construct the product pages for rubystuff.com from

Any chance you could make that code available? Sounds like a useful
example.

Is Mechanize also a good option for writing acceptance tests, compared
to
Watir?

Thanks.
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 james_b (Guest)
on 2005-12-02 21:09
(Received via mailing list)
Dan Kohn wrote:
> Thanks for the response, James.  My next question was actually about
> debugging Mechanize
> <http://groups.google.com/group/comp.lang.ruby/msg/....
> Would you mind emailing me your scraping code, as I've been suffering
> from a lack of examples to copy?
>
> Also, are you sure Mechanize parses the whole page with get?  It
> doesn't wait for a find?

Don't think so, but I might be wrong.  My code calls agent get, then
goes right into looping over the collected nodes.

I'll see about putting my code together as an example.

As for debugging Mechanize, I've found it helpful to go to the lib
source and stick in some STDERR.puts calls to inspect request and
response data to be sure things are getting passed around as expected.

After that, unit tests are helpful.



James
--

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com     - Playing with Better Toys
http://www.30secondrule.com   - Building Better Tools
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 james_b (Guest)
on 2005-12-02 21:17
(Received via mailing list)
itsme213 wrote:
>
WATIR exposes the HTML DOM as seen by IE, which is not the raw HTML
source returned  from the server (but perhaps someone more up on the
latest WATIR knows otherwise).  Mechanize will get you the source HTML,
albeit sanitized for REXML parsing.

I find WATIR most useful for walking though a series of pages where
automated typing and clicking is essential.  Pretty much every Web app
I've written in the last 9 months uses WATIR (plus my own custom DSL on
top of it) for functional testing.  Major time saver.

I use Mechanize for data snarfing and occasional feed building.



James Britt



--

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com     - Playing with Better Toys
http://www.30secondrule.com   - Building Better Tools
This topic is locked and can not be replied to.