Any other comments on ways to make the code faster, cleaner, and more
Ruby-like? Finally, can you please tell me why I can’t get strip to
work, if I switch the commenting for lines 15 and 16? (It doesn’t
remove the leading space in the second element of the last 6 lines.)
By contrast, the gsub on line 15 does what I want.
Thanks very much in advance for any advice you can offer on which tools
to use.
The program parses out all of the rows and then looks
for the right kinds of cells inside. It constructs
The code in your post seems to use Mechanize.
If you are using agent.get to fetch the HTML then you’ve already parsed
the html using htmltools & REXML. You can register callback objects
that are invoked when the parsing process encounters matching nodes.
Mechanize does this automatically for certain nodes (form stuff, I
think), but you can use watch_for_set= {} to define a set of nodes to
watch for.
This is what I use to construct the product pages for rubystuff.com from
the multiple CafePress pages that contain the images, prices, and
product description. I tell Mechanize to watch for img, tr, and td
elements, and it constructs sets of custom objects of just the parts of
the source HTML matching certain criteria. Then I extract the data,
create RSS feeds, and turn those into a set of aggregated HTML pages.
What I like about this is that the parse process gives me business
objects, with (hopefully) self-explanatory behavior. For example, I can
ask one of these objects for ‘product_id’ or ‘description’; the object
encapsulates the assorted XPath/regex code needed to get that from the
source HTML node, making the main part of the app easier to maintain.
WATIR exposes the HTML DOM as seen by IE, which is not the raw HTML
source returned from the server (but perhaps someone more up on the
latest WATIR knows otherwise). Mechanize will get you the source HTML,
albeit sanitized for REXML parsing.
I find WATIR most useful for walking though a series of pages where
automated typing and clicking is essential. Pretty much every Web app
I’ve written in the last 9 months uses WATIR (plus my own custom DSL on
top of it) for functional testing. Major time saver.
I use Mechanize for data snarfing and occasional feed building.
Also, are you sure Mechanize parses the whole page with get? It
doesn’t wait for a find?
Don’t think so, but I might be wrong. My code calls agent get, then
goes right into looping over the collected nodes.
I’ll see about putting my code together as an example.
As for debugging Mechanize, I’ve found it helpful to go to the lib
source and stick in some STDERR.puts calls to inspect request and
response data to be sure things are getting passed around as expected.