James Edward G. II wrote:
On Nov 19, 2006, at 8:50 PM, Paul L. wrote:
Chris G. wrote:
OK that code all works great but i have one last question
Returns an array, each cell of which is a paragraph from the
In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.
I hope you’re not arguing that HTML should be parsed with simple
regular expression instead of a real parser. I think most would
agree with me when I say that strategy seldom holds up for long.
That depends on the complexity of the problem to be solved, and the
reliability of the source page’s HTML formatting.
For a page that can pass validation of one kind or another or that is
the simplest kinds of parsers provide terrific results. For legacy pages
and those that can be expected to have “relaxed” syntax, more robust
parsers are required.
But I must say I regularly see requests here for parsers that can be
expected to do anything, but often as not and IMHO, such a library
represents too much complexity for the majority of routine HTML/XML
tasks with Web pages and documents that are often generated, not
This thread is an example. Beginning with the generic equivalent of “Is
there a library that can …” followed almost immediately by “Great! But
how do I make it do this …”, requesting a really trivial extraction
that can be accomplished in a single line of Ruby.
I find this rather ironic, since Ruby is meant to provide an easy way to
create solutions to everyday problems. One then sees a blizzard of
libraries whose purpose is to shield the user from the complexities of
language, in a way that the remedy is often more complex than the
it is meant to solve.
In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but
never considered writing code to solve the problem directly. After
a library, the OP realized he didn’t see an obvious way to solve the
original problem – extracting specific content from the source pages.
As to modern XHTML Web pages that can pass a validator, I know from
recent experience that they yield to the simplest parser design, and can
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code. It is
to justify bringing out the big guns for a task like this, when one
instead use a small self-documenting routine such as I suggested.
In the bad old days of assembly and comparatively heavy, inflexible
languages like C, C++ and the like, it is easy to see why people would
motivated to create specialized libraries to solve generic problems just
once for all time. In fact, the argument can be made that Ruby is just
a library of generics, broadly speaking an extension/amplification of
Now we see people writing easy-to-use application libraries, each
using the easy-to-use Ruby library, but that are sometimes harder to
out, or make practical use of, than a short bit of code would have been.
Lest my readers think I am going overboard here on a topic dear to my
let me quote the OP once again:
In other words, after choosing a library and playing with it for a
found himself back in square one, unable to solve the original problem.
To quote one of my favorite authors (William Burroughs), it seems people
busy inventing cures for which there are no diseases.