James Edward G. II wrote:
On Nov 19, 2006, at 8:50 PM, Paul L. wrote:
Chris G. wrote:
OK that code all works great but i have one last question 
/ …
Returns an array, each cell of which is a paragraph from the
original page.
/ …
In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.
I hope you’re not arguing that HTML should be parsed with simple
regular expression instead of a real parser. I think most would
agree with me when I say that strategy seldom holds up for long.
That depends on the complexity of the problem to be solved, and the
reliability of the source page’s HTML formatting.
For a page that can pass validation of one kind or another or that is
XHTML,
the simplest kinds of parsers provide terrific results. For legacy pages
and those that can be expected to have “relaxed” syntax, more robust
parsers are required.
But I must say I regularly see requests here for parsers that can be
expected to do anything, but often as not and IMHO, such a library
represents too much complexity for the majority of routine HTML/XML
parsing
tasks with Web pages and documents that are often generated, not
hand-written.
This thread is an example. Beginning with the generic equivalent of “Is
there a library that can …” followed almost immediately by “Great! But
how do I make it do this …”, requesting a really trivial extraction
step
that can be accomplished in a single line of Ruby.
I find this rather ironic, since Ruby is meant to provide an easy way to
create solutions to everyday problems. One then sees a blizzard of
libraries whose purpose is to shield the user from the complexities of
the
language, in a way that the remedy is often more complex than the
problem
it is meant to solve.
In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but
apparently
never considered writing code to solve the problem directly. After
choosing
a library, the OP realized he didn’t see an obvious way to solve the
original problem – extracting specific content from the source pages.
As to modern XHTML Web pages that can pass a validator, I know from
direct
recent experience that they yield to the simplest parser design, and can
be
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code. It is
hard
to justify bringing out the big guns for a task like this, when one
could
instead use a small self-documenting routine such as I suggested.
In the bad old days of assembly and comparatively heavy, inflexible
languages like C, C++ and the like, it is easy to see why people would
be
motivated to create specialized libraries to solve generic problems just
once for all time. In fact, the argument can be made that Ruby is just
such
a library of generics, broadly speaking an extension/amplification of
the
STL project.
Now we see people writing easy-to-use application libraries, each
composed
using the easy-to-use Ruby library, but that are sometimes harder to
sort
out, or make practical use of, than a short bit of code would have been.
Lest my readers think I am going overboard here on a topic dear to my
heart,
let me quote the OP once again:
any ideas?
In other words, after choosing a library and playing with it for a
while, he
found himself back in square one, unable to solve the original problem.
To quote one of my favorite authors (William Burroughs), it seems people
are
busy inventing cures for which there are no diseases.