Peter S. wrote:
of tidy-up engine, so you can turn invalid to HTML to XML.
That depends on what data you are after, and where you want to look for
it.
If, for example, you just want to get a list of css files referenced in
a page, then regexen would likely be simpler and faster than the
tidy-up approach.
I recommend
this one:
http://tidy.rubyforge.org/
After this step you have reduced the problem of arbitrary (possibly
invalid) HTML parsing to XML parsing which is definitely easier, e.g.
with REXML.
Sort of. I’ve seen tidy make some odd assumptions about what the
“correct” output should be, based on surreal HTML input. And this can
throw off the XML manipulation code.
soup = BeautifulSoup.new(page)
I’ve just been trying out BeautifulSoup to parse some nasty del.icio.us
markup (it has an XHTML DOCTYPE, but is painfully broken).
I had been using some simple regex iteration over the source, but they
changed that page layout, my app broke, and I thought perhaps I’d give
BeautifulSoup another shot. But I realized why I stopped using it in
the first place: it’s way too slow. (Or at least way slower than my
hand rolled hacks.)
I’ve tried a number of ways, over various applications, to extract stuff
from HTML. If I can get predictable XML right off, then that’s a big
help; I can pass it into a stream parser, or use a DOM if the file isn’t
too large.
When handed broken markup, I’ve found that many times the problem is in
only one or two places, most often the header (with malformed empty
elements). Much time can be saved by grabbing a subset of the raw HTML
(with some simple stateful line-by-line iteration) and cleaning up what
I actually need (and often that extracted subset is proper XML all by
itself).
There is a real cost to making the parsing/cleaning code highly robust,
and if you can make certain assumptions about the source text (and live
with the risks that things can change), you can often make the app
faster/simpler.
–
James B.
Judge a man by his questions, rather than his answers.