I am having a complicated issue here. I am trying to fetch a page from
Froogle and parse it via Hpricot to collect data from the products in
the search results.
The problem is that the HTML on Froogle is seriously broken. I need to
get the table row (tr) for each product, and then look in each of that
rows td’s for data. But google’s html is full of unclosed tags for
their tables that makes Hpricot freak out. Hpricot thinks the tr’s are
empty:
“<tr valign="top">\n”
So I guess the question is how do I make Hpricot cope with this markup?
It obviously works great in the browser. Are there any tools that will
convert a string of html to a valid XML or DOM equivalent? It must be
possible because web browsers handle it all the time.
What I need to be able to do:
html = open(‘http://foo.com/’).read
html = html.clean_markup
html = Hpricot(html)
Here is an oversimplified example of froogle’s of malformed markup:
I am having a complicated issue here. I am trying to fetch a page from
Froogle and parse it via Hpricot to collect data from the products in
the search results.
The problem is that the HTML on Froogle is seriously broken. I need to
get the table row (tr) for each product, and then look in each of that
rows td’s for data. But google’s html is full of unclosed tags for
their tables that makes Hpricot freak out. Hpricot thinks the tr’s are
empty:
“<tr valign="top">\n”
Heres a better illustration of the problem, from irb:
pp Hpricot(‘
foo
bar
’)
=> #<Hpricot::Doc
{elem
{emptyelem }
{elem
{text "foo"}}
{elem
{text "bar"}}
}>
the is empty, and the 's are considered direct children of
. So the selector "table tr td" wont work. There is no way to
groud td's by row in this case.
The problem is that the HTML on Froogle is seriously broken.
Agreed!
html = html.clean_markup
html = Hpricot(html)
I had a similar problem last week and ended up doing exactly what you
are proposing, i.e. a pre-processing step to clean up the HTML before
feeding it to Hpricot.
Here is an oversimplified example of froogle’s of malformed markup:
foo
bar
baz
boo
I believe there are Ruby libraries for cleaning up HTML though I’m
not familiar with them. Perhaps you could just treat it as a long
string and walk over it doing the following:
Scan forward until you find a tag (either opening or closing).
If the tag is a known potentially-broken one (’
’, ‘
’,
‘
’, etc) set a flag for that tag to indicate it is open (or push
it onto a per-tag stack somewhere). Clear the flag (or pop the
stack) if/when you see the matching closing tag.
When you see that tag again, if it hasn’t been closed in the
meantime, insert the closing tag yourself and clear your flag (pop
your stack).
Take a look at scrapi - if not to actually use then to steal Assaf’s
ideas. =) I THINK he has some sort of way to pre-process HTML with
Tidy in there; might want to crib those ideas.
On Nov 17, 2006, at 9:51 AM, Thomas, Mark - BLS CTR wrote:
The end tags for and can be omitted.
Unless the DTD declaration claims it to be something newer than HTML
4.01, it is fine.
I would say this is a bug in Hpricot.
Mark.
You can use RubyfulSoup to deal with HTML even when it isn’t
completely correct. It is packaged as a gem, but I unpacked it into
the plugin directory and it’s working for me. (Hpricot didn’t exist
at the time or I might have tried it.)
Take a look at scrapi - if not to actually use then to steal Assaf’s
ideas. =) I THINK he has some sort of way to pre-process HTML with
Tidy in there; might want to crib those ideas.
We also use tidy for cleaning up invalid xhtml with MasterView project.
Note that it also requires that the tidy library available on the server
as
well. It is available for both windows and *nix.
It works well at cleaning up invalid xhtml and the ruby tidy wrapper is
simple to use. The only disadvantage is that you need to have the lib
available and you need to set the path to the lib so that it can load
it. I
wish that could be automated some how, because it is a manual setup
step.
Take a look at scrapi - if not to actually use then to steal Assaf’s
ideas. =) I THINK he has some sort of way to pre-process HTML with
Tidy in there; might want to crib those ideas.
We also use tidy for cleaning up invalid xhtml with MasterView project.