I am having a complicated issue here. I am trying to fetch a page from
Froogle and parse it via Hpricot to collect data from the products in
the search results.
The problem is that the HTML on Froogle is seriously broken. I need to
get the table row (tr) for each product, and then look in each of that
rows td’s for data. But google’s html is full of unclosed tags for
their tables that makes Hpricot freak out. Hpricot thinks the tr’s are
So I guess the question is how do I make Hpricot cope with this markup?
It obviously works great in the browser. Are there any tools that will
convert a string of html to a valid XML or DOM equivalent? It must be
possible because web browsers handle it all the time.
What I need to be able to do:
html = open(‘http://foo.com/’).read
html = html.clean_markup
html = Hpricot(html)
Here is an oversimplified example of froogle’s of malformed markup: