Hpricot help - parsing malformed HTML

squeegy · November 17, 2006, 1:06am

I am having a complicated issue here. I am trying to fetch a page from
Froogle and parse it via Hpricot to collect data from the products in
the search results.

sample page: http://froogle.google.com/froogle?q=magnets&btnG=Search

The problem is that the HTML on Froogle is seriously broken. I need to
get the table row (tr) for each product, and then look in each of that
rows td’s for data. But google’s html is full of unclosed tags for
their tables that makes Hpricot freak out. Hpricot thinks the tr’s are
empty:

“<tr valign="top">\n”

So I guess the question is how do I make Hpricot cope with this markup?
It obviously works great in the browser. Are there any tools that will
convert a string of html to a valid XML or DOM equivalent? It must be
possible because web browsers handle it all the time.

What I need to be able to do:

html = open(‘http://foo.com/’).read
html = html.clean_markup
html = Hpricot(html)

Here is an oversimplified example of froogle’s of malformed markup:

foo	bar
baz	boo

squeegy · November 17, 2006, 1:12am

Alex W. wrote:

I am having a complicated issue here. I am trying to fetch a page from
Froogle and parse it via Hpricot to collect data from the products in
the search results.

sample page: http://froogle.google.com/froogle?q=magnets&btnG=Search

The problem is that the HTML on Froogle is seriously broken. I need to
get the table row (tr) for each product, and then look in each of that
rows td’s for data. But google’s html is full of unclosed tags for
their tables that makes Hpricot freak out. Hpricot thinks the tr’s are
empty:

“<tr valign="top">\n”

Heres a better illustration of the problem, from irb:

pp Hpricot(‘

foo

bar

’)

=> #<Hpricot::Doc

{elem

{emptyelem } {elem

{text "foo"}} {elem

{text "bar"}}

}>

the is empty, and the 's are considered direct children of

. So the selector "table tr td" wont work. There is no way to groud td's by row in this case.

squeegy · November 17, 2006, 9:39am

On 17 Nov 2006, at 00:06, Alex W. wrote:

The problem is that the HTML on Froogle is seriously broken.

Agreed!

html = html.clean_markup
html = Hpricot(html)

I had a similar problem last week and ended up doing exactly what you
are proposing, i.e. a pre-processing step to clean up the HTML before
feeding it to Hpricot.

Here is an oversimplified example of froogle’s of malformed markup:

foo bar

baz boo

I believe there are Ruby libraries for cleaning up HTML though I’m
not familiar with them. Perhaps you could just treat it as a long
string and walk over it doing the following:

Scan forward until you find a tag (either opening or closing).
If the tag is a known potentially-broken one (’’, ‘’,
‘ ’, etc) set a flag for that tag to indicate it is open (or push
it onto a per-tag stack somewhere). Clear the flag (or pop the
stack) if/when you see the matching closing tag.
When you see that tag again, if it hasn’t been closed in the
meantime, insert the closing tag yourself and clear your flag (pop
your stack).

I think it will be easier to do than it sounds

Hope that helps,
Andy

squeegy · November 17, 2006, 4:10pm

Take a look at scrapi - if not to actually use then to steal Assaf’s
ideas. =) I THINK he has some sort of way to pre-process HTML with
Tidy in there; might want to crib those ideas.

On 11/16/06, Alex W. [email protected] wrote:

their tables that makes Hpricot freak out. Hpricot thinks the tr’s are
foo

–

I think it is inevitable that people program poorly. Training will not
substantially help matters. We have to learn to live with it. – Alan
Perlis

squeegy · November 17, 2006, 3:56pm

Andrew S. wrote:

On 17 Nov 2006, at 00:06, Alex W. wrote:

The problem is that the HTML on Froogle is seriously broken.

Agreed!

Disagree!

The example given is not malformed. It’s perfectly acceptable HTML 4.01.
The end tags for

and can be omitted.

Unless the DTD declaration claims it to be something newer than HTML
4.01, it is fine.

I would say this is a bug in Hpricot.

Mark.

squeegy · November 17, 2006, 4:10pm

On Nov 17, 2006, at 9:51 AM, Thomas, Mark - BLS CTR wrote:

The end tags for and can be omitted.

Unless the DTD declaration claims it to be something newer than HTML
4.01, it is fine.

I would say this is a bug in Hpricot.

Mark.

You can use RubyfulSoup to deal with HTML even when it isn’t
completely correct. It is packaged as a gem, but I unpacked it into
the plugin directory and it’s working for me. (Hpricot didn’t exist
at the time or I might have tried it.)

#Rubyful Soup
#Elixir and Tonic
#“The Screen-Scraper’s Friend”
#v1.0.4
#http://www.crummy.com/software/RubyfulSoup/

#Rubyful Soup is a port to the Ruby language and idiom of the Python
#library Beautiful Soup.
#See Beautiful Soup: We called him Tortoise because he taught us. for details on the
original.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

squeegy · November 17, 2006, 4:42pm

On 11/17/06, Michael C. [email protected] wrote:

Take a look at scrapi - if not to actually use then to steal Assaf’s
ideas. =) I THINK he has some sort of way to pre-process HTML with
Tidy in there; might want to crib those ideas.

We also use tidy for cleaning up invalid xhtml with MasterView project.

You can get the ruby tidy wrapper here
http://rubyforge.org/projects/tidy
http://tidy.rubyforge.org/ (for usage info)

Note that it also requires that the tidy library available on the server
as
well. It is available for both windows and *nix.

It works well at cleaning up invalid xhtml and the ruby tidy wrapper is
simple to use. The only disadvantage is that you need to have the lib
available and you need to set the path to the lib so that it can load
it. I
wish that could be automated some how, because it is a manual setup
step.

Jeff

squeegy · November 17, 2006, 6:21pm

Jeff B. wrote:

On 11/17/06, Michael C. [email protected] wrote:

Take a look at scrapi - if not to actually use then to steal Assaf’s
ideas. =) I THINK he has some sort of way to pre-process HTML with
Tidy in there; might want to crib those ideas.

We also use tidy for cleaning up invalid xhtml with MasterView project.

You can get the ruby tidy wrapper here
http://rubyforge.org/projects/tidy
http://tidy.rubyforge.org/ (for usage info)

Jeff

I seem to be having some luck with tidy and cleaning it before I send it
to Hpricot.

This little code snippet seems to handle keeping the Tidy.path assigned.
I just have to include the linux and windows tody libs in my /lib
directory.

require ‘tidy’
if RUBY_PLATFORM =~ /mswin/
Tidy.path = “#{RAILS_ROOT}/lib/tidy.dll”
else
Tidy.path = “#{RAILS_ROOT}/lib/tidy”
end

Thanks for the tip!