Parsing HTML into a tree

Hello,

I have a HTML document which I need to parse into a tree. I was
looking for a suitable module to use, and came to two:

  1. Hpricot, while it looks like a nice and fast HTML parser, it
    appears that its interface is completely unsuitable for parsing a HTML
    file into a tree. Is this possible and I’m missing something ?

  2. htree - looks closer to what I need, but it documentation is very
    poor (almost inexistent). When I ‘pp’ a parsed HTree document I see a
    representation, but how can I actually traverse the tree ? At the
    moment I’m using reflection to “dissasemble” the htree tree structure.
    There must be a better way ! Can someone please provide an example of
    how to recursively print out the tree, telling for each node what kind
    of node it is ?

Are there other options ?

Thanks in advance
Eli

Hi Eli,

I am not sure what you are trying to do

Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?

gg.html (sample file)

<html>
<head>
   <title>Test Doc</title>
</head>
<body>
<div id="main">
Main div Content
<div id="sub">
Sub div content
<span> Test span content</span>
</div>
</div>
</body>
</html>


gg.rb
~~~~
require "htree"
tree = HTree.parse(STDIN)
tree.traverse_all_element do |e|
   puts e.name
   puts e.extract_text
end

Output
~~~~~~
imayam:~/work/temp/htree-0.6/test gg$ ruby gg.rb  < gg.html
{http://www.w3.org/1999/xhtml}html
           Test Doc
                         Main div Content
                                 Sub div content
                                  Test span content
{http://www.w3.org/1999/xhtml}head
           Test Doc
{http://www.w3.org/1999/xhtml}title
Test Doc
{http://www.w3.org/1999/xhtml}body
                         Main div Content
                                 Sub div content
                                  Test span content
{http://www.w3.org/1999/xhtml}div
                         Main div Content
                                 Sub div content
                                  Test span content
{http://www.w3.org/1999/xhtml}div
                                 Sub div content
                                  Test span content
{http://www.w3.org/1999/xhtml}span
Test span content

You can also traverse using 'traverse_some_element', 'each_child',
'traverse_text' etc. You can also convert Htree to rexml and traverse
it. Probably some context on what actually you are trying to achieve
might be helpful.

Cheers,
Ganesh G.