Hpricot - Trying to do a few things...are they possible?

I’ve been messing with Hpricot and I’m trying to do a few things that
aren’t apparently documented or available as part of Hpricot. Can
someone verify the following…

  1. Is there a simple way to determine the element’s current path /
    location? For example, if I find a text node, is there a simple way to
    determine the path of that text node so I can find it again later using
    that path / location as a parameter to the search method? I assume I
    can use the parent method to find the parent and recurse through until
    I get to the root node…is there an easier way?

  2. Is there a simple way to find all elements with non-empty text
    nodes? It appears that Hpricot is focused on providing methods for
    finding something if you know the element tag / attributes / classes /
    etc. I’ve been using traverse_text which requires going through every
    text node and filtering out the ones that are empty / whitespace. Is
    there an easier way to find all elements with non-empty text nodes?

This is in reference to parsing HTML pages which may or may not be
well-formed.

All in all - I really like Hpricot. I was using REXML and tidy before,
but this is alot simplier and faster!

Thanks to _why the lucky stiff for a great little HTML parser…

On Mon, Oct 02, 2006 at 11:38:05PM +0900, HH wrote:
} I’ve been messing with Hpricot and I’m trying to do a few things that
} aren’t apparently documented or available as part of Hpricot. Can
} someone verify the following…
}
} 1) Is there a simple way to determine the element’s current path /
} location? For example, if I find a text node, is there a simple way
to
} determine the path of that text node so I can find it again later
using
} that path / location as a parameter to the search method? I assume I
} can use the parent method to find the parent and recurse through until
} I get to the root node…is there an easier way?

I have been using the recursive (well, iterative, actually) way. I
suspect
that that is the way to do it since the tree structure is intentionally
simple and is designed to allow you to move nodes around arbitrarily.
Maintaining a node’s path independent of its structural location is
inefficient at best and impossible at worst.

} 2) Is there a simple way to find all elements with non-empty text
} nodes? It appears that Hpricot is focused on providing methods for
} finding something if you know the element tag / attributes / classes /
} etc. I’ve been using traverse_text which requires going through every
} text node and filtering out the ones that are empty / whitespace. Is
} there an easier way to find all elements with non-empty text nodes?

nodes = []
doc.traverse_text { |t| nodes << t.parent if (t.content && t.content !=
‘’) }

} This is in reference to parsing HTML pages which may or may not be
} well-formed.

I’ve found Hpricot to be remarkably resilient in parsing questionable
HTML.

} All in all - I really like Hpricot. I was using REXML and tidy
before,
} but this is alot simplier and faster!
}
} Thanks to _why the lucky stiff for a great little HTML parser…

I’ll second that.
–Greg

Gregory – I appreciate your reply. I found it to be very helpful.

One more question that pertains to finding the path recursively…if
done this way, the path itself is not necessarily unique to the text
node. It’s quite possible to have 2 different text nodes with the same
path (e.g. they are in the same table, but in different rows). The
only way to distinguish between text nodes would be to include some
sort of index to account for multiple children under the same parent or
other instances that would create a situation with the same path.

For example, you could have two text nodes that are in the first cell
of a table but in different rows with the path:

/html/body/table/tr/td

To truely select a particular text node, you would have to know the
index of the row in order to get to that text node:

/html/body/table/tr[1]/td
/html/body/table/tr[2]/td

I’m assuming there is no easy way to determine the index as well as the
path of a particular text node…

Any ideas would be greatly appreciated.

Thanks again!