Hpricot - best way to parse based on comments

jerome · November 20, 2006, 11:52pm

I am trying to parse some files that contain comments like this:

images, text, etc…

Interesting text of site here.

I am wondering how to go about extracting the data within the comments
block using Hpricot. I am not aware of a way to refer to commented HTML
through CSS or XPath selectors.

Thanks for any ideas!

Jerome

jerome · November 21, 2006, 12:50am

On 11/20/06, Jerome — [email protected] wrote:

I am trying to parse some files that contain comments like this:
…
I am not aware of a way to refer to commented HTML
through CSS or XPath selectors.

The XPath comment() selector will select all comments:

For example (xpath after -m flag):
keith@devel ~ $ xml sel -t -m ‘//comment()’ -v ‘.’ -n simple.xml
one comment
two comment

keith@devel ~ $ cat simple.xml

HTH,
Keith

jerome · November 24, 2006, 8:54pm

Jerome — wrote:

Interesting text of site here.

I am wondering how to go about extracting the data within the comments
block using Hpricot.

The best and easiest way to parse this file using Hpricot with your
required
specification … is not to use Hpricot.

start_mark = “”
end_mark = “”

data = File.read(page_path)

output = data.scan(%r{#{start_mark}(.*?)#{end_mark}}m)

All done, finished, no poring over documentation, no considering
rewriting
the library to get it to do what you actually want, done.

By the way. Did I mention that inserting new data into the same page
structure is about the same level of difficulty?