Need help with Hpricot

alex-osu3 · October 8, 2008, 9:55pm

Hi all,

I try to get"Slang " and “A close companion or comrade.” ONLY out of
the following a webpage(part of it) with hpricot. There are so many
javascripts there. I don’t think I know path/tag for target.

Thanks,

Li

sideÃ‚Â·kick

(sÃ„Â«d'kÃ„Âk') Pronunciation Key

n.

Slang

A close companion or comrade.

alex-osu3 · October 9, 2008, 8:56pm

On Oct 8, 3:53 pm, Li Chen [email protected] wrote:

I try to get"Slang " and “A close companion or comrade.” ONLY out of
the following a webpage(part of it) with hpricot. There are so many
javascripts there. I don’t think I know path/tag for target.

There’s not a whole lot of HTML structure there. If you can
definitively target the with Hpricot, you can use regular
expressions to find the appropriate comments and grab the following
text.

You can get a little more specific with XPath expressions. The
following sample code (requires libxml-ruby) extracts the two values
from your sample code:

require ‘xml’
html = %Q(your_html_here)
doc = XML::HTMLParser.string(html).parse
puts doc.find(‘//comment()[contains(.,“SUBHEAD”)]/following::i/
text()’).first
puts doc.find(‘//comment()[contains(.,“BOF_DEF”)]/
following::text()’).first

alex-osu3 · October 9, 2008, 9:33pm

Hi Mark T.:

Thank you for the suggestion.

I aslo search the forum and find an earlier post which helps me get the
job done. The ideas of it are 1) use regular expression to remove
non-convention HMLT stuff such as javascripts. 2) then let hpricot
handle the remaining. It works pretty good for me.

Here is the title and author of that post/reply:

Re: HTML parser Hpricot? and how to get all text
Posted by SpringFlowers AutumnMoon (winterheat) on 03.11.2007 09:10

Li