Need help with Hpricot


#1

Hi all,

I try to get"Slang " and “A close companion or comrade.” ONLY out of
the following a webpage(part of it) with hpricot. There are so many
javascripts there. I don’t think I know path/tag for target.

Thanks,

Li

side·kick       (sÄ«d'kÄ­k')  Pronunciation Key 

n.  

Slang

A close companion or comrade.



#2

On Oct 8, 3:53 pm, Li Chen removed_email_address@domain.invalid wrote:

I try to get"Slang " and “A close companion or comrade.” ONLY out of
the following a webpage(part of it) with hpricot. There are so many
javascripts there. I don’t think I know path/tag for target.

There’s not a whole lot of HTML structure there. If you can
definitively target the

with Hpricot, you can use regular
expressions to find the appropriate comments and grab the following
text.

You can get a little more specific with XPath expressions. The
following sample code (requires libxml-ruby) extracts the two values
from your sample code:

require ‘xml’
html = %Q(your_html_here)
doc = XML::HTMLParser.string(html).parse
puts doc.find(’//comment()[contains(.,“SUBHEAD”)]/following::i/
text()’).first
puts doc.find(’//comment()[contains(.,“BOF_DEF”)]/
following::text()’).first


#3

Hi Mark T.:

Thank you for the suggestion.

I aslo search the forum and find an earlier post which helps me get the
job done. The ideas of it are 1) use regular expression to remove
non-convention HMLT stuff such as javascripts. 2) then let hpricot
handle the remaining. It works pretty good for me.

Here is the title and author of that post/reply:

Re: HTML parser Hpricot? and how to get all text
Posted by SpringFlowers AutumnMoon (winterheat) on 03.11.2007 09:10

Li