Forum: Ruby need help with Hpricot

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2008-10-08 21:55
Hi all,

 I try to get"Slang " and "A close companion or comrade." ONLY out of
the following a webpage(part of it) with hpricot. There are so many
javascripts there.  I don't think I know path/tag for target.

Thanks,

Li




<td><b>side·kick</b> &nbsp;
    <script type="text/javascript">
    ............................
.............................................
    </script><noscript><a
href="http://dictionary.reference.com/audio.html/ahd4WAV...
target="_blank"><img src="http://cache.lexico.com/g/d/speaker.gif"
border="0" /></a></noscript> &nbsp; &nbsp;&nbsp;(sīd'kĭk') &nbsp;<a
href="http://cache.lexico.com/help/ahd4/pronkey.html" class="pronkey"
title="Click for guide to symbols." onclick="ahdpop();return
false;">Pronunciation Key</a>&nbsp;
<br />

<!--BOF_HEAD-->
n.&nbsp;&nbsp;<!--EOF_HEAD-->
<!--BOF_SUBHEAD-->
<i>Slang</i>
<br />
<!--EOF_SUBHEAD-->
<!--BOF_DEF-->
A close companion or comrade.
<br />
<!--EOF_DEF-->
<br />
</td>
134ea397777886d6f0aa992672a50eaa?d=identicon&s=25 Mark Thomas (Guest)
on 2008-10-09 20:56
(Received via mailing list)
On Oct 8, 3:53 pm, Li Chen <chen_...@yahoo.com> wrote:
>  I try to get"Slang " and "A close companion or comrade." ONLY out of
> the following a webpage(part of it) with hpricot. There are so many
> javascripts there.  I don't think I know path/tag for target.

There's not a whole lot of HTML structure there. If you can
definitively target the <td> with Hpricot, you can use regular
expressions to find the appropriate comments and grab the following
text.

You can get a little more specific with XPath expressions. The
following sample code (requires libxml-ruby) extracts the two values
from your sample code:

require 'xml'
html = %Q(your_html_here)
doc = XML::HTMLParser.string(html).parse
puts doc.find('//comment()[contains(.,"SUBHEAD")]/following::i/
text()').first
puts doc.find('//comment()[contains(.,"BOF_DEF")]/
following::text()').first
73700e119917433681f2e8f3e4369f74?d=identicon&s=25 Li CN (alex-osu3)
on 2008-10-09 21:33
Hi Mark Thomas:

Thank you for the suggestion.

I aslo search the forum and find an earlier post which helps me get the
job done. The ideas of it are 1) use regular expression to remove
non-convention HMLT stuff such as javascripts. 2) then let hpricot
handle the remaining. It works pretty  good for me.

Here is the title and author of that post/reply:

   Re: HTML parser Hpricot? and how to get all text
    Posted by SpringFlowers AutumnMoon (winterheat) on 03.11.2007 09:10

Li
This topic is locked and can not be replied to.