A simple Hpricot text setter

If anyone is trying to use Hpricot to clean up the actual content of
a site while leaving the markup alone, theymight find the following
tiny method useful:

class Hpricot::Text

Adds a simple Hpricot method to change

the text embedded in an HTML document

Example of use:

body.traverse_text do |text|

text_out = text.to_s

manupulate text_out

text.set(text_out)

end

def set(string)
@content = string
self.raw_string = string
end
end

The trick is to set both @content in Hpricot::Text and @raw_string in
it’s parent.

The folly of mistaking a paradox for a discovery, a metaphor for a
proof, a torrent of verbiage for a spring of capital truths, and
oneself for an oracle, is inborn in us.
-Paul Valery, poet and philosopher (1871-1945)

On Fri, Aug 11, 2006 at 03:19:13AM +0900, Chris G. wrote:

text_out = text.to_s

manupulate text_out

text.set(text_out)

end

def set(string)
@content = string
self.raw_string = string
end
end

You can also use Elements#inner_html= and Element#inner_html= for this.

(body/:a).inner_html = “New Link Text”

Also: set, html, remove, append, prepend, before, after, and wrap, which
all
work just like their JQuery cousins.[1]

Thankyou for using Hpricot, it helps the all horses’ hearts when you do.

_why

[1] jQuery API Documentation

On Sat, Aug 12, 2006 at 11:23:14AM +0900, Chris G. wrote:

What may be related is that the file text.rb is at:
/usr/local/lib/ruby/gems/1.8/gems/hpricot-0.3/lib/hpricot/text.rb
but it is not actually being required anywhere in Hpricot. When i
tried to require it manually, i found that it was requiring files
that gem didn’t give me. This is all in Hpricot 0.3.

Okay, yeah, you’ll need the latest Hpricot (0.4.43):

gem install hpricot --source code.whytheluckystiff.net

Also, don’t forget to remove require_gem 'hpricot' and use, instead,
require 'hpricot'.

_why

On Aug 11, 2006, at 5:20 PM, why the lucky stiff wrote:

body.traverse_text do |text|

You can also use Elements#inner_html= and Element#inner_html= for
this.

(body/:a).inner_html = “New Link Text”

Also: set, html, remove, append, prepend, before, after, and wrap,
which all
work just like their JQuery cousins.[1]

Thanks for responding, why: and thanks very much for Hpricot.

I’m a long way from completely understanding Hpricot but I did try to
use inner_html in what I though was the correct way.

Here is a little sample program:

require ‘rubygems’
require_gem ‘hpricot’

doc = Hpricot(open(‘TestFile.html’))
body = doc.search(‘body’)
body.each {|elmnt| elmnt.inner_html}
body.inner_html
(body/:a).inner_html = “New Link Text”
puts doc

The output is:
testHpricot.rb:6: undefined method inner_html' for #<Hpricot::Elem: 0x7546bc> (NoMethodError) from testHpricot.rb:6:ineach’
from testHpricot.rb:6

If I comment out the body.each… line I get:

testHpricot.rb:7: undefined method `inner_html’ for
#Hpricot::Elements:0x753d48 (NoMethodError)

If I comment out that line, I get:

testHpricot.rb:8: undefined method `inner_html=’ for []:Array
(NoMethodError)

What may be related is that the file text.rb is at:
/usr/local/lib/ruby/gems/1.8/gems/hpricot-0.3/lib/hpricot/text.rb
but it is not actually being required anywhere in Hpricot. When i
tried to require it manually, i found that it was requiring files
that gem didn’t give me. This is all in Hpricot 0.3.

Thanks again for both your time and Hpricot.

On Aug 14, 2006, at 9:29 AM, why the lucky stiff wrote:

Okay, yeah, you’ll need the latest Hpricot (0.4.43):

gem install hpricot --source code.whytheluckystiff.net

Also, don’t forget to remove require_gem 'hpricot' and use, instead,
require 'hpricot'.

_why

You seem to be making great progress with Hpricot, committing changes
every day.

Yep, ‘require_gem’ no longer works. Just using ‘require’ seems better.

I don’t know that I communicated my idea behind adding a set method
for Hpricot::Text. There are times when one wants to scan an
potentially change everything that’s not markup. The markup should
be left unchanged or modified only in trivial ways such as changing
the order of attribute declarations.

Hpricott::Traverse#traverse_text is great for finding as the stuff
that’s not markup, the pcdata, in an HTML file. I just added a
method to change that data.

You suggested using inner_html= but the only way I can see that
working is to parse the tree looking for those elements which only
have Hpricot::Text children and then using inner_html= on them. But
that would involve essentially recreating
Hpricott::Traverse#traverse_text to find such elements although the
common code could mostly be factored out.

On Aug 16, 2006, at 12:25 PM, why the lucky stiff wrote:

Okay, I get it. I guess I need to get //div[contains(text(), ‘…’)]
working.

Works for me!

Be assured, traverse_text will stick around.

Thanks why!

On Wed, Aug 16, 2006 at 01:04:46PM +0900, Chris G. wrote:

Hpricott::Traverse#traverse_text is great for finding as the stuff
that’s not markup, the pcdata, in an HTML file. I just added a
method to change that data.

Okay, I get it. I guess I need to get //div[contains(text(), ‘…’)]
working. Be assured, traverse_text will stick around.

_why