Html to plain text


#1

Okay, I have played with Hpricot and I am a convert. Amazing stuff.

I am struggling up to speed and I can’t find what must be a basic
function. I’ve scraped the FAA site and they store all their stuff
wrapped in td’s, wrapped in tr’s, wrapped in tables. Thank you
Hpricot.

Now that I have “Manufacturer” isn’t there a simple call to get
rid of the last bit of html?

Thanks,
–Colin


#2

Hi Colin, consult api doc for Hpricot.inner_text:

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
doc = open( ‘http://www.google.com/ncr’ ) { |io| Hpricot io }
doc.inner_text

Regards
Florian


#3

On Jun 24, 1:40 pm, “Colin S.” removed_email_address@domain.invalid wrote:

Thanks,
–Colin

It looks like you’re looking for the inner_text method.

HTH,
Chris


#4

On 6/24/07, Todd B. removed_email_address@domain.invalid wrote:

The following does:

require ‘rubygems’
require ‘hpricot’
html_string = ‘Manufacturer
html_data = Hpricot html_string
html_element = html_data / “b”
puts html_element.inner_html

Another “jump too soon moment”.

In the above code, I didn’t point out that html_element should be
plural. It still works though, but technically the grammatically
correct way would be:

require ‘rubygems’
require ‘hpricot’
html_string = ‘Manufacturer
html_data = Hpricot html_string
html_elements = html_data / “b”
first_b_element = html_data.at “b”
first_b_element_also = (html_data / “b”).first
puts first_b_element.inner_html

Todd


#5

On 6/24/07, Florian Aßmann removed_email_address@domain.invalid wrote:

Hi Colin, consult api doc for Hpricot.inner_text:

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
doc = open( ‘http://www.google.com/ncr’ ) { |io| Hpricot io }
doc.inner_text
^^^^^^^
This code (above) doesn’t work on my system.

The following does:

require ‘rubygems’
require ‘hpricot’
html_string = ‘Manufacturer
html_data = Hpricot html_string
html_element = html_data / “b”
puts html_element.inner_html

Todd