Html to plain text

banry · June 24, 2007, 9:41pm

Okay, I have played with Hpricot and I am a convert. Amazing stuff.

I am struggling up to speed and I can’t find what must be a basic
function. I’ve scraped the FAA site and they store all their stuff
wrapped in td’s, wrapped in tr’s, wrapped in tables. Thank you
Hpricot.

Now that I have “Manufacturer” isn’t there a simple call to get
rid of the last bit of html?

Thanks,
–Colin

banry · June 24, 2007, 10:03pm

Hi Colin, consult api doc for Hpricot.inner_text:

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
doc = open( ‘Google’ ) { |io| Hpricot io }
doc.inner_text

Regards
Florian

banry · June 24, 2007, 10:06pm

On Jun 24, 1:40 pm, “Colin S.” [email protected] wrote:

Thanks,
–Colin

It looks like you’re looking for the inner_text method.

HTH,
Chris

banry · June 24, 2007, 10:32pm

On 6/24/07, Todd B. [email protected] wrote:

The following does:

require ‘rubygems’
require ‘hpricot’
html_string = ‘Manufacturer’
html_data = Hpricot html_string
html_element = html_data / “b”
puts html_element.inner_html

Another “jump too soon moment”.

In the above code, I didn’t point out that html_element should be
plural. It still works though, but technically the grammatically
correct way would be:

require ‘rubygems’
require ‘hpricot’
html_string = ‘Manufacturer’
html_data = Hpricot html_string
html_elements = html_data / “b”
first_b_element = html_data.at “b”
first_b_element_also = (html_data / “b”).first
puts first_b_element.inner_html

Todd

banry · June 24, 2007, 10:17pm

On 6/24/07, Florian Aßmann [email protected] wrote:

Hi Colin, consult api doc for Hpricot.inner_text:

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
doc = open( ‘Google’ ) { |io| Hpricot io }
doc.inner_text
^^^^^^^
This code (above) doesn’t work on my system.

The following does:

require ‘rubygems’
require ‘hpricot’
html_string = ‘Manufacturer’
html_data = Hpricot html_string
html_element = html_data / “b”
puts html_element.inner_html

Todd