Noob Question - String Manipulation

jcernelli · May 5, 2006, 7:27pm

Hey, I’m pretty new to Ruby and am trying to absorb the “ruby way” of
things as much as I can. What’s the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

jcernelli · May 5, 2006, 7:48pm

On 5/5/06, Joe C. [email protected] wrote:

Hey, I’m pretty new to Ruby and am trying to absorb the “ruby way” of
things as much as I can. What’s the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

I don’t know of any libraries offhand that can do this (CGI only has
escape/unescape), but it’s fairly simple:

html_string.gsub(/<[^>]+>/, “”)

Replacing that regex with something better, probably.

jcernelli · May 5, 2006, 7:57pm

I’ve been using

html_string.gsub(/<.*?>/,"")

for this. But it’s always seemed more a “perl way” than a “ruby way”.

jcernelli · May 5, 2006, 8:15pm

Joseph Michaels wrote:

I have a string that contains html formating. All I want is the plain
text.
html_string.gsub(/<[^>]+>/, “”)
Replacing that regex with something better, probably.

gsubbing breaks down for more complex test cases, such as things
containing source code, or problematic attribute strings (e.g. ).

If you really want to be accurate, I suggest using an XML or HTML
parsing tool, such as Mechanize or RubyfulSoup.

Pistos

jcernelli · May 5, 2006, 8:50pm

On Sat, 2006-05-06 at 02:28 +0900, Joe C. wrote:

Hey, I’m pretty new to Ruby and am trying to absorb the “ruby way” of
things as much as I can. What’s the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

There’s more to this than meets the eye - it’s often best to hand off
the hard stuff to someone else

def fetchtext(uri)
lynx --dump #{uri}
end

puts fetchtext(‘www.google.com’)

=>

[1]Personalised Home | [2]Sign in

Google

Web [3]Images [4]Groups [5]News [6]Froogle [7]more Â»

Noob Question - String Manipulation

=>

[1]Personalised Home | [2]Sign in

Google

Web [3]Images [4]Groups [5]News [6]Froogle [7]more Â»

… [snipped] …