Noob Question - String Manipulation


#1

Hey, I’m pretty new to Ruby and am trying to absorb the “ruby way” of
things as much as I can. What’s the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.


#2

On 5/5/06, Joe C. removed_email_address@domain.invalid wrote:

Hey, I’m pretty new to Ruby and am trying to absorb the “ruby way” of
things as much as I can. What’s the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

I don’t know of any libraries offhand that can do this (CGI only has
escape/unescape), but it’s fairly simple:

html_string.gsub(/<[^>]+>/, “”)

Replacing that regex with something better, probably.


#3

I’ve been using

html_string.gsub(/<.*?>/,"")

for this. But it’s always seemed more a “perl way” than a “ruby way”.


#4

Joseph Michaels wrote:

I have a string that contains html formating. All I want is the plain
text.
html_string.gsub(/<[^>]+>/, “”)
Replacing that regex with something better, probably.

gsubbing breaks down for more complex test cases, such as things
containing source code, or problematic attribute strings (e.g. ).

If you really want to be accurate, I suggest using an XML or HTML
parsing tool, such as Mechanize or RubyfulSoup.

Pistos


#5

On Sat, 2006-05-06 at 02:28 +0900, Joe C. wrote:

Hey, I’m pretty new to Ruby and am trying to absorb the “ruby way” of
things as much as I can. What’s the best way to solve this problem:

I have a string that contains html formating. All I want is the plain
text.

There’s more to this than meets the eye - it’s often best to hand off
the hard stuff to someone else :slight_smile:

def fetchtext(uri)
lynx --dump #{uri}
end

puts fetchtext(‘www.google.com’)

=>

[1]Personalised Home | [2]Sign in

Google

Web [3]Images [4]Groups [5]News [6]Froogle [7]more »

… [snipped] …