Forum: Ruby Noob Question - String Manipulation...

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Joe C. (Guest)
on 2006-05-05 21:27
Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can.  What's the best way to solve this problem:

I have a string that contains html formating.  All I want is the plain
text.
Joseph Michaels (Guest)
on 2006-05-05 21:48
(Received via mailing list)
On 5/5/06, Joe C. <removed_email_address@domain.invalid> wrote:
> Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
> things as much as I can.  What's the best way to solve this problem:
>
> I have a string that contains html formating.  All I want is the plain
> text.
>

I don't know of any libraries offhand that can do this (CGI only has
escape/unescape), but it's fairly simple:

html_string.gsub(/<[^>]+>/, "")

Replacing that regex with something better, probably.
Francisco O. (Guest)
on 2006-05-05 21:57
(Received via mailing list)
I've been using

html_string.gsub(/<.*?>/,"")

for this. But it's always seemed more a "perl way" than a "ruby way".
Pistos C. (Guest)
on 2006-05-05 22:15
Joseph Michaels wrote:
>> I have a string that contains html formating.  All I want is the plain
>> text.
> html_string.gsub(/<[^>]+>/, "")
> Replacing that regex with something better, probably.

gsubbing breaks down for more complex test cases, such as things
containing source code, or problematic attribute strings (e.g. <sometag
someattr="some>str">).

If you really want to be accurate, I suggest using an XML or HTML
parsing tool, such as Mechanize or RubyfulSoup.

Pistos
Ross B. (Guest)
on 2006-05-05 22:50
(Received via mailing list)
On Sat, 2006-05-06 at 02:28 +0900, Joe C. wrote:
> Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
> things as much as I can.  What's the best way to solve this problem:
>
> I have a string that contains html formating.  All I want is the plain
> text.
>

There's more to this than meets the eye - it's often best to hand off
the hard stuff to someone else :)

def fetchtext(uri)
  `lynx --dump #{uri}`
end

puts fetchtext('www.google.com')
# =>
#                     [1]Personalised Home | [2]Sign in
#
#                                   Google
#
#    Web    [3]Images    [4]Groups    [5]News    [6]Froogle    [7]more »
#
# ... [snipped] ...
This topic is locked and can not be replied to.