Forum: Ruby Noob Question - String Manipulation...

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
7b2a4ea7a2c486b1b0ceab3c6d4680d9?d=identicon&s=25 Joe Cairns (diregnome)
on 2006-05-05 19:27
Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
things as much as I can.  What's the best way to solve this problem:

I have a string that contains html formating.  All I want is the plain
text.
Fb8a1f3d5a89013c70139d0cd2b66888?d=identicon&s=25 Joseph Michaels (Guest)
on 2006-05-05 19:48
(Received via mailing list)
On 5/5/06, Joe Cairns <joe.cairns@gmail.com> wrote:
> Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
> things as much as I can.  What's the best way to solve this problem:
>
> I have a string that contains html formating.  All I want is the plain
> text.
>

I don't know of any libraries offhand that can do this (CGI only has
escape/unescape), but it's fairly simple:

html_string.gsub(/<[^>]+>/, "")

Replacing that regex with something better, probably.
Dbfd4e56a67b907f4ddf54012f214263?d=identicon&s=25 Francisco Ortiz (Guest)
on 2006-05-05 19:57
(Received via mailing list)
I've been using

html_string.gsub(/<.*?>/,"")

for this. But it's always seemed more a "perl way" than a "ruby way".
A402df36168b81b31c17adcbb5ae8cf4?d=identicon&s=25 Pistos Christou (pistos)
on 2006-05-05 20:15
Joseph Michaels wrote:
>> I have a string that contains html formating.  All I want is the plain
>> text.
> html_string.gsub(/<[^>]+>/, "")
> Replacing that regex with something better, probably.

gsubbing breaks down for more complex test cases, such as things
containing source code, or problematic attribute strings (e.g. <sometag
someattr="some>str">).

If you really want to be accurate, I suggest using an XML or HTML
parsing tool, such as Mechanize or RubyfulSoup.

Pistos
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-05-05 20:50
(Received via mailing list)
On Sat, 2006-05-06 at 02:28 +0900, Joe Cairns wrote:
> Hey, I'm pretty new to Ruby and am trying to absorb the "ruby way" of
> things as much as I can.  What's the best way to solve this problem:
>
> I have a string that contains html formating.  All I want is the plain
> text.
>

There's more to this than meets the eye - it's often best to hand off
the hard stuff to someone else :)

def fetchtext(uri)
  `lynx --dump #{uri}`
end

puts fetchtext('www.google.com')
# =>
#                     [1]Personalised Home | [2]Sign in
#
#                                   Google
#
#    Web    [3]Images    [4]Groups    [5]News    [6]Froogle    [7]more »
#
# ... [snipped] ...
This topic is locked and can not be replied to.