Forum: Ruby need script: convert html-text to text

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
keal (Guest)
on 2006-01-04 12:29
i have html-text. i have to convert this text to simple text without
html-tags.
Gene T. (Guest)
on 2006-01-04 12:45
(Received via mailing list)
keal wrote:
> i have html-text. i have to convert this text to simple text without
> html-tags.
>
> --
> Posted via http://www.ruby-forum.com/.

path o'least resistance

lynx -dump www.myurl
or use links2 ## or w3m -dump www.myurl

or high-falutin solution
http://groups.google.com/group/comp.lang.ruby/brow...
Robert K. (Guest)
on 2006-01-04 12:54
(Received via mailing list)
keal wrote:
> i have html-text. i have to convert this text to simple text without
> html-tags.

This is a very low cost variant - I guess the lynx approach is much more
effective and complete:

ruby -pe 'gsub! %r{</?.*?>}, ""' index.html

Kind regards

    robert
Ross B. (Guest)
on 2006-01-04 13:00
(Received via mailing list)
On Wed, 04 Jan 2006 10:30:03 -0000, keal <removed_email_address@domain.invalid> 
wrote:

> i have html-text. i have to convert this text to simple text without
> html-tags.
>

It's tricky, there's more to it than you'd think. The best way is
probably
to use Lynx, or another browser, to do it for you, e.g.:

	def plain(url)
	  `lynx -dump "#{url}"`
	end

	p = plain('http://www.google.com/')
	puts p

Outputs:

                      [1]Personalised Home | [2]Sign in

   [3]A picture of the Braille letters spelling out "Google." Happy
Birthday
                               Louis Braille!

     Web    [4]Images    [5]Groups    [6]News    [7]Froogle    [8]more »

> ... [snip] ...

Of course you'll need lynx for that to work, but you can use others too.
Try a Google search.

Cheers,
This topic is locked and can not be replied to.