Need script: convert html-text to text


#1

i have html-text. i have to convert this text to simple text without
html-tags.


#2

keal wrote:

i have html-text. i have to convert this text to simple text without
html-tags.


Posted via http://www.ruby-forum.com/.

path o’least resistance

lynx -dump www.myurl
or use links2 ## or w3m -dump www.myurl

or high-falutin solution
http://groups.google.com/group/comp.lang.ruby/browse_frm/thread/e0fb1207f1814c77/37cd5e35a1ffb8d7?q=strip+HTML+tags&rnum=7#37cd5e35a1ffb8d7


#3

keal wrote:

i have html-text. i have to convert this text to simple text without
html-tags.

This is a very low cost variant - I guess the lynx approach is much more
effective and complete:

ruby -pe ‘gsub! %r{</?.*?>}, “”’ index.html

Kind regards

robert

#4

On Wed, 04 Jan 2006 10:30:03 -0000, keal removed_email_address@domain.invalid wrote:

i have html-text. i have to convert this text to simple text without
html-tags.

It’s tricky, there’s more to it than you’d think. The best way is
probably
to use Lynx, or another browser, to do it for you, e.g.:

def plain(url)
  `lynx -dump "#{url}"`
end

p = plain('http://www.google.com/')
puts p

Outputs:

                  [1]Personalised Home | [2]Sign in

[3]A picture of the Braille letters spelling out “Google.” Happy
Birthday
Louis Braille!

 Web    [4]Images    [5]Groups    [6]News    [7]Froogle    [8]more »

… [snip] …

Of course you’ll need lynx for that to work, but you can use others too.
Try a Google search.

Cheers,