Would a good HTML parser be Hpricot? I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).
SpringFlowers AutumnMoon wrote:
Would a good HTML parser be Hpricot?
It definitely is.
I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).
It looks like #inner_text removes all tags and what remains is the plain
text content. Note that it won’t convert
's and
's to newlines -
it really just strips tags. If you want more sophisticated text results,
you should iterate over the elements, and implement your logic for
specific ones.
mortee
2007/10/29, SpringFlowers AutumnMoon [email protected]:
Would a good HTML parser be Hpricot?
It is a good and fast HTML and XML parser.
I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).
Mortee’s is a quick way to do it. If you need more information to it,
take a look at http://code.whytheluckystiff.net/hpricot or ask on
hpricot’s mailing list.
SpringFlowers AutumnMoon wrote:
Would a good HTML parser be Hpricot?
It’s extremely good; try it and see!
I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).
.each_element( ‘.//text()’ ){}.join() might do it.
by the way
require ‘hpricot’
doc = Hpricot(“hello world”)
p doc.search("").inner_text
won’t work… i am not sure if it is the Win installer of Ruby… but it
is the most recent Win installer.
it says
scraper2.rb:6: undefined method `inner_text’ for
#Hpricot::Elements:0x348dbc4
(NoMethodError)
and doc.to_plain_text() won’t work either…
SpringFlowers AutumnMoon wrote:
won’t work… i am not sure if it is the Win installer of Ruby… but it
$ uname -s
CYGWIN_NT-5.1
$ gem list hpricot
*** LOCAL GEMS ***
hpricot (0.6, 0.5)
a swift, liberal HTML parser with a fantastic library
$ irb
irb(main):001:0> require ‘hpricot’
=> true
irb(main):002:0> d = Hpricot(“hello world”)
=> #<Hpricot::Doc {elem "hello " {elem “world” } }>
irb(main):003:0> d.inner_text
=> “hello world”
C:>systeminfo
…
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 2 Build 2600
…
C:>gem list hpricot
*** LOCAL GEMS ***
hpricot (0.6, 0.5, 0.4)
a swift, liberal HTML parser with a fantastic library
C:>irb
irb(main):001:0> require ‘hpricot’
=> true
irb(main):002:0> d = Hpricot(“hello world”)
=> #<Hpricot::Doc {elem "hello " {elem “world” } }>
irb(main):003:0> d.inner_text
=> “hello world”
mortee
Phlip wrote:
SpringFlowers AutumnMoon wrote:
Would a good HTML parser be Hpricot?
It’s extremely good; try it and see!
I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags)..each_element( ‘.//text()’ ){}.join() might do it.
anyone knows where to go from:
require ‘hpricot’
doc = Hpricot(“hello world”)
and what can i do to get “hello world”?
in
http://code.whytheluckystiff.net/hpricot/wiki/HpricotChallenge#StripallHTMLtags
it says just use
str=doc.to_s
print str.gsub(/</?[^>]*>/, “”)
but can’t the < > be nested in some HTML code? If it is nested then
the above won’t work, it seems.
On 10/30/07, mortee [email protected] wrote:
irb(main):001:0> require ‘hpricot’
=> true
irb(main):002:0> d = Hpricot(“hello world”)
=> #<Hpricot::Doc {elem "hello " {elem “world” } }>
irb(main):003:0> d.inner_text
=> “hello world”mortee
yup, mine is
C:>gem list hpricot
*** LOCAL GEMS ***
hpricot (0.4)
a swift, liberal HTML parser with a fantastic library
and d.inner_text or d.text both won’t work.
SpringFlowers AutumnMoon wrote:
in
http://code.whytheluckystiff.net/hpricot/wiki/HpricotChallenge#StripallHTMLtags
it says just usestr=doc.to_s
print str.gsub(/</?[^>]*>/, “”)but can’t the < > be nested in some HTML code? If it is nested then
the above won’t work, it seems.
What do you mean by nested? I would consider your example as containing
nested tags:
hello world"
and the regex removes all the tags from that string. html can look like
this:
<h2
hel<b></h2llo<h1<b>>worl
What do you want to do with that string?
On 10/30/07, 7stud – [email protected] wrote:
<h2
hel<b></h2llo<h1<b>>worl
i just wonder if there would be any case with… the style, etc… the
quote, double quote, and some where, there is < or > inside of a
beginning
tag… just hard to say…
also, removing the tag won’t work to remove the CSS style or javascript
too…
kendear wrote:
irb(main):001:0> require ‘hpricot’
yup, mine isC:>gem list hpricot
*** LOCAL GEMS ***
hpricot (0.4)
a swift, liberal HTML parser with a fantastic libraryand d.inner_text or d.text both won’t work.
Does something prevent you from upgrading?
mortee
however, the CSS and Javascript lines are not
removed. So I think I can gsub the CSS and Javascript blocks with the
multiline regexp gsub.I wonder though if there is a quick way, that will do what the lynx on
UNIX does… just print out a plain and readable text page.
i got it to work till:
require ‘open-uri’
require ‘hpricot’
c = open(‘http://www.google.com’).read
c.gsub!(/<style.?</style.?>/m, " ")
c.gsub!(/<script.?</script.?>/m, " ")
c.gsub!(/<(span|tr|td| ).?>/, " ")
c.gsub!(/<(br|p|div|table).?>/, “\n”)
d = Hpricot(c).inner_text
d.gsub!(/\s+/, " ")
d.gsub!(/\n+/, “\n”)
print d
but it is not so pretty. and it is not filtering the non-printable
character too.
mortee wrote:
kendear wrote:
irb(main):001:0> require ‘hpricot’
yup, mine isC:>gem list hpricot
*** LOCAL GEMS ***
hpricot (0.4)
a swift, liberal HTML parser with a fantastic libraryand d.inner_text or d.text both won’t work.
Does something prevent you from upgrading?
I finally got the time to upgrade to Hpricot 6.0
so now, the following
require ‘net/http’
require ‘hpricot’
r = “”
Net::HTTP.start(“www.google.com”) do |http|
r = http.get(“/”)
end
c = Hpricot(r.body)
p c.to_plain_text
will work, and so will
p c.inner_text
as the last line. however, the CSS and Javascript lines are not
removed. So I think I can gsub the CSS and Javascript blocks with the
multiline regexp gsub.
I wonder though if there is a quick way, that will do what the lynx on
UNIX does… just print out a plain and readable text page.