Print - and strip text between tags using Nokogiri

paul-au · December 16, 2012, 12:10am

I’m a Ruby N. trying to write a program to process thousands of HTML
files, extracting pertinent text and inserting it into a MySQL database.
Ruby seems ideally suited to the task in general, and I’ve already used
Nokogiri to extract comment text. What I need to do next is to print -
and then ultimately delete or strip - the text between “pre” tags.

Picture some html like this:

My Title

My Heading

From:Me
Date: Wed Dec 05 2012 - 18:17:49 EST

text line 1
text line 2
text line 3

very important text
more important text
would you believe even more important text?

I basically need to do 2 things: 1) to print only the text between the 2
“pre” tags, and then 2) to print all of the non-tagged text between the
“body” comments - minus the text between the “pre” tags. I’ve been
messing with this for a couple of hours - unsuccessfully - but I’m still
convinced that this is the right tool for the job.

paul-au · December 16, 2012, 12:29am

On Sun, Dec 16, 2012 at 12:10 AM, Paul M. [email protected]
wrote:

My Title

I basically need to do 2 things: 1) to print only the text between the 2
“pre” tags, and then 2) to print all of the non-tagged text between the
“body” comments - minus the text between the “pre” tags. I’ve been
messing with this for a couple of hours - unsuccessfully - but I’m still
convinced that this is the right tool for the job.

If you need to do more HTML and XML manipulation, learning XPath is a
good investment! You can look here for a start:
http://www.w3schools.com/Xpath/default.asp

One way to achieve what you want:

require ‘nokogiri’

text = <<HTML

My Title

My Heading

From:Me
Date: Wed Dec 05 2012 - 18:17:49 EST

text line 1
text line 2
text line 3

very important text
more important text
would you believe even more important text?

HTML

dom = Nokogiri.HTML(text)

puts dom.xpath(‘/html/body//pre/text()’).map(&:to_s)

puts ‘—’

puts dom.xpath(‘/html/body//text()[not(ancestor::pre)]’).map(&:to_s)

You can also process nodes individually if you replace “.map…” with
“.each” and a block which receives the node and does something with
it.

Kind regards

robert

paul-au · December 16, 2012, 3:23am

Thank you for the swift reply! I tried running the above against my
“test.html” snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html

./test.html

paul-au · December 16, 2012, 4:21am

On Sat, Dec 15, 2012 at 8:23 PM, Paul M. [email protected] wrote:

Thank you for the swift reply! I tried running the above against my
“test.html” snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html

./test.html

–
Posted via http://www.ruby-forum.com/.

You passed in the file name string to nokugiri, not the contents.

paul-au · December 16, 2012, 11:37am

There should be a way to match the text of the first comment, but I
couldn’t get this to work:

comment()[text()=‘body=“start”’]

paul-au · December 16, 2012, 11:18am

Robert K. wrote in post #1089225:

puts dom.xpath(’/html/body//pre/text()’).map(&:to_s)

Calling map() is redundant because puts calls to_s on its arguments.

to print all of the non-tagged text between the
“body” comments

Your html doesn’t even test your requirements because there is no text
after the body="end’ comment. And there is no non-tagged text:

require ‘nokogiri’

html = <<HTML

My Title

My Heading

From:Me
Date: Wed Dec 05 2012 - 18:17:49 EST

text line 1
text line 2
text line 3

very important text
more important text
would you believe even more important text?

text line 4
text line 5 HTML

doc = Nokogiri.HTML(html)

my_xpath = “/html/body/comment()[1]/following-sibling::*”

doc.xpath(my_xpath).each do |node|
puts node.name
puts node.text
puts ‘*’ * 20
end

–output:–
p

text line 1

text line 2

text line 3

p

pre

very important text
more important text
would you believe even more important text?

p

text line 4

text line 5

doc = Nokogiri.HTML(html)

my_xpath =
“/html/body/comment()[1]/following-sibling::*[not(self::pre)]”

catch :found_ending_text do
doc.xpath(my_xpath).each do |node|
node.children.each do |child|
text = child.text
throw :found_ending_text if text.include? %q{body=“end”}
next if text.empty?
puts text.strip
end
end
end

–output:–
text line 1
text line 2
text line 3

paul-au · December 16, 2012, 11:52am

This ugly xpath will select the comment based on its text:

comment()[. =’ body=“start” ']

my_xpath = %Q{/html/body/comment()[. =’ body=“start”
']/following-sibling::*[not(self::pre)]}

paul-au · December 17, 2012, 12:57am

7stud - that’s perfect. Thank you so much!

paul-au · December 16, 2012, 5:11pm

I want the thank everyone for their quick replies and helpful
suggestions. I realized that I should probably be using the real - and
admittedly poorly-formed - HTML for this question and not the test HTML
I’ve tried to concoct for this example. The real HTML was generated by
the Hypermail program, basically converting an email from mbox form to
HTML. Here is one such file:

haiku_archive: watching the news

watching the news

From: Paul David Mena ([email protected])
Date: Fri Dec 14 2012 - 18:51:14 EST

watching the news
I feel guilty
for being alive

--
Paul David Mena
--------------------
[email protected]

My ultimate goal is to extract all of the comment text between and but not what is between the
two “pre” tags. So far I’ve been able to extract all of the comment
text but not exclude the “pre” text, using the following code:

#!/usr/bin/env ruby

require “rubygems”
require “nokogiri”

class PlainTextExtractor < Nokogiri::XML::SAX::Document

attr_reader :plaintext

Initialize the state of interest variable with false

def initialize
@interesting = false
@plaintext = “”
end

This method is called whenever a comment occurs and

the comments text is passed in as string.

def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body=“start”/ # match starting comment
@interesting = true
when /^body=“end”/
@interesting = false # match closing comment
end
end

This callback method is called with any string between

a tag.

def characters(string)
@plaintext << string if @interesting
end
end

write to the screen

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

puts pte.plaintext

write to a file

begin
file = File.open(“snippet.txt”, “w”)
file.write pte.plaintext
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end

get the date written

fname = ARGV[0]
start_column = 3
end_column = 5

target_range = (start_column-1)…(end_column-1)

IO.foreach(fname) do |line|
if line.match(/Date:</strong>/)
pieces = line.split(" “)
puts pieces[target_range].join(”-")
end
end

remove blank lines from file

fh = File.open(‘snippet.txt’)
while( !fh.eof)
line = fh.readline.chomp
# remove leading and trailing blanks
line.strip!
# skip empty lines
next if line == ‘’
# convert tab chars to blanks
line.gsub!(/\t/,’ ‘)
# substitute a single blank for a sequence of blanks
line.squeeze!(’ ')
# add code to process line if needed
puts line
end
fh.close
exit(0)

The output is as follows:

pablo@cochituate=> ./extract_haiku.rb
/export/www/html/haikupoet/archive/0925.html
watching the news
I feel guilty
for being alive

Paul David Mena

[email protected]

Basically I want to omit the signature (everything below the “–”,
inclusive), which is wrapped in the “pre” tags.

paul-au · December 17, 2012, 2:01am

Paul M. wrote in post #1089283:

remove blank lines from file

fh = File.open(‘snippet.txt’)
while( !fh.eof)
line = fh.readline.chomp
# remove leading and trailing blanks
line.strip!
# skip empty lines
next if line == ‘’
# convert tab chars to blanks
line.gsub!(/\t/,’ ‘)
# substitute a single blank for a sequence of blanks
line.squeeze!(’ ')
# add code to process line if needed
puts line
end
fh.close
exit(0)

I forgot to mention. There are several ways to read line by line from a
file, but your loop is particularly ugly. If you use my favorite:

IO.foreach(fname) do |line|
line = line.chomp
…
…
end

…an added benefit is that the file is automatically closed when the
block exits.

paul-au · December 17, 2012, 7:12am

7stud – wrote in post #1089314:

I forgot to mention. There are several ways to read line by line from a
file,

Here are some others:

f = File.new(fname)

f.each do |line|
line = line.chomp

end

f.close

====

File.open(fname) do |f|
while line = f.gets #gets() returns nil at eof
…
end
end #file is automatically closed here

paul-au · December 16, 2012, 10:17pm

My ultimate goal is to extract all of the comment text between and

Yet none of your example html is setup to test that requirement–because
there is no text after the body=end comment.

require “nokogiri”

class PlainTextExtractor < Nokogiri::XML::SAX::Document
attr_reader :plaintext

Initialize the state of interest variable with false

def initialize
@interesting = false
@pre = false
@plaintext = “”
end

def start_element(name, attrs = [])
if name == “pre”
@pre = true
end
end

def end_element(name, attrs = [])
if name == “pre”
@pre = false
end
end

This method is called whenever a comment occurs and

the comments text is passed in as string.

def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body=“start”/ # match starting comment
@interesting = true
when /^body=“end”/
@interesting = false # match closing comment
end
end

This callback method is called with any string between

a tag.

def characters(string)
if @interesting and not @pre
@plaintext << string
end
end
end

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

p pte.plaintext

–output:–
“\n\nwatching the news\n\nI feel guilty\n\nfor being alive\n\n”

paul-au · December 17, 2012, 9:12am

On Sun, Dec 16, 2012 at 11:18 AM, 7stud – [email protected] wrote:

Robert K. wrote in post #1089225:

puts dom.xpath(‘/html/body//pre/text()’).map(&:to_s)

Calling map() is redundant because puts calls to_s on its arguments.

Right, thanks for the reminder! That was an artifact of IRB testing.

to print all of the non-tagged text between the
“body” comments

Your html doesn’t even test your requirements because there is no text
after the body="end’ comment. And there is no non-tagged text:

I overlooked the comment thing.

#!/usr/bin/ruby

require ‘nokogiri’

require ‘irb’

text = <<HTML

My Title

My Heading

From:Me
Date: Wed Dec 05 2012 - 18:17:49 EST

text line 1
text line 2
text line 3

very important text
more important text
would you believe even more important text?

not to print HTML

dom = Nokogiri.HTML(text)

puts dom.xpath(‘/html/body//pre/text()’)

puts ‘—’

puts dom.xpath(‘//text()[contains(preceding::comment(),“start”) and
contains(following::comment(),“end”) and not(ancestor::pre)]’)

Kind regards

robert

Print - and strip text between tags using Nokogiri

My Heading

My Heading

pablo@cochituate=> ./extract_text.rb ./test.html

pablo@cochituate=> ./extract_text.rb ./test.html

My Heading

watching the news

Initialize the state of interest variable with false

This method is called whenever a comment occurs and

the comments text is passed in as string.

This callback method is called with any string between

a tag.

write to the screen

puts pte.plaintext

write to a file

get the date written

remove blank lines from file

pablo@cochituate=> ./extract_haiku.rb /export/www/html/haikupoet/archive/0925.html watching the news I feel guilty for being alive

Paul David Mena

remove blank lines from file

Initialize the state of interest variable with false

This method is called whenever a comment occurs and

the comments text is passed in as string.

This callback method is called with any string between

a tag.

require ‘irb’

My Heading

pablo@cochituate=> ./extract_haiku.rb
/export/www/html/haikupoet/archive/0925.html
watching the news
I feel guilty
for being alive