Print - and strip text between tags using Nokogiri

I’m a Ruby N. trying to write a program to process thousands of HTML
files, extracting pertinent text and inserting it into a MySQL database.
Ruby seems ideally suited to the task in general, and I’ve already used
Nokogiri to extract comment text. What I need to do next is to print -
and then ultimately delete or strip - the text between “pre” tags.

Picture some html like this:

My Title

My Heading

From:Me
Date: Wed Dec 05 2012 - 18:17:49 EST

text line 1
text line 2
text line 3

very important text
more important text
would you believe even more important text?

I basically need to do 2 things: 1) to print only the text between the 2
“pre” tags, and then 2) to print all of the non-tagged text between the
“body” comments - minus the text between the “pre” tags. I’ve been
messing with this for a couple of hours - unsuccessfully - but I’m still
convinced that this is the right tool for the job.

On Sun, Dec 16, 2012 at 12:10 AM, Paul M. [email protected]
wrote:

My Title

I basically need to do 2 things: 1) to print only the text between the 2
“pre” tags, and then 2) to print all of the non-tagged text between the
“body” comments - minus the text between the “pre” tags. I’ve been
messing with this for a couple of hours - unsuccessfully - but I’m still
convinced that this is the right tool for the job.

If you need to do more HTML and XML manipulation, learning XPath is a
good investment! You can look here for a start:
http://www.w3schools.com/Xpath/default.asp

One way to achieve what you want:

require ‘nokogiri’

text = <<HTML

My Title

My Heading

From:Me
Date: Wed Dec 05 2012 - 18:17:49 EST

text line 1
text line 2
text line 3

very important text
more important text
would you believe even more important text?

HTML

dom = Nokogiri.HTML(text)

puts dom.xpath(‘/html/body//pre/text()’).map(&:to_s)

puts ‘—’

puts dom.xpath(‘/html/body//text()[not(ancestor::pre)]’).map(&:to_s)

You can also process nodes individually if you replace “.map…” with
“.each” and a block which receives the node and does something with
it.

Kind regards

robert

Thank you for the swift reply! I tried running the above against my
“test.html” snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html

./test.html

On Sat, Dec 15, 2012 at 8:23 PM, Paul M. [email protected] wrote:

Thank you for the swift reply! I tried running the above against my
“test.html” snippet and ended up getting the following:

pablo@cochituate=> ./extract_text.rb ./test.html

./test.html


Posted via http://www.ruby-forum.com/.

You passed in the file name string to nokugiri, not the contents.

There should be a way to match the text of the first comment, but I
couldn’t get this to work:

comment()[text()=‘body=“start”’]

Robert K. wrote in post #1089225:

puts dom.xpath(’/html/body//pre/text()’).map(&:to_s)

Calling map() is redundant because puts calls to_s on its arguments.

  1. to print all of the non-tagged text between the
    “body” comments

Your html doesn’t even test your requirements because there is no text
after the body="end’ comment. And there is no non-tagged text:

require ‘nokogiri’

html = <<HTML

My Title

My Heading

From:Me
Date: Wed Dec 05 2012 - 18:17:49 EST

text line 1
text line 2
text line 3

very important text
more important text
would you believe even more important text?

text line 4
text line 5 HTML

doc = Nokogiri.HTML(html)

my_xpath = “/html/body/comment()[1]/following-sibling::*”

doc.xpath(my_xpath).each do |node|
puts node.name
puts node.text
puts ‘*’ * 20
end

–output:–
p

text line 1

text line 2

text line 3


p


pre

very important text
more important text
would you believe even more important text?


p


p

text line 4

text line 5


doc = Nokogiri.HTML(html)

my_xpath =
“/html/body/comment()[1]/following-sibling::*[not(self::pre)]”

catch :found_ending_text do
doc.xpath(my_xpath).each do |node|
node.children.each do |child|
text = child.text
throw :found_ending_text if text.include? %q{body=“end”}
next if text.empty?
puts text.strip
end
end
end

–output:–
text line 1
text line 2
text line 3

This ugly xpath will select the comment based on its text:

comment()[. =’ body=“start” ']

my_xpath = %Q{/html/body/comment()[. =’ body=“start”
']/following-sibling::*[not(self::pre)]}

7stud - that’s perfect. Thank you so much!

I want the thank everyone for their quick replies and helpful
suggestions. I realized that I should probably be using the real - and
admittedly poorly-formed - HTML for this question and not the test HTML
I’ve tried to concoct for this example. The real HTML was generated by
the Hypermail program, basically converting an email from mbox form to
HTML. Here is one such file:

haiku_archive: watching the news

watching the news

From: Paul David Mena ([email protected])
Date: Fri Dec 14 2012 - 18:51:14 EST


watching the news
I feel guilty
for being alive

--
Paul David Mena
--------------------
[email protected]

My ultimate goal is to extract all of the comment text between and but not what is between the
two “pre” tags. So far I’ve been able to extract all of the comment
text but not exclude the “pre” text, using the following code:

#!/usr/bin/env ruby

require “rubygems”
require “nokogiri”

class PlainTextExtractor < Nokogiri::XML::SAX::Document

attr_reader :plaintext

Initialize the state of interest variable with false

def initialize
@interesting = false
@plaintext = “”
end

This method is called whenever a comment occurs and

the comments text is passed in as string.

def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body=“start”/ # match starting comment
@interesting = true
when /^body=“end”/
@interesting = false # match closing comment
end
end

This callback method is called with any string between

a tag.

def characters(string)
@plaintext << string if @interesting
end
end

write to the screen

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

puts pte.plaintext

write to a file

begin
file = File.open(“snippet.txt”, “w”)
file.write pte.plaintext
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end

get the date written

fname = ARGV[0]
start_column = 3
end_column = 5

target_range = (start_column-1)…(end_column-1)

IO.foreach(fname) do |line|
if line.match(/Date:</strong>/)
pieces = line.split(" “)
puts pieces[target_range].join(”-")
end
end

remove blank lines from file

fh = File.open(‘snippet.txt’)
while( !fh.eof)
line = fh.readline.chomp
# remove leading and trailing blanks
line.strip!
# skip empty lines
next if line == ‘’
# convert tab chars to blanks
line.gsub!(/\t/,’ ‘)
# substitute a single blank for a sequence of blanks
line.squeeze!(’ ')
# add code to process line if needed
puts line
end
fh.close
exit(0)

The output is as follows:

pablo@cochituate=> ./extract_haiku.rb
/export/www/html/haikupoet/archive/0925.html
watching the news
I feel guilty
for being alive

Paul David Mena

[email protected]

Basically I want to omit the signature (everything below the “–”,
inclusive), which is wrapped in the “pre” tags.

Paul M. wrote in post #1089283:

remove blank lines from file

fh = File.open(‘snippet.txt’)
while( !fh.eof)
line = fh.readline.chomp
# remove leading and trailing blanks
line.strip!
# skip empty lines
next if line == ‘’
# convert tab chars to blanks
line.gsub!(/\t/,’ ‘)
# substitute a single blank for a sequence of blanks
line.squeeze!(’ ')
# add code to process line if needed
puts line
end
fh.close
exit(0)

I forgot to mention. There are several ways to read line by line from a
file, but your loop is particularly ugly. If you use my favorite:

IO.foreach(fname) do |line|
line = line.chomp


end

…an added benefit is that the file is automatically closed when the
block exits.

7stud – wrote in post #1089314:

I forgot to mention. There are several ways to read line by line from a
file,

Here are some others:

f = File.new(fname)

f.each do |line|
line = line.chomp

end

f.close

====

File.open(fname) do |f|
while line = f.gets #gets() returns nil at eof

end
end #file is automatically closed here

My ultimate goal is to extract all of the comment text between and

Yet none of your example html is setup to test that requirement–because
there is no text after the body=end comment.

require “nokogiri”

class PlainTextExtractor < Nokogiri::XML::SAX::Document
attr_reader :plaintext

Initialize the state of interest variable with false

def initialize
@interesting = false
@pre = false
@plaintext = “”
end

def start_element(name, attrs = [])
if name == “pre”
@pre = true
end
end

def end_element(name, attrs = [])
if name == “pre”
@pre = false
end
end

This method is called whenever a comment occurs and

the comments text is passed in as string.

def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body=“start”/ # match starting comment
@interesting = true
when /^body=“end”/
@interesting = false # match closing comment
end
end

This callback method is called with any string between

a tag.

def characters(string)
if @interesting and not @pre
@plaintext << string
end
end
end

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

p pte.plaintext

–output:–
“\n\nwatching the news\n\nI feel guilty\n\nfor being alive\n\n”

On Sun, Dec 16, 2012 at 11:18 AM, 7stud – [email protected] wrote:

Robert K. wrote in post #1089225:

puts dom.xpath(‘/html/body//pre/text()’).map(&:to_s)

Calling map() is redundant because puts calls to_s on its arguments.

Right, thanks for the reminder! That was an artifact of IRB testing.
:slight_smile:

  1. to print all of the non-tagged text between the
    “body” comments

Your html doesn’t even test your requirements because there is no text
after the body="end’ comment. And there is no non-tagged text:

I overlooked the comment thing.

#!/usr/bin/ruby

require ‘nokogiri’

require ‘irb’

text = <<HTML

My Title

My Heading

From:Me
Date: Wed Dec 05 2012 - 18:17:49 EST

text line 1
text line 2
text line 3

very important text
more important text
would you believe even more important text?

not to print HTML

dom = Nokogiri.HTML(text)

puts dom.xpath(‘/html/body//pre/text()’)

puts ‘—’

puts dom.xpath(‘//text()[contains(preceding::comment(),“start”) and
contains(following::comment(),“end”) and not(ancestor::pre)]’)

Kind regards

robert