I want the thank everyone for their quick replies and helpful
suggestions. I realized that I should probably be using the real - and
admittedly poorly-formed - HTML for this question and not the test HTML
I’ve tried to concoct for this example. The real HTML was generated by
the Hypermail program, basically converting an email from mbox form to
HTML. Here is one such file:
haiku_archive: watching the news
watching the news
From: Paul David Mena (
[email protected])
Date: Fri Dec 14 2012 - 18:51:14 EST
watching the news
I feel guilty
for being alive
--
Paul David Mena
--------------------
[email protected]
My ultimate goal is to extract all of the comment text between and but not what is between the
two “pre” tags. So far I’ve been able to extract all of the comment
text but not exclude the “pre” text, using the following code:
#!/usr/bin/env ruby
require “rubygems”
require “nokogiri”
class PlainTextExtractor < Nokogiri::XML::SAX::Document
attr_reader :plaintext
Initialize the state of interest variable with false
def initialize
@interesting = false
@plaintext = “”
end
This method is called whenever a comment occurs and
the comments text is passed in as string.
def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body=“start”/ # match starting comment
@interesting = true
when /^body=“end”/
@interesting = false # match closing comment
end
end
This callback method is called with any string between
a tag.
def characters(string)
@plaintext << string if @interesting
end
end
write to the screen
pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]
puts pte.plaintext
write to a file
begin
file = File.open(“snippet.txt”, “w”)
file.write pte.plaintext
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end
get the date written
fname = ARGV[0]
start_column = 3
end_column = 5
target_range = (start_column-1)…(end_column-1)
IO.foreach(fname) do |line|
if line.match(/Date:</strong>/)
pieces = line.split(" “)
puts pieces[target_range].join(”-")
end
end
remove blank lines from file
fh = File.open(‘snippet.txt’)
while( !fh.eof)
line = fh.readline.chomp
# remove leading and trailing blanks
line.strip!
# skip empty lines
next if line == ‘’
# convert tab chars to blanks
line.gsub!(/\t/,’ ‘)
# substitute a single blank for a sequence of blanks
line.squeeze!(’ ')
# add code to process line if needed
puts line
end
fh.close
exit(0)
The output is as follows:
pablo@cochituate=> ./extract_haiku.rb
/export/www/html/haikupoet/archive/0925.html
watching the news
I feel guilty
for being alive
Paul David Mena
[email protected]
Basically I want to omit the signature (everything below the “–”,
inclusive), which is wrapped in the “pre” tags.