Nokogiri help parsing HTML

I’m relatively new to Ruby (and therefore Nokogiri) and am trying to
parse some HTML that will ultimately be written to a MySQL database. In
the interim, I’m writing it to a text file for troubleshooting purposes.

Here’s the relevant piece of the HTML I’d like to parse:

From: Paul David Mena <pauldavidmena_at_gmail.com>
Date: Tue, 26 Mar 2013 18:13:21 -0400

Line 1
Line 2
Line 3

--
Paul David Mena
--------------------
pauldavidmena_at_gmail.com
Received on Tue Mar 26 2013 - 22:13:23 EDT

My goal is to strip out everything between the “address” and “pre” tags
and to output only:

Line 1

Line 2

Line 3

My code, however, is stripping out one or the other, depending upon
where I place the definition. Here is the code:

#!/usr/bin/env ruby

require “nokogiri”

class PlainTextExtractor < Nokogiri::XML::SAX::Document
attr_reader :plaintext

Initialize the state of interest variable with false

def initialize
@interesting = false
@pre = false
@address = false
@plaintext = “”
end

def start_element(name, attrs = [])
if name == “address”
@address = true
end
end

def end_element(name, attrs = [])
if name == “address”
@address = false
end
end

def start_element(name, attrs = [])
if name == “pre”
@pre = true
end
end

def end_element(name, attrs = [])
if name == “pre”
@pre = false
end
end

This method is called whenever a comment occurs and

the comments text is passed in as string.

def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body=“start”/ # match starting comment
@interesting = true
when /^body=“end”/
@interesting = false # match closing comment
end
end

This callback method is called with any string between

a tag.

def characters(string)
if @interesting and not @pre
if @interesting and not @address
@plaintext << string
end
end
end
end

fname = ARGV[0]
start_column = 4
end_column = 6

target_range = (start_column-1)…(end_column-1)
IO.foreach(fname) do |line|
if line.match(/Date</dfn>/)
pieces = line.split(" ")

@date_string = pieces[target_range].join("-")

puts @date_string

end
end

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

puts pte.plaintext

begin
file = File.open(“snippet.txt”, “w”)
file.write(@date_string)
file.write(pte.plaintext)
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end

On Tue, Mar 26, 2013 at 11:40 PM, Paul M. [email protected]
wrote:

From: Paul David Mena <<a
Line 3

My goal is to strip out everything between the “address” and “pre” tags
and to output only:

OK, so you want every tag that is a sibling of address and pre and is
within those two. I have found this StackOverFlow answer:

which applied to your problem:

1.9.2p290 :001 > require ‘nokogiri’
=> true
1.9.2p290 :002 > s = <<END
1.9.2p290 :003">
1.9.2p290 :004">


[…snip…]
1.9.2p290 :031 > doc = Nokogiri::HTML(s)
1.9.2p290 :039 >
doc.xpath(“//address/following-sibling::node()[count(.|
//pre/preceding-sibling::node())=count(//pre/preceding-sibling::node())]”)
=> [#<Nokogiri::XML::Text:0xdb0114 “\n”>,
#<Nokogiri::XML::Element:0xdaff84 name=“p”
children=[#<Nokogiri::XML::Text:0xdafc28 “\nLine 1\n”>,
#<Nokogiri::XML::Element:0xdafa5c name=“br”>,
#<Nokogiri::XML::Text:0xdaf728 “\nLine 2\n”>,
#<Nokogiri::XML::Element:0xdaf5ac name=“br”>,
#<Nokogiri::XML::Text:0xdaf228 “\nLine 3\n”>,
#<Nokogiri::XML::Element:0xdaf0ac name=“br”>]>,
#<Nokogiri::XML::Element:0xdaea80 name=“p”>]

will return a node set that contains the required nodes.

Hope this helps,

Jesus.

On Wed, Mar 27, 2013 at 12:05 AM, Jess Gabriel y Galn
[email protected] wrote:

On Tue, Mar 26, 2013 at 11:40 PM, Paul M. [email protected] wrote:

1.9.2p290 :001 > require ‘nokogiri’
#<Nokogiri::XML::Element:0xdaff84 name=“p”
children=[#<Nokogiri::XML::Text:0xdafc28 “\nLine 1\n”>,
#<Nokogiri::XML::Element:0xdafa5c name=“br”>,
#<Nokogiri::XML::Text:0xdaf728 “\nLine 2\n”>,
#<Nokogiri::XML::Element:0xdaf5ac name=“br”>,
#<Nokogiri::XML::Text:0xdaf228 “\nLine 3\n”>,
#<Nokogiri::XML::Element:0xdaf0ac name=“br”>]>,
#<Nokogiri::XML::Element:0xdaea80 name=“p”>]

will return a node set that contains the required nodes.

Your version also outputs

tags, doesn’t it? A modified version of
yours

irb(main):036:0>
dom.xpath(‘//address/following-sibling::*//text()’).each {|n| p n}
#<Nokogiri::XML::Text:0x…fc01de63e “\nLine 1\n”>
#<Nokogiri::XML::Text:0x…fc01de4d6 “\nLine 2\n”>
#<Nokogiri::XML::Text:0x…fc01de36e “\nLine 3\n”>
#<Nokogiri::XML::Text:0x…fc01de206 “\n”>
#<Nokogiri::XML::Text:0x…fc01ddf04 “\n”>
=> 0

irb(main):037:0>
dom.xpath(‘//address/following-sibling::p//text()’).each {|n| p n}
#<Nokogiri::XML::Text:0x…fc01de63e “\nLine 1\n”>
#<Nokogiri::XML::Text:0x…fc01de4d6 “\nLine 2\n”>
#<Nokogiri::XML::Text:0x…fc01de36e “\nLine 3\n”>
#<Nokogiri::XML::Text:0x…fc01de206 “\n”>
=> 0

Here’s another approach: find everything under

but
not under :

irb(main):032:0>
dom.xpath(‘//div[@class=“mail”]//text()[not(ancestor::address)]’).each
{|n| p n}
#<Nokogiri::XML::Text:0x…fc01c07c4 “\n”>
#<Nokogiri::XML::Text:0x…fc01de7b0 “\n”>
#<Nokogiri::XML::Text:0x…fc01de63e “\nLine 1\n”>
#<Nokogiri::XML::Text:0x…fc01de4d6 “\nLine 2\n”>
#<Nokogiri::XML::Text:0x…fc01de36e “\nLine 3\n”>
#<Nokogiri::XML::Text:0x…fc01de206 “\n”>
#<Nokogiri::XML::Text:0x…fc01ddf04 “\n”>
=> 0

TIMTOWTDI :slight_smile:

Kind regards

robert

Thanks to all for the help. The following seems to do most of what I
want:

dom.xpath(’//address/following-sibling::p//text()’).each {|n| p n}

My next task is to capture only the desired text, and to write it to a
file. Specifically:

Line 1\n
Line 2\n
Line 3\n

The above code writes the following to standard out:

#<Nokogiri::XML::Text:0xba8f9c “\nLine 1\n”>
#<Nokogiri::XML::Text:0xba8eac “\nLine 2\n”>
#<Nokogiri::XML::Text:0xba8dd0 “\nLine 3\n”>

So close!

Isn’t that as simple as “p n.text”?

I should probably include the whole revised program for context. The
argument is the path to an HTML file that contains relevant text between
the body=start and body=end tags.

#!/usr/bin/env ruby

some initializations

@interesting = false
@my_text = “”

read the file between the two “body” tags and stash in “my_text”

fname = ARGV[0]

IO.foreach(fname) do |line|
if line.match(/body=“start”/)
@interesting = true
end

meanwhile let’s grab the date string and process it

start_column = 4
end_column = 6

target_range = (start_column-1)…(end_column-1)

if line.match(/Date</dfn>/)
pieces = line.split(" “)
@date_string = pieces[target_range].join(”-")
end

if line.match(/body=“end”/)
@interesting = false
end

if @interesting
@my_text << line
end
end

puts @haiku_text

require “nokogiri”

doc = Nokogiri::HTML(@my_text)
doc.xpath(’//address/following-sibling::p//text()’).each {|n| p n}

puts doc

begin
file = File.open(“snippet.txt”, “w”)
file.write(@date_string)
file.write(doc)
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end

How about this?

doc = Nokogiri::HTML(@my_text)

output = @date_string + $/
doc.xpath(’//address/following-sibling::p//text()’).each { |line| output
<< ( line.text.strip + $/ ) }

File.write(“snippet.txt”, output)

This actually worked perfectly:

require “nokogiri”

doc = Nokogiri::HTML(@my_text)
output = @date_string + $/
doc.xpath(’//address/following-sibling::p//text()’).each { |line| output
<< ( line.text.strip + $/ ) }

File.write(“snippet.txt”, output)

Just out of curiosity I had a go at writing this myself, with the
exception of that complicated xpath because I don’t really understand
xpath yet :slight_smile:

This is what I came up with:

require ‘nokogiri’
doc = Nokogiri::HTML File.read(ARGV[0])
output = doc.css(‘span[@id=“date”]’).first.text[/\d+ \w+ \d+/].gsub(’
‘,’-’) + $/
path = ‘//address/following-sibling::p//text()’
doc.xpath(path).each { |line| output << line.text.strip << $/ }
File.write(“snippet.txt”, output)

You still have @my_text and @date_string there, which leads me to
suspect that’s only the last part of your script.
The example I gave is the entire script…

Since you’re using Nokogiri anyway, you’d be better off using that for
the whole process rather than looping through the HTML “manually”. This
is the sort of thing it’s worth getting in the habit of: using the tools
available to their fullest potential.
You can do the whole thing in 5 lines (barring error-checking) as I
demonstrated earlier.

Joel P. wrote in post #1103626:

Since you’re using Nokogiri anyway, you’d be better off using that for
the whole process rather than looping through the HTML “manually”. This
is the sort of thing it’s worth getting in the habit of: using the tools
available to their fullest potential.
You can do the whole thing in 5 lines (barring error-checking) as I
demonstrated earlier.

I completely missed that in your earlier post! It definitely makes
sense to use Nokogiri to do exactly what it’s good at.

Thanks!

Joel P. wrote in post #1103448:

You still have @my_text and @date_string there, which leads me to
suspect that’s only the last part of your script.
The example I gave is the entire script…

You’re right. Here’s the whole thing:

#!/usr/bin/env ruby

some initializations

@interesting = false
@my_text = “”

read the file between the two “body” tags

fname = ARGV[0]

IO.foreach(fname) do |line|
if line.match(/body=“start”/)
@interesting = true
end

meanwhile let’s grab the date string and process it

start_column = 4
end_column = 6

target_range = (start_column-1)…(end_column-1)

if line.match(/Date</dfn>/)
pieces = line.split(" “)
@date_string = pieces[target_range].join(”-")
end

if line.match(/body=“end”/)
@interesting = false
end

if @interesting
@my_text << line
end
end

require “nokogiri”

doc = Nokogiri::HTML(@my_text)
output = @date_string + $/
doc.xpath(’//address/following-sibling::p//text()’).each { |line| output
<< ( line.text.strip + $/ ) }

File.write(“snippet.txt”, output)

That makes quite a difference! Here’s what I have after ripping out the
old logic and extracting the date using Nokogiri:

#!/usr/bin/env ruby

require “nokogiri”

get the date

doc = Nokogiri::HTML File.read(ARGV[0])
output = doc.css(‘span[@id=“date”]’).first.text[/\d+ \w+ \d+/].gsub(’
‘,’-’) + $/

get the remaining text

path = ‘//address/following-sibling::p//text()’
doc.xpath(path).each { |line| output << line.text.strip << $/ }

File.write(“snippet.txt”, output)

Looks good, Ruby’s pretty amazing when it comes to finding simple ways
to do complex things.
Are there any parts of that code you need clarifying? It’ll help to
understand all the methods used here so you can write your own more
easily in future.
For example, you can test regular expressions here:

I do have a follow-up question, if you don’t mind. I can see how the
“address” tag is stripped, but not the “pre” tag. Amazing how much
heavy lifting is accomplished with a simple (or, at least to me, not so
simple) line of code.

It really helped to run the code in IRB to see what Nokogiri was doing.
It will take me a little bit longer to wrap my mind around how Ruby does
regular expressions, but it certainly seems worth the effort.

Thanks for the link, and for all of the help!

I’m not sure what you mean by stripping the tags. Firstly, the xpath is
looking for

after , which doesn’t include

 in your
example. Secondly if you ask Nokogiri for the “text”, it won’t include
any html tags.