Nokogiri help parsing HTML

paul-au · March 26, 2013, 11:39pm

I’m relatively new to Ruby (and therefore Nokogiri) and am trying to
parse some HTML that will ultimately be written to a MySQL database. In
the interim, I’m writing it to a text file for troubleshooting purposes.

Here’s the relevant piece of the HTML I’d like to parse:

From: Paul David Mena <pauldavidmena_at_gmail.com>
Date: Tue, 26 Mar 2013 18:13:21 -0400

Line 1
Line 2
Line 3

--
Paul David Mena
--------------------
pauldavidmena_at_gmail.com

Received on Tue Mar 26 2013 - 22:13:23 EDT

My goal is to strip out everything between the “address” and “pre” tags
and to output only:

Line 1

Line 2

Line 3

My code, however, is stripping out one or the other, depending upon
where I place the definition. Here is the code:

#!/usr/bin/env ruby

require “nokogiri”

class PlainTextExtractor < Nokogiri::XML::SAX::Document
attr_reader :plaintext

Initialize the state of interest variable with false

def initialize
@interesting = false
@pre = false
@address = false
@plaintext = “”
end

def start_element(name, attrs = [])
if name == “address”
@address = true
end
end

def end_element(name, attrs = [])
if name == “address”
@address = false
end
end

def start_element(name, attrs = [])
if name == “pre”
@pre = true
end
end

def end_element(name, attrs = [])
if name == “pre”
@pre = false
end
end

This method is called whenever a comment occurs and

the comments text is passed in as string.

def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^body=“start”/ # match starting comment
@interesting = true
when /^body=“end”/
@interesting = false # match closing comment
end
end

This callback method is called with any string between

fname = ARGV[0]
start_column = 4
end_column = 6

target_range = (start_column-1)…(end_column-1)
IO.foreach(fname) do |line|
if line.match(/Date</dfn>/)
pieces = line.split(" ")

@date_string = pieces[target_range].join("-")

puts @date_string

end
end

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]

puts pte.plaintext

begin
file = File.open(“snippet.txt”, “w”)
file.write(@date_string)
file.write(pte.plaintext)
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end

paul-au · March 27, 2013, 12:06am

On Tue, Mar 26, 2013 at 11:40 PM, Paul M. [email protected]
wrote:

From: Paul David Mena <<a
Line 3

My goal is to strip out everything between the “address” and “pre” tags
and to output only:

OK, so you want every tag that is a sibling of address and pre and is
within those two. I have found this StackOverFlow answer:

which applied to your problem:

1.9.2p290 :001 > require ‘nokogiri’
=> true
1.9.2p290 :002 > s = <<END
1.9.2p290 :003">
1.9.2p290 :004">

[…snip…]
1.9.2p290 :031 > doc = Nokogiri::HTML(s)
1.9.2p290 :039 >
doc.xpath(“//address/following-sibling::node()[count(.|
//pre/preceding-sibling::node())=count(//pre/preceding-sibling::node())]”)
=> [#<Nokogiri::XML::Text:0xdb0114 “\n”>,
#<Nokogiri::XML::Element:0xdaff84 name=“p”
children=[#<Nokogiri::XML::Text:0xdafc28 “\nLine 1\n”>,
#<Nokogiri::XML::Element:0xdafa5c name=“br”>,
#<Nokogiri::XML::Text:0xdaf728 “\nLine 2\n”>,
#<Nokogiri::XML::Element:0xdaf5ac name=“br”>,
#<Nokogiri::XML::Text:0xdaf228 “\nLine 3\n”>,
#<Nokogiri::XML::Element:0xdaf0ac name=“br”>]>,
#<Nokogiri::XML::Element:0xdaea80 name=“p”>]

will return a node set that contains the required nodes.

Hope this helps,

Jesus.

paul-au · March 27, 2013, 9:25am

On Wed, Mar 27, 2013 at 12:05 AM, Jess Gabriel y Galn
[email protected] wrote:

On Tue, Mar 26, 2013 at 11:40 PM, Paul M. [email protected] wrote:

1.9.2p290 :001 > require ‘nokogiri’
#<Nokogiri::XML::Element:0xdaff84 name=“p”
children=[#<Nokogiri::XML::Text:0xdafc28 “\nLine 1\n”>,
#<Nokogiri::XML::Element:0xdafa5c name=“br”>,
#<Nokogiri::XML::Text:0xdaf728 “\nLine 2\n”>,
#<Nokogiri::XML::Element:0xdaf5ac name=“br”>,
#<Nokogiri::XML::Text:0xdaf228 “\nLine 3\n”>,
#<Nokogiri::XML::Element:0xdaf0ac name=“br”>]>,
#<Nokogiri::XML::Element:0xdaea80 name=“p”>]

will return a node set that contains the required nodes.

Your version also outputs

tags, doesn’t it? A modified version of
yours

irb(main):036:0>
dom.xpath(‘//address/following-sibling::*//text()’).each {|n| p n}
#<Nokogiri::XML::Text:0x…fc01de63e “\nLine 1\n”>
#<Nokogiri::XML::Text:0x…fc01de4d6 “\nLine 2\n”>
#<Nokogiri::XML::Text:0x…fc01de36e “\nLine 3\n”>
#<Nokogiri::XML::Text:0x…fc01de206 “\n”>
#<Nokogiri::XML::Text:0x…fc01ddf04 “\n”>
=> 0

irb(main):037:0>
dom.xpath(‘//address/following-sibling::p//text()’).each {|n| p n}
#<Nokogiri::XML::Text:0x…fc01de63e “\nLine 1\n”>
#<Nokogiri::XML::Text:0x…fc01de4d6 “\nLine 2\n”>
#<Nokogiri::XML::Text:0x…fc01de36e “\nLine 3\n”>
#<Nokogiri::XML::Text:0x…fc01de206 “\n”>
=> 0

Here’s another approach: find everything under

but
not under :

irb(main):032:0>
dom.xpath(‘//div[@class=“mail”]//text()[not(ancestor::address)]’).each
{|n| p n}
#<Nokogiri::XML::Text:0x…fc01c07c4 “\n”>
#<Nokogiri::XML::Text:0x…fc01de7b0 “\n”>
#<Nokogiri::XML::Text:0x…fc01de63e “\nLine 1\n”>
#<Nokogiri::XML::Text:0x…fc01de4d6 “\nLine 2\n”>
#<Nokogiri::XML::Text:0x…fc01de36e “\nLine 3\n”>
#<Nokogiri::XML::Text:0x…fc01de206 “\n”>
#<Nokogiri::XML::Text:0x…fc01ddf04 “\n”>
=> 0

TIMTOWTDI

Kind regards

robert

paul-au · March 27, 2013, 3:57pm

Thanks to all for the help. The following seems to do most of what I
want:

dom.xpath(’//address/following-sibling::p//text()’).each {|n| p n}

My next task is to capture only the desired text, and to write it to a
file. Specifically:

Line 1\n
Line 2\n
Line 3\n

The above code writes the following to standard out:

#<Nokogiri::XML::Text:0xba8f9c “\nLine 1\n”>
#<Nokogiri::XML::Text:0xba8eac “\nLine 2\n”>
#<Nokogiri::XML::Text:0xba8dd0 “\nLine 3\n”>

So close!

paul-au · March 27, 2013, 4:18pm

Isn’t that as simple as “p n.text”?

paul-au · March 27, 2013, 5:44pm

I should probably include the whole revised program for context. The
argument is the path to an HTML file that contains relevant text between
the body=start and body=end tags.

#!/usr/bin/env ruby

some initializations

@interesting = false
@my_text = “”

read the file between the two “body” tags and stash in “my_text”

fname = ARGV[0]

IO.foreach(fname) do |line|
if line.match(/body=“start”/)
@interesting = true
end

meanwhile let’s grab the date string and process it

start_column = 4
end_column = 6

target_range = (start_column-1)…(end_column-1)

if line.match(/Date</dfn>/)
pieces = line.split(" “)
@date_string = pieces[target_range].join(”-")
end

if line.match(/body=“end”/)
@interesting = false
end

if @interesting
@my_text << line
end
end

puts @haiku_text

require “nokogiri”

doc = Nokogiri::HTML(@my_text)
doc.xpath(’//address/following-sibling::p//text()’).each {|n| p n}

puts doc

begin
file = File.open(“snippet.txt”, “w”)
file.write(@date_string)
file.write(doc)
rescue IOError => e
#some error occur, dir not writable etc.
ensure
file.close unless file == nil
end

paul-au · March 27, 2013, 5:56pm

How about this?

doc = Nokogiri::HTML(@my_text)

output = @date_string + $/
doc.xpath(’//address/following-sibling::p//text()’).each { |line| output
<< ( line.text.strip + $/ ) }

File.write(“snippet.txt”, output)

paul-au · March 27, 2013, 7:05pm

This actually worked perfectly:

require “nokogiri”

doc = Nokogiri::HTML(@my_text)
output = @date_string + $/
doc.xpath(’//address/following-sibling::p//text()’).each { |line| output
<< ( line.text.strip + $/ ) }

File.write(“snippet.txt”, output)

paul-au · March 27, 2013, 6:55pm

Just out of curiosity I had a go at writing this myself, with the
exception of that complicated xpath because I don’t really understand
xpath yet

This is what I came up with:

require ‘nokogiri’
doc = Nokogiri::HTML File.read(ARGV[0])
output = doc.css(‘span[@id=“date”]’).first.text[/\d+ \w+ \d+/].gsub(’
‘,’-’) + $/
path = ‘//address/following-sibling::p//text()’
doc.xpath(path).each { |line| output << line.text.strip << $/ }
File.write(“snippet.txt”, output)

paul-au · March 28, 2013, 9:35am

You still have @my_text and @date_string there, which leads me to
suspect that’s only the last part of your script.
The example I gave is the entire script…

paul-au · March 29, 2013, 10:28am

Since you’re using Nokogiri anyway, you’d be better off using that for
the whole process rather than looping through the HTML “manually”. This
is the sort of thing it’s worth getting in the habit of: using the tools
available to their fullest potential.
You can do the whole thing in 5 lines (barring error-checking) as I
demonstrated earlier.

paul-au · March 29, 2013, 4:17pm

Joel P. wrote in post #1103626:

Since you’re using Nokogiri anyway, you’d be better off using that for
the whole process rather than looping through the HTML “manually”. This
is the sort of thing it’s worth getting in the habit of: using the tools
available to their fullest potential.
You can do the whole thing in 5 lines (barring error-checking) as I
demonstrated earlier.

I completely missed that in your earlier post! It definitely makes
sense to use Nokogiri to do exactly what it’s good at.

Thanks!

paul-au · March 28, 2013, 1:07pm

Joel P. wrote in post #1103448:

You still have @my_text and @date_string there, which leads me to
suspect that’s only the last part of your script.
The example I gave is the entire script…

You’re right. Here’s the whole thing:

#!/usr/bin/env ruby

some initializations

@interesting = false
@my_text = “”

read the file between the two “body” tags

fname = ARGV[0]

IO.foreach(fname) do |line|
if line.match(/body=“start”/)
@interesting = true
end

meanwhile let’s grab the date string and process it

start_column = 4
end_column = 6

target_range = (start_column-1)…(end_column-1)

if line.match(/Date</dfn>/)
pieces = line.split(" “)
@date_string = pieces[target_range].join(”-")
end

if line.match(/body=“end”/)
@interesting = false
end

if @interesting
@my_text << line
end
end

require “nokogiri”

doc = Nokogiri::HTML(@my_text)
output = @date_string + $/
doc.xpath(’//address/following-sibling::p//text()’).each { |line| output
<< ( line.text.strip + $/ ) }

File.write(“snippet.txt”, output)

paul-au · March 29, 2013, 5:39pm

That makes quite a difference! Here’s what I have after ripping out the
old logic and extracting the date using Nokogiri:

#!/usr/bin/env ruby

require “nokogiri”

get the date

doc = Nokogiri::HTML File.read(ARGV[0])
output = doc.css(‘span[@id=“date”]’).first.text[/\d+ \w+ \d+/].gsub(’
‘,’-’) + $/

get the remaining text

path = ‘//address/following-sibling::p//text()’
doc.xpath(path).each { |line| output << line.text.strip << $/ }

File.write(“snippet.txt”, output)

paul-au · March 29, 2013, 5:50pm

Looks good, Ruby’s pretty amazing when it comes to finding simple ways
to do complex things.
Are there any parts of that code you need clarifying? It’ll help to
understand all the methods used here so you can write your own more
easily in future.
For example, you can test regular expressions here:

paul-au · March 29, 2013, 11:49pm

I do have a follow-up question, if you don’t mind. I can see how the
“address” tag is stripped, but not the “pre” tag. Amazing how much
heavy lifting is accomplished with a simple (or, at least to me, not so
simple) line of code.

paul-au · March 29, 2013, 5:55pm

It really helped to run the code in IRB to see what Nokogiri was doing.
It will take me a little bit longer to wrap my mind around how Ruby does
regular expressions, but it certainly seems worth the effort.

Thanks for the link, and for all of the help!

paul-au · March 30, 2013, 12:02am

I’m not sure what you mean by stripping the tags. Firstly, the xpath is
looking for

after , which doesn’t include

 in your

example. Secondly if you ask Nokogiri for the “text”, it won’t include

any html tags.

Nokogiri help parsing HTML

Initialize the state of interest variable with false

This method is called whenever a comment occurs and

the comments text is passed in as string.

This callback method is called with any string between

a tag.

puts @date_string

puts pte.plaintext

some initializations

read the file between the two “body” tags and stash in “my_text”

meanwhile let’s grab the date string and process it

puts @haiku_text

puts doc

some initializations

read the file between the two “body” tags

meanwhile let’s grab the date string and process it

get the date

get the remaining text