Nokogiri gem issues

Hello all,

I recently was pointed towards the Nokogiri gem recently to find all
html elements with a particular class, rather than attempting my own
regular expression. (Thanks John-John T. and Hassan S.!!!)

It works perfectly on my local machine, (Lion OS X and passenger), but
when I deployed it to my server (Centos 5.5 and passenger) Nokogiri
seems to not grab all the elements of the html file.

Here is my method:

=========================
def find_editable
code = Nokogiri::HTML(open(source_code(FtpAction::DOWNLOAD)))
# source_code() method returns a location of a file within the app

Rails.logger.info "===== code ===="
Rails.logger.info code.inspect

elements = []
num = 0

code.css('.my-class').each do |el|
# I tried using xpath, but was not able to get it to grab elements

w/ ‘class=“my-class icon”’
# only ‘class=“my-class”’
attrs = []
el.attributes.each { |attr| attrs << {attr[0] => attr[1].value } }
elements << {:element => el.name, :attributes => attrs, :content
=> el.content, :count => num+=1 }
end

Rails.logger.info "==== elements ===="
Rails.logger.info elements.inspect

elements
#  array return of hash built above, containing:
#     name (html tag)
#     attributes (classes and ids and anything else)
#     content actual text within element "Hello World!"

end

Below is the output from the log files. You don’t really need to go
through all of the local output. The whole .html file is being parsed.
If anyone has any idea why Nokogiri is not working on my server but
would be locally, I would appreciate any help you can provide.

Also I did check that the full .html file is there on the server and
using File.open do |file| printed out the full file for me.

Thank you in advance.

Output from Rails.logger.info:

Server (Centos 5.5)
===== code ====
#<Nokogiri::HTML::Document:0x5d14890 name=“document”
children=[#<Nokogiri::XML::DTD:0x5d14624 name=“html”>]>
==== elements ====
[]

Locally (Lion OS X)
===== code ====
#<Nokogiri::HTML::Document:0x81f508d0 name=“document”
children=[#<Nokogiri::XML::DTD:0x81f4fd2c name=“html”>,
#<Nokogiri::XML::Element:0x81f4fcb4 name=“html”
attributes=[#<Nokogiri::XML::Attr:0x81f4b5b0 name=“lang” value=“en”>]
children=[#<Nokogiri::XML::Element:0x81f4aa84 name=“head”
children=[#<Nokogiri::XML::Element:0x81f4a3b8 name=“meta”
attributes=[#<Nokogiri::XML::Attr:0x81f4a0fc name=“charset”
value=“utf-8”>]>, #<Nokogiri::XML::Element:0x81f4a264 name=“meta”
attributes=[#<Nokogiri::XML::Attr:0x81f49918 name=“http-equiv”
value=“X-UA-Compatible”>, #<Nokogiri::XML::Attr:0x81f49904
name=“content” value=“IE=edge,chrome=1”>]>,
#<Nokogiri::XML::Element:0x81f499a4 name=“title”
children=[#<Nokogiri::XML::Text:0x81f47e4c “Mut8 Test Site”>]>,
#<Nokogiri::XML::Element:0x81f47cd0 name=“meta”
attributes=[#<Nokogiri::XML::Attr:0x81f47bf4 name=“name”
value=“description”>, #<Nokogiri::XML::Attr:0x81f47be0
name=“content”>]>, #<Nokogiri::XML::Element:0x81f47c80 name=“meta”
attributes=[#<Nokogiri::XML::Attr:0x81f47320 name=“name”
value=“author”>, #<Nokogiri::XML::Attr:0x81f4730c name=“content”>]>]>,
#<Nokogiri::XML::Element:0x81f4677c name=“body”
children=[#<Nokogiri::XML::Text:0x81f46470 “\n \t”>,
#<Nokogiri::XML::Element:0x81f46420 name=“header”
children=[#<Nokogiri::XML::Element:0x81f46128 name=“hgroup”
children=[#<Nokogiri::XML::Element:0x81f45ea8 name=“h1”
children=[#<Nokogiri::XML::Text:0x81f45c14 “Mut8 Testing”>]>,
#<Nokogiri::XML::Text:0x81f45aac "\n \t ">,
#<Nokogiri::XML::Element:0x81f45a5c name=“h2”
children=[#<Nokogiri::XML::Text:0x81f45764 “In Progress…”>]>,
#<Nokogiri::XML::Text:0x81f455fc "\n \t ">]>]>,
#<Nokogiri::XML::Element:0x81f453cc name=“div”
attributes=[#<Nokogiri::XML::Attr:0x81f4514c name=“id” value=“content”>]
children=[#<Nokogiri::XML::Text:0x81f4307c "\n \t ">,
#<Nokogiri::XML::Element:0x81f4302c name=“h3”
children=[#<Nokogiri::XML::Text:0x81f42d34 “My Page Title Here!”>]>,
#<Nokogiri::XML::Text:0x81f42bcc "\n \t ">,
#<Nokogiri::XML::Element:0x81f42b7c name=“p”
attributes=[#<Nokogiri::XML::Attr:0x81f42a28 name=“id” value=“villa”>,
#<Nokogiri::XML::Attr:0x81f42a14 name=“class” value=“bob my-class”>]
children=[#<Nokogiri::XML::Text:0x81f42410 “Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Nulla eu ipsum urna, et molestie mi. \n \t
\t Aliquam adipiscing, massa et fermentum ullamcorper, neque nunc
consectetur enim, imperdiet \n \t \t porta lacus est non turpis. Nam
id nisi vitae enim scelerisque ullamcorper vel nec magna. \n \t \t
Morbi erat augue, mattis non imperdiet ac, dignissim in velit.”>]>,
#<Nokogiri::XML::Text:0x81f422a8 "\n\n\t ">,
#<Nokogiri::XML::Element:0x81f42258 name=“p”
attributes=[#<Nokogiri::XML::Attr:0x81f42104 name=“class”
value=“my-class icon”>] children=[#<Nokogiri::XML::Text:0x81f41cb8
“Pellentesque dapibus, nisl non venenatis vehicula, quam tortor placerat
lacus, hendrerit \n\t commodo nunc nisl non jusdddto. Donec erat
nulla, facilisis fringilla vestibulum et, iaculis \n\t eu metus. Sed
aliquet ultrices nunc quis pulvinar. Quisque facilisis dolor sed mauris
\n\t sagittis blandit. Quisque tortor libero, vestibulum quis semper
a, gravida quis nisl. \n\t Maecenas quam eros, blandit malesuada
imperdiet quis, volutpat sit amet nisl.”>]>,
#<Nokogiri::XML::Text:0x81f41b50 “\n \t”>]>,
#<Nokogiri::XML::Text:0x81f419e8 “\n \t”>,
#<Nokogiri::XML::Element:0x81f41998 name=“footer”
children=[#<Nokogiri::XML::Element:0x81f41664 name=“p”
attributes=[#<Nokogiri::XML::Attr:0x81f41574 name=“class”
value=“my-class”>] children=[#<Nokogiri::XML::Text:0x81f40ee4 “This is
my footer info”>]>, #<Nokogiri::XML::Text:0x81f40d7c “\n \t”>]>]>]>]>
==== elements ====
[{:content=>“Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nulla eu ipsum urna, et molestie mi. Aliquam adipiscing, massa et
fermentum ullamcorper, neque nunc consectetur enim, imperdiet porta
lacus est non turpis. Nam id nisi vitae enim scelerisque ullamcorper vel
nec magna. Morbi erat augue, mattis non imperdiet ac, dignissim in
velit.”, :attributes=>[{“class”=>“bob my-class”}, {“id”=>“villa”}],
:count=>1, :element=>“p”}, {:content=>“Pellentesque dapibus, nisl non
venenatis vehicula, quam tortor placerat lacus, hendrerit commodo nunc
nisl non jusdddto. Donec erat nulla, facilisis fringilla vestibulum et,
iaculis eu metus. Sed aliquet ultrices nunc quis pulvinar. Quisque
facilisis dolor sed mauris sagittis blandit. Quisque tortor libero,
vestibulum quis semper a, gravida quis nisl. Maecenas quam eros, blandit
malesuada imperdiet quis, volutpat sit amet nisl.”,
:attributes=>[{“class”=>“my-class icon”}], :count=>2, :element=>“p”},
{:content=>“This is my footer info”,
:attributes=>[{“class”=>“my-class”}], :count=>3, :element=>“p”}]

On Fri, Aug 26, 2011 at 9:06 AM, Keith R. [email protected]
wrote:

I recently was pointed towards the Nokogiri gem recently to find all
html elements with a particular class, rather than attempting my own
regular expression.

It works perfectly on my local machine, (Lion OS X and passenger), but
when I deployed it to my server (Centos 5.5 and passenger) Nokogiri
seems to not grab all the elements of the html file.

I’ll guess your server has an older (buggier) libxml2 than your Mac –
you can check versions using xmllint --version.

If it’s older, try upgrading and see if that fixes it.


Hassan S. ------------------------ [email protected]

twitter: @hassan

I did $ yum list updates and the only one that said there was an update
that was close to libxml2 was ‘libxml2-python’.

I updated that and it still does not work. When I checked the version,
the output was this:

xmllint --version
xmllint: using libxml version 20626
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1
FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv
ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug

Is there any other dependancies that I may need to update as well?

Thank you for the reply and your help

On Fri, Aug 26, 2011 at 12:04 PM, Keith R. [email protected]
wrote:

I did $ yum list updates and the only one that said there was an update
that was close to libxml2 was ‘libxml2-python’.

I don’t think that’s relevant :slight_smile:

I updated that and it still does not work. When I checked the version,
the output was this:

xmllint --version
xmllint: using libxml version 20626

On my Mac (Snow Leopard) version 20708
On a very old SuSE box (home office spare) version 20620
On an Ubuntu server provisioned in the last year version 20626

So I would bet that difference could still account for your problem.

I rarely use package managers like yum; usually it seems faster and
easier to just install/update from source :slight_smile:

You might want to research the change history but I would probably
just update and see what happens…

FWIW,

Hassan S. ------------------------ [email protected]

twitter: @hassan

On 26 August 2011 21:38, Stephen D. [email protected] wrote:

Unsubscribe

It says this at the bottom of each message…

Unsubscribe

Sent from my iPad

Keith R. wrote in post #1018677:

# I tried using xpath, but was not able to get it to grab elements

w/ ‘class=“my-class icon”’
# only ‘class=“my-class”’

If you give up on installing a newer version of libxml (I’ve tried it in
the past and found it impossible), you can use xpath() to do what you
want:

Test

Hello world

<div class="editor_fruit red">Apple</div>

<div class="article_editor">
  <div>Not this node</div>
  <div class="hide editor">Papillon</div>
</div>

require ‘nokogiri’

f = File.open(‘2html.htm’)
doc = Nokogiri::HTML(f)

results =
doc.xpath("//*[contains(concat(’ ', @class, ’ '), ’ editor
')]").each do |el|
p [
el.attributes[‘class’].value,
el.children[0].text
]
end

–output:–
[“editor highlight”, “Hello world”]
[“hide editor”, “Papillon”]

The forum software created a newline in an unfortunate place. Here is
what the contains() function looks like:

contains( concat(’ ', @class, ’ '), ’ editor ’ )

Just an update…

I removed older versions of libxml2 and libxslt and reinstalled them
from the source and it appears to be working.

Thank you for all your help, again