Nokogiri bug or intended effect?

nuby2ruby · May 4, 2010, 12:44am

I’m trying to parse this (poorly formatted) page, and when I look at the
page I see:

Name: ZITO, PEDRO OSVALDO

When I look at the source I get:

Name: ZITO, PEDRO OSVALDO

When I parse the page I get:

page.search("/html/body/table[3]/tr[1]/td[4]/table/tr[1]/td[1]/table/tr[3]/td[2]/table/tr[1]/td[1]/table/tr[2]/td[1]").first
=> #<Nokogiri::XML::Element:0x1eb1f76 name=“td”
attributes=[#<Nokogiri::XML::Attr:0x1eb1eea name=“colspan” value=“4”>]
children=[#<Nokogiri::XML::Element:0x1eb0d24 name=“font”
attributes=[#<Nokogiri::XML::Attr:0x1eb0c7a name=“face” value=“Arial”>,
#<Nokogiri::XML::Attr:0x1eb0c70 name=“size” value=“2”>]
children=[#<Nokogiri::XML::Text:0x1eb0694 "
\r\n ">, #<Nokogiri::XML::Element:0x1eb066c name=“b”
children=[#<Nokogiri::XML::Text:0x1eb0496 "Name: ">]>,
#<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
">]>, #<Nokogiri::XML::Text:0x1eaf636 " \r\n
\r\n ">]>

If you notice in the #<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
"> All the spaces in the name have been removed.

Here’s what I’m using:

Nokogiri::LIBXML_VERSION
=> “2.7.3”
macbook-pro:~ jeremywoertink$ ruby -v
ruby 1.8.6 (2009-06-08 patchlevel 369) [universal-darwin9.0]

Anyone have any ideas? My guess is maybe an encoding issue??? There are
other areas in the pages where I have to do string.gsub("\302\240", “”).

page.meta_encoding
=> nil

Thanks,

~Jeremy

nuby2ruby · May 4, 2010, 4:06am

Try using the .content() or .text() methods to get the text content of
the nodes.

nuby2ruby · May 4, 2010, 10:13pm

G_ F_ wrote:

Try using the .content() or .text() methods to get the text content of
the nodes.

Yeah, I tried that. It just returns the name all squished. Any other
ideas?

nuby2ruby · May 5, 2010, 5:21am

If you post this question to nokogiri-talk with a reproducible test
case, I
think you’ll quickly get a response from the helpful nokogiri community.

On May 3, 2010 6:45 PM, “Jeremy W.” [email protected]
wrote:

I’m trying to parse this (poorly formatted) page, and when I look at the
page I see:

Name: ZITO, PEDRO OSVALDO

When I look at the source I get:

Name: ZITO, PEDRO OSVALDO

When I parse the page I get:

page.search(“/html/body/table[3]/tr[1]/td[4]/table/tr[1]/td[1]/table/tr[3]/td[2]/table/tr[1]/td[1]/table/tr[2]/td[1]”).first
=> #<Nokogiri::XML::Element:0x1eb1f76 name=“td”
attributes=[#<Nokogiri::XML::Attr:0x1eb1eea name=“colspan” value=“4”>]
children=[#<Nokogiri::XML::Element:0x1eb0d24 name=“font”
attributes=[#<Nokogiri::XML::Attr:0x1eb0c7a name=“face” value=“Arial”>,
#<Nokogiri::XML::Attr:0x1eb0c70 name=“size” value=“2”>]
children=[#<Nokogiri::XML::Text:0x1eb0694 "
\r\n ">, #<Nokogiri::XML::Element:0x1eb066c name=“b”
children=[#<Nokogiri::XML::Text:0x1eb0496 "Name: ">]>,
#<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
">]>, #<Nokogiri::XML::Text:0x1eaf636 " \r\n
\r\n ">]>

If you notice in the #<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
"> All the spaces in the name have been removed.

Here’s what I’m using:

Nokogiri::LIBXML_VERSION
=> “2.7.3”
macbook-pro:~ jeremywoertink$ ruby -v
ruby 1.8.6 (2009-06-08 patchlevel 369) [universal-darwin9.0]

Anyone have any ideas? My guess is maybe an encoding issue??? There are
other areas in the pages where I have to do string.gsub(“\302\240”, “”).

page.meta_encoding
=> nil

Thanks,

~Jeremy

nuby2ruby · May 5, 2010, 5:50pm

Cool, I’ll try that. Thanks man.

~Jeremy

Mike D. wrote:

If you post this question to nokogiri-talk with a reproducible test
case, I
think you’ll quickly get a response from the helpful nokogiri community.