I’m trying to parse this (poorly formatted) page, and when I look at the
page I see:
Name: ZITO, PEDRO OSVALDO
When I look at the source I get:
Name: ZITO, PEDRO OSVALDO
|
When I parse the page I get:
page.search("/html/body/table[3]/tr[1]/td[4]/table/tr[1]/td[1]/table/tr[3]/td[2]/table/tr[1]/td[1]/table/tr[2]/td[1]").first
=> #<Nokogiri::XML::Element:0x1eb1f76 name=“td”
attributes=[#<Nokogiri::XML::Attr:0x1eb1eea name=“colspan” value=“4”>]
children=[#<Nokogiri::XML::Element:0x1eb0d24 name=“font”
attributes=[#<Nokogiri::XML::Attr:0x1eb0c7a name=“face” value=“Arial”>,
#<Nokogiri::XML::Attr:0x1eb0c70 name=“size” value=“2”>]
children=[#<Nokogiri::XML::Text:0x1eb0694 "
\r\n ">, #<Nokogiri::XML::Element:0x1eb066c name=“b”
children=[#<Nokogiri::XML::Text:0x1eb0496 "Name: ">]>,
#<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
">]>, #<Nokogiri::XML::Text:0x1eaf636 " \r\n
\r\n ">]>
If you notice in the #<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
"> All the spaces in the name have been removed.
Here’s what I’m using:
Nokogiri::LIBXML_VERSION
=> “2.7.3”
macbook-pro:~ jeremywoertink$ ruby -v
ruby 1.8.6 (2009-06-08 patchlevel 369) [universal-darwin9.0]
Anyone have any ideas? My guess is maybe an encoding issue??? There are
other areas in the pages where I have to do string.gsub("\302\240", “”).
page.meta_encoding
=> nil
Thanks,
~Jeremy
Try using the .content() or .text() methods to get the text content of
the nodes.
G_ F_ wrote:
Try using the .content() or .text() methods to get the text content of
the nodes.
Yeah, I tried that. It just returns the name all squished. Any other
ideas?
If you post this question to nokogiri-talk with a reproducible test
case, I
think you’ll quickly get a response from the helpful nokogiri community.
On May 3, 2010 6:45 PM, “Jeremy W.” [email protected]
wrote:
I’m trying to parse this (poorly formatted) page, and when I look at the
page I see:
Name: ZITO, PEDRO OSVALDO
When I look at the source I get:
Name: ZITO, PEDRO OSVALDO
When I parse the page I get:
page.search(“/html/body/table[3]/tr[1]/td[4]/table/tr[1]/td[1]/table/tr[3]/td[2]/table/tr[1]/td[1]/table/tr[2]/td[1]”).first
=> #<Nokogiri::XML::Element:0x1eb1f76 name=“td”
attributes=[#<Nokogiri::XML::Attr:0x1eb1eea name=“colspan” value=“4”>]
children=[#<Nokogiri::XML::Element:0x1eb0d24 name=“font”
attributes=[#<Nokogiri::XML::Attr:0x1eb0c7a name=“face” value=“Arial”>,
#<Nokogiri::XML::Attr:0x1eb0c70 name=“size” value=“2”>]
children=[#<Nokogiri::XML::Text:0x1eb0694 "
\r\n ">, #<Nokogiri::XML::Element:0x1eb066c name=“b”
children=[#<Nokogiri::XML::Text:0x1eb0496 "Name: ">]>,
#<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
">]>, #<Nokogiri::XML::Text:0x1eaf636 " \r\n
\r\n ">]>
If you notice in the #<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
"> All the spaces in the name have been removed.
Here’s what I’m using:
Nokogiri::LIBXML_VERSION
=> “2.7.3”
macbook-pro:~ jeremywoertink$ ruby -v
ruby 1.8.6 (2009-06-08 patchlevel 369) [universal-darwin9.0]
Anyone have any ideas? My guess is maybe an encoding issue??? There are
other areas in the pages where I have to do string.gsub(“\302\240”, “”).
page.meta_encoding
=> nil
Thanks,
~Jeremy
Cool, I’ll try that. Thanks man.
~Jeremy
Mike D. wrote:
If you post this question to nokogiri-talk with a reproducible test
case, I
think you’ll quickly get a response from the helpful nokogiri community.