Nokogiri help

I keep getting this error
“encoding error : output conversion failed due to conv error, bytes 0xA0
0x69 0x64 0xC2
I/O error : encoder error”

whenever I try to append my html string to Nokogiri::HTML.

When I write that doc to a file, some of the spaces, or possibly letters
are weird looking characters.

Here is my (partial) output in irb:

templates = templates_page.search(’/html/body/table[3]/tr[2]/td[2]//a’)
=> footerhas-sub-sectionsheaderhomeitem-pagejustpro-888-519-5878left-navsearchsection-page

link = templates.first
=> footer

page = @app.browser.click(link)
=> #WWW::Mechanize::Page......

template_body = page.search(’/html//body/form//pre’)
=>

DIV?id?“footer”
DIV?id?“footer-icons”
IMG?src?"/lib/yhst-72759769340912/yahoo.gif"
IMG?src?"/lib/yhst-72759769340912/secure.gif"
DIV?id?“copyright”
[email protected]
LINEBREAK?

t = template_body.to_html.gsub(/[;|\s][a-zA-Z-]+[&|\s]/m) { |match|
?> if match != " return "

      val = %{|#{match.scan(/[a-zA-Z-]+/).first}}
      match.gsub(/[a-zA-Z-]+/, val)
    end
  }

=> “

<a
href=“javascript:document.f3.SLID.value=‘F16’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>DIV\240id\240<a
href=“javascript:document.f3.SLID.value=‘F17’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>“footer”\n <a
href=“javascript:document.f3.SLID.value=‘F18’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>DIV\240id\240<a
href=“javascript:document.f3.SLID.value=‘F19’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>“footer-icons”\n <a
href=“javascript:document.f3.SLID.value=‘F20’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>IMG\240src\240<a
href=“javascript:document.f3.SLID.value=‘F21’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>”/lib/yhst-72759769340912/yahoo.gif"\n
<a
href=“javascript:document.f3.SLID.value=‘F22’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>IMG\240src\240<a
href=“javascript:document.f3.SLID.value=‘F23’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>"/lib/yhst-72759769340912/secure.gif"\n
<a
href=“javascript:document.f3.SLID.value=‘F24’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>DIV\240id\240<a
href=“javascript:document.f3.SLID.value=‘F25’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>“copyright”\n <a
href=“javascript:document.f3.SLID.value=‘F26’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>TEXT\240<a
href=“javascript:document.f3.SLID.value=‘F27’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>@copyright\n <a
href=“javascript:document.f3.SLID.value=‘F28’;%20document.f3.submit();”
title=“Select” onmouseover=“window.status=‘Select’;true;”
onmouseout=“window.status=’’;”>LINEBREAK\240\n
\n"

doc = Nokogiri::HTML(<<-eohtml)
#{t}
eohtml
encoding error : output conversion failed due to conv error, bytes
0xA0 0x69 0x64 0xC2
I/O error : encoder error
=>

Is there something I should do different?

Thanks,

~Jeremy W.

On Wed, Nov 18, 2009 at 08:30:10AM +0900, Jeremy W. wrote:

I keep getting this error
“encoding error : output conversion failed due to conv error, bytes 0xA0
0x69 0x64 0xC2
I/O error : encoder error”

This is most definitely an encoding problem with the source document.

If the source document hasn’t declared an encoding in the meta tags,
then libxml2 must guess the encoding of the document. Sometimes it gets
it wrong, and it looks like you’ve found one of those times.

I suggest attempting to parse the document outside Mechanize. Check the
encoding returned in the server headers, and use that when parsing.

Check the actual document source for an encoding, and try that.

You may also need to make an educated guess. For example, some people
will
create documents containing UTF-8 characters, but then declare the
document as
using ISO-8859-1 encoding. :frowning:

On Thu, Nov 19, 2009 at 01:54:47PM +0900, Jeremy W. wrote:

so, where content-encoding is gzip, is this what “should” be UTF-8?

No. That means they are just not specifying a character encoding? Was
there one in the HTML document itself?

I just updated my libxml2 as well so I’m using libxml2 @2.7.3_0
(active). Is there an attribute I can set somewhere that will allow me
to parse the page using the gzip encoding?

No. It should be unzipped before sending to the parser.

Thanks for the help man!

No problem. :slight_smile:

I checked out the page response, and this is what I got back

page.response
=> {“cache-control”=>“private”, “connection”=>“close”,
“p3p”=>“policyref=“http://p3p.yahoo.com/w3c/p3p.xml”, CP=“CAO DSP COR
CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi
PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV””,
“date”=>“Thu, 19 Nov 2009 04:39:19 GMT”, “content-type”=>“text/html”,
“content-encoding”=>“gzip”, “set-cookie”=>“B=c4mf1f55g9ivn&b=3&s=9v;
expires=Tue, 02-Jun-2037 20:00:00 GMT; path=/; domain=.yahoo.com”}

so, where content-encoding is gzip, is this what “should” be UTF-8?

I just updated my libxml2 as well so I’m using libxml2 @2.7.3_0
(active). Is there an attribute I can set somewhere that will allow me
to parse the page using the gzip encoding?

Thanks for the help man!

~Jeremy

Aaron P. wrote:

On Thu, Nov 19, 2009 at 01:54:47PM +0900, Jeremy W. wrote:

No. That means they are just not specifying a character encoding? Was
there one in the HTML document itself?

No, there’s just a page full of crap >.< For example… here is the
first line when you view source

Yahoo! Store Editor<table

Oh yeah! 2 HTML tags!!

Ok, well I got a bit of a start at least.

Thanks.

~Jeremy