I checked out the page response, and this is what I got back
page.response
=> {“cache-control”=>“private”, “connection”=>“close”,
“p3p”=>“policyref="http://p3p.yahoo.com/w3c/p3p.xml\”, CP="CAO DSP COR
CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi
PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV"",
“date”=>“Thu, 19 Nov 2009 04:39:19 GMT”, “content-type”=>“text/html”,
“content-encoding”=>“gzip”, “set-cookie”=>“B=c4mf1f55g9ivn&b=3&s=9v;
expires=Tue, 02-Jun-2037 20:00:00 GMT; path=/; domain=.yahoo.com”}
I’m getting an encoding error when writing out the contents of this
page. The content-encoding is showing gzip. Anyone know a way I can tell
mechanize to use a different encoding when parsing a page? Or possibly
another way I can do this?
Thanks,
~Jeremy
Morning Jeremy,
On Mon, Nov 23, 2009 at 9:21 AM, Jeremy W.
[email protected]wrote:
I’m getting an encoding error when writing out the contents of this
page. The content-encoding is showing gzip. Anyone know a way I can tell
mechanize to use a different encoding when parsing a page? Or possibly
another way I can do this?
You are getting the encoding error because you aren’t dealing with a
string
in this case but rather a gzipped string. You could technically go to
the
request object and tell it you won’t accept gzip encoded responses - but
I
personally find that distasteful because gzipping the pages saves
everyone
bandwidth and if we’re scraping data we shouldn’t be a nuisance (IMO).
What
you want to do is use the Zlib::Inflate class to convert the response to
a
regular string (
http://www.ruby-doc.org/stdlib/libdoc/zlib/rdoc/classes/Zlib/Inflate.html#M001974).
That should solve your problem.
John
John W Higgins wrote:
Morning Jeremy,
On Mon, Nov 23, 2009 at 9:21 AM, Jeremy W.
[email protected]wrote:
I’m getting an encoding error when writing out the contents of this
page. The content-encoding is showing gzip. Anyone know a way I can tell
mechanize to use a different encoding when parsing a page? Or possibly
another way I can do this?
You are getting the encoding error because you aren’t dealing with a
string
in this case but rather a gzipped string. You could technically go to
the
request object and tell it you won’t accept gzip encoded responses - but
I
personally find that distasteful because gzipping the pages saves
everyone
bandwidth and if we’re scraping data we shouldn’t be a nuisance (IMO).
What
you want to do is use the Zlib::Inflate class to convert the response to
a
regular string (
http://www.ruby-doc.org/stdlib/libdoc/zlib/rdoc/classes/Zlib/Inflate.html#M001974).
That should solve your problem.
John
Hmmm, interesting. I didn’t think about that. I’m not familiar with the
class though. I ran through it and I got an error:
Zlib::DataError: incorrect header check
from (irb):84:in `inflate’
from (irb):84
from :0
I did notice when I print out the string, there are a lot of “\240” in
the string like
“
<a
href="javascript:document.f5.SLID.value=‘F99’;%20document.f5.submit();"
title="Select" onmouseover="window.status=‘Select’; return true;"
onmouseout="window.status=‘’;">DIV\240id\240<a
href="javascript:document.f5.SLID.value=‘F100’;%20document.f5.submit();"
title="Select" onmouseover="window.status=‘Select’; return true;"
onmouseout="window.status=‘’;">.…”
I think these are where I’m getting messed up. Does anyone know a good
site that lists these characters? I think \240 might be a tab character,
but I want to check it against some list just to see.
Thanks,
~Jeremy