Rafa_F
August 17, 2015, 3:43am
1
Using Ruby 2.2.1
When I run this Craigslist scraper script (see excerpt below) with ARGV
input strings such as “Casio” it works perfectly.
When I run this script with any ARGV that has duplicated letters -
for example “Hammond” this script fails with error listed below.
(eval):14:in ===': invalid byte sequence in UTF-8 (ArgumentError) from (eval):14:in
block (2 levels) in links_with’
from (eval):13:in each' from (eval):13:in
all?’
from (eval):13:in block in links_with' from (eval):12:in
each’
from (eval):12:in find_all' from (eval):12:in
links_with’
from Craig_Search.rb:47:in block (2 levels) in <main>' from Craig_Search.rb:40:in
each’
from Craig_Search.rb:40:in block in <main>' from Craig_Search.rb:29:in
each’
from Craig_Search.rb:29:in `’
Can anyone tell me what’s wrong or how to avoid?
The code has been working for some time without error but it seems some
some recent gem update “broke” it. Wish I could say which one.
Excerpt from script Craig_Search.rb
require ‘rubygems’
require ‘mechanize’
require ‘date’
require ‘net/ftp’
…some more code…
craigslist cities to search along with search terms
cities = [“daytona”, “ocala”, “orlando”, “cfl”]
search_words = ARGV.dup
agent = Mechanize.new
link_file = “/home/raymon/Documents/CraigsList_all_links.htm”
remote_file2 = “CraigsList_all_links.htm”
line 29----->>> search_words.each { |word|
replace spaces in search word with underscore for the links
mod_word =word.gsub(/[ ]/, “_”)
#file to be created
web_file =
“/home/raymon/Documents/CraigsList_”+mod_word+"_Links.htm"
open the local file for writing and begin creating webpage for later
upload to server
we do want to update the complete page each time – no appending
my_file = File.open(web_file, ‘w’)
line 40 ------>>>>> cities.each { |city|
write the headings to the file
my_file.puts("<h3 class=\"blue\">"+city.capitalize+"</h3>")
url="http://"+city+".craigslist.org/search/?areaID=238&subAreaID=&query="+word+"&catAbb=sss"
page = agent.get(url)
line 47 ------>>>>>> found_link = page.links_with(:text =>
/#{word}/i)
found_link.each { |link|...............rest of code
strangely it works without error if I search on “Kimball” (two 'L’s)
but still on “Hammond” (two 'M’s) …What gives?
Any one with ideas?
Seems like a problem of Mechanize.
Meaby you want to check out:
opened 04:19PM - 22 Sep 14 UTC
closed 09:11PM - 31 Mar 21 UTC
While reading a website, I got the above error. Below is the transaction log. Wh… at's strange about this is the Content-Encoding appears to be 'UTF-8', but Mechanize clearly logs that it is 'utf8'. Any thoughts on how I can make it work?
```
Net::HTTP::Get: /dashboard
request-header: accept => */*
request-header: user-agent => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/536.26.14 (KHTML, like Gecko) Version/6.0.1 Safari/536.26.14 BookTrakr
request-header: accept-encoding => gzip,deflate,identity
request-header: accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7
request-header: accept-language => en-us,en;q=0.5
request-header: cookie => at-main=5|gfsWc2E79m6dQwDZbK3anNoRpqeyH+40Hxb8D/zkIF8hsL3M1G6j6hrMepPa2pa2EQ8EjRhHdur+uX9+63KP+RQXxvb/obdo22h0ogmlO1qZflq1YLV5tZ82BC23gdgnAphedat4IIXk8a8PoKexzsa8SUgQAOByTosDKiGVOvH6q9R5zc6SZE1wTuMkgQBT/Gukdkw2XH2SRxnyDFPEuheFZ906mC+H; kdp-lc-main=en_US; lc-main=en_US; sess-at-main=sW1rTLqkt18lzG2mi/og+giDdhEVGLsqckXc68GrpDE=; session-id=181-5515808-6319153; session-id-time=1412007142l; session-token=Xr0Wgr9hzWCb8W8mpafCtCy3YKnDcTu9hD6Mxf3sYaiUKRi61Juz+MlnRaUEcsJKe1v2mMiJqku/gIs1bdpSMniYDnEsX7h0D0bg7dDTYD5LLgXKBtF61aF2CZcFuMZz1lf+eEwlpY5Z/CoP0VDqPP9gFe/Y7gqoeMKRJLMltB4rx9nQw1lwHnZnkJey8XxdLLeySQi7tNxraBt68CR/Js5V2Xxnpz6rxWsW34nz/+w=; ubid-main=183-8767582-2813450; x-main=fj4eh2R3sfisrj1Kux7AOsWeUSGB4v5f
request-header: host => kdp.amazon.com
status: Net::HTTPOK 1.1 200 OK
response-header: date => Mon, 22 Sep 2014 16:12:22 GMT
response-header: server => Server
response-header: x-amz-id-1 => 1P9XK5NHQK5FVJ7EHR89
response-header: x-amz-id-2 => zf81SPO+thBvIshnOqjPnXlJiCAk+Xzs4EXgm8YX5OOMiYkK6kqj0uSoWgBH+w0S2fSdyK60TT8=
response-header: x-frame-options => DENY
response-header: cache-control => no-cache, no-store, must-revalidate, max-age=0
response-header: pragma => no-cache
response-header: expires => Thu, 01 Jan 1970 00:00:00 GMT
response-header: x-ua-compatible => IE=edge,chrome=1
response-header: content-type => text/html;charset=UTF-8
response-header: content-language => en-US
response-header: set-cookie => session-id=181-5515808-6319153; Domain=.amazon.com; Expires=Mon, 29-Sep-2014 16:12:22 GMT; Path=/, session-id-time=1412007142l; Domain=.amazon.com; Expires=Mon, 29-Sep-2014 16:12:22 GMT; Path=/, ubid-main=183-8767582-2813450; Domain=.amazon.com; Expires=Sun, 17-Sep-2034 16:12:22 GMT; Path=/, session-token="AGeKCdjEFIKMg7uEGNZxJOHRzkH0jJcc7VpTQyZhU/C2RPCBr0YXsmMkNXw1px8l/Rs+ox6/aCm1AMBKSt+hCmqXPW2fDC0c+PJN6cwFFVaa0ib10S3SZbU1HvnzH6fPOTLz3ZoZKtvtn5DFazLnUEUZJ/+qd743gtIdehhrJklFTYi25NVdAltAuTrm18ADkxw8mwb7UAn3/b3PAlunMj4SeJdGb3v8sqykaaeH658="; Version=1; Domain=.amazon.com; Max-Age=600; Expires=Mon, 22-Sep-2014 16:22:22 GMT; Path=/
response-header: vary => User-Agent
response-header: transfer-encoding => chunked
Read 6102 bytes (6102 total)
Read 18 bytes (6120 total)
Read 1721 bytes (7841 total)
Read 351 bytes (8192 total)
Read 6915 bytes (15107 total)
Read 5 bytes (15112 total)
Read 1272 bytes (16384 total)
Read 441 bytes (16825 total)
Read 7273 bytes (24098 total)
Read 6 bytes (24104 total)
Read 472 bytes (24576 total)
Read 1241 bytes (25817 total)
Read 6951 bytes (32768 total)
Read 2041 bytes (34809 total)
Read 5225 bytes (40034 total)
Read 926 bytes (40960 total)
Read 2841 bytes (43801 total)
Read 4425 bytes (48226 total)
Read 926 bytes (49152 total)
Read 3641 bytes (52793 total)
Read 3625 bytes (56418 total)
Read 926 bytes (57344 total)
Read 4441 bytes (61785 total)
Read 3751 bytes (65536 total)
Read 3515 bytes (69051 total)
Read 1726 bytes (70777 total)
Read 2951 bytes (73728 total)
Read 4315 bytes (78043 total)
Read 1726 bytes (79769 total)
Read 2151 bytes (81920 total)
Read 5115 bytes (87035 total)
Read 1726 bytes (88761 total)
Read 1351 bytes (90112 total)
Read 5915 bytes (96027 total)
Read 1726 bytes (97753 total)
Read 551 bytes (98304 total)
Read 6715 bytes (105019 total)
Read 1477 bytes (106496 total)
Read 7273 bytes (113769 total)
Read 247 bytes (114016 total)
Read 672 bytes (114688 total)
Read 6594 bytes (121282 total)
Read 1598 bytes (122880 total)
Read 1841 bytes (124721 total)
Read 5425 bytes (130146 total)
Read 926 bytes (131072 total)
Read 2641 bytes (133713 total)
Read 4625 bytes (138338 total)
Read 926 bytes (139264 total)
Read 3441 bytes (142705 total)
Read 3825 bytes (146530 total)
Read 926 bytes (147456 total)
Read 2338 bytes (149794 total)
saved cookie: session-id=181-5515808-6319153
saved cookie: session-id-time=1412007142l
saved cookie: ubid-main=183-8767582-2813450
saved cookie: session-token=AGeKCdjEFIKMg7uEGNZxJOHRzkH0jJcc7VpTQyZhU/C2RPCBr0YXsmMkNXw1px8l/Rs+ox6/aCm1AMBKSt+hCmqXPW2fDC0c+PJN6cwFFVaa0ib10S3SZbU1HvnzH6fPOTLz3ZoZKtvtn5DFazLnUEUZJ/+qd743gtIdehhrJklFTYi25NVdAltAuTrm18ADkxw8mwb7UAn3/b3PAlunMj4SeJdGb3v8sqykaaeH658=
form encoding: utf8
from_native_charset: Encoding::ConverterNotFoundError: form encoding: "utf8" string: search
Encoding::ConverterNotFoundError: code converter not found (UTF-8 to utf8)
```
opened 04:47AM - 13 Sep 13 UTC
closed 07:32AM - 22 Aug 16 UTC
in lib/mechanize/form.rb in file_to_multipart(file):
``` ruby
body = "Cont… ent-Disposition: form-data; name=\"" +
"#{mime_value_quote(file.name)}\"; " +
"filename=\"#{mime_value_quote(file_name)}\"\r\n" +
"Content-Transfer-Encoding: binary\r\n"
```
body.encoding will be UTF-8
and later
``` ruby
body <<
if file.file_data.respond_to? :read
"\r\n#{file.file_data.read}\r\n"
else
"\r\n#{file.file_data}\r\n"
end
```
file data will be in ASCII-8BIT and concat to body will fail
Best regards.
So if want to solution this problem, start seeing the code of Mechanize,
understand it and try to figure out what is wrong. Simple as that.
raymonb
September 17, 2015, 1:48am
6
Damián M. González wrote in post #1178271:
So if want to solution this problem, start seeing the code of Mechanize,
understand it and try to figure out what is wrong. Simple as that.
If this is a mechanize problem shouldn’t they fix it?
raymonb
September 18, 2015, 10:43pm
7
Of course, but remember that is open-source.
raymonb
September 10, 2015, 2:48am
8
Damián M. González wrote in post #1178271:
So if want to solution this problem, start seeing the code of Mechanize,
understand it and try to figure out what is wrong. Simple as that.
Sorry …that is beyond my capabilities.
Sure wish someone had something specific to suggest.
oh well…