Beginner Help - Encoding error exporting/writing to CSV

I’ve looked all over the internet and can’t seem to figure out what I
need to do to get past this error. I admit, I’ve only been playing with
Ruby for a couple of days so bear with me.

I created a script to read the info and first post from a list of forum
threads. I can get all of the data in arrays and export to a CSV
perfectly fine with small batches. However, I know the forum has some
japanese and other language posts and I can only assume one of those is
what’s causing this problem.

After the parsing of my list of HTML pages is done and it starts writing
the data to CSV, about half way I get this:

F:/ruby/lib/ruby/1.9.1/csv.rb:1729:in join': incompatible character encodings: UTF-8 and ISO-8859-1 (Encoding::CompatibilityError) from F:/ruby/lib/ruby/1.9.1/csv.rb:1729:in<<’
from threads.rb:89:in block (2 levels) in <main>' from threads.rb:80:ineach’
from threads.rb:80:in block in <main>' from F:/OfflineExplorerPortable/ruby/lib/ruby/1.9.1/csv.rb:1354:inopen’
from threads.rb:77:in `’

I tried a dozen different things I read online about encoding to try and
fix it but they either didn’t do anything or threw me other method
errors. It’s probably simple but it’s beyond me at the moment. Can
anyone give the FNG a little help? I’d appreciate it!

Can you show us the code? It would be easier to help.

All I can recomment now is to add .encode(‘utf-8’) (or
.force_encoding(‘utf-8’)) everywhere where you accept external input.
Command line arguments? Downloaded website content? Files read from
disk? Mark their encodings explicitly.

– Matma R.

Thank you for your response.

Bartosz Dziewoński wrote in post #1074162:

All I can recomment now is to add .encode(‘utf-8’) (or
.force_encoding(‘utf-8’)) everywhere where you accept external input.
I’ve tried that and I must be doing something wrong because I keep
getting undefined method errors when I do

Can you show us the code? It would be easier to help.
I’ll warn you and say that it’s not very elegant but it seems to get the
data correctly:

encoding: utf-8

require ‘nokogiri’
require ‘open-uri’
require ‘csv’

#define arrays
@thread = Array.new
@filename = Array.new
@postid = Array.new
@title = Array.new
@date = Array.new
@filedate = Array.new
@member = Array.new
@memberurl = Array.new
@content = Array.new

#pull in file/URL list
files = CSV.read(“files1.csv”)
(0…files.length - 1).each do |index|
#print out current URL or File
puts files[index][0]
#save it to array
@filename << files[index][0]
#load HTML to Nokogiri
doc = Nokogiri::HTML(open(files[index][0]))

#find the first post and ID
threadurl = doc.css(‘a[name=“1”]’).map { |link| link[‘href’] }
test = threadurl.to_s
test = test[2…-3]
#isolate the main/first post ID
post_id = test.split("#") [1]
post_id = post_id[4…-1]

#make other versions of post ID references if needed
postmessageid = “post_message_” + post_id
postmenu = “postmenu_” + post_id
pstid = “post” + post_id

#find the Date of first post
fdate = doc.css(‘td[class=“thead”]’)[2].content
fdate = fdate.strip

#get member name
membername = doc.css(‘a[class=“bigusername”]’)

#get post content
contentid = “div#post_message_” + post_id
postcontent = doc.css(contentid)

#write all other arrays
@title << doc.at_css(“title”).text[0…-28]
@filedate << fdate
@member << membername[0].text
@memberurl << membername[0][‘href’]
@content << postcontent
end

CSV.open(“output.csv”, “wb:UTF-8”) do |row|
row << [“Thread Title”, “Filename - Thread URL”, “Date”, “Member”,
“Member URL”, “Content”]
#(0…urls.length - 1).each do |index|
(0…files.length - 1).each do |index|
row << [
@title[index],
@filename[index],
@filedate[index],
@member[index],
@memberurl[index],
@content[index]]
end
end

Just to point out, the input CSV really isn’t a multi-column CSV. It’s
really just a single list of local files and HTTP URLs. It’s just CSV
so I can easily open it in Excel.

Bartosz Dziewoński wrote in post #1074257:

Uh, that’s certainly bad. Either you’re not using Ruby 1.9 (that would
be weird…), or the objects you are operating on are not the object
you think they are (you should get information about the in the error
message, e.g. “NoMethodError: undefined method `encode’ for
5:Fixnum”).

It would just be like:

postcontent.force_encoding(‘utf-8’))

right? Or do I have to pass the output of that into another
variable/string?

This is what I get when I use the above:

threads.rb:64:in block in <main>': undefined methodforce_encoding’
for #Nokogiri::XML::NodeSet:0x142b048 (NoMethodError)
from threads.rb:21:in each' from threads.rb:21:in

I skimmed the code, possible places where you are not setting the
encoding are “files = CSV.read(“files1.csv”)” or “doc =
Nokogiri::HTML(open(files[index][0]))”. Also, this is a different code
than the one that gives the error in first post (there is not line 77
in it).

Nope, it’s the same code but cleaned it up some comments and spaces
since. I’ll try and get the message on the exact code above.

When I stick the force_encoding at the end of the Nokogiri call (since
that’s where the text is coming out from), I get:

threads.rb:20:in <main>': undefined methodforce_encoding’ for
#Array:0xb3aae0 (NoMethodError)

2012/9/1 Allan A. [email protected]:

Bartosz Dziewoński wrote in post #1074162:

All I can recomment now is to add .encode(‘utf-8’) (or

.force_encoding(‘utf-8’)) everywhere where you accept external input.
I’ve tried that and I must be doing something wrong because I keep
getting undefined method errors when I do

Uh, that’s certainly bad. Either you’re not using Ruby 1.9 (that would
be weird…), or the objects you are operating on are not the object
you think they are (you should get information about the in the error
message, e.g. “NoMethodError: undefined method `encode’ for
5:Fixnum”).

I skimmed the code, possible places where you are not setting the
encoding are “files = CSV.read(“files1.csv”)” or “doc =
Nokogiri::HTML(open(files[index][0]))”. Also, this is a different code
than the one that gives the error in first post (there is not line 77
in it).

– Matma R.

This is the error from the above code. This is after about 1390 lines
being written to the output.csv file:

F:/ruby/lib/ruby/1.9.1/csv.rb:1729:in join': incompatible character encodings: UTF-8 and ISO-8859-1 (Encoding::CompatibilityError) from F:/ruby/lib/ruby/1.9.1/csv.rb:1729:in<<’
from threads2.rb:87:in block (2 levels) in <main>' from threads2.rb:78:ineach’
from threads2.rb:78:in block in <main>' from F:/ruby/lib/ruby/1.9.1/csv.rb:1354:inopen’
from threads2.rb:75:in `’

I checked the thread that was 1391 in the source list and it has a
double quote in the title(6”). That’s probably breaking it, right? I
have to somehow escape/filter it or is there some other setting that
will help it?