Problems with Accent in Ruby 1.9+ (Latin Characters)

dubstep · August 11, 2011, 12:43pm

Hi folks,

First post here, thanks for the help.

Having a strange behavior with my ruby 1.9.2 code. When i try to print
latin characters with accents, i get the follow error:

incompatible character encodings: UTF-8 and
IBM437

Code follows bellow and a screen shot attached:

-----------BEGIN OF CODE--------

encoding: UTF-8

Html Inspect

Ler uma página html

Calcular o número de linhas

Calcular o número de palavras

Calcular o número de caracteres

Encontrar o Título da Página

Encontrar os Sub-Títulos

Exibir um resumo sobre a página

arquivo = File.readlines(“pagina.html”)
linhas = arquivo.size
pagina = arquivo.join
quantidade_palavras = pagina.split.length
quantidade_caracteres = pagina.length
titulo = []
arquivo.each do |line|
if line.match(//)
titulo << line
end
end
puts “#{linhas} linhas”
puts “#{quantidade_palavras} palavras”
puts “#{quantidade_caracteres} caracteres”
puts “Título: #{titulo.join}”
----------END OF CODE----------

----------BEGIN OF OUTPUT------
C:\Ruby192\bin\ruby.exe -e $stdout.sync=true;
$stderr.sync=true;load($0=ARGV.shift) C:/Users/Sandbox.Marco-PC/
RubymineProjects/TempApp/main.rb
C:/Users/Sandbox.Marco-PC/RubymineProjects/TempApp/main.rb:25:in <top (required)>': incompatible character encodings: UTF-8 and IBM437 (Encoding::CompatibilityError) from -e:1:inload’
from -e:1:in `’
1483 linhas
25939 palavras
354934 caracteres
Process finished with exit code 1
--------END OF OUTPUT----------

mffreire · August 11, 2011, 1:01pm

Your text is in US-ASCII.
try :

put “# encoding: ASCII” at the beginning of source
or
convert your text to utf-8
- (notepad++, set encoding ASCIII / convert UTF-8)
- iconv

test your execution with
p ENCODING

good luck

mffreire · August 11, 2011, 1:01pm

On Thu, Aug 11, 2011 at 12:43 PM, Marco Floriano
[email protected] wrote:

Having a strange behavior with my ruby 1.9.2 code. When i try to print
latin characters with accents, i get the follow error:

incompatible character encodings: UTF-8 and
IBM437

The terminal used cannot display UTF-8 characters.

That’s a limitation of the terminal your IDE is using. In a normal
command prompt, you can use the “chcp” command to see and change the
current codepage (like IBM437, or Windows-1250/1251). Which one you
need depends on your language, but most Western European languages are
served with codepage 1250.

I think there’s a way to change the codepage permanently, but I’m not
sure how.

–
Phillip G.

phgaw.posterous.com | twitter.com/phgaw | gplus.to/phgaw

A method of solution is perfect if we can forsee from the start,
and even prove, that following that method we shall attain our aim.
– Leibniz

mffreire · August 11, 2011, 1:10pm

On Thu, Aug 11, 2011 at 1:02 PM, Regis d’Aubarede
[email protected] wrote:

Your text is in US-ASCII.

isn’t part of ASCII.

–
Phillip G.

phgaw.posterous.com | twitter.com/phgaw | gplus.to/phgaw

A method of solution is perfect if we can forsee from the start,
and even prove, that following that method we shall attain our aim.
– Leibniz

mffreire · August 11, 2011, 5:40pm

Marco Floriano wrote in post #1016136:

C:\Ruby192\bin\ruby.exe -e $stdout.sync=true;
$stderr.sync=true;load($0=ARGV.shift) C:/Users/Sandbox.Marco-PC/
RubymineProjects/TempApp/main.rb
C:/Users/Sandbox.Marco-PC/RubymineProjects/TempApp/main.rb:25:in `<top
(required)>': incompatible character encodings: UTF-8 and IBM437
(Encoding::CompatibilityError)

Well, here it says the problem is in line 25 of main.rb. The screenshot
said it was in line 32 of main.rb.

So the first thing is: can you identify what line that is? The code you
pasted in had only 25 lines.

The rules for encodings in ruby 1.9.x are labyrinthine. I tried to
understand them once: see

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

However it appears that your application is using UTF-8 for the source
encoding (for string literals), and IBM437 for text read from files, and
then it sometimes crashes when these two meet.

The source encoding is UTF-8 because you declared it thus in the
#encoding line. This part at least is sane.

The encoding for data read from files is IBM437 because ruby has
silently guessed this, based on settings in your environment.

You can either meddle with your environment to fix the problem - however
your program will then work on your machine but may not work on someone
else’s. Or you can change your code to something like this:

arquivo = File.readlines(“pagina.html”, :encoding=>“UTF-8”)

Unfortunately, ruby doesn’t “crash early” for errors like this. Without
this incantation, your program will sometimes work and sometimes crash,
depending on the data you feed into it. I cannot stand ruby 1.9 for this
reason. Fortunately for me, ruby 1.8 still exists.

mffreire · August 11, 2011, 9:08pm

I think that Phillip G. is right.
I have try to restated this kind of issue, I get only
“invalid multibyte char” exception :

invalid multibyte char (US-ASCII) (SyntaxError)
encoding.rb:2: invalid multibyte char (US-ASCII)

“incompatible character encoding: UTF-8 and IBM437” seem
well append at the “puts” on stdout.

On windows, I do all my stuff with 1.9.2 in utf-8, and i forget 1.8…
Issue arrive when I test compatibility with JRuby !

mffreire · August 12, 2011, 2:36am

Regis d’Aubarede wrote in post #1016237:

I think that Phillip G. is right.
I have try to restated this kind of issue, I get only
“invalid multibyte char” exception :

invalid multibyte char (US-ASCII) (SyntaxError)
encoding.rb:2: invalid multibyte char (US-ASCII)

“incompatible character encoding: UTF-8 and IBM437” seem
well append at the “puts” on stdout.

On windows, I do all my stuff with 1.9.2 in utf-8, and i forget 1.8…
Issue arrive when I test compatibility with JRuby !

Have you tried running your script directly from the command line?

But before, doing: chcp 1252 ?

That will activate latin codepage that should be compatible with UTF-8
encoding.

You can permanently change that doing a registry edit:

Hope that helps

Luis L.

mffreire · August 12, 2011, 10:22am

Regis d’Aubarede wrote in post #1016237:

I think that Phillip G. is right.
I have try to restated this kind of issue, I get only
“invalid multibyte char” exception :

invalid multibyte char (US-ASCII) (SyntaxError)
encoding.rb:2: invalid multibyte char (US-ASCII)

That’s a syntax error, i.e. your program itself can’t be parsed. In
this case, I think you have an accented character in a string literal,
but you have forgotten to put

#encoding: UTF-8

at the top of your file (did you remove it?). Without such a line, the
source encoding defaults to US-ASCII, which means that any character
with the top bit set is forbidden entirely.

One of the problems with ruby 1.9 and encodings is that nobody
understands it properly, and as a result lots of people give incorrect
advice. Of course, you can just try making random changes to your
program in the hope that it will help, and sometimes you will stumble on
something which works.

But, let’s assume you don’t want to do it by guesswork.

So:

(1) Put #encoding: UTF-8 as the first line of your file (or the second,
if you have a Unix shebang line). Then all your string literals will be
tagged UTF-8

(2) Using :encoding=>“UTF-8” when you open or read any file. Then all
your read-in strings will be tagged UTF-8

(3) Then you can concatenate these strings, or interpolate one into the
other, and it will work. ruby 1.9 may still crash, but in fewer
circumstances (e.g. if you try a regexp match against your string and
the string contains invalid UTF-8 characters. There are some operations
you can do on invalid strings successfully, and some which crash. See
the link on github I gave before)

On windows, I do all my stuff with 1.9.2 in utf-8, and i forget 1.8…
Issue arrive when I test compatibility with JRuby !

JRuby has both 1.8 and 1.9 compatibility modes, or at least it did the
last time I looked at it a while back.