Reading an UTF-8 encoded file

Une_BSSSSvue · March 10, 2010, 3:40pm

if i read and output to terminal an UTF-8 encoded file, i do not have
the same result with ruby 1.8.x and ruby 1.9

with 1.8 i get “é” correctly, with 1.9 i get it wrong “Ã©” even if i
specify the encoding by :
open(FILE, “r:UTF-8”) do …

what did i missunderstood ?

Une_BSSSSvue · March 10, 2010, 10:19pm

Une BÃ©vue wrote:

if i read and output to terminal an UTF-8 encoded file, i do not have
the same result with ruby 1.8.x and ruby 1.9

with 1.8 i get “ï¿½” correctly, with 1.9 i get it wrong “Ã©” even if i
specify the encoding by :
open(FILE, “r:UTF-8”) do …

In my web browser onto ruby-forum, I see what you say is the “correct”
symbol as invalid above, and the “wrong” symbol is a valid one.

Are you in irb, or running code in a .rb file? Are you using “puts” or
are you looking at the string values as returned by irb, after the =>
prompt?

In either case, show your actual code. Beware that things behave
strangely in irb with 1.9. Some of the oddities I noticed in irb are
documented in

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

from about line 1648.

what did i missunderstood ?

Remember that encodings by themselves don’t actually change the sequence
of bytes. If your code is something like this:

open(“somefile.txt”) do |f|
while line = f.gets
puts line
end
end

and you run it as a .rb script, I would expect it to work the same in
both 1.8 and 1.9. That is, it should read lines and squirt them back out
to stdout unchanged. No transcoding is done. If they appear wrongly, it
would be because the encoding of the file contents is not the same as
the encoding of your terminal.

Furthermore, it makes no difference in 1.9 if you do this:

open(“somefile.txt”,“r:UTF-8”) do |f|
while line = f.gets
puts line
end
end

In ruby 1.9, all this means is that the string ‘line’ will be tagged as
being UTF-8, rather than some encoding picked up from the environment.
However by default, the same sequence of bytes will be squirted out.

However in 1.9 you can cause the string to be transcoded, if:

(1) you specify a different internal and external encoding when reading
the data (so it gets transcoded on input); or

(2) you specify an external encoding when writing the data (so it gets
transcoded on output)

HTH,

Brian

Une_BSSSSvue · March 11, 2010, 7:41am

Brian C. [email protected] wrote:

Are you in irb, or running code in a .rb file? Are you using “puts” or
are you looking at the string values as returned by irb, after the =>
prompt?

In either case, show your actual code. Beware that things behave
strangely in irb with 1.9.

first, thanks for your reply )))

i’m not using irb rather an rb file kaunched from Terminal

here is the code (ruby 1.9) :

#! /usr/local/bin/macruby

encoding: utf-8

SIGNATURES_FILE = “/Users/yt/dev/Signature/signatures.txt”

open(SIGNATURES_FILE, “r:UTF-8”) do |file|
p file.internal_encoding
file.each do |line|
p [line.encoding.name, line]
end
end
open(SIGNATURES_FILE) do |file|
p file.internal_encoding
file.each do |line|
p [line.encoding.name, line]
end
end

resulting in :
zsh-% ./essai_macruby.rb
nil
[“US-ASCII”, “-- \n”]
[“US-ASCII”, “Â« Un banquier est toujours en libertÃ© provisoire Â» \n”]
[“US-ASCII”, “(Henri PoincarÃ© )\n”]

…

[“US-ASCII”, “la minute de vÃ©ritÃ© risque de se faire longtemps
attendre. Â» \n”]
[“US-ASCII”, “(Pierre Dac)\n”]
nil
[“US-ASCII”, “-- \n”]
[“US-ASCII”, “Â« Un banquier est toujours en libertÃ© provisoire Â» \n”]
[“US-ASCII”, “(Henri PoincarÃ© )\n”]
[
…

[“US-ASCII”, “la minute de vÃ©ritÃ© risque de se faire longtemps
attendre. Â» \n”]
[“US-ASCII”, “(Pierre Dac)\n”]
zsh-%

then, both methods (with and without “r:UTF-8”) see the file as being of
US-ASCII although they are really UTF-8 encoded.

now the “equivalent” test using ruby 1.8.* :

#! /usr/bin/env ruby

SIGNATURES_FILE = “/Users/yt/dev/Signature/signatures.txt”

open(SIGNATURES_FILE) do |file|
file.each do |line|
puts line
end
end

run from Term :

zsh-% ./essai.rb

« Un banquier est toujours en liberté provisoire »
(Henri Poincaré )

…

« Pour ceux qui vont chercher midi à quatorze heures,
la minute de vérité risque de se faire longtemps attendre. »
(Pierre Dac)

accentuated chars are correct now, notice i have to use “puts” instead
of “p” to get the chars otherwise i got the unicode code as
“v\303\251rit\303\251”.

Une_BSSSSvue · March 11, 2010, 10:54am

Use ‘puts’ instead of ‘p’ and it may work. That is, I suspect
String#inspect is doing some mangling.

You really should look at your postings in ruby-forum:
http://www.ruby-forum.com/topic/205792

Wherever you say ruby 1.9 is giving the ‘wrong’ output it is correct,
and where you say ruby 1.8 is giving the ‘right’ output it is wrong. I
have a suspicion that there is a mismatch between the file content and
the terminal.

What if you just type “cat /Users/yt/dev/Signature/signatures.txt” at
the terminal?

accentuated chars are correct now, notice i have to use “puts” instead
of “p” to get the chars otherwise i got the unicode code as
“v\303\251rit\303\251”.

Yes, String#inspect in ruby 1.8 will mangle all values over 128 into
escaped form. String#inspect in ruby 1.9 behaves differently, and
doesn’t always mangle them.

However, I just noticed ‘macruby’ in your scripts. Are you actually
running MacRuby, or genuine Matz Ruby Interpreter 1.9 ? If it’s macruby
all bets are off - I thought it was a completely different interpreter
written from scratch. I have no Mac here to compare behaviour with, and
I have no idea what variation of 1.9 encoding rules MacRuby has
implemented.

In particular, I’m surprised that your program sees strings tagged as
“US-ASCII” rather than “UTF-8” when you explicitly opened the file with
external encoding of UTF-8. This makes me very suspicious of your actual
ruby platform.

Try adding this line to your code to get info about the Ruby platform:
p Object.constants.grep(/RUBY/).map { |n| [n, Object.const_get(n)] }

Regards,

Brian.

P.S. For comparison, here’s what I get with an oldish ruby pre-1.9.2
under Linux. Try these on your system.

File.open(“/etc/passwd”,“r:ISO-8859-1”).gets.encoding
=> #Encoding:ISO-8859-1
File.open(“/etc/passwd”,“r:UTF-8”).gets.encoding
=> #Encoding:UTF-8
RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]”

Une_BSSSSvue · March 11, 2010, 2:50pm

Brian C. [email protected] wrote:

What if you just type “cat /Users/yt/dev/Signature/signatures.txt” at
the terminal?

i got the correct chars :
zsh-% cat /Users/yt/dev/Signature/signatures.txt

« Un banquier est toujours en liberté provisoire »
(Henri Poincaré )

…

« Pour ceux qui vont chercher midi à quatorze heures,
la minute de vérité risque de se faire longtemps attendre. »
(Pierre Dac)

running MacRuby, or genuine Matz Ruby Interpreter 1.9 ? If it’s macruby
all bets are off - I thought it was a completely different interpreter
written from scratch. I have no Mac here to compare behaviour with, and
I have no idea what variation of 1.9 encoding rules MacRuby has
implemented.

In particular, I’m surprised that your program sees strings tagged as
“US-ASCII” rather than “UTF-8” when you explicitly opened the file with
external encoding of UTF-8. This makes me very suspicious of your actual
ruby platform.

right now, that’s to say using puts in place of p, i get the right
chars.
But those strings are still taged by “US-ASCII”…

Try adding this line to your code to get info about the Ruby platform:
p Object.constants.grep(/RUBY/).map { |n| [n, Object.const_get(n)] }

it seems to be an “old” Ruby 1.9.0 :

[[:RUBY_VERSION, “1.9.0”], [:RUBY_RELEASE_DATE, “2008-06-03”],
[:RUBY_PLATFORM, “universal-darwin10.0”], [:RUBY_PATCHLEVEL, 0],
[:RUBY_REVISION, 0], [:RUBY_DESCRIPTION, “MacRuby version 0.5 (ruby
1.9.0) [universal-darwin10.0, x86_64]”], [:RUBY_COPYRIGHT, “MacRuby -
Copyright (C) 2007-2008 Apple Inc.”], [:RUBY_ENGINE, “macruby”],
[:RUBY_ARCH, “x86_64”], [:MACRUBY_VERSION, “0.5”], [:MACRUBY_REVISION,
“svn revision 3380 from
http://svn.macosforge.org/repository/ruby/MacRuby/branches/0.5”]]

thanks again !

Une_BSSSSvue · March 11, 2010, 3:25pm

Une Bévue [email protected] wrote:

right now, that’s to say using puts in place of p, i get the right
chars.
But those strings are still taged by “US-ASCII”…

however, when doing some spliting by :

def get_signatures
t = “”.force_encoding(“UTF-8”)
open(SIGNATURES_FILE, “r:UTF-8”) do |file|
#open(SIGNATURES_FILE) do |file|
file.each do |line|
t += line.force_encoding(“UTF-8”)
end
end
#File.open(SIGNATURES_FILE, “r:UTF-8”).each {|l| t += l }
return t.split(NEEDLE)
end

(notice i’ve forced the encoding)

signatures = get_signatures
c = signatures.count
puts “Nombre de signatures : #{c}”

r = rand(c)
puts “Signature aléatoire (n° #{r}) :”
signature = NEEDLE + signatures[r]
puts signature

the output is wrong in that case ???

Nombre de signatures : 29
Signature aléatoire (n° 1) :

Â« Un banquier est toujours en libertÃ© provisoire Â»
(Henri PoincarÃ© )

Une_BSSSSvue · March 11, 2010, 3:27pm

Brian C. [email protected] wrote:

Ah, so it’s not ruby 1.9, it’s macruby 0.5. I’m afraid I’ll have to
defer to the Mac experts here.

ok i’ll ask there even it is based upon ruby 1.9 :
[:RUBY_VERSION, “1.9.0”], [:RUBY_RELEASE_DATE, “2008-06-03”]

Une_BSSSSvue · March 11, 2010, 4:07pm

If forcing the encoding of individual strings makes Ruby output them
differently, then I expect that STDOUT must have an external encoding
set.

Try this:

p STDOUT.external_encoding

What do you get? If you get something other than nil, it means that puts
will transcode characters from the tagged encoding to this encoding.

In ruby 1.9, STDOUT.external_encoding is nil unless you set it
explicitly.

STDOUT.external_encoding
=> nil

Encoding.default_external
=> #Encoding:UTF-8

RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]”

Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

Une_BSSSSvue · March 11, 2010, 3:17pm

Ah, so it’s not ruby 1.9, it’s macruby 0.5. I’m afraid I’ll have to
defer to the Mac experts here.

I see there that macruby has its own mailing lists:
http://www.macruby.org/contact-us.html

Une_BSSSSvue · March 11, 2010, 5:55pm

Brian C. [email protected] wrote:

Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

Perfectly right, because i read on a MacRuby web page
(http://www.macruby.org/documentation/overview.html) :

Primitives Classes
The primitive Ruby classes (String, Array, and Hash) have been
re-implemented on top of their Cocoa equivalents (respectively,
NSString, NSArray, and NSDictionary).

As an example, String is no longer a class, but a pointer (alias) to
NSMutableString. All strings in MacRuby are genuine Cocoa strings and
can be passed (without conversion) to underlying C or Objective-C APIs
that expect Cocoa strings.

The whole String interface was re-implemented on top of NSString. This
means that you can call any method of String on any Cocoa string.
Because Cocoa strings can be either mutable and immutable, if you try to
call a method that is supposed to modify its receiver on an immutable
string, a runtime exception will be raised.

Une_BSSSSvue · March 11, 2010, 6:15pm

Une Bévue [email protected] wrote:

Brian C. [email protected] wrote:

Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

Perfectly right, because i read on a MacRuby web page
(http://www.macruby.org/documentation/overview.html) :

Then i’ve installed ruby 1.9 :
zsh-% ruby1.9 -v
ruby 1.9.1p376 (2009-12-07 revision 26041) [i386-darwin10]

and, when using it without forcing the encoding i get the right chars…

then i’m sure the prob comes from MacRuby.

Une_BSSSSvue · March 11, 2010, 5:43pm

Brian C. [email protected] wrote:

If forcing the encoding of individual strings makes Ruby output them
differently, then I expect that STDOUT must have an external encoding
set.

Try this:

p STDOUT.external_encoding

What do you get? If you get something other than nil, it means that puts
will transcode characters from the tagged encoding to this encoding.

I got nil ))

Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

yes right, I’ll ask to the MacRuby list.

In fact, i do have also a buitin ruby 1.8.x but i’d rather make use of
1.9 because i do have to count UTF-8 chars and i know this is internal
with ruby 1.9 and because i might design an UI it’s better using MacRuby
because it is written on top of Obj-C and Cocoa.

Une_BSSSSvue · March 11, 2010, 11:00pm

Brian C. [email protected] wrote:

Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

yes, i get the answer from MacRuby list :
1.9 encodings in trunk have very little support for now, but we
significantly improved them in a branch that might get merged into trunk
in a few days (maybe today). I will post an update here once it’s done.

Reading an UTF-8 encoded file

encoding: utf-8

open(SIGNATURES_FILE, “r:UTF-8”) do |file| p file.internal_encoding file.each do |line| p [line.encoding.name, line] end end open(SIGNATURES_FILE) do |file| p file.internal_encoding file.each do |line| p [line.encoding.name, line] end end

open(SIGNATURES_FILE) do |file| file.each do |line| puts line end end

zsh-% ./essai.rb

i got the correct chars : zsh-% cat /Users/yt/dev/Signature/signatures.txt

…

Nombre de signatures : 29 Signature aléatoire (n° 1) :

open(SIGNATURES_FILE, “r:UTF-8”) do |file|
p file.internal_encoding
file.each do |line|
p [line.encoding.name, line]
end
end
open(SIGNATURES_FILE) do |file|
p file.internal_encoding
file.each do |line|
p [line.encoding.name, line]
end
end

open(SIGNATURES_FILE) do |file|
file.each do |line|
puts line
end
end

i got the correct chars :
zsh-% cat /Users/yt/dev/Signature/signatures.txt

Nombre de signatures : 29
Signature aléatoire (n° 1) :