Encoding question

aris · September 17, 2012, 11:52pm

I am new to ruby and play around with it a little bit at the moment. I
have a large text file containing data with french accents and german
umlauts. The content of this file (some hundredthousend lines) should be
stored in a table in a postgres database. When I open the file on
windows with an editor called notepad++ it displays the data correctly.
When I look at the output from File.foreach(…) |line| puts line, I get
garbage for any non ASCII character. When I try to store the records to
postgres I get an error, as soon as data with non ASCII characters
should be inserted.

I use RubyMine as IDE and receive the following output with the
following code:

File.foreach("somefile.txt") do |line|
  if counter > 0 then
    record = line.split(";")
    @az_addidnr = record[1]
    az_chnr = record[2]
    az_adr1 = record[6]
    puts "record data: #{@az_addidnr} | #{az_chnr} | #{az_adr1}"
    conn.exec_prepared('stmt1', [@az_addidnr, az_chnr, az_adr1])
  end

OUTPUT:

record data: 512999 | CH21702301867 | Garage de la Moli�re SA
Uncaught exception: FEHLER: ungültige Byte-Sequenz für Kodierung
»UTF8«: 0xe87265

I also tried az_adr1 = record[6].encode(“ISO-8859-1”)

If I try az_adr1 = record[6].encode(“ASCII”) I get:
Uncaught exception: U+00DE to US-ASCII in conversion from CP850 to UTF-8
to US-ASCII

Could anybode please explain me the following:
How can I find, what kind of Encoding is used in a text file?
What kind of conversion to I need to a) get a correct output and b) to
be able to insert the record into postgresql.

Many thanks for your help.

Tom

tbaustert · September 18, 2012, 2:26am

In the code example you used, there’s no external encoding being
defined, so the default is being used, which is generally the default
encoding of the operating system (windows is likely Windows-1252, mac
is likely MacRoman, linux is likely UTF-8). Based on your comments,
the default encoding is not appropriate.

There’s not really a great way to determine the encoding of a file.
Generally the encoding is well defined by some standard, contract or
other out-of-band mechanism. Once you do know the encoding, the proper
way to open up a file is to declare the external encoding of the file,
like this -

f = File.open(‘somefile.txt’, ‘r:iso-8859-1’)

Then when you read content from the file, the data in the file will be
transcoded from ‘iso-8859-1’ to the default internal encoding of the
Ruby interpreter instance (generally UTF-8). You can define what
internal encoding to use by further qualifying the open, like this -

f = File.open(‘somefile.txt’, ‘r:iso-8859-1:utf-8’)

This will open the file with an external encoding of iso-8859-1 and an
internal encoding of utf-8.

Check out this article for some more information -
Working with Encodings in Ruby 1.9.

tbaustert · September 18, 2012, 9:21am

you could use it with foreach too:

File.foreach(“somefile.txt”,mode: “r:iso-8859-1:utf-8”) do
…
end

tbaustert · September 18, 2012, 7:29pm

I don’t know yet, how to determine the encoding of those uploaded
files…

Well, imagine you are interviewing for a job as a translator. The
employer has some documents he wants you to translate into English, and
he wants to hire you to translate them. Before you accept the job, you
ask, “What languages are the documents written in?” The employer
replies, “All kinds of different languages! Can you finish by next
week?” How would you respond to that answer? Wouldn’t you think to
yourself, “This guy is crazy.”

The bottom line is: you have to know what ‘language’ the document you
are
reading is written in, so that you can ‘hire’ the appropriate
translator, i.e. tell ruby what encoding to use.

tbaustert · September 18, 2012, 8:54am

Hi Nathan,
Thanks for your help. It is a bit a try and error thing! This worked for
me:

f = File.open(“somefile.txt”, “r:iso-8859-1:utf-8”)
f.each do |line|
…
end

In real live I will have to process files (CSV) that are uploaded from
anywhere, produced by any system (Mac, Linux, Windows) using whatever
software like different versions of Excel, OpenOffice Calc or extracts
from databases…

I don’t know yet, how to determine the encoding of those uploaded
files…

Tom

tbaustert · September 19, 2012, 8:57am

Thank you all for your helpful hints and comments!

tbaustert · September 19, 2012, 2:58am

On Tue, Sep 18, 2012 at 1:54 AM, Thomas B. [email protected]
wrote:

anywhere, produced by any system (Mac, Linux, Windows) using whatever
software like different versions of Excel, OpenOffice Calc or extracts
from databases…

If it’s an upload via HTTP, you might get lucky and get a ‘charset’ in
the Content-Type header. You might be able to make some good guess
based on the operating systems and the applications that create them.
For example, if the CSV is from Windows and Excel, it’s likely that
the encoding is Windows-1252. On that example, it’s important to note
that Windows-1252 is a superset of ISO-8859-1, so if you parse the
file with ISO-8859-1 and it works most of the time for files from
Windows+Excel, but occasionally includes replacement characters, it’s
because the CSV is using the few characters that don’t overlap.

Check out Windows-1252 - Wikipedia and
ISO/IEC 8859-1 - Wikipedia.

tbaustert · September 20, 2012, 9:58am

Nathan B. wrote in post #1076380:

Once you do know the encoding, the proper
way to open up a file is to declare the external encoding of the file,
like this -

f = File.open(‘somefile.txt’, ‘r:iso-8859-1’)

That is correct.

Then when you read content from the file, the data in the file will be
transcoded from ‘iso-8859-1’ to the default internal encoding of the
Ruby interpreter instance (generally UTF-8).

That is incorrect. But encodings in ruby are such a damned mess that I’m
not surprised that many people don’t understand it.

No transcoding takes place in the above example, regardless of the
where it runs; the data is read into the string as-is. However, every
String which you read from the file using f.gets or f.getc is marked to
say that it is encoded using UTF-8. (But not Strings which you read
using f.read)

You can define what
internal encoding to use by further qualifying the open, like this -

f = File.open(‘somefile.txt’, ‘r:iso-8859-1:utf-8’)

Yes, in that case, ruby will transcode.

Check out this article for some more information -
Working with Encodings in Ruby 1.9.

For as much of the gorey details as I managed to work out before giving
up, you can also try

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

tbaustert · September 20, 2012, 11:28pm

On Tue, Sep 18, 2012 at 8:54 AM, Thomas B.
[email protected]wrote:

I don’t know yet, how to determine the encoding of those uploaded

files…

If the encoding is unknown you can only make a guess. Have a look at

tbaustert · September 20, 2012, 10:03am

Brian C. wrote in post #1076773:

f = File.open(‘somefile.txt’, ‘r:iso-8859-1’)
…
No transcoding takes place in the above example, regardless of the
where it runs; the data is read into the string as-is. However, every
String which you read from the file using f.gets or f.getc is marked to
say that it is encoded using UTF-8.

I meant to say ISO-8859-1 not UTF-8 there.

The point remains, the string remains the same sequence bytes of
ISO-8859-1 encoded characters. If you want it transcoded to UTF-8, you
have to ask for this by giving a second encoding.

tbaustert · September 21, 2012, 1:47am

On Thu, Sep 20, 2012 at 2:58 AM, Brian C. [email protected]
wrote:

transcoded from ‘iso-8859-1’ to the default internal encoding of the
Ruby interpreter instance (generally UTF-8).

That is incorrect. But encodings in ruby are such a damned mess that I’m
not surprised that many people don’t understand it.

Indeed …

1.9.3-p194 :015 > f = File.open(‘hello.txt’, ‘r:windows-1252’)
=> #<File:hello.txt>
1.9.3-p194 :016 > c = f.read
=> “hello world!”
1.9.3-p194 :017 > c.encoding
=> #Encoding:Windows-1252

Looking at the documentation, I see what I missed
(Class: IO (Ruby 1.9.3)). I suppose that
makes.

No transcoding takes place in the above example, regardless of the
where it runs; the data is read into the string as-is. However, every
String which you read from the file using f.gets or f.getc is marked to
say that it is encoded using UTF-8. (But not Strings which you read
using f.read)

I’m not seeing this behavior.

1.9.3-p194 :001 > f = File.open(‘hello.txt’, ‘r:windows-1252’)
=> #<File:hello.txt>
1.9.3-p194 :002 > c = f.getc
=> “h”
1.9.3-p194 :003 > c.encoding
=> #Encoding:Windows-1252

tbaustert · September 21, 2012, 3:27pm

Nathan B. wrote in post #1076883:

No transcoding takes place in the above example, regardless of the
where it runs; the data is read into the string as-is. However, every
String which you read from the file using f.gets or f.getc is marked to
say that it is encoded using UTF-8. (But not Strings which you read
using f.read)

I’m not seeing this behavior.

1.9.3-p194 :001 > f = File.open(‘hello.txt’, ‘r:windows-1252’)
=> #<File:hello.txt>
1.9.3-p194 :002 > c = f.getc
=> “h”
1.9.3-p194 :003 > c.encoding
=> #Encoding:Windows-1252

The letter “c” has not been transcoded. It happens to be the same in
windows-1252 and UTF-8, the single byte 0x63. However if you try it with
a high character >0x80 you’ll see that it is not transcoded into UTF-8.

Note that a string in ruby 1.9 is a vector of two things: an array of
bytes, and an encoding tag claiming that this string is encoded in a
particular way. c.encoding only shows you the tag.

tbaustert · September 21, 2012, 3:28pm

I meant the letter “h” and the byte 0x68. Too tired.

tbaustert · September 22, 2012, 5:50am

On Fri, Sep 21, 2012 at 8:27 AM, Brian C. [email protected]
wrote:

=> #<File:hello.txt>
bytes, and an encoding tag claiming that this string is encoded in a
particular way. c.encoding only shows you the tag.

I’m aware of the implementation details of Ruby 1.9’s String. What
I’ve been trying to figure out for a bit now is all of the
idiosyncrasies of the standard library APIs. As such, I’m very curious
about these details, so I performed a few more experiments.
Interestingly I’m still not seeing this behavior. Could this have
changed at some point between 1.9.0 and 1.9.3-p194? Am I running into
something on OS-specific?

I ran two tests. One with the inverted exclamation character, which is
code point U+00A0 and the euro sign, which is code point U+20AC. I
used these two characters, as the inverted exclamation has the same
code point value in Unicode, ISO-8859-1 and Windows-1252, but the byte
value is two bytes in UTF-8 and one byte in Windows-1252; the euro
sign is in Unicode and Windows-1252, but at different code points.

For the euro sign, i create a windows-1252 text file with a single
byte of 0x80 (the code point value) and then opened up IRB and ran the
following.

1.9.3-p194 :001 > f = File.open(‘euro_win1252.txt’, ‘r:windows-1252’)
=> #<File:euro_win1252.txt>
1.9.3-p194 :002 > c = f.getc
=> “\x80”
1.9.3-p194 :003 > c.encoding
=> #Encoding:Windows-1252
1.9.3-p194 :004 > ct = c.encode(‘utf-8’)
=> “€”

For the inverted exclamation point, i created a windows-1252 text file
with a single byte of 0xA1 (the code point value) and then opened up
IRB and ran the following.

1.9.3-p194 :001 > f = File.open(‘inverted_win1252.txt’,
‘r:windows-1252’)
=> #<File:inverted_win1252.txt>
1.9.3-p194 :002 > c = f.getc
=> “\xA1”
1.9.3-p194 :003 > c.encoding
=> #Encoding:Windows-1252
1.9.3-p194 :004 > ct = c.encode(‘utf-8’)
=> “¡”

Am I not working through this correctly?

FWIW - I’m running my tests on Mac OS X 10.8.2 with Ruby 1.9.3-p194
installed via RVM.

Thanks.

tbaustert · September 22, 2012, 3:42pm

I think we’re talking about different things now. What I was testing
was this statement you made a few posts earlier -

However, every
String which you read from the file using f.gets or f.getc is marked to
say that it is encoded using UTF-8. (But not Strings which you read
using f.read)

I think I misread a follow-up that you corrected. The behavior I’m
seeing is that when a File is opened with a defined external encoding
and no internal encoding defined, then all reads to that file will
produce Strings that maintain the defined external encoding.
Transcoding would only happen if an internal encoding is defined. Is
that correct?

Thanks,
-Nathan

tbaustert · September 23, 2012, 12:06pm

Nathan B. wrote in post #1077086:

I think I misread a follow-up that you corrected. The behavior I’m
seeing is that when a File is opened with a defined external encoding
and no internal encoding defined, then all reads to that file will
produce Strings that maintain the defined external encoding.
Transcoding would only happen if an internal encoding is defined. Is
that correct?

That’s correct. Sorry my brain has not been working 100% for my
first-time postings

tbaustert · September 22, 2012, 11:57am

Nathan B. wrote in post #1077055:

I’m aware of the implementation details of Ruby 1.9’s String. What
I’ve been trying to figure out for a bit now is all of the
idiosyncrasies of the standard library APIs.

Ah, well that’s an open-ended question. Ruby’s standard library is very
large, and none of the encoding-related behaviour is documented. But
File.open / getc are pretty fundamental to encoding behaviour.

As such, I’m very curious
about these details, so I performed a few more experiments.
Interestingly I’m still not seeing this behavior. Could this have
changed at some point between 1.9.0 and 1.9.3-p194? Am I running into
something on OS-specific

I don’t think so, unless it’s behaviour of irb. I think you are just
misinterpreting the results, and being confused by String#inspect.

Look at
c.bytes.to_a
and
c.pack(“H*”)
to see what’s really in the String.

I ran two tests. One with the inverted exclamation character, which is
code point U+00A0 and the euro sign, which is code point U+20AC. I
used these two characters, as the inverted exclamation has the same
code point value in Unicode, ISO-8859-1 and Windows-1252, but the byte
value is two bytes in UTF-8 and one byte in Windows-1252; the euro
sign is in Unicode and Windows-1252, but at different code points.

For the euro sign, i create a windows-1252 text file with a single
byte of 0x80 (the code point value) and then opened up IRB and ran the
following.

1.9.3-p194 :001 > f = File.open(‘euro_win1252.txt’, ‘r:windows-1252’)
=> #<File:euro_win1252.txt>
1.9.3-p194 :002 > c = f.getc
=> “\x80”

That’s the single byte you expected. However String#inspect has some
hard-coded behaviour which treats bytes in the range 0x80-0x9f (I think)
as unprintable, and therefore substitutes hex representation. “puts c”
will squirt the string directly at the terminal, and because your
terminal is UTF-8 but the string is invalid UTF-8, it will be
unprintable. Your terminal will probably substitute some special
character.

1.9.3-p194 :003 > c.encoding
=> #Encoding:Windows-1252
1.9.3-p194 :004 > ct = c.encode(‘utf-8’)
=> “€”

You’ve transcoded it. Now ct contains two bytes, which is the UTF-8
representation of that character. Then you’ve sent it to the screen.

By default ruby does no transcoding on output (i.e. it does not take
into account the encoding of your terminal). Your terminal is in fact
UTF-8, and so those two bytes get displayed as the one character you’re
sending.

(Because you’re running OSX, your terminal is almost certainly UTF-8;
mine is anyway)

For the inverted exclamation point, i created a windows-1252 text file
with a single byte of 0xA1 (the code point value) and then opened up
IRB and ran the following.

1.9.3-p194 :001 > f = File.open(‘inverted_win1252.txt’,
‘r:windows-1252’)
=> #<File:inverted_win1252.txt>
1.9.3-p194 :002 > c = f.getc
=> “\xA1”

Again that’s one byte; for some reason String#inspect or irb is showing
it in hex representation and I don’t know why in this case. But if it
didn’t, it would be unprintable on a UTF-8 terminal.

1.9.3-p194 :003 > c.encoding
=> #Encoding:Windows-1252
1.9.3-p194 :004 > ct = c.encode(‘utf-8’)
=> “¡”

Now ct contains 2 bytes, the UTF-8 representation of that character, and
your terminal displays it properly.

Anyway, try running something like this from the command line and see if
it’s any clearer, because it eliminates any possible interaction with
irb.

File.open(“inverted_win1252.txt”,“wb”) do |f|
f.write “\xA1”
end
File.open(“inverted_win1252.txt”,“r:windows-1252”) do |f|
c = f.getc
puts c.bytes.to_a
puts c.unpack(“H*”)
puts c.encoding
puts c.inspect
puts c
ct = c.encode(“utf-8”)
puts ct.bytes.to_a
puts ct.unpack(“H*”)
puts ct.encoding
puts ct.inspect
puts ct
end

Regards,

Brian.