Encoding of strings received from db

mare · June 26, 2007, 10:53pm

I swear I searched this forum and couldn’t find the solution!

Anyway, the problem:
I have a db (PostgreSQL) with data encoded in utf-8. I want to check
this data, so I hoped to create a script to assert the values found
there. The problem is they do not match, since the strings obtained from
db are not utf-8. I tried with iconv, but the resulted strings were even
messier. I’m using ‘postgres’ gem to connect to the db.
Any solutions?

mare · June 27, 2007, 2:04pm

Dear mare,

does the information here:

http://groups.google.ca/group/rubyonrails-core/browse_thread/thread/db8d0c594cb4cc73

help ?
Otherwise, I’m not sure whether for Postgres, something like
Mysql.escape(string) could be of help.
Maybe you post some example string of what you want to achieve,
and what goes wrong…

Best regards,

Axel

mare · June 27, 2007, 4:37pm

Axel E. wrote:

Dear mare,

does the information here:

http://groups.google.ca/group/rubyonrails-core/browse_thread/thread/db8d0c594cb4cc73

help ?

Unfortunately it doesn’t, because rails has some kind of configuration
(I was only playing with rails, never digging deeper into it, so I’m not
sure what it looks like). As for my problem, I’m trying to do this:

db = PGconn.connect(‘localhost’, 5432, ‘’, ‘’, ‘isys’, ‘postgres’,
‘postgres’)
res = db.exec(‘select * from “ADDRESS”’)
puts res[1][1] == “SÃ¼dstrasse”

and make the output “true”. But i can’t, because res[1][1] (second
column of the second row) is in some other encoding and instead of
SÃ¼dstrasse (“Sudstrasse” where “u” has umlauts - the two dots above the
letter) I get SÃƒÂ¼dstrasse (“u” with umlauts is presented as two
letters).

I’ve tried using:

Iconv.new(‘UTF-8’, ‘ISO-8859-1’).iconv(res[1][1])

but now every byte that I have instead of umlauted u is transformed into
another two characters - instead of u with umlauts, I now have 4
characters.
I believe that the problem lies in ‘postgres’ gem and that it should be
configured somehow, but I don’t have the slightest idea how…

mare · June 27, 2007, 5:10pm

Dear Marko,

I am pretty sure that nothing is wrong here - it’s merely
a question of what encoding settings you give your editor.
I’ve tried the following:

Open an editor, write the word “Südstrasse” into it, and
save it in UTF8 encoding as “a.txt” - so the “ü” gets displayed
correctly when you have UTF8 encoding set.
Then execute the following script, which is in ISO-8859-1 encoding

that matters for the “ü” in this “Südstrasse”:

require “iconv”
text=IO.readlines(“a.txt”).to_s
p text => “S\303\274dstrasse”
result = Iconv.conv(“ISO-8859-1”,“UTF-8”, text)
p result => “S\374dstrasse”
compare_to=“Südstrasse”

p compare_to==result => true, because of the conversion made in Iconv
p compare_to==text => false, because of the lack of conversion

Best regards,

Axel

mare · June 27, 2007, 5:28pm

On Jun 27, 2007, at 9:37 , Marko Marjanovic wrote:

Unfortunately it doesn’t, because rails has some kind of configuration
column of the second row) is in some other encoding and instead of
into
another two characters - instead of u with umlauts, I now have 4
characters.
I believe that the problem lies in ‘postgres’ gem and that it
should be
configured somehow, but I don’t have the slightest idea how…

I think your issue is that your Ruby script doesn’t know what
encoding is coming back from the database. Try setting $KCODE = ‘u’
in your script.

$ cat db_encoding_test.rb
#!/usr/local/bin/ruby -w
$KCODE = ‘u’
require ‘rubygems’
require ‘postgres’
require ‘dbi’

DBI.connect(‘dbi:Pg:test:localhost:54824’, ‘postgres’) do |dbh|
sth = dbh.prepare(<<-EOS)
SELECT * FROM streets;
EOS
sth.execute
sth.fetch do |row|
p row
p row.first == ‘Südstrasse’
end
end
$ ./db_encoding_test.rb
[“Südstrasse”]
true

Michael G.
grzm seespotcode net

mare · June 27, 2007, 7:48pm

Thanks a lot, guys!
It seems that SciTE editor that I used for creating the script somehow
screwed up the encoding. When I checked, I found out that its UTF-8-Y or
something like that, instead of UTF-8. I edited the script in JEdit,
fixed its encoding, added $KCODE=‘u’, just for extra safety :), and
everything was fine.
Thanks again!
Cheers