Ruby1.9: Encoding problems (how to use #force_encoding ?)

Hi, I’m using geo_location Ruby gem which returns to me a hash with the
given
IP geolocation.

I use Ruby1.9 and UTF-8 works fine, but in this case, when the “city”
has
“strange” symbols the the gem gives the string encoded in ASCII-8BIT.

For example:

Alarc'n  (theorically it should be "Alarcón")

I need to send this string to a server which mandates UTF-8 usage so
sending
it as it’s fails.

I’ve tryed to convert the encoding but received an error:

result.encode “UTF-8”
=> `encode’: “\xF3” from ASCII-8BIT to UTF-8
(Encoding::UndefinedConversionError)

I’ve also tryed with force_encoding:
result.force_encoding “UTF-8”

and then, the “result” string is converted to UTF-8 (I’ve checked
result.encoding) but it’s also not valid for the server and when
printing it I
see the same as before.

I need all of this just for a simple demo, so it owuld be valid for me
just to
delete the non valid UTF-8 chars from the result string, but I don’t
know
how to do it.

Any help please?

Iñaki Baz C. wrote:

Hi, I’m using geo_location Ruby gem which returns to me a hash with the
given
IP geolocation.

Lots of gems are not ruby-1.9 compatible. You should probably report
problems to the author, ideally with a patch which fixes it, and a test
case which reproduces it.

I use Ruby1.9 and UTF-8 works fine, but in this case, when the “city”
has
“strange” symbols the the gem gives the string encoded in ASCII-8BIT.

All data read from a socket is tagged as ASCII-8BIT by default. That’s
probably what’s happening in the library you’re using.

I need to send this string to a server which mandates UTF-8 usage so
sending
it as it’s fails.

That doesn’t make much sense. A string, when it hits a socket, is just a
stream of bytes. So you should be sending the same stream of bytes as
you receive.

I’ve tryed to convert the encoding but received an error:

result.encode “UTF-8”
=> `encode’: “\xF3” from ASCII-8BIT to UTF-8
(Encoding::UndefinedConversionError)

That’s correct. Transcoding tries to transcode (replace characters one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents.

I’ve also tryed with force_encoding:
result.force_encoding “UTF-8”

and then, the “result” string is converted to UTF-8 (I’ve checked
result.encoding)

It’s not converted, it’s just tagged as being a string of UTF-8
characters, which it sounds like it is.

but it’s also not valid for the server

Again, doesn’t mean much without seeing the code which is trying to
submit this to the server.

I need all of this just for a simple demo, so it owuld be valid for me
just to
delete the non valid UTF-8 chars from the result string, but I don’t
know
how to do it.

str.force_encoding(“ASCII-8BIT”) # if not already
str.gsub!(/[^\x20-\x7e]/,’’)

Dear Iñaki,

maybe you can use CGI.escape for your encoding problems?

I fetched the Spanish wikipedia page for Alarcón like so:

encoding: utf-8

require “cgi”
require ‘open-uri’

search_what=CGI.escape(“Alarcón”)
page=“Buscar - Wikipedia, la enciclopedia libre
open(page){ |f| print f.read }

Best regards,

Axel

On Sep 2, 2009, at 9:59 AM, Iñaki Baz C. wrote:

require ‘open-uri’

search_what=CGI.escape(“Alarcón”)
page="Buscar - Wikipedia, la enciclopedia libre
"
open(page){ |f| print f.read }

Thanks but the problem is that the geo_location Ruby gem returns a
wrong string (encoded in ASCII-8BIT) since it contains invalid chars
for ASCII-8BIT encoding, so Ruby fails when trying to convert it to
other encoding :frowning:

There are no invalid characters in ASCII-8BIT. It’s a catch all
Encoding. So that’s definitely not the problem… :wink:

James Edward G. II

El Miércoles, 2 de Septiembre de 2009, James Edward G. II
escribió:

There are no invalid characters in ASCII-8BIT. It’s a catch all
Encoding. So that’s definitely not the problem… :wink:

Ok, that’s a good point.
I’ll try it.

“That’s correct. Transcoding tries to transcode (replace characters
one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents.”

It isn’t true, in fact, \xc characters are Unicode code points (in UTF-8
encoding) and not ASCII-2 characters. You have a Unicode String with
UTF-8 encoding and encoding incorrectly set to ASCII-2, to solve this
problem try this:

begin
str.encode! Encoding::UTF_8 if str.encoding != Encoding::UTF_8
rescue Encoding::UndefinedConversionError
#string incorrectly encoded try force
str.force_encoding Encoding::UTF_8
end

2009/9/2 Axel E. [email protected]:

search_what=CGI.escape(“Alarcón”)
page=“Buscar - Wikipedia, la enciclopedia libre
open(page){ |f| print f.read }

Thanks but the problem is that the geo_location Ruby gem returns a
wrong string (encoded in ASCII-8BIT) since it contains invalid chars
for ASCII-8BIT encoding, so Ruby fails when trying to convert it to
other encoding :frowning:

Pedro G. wrote in post #1040715:

“That’s correct. Transcoding tries to transcode (replace characters
one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents.”

It isn’t true

Yes, it is true, because it’s exactly what the Ruby encoding
“ASCII_8BIT” means. It allows you to use \x80 to \xFF without defining
what character set those are in. Hence these characters cannot be
transcoded, since it’s undefined what they are.

(Also, why are you resurrecting a 2-year-old thread?)

, in fact, \xc characters are Unicode code points (in UTF-8
encoding) and not ASCII-2 characters. You have a Unicode String with
UTF-8 encoding and encoding incorrectly set to ASCII-2, to solve this
problem try this:

What do you mean by ASCII-2? Standard ASCII is only a 7-bit character
set. There are a whole bunch of 8-bit extensions to ASCII, e.g.
ISO-8859-1, Windows-1252 etc. They all define different character sets
for \x80 to \xff. The encoding “ASCII_8BIT” makes no assertion about
what these high characters are.

begin
str.encode! Encoding::UTF_8 if str.encoding != Encoding::UTF_8
rescue Encoding::UndefinedConversionError
#string incorrectly encoded try force
str.force_encoding Encoding::UTF_8
end

That’s wrong, and just shows you don’t understand the problem.