Forum: Ruby How does one transform UTF-8 encoded characters to ASCII?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-24 22:22
I'm a little embarrassed about asking this, but here goes...

I am using HTMLEntities.decode_entities on a string of HTML.

I don't understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad.  All of the entities are
preceded by the character A-circumflex.  My guess is that Notepad
doesn't know how to handle UTF-8, for example.

Do I need to unpack the string with 'U' and then repack the result with
'A'?

Thanks,
Wes
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-24 22:51
Wes Gamble wrote:
> I'm a little embarrassed about asking this, but here goes...
>
> I am using HTMLEntities.decode_entities on a string of HTML.
>
> I don't understand how to make my text, which now contains UTF-8
> characters, display correctly in say, Notepad.  All of the entities are
> preceded by the character A-circumflex.  My guess is that Notepad
> doesn't know how to handle UTF-8, for example.
>
> Do I need to unpack the string with 'U' and then repack the result with
> 'A'?
>
> Thanks,
> Wes

OK I have found the iconv library, however, I am still having trouble.

What is the default text encoding for Ruby?  I assume it's gotten from
the OS, right?  So if I'm on Windows XP in the US, it's probably
ISO-8859-1?

Any help is appreciated.

Wes
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-05-25 00:24
(Received via mailing list)
On 25/05/06, Wes Gamble <weyus@att.net> wrote:
> > I don't understand how to make my text, which now contains UTF-8
> > characters, display correctly in say, Notepad.  All of the entities are
> > preceded by the character A-circumflex.  My guess is that Notepad
> > doesn't know how to handle UTF-8, for example.

Windows Notepad does handle UTF-8, but requires the presence of a BOM
(the three bytes "\xef\xbb\xbf") at the start of the file to read it
properly. Other, more competent applications may allow you to select
the appropriate encoding, or may even automatically detect it.

> OK I have found the iconv library, however, I am still having trouble.
>
> What is the default text encoding for Ruby?  I assume it's gotten from
> the OS, right?  So if I'm on Windows XP in the US, it's probably
> ISO-8859-1?

Windows XP uses UTF-16 internally, I believe, but retains the concept
of a legacy code page to allow non-Unicode-aware applications to run.
English Windows uses Windows-1252 (mostly, but not completely the same
as ISO-8859-1). Ruby on Windows uses the legacy code page to
communicate with the operating system, so things like file names will
be in Windows-1252.

Paul.
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-25 01:27
Paul Battley wrote:
> On 25/05/06, Wes Gamble <weyus@att.net> wrote:
>> > I don't understand how to make my text, which now contains UTF-8
>> > characters, display correctly in say, Notepad.  All of the entities are
>> > preceded by the character A-circumflex.  My guess is that Notepad
>> > doesn't know how to handle UTF-8, for example.
>
> Windows Notepad does handle UTF-8, but requires the presence of a BOM
> (the three bytes "\xef\xbb\xbf") at the start of the file to read it
> properly. Other, more competent applications may allow you to select
> the appropriate encoding, or may even automatically detect it.
>
>> OK I have found the iconv library, however, I am still having trouble.
>>
>> What is the default text encoding for Ruby?  I assume it's gotten from
>> the OS, right?  So if I'm on Windows XP in the US, it's probably
>> ISO-8859-1?
>
> Windows XP uses UTF-16 internally, I believe, but retains the concept
> of a legacy code page to allow non-Unicode-aware applications to run.
> English Windows uses Windows-1252 (mostly, but not completely the same
> as ISO-8859-1). Ruby on Windows uses the legacy code page to
> communicate with the operating system, so things like file names will
> be in Windows-1252.
>
> Paul.


I don't really care about Notepad.

I can't get VIM to show this text correctly.

I want to be able to convert (what I believe to be) UTF-8 into
Windows-1252 succesfully.

Wes
0ca6e5c33d7e7ff901d75ff0b13d9e1c?d=identicon&s=25 Sam Roberts (Guest)
on 2006-05-25 03:39
(Received via mailing list)
On Thu, May 25, 2006 at 08:27:46AM +0900, Wes Gamble wrote:
> Paul Battley wrote:
> > Windows Notepad does handle UTF-8, but requires the presence of a BOM
> > (the three bytes "\xef\xbb\xbf") at the start of the file to read it
> > properly. Other, more competent applications may allow you to select
> > the appropriate encoding, or may even automatically detect it.
> I don't really care about Notepad.
>
> I can't get VIM to show this text correctly.

vim handles utf-8 (if your terminal does!), but also wants a BOM.

You might be able to tell it that its utf-8 even without the BOM.

> I want to be able to convert (what I believe to be) UTF-8 into
> Windows-1252 succesfully.

Doesn't iconv take two args, the input and output character sets?

Sam
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-05-25 04:45
(Received via mailing list)
On May 24, 2006, at 9:36 PM, Sam Roberts wrote:

>
>
>

require 'iconv'
from = 'UTF-8'
to = 'CP1252'
iconvertor = Iconv.new(from, to)
s = "String in utf8"
string_in_windows1252 = iconvertor.iconv(s)
iconvertor.close
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-25 05:13
Logan Capaldo wrote:
> On May 24, 2006, at 9:36 PM, Sam Roberts wrote:
>
>>
>>
>>
>
> require 'iconv'
> from = 'UTF-8'
> to = 'CP1252'
> iconvertor = Iconv.new(from, to)
> s = "String in utf8"
> string_in_windows1252 = iconvertor.iconv(s)
> iconvertor.close

Is CP1252 the correct name of the character encoding?  I was using
windows-1252?

Where is a list of the canonical names of character encodings?

Also, what is a BOM?

Thanks,
Wes
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-25 05:18
I would also like to know which encoding is in effect before I start
doing all of this converting.

If I look at the $KCode variable to try and figure that out and the
value is "NONE", what does that mean?  Can I assume I am using "ASCII"
or "US-ASCII"?

Thanks,
Wes
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-05-25 05:28
(Received via mailing list)
On May 24, 2006, at 11:13 PM, Wes Gamble wrote:

>> iconvertor = Iconv.new(from, to)
>> s = "String in utf8"
>> string_in_windows1252 = iconvertor.iconv(s)
>> iconvertor.close
>
> Is CP1252 the correct name of the character encoding?  I was using
> windows-1252?
>
> Where is a list of the canonical names of character encodings?
>
Canonical depends on whether or not your using GNU libiconv [1] or
the local implementation. If you used windows-1252 and it didn't
throw an exception, you're probably ok.



> Also, what is a BOM?
>
Byte order marker. Usually used with UTF-16 its an initial sequence
of bytes to let a program now if the file is big or little endian.

[1] http://www.gnu.org/software/libiconv/
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-25 07:51
If I do an Iconv from windows-1252 to UTF-8 and
then I operate on my UTF-8 string to do something to it and
then I do an Iconv back to windows-1252 before I write to a file,

THEN

I should not have any need to place a BOM into the string correct?

The BOM is only to help applications understand that the encoding of the
text is UTF-8, correct?

Wes
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-25 08:04
Wes Gamble wrote:
> If I do an Iconv from windows-1252 to UTF-8 and
> then I operate on my UTF-8 string to do something to it and
> then I do an Iconv back to windows-1252 before I write to a file,
>
> THEN
>
> I should not have any need to place a BOM into the string correct?
>
> The BOM is only to help applications understand that the encoding of the
> text is UTF-8, correct?
>
> Wes

One last thing - I believe that one of my issues is that when I attempt
to convert BACK to 1252/ISO-8859 or whatever, that some of my UTF-8
chars will not convert because they don't exist in the target character
encodings.  If that's true, then I could write the UTF-8 encoded data
into a file provided that I preceded it with the BOM.

Does that sound right?

Man, when I got up this morning I didn't know I'd be learning how to
convert character encodings in Ruby.  Oh joy!!!!

Wes
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-05-25 17:04
(Received via mailing list)
On 5/24/06, Wes Gamble <weyus@att.net> wrote:
> I'm a little embarrassed about asking this, but here goes...
>
> I am using HTMLEntities.decode_entities on a string of HTML.
>
> I don't understand how to make my text, which now contains UTF-8
> characters, display correctly in say, Notepad.  All of the entities are
> preceded by the character A-circumflex.  My guess is that Notepad
> doesn't know how to handle UTF-8, for example.

At least on Windows XP the notepad can handle various encodings. The
problem is convincing it to use the right encoding. There is no
obvoius way.
One way that might work:
Save a text as utf-8 in notepad. Notepad inserts a mark at the
beginning of the text. You can then copy the mark to the beginning of
any of your texts and it would be readable in notepad. But it would no
longer parse as a valid HTML, ruby, or whatever.

Or just drop notepad and use a text editor.

HTH

Michal
89d967359903c639d31e4cad4569f537?d=identicon&s=25 Charlie Bowman (Guest)
on 2006-05-25 17:13
(Received via mailing list)
Sorry to just into this thread but I have the exact opposite problem.
Open Office uses utf8 and I need to be able to copy text from the OO
document into a web form.  Is this possible without saving the document
as a plain text file first?

On Fri, 2006-05-26 at 00:03 +0900, Michal Suchanek wrote:

> At least on Windows XP the notepad can handle various encodings. The
> HTH
>
> Michal

Charlie Bowman
http://www.recentrambles.com
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-25 17:20
Charlie Bowman wrote:
> Sorry to just into this thread but I have the exact opposite problem.
> Open Office uses utf8 and I need to be able to copy text from the OO
> document into a web form.  Is this possible without saving the document
> as a plain text file first?
>
> On Fri, 2006-05-26 at 00:03 +0900, Michal Suchanek wrote:
>
>> At least on Windows XP the notepad can handle various encodings. The
>> HTH
>>
>> Michal
>
> Charlie Bowman
> http://www.recentrambles.com

I assume that you should be able to use Iconv to transform the string.
There are examples of its usage earlier in this thread.
This topic is locked and can not be replied to.