How does one transform UTF-8 encoded characters to ASCII?

weyus · May 24, 2006, 10:22pm

I’m a little embarrassed about asking this, but here goes…

I am using HTMLEntities.decode_entities on a string of HTML.

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

Do I need to unpack the string with ‘U’ and then repack the result with
‘A’?

Thanks,
Wes

weyus · May 24, 2006, 10:51pm

Wes G. wrote:

I’m a little embarrassed about asking this, but here goes…

I am using HTMLEntities.decode_entities on a string of HTML.

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

Do I need to unpack the string with ‘U’ and then repack the result with
‘A’?

Thanks,
Wes

OK I have found the iconv library, however, I am still having trouble.

What is the default text encoding for Ruby? I assume it’s gotten from
the OS, right? So if I’m on Windows XP in the US, it’s probably
ISO-8859-1?

Any help is appreciated.

Wes

weyus · May 25, 2006, 12:24am

On 25/05/06, Wes G. [email protected] wrote:

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

Windows Notepad does handle UTF-8, but requires the presence of a BOM
(the three bytes “\xef\xbb\xbf”) at the start of the file to read it
properly. Other, more competent applications may allow you to select
the appropriate encoding, or may even automatically detect it.

OK I have found the iconv library, however, I am still having trouble.

What is the default text encoding for Ruby? I assume it’s gotten from
the OS, right? So if I’m on Windows XP in the US, it’s probably
ISO-8859-1?

Windows XP uses UTF-16 internally, I believe, but retains the concept
of a legacy code page to allow non-Unicode-aware applications to run.
English Windows uses Windows-1252 (mostly, but not completely the same
as ISO-8859-1). Ruby on Windows uses the legacy code page to
communicate with the operating system, so things like file names will
be in Windows-1252.

Paul.

weyus · May 25, 2006, 3:39am

On Thu, May 25, 2006 at 08:27:46AM +0900, Wes G. wrote:

Paul B. wrote:

Windows Notepad does handle UTF-8, but requires the presence of a BOM
(the three bytes “\xef\xbb\xbf”) at the start of the file to read it
properly. Other, more competent applications may allow you to select
the appropriate encoding, or may even automatically detect it.
I don’t really care about Notepad.

I can’t get VIM to show this text correctly.

vim handles utf-8 (if your terminal does!), but also wants a BOM.

You might be able to tell it that its utf-8 even without the BOM.

I want to be able to convert (what I believe to be) UTF-8 into
Windows-1252 succesfully.

Doesn’t iconv take two args, the input and output character sets?

Sam

weyus · May 25, 2006, 4:45am

On May 24, 2006, at 9:36 PM, Sam R. wrote:

require ‘iconv’
from = ‘UTF-8’
to = ‘CP1252’
iconvertor = Iconv.new(from, to)
s = “String in utf8”
string_in_windows1252 = iconvertor.iconv(s)
iconvertor.close

weyus · May 25, 2006, 1:27am

Paul B. wrote:

On 25/05/06, Wes G. [email protected] wrote:

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

Windows Notepad does handle UTF-8, but requires the presence of a BOM
(the three bytes “\xef\xbb\xbf”) at the start of the file to read it
properly. Other, more competent applications may allow you to select
the appropriate encoding, or may even automatically detect it.

OK I have found the iconv library, however, I am still having trouble.

What is the default text encoding for Ruby? I assume it’s gotten from
the OS, right? So if I’m on Windows XP in the US, it’s probably
ISO-8859-1?

Windows XP uses UTF-16 internally, I believe, but retains the concept
of a legacy code page to allow non-Unicode-aware applications to run.
English Windows uses Windows-1252 (mostly, but not completely the same
as ISO-8859-1). Ruby on Windows uses the legacy code page to
communicate with the operating system, so things like file names will
be in Windows-1252.

Paul.

I don’t really care about Notepad.

I can’t get VIM to show this text correctly.

I want to be able to convert (what I believe to be) UTF-8 into
Windows-1252 succesfully.

Wes

weyus · May 25, 2006, 5:18am

I would also like to know which encoding is in effect before I start
doing all of this converting.

If I look at the $KCode variable to try and figure that out and the
value is “NONE”, what does that mean? Can I assume I am using “ASCII”
or “US-ASCII”?

Thanks,
Wes

weyus · May 25, 2006, 5:13am

Logan C. wrote:

On May 24, 2006, at 9:36 PM, Sam R. wrote:

require ‘iconv’
from = ‘UTF-8’
to = ‘CP1252’
iconvertor = Iconv.new(from, to)
s = “String in utf8”
string_in_windows1252 = iconvertor.iconv(s)
iconvertor.close

Is CP1252 the correct name of the character encoding? I was using
windows-1252?

Where is a list of the canonical names of character encodings?

Also, what is a BOM?

Thanks,
Wes

weyus · May 25, 2006, 5:28am

On May 24, 2006, at 11:13 PM, Wes G. wrote:

iconvertor = Iconv.new(from, to)
s = “String in utf8”
string_in_windows1252 = iconvertor.iconv(s)
iconvertor.close

Is CP1252 the correct name of the character encoding? I was using
windows-1252?

Where is a list of the canonical names of character encodings?

Canonical depends on whether or not your using GNU libiconv [1] or
the local implementation. If you used windows-1252 and it didn’t
throw an exception, you’re probably ok.

Also, what is a BOM?

Byte order marker. Usually used with UTF-16 its an initial sequence
of bytes to let a program now if the file is big or little endian.

[1] libiconv - GNU Project - Free Software Foundation (FSF)

weyus · May 25, 2006, 7:51am

If I do an Iconv from windows-1252 to UTF-8 and
then I operate on my UTF-8 string to do something to it and
then I do an Iconv back to windows-1252 before I write to a file,

THEN

I should not have any need to place a BOM into the string correct?

The BOM is only to help applications understand that the encoding of the
text is UTF-8, correct?

Wes

weyus · May 25, 2006, 5:04pm

On 5/24/06, Wes G. [email protected] wrote:

I’m a little embarrassed about asking this, but here goes…

I am using HTMLEntities.decode_entities on a string of HTML.

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

At least on Windows XP the notepad can handle various encodings. The
problem is convincing it to use the right encoding. There is no
obvoius way.
One way that might work:
Save a text as utf-8 in notepad. Notepad inserts a mark at the
beginning of the text. You can then copy the mark to the beginning of
any of your texts and it would be readable in notepad. But it would no
longer parse as a valid HTML, ruby, or whatever.

Or just drop notepad and use a text editor.

HTH

Michal

weyus · May 25, 2006, 5:13pm

Sorry to just into this thread but I have the exact opposite problem.
Open Office uses utf8 and I need to be able to copy text from the OO
document into a web form. Is this possible without saving the document
as a plain text file first?

On Fri, 2006-05-26 at 00:03 +0900, Michal S. wrote:

At least on Windows XP the notepad can handle various encodings. The
HTH

Michal

Charlie B.
http://www.recentrambles.com

weyus · May 25, 2006, 8:04am

Wes G. wrote:

If I do an Iconv from windows-1252 to UTF-8 and
then I operate on my UTF-8 string to do something to it and
then I do an Iconv back to windows-1252 before I write to a file,

THEN

I should not have any need to place a BOM into the string correct?

The BOM is only to help applications understand that the encoding of the
text is UTF-8, correct?

Wes

One last thing - I believe that one of my issues is that when I attempt
to convert BACK to 1252/ISO-8859 or whatever, that some of my UTF-8
chars will not convert because they don’t exist in the target character
encodings. If that’s true, then I could write the UTF-8 encoded data
into a file provided that I preceded it with the BOM.

Does that sound right?

Man, when I got up this morning I didn’t know I’d be learning how to
convert character encodings in Ruby. Oh joy!!!

Wes

weyus · May 25, 2006, 5:20pm

Charlie B. wrote:

Sorry to just into this thread but I have the exact opposite problem.
Open Office uses utf8 and I need to be able to copy text from the OO
document into a web form. Is this possible without saving the document
as a plain text file first?

On Fri, 2006-05-26 at 00:03 +0900, Michal S. wrote:

At least on Windows XP the notepad can handle various encodings. The
HTH

Michal

Charlie B.
http://www.recentrambles.com

I assume that you should be able to use Iconv to transform the string.
There are examples of its usage earlier in this thread.