How does one transform UTF-8 encoded characters to ASCII?


#1

I’m a little embarrassed about asking this, but here goes…

I am using HTMLEntities.decode_entities on a string of HTML.

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

Do I need to unpack the string with ‘U’ and then repack the result with
‘A’?

Thanks,
Wes


#2

Wes G. wrote:

I’m a little embarrassed about asking this, but here goes…

I am using HTMLEntities.decode_entities on a string of HTML.

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

Do I need to unpack the string with ‘U’ and then repack the result with
‘A’?

Thanks,
Wes

OK I have found the iconv library, however, I am still having trouble.

What is the default text encoding for Ruby? I assume it’s gotten from
the OS, right? So if I’m on Windows XP in the US, it’s probably
ISO-8859-1?

Any help is appreciated.

Wes


#3

On 25/05/06, Wes G. removed_email_address@domain.invalid wrote:

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

Windows Notepad does handle UTF-8, but requires the presence of a BOM
(the three bytes “\xef\xbb\xbf”) at the start of the file to read it
properly. Other, more competent applications may allow you to select
the appropriate encoding, or may even automatically detect it.

OK I have found the iconv library, however, I am still having trouble.

What is the default text encoding for Ruby? I assume it’s gotten from
the OS, right? So if I’m on Windows XP in the US, it’s probably
ISO-8859-1?

Windows XP uses UTF-16 internally, I believe, but retains the concept
of a legacy code page to allow non-Unicode-aware applications to run.
English Windows uses Windows-1252 (mostly, but not completely the same
as ISO-8859-1). Ruby on Windows uses the legacy code page to
communicate with the operating system, so things like file names will
be in Windows-1252.

Paul.


#4

On Thu, May 25, 2006 at 08:27:46AM +0900, Wes G. wrote:

Paul B. wrote:

Windows Notepad does handle UTF-8, but requires the presence of a BOM
(the three bytes “\xef\xbb\xbf”) at the start of the file to read it
properly. Other, more competent applications may allow you to select
the appropriate encoding, or may even automatically detect it.
I don’t really care about Notepad.

I can’t get VIM to show this text correctly.

vim handles utf-8 (if your terminal does!), but also wants a BOM.

You might be able to tell it that its utf-8 even without the BOM.

I want to be able to convert (what I believe to be) UTF-8 into
Windows-1252 succesfully.

Doesn’t iconv take two args, the input and output character sets?

Sam


#5

On May 24, 2006, at 9:36 PM, Sam R. wrote:

require ‘iconv’
from = ‘UTF-8’
to = ‘CP1252’
iconvertor = Iconv.new(from, to)
s = “String in utf8”
string_in_windows1252 = iconvertor.iconv(s)
iconvertor.close


#6

Paul B. wrote:

On 25/05/06, Wes G. removed_email_address@domain.invalid wrote:

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

Windows Notepad does handle UTF-8, but requires the presence of a BOM
(the three bytes “\xef\xbb\xbf”) at the start of the file to read it
properly. Other, more competent applications may allow you to select
the appropriate encoding, or may even automatically detect it.

OK I have found the iconv library, however, I am still having trouble.

What is the default text encoding for Ruby? I assume it’s gotten from
the OS, right? So if I’m on Windows XP in the US, it’s probably
ISO-8859-1?

Windows XP uses UTF-16 internally, I believe, but retains the concept
of a legacy code page to allow non-Unicode-aware applications to run.
English Windows uses Windows-1252 (mostly, but not completely the same
as ISO-8859-1). Ruby on Windows uses the legacy code page to
communicate with the operating system, so things like file names will
be in Windows-1252.

Paul.

I don’t really care about Notepad.

I can’t get VIM to show this text correctly.

I want to be able to convert (what I believe to be) UTF-8 into
Windows-1252 succesfully.

Wes


#7

I would also like to know which encoding is in effect before I start
doing all of this converting.

If I look at the $KCode variable to try and figure that out and the
value is “NONE”, what does that mean? Can I assume I am using “ASCII”
or “US-ASCII”?

Thanks,
Wes


#8

Logan C. wrote:

On May 24, 2006, at 9:36 PM, Sam R. wrote:

require ‘iconv’
from = ‘UTF-8’
to = ‘CP1252’
iconvertor = Iconv.new(from, to)
s = “String in utf8”
string_in_windows1252 = iconvertor.iconv(s)
iconvertor.close

Is CP1252 the correct name of the character encoding? I was using
windows-1252?

Where is a list of the canonical names of character encodings?

Also, what is a BOM?

Thanks,
Wes


#9

On May 24, 2006, at 11:13 PM, Wes G. wrote:

iconvertor = Iconv.new(from, to)
s = “String in utf8”
string_in_windows1252 = iconvertor.iconv(s)
iconvertor.close

Is CP1252 the correct name of the character encoding? I was using
windows-1252?

Where is a list of the canonical names of character encodings?

Canonical depends on whether or not your using GNU libiconv [1] or
the local implementation. If you used windows-1252 and it didn’t
throw an exception, you’re probably ok.

Also, what is a BOM?

Byte order marker. Usually used with UTF-16 its an initial sequence
of bytes to let a program now if the file is big or little endian.

[1] http://www.gnu.org/software/libiconv/


#10

If I do an Iconv from windows-1252 to UTF-8 and
then I operate on my UTF-8 string to do something to it and
then I do an Iconv back to windows-1252 before I write to a file,

THEN

I should not have any need to place a BOM into the string correct?

The BOM is only to help applications understand that the encoding of the
text is UTF-8, correct?

Wes


#11

On 5/24/06, Wes G. removed_email_address@domain.invalid wrote:

I’m a little embarrassed about asking this, but here goes…

I am using HTMLEntities.decode_entities on a string of HTML.

I don’t understand how to make my text, which now contains UTF-8
characters, display correctly in say, Notepad. All of the entities are
preceded by the character A-circumflex. My guess is that Notepad
doesn’t know how to handle UTF-8, for example.

At least on Windows XP the notepad can handle various encodings. The
problem is convincing it to use the right encoding. There is no
obvoius way.
One way that might work:
Save a text as utf-8 in notepad. Notepad inserts a mark at the
beginning of the text. You can then copy the mark to the beginning of
any of your texts and it would be readable in notepad. But it would no
longer parse as a valid HTML, ruby, or whatever.

Or just drop notepad and use a text editor.

HTH

Michal


#12

Sorry to just into this thread but I have the exact opposite problem.
Open Office uses utf8 and I need to be able to copy text from the OO
document into a web form. Is this possible without saving the document
as a plain text file first?

On Fri, 2006-05-26 at 00:03 +0900, Michal S. wrote:

At least on Windows XP the notepad can handle various encodings. The
HTH

Michal

Charlie B.
http://www.recentrambles.com


#13

Wes G. wrote:

If I do an Iconv from windows-1252 to UTF-8 and
then I operate on my UTF-8 string to do something to it and
then I do an Iconv back to windows-1252 before I write to a file,

THEN

I should not have any need to place a BOM into the string correct?

The BOM is only to help applications understand that the encoding of the
text is UTF-8, correct?

Wes

One last thing - I believe that one of my issues is that when I attempt
to convert BACK to 1252/ISO-8859 or whatever, that some of my UTF-8
chars will not convert because they don’t exist in the target character
encodings. If that’s true, then I could write the UTF-8 encoded data
into a file provided that I preceded it with the BOM.

Does that sound right?

Man, when I got up this morning I didn’t know I’d be learning how to
convert character encodings in Ruby. Oh joy!!!

Wes


#14

Charlie B. wrote:

Sorry to just into this thread but I have the exact opposite problem.
Open Office uses utf8 and I need to be able to copy text from the OO
document into a web form. Is this possible without saving the document
as a plain text file first?

On Fri, 2006-05-26 at 00:03 +0900, Michal S. wrote:

At least on Windows XP the notepad can handle various encodings. The
HTH

Michal

Charlie B.
http://www.recentrambles.com

I assume that you should be able to use Iconv to transform the string.
There are examples of its usage earlier in this thread.