Unicode

Zephyr_P · September 30, 2007, 7:22am

On Sep 29, 2007, at 9:47 PM, Felipe C. wrote:

Read the unicode stuff carefully. It’s vital for many things.
The short version is that UTF-16 is basically wasteful. It uses 2
that). UTF-16 basically stores characters in 2 bytes (that means more
the least about it…
Let’s say I want to rename a file “fooobar”, and remove the third “o”,
–
Felipe C.

Hmm… you should consider converting it to utf-8 via iconv.
There is a gem for iconv
This will keep your data intact, but you might need to convert it
back to utf-16 later.

I believe filenames on windows are actually utf-8,
Files’ contents are generally written in utf-16

Could be wrong on this…
but test it and see!
Try to to open a file with non-ascii range characters in irb and see
what happens.
If it fails, no harm done.

Zephyr_P · September 30, 2007, 7:29am

oh, and Mr. Contreras,
I did not mean to say RTFM to you. Sorry if it seemed like that.

Zephyr_P · October 1, 2007, 3:52pm

On 30/09/2007, Felipe C. [email protected] wrote:

Read the unicode stuff carefully. It’s vital for many things.
The short version is that UTF-16 is basically wasteful. It uses 2
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn’t
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

UTF-8 and UTF-16 are pretty much the same. They encode a single
character using one or more units, where these units are 8-bit or
16-bit respectively. The only thing you buy by converting to utf-16 is
space efficiency for codepoints that require nearly 16 bits to
represent (such as Japanese characters) and endianness issues. Note
that some characters may (and some must) be composed of multiple
codepoints (a character codepoint, and additional accent
codepoint(s)).

I’m sorry if I’m being rude, but I really don’t like when people tell
me to read stuff I already know.

My question is still there:

Let’s say I want to rename a file “fooobar”, and remove the third “o”,
but it’s UTF-16, and Ruby only supports UTF-8, so I remove the “o” and
of course there will still be a 0x00 in there. That’s if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

If you handle UTF-16 as something else you break it regardless of the
language support. If you know (or have a way to find out) it’s UTF-16
you can convert it. If there is no way to find out all language
support is moot.

Thanks
Michal

Zephyr_P · October 1, 2007, 4:10pm

On Sep 30, 2007, at 12:22 AM, John J. wrote:

There is a gem for iconv

The iconv library is a standard library shipped with Ruby.

James Edward G. II

Zephyr_P · September 30, 2007, 6:56am

Felipe C. wrote:

So what exactly am I looking for here?

ASCII is a 7-Bit Encoding with 128 characters in the set.

Most PC’s these days use an 8 bit byte. I’m no rocket scientist when it
comes
to CPU Architectures or character encodings but I would think the
machines
byte or word size would be the most logical choices…

Most of my files are in UTF-8 or ISO 8859-1 (and probably some
Windows-1252).
As far as I know UTF-8 and Latin 1 are compatible in the first 128 char
because of ASCII’s wide spread’ness.

Since I may have missed the original message… What is the problem
again?

TerryP.

Zephyr_P · October 1, 2007, 4:22pm

On Oct 1, 2007, at 9:08 AM, James Edward G. II wrote:

On Sep 30, 2007, at 12:22 AM, John J. wrote:

There is a gem for iconv

The iconv library is a standard library shipped with Ruby.

James Edward G. II

Sure enough!
Just got so used to require rubygems with nearly everything…