Yet another unicode string hacks

Dae_San_H · June 14, 2006, 12:32pm

Hi everyone.

I’m implementing yet another unicode string hacks. I’m trying to
rewire String class so that it will act like Ruby 2.0 String class.
(see http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html)

String literals will act as byte buffers, just as they used to.
However, when creating string object by using constructor, you can
optionally specify the encoding of the input string.

String.new(“\352\260\200”, “utf-8”)

Default value of the encoding is nil if $KCODE is not set or set to
“none”. Default encoding is ‘utf-8’ if $KCODE == ‘u’. If encoding is
nil, string objects will act just like old ruby strings we all know
and love. If encoding is set to a specific charset, string’s
instance methods will act more reasonably according to its encoding.
Following is the summary of what I’m thinking:

String#encoding gives character encoding name (e.g. “utf-8”)
String#[index] returns character string if encoding is set. If the
encoding is not set, it returns fixnum as it used to.
String#[] is always encoding aware if encoding is set.
String#slice is always byte buffer operation regardless of the
encoding.
String#size always returns the number of bytes in the string.
String#length returns the number of characters in the string
according to the encoding specified. If the encoding is not set, it’s
same as String#size.
String#+ will return utf-8 encoded string if two string’s encoding
does not match.

*, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop,
count, delete, downcase, each, each_line, eql?, gsub, match, succ,
scan, split, strip, sub, upcase, upto will be all encoding aware if
encoding is set.

The reason I’m differentiating between ‘size’ and ‘length’ is because
some libraries (like rails) depend on them returning the byte size of
the string. Maybe we can establish a customs that ‘size’ for byte
size and ‘length’ for the number of characters. Same reasoning goes
for ‘[]’ and ‘slice’.

For now, it will support only utf-8 encoding as ruby’s regexp doesn’t
seem to support encodings other than ascii and utf-8. (I could use
iconv to convert encoding internally to utf-8 for each method call,
but at the moment, I think it’s probably too costly and not worth it.)

I would love to get some feedback on this. I really want to create
something that I can depend on until Ruby 2.0 releases.

Thanks!

Daesan

Dae San H.
[email protected]

Dae_San_H · June 15, 2006, 2:48am

Dae San H. wrote:

String.new(“\352\260\200”, “utf-8”)

This is a dead-end approach, alas. Not a single library in the world
will tell Strings to magically become UTF-8.
I implemented a different solution to that using an accessor that gives
you a character-friendly proxy, in the newest version of mu plugin.
Feels somewhat nicer to me.

http://julik.nl/code/unicode-hacks/index.html

What you are trying to achieve is subclassing - you can see where it
leads here:

http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

and here

http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

exactly because “some flag does not get set somewhere” and such.