Hi everyone. I'm implementing yet another unicode string hacks. I'm trying to rewire String class so that it will act like Ruby 2.0 String class. (see http://redhanded.hobix.com/inspect/futurismUnicode...) String literals will act as byte buffers, just as they used to. However, when creating string object by using constructor, you can optionally specify the encoding of the input string. String.new("\352\260\200", "utf-8") Default value of the encoding is nil if $KCODE is not set or set to "none". Default encoding is 'utf-8' if $KCODE == 'u'. If encoding is nil, string objects will act just like old ruby strings we all know and love. If encoding is set to a specific charset, string's instance methods will act more reasonably according to its encoding. Following is the summary of what I'm thinking: String#encoding gives character encoding name (e.g. "utf-8") String#[index] returns character string if encoding is set. If the encoding is not set, it returns fixnum as it used to. String# is always encoding aware if encoding is set. String#slice is always byte buffer operation regardless of the encoding. String#size always returns the number of bytes in the string. String#length returns the number of characters in the string according to the encoding specified. If the encoding is not set, it's same as String#size. String#+ will return utf-8 encoded string if two string's encoding does not match. *, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop, count, delete, downcase, each, each_line, eql?, gsub, match, succ, scan, split, strip, sub, upcase, upto will be all encoding aware if encoding is set. The reason I'm differentiating between 'size' and 'length' is because some libraries (like rails) depend on them returning the byte size of the string. Maybe we can establish a customs that 'size' for byte size and 'length' for the number of characters. Same reasoning goes for '' and 'slice'. For now, it will support only utf-8 encoding as ruby's regexp doesn't seem to support encodings other than ascii and utf-8. (I could use iconv to convert encoding internally to utf-8 for each method call, but at the moment, I think it's probably too costly and not worth it.) I would love to get some feedback on this. I really want to create something that I can depend on until Ruby 2.0 releases. Thanks! Daesan Dae San Hwang email@example.com
on 2006-06-14 12:32
on 2006-06-15 02:48
Dae San Hwang wrote: > String.new("\352\260\200", "utf-8") This is a dead-end approach, alas. Not a single library in the world will tell Strings to magically become UTF-8. I implemented a different solution to that using an accessor that gives you a character-friendly proxy, in the newest version of mu plugin. Feels somewhat nicer to me. http://julik.nl/code/unicode-hacks/index.html What you are trying to achieve is subclassing - you can see where it leads here: http://thraxil.org/users/anders/posts/2005/11/01/u... and here http://thraxil.org/users/anders/posts/2005/11/01/u... exactly because "some flag does not get set somewhere" and such.