A plan for another unicode string hack

Dae_San_H · June 14, 2006, 4:50pm

Hi everyone.

I’m implementing yet another unicode string hacks. I’m trying to
rewire String class so that it will act like Ruby 2.0 String class.
(see http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html)

String literals will act as byte buffers, just as they used to.
However, when creating string object by using constructor, you can
optionally specify the encoding of the input string.

String.new(“\352\260\200”, “utf-8”)

Default value of the encoding is nil if $KCODE is not set or set to
“none”. Default encoding is ‘utf-8’ if $KCODE == ‘u’. If encoding is
nil, string objects will act just like old ruby strings we all know
and love. If encoding is set to a specific charset, string’s
instance methods will act more reasonably according to its encoding.
Following is the summary of what I’m thinking:

String#encoding gives character encoding name (e.g. “utf-8”)
String#[index] returns character string if encoding is set. If the
encoding is not set, it returns fixnum as it used to.
String#[] is always encoding aware if encoding is set.
String#slice is always byte buffer operation regardless of the
encoding.
String#size always returns the number of bytes in the string.
String#length returns the number of characters in the string
according to the encoding specified. If the encoding is not set, it’s
same as String#size.
String#+ will return utf-8 encoded string if two string’s encoding
does not match.

*, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop,
count, delete, downcase, each, each_line, eql?, gsub, match, succ,
scan, split, strip, sub, upcase, upto will be all encoding aware if
encoding is set.

The reason I’m differentiating between ‘size’ and ‘length’ is because
some libraries (like rails) depend on them returning the byte size of
the string. Maybe we can establish a customs that ‘size’ for byte
size and ‘length’ for the number of characters. Same reasoning goes
for ‘[]’ and ‘slice’.

For now, it will support only utf-8 encoding as ruby’s regexp doesn’t
seem to support encodings other than ascii and utf-8. (I could use
iconv to convert encoding internally to utf-8 for each method call,
but at the moment, I think it’s probably too costly and not worth it.)

I would love to get some feedback on this. Matz’s feedback will be
especially great since I want to make this as much forward compatible
as possible with Ruby 2.0.

Thanks!

Daesan

Dae San H.
[email protected]

Dae_San_H · June 14, 2006, 4:54pm

On 6/14/06, Dae San H. [email protected] wrote:

String#size always returns the number of bytes in the string.
String#length returns the number of characters in the string
according to the encoding specified. If the encoding is not set, it’s
same as String#size.

This is a bad change.

#size and #length are synonymous now and should remain so. Add a new
method, like #character_count or something like that.

-austin

Dae_San_H · June 14, 2006, 4:57pm

On Jun 14, 2006, at 10:47 AM, Dae San H. wrote:

The reason I’m differentiating between ‘size’ and ‘length’ is
because some libraries (like rails) depend on them returning the
byte size of the string. Maybe we can establish a customs that
‘size’ for byte size and ‘length’ for the number of characters.
Same reasoning goes for ‘[]’ and ‘slice’.

I like these very much. Although the choice between [] and slice seem
arbitrary (i.e. you could have swapped their meanings and it would
have made just as much sense). #size vs. #length is perfect. and #[]
being a Fixnum when their was no encoding but a character when there
is is equally brilliant. I salute you sir!

Dae_San_H · June 14, 2006, 5:00pm

“A” == Austin Z. [email protected] writes:

A> #size and #length are synonymous now and should remain so. Add a new
A> method, like #character_count or something like that.

Say this to matz

svg% cat b.rb
#!./ruby -ku
a = String.new(“Peut-Ãªtre qu’on n’Ã©tait pas encore lÃ …”, “utf-8”)
p a.length
p a.size
svg%

svg% ./b.rb
39
42
svg%

old ruby_m17n implementation

Guy Decoux

Dae_San_H · June 14, 2006, 5:11pm

On 6/14/06, ts [email protected] wrote:

“A” == Austin Z. [email protected] writes:
A> #size and #length are synonymous now and should remain so. Add a new
A> method, like #character_count or something like that.

Say this to matz

I will. Matz, please see above.

The problem I have with this change is that I know that in my code I
have used #length and #size interchangeably depending on which reads
better in context.

It’s not a good, clear, and understandable change. It will forever
require looking in ri or other resources to remember which one counts
characters and which one counts bytes.

-austin

Dae_San_H · June 14, 2006, 5:21pm

On 6/14/06, Austin Z. [email protected] wrote:

… in my code I
have used #length and #size interchangeably depending on which reads
better in context.

I’ve never been a fan of the Ruby practice of having many names for
the same thing, but I’m willing to be convinced. Can you give me an
example of two string variables where getting the number of characters
reads better with “length” for one and “size” for the other?

Dae_San_H · June 14, 2006, 5:49pm

On Jun 14, 2006, at 11:56 PM, Logan C. wrote:

seem arbitrary (i.e. you could have swapped their meanings and it
would have made just as much sense). #size vs. #length is perfect.
and #[] being a Fixnum when their was no encoding but a character
when there is is equally brilliant. I salute you sir!

Thanks for the kind words.

The reason I picked [] for encoding aware method is because String#
[index] will be used to extract the letter and not the byte in Ruby
2.0 as mentioned in http://redhanded.hobix.com/inspect/
futurismUnicodeInRuby.html

so that “abc”[0] returns “a” instead of fixnum 97

A way to get a Nth byte of a byte buffer is probably still necessary
and String#slice seems to be the logical one, I thought.

Dae San H.
[email protected]

Dae_San_H · June 14, 2006, 6:11pm

On Jun 15, 2006, at 12:30 AM, Austin Z. wrote:

reads better with “length” for one and “size” for the other?

It’s all code context. “name.length” reads better than “name.size” and
“box.size” reads better than “box.length”. Remember, in Ruby you
don’t know whether you’re dealing with a String, Array, or Hash (or
something else) when you’re dealing with simple method calls.
Similarly, I will use #map most of the time, but sometimes I’ll use
#collect.

Are you sure that “box” happened to be a variable for a string
object?

In any case, these are well-established names and having them differ
would be problematic. That said, I’ll have to fix stuff in Ruby 2
for PDF::Writer because I’m currently doing byte counting, not
character counting.

My proposed change won’t disturb anyone’s existing codes unless you
set $KCODE to be ‘u’ as well in that code. If you did set $KCODE to
‘u’ in your previous projects, you don’t have to apply this hack
(which hasn’t been implemented yet) to that project.

Matz has said several times that he will maximize the breakage moving
to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as
implied in Guy Decoux’s posting) I think I will just follow along. My
goal is to provide Ruby 2.0 forward compatible unicode support until
the move is complete.

Dae San H.
[email protected]

Dae_San_H · June 14, 2006, 6:18pm

On 6/14/06, Dae San H. [email protected] wrote:

My proposed change won’t disturb anyone’s existing codes unless you
set $KCODE to be ‘u’ as well in that code. If you did set $KCODE to
‘u’ in your previous projects, you don’t have to apply this hack
(which hasn’t been implemented yet) to that project.

Um. PDF::Writer is a library, and I think that I use both depending on
how the code reads.

Matz has said several times that he will maximize the breakage moving
to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as
implied in Guy Decoux’s posting) I think I will just follow along. My
goal is to provide Ruby 2.0 forward compatible unicode support until
the move is complete.

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and which one returns the
number of bytes?

They should always either return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

-austin

Dae_San_H · June 14, 2006, 5:32pm

On 6/14/06, Mark V. [email protected] wrote:

On 6/14/06, Austin Z. [email protected] wrote:

… in my code I
have used #length and #size interchangeably depending on which reads
better in context.
I’ve never been a fan of the Ruby practice of having many names for
the same thing, but I’m willing to be convinced. Can you give me an
example of two string variables where getting the number of characters
reads better with “length” for one and “size” for the other?

It’s all code context. “name.length” reads better than “name.size” and
“box.size” reads better than “box.length”. Remember, in Ruby you
don’t know whether you’re dealing with a String, Array, or Hash (or
something else) when you’re dealing with simple method calls.
Similarly, I will use #map most of the time, but sometimes I’ll use
#collect.

In any case, these are well-established names and having them differ
would be problematic. That said, I’ll have to fix stuff in Ruby 2
for PDF::Writer because I’m currently doing byte counting, not
character counting.

-austin

Dae_San_H · June 14, 2006, 6:53pm

On 14/06/06, Dae San H. [email protected] wrote:

For now, it will support only utf-8 encoding as ruby’s regexp doesn’t
seem to support encodings other than ascii and utf-8. (I could use
iconv to convert encoding internally to utf-8 for each method call,
but at the moment, I think it’s probably too costly and not worth it.)

Regexp also supports EUC (which seems to work for EUC-KR as well as
EUC-JP, incidentally) and Shift_JIS. Nevertheless, I think that
starting with UTF-8 is the way to go.

I would love to get some feedback on this. Matz’s feedback will be
especially great since I want to make this as much forward compatible
as possible with Ruby 2.0.

I think it’s a great idea. If you want any implementation assistance,
I’d be glad to help (I’ve done quite a bit of Unicode hacking in
Ruby).

Paul.

Dae_San_H · June 14, 2006, 6:24pm

On Jun 14, 2006, at 12:17 PM, Austin Z. wrote:

They should always either return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

+1

Gary W.

Dae_San_H · June 14, 2006, 8:02pm

Dae San H. wrote:

String literals will act as byte buffers, just as they used to. However,
when creating string object by using constructor, you can optionally
specify the encoding of the input string.

String.new("\352\260\200", “utf-8”)

I’d like to have a different interface, using named parameters.

String.new("\352\260\200", encoding: “utf-8”)

or

String.new("\352\260\200", :encoding => “utf-8”)

That way it’s easier to extend String later on.

Cheers,
Daniel

Dae_San_H · June 14, 2006, 8:56pm

On 6/14/06, Dae San H. [email protected] wrote:

[index] will be used to extract the letter and not the byte in Ruby
2.0 as mentioned in http://redhanded.hobix.com/inspect/
futurismUnicodeInRuby.html

so that “abc”[0] returns “a” instead of fixnum 97

This behaviour - of [] returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

Dae_San_H · June 14, 2006, 8:59pm

On 6/14/06, Leslie V. [email protected] wrote:

Same reasoning goes for ‘[]’ and ‘slice’.
The reason I picked [] for encoding aware method is because String#
[index] will be used to extract the letter and not the byte in Ruby
2.0 as mentioned in http://redhanded.hobix.com/inspect/
futurismUnicodeInRuby.html

so that “abc”[0] returns “a” instead of fixnum 97

This behaviour - of [] returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

…returning different type values I mean…

Dae_San_H · June 15, 2006, 6:14am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dae San H. wrote:

The reason I’m differentiating between ‘size’ and ‘length’ is because
some libraries (like rails) depend on them returning the byte size of
the string. Maybe we can establish a customs that ‘size’ for byte size
and ‘length’ for the number of characters. Same reasoning goes for ‘[]’
and ‘slice’.

Good idea. This separation of ‘length’ and ‘size’ methods is quite
reasonable, in my opinion.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFEkN5FmV9O7RYnKMcRApEsAJ968jHHafjyNdMBb9doKnfESaDc7ACfUlvS
F+LQH5TY5kehba7roMNfiq4=
=cgPr
-----END PGP SIGNATURE-----

Dae_San_H · June 15, 2006, 10:12am

Leslie V. wrote:

On 6/14/06, Leslie V. [email protected] wrote:

On 6/14/06, Dae San H. [email protected] wrote:

so that “abc”[0] returns “a” instead of fixnum 97

This behaviour - of [] returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

…returning different type values I mean…

I’ve heard it’s due to be fixed by end of next year.

Now, to Ruby’s strings, a character is a byte, represented by a Fixnum.

The new Ruby character will be a string:

?c #=> “c”
“c”[0] #=> “c”
“c”[0].ord #=> 99

Cheers,
Dave

Dae_San_H · June 15, 2006, 11:56am

Matz has said several times that he will maximize the breakage moving
to Ruby 2.0.

If that is the case, is there a reason why we should continue using
String for
bytes then ? We could have a ByteBuffer class or something similar. IO
objects would return ByteBuffer ; ByteBuffer.to_s would return the
string
equivalent in the current encoding, and ByteBuffer.to_s(‘my encoding’)
into
the required encoding.

Anselm

–

Netuxo Ltd
a workers’ co-operative
providing low-cost IT solutions
for peace, environmental and social justice groups
and the radical NGO sector

Registered as a company in England and Wales. No 4798478
Registered office: 5 Caledonian Road, London N1 9DY, Britain

Dae_San_H · June 15, 2006, 12:12pm

Hi!

If that is the case, is there a reason why we should continue using
String

for
bytes then ? We could have a ByteBuffer class or something similar. IO
objects would return ByteBuffer ; ByteBuffer.to_s would return the string
equivalent in the current encoding, and ByteBuffer.to_s(‘my encoding’)
into
the required encoding.

+1, that would be very nice. In some other platforms (like java or
.net),
the programmer doesn’t know about the bytes (only length in terms of
chars)
unless he’s willing to digg into them (using a specific class).

Dae_San_H · June 15, 2006, 12:12pm

On 6/15/06, Anselm H. [email protected] wrote:

the required encoding.

Will you volunteer to go throughout all source code of Ruby core library
(about 350 KSLOC of C and Ruby), inspect, fix and test all the
consequent
issues? How long time could it take?