A plan for another unicode string hack

Hi everyone.

I’m implementing yet another unicode string hacks. I’m trying to
rewire String class so that it will act like Ruby 2.0 String class.
(see http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html)

String literals will act as byte buffers, just as they used to.
However, when creating string object by using constructor, you can
optionally specify the encoding of the input string.

String.new(“\352\260\200”, “utf-8”)

Default value of the encoding is nil if $KCODE is not set or set to
“none”. Default encoding is ‘utf-8’ if $KCODE == ‘u’. If encoding is
nil, string objects will act just like old ruby strings we all know
and love. If encoding is set to a specific charset, string’s
instance methods will act more reasonably according to its encoding.
Following is the summary of what I’m thinking:

String#encoding gives character encoding name (e.g. “utf-8”)
String#[index] returns character string if encoding is set. If the
encoding is not set, it returns fixnum as it used to.
String#[] is always encoding aware if encoding is set.
String#slice is always byte buffer operation regardless of the
encoding.
String#size always returns the number of bytes in the string.
String#length returns the number of characters in the string
according to the encoding specified. If the encoding is not set, it’s
same as String#size.
String#+ will return utf-8 encoded string if two string’s encoding
does not match.

*, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop,
count, delete, downcase, each, each_line, eql?, gsub, match, succ,
scan, split, strip, sub, upcase, upto will be all encoding aware if
encoding is set.

The reason I’m differentiating between ‘size’ and ‘length’ is because
some libraries (like rails) depend on them returning the byte size of
the string. Maybe we can establish a customs that ‘size’ for byte
size and ‘length’ for the number of characters. Same reasoning goes
for ‘[]’ and ‘slice’.

For now, it will support only utf-8 encoding as ruby’s regexp doesn’t
seem to support encodings other than ascii and utf-8. (I could use
iconv to convert encoding internally to utf-8 for each method call,
but at the moment, I think it’s probably too costly and not worth it.)

I would love to get some feedback on this. Matz’s feedback will be
especially great since I want to make this as much forward compatible
as possible with Ruby 2.0.

Thanks!

Daesan

Dae San H.
[email protected]

On 6/14/06, Dae San H. [email protected] wrote:

String#size always returns the number of bytes in the string.
String#length returns the number of characters in the string
according to the encoding specified. If the encoding is not set, it’s
same as String#size.

This is a bad change.

#size and #length are synonymous now and should remain so. Add a new
method, like #character_count or something like that.

-austin

On Jun 14, 2006, at 10:47 AM, Dae San H. wrote:

The reason I’m differentiating between ‘size’ and ‘length’ is
because some libraries (like rails) depend on them returning the
byte size of the string. Maybe we can establish a customs that
‘size’ for byte size and ‘length’ for the number of characters.
Same reasoning goes for ‘[]’ and ‘slice’.

I like these very much. Although the choice between [] and slice seem
arbitrary (i.e. you could have swapped their meanings and it would
have made just as much sense). #size vs. #length is perfect. and #[]
being a Fixnum when their was no encoding but a character when there
is is equally brilliant. I salute you sir!

“A” == Austin Z. [email protected] writes:

A> #size and #length are synonymous now and should remain so. Add a new
A> method, like #character_count or something like that.

Say this to matz :slight_smile:

svg% cat b.rb
#!./ruby -ku
a = String.new(“Peut-être qu’on n’était pas encore là …”, “utf-8”)
p a.length
p a.size
svg%

svg% ./b.rb
39
42
svg%

old ruby_m17n implementation

Guy Decoux

On 6/14/06, ts [email protected] wrote:

“A” == Austin Z. [email protected] writes:
A> #size and #length are synonymous now and should remain so. Add a new
A> method, like #character_count or something like that.

Say this to matz :slight_smile:

I will. Matz, please see above. :wink:

The problem I have with this change is that I know that in my code I
have used #length and #size interchangeably depending on which reads
better in context.

It’s not a good, clear, and understandable change. It will forever
require looking in ri or other resources to remember which one counts
characters and which one counts bytes.

-austin

On 6/14/06, Austin Z. [email protected] wrote:

… in my code I
have used #length and #size interchangeably depending on which reads
better in context.

I’ve never been a fan of the Ruby practice of having many names for
the same thing, but I’m willing to be convinced. Can you give me an
example of two string variables where getting the number of characters
reads better with “length” for one and “size” for the other?

On Jun 14, 2006, at 11:56 PM, Logan C. wrote:

seem arbitrary (i.e. you could have swapped their meanings and it
would have made just as much sense). #size vs. #length is perfect.
and #[] being a Fixnum when their was no encoding but a character
when there is is equally brilliant. I salute you sir!

Thanks for the kind words.

The reason I picked [] for encoding aware method is because String#
[index] will be used to extract the letter and not the byte in Ruby
2.0 as mentioned in http://redhanded.hobix.com/inspect/
futurismUnicodeInRuby.html

so that “abc”[0] returns “a” instead of fixnum 97

A way to get a Nth byte of a byte buffer is probably still necessary
and String#slice seems to be the logical one, I thought.

Dae San H.
[email protected]

On Jun 15, 2006, at 12:30 AM, Austin Z. wrote:

reads better with “length” for one and “size” for the other?

It’s all code context. “name.length” reads better than “name.size” and
“box.size” reads better than “box.length”. Remember, in Ruby you
don’t know whether you’re dealing with a String, Array, or Hash (or
something else) when you’re dealing with simple method calls.
Similarly, I will use #map most of the time, but sometimes I’ll use
#collect.

Are you sure that “box” happened to be a variable for a string
object? :wink:

In any case, these are well-established names and having them differ
would be problematic. That said, I’ll have to fix stuff in Ruby 2
for PDF::Writer because I’m currently doing byte counting, not
character counting.

My proposed change won’t disturb anyone’s existing codes unless you
set $KCODE to be ‘u’ as well in that code. If you did set $KCODE to
‘u’ in your previous projects, you don’t have to apply this hack
(which hasn’t been implemented yet) to that project.

Matz has said several times that he will maximize the breakage moving
to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as
implied in Guy Decoux’s posting) I think I will just follow along. My
goal is to provide Ruby 2.0 forward compatible unicode support until
the move is complete.

Dae San H.
[email protected]

On 6/14/06, Dae San H. [email protected] wrote:

My proposed change won’t disturb anyone’s existing codes unless you
set $KCODE to be ‘u’ as well in that code. If you did set $KCODE to
‘u’ in your previous projects, you don’t have to apply this hack
(which hasn’t been implemented yet) to that project.

Um. PDF::Writer is a library, and I think that I use both depending on
how the code reads.

Matz has said several times that he will maximize the breakage moving
to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as
implied in Guy Decoux’s posting) I think I will just follow along. My
goal is to provide Ruby 2.0 forward compatible unicode support until
the move is complete.

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and which one returns the
number of bytes?

They should always either return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

-austin

On 6/14/06, Mark V. [email protected] wrote:

On 6/14/06, Austin Z. [email protected] wrote:

… in my code I
have used #length and #size interchangeably depending on which reads
better in context.
I’ve never been a fan of the Ruby practice of having many names for
the same thing, but I’m willing to be convinced. Can you give me an
example of two string variables where getting the number of characters
reads better with “length” for one and “size” for the other?

It’s all code context. “name.length” reads better than “name.size” and
“box.size” reads better than “box.length”. Remember, in Ruby you
don’t know whether you’re dealing with a String, Array, or Hash (or
something else) when you’re dealing with simple method calls.
Similarly, I will use #map most of the time, but sometimes I’ll use
#collect.

In any case, these are well-established names and having them differ
would be problematic. That said, I’ll have to fix stuff in Ruby 2
for PDF::Writer because I’m currently doing byte counting, not
character counting.

-austin

On 14/06/06, Dae San H. [email protected] wrote:

For now, it will support only utf-8 encoding as ruby’s regexp doesn’t
seem to support encodings other than ascii and utf-8. (I could use
iconv to convert encoding internally to utf-8 for each method call,
but at the moment, I think it’s probably too costly and not worth it.)

Regexp also supports EUC (which seems to work for EUC-KR as well as
EUC-JP, incidentally) and Shift_JIS. Nevertheless, I think that
starting with UTF-8 is the way to go.

I would love to get some feedback on this. Matz’s feedback will be
especially great since I want to make this as much forward compatible
as possible with Ruby 2.0.

I think it’s a great idea. If you want any implementation assistance,
I’d be glad to help (I’ve done quite a bit of Unicode hacking in
Ruby).

Paul.

On Jun 14, 2006, at 12:17 PM, Austin Z. wrote:

They should always either return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

+1

Gary W.

Dae San H. wrote:

String literals will act as byte buffers, just as they used to. However,
when creating string object by using constructor, you can optionally
specify the encoding of the input string.

String.new("\352\260\200", “utf-8”)

I’d like to have a different interface, using named parameters.

String.new("\352\260\200", encoding: “utf-8”)

or

String.new("\352\260\200", :encoding => “utf-8”)

That way it’s easier to extend String later on.

Cheers,
Daniel

On 6/14/06, Dae San H. [email protected] wrote:

[index] will be used to extract the letter and not the byte in Ruby
2.0 as mentioned in http://redhanded.hobix.com/inspect/
futurismUnicodeInRuby.html

so that “abc”[0] returns “a” instead of fixnum 97

This behaviour - of [] returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

On 6/14/06, Leslie V. [email protected] wrote:

Same reasoning goes for ‘[]’ and ‘slice’.
The reason I picked [] for encoding aware method is because String#
[index] will be used to extract the letter and not the byte in Ruby
2.0 as mentioned in http://redhanded.hobix.com/inspect/
futurismUnicodeInRuby.html

so that “abc”[0] returns “a” instead of fixnum 97

This behaviour - of [] returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

…returning different type values I mean…

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dae San H. wrote:

The reason I’m differentiating between ‘size’ and ‘length’ is because
some libraries (like rails) depend on them returning the byte size of
the string. Maybe we can establish a customs that ‘size’ for byte size
and ‘length’ for the number of characters. Same reasoning goes for ‘[]’
and ‘slice’.

Good idea. This separation of ‘length’ and ‘size’ methods is quite
reasonable, in my opinion.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFEkN5FmV9O7RYnKMcRApEsAJ968jHHafjyNdMBb9doKnfESaDc7ACfUlvS
F+LQH5TY5kehba7roMNfiq4=
=cgPr
-----END PGP SIGNATURE-----

Leslie V. wrote:

On 6/14/06, Leslie V. [email protected] wrote:

On 6/14/06, Dae San H. [email protected] wrote:

so that “abc”[0] returns “a” instead of fixnum 97

This behaviour - of [] returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

…returning different type values I mean…

I’ve heard it’s due to be fixed by end of next year.

Now, to Ruby’s strings, a character is a byte, represented by a Fixnum.

The new Ruby character will be a string:

?c #=> “c”
“c”[0] #=> “c”
“c”[0].ord #=> 99

Cheers,
Dave

Matz has said several times that he will maximize the breakage moving
to Ruby 2.0.

If that is the case, is there a reason why we should continue using
String for
bytes then ? We could have a ByteBuffer class or something similar. IO
objects would return ByteBuffer ; ByteBuffer.to_s would return the
string
equivalent in the current encoding, and ByteBuffer.to_s(‘my encoding’)
into
the required encoding.

Anselm

Netuxo Ltd
a workers’ co-operative
providing low-cost IT solutions
for peace, environmental and social justice groups
and the radical NGO sector

Registered as a company in England and Wales. No 4798478
Registered office: 5 Caledonian Road, London N1 9DY, Britain

Hi!

If that is the case, is there a reason why we should continue using
String

for
bytes then ? We could have a ByteBuffer class or something similar. IO
objects would return ByteBuffer ; ByteBuffer.to_s would return the string
equivalent in the current encoding, and ByteBuffer.to_s(‘my encoding’)
into
the required encoding.

+1, that would be very nice. In some other platforms (like java or
.net),
the programmer doesn’t know about the bytes (only length in terms of
chars)
unless he’s willing to digg into them (using a specific class).

On 6/15/06, Anselm H. [email protected] wrote:

the required encoding.

Will you volunteer to go throughout all source code of Ruby core library
(about 350 KSLOC of C and Ruby), inspect, fix and test all the
consequent
issues? How long time could it take? :slight_smile: