A plan for another unicode string hack

Dae_San_H · June 15, 2006, 1:29pm

On 15-jun-2006, at 12:10, Dmitry S. wrote:

similar. IO
library
(about 350 KSLOC of C and Ruby), inspect, fix and test all the
consequent
issues? How long time could it take?

I know it might sound terrible, but If Ruby as the language will
progress and prosper
this will have to be done (patching string handling), and the sooner

the better. The same will have to be done with Python 3000 very soon.

It’s good to break bad string handling that was arong to start with

Dae_San_H · June 15, 2006, 2:04pm

On 15-jun-2006, at 6:13, Suraj N. Kurapati wrote:

and ‘slice’.

Good idea. This separation of ‘length’ and ‘size’ methods is quite
reasonable, in my opinion.

To the original poster - frankly I don’t see the point of doing this
all over again. If you want to have unicode handling
that way just grab it from my plugin. It’s just that when you have to
work with external libraries they
will not cooperate. I was using this String class in the wild for a
few months, so trust me. It’s not simply
because I “felt” like removing this functionality - it simply Broke
Alot Of Stuff In A Variety Of Subtle Ways.

Separation of “size” and “length” is sensless because they are
aliases in Ruby. It would be sensible
to have “byte_” prefixed methods for byte access, just as I had in my
hacks plugin a while ago. It worked too.

What is this? Curiosity or you just want to delve into the dirty
swamp of character handling for pure entertainment?

Dae_San_H · June 15, 2006, 2:15pm

On 6/14/06, Austin Z. [email protected] wrote:

to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as
implied in Guy Decoux’s posting) I think I will just follow along. My
goal is to provide Ruby 2.0 forward compatible unicode support until
the move is complete.

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and which one returns the
number of bytes?

Well, to me it is quite intuitive that length gives the number of
characters, and size returns the amount of space needed to store the
object.
The problem is that for other objects these would still be equivalent.
But the subject contains the word ‘hack’, mind you.

They should always either return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

I think that more descriptive aliases would be welcome as well.

Thanks

Michal

Dae_San_H · June 15, 2006, 3:39pm

On Jun 15, 2006, at 1:17 AM, Austin Z. wrote:

On 6/14/06, Dae San H. [email protected] wrote:

My proposed change won’t disturb anyone’s existing codes unless you
set $KCODE to be ‘u’ as well in that code. If you did set $KCODE to
‘u’ in your previous projects, you don’t have to apply this hack
(which hasn’t been implemented yet) to that project.

Um. PDF::Writer is a library, and I think that I use both depending on
how the code reads.

I see your point now. Then I guess I won’t utilize $KCODE to set
default encoding. That way, only strings with explicit encoding will
exhibit new behavior.

Dae San H.
[email protected]

Dae_San_H · June 15, 2006, 3:20pm

On 6/15/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

size
work with external libraries they
will not cooperate. I was using this String class in the wild for a
few months, so trust me. It’s not simply
because I “felt” like removing this functionality - it simply Broke
Alot Of Stuff In A Variety Of Subtle Ways.

It needs to be fixed for ruby 2.0 anyway. IO and some networking stuff
would need to be fixed to use byte_size I guess.

For me IO is sufficient for now.

Thanks

Michal

Dae_San_H · June 15, 2006, 4:29pm

On Jun 15, 2006, at 9:03 PM, Julian ‘Julik’ Tarkhanov wrote:

To the original poster - frankly I don’t see the point of doing
this all over again. If you want to have unicode handling
that way just grab it from my plugin. It’s just that when you have
to work with external libraries they
will not cooperate. I was using this String class in the wild for a
few months, so trust me. It’s not simply
because I “felt” like removing this functionality - it simply Broke
Alot Of Stuff In A Variety Of Subtle Ways.

Hi Julian. I have tried your plugin in the past and I appreciate your
efforts on better unicode supports on Ruby. The reason I’m proposing
a different hack is because people have been advising against the use
of your unicode hack due to its incompatibilities with other
libraries. So, I figured that we need a way of differentiating
between plain old string and new hacked string with explicit
encoding. (I got the hint and inspiration from http://
redhanded.hobix.com/inspect/futurismUnicodeInRuby.html) That way it
can be backward compatible with existing libraries and yet be forward
compatible with Ruby 2.0.

Separation of “size” and “length” is sensless because they are
aliases in Ruby. It would be sensible
to have “byte_” prefixed methods for byte access, just as I had in
my hacks plugin a while ago. It worked too.

My proposal for differentiating method names between ‘size’ and
‘length’ has risen from my personal itch. I have always appreciated
Ruby’s intuitiveness and I think ‘size’ is an intuitively better name
for byte size of a string and ‘length’ is better suited to give the
length of a string. I might be being compulsive here but I think this
kind of attention to details have earned the title of the programer
friendly language to Ruby. Guy Decoux have pointed out that Matz has
considered this change himself once and obviously many people on the
forum welcome this change. (Equal number of people voted against it
as well, 7:7 at the moment.)

Some people have pointed out that ‘size’ doesn’t give byte size in
other classes like array or hash but what matters here is the
context. ‘size’ meaning byte size in the context of string object is
pretty damn intuitive in my opinion. Ruby have used ‘size’ and
‘length’ to mean the same thing in the past but I believe that
decision was consciously made by Matz thinking that people would
prefer to use ‘size’ when they are using the string as byte buffer
and use ‘length’ when they are using the string as character string.
(I wouldn’t know what Matz was thinking when he designed the String
API but that’s my guess.) Regardless of my feelings on this issue, I
will just follow what Matz decides for Ruby 2.0 String API as one of
my goals here is to provide forward compatibilities as much as possible.

Thanks to everyone who replied. I appreciate all your comments and
will post back when I get something working.

Best regards,

Daesan

Dae San H.
[email protected]

Dae_San_H · June 16, 2006, 1:33am

On 15-jun-2006, at 16:26, Dae San H. wrote:

Hi Julian. I have tried your plugin in the past and I appreciate
your efforts on better unicode supports on Ruby. The reason I’m
proposing a different hack is because people have been advising
against the use of your unicode hack due to its incompatibilities
with other libraries. So, I figured that we need a way of
differentiating between plain old string and new hacked string with
explicit encoding. (I got the hint and inspiration from http://
redhanded.hobix.com/inspect/futurismUnicodeInRuby.html) That way it
can be backward compatible with existing libraries and yet be
forward compatible with Ruby 2.0.

Interesting what you are going to come up with. Especially when you
pass a “flagged” string to routines such as CGI.escape which cannot
tolerate codepoint-based String#size.

give the length of a string. I might be being compulsive here but I
think this kind of attention to details have earned the title of
the programer friendly language to Ruby. Guy Decoux have pointed
out that Matz has considered this change himself once and obviously
many people on the forum welcome this change. (Equal number of
people voted against it as well, 7:7 at the moment.)

I’m really eager to see if it works out for you. Ples keep us posted.

Dae_San_H · August 3, 2006, 2:23pm

On 6/15/06, Dave B. [email protected] wrote:

I’ve heard it’s due to be fixed by end of next year.

Now, to Ruby’s strings, a character is a byte, represented by a Fixnum.

The new Ruby character will be a string:

?c #=> “c”
“c”[0] #=> “c”
“c”[0].ord #=> 99

Yahoo!