String comparison. Why does Ruby consider this true?

swengineer · June 18, 2010, 7:46pm

When I try for example to compare the following strings in Ruby, I get
“true”.

puts ‘Xeo’ < ‘ball’

When I make ‘Xeo’ start with a lowercase letter, i get ‘false’

puts ‘xeo’ < ‘ball’

The second statement is clear, but why when I capitalize ‘Xeo’ I get
true?

Thanks.

swengineer · June 18, 2010, 7:50pm

Abder-rahman Ali wrote:

When I try for example to compare the following strings in Ruby, I get
“true”.

puts ‘Xeo’ < ‘ball’

When I make ‘Xeo’ start with a lowercase letter, i get ‘false’

puts ‘xeo’ < ‘ball’

The second statement is clear, but why when I capitalize ‘Xeo’ I get
true?

Thanks.

The “Learn to Program” book by Chris P. mentions that computers order
capital letters as coming before lowercase letters. So, can it be
explained then by this?

Thanks.

swengineer · June 18, 2010, 7:55pm

On Fri, Jun 18, 2010 at 11:46 AM, Abder-rahman Ali <
[email protected]> wrote:

true?

Thanks.

Posted via http://www.ruby-forum.com/.

Because the ‘<’ is doing a character-by-character compare on the
strings.
As it turns out, ‘X’ < ‘b’ is true, while ‘x’ < ‘b’ is false. This is
because in the basic character set, the uppercase letters are
lower-valued
than lowercase letters. See http://www.asciitable.com/

-Jonathan N.

swengineer · June 18, 2010, 8:45pm

On Fri, Jun 18, 2010 at 12:46 PM, Abder-rahman Ali <
[email protected]> wrote:

true?

Thanks.

Posted via http://www.ruby-forum.com/.

Well, this used to be easy to show, but apparently since ascii has been
abandoned, and I don’t know unicode, I have to resort to hacky things
like
this to explain it.

$chars = (1…128).inject(Hash.new) { |chars,num| chars[num.chr] = num ;
chars }

def to_number_array(str)
str.split(//).map { |char| $chars[char] }
end

to_number_array ‘Xeo’ # => [88, 101, 111]
to_number_array ‘xeo’ # => [120, 101, 111]
to_number_array ‘ball’ # => [98, 97, 108, 108]
to_number_array ‘ABC’ # => [65, 66, 67]
to_number_array ‘abc’ # => [97, 98, 99]

In this case, $chars is a hash that will take a 1 character string, and
return its ascii value. So the method receives a String, and returns an
array where each index is the ascii value of the character.

Then to understand why one would be less than or greater than the other,
go
through index by index, comparing the number in that index. If the two
strings (or in this case, their array representations that I made) have
different numbers, then whichever has the smaller number is considered
less
than the other. If you run out of indexes on one of them, then that one
comes before the other. If you run out of indexes on them both
simultaneously, then they are equal.

swengineer · June 18, 2010, 11:09pm

On Sat, Jun 19, 2010 at 3:43 AM, Josh C. [email protected]
wrote:

puts ‘xeo’ < ‘ball’
abandoned, and I don’t know unicode, I have to resort to hacky things like
to_number_array ‘Xeo’ Â # => [88, 101, 111]
to_number_array ‘xeo’ Â # => [120, 101, 111]
to_number_array ‘ball’ Â # => [98, 97, 108, 108]
to_number_array ‘ABC’ Â # => [65, 66, 67]
to_number_array ‘abc’ Â # => [97, 98, 99]

%w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.codepoints.to_a }
{“ABC”=>[65, 66, 67]}
{“Xeo”=>[88, 101, 111]}
{“abc”=>[97, 98, 99]}
{“ball”=>[98, 97, 108, 108]}
{“xeo”=>[120, 101, 111]}
=> [“ABC”, “Xeo”, “abc”, “ball”, “xeo”]

swengineer · June 18, 2010, 11:17pm

On 10-06-18 02:09 PM, Michael F. wrote:

puts ‘xeo’< ‘ball’

The second statement is clear, but why when I capitalize ‘Xeo’ I get
true?

That’s an artifact of the old ASCII encoding. Uppercase letters came
out first
so they have a lower integer value than uppercase.

swengineer · June 18, 2010, 11:21pm

On Fri, Jun 18, 2010 at 4:09 PM, Michael F.
[email protected]wrote:

Well, this used to be easy to show, but apparently since ascii has been
end
{“Xeo”=>[88, 101, 111]}

Thanks, but it doesn’t seem to work on 1.8

RUBY_VERSION # => “1.8.7”

%w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.codepoints.to_a
} #
=>

~> -:3: undefined method `codepoints’ for “ABC”:String (NoMethodError)

~> from -:3:in `each’

~> from -:3

And the 1.8 ways to get it don’t work on 1.9 (ie “a”[0])

swengineer · June 18, 2010, 8:02pm

On Fri, Jun 18, 2010 at 11:46 AM, Abder-rahman Ali
[email protected] wrote:

true?
Uppercase letters come before lowercase letters.

You can look at the implementation in the source (start at
rb_str_cmp()), but if you dig deeply enough, it comes down to the way
the standard C library function memcmp() works. It compares bytes. And
an ASCII ‘X’ is represented by a smaller value (88) than an ASCII ‘b’
(98). So ‘Xeo’ is less than ‘ball’.

Kirk H.
Developer
Engine Y.

swengineer · June 18, 2010, 11:26pm

On 10-06-18 02:21 PM, Josh C. wrote:

On Fri, Jun 18, 2010 at 4:09 PM, Michael F.[email protected]wrote:

I thought Unicode started with ASCII anyway, so I don’t think that
solves it…

Yes, here:

swengineer · June 19, 2010, 9:05am

On Sat, Jun 19, 2010 at 6:21 AM, Josh C. [email protected]
wrote:

~> Â Â from -:3

And the 1.8 ways to get it don’t work on 1.9 (ie “a”[0])

%w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.unpack(‘C*’) }
{“ABC”=>[65, 66, 67]}
{“Xeo”=>[88, 101, 111]}
{“abc”=>[97, 98, 99]}
{“ball”=>[98, 97, 108, 108]}
{“xeo”=>[120, 101, 111]}
=> [“ABC”, “Xeo”, “abc”, “ball”, “xeo”]

There is always a way to make things work on both, it’s just that I
don’t care much about 1.8 anymore.

swengineer · June 19, 2010, 9:59am

On Sat, Jun 19, 2010 at 2:04 AM, Michael F.
[email protected]wrote:

~> -:3: undefined method `codepoints’ for “ABC”:String (NoMethodError)

{“Xeo”=>[88, 101, 111]}
CTO, The Rubyists, LLC

Well, a lot of systems still ship with it, SnowLeopard, for example
ships
with 1.8.7, so I think that while this is a legitimate personal
decision, it
is good to be aware of one’s audience. For example, since Abder-rahman
is
having difficulty understanding String comparison, then it is probably
fair
to assume he isn’t initiated enough to understand why the example that
is
supposed to help him understand ends up breaking (if he is on 1.8). That
could be very discouraging for someone new, come to the ML to get a
better
understanding, and the answers, given by the people who know what they
are
doing won’t even run.

Anyway, I really do like your solution It is elegant and uniform,
thank
you for providing it.

swengineer · June 21, 2010, 12:10pm

Josh C. wrote:

Well, this used to be easy to show, but apparently since ascii has been
abandoned, and I don’t know unicode, I have to resort to hacky things
like
this to explain it.

$chars = (1…128).inject(Hash.new) { |chars,num| chars[num.chr] = num ;
chars }

def to_number_array(str)
str.split(//).map { |char| $chars[char] }
end

to_number_array ‘Xeo’ # => [88, 101, 111]
to_number_array ‘xeo’ # => [120, 101, 111]
to_number_array ‘ball’ # => [98, 97, 108, 108]
to_number_array ‘ABC’ # => [65, 66, 67]
to_number_array ‘abc’ # => [97, 98, 99]

Except that this is irrelevant, because even ruby 1.9 does not compare
strings by codepoints. It compares them byte-by-byte using memcmp. See
rb_str_cmp_m() and rb_str_cmp() in string.c

It’s a designed-in side-effect of UTF-8 encoding that higher codepoints
sort after lower ones. There is a table at
UTF-8 - Wikipedia under “Description” which illustrates
this.

However this does not work for other encodings. Try this for size:

s1 = 97.chr(“UTF-8”)
=> “a”
s2 = 257.chr(“UTF-8”)
=> “Ä”
s1 < s2
=> true

s1 = 97.chr(“UTF-16LE”)
=> “a\x00”
s2 = 257.chr(“UTF-16LE”)
=> “\x01\x01”
s1 < s2
=> false

Yes: that’s the same two unicode codepoints, but sorting in different
order. For encodings like UTF-16LE, where the least-significant byte
comes before the most-significant byte, you get an almost arbitrary
ordering.

Proviso: I tested this with
ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]

ruby 1.9.x string encoding rules are (a) undocumented, and (b) subject
to arbitrary changes between patchlevels, hence YMMV.

swengineer · June 21, 2010, 12:27pm

Michael F. wrote:

%w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.unpack(‘C*’) }
{“ABC”=>[65, 66, 67]}
{“Xeo”=>[88, 101, 111]}
{“abc”=>[97, 98, 99]}
{“ball”=>[98, 97, 108, 108]}
{“xeo”=>[120, 101, 111]}
=> [“ABC”, “Xeo”, “abc”, “ball”, “xeo”]

There is always a way to make things work on both, it’s just that I
don’t care much about 1.8 anymore.

That does work the same on both, but it doesn’t give codepoints.

$ irb --simple-prompt

“groÃŸ”.unpack(“C*”)
=> [103, 114, 111, 195, 159]

RUBY_VERSION
=> “1.8.6”

$ irb19 --simple-prompt

“groÃŸ”.unpack(‘C*’)
=> [103, 114, 111, 195, 159]

“groÃŸ”.codepoints.to_a
=> [103, 114, 111, 223]

RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]”