String comparison. Why does Ruby consider this true?

When I try for example to compare the following strings in Ruby, I get
“true”.

puts ‘Xeo’ < ‘ball’

When I make ‘Xeo’ start with a lowercase letter, i get ‘false’

puts ‘xeo’ < ‘ball’

The second statement is clear, but why when I capitalize ‘Xeo’ I get
true?

Thanks.

Abder-rahman Ali wrote:

When I try for example to compare the following strings in Ruby, I get
“true”.

puts ‘Xeo’ < ‘ball’

When I make ‘Xeo’ start with a lowercase letter, i get ‘false’

puts ‘xeo’ < ‘ball’

The second statement is clear, but why when I capitalize ‘Xeo’ I get
true?

Thanks.

The “Learn to Program” book by Chris P. mentions that computers order
capital letters as coming before lowercase letters. So, can it be
explained then by this?

Thanks.

On Fri, Jun 18, 2010 at 11:46 AM, Abder-rahman Ali <
[email protected]> wrote:

true?

Thanks.

Posted via http://www.ruby-forum.com/.

Because the ‘<’ is doing a character-by-character compare on the
strings.
As it turns out, ‘X’ < ‘b’ is true, while ‘x’ < ‘b’ is false. This is
because in the basic character set, the uppercase letters are
lower-valued
than lowercase letters. See http://www.asciitable.com/

-Jonathan N.

On Fri, Jun 18, 2010 at 12:46 PM, Abder-rahman Ali <
[email protected]> wrote:

true?

Thanks.

Posted via http://www.ruby-forum.com/.

Well, this used to be easy to show, but apparently since ascii has been
abandoned, and I don’t know unicode, I have to resort to hacky things
like
this to explain it.

$chars = (1…128).inject(Hash.new) { |chars,num| chars[num.chr] = num ;
chars }

def to_number_array(str)
str.split(//).map { |char| $chars[char] }
end

to_number_array ‘Xeo’ # => [88, 101, 111]
to_number_array ‘xeo’ # => [120, 101, 111]
to_number_array ‘ball’ # => [98, 97, 108, 108]
to_number_array ‘ABC’ # => [65, 66, 67]
to_number_array ‘abc’ # => [97, 98, 99]

In this case, $chars is a hash that will take a 1 character string, and
return its ascii value. So the method receives a String, and returns an
array where each index is the ascii value of the character.

Then to understand why one would be less than or greater than the other,
go
through index by index, comparing the number in that index. If the two
strings (or in this case, their array representations that I made) have
different numbers, then whichever has the smaller number is considered
less
than the other. If you run out of indexes on one of them, then that one
comes before the other. If you run out of indexes on them both
simultaneously, then they are equal.

On Sat, Jun 19, 2010 at 3:43 AM, Josh C. [email protected]
wrote:

puts ‘xeo’ < ‘ball’
abandoned, and I don’t know unicode, I have to resort to hacky things like
to_number_array ‘Xeo’ Â # => [88, 101, 111]
to_number_array ‘xeo’ Â # => [120, 101, 111]
to_number_array ‘ball’ Â # => [98, 97, 108, 108]
to_number_array ‘ABC’ Â # => [65, 66, 67]
to_number_array ‘abc’ Â # => [97, 98, 99]

%w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.codepoints.to_a }
{“ABC”=>[65, 66, 67]}
{“Xeo”=>[88, 101, 111]}
{“abc”=>[97, 98, 99]}
{“ball”=>[98, 97, 108, 108]}
{“xeo”=>[120, 101, 111]}
=> [“ABC”, “Xeo”, “abc”, “ball”, “xeo”]

On 10-06-18 02:09 PM, Michael F. wrote:

puts ‘xeo’< ‘ball’

The second statement is clear, but why when I capitalize ‘Xeo’ I get
true?

That’s an artifact of the old ASCII encoding. Uppercase letters came
out first
so they have a lower integer value than uppercase.

On Fri, Jun 18, 2010 at 4:09 PM, Michael F.
[email protected]wrote:

Well, this used to be easy to show, but apparently since ascii has been
end
{“Xeo”=>[88, 101, 111]}

Thanks, but it doesn’t seem to work on 1.8

RUBY_VERSION # => “1.8.7”

%w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.codepoints.to_a
} #
=>

~> -:3: undefined method `codepoints’ for “ABC”:String (NoMethodError)

~> from -:3:in `each’

~> from -:3

And the 1.8 ways to get it don’t work on 1.9 (ie “a”[0])

On Fri, Jun 18, 2010 at 11:46 AM, Abder-rahman Ali
[email protected] wrote:

true?
Uppercase letters come before lowercase letters.

You can look at the implementation in the source (start at
rb_str_cmp()), but if you dig deeply enough, it comes down to the way
the standard C library function memcmp() works. It compares bytes. And
an ASCII ‘X’ is represented by a smaller value (88) than an ASCII ‘b’
(98). So ‘Xeo’ is less than ‘ball’.

Kirk H.
Developer
Engine Y.

On 10-06-18 02:21 PM, Josh C. wrote:

On Fri, Jun 18, 2010 at 4:09 PM, Michael F.[email protected]wrote:

I thought Unicode started with ASCII anyway, so I don’t think that
solves it…

Yes, here:

On Sat, Jun 19, 2010 at 6:21 AM, Josh C. [email protected]
wrote:

~> Â Â from -:3

And the 1.8 ways to get it don’t work on 1.9 (ie “a”[0])

%w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.unpack(‘C*’) }
{“ABC”=>[65, 66, 67]}
{“Xeo”=>[88, 101, 111]}
{“abc”=>[97, 98, 99]}
{“ball”=>[98, 97, 108, 108]}
{“xeo”=>[120, 101, 111]}
=> [“ABC”, “Xeo”, “abc”, “ball”, “xeo”]

There is always a way to make things work on both, it’s just that I
don’t care much about 1.8 anymore.

On Sat, Jun 19, 2010 at 2:04 AM, Michael F.
[email protected]wrote:

~> -:3: undefined method `codepoints’ for “ABC”:String (NoMethodError)

{“Xeo”=>[88, 101, 111]}
CTO, The Rubyists, LLC

Well, a lot of systems still ship with it, SnowLeopard, for example
ships
with 1.8.7, so I think that while this is a legitimate personal
decision, it
is good to be aware of one’s audience. For example, since Abder-rahman
is
having difficulty understanding String comparison, then it is probably
fair
to assume he isn’t initiated enough to understand why the example that
is
supposed to help him understand ends up breaking (if he is on 1.8). That
could be very discouraging for someone new, come to the ML to get a
better
understanding, and the answers, given by the people who know what they
are
doing won’t even run.

Anyway, I really do like your solution :slight_smile: It is elegant and uniform,
thank
you for providing it.

Josh C. wrote:

Well, this used to be easy to show, but apparently since ascii has been
abandoned, and I don’t know unicode, I have to resort to hacky things
like
this to explain it.

$chars = (1…128).inject(Hash.new) { |chars,num| chars[num.chr] = num ;
chars }

def to_number_array(str)
str.split(//).map { |char| $chars[char] }
end

to_number_array ‘Xeo’ # => [88, 101, 111]
to_number_array ‘xeo’ # => [120, 101, 111]
to_number_array ‘ball’ # => [98, 97, 108, 108]
to_number_array ‘ABC’ # => [65, 66, 67]
to_number_array ‘abc’ # => [97, 98, 99]

Except that this is irrelevant, because even ruby 1.9 does not compare
strings by codepoints. It compares them byte-by-byte using memcmp. See
rb_str_cmp_m() and rb_str_cmp() in string.c

It’s a designed-in side-effect of UTF-8 encoding that higher codepoints
sort after lower ones. There is a table at
UTF-8 - Wikipedia under “Description” which illustrates
this.

However this does not work for other encodings. Try this for size:

s1 = 97.chr(“UTF-8”)
=> “a”
s2 = 257.chr(“UTF-8”)
=> “ā”
s1 < s2
=> true

s1 = 97.chr(“UTF-16LE”)
=> “a\x00”
s2 = 257.chr(“UTF-16LE”)
=> “\x01\x01”
s1 < s2
=> false

Yes: that’s the same two unicode codepoints, but sorting in different
order. For encodings like UTF-16LE, where the least-significant byte
comes before the most-significant byte, you get an almost arbitrary
ordering.

Proviso: I tested this with
ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]

ruby 1.9.x string encoding rules are (a) undocumented, and (b) subject
to arbitrary changes between patchlevels, hence YMMV.

Michael F. wrote:

%w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.unpack(‘C*’) }
{“ABC”=>[65, 66, 67]}
{“Xeo”=>[88, 101, 111]}
{“abc”=>[97, 98, 99]}
{“ball”=>[98, 97, 108, 108]}
{“xeo”=>[120, 101, 111]}
=> [“ABC”, “Xeo”, “abc”, “ball”, “xeo”]

There is always a way to make things work on both, it’s just that I
don’t care much about 1.8 anymore.

That does work the same on both, but it doesn’t give codepoints.

$ irb --simple-prompt

“groß”.unpack(“C*”)
=> [103, 114, 111, 195, 159]

RUBY_VERSION
=> “1.8.6”

$ irb19 --simple-prompt

“groß”.unpack(‘C*’)
=> [103, 114, 111, 195, 159]

“groß”.codepoints.to_a
=> [103, 114, 111, 223]

RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]”