String + Range = Strange

#1

How does this work?

irb(main):002:0> (“a”…“z”).to_a
=> [“a”, “b”, “c”, “d”, “e”, “f”, “g”, “h”, “i”, “j”, “k”, “l”, “m”,
“n”, “o”, “p”, “q”, “r”, “s”, “t”, “u”, “v”, “w”, “x”, “y”, “z”]
irb(main):003:0> (“aa”…“z”).to_a
=> []
irb(main):004:0> (“aa”…“1z”).to_a
=> []
irb(main):005:0> (“aa”…“b1”).to_a
[“aa”…“zz” elided]

What I mean is, what algorithm is used to determine that the stream of
successors is not going to hit the endpoint of the range? And could this
be used to do a faster-than-O(n) set memnbership test?

martin

#2

Exactly.

And the same holds true of many a sequence, like even numbers. Now how
can membership be optimized in a custom manner?

T.

#3

Martin wrote:

irb(main):005:0> (“aa”…“b1”).to_a
[“aa”…“zz” elided]
Then - Trans wrote:
Exactly.

And the same holds true of many a sequence, like even numbers. Now how can membership be optimized in a custom manner?

Maybe I’m about to say something silly (by not being up-to speed) but
why on earth could anyone consider the output from the above one-liner
to be sane? Not only is upper bound exceeded (under any sensible notion
of comparison) but many intermediate values are missing for example “b!”
and “a~” as well as more esoteric strings with control characters such
as “b\0” and all strings with capital letters in them?
To make matters worse, even if we accept that ranges have a peculiar
semantics over strings, they are inconsistent.

irb(main):001:0> (“aa”…“b1”)===“zz”
=> false
irb(main):002:0> (“aa”…“b1”).to_a.include? “zz”
=> true

The max method also behaves in a very counter intuitive way without
converting to an array:

irb(main):003:0> (“A”…“d”).max
=> “Z”

Establishing set membership is something for which (in other languages)
I’ve long had an active interest - which brings me to comment on the
even numbers question. A really powerful resolution of range would
support “modifiers”:

Projection- Applying a function to each element
Selection- Filtering values which do not exist
Intersection- Retaining only values present in both
Union - Retaining values present in either
Minus - Retaining values present in the first but not the second.

It is interesting to note that it is possible to efficiently compute
membership (without enumerating all the possible values) unless
Projection is applied. If projection is applied then it is necessary
that the projection function can be inverted - which can’t be done in
the general case. So, given an appropriate implementation of efficient

• the following generate the odd numbers below n.

efficient(0…n).filter { |x| x%2==0 }
efficient(0…(n/2)).project { |x| x=x*2 }
efficient(0…n).minus efficient(0…n).filter { |x| x%2==1 }
… and many, many more possibilities…

For each of these (except project, assuming no-one supplies the
hand-verified inverse to the supplied function) there are efficient
[i.e. far better than O(n)] ways to establish membership.

Does anyone not claim that ranges are broken with respect to strings?

#4

Bob S. wrote:

The point is that even though “aa”…“b1” is a valid range, calling
succ repeatedly from the starting point will never yield the endpoint.

This, I think is the crux of the matter. It seems to me that given that
if “aa”…“b1” is a valid range, that calling calling succ on “aa” should
logically eventually reach “b1”. I understand that in Ruby, today, it
doesn’t.

I suppose I can get the behaviour I expect by doing something like this.

class String
def succ
s=String.new self
(s.length-1).downto 0 do |i|
s[i]=(s[i]+1)&0xff
return s unless s[i]==0
end
1.chr.concat s
end
end

Is there any good reason that this isn’t the default behaviour for succ

• given that Ruby claims an objective is to have “sensible defaults”? I
can see a justification for having a method defined how String#succ is
defined today - but I don’t see why it has to be as the default succ
method… especially since it gives such counter-intuitive results when
used with ranges. Alternatively - I suppose - it might make sense if
ranges didn’t use succ?

Ranges in regexps don’t work the same “smart” way - wouldn’t consistent
default behaviour be far preferable?

#5

Steve [RubyTalk] wrote:

Bob S. wrote:

The point is that even though “aa”…“b1” is a valid range, calling
succ repeatedly from the starting point will never yield the endpoint.

This, I think is the crux of the matter. It seems to me that given that
if “aa”…“b1” is a valid range, that calling calling succ on “aa” should
logically eventually reach “b1”. I understand that in Ruby, today, it
doesn’t.

I think that String#succ is of limited usefulness, and will always be
a
source of strange edge cases like this.

Personally, I think that (perhaps) String ranges should be disallowed as
being mostly nonsensical (unless a rigid order, like character set
order,
were imposed).

Hal

#6

Steve [RubyTalk] wrote:

can membership be optimized in a custom manner?

Maybe I’m about to say something silly (by not being up-to speed) but
why on earth could anyone consider the output from the above one-liner
to be sane? Not only is upper bound exceeded (under any sensible notion
of comparison) but many intermediate values are missing for example “b!”
and “a~” as well as more esoteric strings with control characters such
as “b\0” and all strings with capital letters in them?
To make matters worse, even if we accept that ranges have a peculiar
semantics over strings, they are inconsistent.

I don’t know about being sane, but the behavior is explicable by the
fact that:

``````Range#to_a (Enumerable#to_a) calls
Range#each, which uses
String#step (for Strings, anyway), which uses
String#upto, which calls on
String#succ (to generate successive entries)
and String#<=> (to compare against the endpoint)
``````

The key is that String#upto stops iterating as soon as String#succ
generates either

a) a string lexically equal to the endpoint (or the endpoint’s
successor, if the range excludes the endpoint), or
b) a longer string.

(see rb_str_upto() in string.c in the Ruby source code)

It’s this second factor that’s coming into play here.

``````"aa".succ => "ab"
"az".succ => "ba"
...
"zy".succ => "zz"
"zz".succ => "aaa"  # stop iterating here, "aaa" is longer than "zz"
``````

The point is that even though “aa”…“b1” is a valid range, calling succ
repeatedly from the starting point will never yield the endpoint.

irb(main):001:0> (“aa”…“b1”)===“zz”
=> false

Right, because this is a check on the endpoints.

irb(main):002:0> (“aa”…“b1”).to_a.include? “zz”
=> true

Also right, because succ eventually generates ‘zz’

The max method also behaves in a very counter intuitive way without
converting to an array:

irb(main):003:0> (“A”…“d”).max
=> “Z”

Again, ‘Z’.succ is ‘AA’, so the iteration stops. ‘d’ is never reached,
and ‘Z’ is the last element generated.

Note that (‘A’…‘d’).last => ‘d’

Does anyone not claim that ranges are broken with respect to strings?

The point is that ‘A’…‘d’ is somewhat nonsensical given the default
implementation of String#succ. However, you could define String#succ in
such a way that ‘A’…‘d’ was logical.

(very) contrived example:

``````class String
def succ
self.downcase[0].succ.chr
end
end

('A'..'d').to_a => ["A", "b", "c", "d"]
('A'..'d').max => 'd'``````
#7

Hal F. wrote:

Steve [RubyTalk] wrote:

This, I think is the crux of the matter. It seems to me that given
that if “aa”…“b1” is a valid range, that calling calling succ on
“aa” should logically eventually reach “b1”. I understand that in
Ruby, today, it doesn’t.
I think that String#succ is of limited usefulness, and will always be a
source of strange edge cases like this.
I can see circumstances where it might be useful to have a method
defined as succ currently is (for example, maybe, when generating
human-readable indices where character-encoding-order makes no sense.
Ruby’s notion of String, is far more powerful than that suggests - it
allows an encoding of an arbitrary byte sequence - i.e. Ruby Strings
(unlike strings in most languages) are appropriate type for manipulating
binary “messages” - which is an extremely useful feature (at least it
is to me!)
Personally, I think that (perhaps) String ranges should be disallowed as
being mostly nonsensical (unless a rigid order, like character set order,
were imposed).
I hope that they are not. Obviously, the current mish-mash is far from
desirable but I, for one, will want to determine if a given binary byte
sequence is “between” two other byte sequences. The present situation
is that I can do this using ranges if I replace succ with a “more
sensible” (if likely less efficient) version to the default. I wouldn’t
want to loose the support for byte-sequence ranges using strings -
though I would be wholly in favour of any suggestion to rename succ
“alpha_num_succ” (or similar) and to implement succ as iterating of byte
sequence. I can’t see any reason that, by default, any standard type
should not support a consistent total ordering… I hope that, in
future, my wish might become reality.

Steve