#sort_by and #sort_obj

miracleofmagick · October 7, 2007, 11:22am

I haven’t seen this technique in the wild before.

If you have custom sorting using the spaceship:

class Something
def <=>(other)
[@name, @version] <=> [other.name, other.version]
end
end

You can’t easily use the more-efficient #sort_by because you’ll need
to know how to sort Somethings, which involves duplication:

somethings.sort_by { |s| [s.name, s.version] }

Instead you can add a method that returns the thing to sort by:

class Something
def sort_obj
[@name, @version]
end

def <=>(other)
sort_obj <=> other.sort_obj
end
end

So now using #sort_by is easy:

somethings.sort_by { |s| s.sort_obj }

I don’t know if the name #sort_obj is appropriate, but I’m not coming
up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj

miracleofmagick · October 7, 2007, 12:35pm

On Oct 7, 2:21 am, Eric H. [email protected] wrote:

You can’t easily use the more-efficient #sort_by because you’ll need

up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj

Nice. In fact, it seems this could dethrone <=> as the primary method
Sortable depends on.

Perhaps a better name is #indice or #sort_index, or something like
that.

T.

miracleofmagick · October 7, 2007, 1:32pm

Hi –

On Sun, 7 Oct 2007, Eric H. wrote:

You can’t easily use the more-efficient #sort_by because you’ll need to know

something better.
Maybe #sort_key.

David

miracleofmagick · October 7, 2007, 4:43pm

Hi –

On Sun, 7 Oct 2007, Trans wrote:

end
def sort_obj
somethings.sort_by { |s| s.sort_obj }

I don’t know if the name #sort_obj is appropriate, but I’m not coming
up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj

Nice. In fact, it seems this could dethrone <=> as the primary method
Sortable depends on.

I think Eric intends it to return a kind of sort key, not to actually
specify the sorting algorithm.

Perhaps a better name is #indice or #sort_index, or something like
that.

I’m afraid “indice” isn’t a word The singular of indices is index.

David

miracleofmagick · October 7, 2007, 7:13pm

On Oct 7, 7:42 am, “David A. Black” [email protected] wrote:

I think Eric intends it to return a kind of sort key, not to actually
specify the sorting algorithm.

I realize. But clearly a standard definition of <=> would be based on
it. So the “coupling” method of Sortable could be #sort_key instead of
#<=>, though of course one can still override #<=> if need be.

Perhaps a better name is #indice or #sort_index, or something like
that.

I’m afraid “indice” isn’t a word The singular of indices is index.

Ah, well. I’m an American; dropping ‘s’ is good enough for me
#sort_key is a perfectly good name.

T.

miracleofmagick · October 7, 2007, 8:02pm

On Oct 7, 2007, at 3:21 AM, Eric H. wrote:

I don’t know if the name #sort_obj is appropriate, but I’m not
coming up with something better.

i prefer #order, #order_by since the concept is extends to much more
that sorting, for instance iterating or returning elements from a
collection such as a priority queue. in any case the idea is a good
one.

cheers.

a @ http://codeforpeople.com/

miracleofmagick · October 7, 2007, 6:40pm

David A. Black wrote:

Hi –

On Sun, 7 Oct 2007, Eric H. wrote:

You can’t easily use the more-efficient #sort_by because you’ll need to know

something better.
Maybe #sort_key.

David

I’d support that. I wrote a nat_cmp for String and added
String#nat_cmp_key to use sort_by.
It would be nice if the object returned allowed #-@ to be called, to
enable reverse sorting by easily doing: enum.sort_by { |o| -o.sort_key }

Regards
Stefan

miracleofmagick · October 7, 2007, 8:23pm

On 10/7/07, ara.t.howard [email protected] wrote:

On Oct 7, 2007, at 3:21 AM, Eric H. wrote:

I don’t know if the name #sort_obj is appropriate, but I’m not
coming up with something better.

i prefer #order, #order_by since the concept is extends to much more
that sorting, for instance iterating or returning elements from a
collection such as a priority queue. in any case the idea is a good
one.

its a great idea and i have been using it for years in sql by the
name “sortkey” although after reading ara’s point i think i prefer the
term “orderkey” now.
another name in collation docs is “weight string”. i think i like
weight_string the best because it is less verby and confrontational
than order_by and order. both of these are sort of “in your face”
sounding. cheers,
-r.

miracleofmagick · October 7, 2007, 11:00pm

Eric H. wrote:

somethings.sort_by { |s| s.sort_obj }

Since this is more efficient than #sort, maybe the implementation of
#sort should first try to do this, and only if that fails (#sort_obj not
defined, or two sort_obj not comparable), revert to the normal <=>
algorithm.

This would make a difference when some library does a #sort on some
collection, and you can’t tell it to #sort_by instead (and you can’t
redefine #sort in the collection, either).

In the cases where you have a collection of objects that don’t respond
to #sort_obj, #sort will quickly hit a failure and use the fall-back.
(The worst case would be if, say, all but the last object in the
collection had #sort_obj.)

It should also fall back if <=> fails on the sort objs. I’m not sure how
to detect that reliably though.

Anyway, here’s a half-baked attempt:

class NoSortObjMethod < StandardError; end

class Object
def sort_obj
raise NoSortObjMethod
end
end

class Array
alias old_sort sort
def sort
if block_given?
old_sort {|*args| yield(*args)}
else
begin
rslt = sort_by {|s| s.sort_obj}
puts “DEBUG: used sort_by with sort_obj”
rslt
rescue NoSortObjMethod
puts “DEBUG: NoSortObjMethod, using old_sort”
old_sort
rescue TypeError => ex # ?
puts “DEBUG: Comparison failed (#{ex}), using old_sort”
old_sort
end
end
end
end

a = [1, 3, 2]
p a
p a.sort

class C
def initialize n
@n = n
end

def sort_obj
@n
end
end

b = [C.new(1), C.new(3), C.new(2)]
p b
p b.sort

END

Output:

[1, 3, 2]
DEBUG: NoSortObjMethod, using old_sort
[1, 2, 3]
[#<C:0xb7c3d248 @n=1>, #<C:0xb7c3d234 @n=3>, #<C:0xb7c3d220 @n=2>]
DEBUG: used sort_by with sort_obj
[#<C:0xb7c3d248 @n=1>, #<C:0xb7c3d220 @n=2>, #<C:0xb7c3d234 @n=3>]

miracleofmagick · October 7, 2007, 11:21pm

On 10/7/07, Joel VanderWerf [email protected] wrote:

Eric H. wrote:

somethings.sort_by { |s| s.sort_obj }

Since this is more efficient than #sort, maybe the implementation of
#sort should first try to do this, and only if that fails (#sort_obj not
defined, or two sort_obj not comparable), revert to the normal <=>
algorithm.

Yes, but sort_by is not always more efficient than sort:

$ qri sort_by
----------------------------------------------------- Enumerable#sort_by
enum.sort_by {| obj | block } => array

 Sorts enum using a set of keys generated by mapping the values in
 enum through the given block.

    %w{ apple pear fig }.sort_by {|word| word.length}
                 #=> ["fig", "pear", "apple"]

 The current implementation of sort_by generates an array of tuples
 containing the original collection element and the mapped value.
 This makes sort_by fairly expensive when the keysets are simple

    require 'benchmark'
    include Benchmark

    a = (1..100000).map {rand(100000)}

    bm(10) do |b|
      b.report("Sort")    { a.sort }
      b.report("Sort by") { a.sort_by {|a| a} }
    end

 produces:

    user     system      total        real
    Sort        0.180000   0.000000   0.180000 (  0.175469)
    Sort by     1.980000   0.040000   2.020000 (  2.013586)

Also adding the startup overhead to check for an implementation of
sort_obj is an expense.

There’s also the issue that you really need to check if ALL of the
elements being sorted respond to sort_obj.

It seems better to me to either reserve this technique as a pattern,
or at most add a new sort_by_sort_obj method or the like, and leave
sort as it is.

–
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

miracleofmagick · October 7, 2007, 11:44pm

Rick DeNatale wrote:

On 10/7/07, Joel VanderWerf [email protected] wrote:

Eric H. wrote:

somethings.sort_by { |s| s.sort_obj }
Since this is more efficient than #sort, maybe the implementation of
#sort should first try to do this, and only if that fails (#sort_obj not
defined, or two sort_obj not comparable), revert to the normal <=>
algorithm.

Yes, but sort_by is not always more efficient than sort:

Quite right, #sort should stay as it is, and #sort_by should only be
called explicitly.

miracleofmagick · October 8, 2007, 2:07am

On Oct 7, 2007, at 03:34 , Trans wrote:

end
I don’t know if the name #sort_obj is appropriate, but I’m not coming
up with something better.

See also:

http://blog.segment7.net/articles/2007/10/07/sort_by-and-sort_obj

Nice. In fact, it seems this could dethrone <=> as the primary method
Sortable depends on.

My primary goal was reducing the number of method calls and objects
created, boosting speed. The downside to this is that you can’t sort
heterogeneous collections using #sort_obj. When sorting using #<=>
this would be easier.

miracleofmagick · October 8, 2007, 2:14am

On Oct 7, 2007, at 11:22 , Rudi C. wrote:

its a great idea and i have been using it for years in sql by the
name “sortkey” although after reading ara’s point i think i prefer the
term “orderkey” now.
another name in collation docs is “weight string”. i think i like
weight_string the best because it is less verby and confrontational
than order_by and order. both of these are sort of “in your face”
sounding. cheers,

I like order_key too.

miracleofmagick · October 8, 2007, 11:52am

2007/10/8, Eric H. [email protected]:

end
[@name, @version]

My primary goal was reducing the number of method calls and objects
created, boosting speed.

I’m afraid the performance issue is not that easy: I can imagine where
the <=> approach is more efficient. Here’s the reason: with #sort_by
you need to keep all keys in memory for the whole sorting process
which can cost a lot of memory. Creating those for pairwise
comparison (as with <=>) is more CPU intensive because more temp
arrays have to be created but the overall memory usage might be lower
because they are kept during a single comparison operation only (see
also Rick’s benchmark). Allocating that much memory from the OS for
the #sort_by case might actually slow the sorting down dramatically.

The downside to this is that you can’t sort
heterogeneous collections using #sort_obj. When sorting using #<=>
this would be easier.

Not necessarily because the <=> operators might not be able to work
with elements from the other types and thus sorting might fail as
miserably.

Kind regards

robert

miracleofmagick · October 8, 2007, 6:31pm

On Oct 7, 2007, at 14:20 , Rick DeNatale wrote:

Yes, but sort_by is not always more efficient than sort:

elements being sorted respond to sort_obj.

It seems better to me to either reserve this technique as a pattern,
or at most add a new sort_by_sort_obj method or the like, and leave
sort as it is.

I’m actually impressed #sort_by does that well. Array#sort cheats.

miracleofmagick · October 8, 2007, 6:33pm

On Oct 8, 2007, at 02:51 , Robert K. wrote:

because they are kept during a single comparison operation only (see
also Rick’s benchmark). Allocating that much memory from the OS for
the #sort_by case might actually slow the sorting down dramatically.

Profiling indicated a significant amount of time was spent in #<=>
which created more objects and called more methods than #sort_by.
Replacing #sort with #sort_by and #sort_obj reduced both. Creating
fewer objects and using less memory aren’t the same thing.

(Also, Rick’s benchmark shows Array#sort cheating more than any other
effect.)

The downside to this is that you can’t sort heterogeneous
collections using #sort_obj. When sorting using #<=> this would
be easier.

Not necessarily because the <=> operators might not be able to work
with elements from the other types and thus sorting might fail as
miserably.

#<=> usually does the right thing, raising ArgumentError or returning
nil on incomparable objects, where if you use #sort_obj with
incomparable objects you may get bogus results.

miracleofmagick · October 8, 2007, 7:06pm

On 10/8/07, Eric H. [email protected] wrote:

Profiling indicated a significant amount of time was spent in #<=>
which created more objects and called more methods than #sort_by.
Replacing #sort with #sort_by and #sort_obj reduced both. Creating
fewer objects and using less memory aren’t the same thing.

(Also, Rick’s benchmark shows Array#sort cheating more than any other
effect.)

First of all it’s not my benchmark, it comes directly from the ri doc
for sort_by

Second, I don’t understand the concept of Array#sort “cheating” this
isn’t a contest.

Some things perform better with sort, others with sort_by. The
performance of sort is generally determined by the cost of doing the
<=> comparison, while the performace of sort_by is generally
determined by the cost in time and space of creating the side array.

As the English say, Horse for courses.

–
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

miracleofmagick · October 8, 2007, 8:06pm

On Oct 8, 2007, at 10:05 , Rick DeNatale wrote:

for sort_by

Second, I don’t understand the concept of Array#sort “cheating” this
isn’t a contest.

Array#sort doesn’t call Fixnum#<=> when comparing fixnums, even if
you redefine Fixnum#<=>. If you want a benchmark that shows when
#sort is faster than #sort_by in several general cases you should use
like implementations, not the special cases where Array#sort has been
optimized.

Some things perform better with sort, others with sort_by. The
performance of sort is generally determined by the cost of doing the
<=> comparison, while the performace of sort_by is generally
determined by the cost in time and space of creating the side array.

For sorting Fixnum there is much lower cost since you never call a #<=>.

While I didn’t mention it, I was profiling to speed up RubyGems. I
don’t blindly assume, and neither should anyone else.

#sort_by and #sort_obj

$ qri sort_by ----------------------------------------------------- Enumerable#sort_by enum.sort_by {| obj | block } => array

$ qri sort_by
----------------------------------------------------- Enumerable#sort_by
enum.sort_by {| obj | block } => array