Ruby1.9: lazy versions of Enumerator#select and friends?

stickstone · November 2, 2008, 5:42pm

I’ve been having a play with Enumerators in ruby 1.9, in particular
adding ‘lazy’ versions of those methods in Enumerable which normally
return arrays. Here’s a proof-of-concept implementation:

module Enumerable
def lmap(&blk)
Enumerator.new do |y|
each do |e|
y << blk[e]
end
end
end

def lselect(&blk)
Enumerator.new do |y|
each do |e|
y << e if blk[e]
end
end
end

def ltake(n)
Enumerator.new do |y|
count = 0
each do |e|
break if n <= count
y << e
count += 1
end
end
end

def lskip(n)
Enumerator.new do |y|
count = 0
each do |e|
y << e unless count <= n
count += 1
end
end
end
end

if FILE == $0
big = (1…1_000_000_000_000)
big.lselect { |i| i % 2 == 1 }.lmap { |i| i + 100 }.
lskip(5).ltake(10).each { |i| puts i }
end

So instead of generating an array containing all processed elements,
they return an Enumerator which processes the elements on demand. As a
result, you can easily chain them together, as shown in the example; you
can handle large sequences without creating large intermediate arrays;
and also handle infinite lists.

Ruby hides the internals of this very nicely, I presume using Fibers to
maintain the state inside each Enumerator.

My first question is, does ruby 1.9 have methods like this already, and
I’ve just overlooked them? If so, where should I be looking?

If not, then is there value in adding something like this to the
language? For example,

Put this stuff into an external library, Facets-like, with new
methods in Enumerable as above (the method names given are just
examples, there are probably better ones)
Redefine Enumerator#map, Enumerator#select etc, so that they return
Enumerators instead of arrays. It seems reasonable to me that once you
have an Enumerator at the base of the chain, you will likely want to
keep chaining them together. You can always collapse the result to a
real array using ‘to_a’.

You can always forcibly create an Enumerator using ‘each’ without a
block, e.g.

(1…10).map { … }.select { … } # normal evaluation
(1…10).each.map { … }.select { … } # this would be ‘lazy’

A more radical language change would be to have Enumerable#map,
#select etc return Enumerators always. Some existing code would need a
‘to_a’ stuck on the end. Also, it may not be as efficient:

a1 = (1…10).map { |i| i2 } # map generates an array
immediately
a2 = (1…10).lmap { |i| i2 }.to_a # Fibers are involved
If you accept that we should have Enumerators rather than Arrays as
intermediate elements in the chain, then I guess we have to consider
composing Enumerators (indeed, composing anything Enumerable):

module Enumerable
def +(other)
Enumerator.new do |y|
each { |e| y << e }
other.each { |e| y << e }
end
end
end

a = (1…3) + (10…12) + (17…20)
a.each { |i| puts i }

I’m not sure where following this path would end up. & and | might still
have to build intermediate arrays. (However, if the source Enumerables
were known to be in sorted order, there would be interesting and
efficient ways to merge and join on them)

Any comments?

Regards,

Brian.

stickstone · November 2, 2008, 6:00pm

HI –

On Mon, 3 Nov 2008, Brian C. wrote:

end
def ltake(n)
def lskip(n)
if FILE == $0

Ruby hides the internals of this very nicely, I presume using Fibers to
maintain the state inside each Enumerator.

My first question is, does ruby 1.9 have methods like this already, and
I’ve just overlooked them? If so, where should I be looking?

It did have something like this, but not any more. I’m not sure they
were identical to what you’ve written, but the basic idea of an
enumerator that bundled itself with a block did exist in 1.9 for a
while.

If not, then is there value in adding something like this to the
language? For example,

I think so, absolutely. I’m still struggling with trying to believe
that enumerators are more than marginally useful without the ability
to travel with knowledge of a block and still be lazy.

Put this stuff into an external library, Facets-like, with new
methods in Enumerable as above (the method names given are just
examples, there are probably better ones)

Redefine Enumerator#map, Enumerator#select etc, so that they return
Enumerators instead of arrays. It seems reasonable to me that once you
have an Enumerator at the base of the chain, you will likely want to
keep chaining them together. You can always collapse the result to a
real array using ‘to_a’.

I’d strongly go for #1. #2 would require not only massive code
overhaul, but change of habits, and lots of to_a is unsightly (and you
wouldn’t necessarily want to have to hard-code the class of what you
want back, anyway [i.e., the ‘a’ in ‘to_a’]).

David

stickstone · November 2, 2008, 7:05pm

On Nov 2, 2008, at 10:58 AM, David A. Black wrote:

My first question is, does ruby 1.9 have methods like this already,
and
I’ve just overlooked them? If so, where should I be looking?

It did have something like this, but not any more. I’m not sure they
were identical to what you’ve written, but the basic idea of an
enumerator that bundled itself with a block did exist in 1.9 for a
while.

I’ve seen you asking David, but did you ever get an answer of why this
feature was removed? If there’s not a great reason, I sure wish we
could have it back.

James Edward G. II

stickstone · November 2, 2008, 7:21pm

Hi –

On Mon, 3 Nov 2008, James G. wrote:

I’ve seen you asking David, but did you ever get an answer of why this
feature was removed? If there’s not a great reason, I sure wish we could
have it back.

Shugo answered on ruby-core, mainly on the subject of efficiency, and
also on the intention behind enumerators which he says doesn’t include
that kind of instantiation with block awareness. I guess I got used to
how they behaved when they were block-aware, so it’s been hard for me
not to see that as how they’re supposed to be.

David

stickstone · November 2, 2008, 8:15pm

On Nov 2, 11:41 am, Brian C. [email protected] wrote:

end
end

This doesn’t work. But I think I understand what you are after.

Wouldn’t it have to be something like:

def lmap(&blk)
this = self
Functor.new |op, &blk| # &blk in 1.9 only
this.map(&blk).send(op, &blk)
end
end

T.

stickstone · November 2, 2008, 7:39pm

David A. Black wrote:

It did have something like this, but not any more. I’m not sure they
were identical to what you’ve written, but the basic idea of an
enumerator that bundled itself with a block did exist in 1.9 for a
while.

I’d be interested to learn exactly what the dropped feature was.

Perhaps we’re talking at cross purposes, but I’m not sure I’m suggesting
“an enumerator that bundled itself with a block”. Rather, I’m suggesting
that certain methods in Enumerable which used to return an array,
instead return an Enumerator, so that they can be chained horizontally.

By this I mean:

a.meth1 { … }.meth2 { … }.meth3 { … }

currently runs “vertically” (meth1 enumerates all of ‘a’ and creates an
array; meth2 enumerates this array and creates another array; then meth3
enumerates this array and creates a final array)

By running “horizontally” I mean that the first element of ‘a’ is
processed all the way across; then the second element of ‘a’; and so on.
No intermediate arrays are created.

I’d strongly go for #1. #2 would require not only massive code
overhaul, but change of habits, and lots of to_a is unsightly (and you
wouldn’t necessarily want to have to hard-code the class of what you
want back, anyway [i.e., the ‘a’ in ‘to_a’]).

I don’t understand the last sentence - a whole bunch of Enumerable
methods like #map, #select etc are already hard-coded to return an
Array, so you currently have no choice.

With what I propose you can use any method you like to build the result.
If you don’t want an array, then don’t use to_a. You could use ‘inject’
or ‘each’ or ‘each_with_object’ to gather the resulting elements in
whatever way you like: e.g.

a.meth1 { … }.meth2 { …}.each { |x,y| h[x] = y }

Or even, as the original example I posted shows, you can just use the
results as they arrive and then discard them:

a.meth1 { … }.meth2 { …}.each { |x| puts x }

Now that Enumerators exist, the fact that Enumerable methods are
hard-coded always to create an Array seems very limiting. There are so
many other things you might want to create instead.

stickstone · November 2, 2008, 10:15pm

On Nov 2, 4:00 pm, Brian C. [email protected] wrote:

ruby 1.9.1 (2008-10-28 revision 19983) [i686-linux]
Interesting. Clearly there is something substantially different in
1.9’s version. I added

require ‘enumerator’

and using 1.8.6, I expected it would work the same. But get

:13:in initialize': wrong number of arguments (0 for 1) (ArgumentError) from t.rb:13:in new’
from t.rb:13:in `lselect’
from t.rb:44

Enumerator.new() wants an enumerable object as an argument --what is
1.9 doing without it?

T.

stickstone · November 2, 2008, 10:01pm

Thomas S. wrote:

On Nov 2, 11:41ï¿½am, Brian C. [email protected] wrote:

ï¿½ ï¿½ end
ï¿½ end

This doesn’t work.

Sorry, which bit doesn’t work?

$ ruby19 -v
ruby 1.9.1 (2008-10-28 revision 19983) [i686-linux]
$ ruby19 lazy.rb
113
115
117
119
121
123
125
127
129
131
1
2
3
10
11
12
17
18
19
20
$

stickstone · November 2, 2008, 10:26pm

Thomas S. wrote:

Enumerator.new() wants an enumerable object as an argument --what is
1.9 doing without it?

1.9 supports two completely different types of Enumerator objects. The
first is the simple 1.8.6 one which just maps method :each to arbitrary
method :foo on the upstream object. The second is the clever one.

-------------------------------------------------------- Enumerator::new
Enumerator.new(obj, method = :each, *args)
Enumerator.new { |y| … }

 From Ruby 1.9.1

 Creates a new Enumerator object, which is to be used as an
 Enumerable object iterating in a given way.

 In the first form, a generated Enumerator iterates over the given
 object using the given method with the given arguments passed. Use
 of this form is discouraged. Use Kernel#enum_for(), alias to_enum,
 instead.

   e = Enumerator.new(ObjectSpace, :each_object)
       #-> ObjectSpace.enum_for(:each_object)

   e.select { |obj| obj.is_a?(Class) }  #=> array of all classes

 In the second form, iteration is defined by the given block, in
 which a "yielder" object given as block parameter can be used to
 yield a value by calling the +yield+ method, alias +<<+.

   fib = Enumerator.new { |y|
     a = b = 1
     loop {
       y << a
       a, b = b, a + b
     }
   }

   p fib.take(10) #=> [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

stickstone · November 3, 2008, 4:10am

Hi –

On Mon, 3 Nov 2008, Brian C. wrote:

Thomas S. wrote:

Enumerator.new() wants an enumerable object as an argument --what is
1.9 doing without it?

1.9 supports two completely different types of Enumerator objects. The
first is the simple 1.8.6 one which just maps method :each to arbitrary
method :foo on the upstream object. The second is the clever one.

Although… they’re really the same if you think of it as an
enumerable object that has everything it needs except the knowledge of
what to iterate over (i.e., the “each” intelligence), and the block
and the method-attachment are just two different ways of teaching it
how it should iterate.

At least, that’s how I’m planning to present it in my book, so I
thought I’d try it out here first

David

stickstone · November 2, 2008, 11:05pm

Hi –

On Mon, 3 Nov 2008, Brian C. wrote:

that certain methods in Enumerable which used to return an array,
instead return an Enumerator, so that they can be chained horizontally.

This is the feature that disappeared – definitely not identical to
what you’re doing, but kind of germane:

a = [1,2,3,4].enum_for(:map, &lambda {|x| x * 10})
a.select {|x| x > 20 } # [30,40]

No intermediate arrays are created.
Right; that was clear from what you said. We’re talking in the same
problem domain (chainable, lazy enumerators), in somewhat different
terms.

I’d strongly go for #1. #2 would require not only massive code
overhaul, but change of habits, and lots of to_a is unsightly (and you
wouldn’t necessarily want to have to hard-code the class of what you
want back, anyway [i.e., the ‘a’ in ‘to_a’]).

I don’t understand the last sentence - a whole bunch of Enumerable
methods like #map, #select etc are already hard-coded to return an
Array, so you currently have no choice.

True, but that doesn’t take newly-written enumerable classes, and
there are a few overrides here and there (like Hash#select returning a
hash, in 1.9).

I wonder what role what I have dubbed the “un-overriding” of
enumerable methods would play. Here’s an example:

hash = { 1 => 2 }
=> {1=>2}

hash.select { true }
=> {1=>2}

hash.each.select { true }
=> [[1, 2]]

Since Enumerator doesn’t override select (which of course it can’t, in
a general way), calling select on an enumerator for the hash has a
different effect from calling select on the hash. I haven’t puzzled
through how this would play out with your construct, but it seems like
it might crop up (though maybe no more than with enumerators in
general).

a.meth1 { … }.meth2 { …}.each { |x| puts x }

Now that Enumerators exist, the fact that Enumerable methods are
hard-coded always to create an Array seems very limiting. There are so
many other things you might want to create instead.

I’m still not sure I’d like to have to do a kind of
enumerator-resolving last call to every map, select, inject, etc. It’s
different with each, because it exists only for its block side-effects
in the first place. The ones that return result sets that you care
about would all have to be “capped” with to_a or something, unless the
last method in the chain somehow knew not to return an enumerator.

David

stickstone · November 3, 2008, 9:57am

David A. Black wrote:

This is the feature that disappeared – definitely not identical to
what you’re doing, but kind of germane:

a = [1,2,3,4].enum_for(:map, &lambda {|x| x * 10})
a.select {|x| x > 20 } # [30,40]

OK, I think I see what you’re saying now, except I’m not sure exactly
how ‘:map’ interacts with this.

If this code calls the existing ‘map’ method, then does that mean that
‘map’ also behaved differently when invoked via an Enumerator? (Because
otherwise, ‘map’ normally builds the entire result array up-front). How
did ‘map’ know whether it was being called in this context or not?

Using current ruby-1.9 you would write something like

a = [1,2,3,4]
Enumerator.new do |y|
a.each { |e| y << e * 10 }
end.select { |x| x > 20 }

but you can see I’ve had to write the ‘map’ logic in-line.

True, but that doesn’t take newly-written enumerable classes, and
there are a few overrides here and there (like Hash#select returning a
hash, in 1.9).

Ah, well that’s a bit messy. Taking that mess a bit further, I guess you
could arrange for Array#select similarly to be hardcoded always to
return an Array? Hmm.

I’m still not sure I’d like to have to do a kind of
enumerator-resolving last call to every map, select, inject, etc. It’s
different with each, because it exists only for its block side-effects
in the first place. The ones that return result sets that you care
about would all have to be “capped” with to_a or something, unless the
last method in the chain somehow knew not to return an enumerator.

Maybe we’re actually discussing my option #3 - which was to change
Enumerable#select to return an Enumerator.

My option #2 was that Enumerator#select return an Enumerator, in the
same way that Hash#select returns a Hash as you just described.

That is:

class Enumerator
def select
… etc
end
end

Then foo.select { … } would always return an Array (via
Enumerable#select), unless foo was an Enumerator (in which case it would
return an Enumerator), or a Hash (in which case it would return a Hash)

The ‘capping’ would only be needed if you create an Enumerator in the
chain.

I wonder what role what I have dubbed the “un-overriding” of
enumerable methods would play. Here’s an example:

hash = { 1 => 2 }
=> {1=>2}

hash.select { true }
=> {1=>2}

hash.each.select { true }
=> [[1, 2]]

Since Enumerator doesn’t override select (which of course it can’t, in
a general way), calling select on an enumerator for the hash has a
different effect from calling select on the hash.

Yes, I see what you’re saying.

But actually I’m proposing that hash.each.select return another
Enumerator. So you can cap it with a ‘to_hash’ method at the end to get
the same result:

require ‘lazy’
class Enumerator
def to_hash
each_with_object({}) { |(k,v),o| o[k] = v }
end
end
hash = {1=>2}
p hash.lselect { true } #=> #Enumerator:0x8321f60
p hash.lselect { true }.to_hash #=> {1=>2}

As you say, arguably it’s tedious to have to stick .to_a or .to_hash at
the end, but once you have an Enumerator, you have (a) lost the memory
of what it is you created it from, and (b) quite probably don’t want to
create the same thing again anyway - for example, sticking each { … }
at the end of the chain so as to consume the items rather than build a
new object.

So I think I’m warming to the middle ground: Enumerator#foo returns an
Enumerator, so Enumerators chain together, and if at the end of it you
really want an array (or a hash or whatever) rather than an Enumerator,
then you add .to_a or .to_hash

But equally, if you call Array#select or Hash#select directly, without
first “priming” the chain with an Enumerator object, then you get back
another Array or Hash immediately.

It makes sense when you understand it, although someone who didn’t
understand this might have difficulty distinguishing

[10,20,30].select { … }.map { … }.each { … }
from
[10,20,30].each.select { … }.map { … }.each { … }

I think that ‘each without a block’ is not necessarily a good way of
flagging ‘make me an enumerator’; writing to_enum would be clearer.

Actually, I have to say that I can’t really see the purpose of ‘map
without a block’, ‘select without a block’. I saw a posting showing how
you could chain

foo.select.with_index { |x,i| … }

but surely that would be clearer as

foo.with_index.select { |x,i| … }

(where ‘with_index’ creates an Enumerator which adds the index for you)

Unless there’s a good use for this which I’ve overlooked, I’d be happy
for Ruby to go back to 1.8.6 behaviour of each/map/select requiring a
block. In that case, my earlier example would have to be written

[10,20,30].to_enum.select { … }.map { … }.each { … }

which is pretty clear (to me

Regards,

Brian.