Uniq with count; better way?

a = [4,5,6,4,5,6,6,7]

result = Hash.new(0)
a.each { |x| result[x] += 1 }

p result

The result I am getting
{4=>2, 5=>2, 6=>3, 7=>1}
is what I want.

Is there a better way; perhaps using uniq?

The first that came to my mind.

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res[x] += 1; res }

Is it good enough for you?

On Mon, Jan 16, 2012 at 16:00, Sigurd [email protected] wrote:

[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res[x] += 1; res }

I think this is a misuse of inject, personally, every time I see it.
It’s
harder to read and it doesn’t give the feeling of actually “reducing”
(inject’s alias) the array down to one thing. The required ; res is a
sign of that. Compare:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

Well it’s one of the possible solutions.
You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

I like this

a = [4,5,6,4,5,6,6,7]

1

p Hash[a.group_by{|n|n}.map{|k, v|[k, v.size]}]

2

p Hash.new(0).tap{|h|a.each{|n|h[n] += 1}}

2012/1/17 Ralph S. [email protected]:

On Mon, Jan 16, 2012 at 17:04, Adam P. [email protected] wrote:

[1, 2, 3, 4].inject(5) { |a, b| a + b }
There’s always each_with_object, although it’s a little long:

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res[x] += 1
}

On Jan 16, 2012 4:09 PM, “Sigurd” [email protected] wrote:

You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

In what sense is that more “accurate”?

Well,

it seems not quite accurate to me because of block. inject uses
convention that the last statement in method is a return. The nature of
inject is to assign the last value to the memo that has not been used
ever in your case. Therefore it’s more natural to use short inject
method definitions: either a.inject(5, :+) either 5 + a.inject(:+). If
the memo return in proc would be unnatural the inject won’t pass it to
the proc explicitly.

On the other side I’m not a proponent of the crazy injects that could be
barely understood. I think in this case inject could be used easily as
well as the other solutions provided.

Please do not top post.

On Mon, Jan 16, 2012 at 7:08 PM, Sigurd [email protected] wrote:

it seems not quite accurate to me because of block. inject uses convention that
the last statement in method is a return.

??? Inject uses the convention that the block passes the “aggregator”
on which it has received as first argument. This does not have
anything to do with “return”.

The nature of inject is to assign the last value to the memo that has not been
used ever in your case.

What “memo”?

Therefore it’s more natural to use short inject method definitions: either
a.inject(5, :+) either 5 + a.inject(:+). If the memo return in proc would be
unnatural the inject won’t pass it to the proc explicitly.

???

On the other side I’m not a proponent of the crazy injects that could be barely
understood. I think in this case inject could be used easily as well as the other
solutions provided.

On Jan 16, 2012, at 6:14 PM, Adam P. wrote:

On Jan 16, 2012 4:09 PM, “Sigurd” [email protected] wrote:

You example is not accurate though:

5 + [1, 2, 3, 4].reduce(&:+)

Can you please again explain what is not accurate about Adam’s piece
of code? Are you aware of the two different behaviors of #inject?

irb(main):001:0> [[1,2,3],[1],[]].each do |a|
irb(main):002:1* p a, a.inject(&:+), a.inject(0, &:+)
irb(main):003:1> end
[1, 2, 3]
6
6
[1]
1
1
[]
nil
0
=> [[1, 2, 3], [1], []]

Both have their use and it depends on the use case which you pick.

Kind regards

robert

On Jan 16, 2012, at 10:15 PM, Robert K. wrote:

What “memo”?

Memo is the name of the first parameter that is passed to the block, or
at least it’s how it’s named in doc and how i’ve used to call it.

Therefore it’s more natural to use short inject method definitions: either
a.inject(5, :+) either 5 + a.inject(:+). If the memo return in proc would be
unnatural the inject won’t pass it to the proc explicitly.

???

The argue was about the explicit memo returns so on a given sample I
tried to show that that sample do not necessarily need to use blocks.
But I see no problem to explicitly return value if i was passed to that
block. Yes, sometimes it looks ugly, sometimes, especially with tap, it
looks pretty nice.

Can you please again explain what is not accurate about Adam’s piece
of code? Are you aware of the two different behaviors of #inject?

Technically it’s good. My argument was not that the code is not
functional or doing something in a wrong way or that syntax is
unacceptable. The point i have is that Adams code is not the best
example illustrating that the explicit inject block returns is the sign
of incorrect inject use or let’s call it “code smell”. I’m not against
that particular piece of code, it’s just someone is wrong in internet
again :slight_smile:

nil
0
=> [[1, 2, 3], [1], []]

Both have their use and it depends on the use case which you pick.

I don’t know can this be called different behavior. Inject use the first
value of collection if it’s not set explicitly so it’s like particular
case of inject, but i’m ok with the two different cases as well.

On Mon, Jan 16, 2012 at 1:48 PM, Magnus H. [email protected] wrote:

[1, 2, 3, 4].inject(5) { |a, b| a + b }

There’s always each_with_object, although it’s a little long:

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res[x] += 1 }

I think Magnus H. is the clearest (IMHO, yes, it’s just a taste and
humble opinion.).

[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) {|num, hsh| hsh[num] +=
1}

Another way (not better) I remember is…

Hash[ [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size]
} ]

See: http://ruby-doc.org/core-1.9.3/Enumerable.html#method-i-chunk

It also can be… clearer?!?

Hash[ [4,5,6,4,5,6,6,7].group_by {|n| n}.map {|ix, els| [ix, els.size] }
]

Perhaps something like this (same as Magnus H.) just hiding the
complexity into the method.

class Array
def totalize_to_hash
hsh = Hash.new(0)
self.each do |n|
hsh[n] += 1
end
hsh
end
end

[4,5,6,4,5,6,6,7].totalize_to_hash

Abinoam Jr.

On Mon, Jan 16, 2012 at 9:22 PM, Abinoam Jr. [email protected] wrote:

sign of that. Compare:

Hash[ [4,5,6,4,5,6,6,7].group_by {|n| n}.map {|ix, els| [ix, els.size] } ]
hsh
end
end

[4,5,6,4,5,6,6,7].totalize_to_hash

Abinoam Jr.

Some benchmark results…

n = 100_000
Benchmark.bm(15) do |b|
b.report(“Ralph Shneiver:”) { n.times { a = [4,5,6,4,5,6,6,7];
result = Hash.new(0); a.each { |x| result[x] += 1 }; result} }
b.report(“Sigurd:”) { n.times {
[4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res[x] += 1; res } } }
b.report(“Keinich #1”) { n.times { Hash[a.group_by{|n|n}.map{|k,
v|[k, v.size]}] } }
b.report(“Keinich #2”) { n.times {
Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
b.report(“Magnus H.:”) { n.times {
[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res[x] += 1
} } }
b.report(“Abinoam #1:”) { n.times { Hash[
[4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]
} }
end

                 user     system      total        real

Ralph Shneiver: 0.290000 0.000000 0.290000 ( 0.259640)
Sigurd: 0.320000 0.000000 0.320000 ( 0.289873)
Keinich #1 0.560000 0.000000 0.560000 ( 0.497736)
Keinich #2 0.280000 0.000000 0.280000 ( 0.250843)
Magnus H.: 0.310000 0.000000 0.310000 ( 0.283344)
Abinoam #1: 1.140000 0.000000 1.140000 ( 1.042744)

Abinoam Jr.

Kenichi,

Monday, January 16, 2012, 9:21:51 AM, you wrote:

KK> 2012/1/17 Ralph S. [email protected]:

Abinoam,

Monday, January 16, 2012, 6:05:33 PM, you wrote:

On Mon, Jan 16, 2012 at 10:05 PM, Abinoam Jr. [email protected] wrote:

(inject’s alias) the array down to one thing. The required ; res is a
humble opinion.).

end
n = 100_000
[4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res[x] += 1
Keinich #2 0.280000 0.000000 0.280000 ( 0.250843)
Magnus H.: 0.310000 0.000000 0.310000 ( 0.283344)
Abinoam #1: 1.140000 0.000000 1.140000 ( 1.042744)

Abinoam Jr.

Sorry for the mess… (some error in the bench code + horrible text
wrapping)

Apologizing with the gist…
https://gist.github.com/1624016

Abinoam Jr.

On Mon, Jan 16, 2012 at 10:03 PM, Sigurd [email protected] wrote:

On Jan 16, 2012, at 10:15 PM, Robert K. wrote:

Please do not top post.

On Mon, Jan 16, 2012 at 7:08 PM, Sigurd [email protected] wrote:

it seems not quite accurate to me because of block. inject uses convention
that the last statement in method is a return.

Therefore it’s more natural to use short inject method definitions: either
a.inject(5, :+) either 5 + a.inject(:+). If the memo return in proc would be
unnatural the inject won’t pass it to the proc explicitly.

???

The argue was about the explicit memo returns so on a given sample I tried to
show that that sample do not necessarily need to use blocks. But I see no problem
to explicitly return value if i was passed to that block. Yes, sometimes it looks
ugly, sometimes, especially with tap, it looks pretty nice.

Funny thing is, that when you do x.inject(&:+) #inject actually sees a
block. This is just a shorthand notation for doing x.inject {|a,b|
a+b}. See doc of #to_proc and here:

irb(main):028:0> o = Object.new
=> #Object:0x9f8966c
irb(main):029:0> def o.to_proc; lambda {|*a| printf “block %p\n”, a} end
=> nil
irb(main):030:0> def f;if block_given? then yield else puts “no block”
end end
=> nil
irb(main):031:0> f() { puts 123 }
123
=> nil
irb(main):032:0> f
no block
=> nil
irb(main):033:0> f &o
block []
=> nil
irb(main):034:0> f(&o)
block []
=> nil

Can you please again explain what is not accurate about Adam’s piece
of code? Are you aware of the two different behaviors of #inject?

Technically it’s good. My argument was not that the code is not functional or
doing something in a wrong way or that syntax is unacceptable. The point i have is
that Adams code is not the best example illustrating that the explicit inject
block returns is the sign of incorrect inject use or let’s call it “code smell”.
I’m not against that particular piece of code, it’s just someone is wrong in
internet again :slight_smile:

Aha. I would not have called that “not accurate”.

nil
0
=> [[1, 2, 3], [1], []]

Both have their use and it depends on the use case which you pick.

I don’t know can this be called different behavior. Inject use the first value
of collection if it’s not set explicitly so it’s like particular case of inject,
but i’m ok with the two different cases as well.

It is different behavior because the block is invoked different number
of times. And behavior is controlled by the fact whether an argument
is passed to #inject or not:

irb(main):016:0> [[1,2,3],[1],[]].each do |a|
irb(main):017:1* p a
irb(main):018:1> p a.inject {|x,y| printf “block 1: %p, %p\n”, x,
y;x+y},
irb(main):019:1* a.inject(0) {|x,y| printf “block 2: %p, %p\n”, x,
y;x+y},
irb(main):020:1* a.inject(nil) {|x,y| printf “block 3: %p, %p\n”, x,
y;x.to_i+y}
irb(main):021:1> end
[1, 2, 3]
block 1: 1, 2
block 1: 3, 3
block 2: 0, 1
block 2: 1, 2
block 2: 3, 3
block 3: nil, 1
block 3: 1, 2
block 3: 3, 3
6
6
6
[1]
block 2: 0, 1
block 3: nil, 1
1
1
1
[]
nil
0
nil
=> [[1, 2, 3], [1], []]

Cheers

robert

Abinoam,
thank you for get the benchmarks.
I guess the different of spped from generating objects.
They generate minimal objects which Object#tap and
Enumerable#each_with_object.

Ralph,
I like #1 for comprehensibility meaning

looks “construct Hash instance from inner subscript”

Hash[]

looks “collect the own values”

a.group_by{|n|n}

counting

map{|k, v|[k, v.size]}]

if Hash has own map

https://gist.github.com/1626034

I like #2 for comprehensibility, clear name-space, and speed meanings

comprehensibility

looks like “list comprehension”

tap{|h|a.each{|n|h[n] += 1}}

clear name-space

no make variables for out of block

speed meanings

look Abinoam’s benchmarks

if Enumerable has method for this case

https://gist.github.com/1625981

2012/1/17 Ralph S. [email protected]:

On Tue, Jan 17, 2012 at 2:44 AM, Abinoam Jr. [email protected] wrote:

Sorry for the mess… (some error in the bench code + horrible text
wrapping)

Apologizing with the gist…
https://gist.github.com/1624016

Thanks. Very interesting.

I assumed (incorrectly turns out for Ruby 1.9.3) that for large data
sets
there
would be a significant performance difference. Because, upon determining
the
“bins” of uniq values, that is essentially a form of “sorting”, which
can
be O(n^2)
if not careful.

Turns out I was wrong for ruby 1.9.3 (but right for some other rubies).

I rewrote the code to create large datasets (array up to 10_000_000),
but
with
the data inside a set of 0…100.

require ‘benchmark’

n = 1
#ar = [4,5,6,4,5,6,6,7]
ar = [].tap{|a| 10_000_000.times {a << rand(100)}} #1_000_000,
2_000_000,

puts “SAMPLE of ar”
puts ar[0…20]
puts “SIZE”
puts ar.size
Benchmark.bm(15) do |b|
b.report(“Ralph Shneiver:”){ n.times { result = Hash.new(0); ar.each {
|x|
result[x] += 1 }; result} }
b.report(“Sigurd:”) { n.times { ar.inject(Hash.new(0)) {|res, x| res[x]
+=
1; res } } }
b.report(“Keinich #1”) { n.times { Hash[ar.group_by{|n|n}.map{|k,v|[k,
v.size]}] } }
b.report(“Keinich #2”) { n.times { Hash.new(0).tap{|h|ar.each{|n|h[n]
+=
1}} } }
b.report(“Magnus H.:”) { n.times { ar.each_with_object(Hash.new(0)) {
|x, res| res[x] += 1 } } }
b.report(“Abinoam #1:”) { n.times { Hash[ar.sort.chunk {|n| n}.map
{|ix,
els| [ix, els.size] } ] } }
end

RUBY 1.9.3:

For ruby 1.9.3-p0 all solutions perform approx. the same (at least same
order)
and seem to stay quasi linear.

SIZE
1000000 #[1_000_000 that is ; n = 1]
user system total real
Ralph Shneiver: 0.180000 0.000000 0.180000 ( 0.174283)
Sigurd: 0.200000 0.000000 0.200000 ( 0.203652)
Keinich #1 0.140000 0.000000 0.140000 ( 0.142833)
Keinich #2 0.180000 0.000000 0.180000 ( 0.177456)
Magnus H.: 0.200000 0.000000 0.200000 ( 0.205895)
Abinoam #1: 0.260000 0.000000 0.260000 ( 0.254554)

SIZE
2000000 #[2_000_000 that is ; n = 1]
user system total real
Ralph Shneiver: 0.340000 0.010000 0.350000 ( 0.350032)
Sigurd: 0.410000 0.000000 0.410000 ( 0.406483)
Keinich #1 0.280000 0.000000 0.280000 ( 0.285213)
Keinich #2 0.350000 0.010000 0.360000 ( 0.354640)
Magnus H.: 0.410000 0.000000 0.410000 ( 0.411010)
Abinoam #1: 0.470000 0.030000 0.500000 ( 0.498782)

SIZE
10000000 #[10_000_000 that is; n = 1 here]

                  user     system      total        real

Ralph Shneiver: 1.710000 0.040000 1.750000 ( 1.748137)
Sigurd: 2.000000 0.010000 2.010000 ( 2.012496)
Keinich #1 1.380000 0.030000 1.410000 ( 1.409462)
Keinich #2 1.750000 0.010000 1.760000 ( 1.760997)
Magnus H.: 1.990000 0.020000 2.010000 ( 2.014282)
Abinoam #1: 2.500000 0.060000 2.560000 ( 2.562646)

All solutions in Ruby 1.9.3 are relatively “linear”.

Keinich #1 is the fastest.

I believe it is because I limited the dataset to 100 different values,
the
number of
bins is fixed and order O(n) can be achieved (not the O(n.log(n)) that I
had
expected, incorrectly).

RUBY 1.8.7:

For ruby 1.8.7-p I could not test the last 2 solutions (methods not
implemented).
For the first 3, significant differences arise.

SIZE
1000000 #[1_000_000 that is ; n = 1]

                 user     system      total        real

Ralph Shneiver: 0.370000 0.010000 0.380000 ( 0.369785)
Sigurd: 2.320000 0.030000 2.350000 ( 2.360403)
Keinich #1 1.520000 0.040000 1.560000 ( 1.562627)
Keinich #2 16.520000 0.100000 16.620000 ( 16.623032)

SIZE
2000000 #[2_000_000 that is ; n = 1]
user system total real
Ralph Shneiver: 0.720000 0.010000 0.730000 ( 0.737673)
Sigurd: 8.040000 0.110000 8.150000 ( 8.142827)
Keinich #1 5.670000 0.040000 5.710000 ( 5.716364)
Keinich #2 52.600000 0.480000 53.080000 ( 53.100823)

SIZE
10000000 #[10_000_000 that is ; n = 1]

                 user     system      total        real

Ralph Shneiver: 3.680000 0.000000 3.680000 ( 3.680230)

Striking: Only the solution of Ralph Shneiver remains linear in MRI
1.8.7

JRUBY (1.6.5.1)

$ ruby -v
jruby 1.6.5.1 (ruby-1.8.7-p330) (2011-12-27 1bf37c2) (OpenJDK Server VM
1.6.0_23) [linux-i386-java]

I also tested JRuby out of curiosity.

SIZE
1000000 #[1_000_000 that is ; n = 1]
user system total real
Ralph Shneiver: 0.244000 0.000000 0.244000 ( 0.220000)
Sigurd: 0.264000 0.000000 0.264000 ( 0.264000)
Keinich #1 0.112000 0.000000 0.112000 ( 0.112000)
Keinich #2 11.256000 0.000000 11.256000 ( 11.256000)

SIZE
2000000 #[2_000_000 that is ; n = 1]
user system total real
Ralph Shneiver: 0.392000 0.000000 0.392000 ( 0.368000)
Sigurd: 0.439000 0.000000 0.439000 ( 0.438000)
Keinich #1 0.196000 0.000000 0.196000 ( 0.196000)
Keinich #2 14.462000 0.000000 14.462000 ( 14.462000)

SIZE
10000000 #[10_000_000 that is ; n = 1]]

                 user     system      total        real

Ralph Shneiver: 1.604000 0.000000 1.604000 ( 1.581000)
Sigurd: 1.769000 0.000000 1.769000 ( 1.769000)
Keinich #1 0.967000 0.000000 0.967000 ( 0.967000)
Keinich #2 8.978000 0.000000 8.978000 ( 8.978000)

Interesting again.

Those 2 solutions where not yet available on JRuby 1.6.5.1:

Magnus H.: NoMethodError: undefined method each_with_object' for #<Array:0x18eb7b8> Abinoam #1: NoMethodError: undefined methodchunk’ for
#Array:0x1719f30

Ralph Shneiver remains linear again.

But then again … Keinich #1 is significantly faster on 10_000_000 than
the others.
Keinich #2 seems to be not very predictable?

JRUBY HEAD (in rvm):

$ ruby -v
jruby 1.7.0.dev (ruby-1.8.7-p357) (2012-01-17 7de254f) (Java HotSpot™
Server VM 1.6.0_26) [linux-i386-java]

1000000 #[1_000_000 that is ; n = 1]

                 user     system      total        real

Ralph Shneiver: 0.238000 0.000000 0.238000 ( 0.223000)
Sigurd: 0.286000 0.000000 0.286000 ( 0.286000)
Keinich #1 0.120000 0.000000 0.120000 ( 0.120000)
Keinich #2 7.841000 0.000000 7.841000 ( 7.841000)

SIZE
2000000 #[2_000_000 that is ; n = 1]

                 user     system      total        real

Ralph Shneiver: 0.426000 0.000000 0.426000 ( 0.410000)
Sigurd: 0.478000 0.000000 0.478000 ( 0.478000)
Keinich #1 0.213000 0.000000 0.213000 ( 0.213000)
Keinich #2 21.928000 0.000000 21.928000 ( 21.928000)

SIZE
10000000 #[10_000_000 that is ; n = 1]

                 user     system      total        real

Ralph Shneiver: 1.550000 0.000000 1.550000 ( 1.535000)
Sigurd: 1.795000 0.000000 1.795000 ( 1.795000)
Keinich #1 1.060000 0.000000 1.060000 ( 1.060000)
Keinich #2 116.826000 0.000000 116.826000 (116.826000)

Similar results to JRuby 1.6.5.1

Caveat:

I did not check the correctness of the result, only the timing.

Questions:

Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
Could it be because the Hash is now “ordered” so it can do an efficient
algorithm when adding an entry to a bin? Or is it object creation?

Why are certain methods leading to non-linear behavior?

My conclusions (?):

  • be careful about performance with large data sets

  • Fastest overall: Keinich #1 on JRuby 1.6.5.1

  • CRuby 1.9.3 seems to be the only on the remains linear for all
    solutions on large data sets.

  • “Ralph Shneiver” is the only solution that remains linear for
    all tested rubies (MRI 1.9.3, MRI 1.8.7, JRuby 1.6.5.1 JRuby head)

HTH,

Peter

On Tue, Jan 17, 2012 at 7:58 AM, Kenichi K. [email protected]
wrote:

if Enumerable has method for this case

https://gist.github.com/1625981

I would like to have it.

We can discuss it better here at ruby talk to see the pros and cons.
If somebody is able to do the C code of it…
Perhaps we could issue a feature request.

Abinoam Jr.

I found a below discussion in rails. (via google)
https://github.com/rails/rails/pull/1932

It has a same name and aims to same goal.
And it was estimated “too specific”.

I think too, but this name not recall other case for me.
uhh…

2012/1/17 Abinoam Jr. [email protected]:

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs