Heavy loop functions slow

dreamwave · April 8, 2008, 7:17am

Alright so I was playing with my large amounts of data and ran into yet
another problem with shoving it into a loop that requires a substantial
amount of memory.

dataArray = []
output = arrayOut.to_s.chop!.split(",")

output1 = output[0…356130]
output2 = output[356131…712260]
output3 = output[712261…1068390]
output4 = output[1068391…1424521]

count = 0
output1.each do |out|
out = out.to_i
push = hashRange[out]
dataArray << push
count+=1
puts “#{push} - #{count}” #Testing purposes
end

I broke ‘output’ up into several blocks for other purposes than just
this loop, but also to see what the effect would be. As you can see
we’re talkin about almost 1,500,000 array elements.
–>hashRange is a hash obviously

Problem being: that test line I added ‘puts “#{push} - #{count}”’
solidifies the fact that it moves through 1 element every 5-6sec…
After doing my math thats about 86 days to finish 1,500,000 elements

Any ideas that would speed this up are much appreciated!! Otherwise I’ll
be back in 3 months IF I dont get an error

Thanks,

Mac

dreamwave · April 8, 2008, 8:17am

Try further benchmarking what causes the slowness. Isolate what code is
causing the slowness itself. Also, without knowing what hashRange and
output contain, it is not obvious where the slowness comes from. For
instance if hashRange = {} and output = (0…1_000_000).to_a, this code
takes relatively little time to execute.

dreamwave · April 8, 2008, 10:12am

2008/4/8, Michael L. [email protected]:

output2 = output[356131…712260]
end
Any ideas that would speed this up are much appreciated!! Otherwise I’ll
be back in 3 months IF I dont get an error

Obviously there is a lot of code missing from the piece above. Can
you explain, what you are trying to achieve? What is your input file
format and what kind of transformation do you want to do on it? I
looked through your other postings but it did not become clear to me.

Cheers

robert

dreamwave · April 8, 2008, 9:43am

On Tue, Apr 8, 2008 at 7:17 AM, Michael L.
[email protected] wrote:

Alright so I was playing with my large amounts of data and ran into yet
another problem with shoving it into a loop that requires a substantial
amount of memory.

dataArray = []
output = arrayOut.to_s.chop!.split(“,”)

set arrayOut to nil if you don’t need it any more.

output1 = output[0…356130]
output2 = output[356131…712260]
output3 = output[712261…1068390]
output4 = output[1068391…1424521]

You dont need output here, set it to nil to allow for garbage collection

count = 0
output1.each do |out|
out = out.to_i
push = hashRange[out]
dataArray << push
count+=1
puts “#{push} - #{count}” Testing purposes
end

you can convert the output to numbers in one pass, though use
benchmark to see the actual gain:

output = arrayOut.to_s.chop!.split(“,”).map {|out| out.to_i }

if you are looking for numbers only, you can do something like

output = []
arrayOut.to_s.chop!.scan(/\d+/) {|out| output << out.to_i }
(you can count the items, and switch to output2 when output1 has
enough, thus 1. creating smaller arrays, 2. doing two things in one
step.)

even in this case, you still have both the original arrayOut and
the long string (.to_s) in memory.
It might be faster, if you could iterate through the array without
creating the intermediate string. The question is 1. will it help? 2.
Is it worth it?

dreamwave · April 8, 2008, 11:32am

Robert K. wrote:

2008/4/8, Michael L. [email protected]:

output2 = output[356131…712260]
end
Any ideas that would speed this up are much appreciated!! Otherwise I’ll
be back in 3 months IF I dont get an error

Obviously there is a lot of code missing from the piece above. Can
you explain, what you are trying to achieve? What is your input file
format and what kind of transformation do you want to do on it? I
looked through your other postings but it did not become clear to me.

Cheers

robert

Alright heres the breakdown of everything.

dataArray = []

arrayOut consist of all integer data stored in a text file.

its called upon via IO.foreach(“data.txt”){|x| dataArray << x}

dataArray being just a predefined array ie: dataArray = []

output = arrayOut.to_s.chop!.split(“,”)

#Each of these outputs breaks down this huge array into 4 smaller arrays
output1 = output[0…356130]
output2 = output[356131…712260]
output3 = output[712261…1068390]
output4 = output[1068391…1424521]

#hashRange[out] is basically calling a hash in the following context.

hash = { 1=> { 20000…30000 => 12345 } }

#so ‘out’ is calling the range of the key to which contains its defined
value
#basically its saying hashRange[25000] #=> 12345 as an example

#everything imported to dataArray is a string, so it must be converted
to an
#integer to correctly match the range key

#after benchmarking some elements of the loop below its found to be
#the push = hashRange[out] is whats slowing everything down.
#everything a nil ‘out’ is shoved into the query it takes about 8sec.
#when its a correct number, takes about 5sec

#the hashRange file is about 78mb, to which I had to load in as
#8 separate data files, then shove those into an eval to convert it
#to a hash

count = 0
output1.each do |out|
out = out.to_i
push = hashRange[out]
dataArray << push
count+=1
puts “#{push} - #{count}” Testing purposes
end

#I guess what I need now is a faster way to access this pre-defined
hash.
#SQL is one possibility but that could be considered a whole other
#forum post

Any other questions feel free to ask,
Your guy’s insight is much appreciated.

Thanks again,

Mac

dreamwave · April 8, 2008, 2:57pm

2008/4/8, Michael L. [email protected]:

you explain, what you are trying to achieve? What is your input file

#Each of these outputs breaks down this huge array into 4 smaller arrays
value
  count+=1
Your guy’s insight is much appreciated.
Let’s see whether I understood correctly: you have a file with
multiple integer numbers per line. You have defined a range mapping,
i.e. each interval an int can be in has a label. You want to read in
all ints and output their labels.

If this is correct, this is what I’d do:

$ ruby -e ‘20.times {|i| puts i}’ >| x
14:54:37 /c/Temp
$ ./rl.rb x
low
low
medium
medium
medium
high
high
high
high
high
no label
no label
no label
no label
no label
no label
no label
no label
no label
no label
14:54:41 /c/Temp
$ cat rl.rb
#!/bin/env ruby

class RangeLabels
def initialize(labels)
@labels = labels.sort_by {|key,lab| key}
end

def lookup(val)
# slow, this can be improved by binary search!
@labels.each do |key, lab|
return lab if val < key
end
“no label”
end
end

rl = RangeLabels.new [
[2, “low”],
[5, “medium”],
[10, “high”],
]

ARGF.each do |line|
first = true
line.scan /\d+/ do |val|
if first
first = false
else
print ", "
end

print rl.lookup(val.to_i)

end

print “\n”
end
14:54:52 /c/Temp
$

Kind regards

robert

dreamwave · April 9, 2008, 8:16am

On 09.04.2008 00:30, Michael L. wrote:

That would work, but even with marshal dumping the data set is just too
large for memory to handle quickly.

Which data set - the range definitions or the output? I thought this is
a one off process that transforms a large input file into a large output
file.

I think I’m going to move the
project over to PostgreSQL and see if that doesn’t speed things up a
considerable amount, Thanks Robert.

That’s of course an option. But I still feel kind of at a loss about
what exactly you are doing. Is this just a single processing step in a
much larger application?

Cheers

robert

dreamwave · April 9, 2008, 12:30am

That would work, but even with marshal dumping the data set is just too
large for memory to handle quickly. I think I’m going to move the
project over to PostgreSQL and see if that doesn’t speed things up a
considerable amount, Thanks Robert.

Mac

dreamwave · April 9, 2008, 7:53pm

Robert K. wrote:

On 09.04.2008 00:30, Michael L. wrote:

That would work, but even with marshal dumping the data set is just too
large for memory to handle quickly.

Which data set - the range definitions or the output? I thought this is
a one off process that transforms a large input file into a large output
file.

I think I’m going to move the
project over to PostgreSQL and see if that doesn’t speed things up a
considerable amount, Thanks Robert.

That’s of course an option. But I still feel kind of at a loss about
what exactly you are doing. Is this just a single processing step in a
much larger application?

Cheers

robert

The dump would be to the pre-defined hash, to hence retrieve the
information faster.

To answer your 2nd question yes this is just a single step in a very
large 12 step application. I’m hoping to condense it down to about 8
steps when I finish. This step alone involves transforming a large
dataset into a smaller dataset.

I’m trying to extract all the numbers between ranges and push the keys
of the hash results into a file. This file will then be opened by
another part of the step process to be analyzed.

IE:
if the transformation involved the file of:
12345
67423
97567
45345
ect.
I would want to pull all of those numbers and get the keys for those
hash ranges
IE:
12000…15000 => 100
60000…70000 => 250
ect.

So 12345 would fall in the range of 12000.15000 so the output file would
get 100 added to it. Then the next step would be analyzing the results
(IE: 100).
Hope this explains things a bit better.

Thanks,

Mac

dreamwave · April 9, 2008, 9:45pm

On 09.04.2008 19:53, Michael L. wrote:

considerable amount, Thanks Robert.
That’s of course an option. But I still feel kind of at a loss about
what exactly you are doing. Is this just a single processing step in a
much larger application?

The dump would be to the pre-defined hash, to hence retrieve the
information faster.

I would not use the term “hash” here because this is an implementation
detail. Basically what you want to store is the mapping from input
numbers mapped to output numbers via ranges, don’t you?

if the transformation involved the file of:
ect.
How many of those ranges do you have? Is there any mathematical
relation between each input range and its output value?

So 12345 would fall in the range of 12000.15000 so the output file would
get 100 added to it. Then the next step would be analyzing the results
(IE: 100).

So let me rephrase it to make sure I understood properly: you are
reading a amount of numbers and map each number to another one (via
ranges). Mapped numbers are input to the next processing stage. It
seems you would want to output each mapped value only once; this
immediately suggests set semantic.

Hope this explains things a bit better.

Yes, we’re getting there. Actually I find this a nice exercise in
requirements extrapolation. In this case I try to extract the
requirements from you (aka the customer).

Kind regards

robert

How about

#!/bin/env ruby

require ‘set’

class RangeLabels
def initialize(labels, fallback = nil)
@labels = labels.sort_by {|key,lab| key}
@fallback = fallback
end

def lookup(val)
# slow if there are many ranges
# this can be improved by binary search!
@labels.each do |key, lab|
return lab if val < key
end
@fallback
end

alias [] lookup
end

rl = RangeLabels.new [
[12000, 50],
[15000, 100],
[60000, nil],
[70000, 250],
]

output = Set.new

ARGF.each do |line|
line.scan /\d+/ do |val|
x = rl[val.to_i] and output << x
end
end

puts output.to_a

dreamwave · April 9, 2008, 11:29pm

another part of the step process to be analyzed.
IE:
Thanks,

cfp:~ > cat a.rb

use narray for fast ruby numbers

require ‘rubygems’
require ‘narray’

ton-o-date

huge = NArray.int(2 ** 25).indgen * 100 # 0, 100, 200, 300, etc

bin data

0…100 → 0

100…200 → 1

200…300 → 2

etc…

a = Time.now.to_f

p huge

huge.div! 100 # 42 → 0, 127 → 1, 2227 → 222

b = Time.now.to_f

elapsed = b - a

p elapsed

p huge

cfp:~ > ruby a.rb
NArray.int(33554432):
[ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200,
1300, … ]
0.202844142913818
NArray.int(33554432):
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, … ]

so that’s doing about 33 million elements in around 2/10ths of a
second…

a @ http://codeforpeople.com/

dreamwave · April 10, 2008, 1:07am

Better for whom?

dreamwave · April 10, 2008, 2:11am

Better for whom?

for my wife - obviously!

a @ http://codeforpeople.com/