Forum: Ruby free RAM problem

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
25ae00f7825b9b6ddc7db1fb7b589f34?d=identicon&s=25 Edouard Dantes (edouard)
on 2009-03-17 13:13
Hi,

I am doing some text processing.

I clean every variable after each loop and use the
GC.start
ObjectSpace.garbage_collect
after each loop too.

So I have three variables whose cumulative size always remain inferior
to 100'000 utf8 characters.

Nevertheless my free RAM is falling with run-time length like if each
time loaded is added to the RAM.

How can I force the RAM to clean completely from no-used stuff?

thanks
Fa2521c6539342333de9f42502657e5a?d=identicon&s=25 Eleanor McHugh (Guest)
on 2009-03-17 13:31
(Received via mailing list)
On 17 Mar 2009, at 12:10, Edouard Dantes wrote:
> to 100'000 utf8 characters.
>
> Nevertheless my free RAM is falling with run-time length like if each
> time loaded is added to the RAM.
>
> How can I force the RAM to clean completely from no-used stuff?

Without seeing code it's hard to comment, but two things which spring
instantly to mind:

1. are you sure your input data is no longer being referenced
elsewhere in the program as live objects are not garbage collected;
2. are you generating symbols as part of processing the input data as
these are not garbage collected (at least not in Ruby 1.8).

Also in general if you have enough memory, don't force garbage
collection: the Ruby runtime does this perfectly well and the overall
performance of your program will improve. 300k of utf8 is such a
trivial amount of memory consumption...


Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
----
raise ArgumentError unless @reality.responds_to? :reason
25ae00f7825b9b6ddc7db1fb7b589f34?d=identicon&s=25 Edouard Dantes (edouard)
on 2009-03-17 16:26
> Without seeing code it's hard to comment,

Thanks for answer,

well, the loops are too basic to create problems, it's fill array with
values & write to file.

RAM jumps are too erratic (5 or 10% for 2GB RAM)

The only processor angry stuff (and possibly RAM too) is the following
parsing line, imported from another file.

Nokogiri::HTML(page).css('div a').map{|dts|
dts.text.gsub(/\s+/,"").gsub(/[^A-Za-z0-9\s]/,'Z') }

This line runs thousands of time, might the implied regex feed RAM with
datas without cleaning?

thanks
25ae00f7825b9b6ddc7db1fb7b589f34?d=identicon&s=25 Edouard Dantes (edouard)
on 2009-03-17 16:28
Edouard Dantes wrote:

>
> RAM jumps are too erratic (5 or 10% for 2GB RAM)
>

Note that this RAM load 'noise' leads to an increasing average RAM load.
Fa2521c6539342333de9f42502657e5a?d=identicon&s=25 Eleanor McHugh (Guest)
on 2009-03-17 17:17
(Received via mailing list)
On 17 Mar 2009, at 15:23, Edouard Dantes wrote:
> parsing line, imported from another file.
>
> Nokogiri::HTML(page).css('div a').map{|dts|
> dts.text.gsub(/\s+/,"").gsub(/[^A-Za-z0-9\s]/,'Z') }
>
> This line runs thousands of time, might the implied regex feed RAM
> with
> datas without cleaning?

I'm not sure it'll make much difference but moving the regex creation
into the enclosing scope seems neater:

  spaces = /\s+/
  chars = /[^A-Za-z0-9\s]/

  Nokogiri::HTML(page).css('div a').map { |dts|
    dts.text.gsub(spaces, "").gsub(chars, 'Z')
    }

Likewise changing the logic to use gsub! will reduce the number of
transient string objects being created and that will reduce memory
usage.

I'd also be interested to see the code you're using for writing to
file as the nature of the RAM usage jumps suggests you could be
experiencing a buffering artefact.


Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
----
raise ArgumentError unless @reality.responds_to? :reason
97550977337c9f0a0e1a9553e55bfaa0?d=identicon&s=25 Jano Svitok (Guest)
on 2009-03-17 17:47
(Received via mailing list)
On Tue, Mar 17, 2009 at 16:23, Edouard Dantes <edrd.dantes@gmail.com>
wrote:
> The only processor angry stuff (and possibly RAM too) is the following
> parsing line, imported from another file.
>
> Nokogiri::HTML(page).css('div a').map{|dts|
> dts.text.gsub(/\s+/,"").gsub(/[^A-Za-z0-9\s]/,'Z') }

Use /o (once) flag for regexen (gsub(/\s+/o,"")), or make them
constants out of the loop - you'll save two object creations for each
loop.

Try replacing the block body with

tmp = dts.text.gsub(/\s+/,"")  # creates a copy of the string, with
whitespace removed
tmp.gsub!(/[^A-Za-z0-9\s]/,'Z') # avoid creating another copy of the
string
tmp # return the temporary string

Run your code under RubyMemoryValidator, or at least under ruby-prof
(or any other similar tool) to see what kind of objects get created
and where. If you post the code and some sample data, I might find
some time to run with MemoryValidator for you.

Jano
25ae00f7825b9b6ddc7db1fb7b589f34?d=identicon&s=25 Edouard Dantes (edouard)
on 2009-03-18 16:42
Hi.

thanks for your help,

actually after running one unit at a time it appears the problem comes
from :
Nokogiri::HTML(page)

if i supress everything else but keep it looping with a "while true" my
RAM gets eaten out.

I am giving a try to hpricot & let you know if it solves the problem

NOTE: I run it on openSuse 11.1 x86-64 & ruby 1.8.6 with the current gem
--remote nokogiri.
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-03-18 17:29
(Received via mailing list)
Attachment: mem-stat.rb (839 Bytes)
2009/3/17 Eleanor McHugh <eleanor@games-with-brains.com>:
> On 17 Mar 2009, at 15:23, Edouard Dantes wrote:

>>
>> This line runs thousands of time, might the implied regex feed RAM with
>> datas without cleaning?
>
> I'm not sure it'll make much difference but moving the regex creation into
> the enclosing scope seems neater:
>
>        spaces = /\s+/
>        chars = /[^A-Za-z0-9\s]/

This is generally less efficient and AFAIK there are no memory leaks
attached to this.  It's a different story with varying regular
expressions (i.e. based on input or configuration).  But in this case
I'd leave them in place.

>        Nokogiri::HTML(page).css('div a').map { |dts|
>                dts.text.gsub(spaces, "").gsub(chars, 'Z')
>                }
>
> Likewise changing the logic to use gsub! will reduce the number of transient
> string objects being created and that will reduce memory usage.

True, maybe the first gsub needs to stay in order to avoid aliasing
effects.

> I'd also be interested to see the code you're using for writing to file as
> the nature of the RAM usage jumps suggests you could be experiencing a
> buffering artefact.

There was also an issue in 1.8.* related to Array#shift or #unshift.

And I agree with your general statement to not explicitly force GC.
Rather look for memory leaks.  Edouard, you might get a first idea
where you spend your memory by dumping instance counts in intervals
(sample attached).

Kind regards

robert
Fa2521c6539342333de9f42502657e5a?d=identicon&s=25 Eleanor McHugh (Guest)
on 2009-03-18 19:16
(Received via mailing list)
On 18 Mar 2009, at 16:25, Robert Klemme wrote:
> 2009/3/17 Eleanor McHugh <eleanor@games-with-brains.com>:
>>        spaces = /\s+/
>>        chars = /[^A-Za-z0-9\s]/
>
> This is generally less efficient and AFAIK there are no memory leaks
> attached to this.  It's a different story with varying regular
> expressions (i.e. based on input or configuration).  But in this case
> I'd leave them in place.

Interesting, I'll have to do some tests to see what the difference is
in practice as this is a code pattern I use a lot.


Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
----
raise ArgumentError unless @reality.responds_to? :reason
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-03-18 20:20
(Received via mailing list)
On 18.03.2009 19:12, Eleanor McHugh wrote:
> in practice as this is a code pattern I use a lot.
It seems, the difference has decreased with 1.9, but you can at least
say that it's not slower to have a regular expression inline:

allruby rx.rb
ruby 1.8.5 (2006-08-25) [i386-linux]
Rehearsal -------------------------------------------------------
inline               15.300000   4.000000  19.300000 ( 19.354039)
separate             15.440000   3.940000  19.380000 ( 19.426682)
out of the loop      15.370000   4.000000  19.370000 ( 19.423543)
--------------------------------------------- total: 58.050000sec

                           user     system      total        real
inline               15.330000   3.990000  19.320000 ( 19.341197)
separate             15.400000   3.980000  19.380000 ( 19.426264)
out of the loop      15.410000   3.950000  19.360000 ( 19.426081)
ruby 1.9.1p0 (2009-01-30 revision 21907) [i686-linux]
Rehearsal -------------------------------------------------------
inline                4.470000   0.000000   4.470000 (  4.502066)
separate              4.450000   0.000000   4.450000 (  4.466547)
out of the loop       4.460000   0.000000   4.460000 (  4.473198)
--------------------------------------------- total: 13.380000sec

                           user     system      total        real
inline                4.470000   0.000000   4.470000 (  4.490634)
separate              4.470000   0.000000   4.470000 (  4.480428)
out of the loop       4.450000   0.000000   4.450000 (  4.506627)
[robert@ora01 ~]$ cat rx.rb

require 'benchmark'

REP = 10_000
text = ("foo bar baz " * 1_000).freeze

Benchmark.bmbm 20 do |bm|
   bm.report "inline" do
     REP.times do
       text.scan(/bar/) do |match|
         match.length
       end
     end
   end

   bm.report "separate" do
     REP.times do
       rx = /bar/
       text.scan(rx) do |match|
         match.length
       end
     end
   end

   bm.report "out of the loop" do
     rx = /bar/
     REP.times do
       text.scan(rx) do |match|
         match.length
       end
     end
   end
end
[robert@ora01 ~]$

Cheers

  robert
25ae00f7825b9b6ddc7db1fb7b589f34?d=identicon&s=25 Edouard Dantes (edouard)
on 2009-03-18 22:07
Hi,

Well I replaced the Nokogiri::HTML with a Hpricot an now my Ram usage
remains below 2% during the whole process.
3afd3e5e05dc9310c89aa5762cc8dd1d?d=identicon&s=25 Tim Hunter (Guest)
on 2009-03-18 22:38
(Received via mailing list)
Edouard Dantes wrote:
> Hi,
>
> Well I replaced the Nokogiri::HTML with a Hpricot an now my Ram usage
> remains below 2% during the whole process.
>

I hope you'll report this issue to the Nokogiri developer(s) so that
they can investigate the problem. I'm sure they will want to hear from
you. If it's something they can fix, it will not only help you, it will
benefit the entire community.

Report it and then post here what happened.
Be30361bb0b0c495e3077db43ad84b56?d=identicon&s=25 Aaron Patterson (Guest)
on 2009-03-18 23:12
(Received via mailing list)
On Thu, Mar 19, 2009 at 06:04:51AM +0900, Edouard Dantes wrote:
> Hi,
>
> Well I replaced the Nokogiri::HTML with a Hpricot an now my Ram usage
> remains below 2% during the whole process.

I ran this:

  require 'nokogiri'

  html = File.read(ARGV[0])
  while true
    Nokogiri::HTML(html).css('div a').map{ |dts|
      dts.text.gsub(/\s+/,"").gsub(/[^A-Za-z0-9\s]/,'Z')
    }
  end

vs this:

  require 'hpricot'

  html = File.read(ARGV[0])
  while true
    Hpricot(html).search('div a').map{ |dts|
      next unless dts.respond_to?(:text)
      dts.text.gsub(/\s+/,"").gsub(/[^A-Za-z0-9\s]/,'Z')
    }
  end

As I expected, both were stable and neither leaked memory.  What version
of
libxml2 are you running?  Also, what version of ruby?  It could also be
the html I'm parsing...  :-\
25ae00f7825b9b6ddc7db1fb7b589f34?d=identicon&s=25 Edouard Dantes (edouard)
on 2009-03-19 06:24
Aaron Patterson wrote:

> libxml2 are you running?  Also, what version of ruby?  It could also be
> the html I'm parsing...  :-\

Hi,

I run libxml2 2.7.1 & ruby 1.8.7 patchlevel 72 x86-64

I 've emailed my code to your gmail address.

regards,
This topic is locked and can not be replied to.