Forum: Ruby Nonlinear scaling - could symbols help solve this?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
2a82a5c3ca7d809f687a562c4aea0e18?d=identicon&s=25 Jeremy Hanford (Guest)
on 2006-05-16 13:37
(Received via mailing list)
Hello List,

I have really just started dabbling in Ruby and have found my first
little task that I can use to try out some of what I've read.

Basically I'm reading in a CSV file and converting it into a proprietry
XML format. To facilitate in the XML generation, I store the items in a
Hash of Hashes of Arrays. Where the first level of hashing is based on
the type of data item, the second hash is based on the item's parent,
and the Array carries all the items for a specific parent.
eg. hash=Hash["class1" => {"parent1" =>
[class1("a"),class1("b"),class1("c")], "parent2" =>
[class1("e"),class1("f"),class1("g")], ...}, "class2" => [...], ...]

The code below does this in a somewhat clumsy way using Ruby. I find
that when running this on 10 000 lines -> execution time is 4.5 seconds,
however if I increase to 300 000 lines (30 times the data) -> execution
time is 14 min 21 sec (168 times the execution time). Is this nonlinear
scaling due to allocating memory to hold the data? Could I improve this
using symbols? Any other general remarks would be appreciated.

csv_file.each_line do |line| #FORMAT:
PARENT1,PARENT2,PARENT3,ITEM,ITEMR,ITEMTYPE,SUBITEM,ITEM_LAYER,PNAME,PVALUE,LEVEL
        sl=line.split(",")
        parent1=sl[0]; parent2=sl[1]; parent3=sl[2]; item=sl[3];
itemr=sl[4]; itemtype=sl[5]; subitem=sl[6]; item_layer=sl[7];
pname=sl[8]; pvalue=sl[9]; level=sl[10]
        if parent1 == "NONE" && parent2 != "NONE"
                parent_type="PARENT2"
                node=parent2
        elsif parent1 != "NONE" && parent2 == "NONE"
                parent_type="PARENT1"
                node=parent1
        end
        i = case level
                when "ITEM"     	  then Item.new(node,item)
                when "SUBITEM"  	  then Subitem.new(node,item,subitem)
                when "ITEM_LAYER"     then
Item_layer.new(node,item,subitem,item_layer)
                when "RELATION"    	  then Relation.new(node,item,itemr)
                when "PARENT3"    	  then Parent3.new(node,parent3)
                when "PARENT2"     	  then Parent2.new(parent2)
                when "PARENT1"     	  then Parent1.new(parent1)
        end
        if ! i.nil?
                hash[(i.class)]=Hash.new if hash[(i.class)].nil?
                hash[(i.class)][(i.parent)]=Array.new if
hash[(i.class)][(i.parent)].nil?
                ind = hash[(i.class)][(i.parent)].index(i)
                if ind.nil?
                        i.addSetting( Setting.new( pname, pvalue ) )
                        hash[(i.class)][(i.parent)].push(i)
                else
                        (hash[(i.class)][(i.parent)])[ind].addSetting(
Setting.new( pname, pvalue ) )
                end
        end
end

Thanks in advance,
Jeremy
3bb23e7770680ea44a2d79e6d10daaed?d=identicon&s=25 M. Edward (Ed) Borasky (Guest)
on 2006-05-16 15:05
(Received via mailing list)
A CSV file is (usually) very much like a database table. And an XML
document is (usually) a tree. A hash of hashes of arrays is an
inefficient way to store a tree.

You need an algorithm and a data structure tuned to the algorithm and
the data. I would recommend starting from the required output and
working backwards to the input data. Have a look at the Ruby libraries
devoted to dealing with XML documents, and the Ruby libraries devoted to
dealing with CSV files. Let the XML libraries build the data structure,
rather than creating your own.

Jeremy Hanford wrote:
> csv_file.each_line do |line| #FORMAT: 
PARENT1,PARENT2,PARENT3,ITEM,ITEMR,ITEMTYPE,SUBITEM,ITEM_LAYER,PNAME,PVALUE,LEVEL
>                 when "ITEM"     	  then Item.new(node,item)
>                 ind = hash[(i.class)][(i.parent)].index(i)
> Jeremy
>
>
>

--
M. Edward (Ed) Borasky

http://linuxcapacityplanning.com
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2006-05-16 16:52
(Received via mailing list)
2006/5/16, Jeremy Hanford <hanford_j@aircom.co.za>:

> The code below does this in a somewhat clumsy way using Ruby. I find that when running 
this on 10 000 lines -> execution time is 4.5 seconds, however if I increase to 300 000 
lines (30 times the data) -> execution time is 14 min 21 sec (168 times the execution 
time). Is this nonlinear scaling due to allocating memory to hold the data? Could I 
improve this using symbols? Any other general remarks would be appreciated.

You can verify that yourself.  Use your operating system's tools to
monitor memory usage and page faults of the Ruby process and you'll
soon know whether it's slow due to paging or not.

Symbols can help with repeated content but you could as well feed
strings you want to store through a hash like this in order to prevent
duplicates:

strings = Hash.new {|h,k| k=k.freeze; h[k]=k}

str = strings[str]

Kind regards

robert
This topic is locked and can not be replied to.