Nonlinear scaling - could symbols help solve this?

Hello List,

I have really just started dabbling in Ruby and have found my first
little task that I can use to try out some of what I’ve read.

Basically I’m reading in a CSV file and converting it into a proprietry
XML format. To facilitate in the XML generation, I store the items in a
Hash of Hashes of Arrays. Where the first level of hashing is based on
the type of data item, the second hash is based on the item’s parent,
and the Array carries all the items for a specific parent.
eg. hash=Hash[“class1” => {“parent1” =>
[class1(“a”),class1(“b”),class1(“c”)], “parent2” =>
[class1(“e”),class1(“f”),class1(“g”)], …}, “class2” => […], …]

The code below does this in a somewhat clumsy way using Ruby. I find
that when running this on 10 000 lines -> execution time is 4.5 seconds,
however if I increase to 300 000 lines (30 times the data) -> execution
time is 14 min 21 sec (168 times the execution time). Is this nonlinear
scaling due to allocating memory to hold the data? Could I improve this
using symbols? Any other general remarks would be appreciated.

csv_file.each_line do |line| #FORMAT:
PARENT1,PARENT2,PARENT3,ITEM,ITEMR,ITEMTYPE,SUBITEM,ITEM_LAYER,PNAME,PVALUE,LEVEL
sl=line.split(",")
parent1=sl[0]; parent2=sl[1]; parent3=sl[2]; item=sl[3];
itemr=sl[4]; itemtype=sl[5]; subitem=sl[6]; item_layer=sl[7];
pname=sl[8]; pvalue=sl[9]; level=sl[10]
if parent1 == “NONE” && parent2 != “NONE”
parent_type=“PARENT2”
node=parent2
elsif parent1 != “NONE” && parent2 == “NONE”
parent_type=“PARENT1”
node=parent1
end
i = case level
when “ITEM” then Item.new(node,item)
when “SUBITEM” then Subitem.new(node,item,subitem)
when “ITEM_LAYER” then
Item_layer.new(node,item,subitem,item_layer)
when “RELATION” then Relation.new(node,item,itemr)
when “PARENT3” then Parent3.new(node,parent3)
when “PARENT2” then Parent2.new(parent2)
when “PARENT1” then Parent1.new(parent1)
end
if ! i.nil?
hash[(i.class)]=Hash.new if hash[(i.class)].nil?
hash[(i.class)][(i.parent)]=Array.new if
hash[(i.class)][(i.parent)].nil?
ind = hash[(i.class)][(i.parent)].index(i)
if ind.nil?
i.addSetting( Setting.new( pname, pvalue ) )
hash[(i.class)][(i.parent)].push(i)
else
(hash[(i.class)][(i.parent)])[ind].addSetting(
Setting.new( pname, pvalue ) )
end
end
end

Thanks in advance,
Jeremy

A CSV file is (usually) very much like a database table. And an XML
document is (usually) a tree. A hash of hashes of arrays is an
inefficient way to store a tree.

You need an algorithm and a data structure tuned to the algorithm and
the data. I would recommend starting from the required output and
working backwards to the input data. Have a look at the Ruby libraries
devoted to dealing with XML documents, and the Ruby libraries devoted to
dealing with CSV files. Let the XML libraries build the data structure,
rather than creating your own.

Jeremy H. wrote:

csv_file.each_line do |line| #FORMAT: PARENT1,PARENT2,PARENT3,ITEM,ITEMR,ITEMTYPE,SUBITEM,ITEM_LAYER,PNAME,PVALUE,LEVEL
when “ITEM” then Item.new(node,item)
ind = hash[(i.class)][(i.parent)].index(i)
Jeremy


M. Edward (Ed) Borasky

2006/5/16, Jeremy H. [email protected]:

The code below does this in a somewhat clumsy way using Ruby. I find that when running this on 10 000 lines → execution time is 4.5 seconds, however if I increase to 300 000 lines (30 times the data) → execution time is 14 min 21 sec (168 times the execution time). Is this nonlinear scaling due to allocating memory to hold the data? Could I improve this using symbols? Any other general remarks would be appreciated.

You can verify that yourself. Use your operating system’s tools to
monitor memory usage and page faults of the Ruby process and you’ll
soon know whether it’s slow due to paging or not.

Symbols can help with repeated content but you could as well feed
strings you want to store through a hash like this in order to prevent
duplicates:

strings = Hash.new {|h,k| k=k.freeze; h[k]=k}

str = strings[str]

Kind regards

robert