What's the best way to approach reading and parse large XLSX files?

addis_a · October 10, 2013, 10:37pm

Hello, I’m developing an app that basically, receives a 10MB or less
XLSX
files with +30000 rows or so, and another XLSX file with about 200rows,
I
have to read one row of the smallest file, look it up on the largest
file
and write data from both files to a new one.

I just did a test reading a few rows from the largest file using ROO
(Spreadsheet doesn’t support XSLX and Creek look good but I can’t find a
way to read row by row)
and it basically made my computer crash, the server crashed, I tried
rebooting it and it said It was already started, anyway, it was a
disaster.

So, my question was, is there gem that works best with large XLSX files
or
is there another way to approach this withouth crashing my computer?

This is what I had (It’s very possible I’m doing it wrong, help is
welcome)
What i was trying to do here, was to process the files and create the
new
XLS file after both of the XLSX files were uploaded:

require ‘roo’
require ‘spreadsheet’
require ‘creek’
class UploadFiles < ActiveRecord::Base
after_commit :process_files
attr_accessible :inventory, :material_list
has_one :inventory
has_one :material_list
has_attached_file :inventory, :url=>"/:current_user/inventory",
:path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension"
has_attached_file :material_list,
:url=>"/:current_user/material_list",
:path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension"
validates_attachment_presence :material_list
accepts_nested_attributes_for :material_list, :allow_destroy => true
accepts_nested_attributes_for :inventory, :allow_destroy => true
validates_attachment_content_type :inventory, :content_type =>
[“application/vnd.openxmlformats-officedocument.spreadsheetml.sheet”],
:message => “Only .XSLX files are accepted as Inventory”
validates_attachment_content_type :material_list, :content_type =>
[“application/vnd.openxmlformats-officedocument.spreadsheetml.sheet”],
:message => “Only .XSLX files are accepted as Material List”

def process_files
inventory = Creek::Book.new(Rails.root.to_s +
“/tmp/users/uploaded_files/inventory/inventory.xlsx”)
material_list = Creek::Book.new(Rails.root.to_s +
“/tmp/users/uploaded_files/material_list/material_list.xlsx”)
inventory = inventory.sheets[0]
scl = Spreadsheet::Workbook.new
sheet1 = scl.create_worksheet
inventory.rows.each do |row|
row.inspect
sheet1.row(1).push(row)
end

sheet1.name = "Site Configuration List"
scl.write(Rails.root.to_s +

“/tmp/users/generated/siteconfigurationlist.xls”)
end
end

Monserrat_F · October 10, 2013, 10:43pm

On Oct 10, 2013, at 4:36 PM, Monserrat F. wrote:

Hello, I’m developing an app that basically, receives a 10MB or less XLSX files
with +30000 rows or so, and another XLSX file with about 200rows, I have to read
one row of the smallest file, look it up on the largest file and write data from
both files to a new one.

Wow. Do you have to do all this in a single request?

You may want to look at Nokogiri and its SAX parser. SAX parsers don’t
care about the size of the document they operate on, because they work
one node at a time, and don’t load the whole thing into memory at once.
There are some limitations on what kind of work a SAX parser can
perform, because it isn’t able to see the entire document and “know”
where it is within the document at any point. But for certain kinds of
problems, it can be the only way to go. Sounds like you may need
something like this.

Walter

Monserrat_F · October 10, 2013, 10:51pm

A coworker suggested I should use just basic OOP for this, to create a
class that reads files, and then another to load the files into memory.
Could please point me in the right direction for this (where can I read
about it)? I have no idea what’s he talking about, as I’ve never done
this
before.

I’ll look up nokogiri and SAX

Monserrat_F · October 11, 2013, 3:15pm

On Oct 10, 2013, at 4:50 PM, Monserrat F. wrote:

A coworker suggested I should use just basic OOP for this, to create a class
that reads files, and then another to load the files into memory. Could please
point me in the right direction for this (where can I read about it)? I have no
idea what’s he talking about, as I’ve never done this before.

How many of these files are you planning to parse at any one time? Do
you have the memory on your server to deal with this load? I can see
this approach working, but getting slow and process-bound very quickly.
Lots of edge cases to deal with when parsing big uploaded files.

Walter

Monserrat_F · October 11, 2013, 5:31pm

One 30000+ row file and another with just over 200. How much memory
should
I need for this not to take forever parsing? (I’m currently using my
computer as server and I can see ruby taking about 1GB in the task
manager
when processing this (and it takes forever).

The 30000+ row file is about 7MB, which is not that much (I think)

Monserrat_F · October 11, 2013, 5:44pm

On Fri, Oct 11, 2013 at 10:30 AM, Monserrat F.
[email protected] wrote:

One 30000+ row file and another with just over 200. How much memory should I
need for this not to take forever parsing? (I’m currently using my computer
as server and I can see ruby taking about 1GB in the task manager when
processing this (and it takes forever).

The 30000+ row file is about 7MB, which is not that much (I think)

Check for a memory leak.

Monserrat_F · October 11, 2013, 5:43pm

On Oct 11, 2013, at 11:30 AM, Monserrat F. wrote:

One 30000+ row file and another with just over 200. How much memory should I
need for this not to take forever parsing? (I’m currently using my computer as
server and I can see ruby taking about 1GB in the task manager when processing
this (and it takes forever).

The 30000+ row file is about 7MB, which is not that much (I think)

I have a collection of 1200 XML files, ranging in size from 3MB to 12MB
each (they’re books, in TEI encoding) that I parse with Nokogiri on a
2GB Joyent SmartMachine to convert them to XHTML and then on to Epub.
This process takes 17 minutes for the first pass, and 24 minutes for the
second pass. It does not crash, but the server is unable to do much of
anything else while the loop is running.

My question here was, is this something that is a self-serve web
service, or an admin-level (one-privileged-user-once-in-a-while) type
thing? In my case, there’s one admin who adds maybe two or three books
per month to the collection, and the 40-minute do-everything loop was
used only for development purposes – it was my test cycle as I checked
all of the titles against a validator to ensure that my adjustments to
the transcoding process didn’t result in invalid code. I would not
advise putting something like this live against the world, as the
potential for DOS is extremely great. Anything that can pull the kinds
of loads you get when you load a huge file into memory and start
fiddling with it should not be public!

Walter

Monserrat_F · October 11, 2013, 6:09pm

Hi, the files automatically download in .XLSX formats, I can’t change
them
and I can’t force the users to change it in order to make my job easier.
Thanks for the suggestion.

Monserrat_F · October 11, 2013, 6:06pm

On 10/11/2013 11:30 AM, Monserrat F. wrote:

On Oct 10, 2013, at 4:50 PM, Monserrat F. wrote:
very quickly. Lots of edge cases to deal with when parsing big
>
they work one node at a time, and don't load the whole thing into
using ROO (Spreadsheet doesn't support XSLX and Creek look good
help is welcome)
> >   has_one :inventory
> >   accepts_nested_attributes_for :material_list, :allow_destroy
:message => "Only .XSLX files are accepted as Material List"
> >     inventory.rows.each do |row|
> >

<javascript:>.
You received this message because you are subscribed to the Google
Groups “Ruby on Rails: Talk” group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit

https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
I use a rather indirect route that works fine for me with 15,000 lines
and about 26 MB. I export the file from LibreOffice Calc using csv
(Comma separated variables). Then, in the rails controller I use
something like:

require ‘csv’

class TheControllerController # ;')

other controller code

def upload
data = CSV.parse(params[:entries].tempfile.read) # from Ruby’s CSV
class
for line in data do
logger.debug “line: #{line.inspect}”
#each line is an array of strings containing the columns of the
one row of the csv file
#I use these data to populate the appropriate db table / rails
model at this point
end
end

end

make sure that your routes.db points to this:

match ‘the_controller/upload’ => ‘the_controller#upload’

from your client machine’s command line

curl -F [email protected] localhost:3000/the_controller/upload

note that ‘entries’ in the curl command matches the ‘entries’ in the
param[:entries] in the controller.

If you want to do this from a rails gui form, look at

During testing on my 4-core, 8 GB laptop, processing the really big
files take several minutes. When I have the app on heroku, this causes
a timeout so I break up the csv file into multiple sections such that
each section takes less than 30 seconds to upload. By leaving a little
‘slack’ in the size, I have this automated so it occurs in the
background while I am doing other work.

Hope these suggestions help.

Don Z.

Monserrat_F · October 11, 2013, 10:36pm

I forgot to say after it reads all rows and writes the file, throws

[1m [35m (600.1ms) [0m begin transaction
[1m [36m (52.0ms) [0m [1mcommit transaction [0m
failed to allocate memory
Redirected to http://localhost:3000/upload_files/110
Completed 406 Not Acceptable in 1207471ms (ActiveRecord: 693.1ms)

Monserrat_F · October 11, 2013, 11:39pm

On Oct 11, 2013, at 4:33 PM, Monserrat F. wrote:

This is an everyday, initially maybe a couple people at the same time uploading
and parsing files to generate the new one, but eventually it will extend to other
people, so…

I used a logger and It does retrieve and save the files using the comparation.
But it takes forever, like 30min or so in generating the file.
The process starts as soon as the files are uploaded but it seems to be taking
most of the time into opening the file, once it’s opened it takes maybe 5min at
most to generate the new file.

Do you know where can i find an example on how to read an xlsx file with
nokogiri? I can’t seem to find one

XSLX is just an Excel file expressed in XML. It’s no different than
parsing any other XML file. First, find a good basic example of file
parsing with Nokogiri.
Searching a XML/HTML document - Nokogiri Next,
open up your file in a text editor, and look for the elements you want
to access. You can use either xpath or css syntax to locate your
elements, and Nokogiri allows you to access either attributes or content
of any element you can locate. If you run into trouble with all the
prefixes that Microsoft like to litter their formats with, you can pass
remove_namespaces to clean that right up.

Walter

Monserrat_F · October 11, 2013, 10:35pm

This is an everyday, initially maybe a couple people at the same time
uploading and parsing files to generate the new one, but eventually it
will
extend to other people, so…

I used a logger and It does retrieve and save the files using the
comparation. But it takes forever, like 30min or so in generating the
file.
The process starts as soon as the files are uploaded but it seems to be
taking most of the time into opening the file, once it’s opened it takes
maybe 5min at most to generate the new file.

Do you know where can i find an example on how to read an xlsx file with
nokogiri? I can’t seem to find one

Monserrat_F · March 25, 2014, 8:10am

Creek is good, I’d also recommend dullard, a gem that I wrote. Its
output
format may be more convenient for your case.

http://rubygems.org/gems/dullard

-Ted