Efficient parsing of large Excel documents in Ruby

All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it’s data into a database (in a Rails app.).

When I run it against an Excel file with ~42000 rows and 11 columns (the
spreadsheet is about 10MB in size), it takes the better part of 10
minutes to parse the file and uses about 200 MB of memory. I’m not
actually sure if the parsing completes successfully - the app. stops
right after the parsing appears to finish.

It would seem that I have two issues - memory usage and performance.

Is anyone aware of a much faster way to parse large Excel files?

Is anyone aware of a “windowing” scheme that uses a set amount of memory
to transfer portions of the data to a client requesting the data so as
to keep the memory required at a fixed level?

Should I look into using the Win32OLE libraries with the hope that it
would be more efficient?

I am transferring this data into a SQL Server database, so perhaps I
could look into some M$-native method of moving this data.

Any thoughts or advice is appreciated.

Thanks,
Wes G.

You should definitely use win32ole. This way, you can cut down memory
usage to the minimal needs by only storing the current row. I’ve been
using it for storing 11000 rows by 10 columns with great success.

However, there seems to be some kind of round trip delay in the OLE
communication: writing 11000 rows takes about 90 seconds, but that
should still be OK for you. I overcame this by simply storing an 11000
by 10 array, which took ~1 second. This then required the memory usage
of an 11000 by 10 array, but not 200MB…

See this link: http://wiki.rubygarden.org/Ruby/page/show/ScriptingExcel

So in the end, you should use win32ole and probably make an Excel
wrapper class for instantiating/closing Excel, appending rows etc.

-Christian

Wes G. skrev:

Christian M. [email protected] wrote:

See this link: http://wiki.rubygarden.org/Ruby/page/show/ScriptingExcel

Interesting. Just as a note, everything that happens on that page can
now be done on Mac using almost the same language, thanks to
rb-appscript (or RubyOSA, which I have not yet tried). It would be
interesting to write a Mac version of the page. m.

I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I’m currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes

Wes G. wrote:

I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I’m currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes

Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It’s fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple “server” for them. I think all you have to do is
“publish them to the intranet”, but I’ve never done it. Then you can use
something like Hpricot to parse them.

I agree with Christian, but with a twist.

I like to use Excel Automation (w/ WIN32OLE) to convert the document to
a format that’s easier to work with. That way, after the initial
conversion, you can handle CSV, XML, etc, and be working with a fast
native library instead of the slow interop.

Another reason I prefer this method is that Excel interop is sometimes
hard to get “just right”. There’s processes I’ve written with it
running for over a year that I’ve never had to touch, but there’s
another that used to require I login to the server and kill orphaned
Excel processes. Someone changes a column, you don’t code defensively
enough, and BAM!

So the simplicity of the Load/Convert process, and getting out of Excel
as quickly as possible, conserving resources and avoiding potential
bugs that might result in orphaned processes is a big advantage IMO.

Plus it’s almost gauranteed to be the fastest, and simplest to develop.
(As long as you’re OK with having to run on Windows and having Excel
installed anyways.)

Wes G. wrote:

I am transferring this data into a SQL Server database, so perhaps I
could look into some M$-native method of moving this data.

Yes, there is a native way of doing this in VBA from a macro, which you
can store either in your personal macro workbook on in the spreadsheets
themselves. I have a colleague where I work that built an application
that does it.

But unfortunately, I have no idea what the technique is called. Almost
anyone who’s been through some advanced Microsoft training ought to be
able to help you find it, or you might search the Microsoft knowledge
base for “efficient transfer of data to SQL Server”.

If you don’t have an answer by Tuesday, let me know and I’ll ask her
what it’s called.

And no, I don’t think it involves installing CygWin. :slight_smile:

Wes G. wrote:

All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it’s data into a database (in a Rails app.).

Another rough option would be using JRuby or one of the Java bridges to
call out to POI, Apache’s Office-document library.

Ed,

Not a bad idea. I’m already using unixODBC to connect to SQL Server
from the Linux box. I totally forgot that you can get to Excel via
ODBC.

Don’t I need an ODBC driver for Excel though?

What about a DBI driver for Excel - does one exist?

Thanks,
Wes

Charles Oliver N. wrote:

Wes G. wrote:

All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it’s data into a database (in a Rails app.).

Another rough option would be using JRuby or one of the Java bridges to
call out to POI, Apache’s Office-document library.

Charles,

I would love to use JRuby for this app., but don’t feel it’s quite ready
yet (although you guys are moving it forward at an incredible rate and I
expect it will be a viable production option in less than a year - keep
up the good work).

Does anyone know if the POI Ruby bindings are a potential solution? I
see
“Implement support for reading Excel files (easy)” under the list of
TODOs so not sure if I could use POI-Ruby to read…

Anyone have any experience with POI-Ruby?

Wes

M. Edward (Ed) Borasky wrote:

Wes G. wrote:

I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I’m currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes

Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It’s fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple “server” for them. I think all you have to do is
“publish them to the intranet”, but I’ve never done it. Then you can use
something like Hpricot to parse them.

Hpricot? I thought that only parsed HTML? Can you say a little more
about this? Would the “server” you mention above parse the Excel into
some intermediate format that I would then process with something else
(like Hpricot or ???)?

Thanks,
Wes

Wes G. wrote:

Anyone have any experience with POI-Ruby?
We appreciate the vote of confidence, and you’re probably right about
JRuby not quite being production ready (“production ready” in the sense
of having something that won’t require a bit of tinkering…there are
people using it for production apps, but they’re comfortable on the
frontier :slight_smile:

Don’t forget the Java bridge options…from what I hear they work
reasonably well for neatly-encapsulated libraries like this.

RJB (Ruby-Java Bridge) is the usual recommendation and apparently more
active project, though YAJB (Yet Another Java Bridge) apparently works
too

And there’s RubyCLR, which can hook into the CLR and any capacity it has
for working with Excel.

Charles Oliver N. wrote:

RJB (Ruby-Java Bridge) is the usual recommendation and apparently more
active project, though YAJB (Yet Another Java Bridge) apparently works
too

I’m looking at RJB and it appears that it only loads the JVM into memory
once until it is explicitly unloaded.

Can anyone corroborate this (or correct me on it)?

Thanks,
Wes

Wes G. wrote:

M. Edward (Ed) Borasky wrote:

Wes G. wrote:

I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I’m currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes

Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It’s fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple “server” for them. I think all you have to do is
“publish them to the intranet”, but I’ve never done it. Then you can use
something like Hpricot to parse them.

Hpricot? I thought that only parsed HTML? Can you say a little more
about this? Would the “server” you mention above parse the Excel into
some intermediate format that I would then process with something else
(like Hpricot or ???)?

Thanks,
Wes

Ah, are you saying that I can “publish” the Excel file into HTML through
some IIS thing and then just parse that with Hpricot? That’s an
interesting solution.

Wes

Some more info:

Recently, I was confronted with a task in one of the apps. I’m building
that would allow the parsing of data in an Excel spreadsheet where the
number of rows could be on the order of 30000/40000/50000 or higher.

Originally, I was using the parseexcel gem to handle the parsing -
however, it proved to be fairly slow and consumed a lot of memory. When
I presented it with a > 42000 row spreadsheet, it basically cratered.
So I had to figure out another way to handle this problem. Someone
mentioned that there was a nice open source Java - based Excel parser
called JExcelAPI (http://jexcelapi.sourceforge.net/). A quick native
Java test showed that the performance and memory footprint would be much
much better.

In order to take advantage of JExcelAPI, I looked at JRuby briefly - but
still had problems implementing that (and I didn’t want to run this app.
on it yet since it’s still so young), so I took a look at some of the
Java - Ruby bridges. I gave one called Rjb
(http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge) a shot. I
was very pleasantly surprised - it was really easy to use this to
integrate with the JExcelAPI.

If I understand correctly, Rjb uses JNI to start, and then interact with
an available JVM (a JDK, not a JRE). Works on Windows or UNIX. You
basically embed a JVM in your Ruby interpreter and then load classes
into it and start using them. Basic type casting to/from Java types is
done for you. The documentation is terrible but there’s just enough of
it to get you started.

Here’s what I did:

  1. Get the Rjb gem using “gem install rjb”
  2. Put the JAR file that I wanted to use - jxl.jar in my RAILS_ROOT/lib
    directory.
  3. Start the JVM using Rjb::load(“#{RAILS_ROOT}/lib/jxl.jar”,
    [‘-Xms256M’, ‘-Xmx512M’]) - the array is a set of parameters to send to
    the JVM for startup.
  4. Load classes using Rjb::import(classname)

Here’s an example of using it in my app.:

file_class = Rjb::import(‘java.io.File’)
workbook_class = Rjb::import(‘jxl.Workbook’)
workbook = workbook_class.getWorkbook(file_class.new(filename))

Some things to notice:

  • filename is a Ruby string - that’s being passed to the File.new() Java
    method.
  • The return of the call to file_class.new is a wrapped Java File object
    and can be immediately passed to the getWorkbook method.
  • workbook is a Java object that can then be used in other parts of the
    app.

The good news: Once you get past loading a class and/or instantiating an
object, doing method calls is as simple as just calling the methods on
the Java objects you’ve instantiated or received from other method
calls.

The bad news: This is so seamless, it would be very easy to forget that
some of the objects that you’re dealing with are effectively Java
objects, and then you might forget how to use them correctly.

For production for this app, I may need to change approaches since I’ll
prob. be running multiple Mongrel processes and I don’t know if I want
to have one embedded JVM per process (if I understand Mongrel deployment
correctly - currently I’m doing Apache/FastCGI so I know it’s a problem
there). That may force using DRb in a separate process to host this
Excel parser component and allow it to be used from anywhere (if that
happens - could also do a Web service-y thing on top of a JRuby process
or whatever).

Hope this is useful for someone.

Wes

Wes G. wrote:

mentioned that there was a nice open source Java - based Excel parser
integrate with the JExcelAPI.
0) Get the Rjb gem using “gem install rjb”
workbook_class = Rjb::import(‘jxl.Workbook’)
The good news: Once you get past loading a class and/or instantiating an
to have one embedded JVM per process (if I understand Mongrel deployment
correctly - currently I’m doing Apache/FastCGI so I know it’s a problem
there). That may force using DRb in a separate process to host this
Excel parser component and allow it to be used from anywhere (if that
happens - could also do a Web service-y thing on top of a JRuby process
or whatever).

Hope this is useful for someone.

Wes

That seems to me to be an awfully roundabout way of doing things. I
think you could hack something up with ODBC to just take your whole
spreadsheet and upload it to a database, then use Rails or DBI to read
the database.

Charles Oliver N. wrote:

And there’s RubyCLR, which can hook into the CLR and any capacity it has
for working with Excel.

Read: None. :slight_smile: You’d just reference the Excel COM object in .NET.
Using WIN32OLE is actually simpler than c# since you have access to the
VB-ish interfaces too (such as Range#Value).