Hi all,
So I have a project where I need to regularly import some very large
files
– 200-500mb with 100-1m records. All files are csv’s. Currently I have
been
using CSV.parse (Rails 3 version of FasterCSV). I have things working
but
the process takes a long time as right now I import record by record. I
am
also validating and normalizing data (i.e. account name comes in on csv
and
I look up the account name in the AR model and normalize the field to
the
account id in the same process in which I am importing the data).
Would like any suggestions of how to make this process faster and more
solid. My thoughts:
-
Seperate the validation steps from import step. First import all
data,
then after all is imported and I have verified that the # of rows in my
model match the # in the file then proceed. This will modularize the
process
more but also if validation fails not make me need to reload all the
data
into the db to re-validate once corrections have been made elsewhere in
the
system. -
Consider using tools to wholesale import csv data into my db
(Postgres):
a) I see a project out there called ActiveRecord-Import (
GitHub - zdennis/activerecord-import: A library for bulk insertion of data into your database using ActiveRecord.)
b) I have found the COPY_FROM command for AR (
http://www.misuse.org/science/2007/02/16/importing-to-postgresql-from-activerecord-amp-rails/
)
Just want to see if anyone has dealt with such masses of data and have
any
other recommendations. This project does not need to be db agnostic
really.
Running Rails 3, Ruby 1.9.2, deployed on Ubuntu server 10.0.4,
postgresql…
Best,
David