Slow database imports with active record

Hey Folks,

Im looking for some suggestions here. Each day I need to process 200 GZ
compressed files each day that contain comma delimited information in
the following format:
HostName, FileName, DirName, Modified_On

Each file contains upwards of 200K rows and works out to be about 20MB
when
uncompressed. I need to compare this
information to the information recieved the day before to look for
files that have changed.

My current plan was to:

do read file and uncompress line
do import line into Mysql using activerecord
end
end
run several comparison queries to find changes save changes to a table
automatically review changes based on rules, and those that are
leftover are unauthorized changes and require research by the network
admins.

I’ve written everything, and its working. However, the MYSQL Import is
SLOOOOW.
Its taking 10 minutes per file.
Extrapolating this, it will take 2000 minutes, or 33 hours each day to
do just the import. Unfortunately, earth days only have 24 hours.

So, I need some way to compare todays file, to yesterdays and see
changes. I really want akk the data to end up in MySQL or SQL server,
so I’m
hesitant to use any solution that requires direct comparios on text
files.
Ideally, I’d like a way to speed up the import. I saw another post
talking
about how ActiveRecord defaults to using transactions for insert in a
block, so
the only thing ive tried at this point to speed things up is to begin
the
transaction outside of the block and end it upon compleetion. This
didn’t seem
to help much…

Thanks :slight_smile:

Brian

Brian C. wrote:

Hey Folks,

Im looking for some suggestions here. Each day I need to process 200 GZ
compressed files each day that contain comma delimited information in
the following format:
HostName, FileName, DirName, Modified_On

Check out LOAD DATA INFILE for MySQL. It’s likely to be a lot faster
than any other route, except possibly pre-parsing your CSV into SQL
statements.

Brian C. wrote:

Hey Folks,

Im looking for some suggestions here. Each day I need to process 200 GZ
compressed files each day that contain comma delimited information in
the following format:
HostName, FileName, DirName, Modified_On

Hello,

hey I am having the same problem here, how did you work that out?