File Auditing - Fast DB import and data manipulation

Hey Folks,

Im looking for some suggestions here. I need to process 200 GZ
compressed files each day that contain comma delimited information in
the following format:
HostName, FileName, DirName, Modified_On

Each file contains upwards of 200K rows. I need to compare this
information to the information recieved the day before to look for
files that have changed.

My current plan was to:

do read file and uncompress line
do import line into Mysql
end
end
run several comparison queries to find changes save changes to a table
automatically review changes based on rules, and those that are
leftover are unauthorized.

The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
Extrapolating this, it will take 2000 minutes, or 33 hours each day to
do just the import. Unfortunately, earth days only have 24 hours.

So, I need some way to compare todays file, to yesterdays and see
changes. Any good way to do this using the text files and skip the
import process? Im worried that this will slow down the comparison
process, but I’d like to try it…

Thanks :slight_smile:

Brian

[email protected][email protected] writes:

The MYSQL Import is SLOOOOW. Its taking 10 minutes per file.
Extrapolating this, it will take 2000 minutes, or 33 hours each day to
do just the import. Unfortunately, earth days only have 24 hours.

How do you import data to MySQL? Do you have direct access to the
database server? At least in PostgreSQL, you can use COPY FILE to
directly import data… which ought to be fast.

Well, if you’re doing it on *nix, use the ‘diff’ command. On Windows,
use ‘fc’, and write some other code to handle the differences. Loading
the files into the database is not the best way forward. Use ‘gunzip’
on either system (you’ll need to download gunzip from gnuwin32
(GnuWin32 Packages. Heck, might as well
download the package with diff and grep in it as well).

Wait, this has come up before, and sparked quite the flamefest each
time!

With moderately clever code, especially on *nix, you should even be
able to do this in a shell script with minimal lines of code,
something like this really bad mix of CMD and sh:

for each f in (*.gz) do
if ff .neq. “” then
#arg, gotta gunzip the file…
gunzip %f %f.txt
f = %f.txt
diff %f %ff | grep -v ‘$>’ > %f.diff
end if
ff = %f
#next f

The databases I’ve dealt with (Oracle, SQL Server, Access, Postgres)
don’t have a good way to do this in SQL or stored procedure language,
because they don’t really support recursion at the SQL level. Plus,
it’s just too expensive for the database to allocate table space (temp
or not) to load a couple of files, run all the queries that you would
need to run to spot the differences and extract them meaningfully, and
then go on to the next file.

Otherwise, another quick-and-dirty way to do it is to set up the table
to import so that there is a multi-column PK or unique constraint, add
a field to store the file name (but keep it out of the constraint),
and then just bulk import each file. If you updated the filename field
the last time, then all the records that were successfully added will
have NULL for a filename. Depending on the database, the rows that
failed to be added due to constraint violations might be in the error
logs, or they just may not really matter at all. Run any reports you
need to for the new records before updating the filename field.

Lather-rinse-repeat.

For the SQL Server geeks, doing the “loop over the files in a
directory” is possible via DTS (sqldts.com has sample VBScript code to
do it…), but it’s a sucky pain in the ass to do. But maybe it’s
easier to do in SS2005.

You can use the ‘mysqlimport’ utility to do a bulk load
of the data.

Vance