Hey Folks, Im looking for some suggestions here. I need to process 200 GZ compressed files each day that contain comma delimited information in the following format: HostName, FileName, DirName, Modified_On Each file contains upwards of 200K rows. I need to compare this information to the information recieved the day before to look for files that have changed. My current plan was to: ## do read file and uncompress line do import line into Mysql end end run several comparison queries to find changes save changes to a table automatically review changes based on rules, and those that are leftover are unauthorized. ## The MYSQL Import is SLOOOOW. Its taking 10 minutes per file. Extrapolating this, it will take 2000 minutes, or 33 hours each day to do just the import. Unfortunately, earth days only have 24 hours. So, I need some way to compare todays file, to yesterdays and see changes. Any good way to do this using the text files and skip the import process? Im worried that this will slow down the comparison process, but I'd like to try it.... Thanks :) Brian
on 2006-03-20 23:38
on 2006-03-21 17:35
"email@example.com" <firstname.lastname@example.org> writes: > The MYSQL Import is SLOOOOW. Its taking 10 minutes per file. > Extrapolating this, it will take 2000 minutes, or 33 hours each day to > do just the import. Unfortunately, earth days only have 24 hours. How do you import data to MySQL? Do you have direct access to the database server? At least in PostgreSQL, you can use COPY FILE to directly import data... which ought to be fast.
on 2006-03-21 18:00
Well, if you're doing it on *nix, use the 'diff' command. On Windows, use 'fc', and write some other code to handle the differences. Loading the files into the database is not the best way forward. Use 'gunzip' on either system (you'll need to download gunzip from gnuwin32 (http://gnuwin32.sourceforge.net/packages.html. Heck, might as well download the package with diff and grep in it as well). Wait, this has come up before, and sparked quite the flamefest each time! With moderately clever code, especially on *nix, you should even be able to do this in a shell script with minimal lines of code, something like this really bad mix of CMD and sh: for each f in (*.gz) do if ff .neq. "" then #arg, gotta gunzip the file... gunzip %f %f.txt f = %f.txt diff %f %ff | grep -v '$>' > %f.diff end if ff = %f #next f The databases I've dealt with (Oracle, SQL Server, Access, Postgres) don't have a good way to do this in SQL or stored procedure language, because they don't really support recursion at the SQL level. Plus, it's just too expensive for the database to allocate table space (temp or not) to load a couple of files, run all the queries that you would need to run to spot the differences and extract them meaningfully, and then go on to the next file. Otherwise, another quick-and-dirty way to do it is to set up the table to import so that there is a multi-column PK or unique constraint, add a field to store the file name (but keep it out of the constraint), and then just bulk import each file. If you updated the filename field the last time, then all the records that were successfully added will have NULL for a filename. Depending on the database, the rows that failed to be added due to constraint violations might be in the error logs, or they just may not really matter at all. Run any reports you need to for the new records before updating the filename field. Lather-rinse-repeat. For the SQL Server geeks, doing the "loop over the files in a directory" is possible via DTS (sqldts.com has sample VBScript code to do it...), but it's a sucky pain in the ass to do. But maybe it's easier to do in SS2005.
on 2006-03-26 08:30
You can use the 'mysqlimport' utility to do a bulk load of the data. Vance