I've just released the latest version of the ActiveWarehouse ETL (Extract-Transform-Load) component, a gem from the ActiveWarehouse collection of tools that helps build ETL processes for extracting data from operational systems, transforming the data and loading it into data warehouses. This release contains lots of nifty and useful enhancements. To get the latest version just do sudo gem install activewarehouse-etl. You can also get the latest release from the ActiveWarehouse RubyForge site ( http://rubyforge.org/frs/?group_id=2435&release_id=10966 ). Job execution is now tracked in a database. This means that ActiveRecord is required regardless of the sources being used in the ETL scripts. Currently AW-ETL records each job execution as a Job instance in the database and stores each record written along with its CRC. The etl script now supports a variety of command line arguments: * -h or—help: Prints the usage statement. * -l or—limit: Specifies a limit for the number of source rows to read, useful for testing your control files before executing a full ETL process. * -o or—offset: Specified a start offset for reading from the source, useful for testing your control files before executing a full ETL process. * -c or—config: Specify the database.yml file to configure the ETL execution data store. The default behavior will look in the current working directory for the file database.yml. * -n or—newlog: Write to the logfile rather than appending to it. * -s or—skip-bulk-import: Skip any bulk import post processors. * —read-locally: Read from the last local cache file for all sources. Database source now supports specifying the select, join and order parts of the query and understands the limit and offset arguments specified on the etl command line. Several new processors were added: * CheckExistProcessor. Check if the record already exists in the target data store. * CheckUniqueProcessor. Check if the record has already been written. * SurrogateKeyProcessor. The SurrogateKey processor should be used in conjunction with the CheckExistProcessor and CheckUniqueProcessor. * SequenceProcessor. Generates context-sensitive sequences based on the fields in the source data. There have also been some changes in the transforms: * Added OrdinalizeTransform * Fixed a bug in the trim transform Sources now provide a trigger file which can be used to indicate that the original source data has been completely extracted to the local file system. This is useful if you need to recover from a failed ETL process. Two shortcut methods have been added to the ETL control DSL: rename and copy. These methods delegate to the RenameProcessor and the CopyFieldProcessor respectively and are used to either rename or copy a data field in memory. I have also spent some time improving the documentation by updating the README and continuing to improve the inline rdoc. Finally I have added some benchmarking to the ETL::Engine to get an idea of how long the ETL engine spends in each of the various parts of the ETL pipeline. More information on ActiveWarehouse can be found at http://activewarehouse.rubyforge.org/ . Enjoy! V/r Anthony E.
on 2007-04-08 20:00