I’ve just released the latest version of the ActiveWarehouse ETL
(Extract-Transform-Load) component, a gem from the ActiveWarehouse
collection of tools that helps build ETL processes for extracting data
from operational systems, transforming the data and loading it into
data warehouses.
This release contains lots of nifty and useful enhancements. To get
the latest version just do sudo gem install activewarehouse-etl. You
can also get the latest release from the ActiveWarehouse RubyForge
site ( http://rubyforge.org/frs/?group_id=2435&release_id=10966 ).
Job execution is now tracked in a database. This means that
ActiveRecord is required regardless of the sources being used in the
ETL scripts. Currently AW-ETL records each job execution as a Job
instance in the database and stores each record written along with its
CRC.
The etl script now supports a variety of command line arguments:
* -h or—help: Prints the usage statement.
* -l or—limit: Specifies a limit for the number of source rows to
read, useful for testing your control files before executing a full
ETL process.
* -o or—offset: Specified a start offset for reading from the
source, useful for testing your control files before executing a full
ETL process.
* -c or—config: Specify the database.yml file to configure the ETL
execution data store. The default behavior will look in the current
working directory for the file database.yml.
* -n or—newlog: Write to the logfile rather than appending to it.
* -s or—skip-bulk-import: Skip any bulk import post processors.
* —read-locally: Read from the last local cache file for all
sources.
Database source now supports specifying the select, join and order
parts of the query and understands the limit and offset arguments
specified on the etl command line.
Several new processors were added:
* CheckExistProcessor. Check if the record already exists in the
target data store.
* CheckUniqueProcessor. Check if the record has already been
written.
* SurrogateKeyProcessor. The SurrogateKey processor should be used
in conjunction with the CheckExistProcessor and CheckUniqueProcessor.
* SequenceProcessor. Generates context-sensitive sequences based
on the fields in the source data.
There have also been some changes in the transforms:
* Added OrdinalizeTransform
* Fixed a bug in the trim transform
Sources now provide a trigger file which can be used to indicate that
the original source data has been completely extracted to the local
file system. This is useful if you need to recover from a failed ETL
process.
Two shortcut methods have been added to the ETL control DSL: rename
and copy. These methods delegate to the RenameProcessor and the
CopyFieldProcessor respectively and are used to either rename or copy
a data field in memory.
I have also spent some time improving the documentation by updating
the README and continuing to improve the inline rdoc.
Finally I have added some benchmarking to the ETL::Engine to get an
idea of how long the ETL engine spends in each of the various parts of
the ETL pipeline.
More information on ActiveWarehouse can be found at
http://activewarehouse.rubyforge.org/ . Enjoy!
V/r
Anthony E.