Hello,
On 29 Nov 2006, at 02:43, eDave wrote:
What’s a good way to go about this? From the message board, it looks
like the best route is to make a script that instantiates an
ActiveRecord::Base connection, and then loads information out of the
government database (exported to MySQL) into the appropriate
ActiveRecord objects, and saves these objects to the new database. I’m
unclear as to how to set this up though. Will Rails magically load the
environment so I can just create a new Foo object and save it, or will
I need to require some files? Where is an appropriate place to put
this
script? The lib folder, vendor folder, elsewhere?
You have two main options and your situation will determine which is
better for you.
In both options below, the first CSV step assumes you just do a
simple ‘export to CSV’ from Access rather than writing a script to
pull out only the data you want. If you do the latter instead you
can skip the first CSV step.
Option 1
Legacy DB (Access) → CSV (legacy schema) → CSV (new schema) →
MySQL → validate with ActiveRecord[2]
You would write the script to convert between the first set of CSV
files and the second set. You would then use MySQL’s batch loading
facility[1] to lever the data into the database. Finally you would
use a script like [2] to validate all the data via ActiveRecord.
Option 2
Legacy DB (Access) → CSV (legacy schema) → ActiveRecord
You would write the script to read the legacy CSV files and create
ActiveRecord objects. The data are validated along the way. In
order for your script to work with the AR objects, include your /
models directory on the load path. E.g:
$ ruby -I lib:/path/to/your/app/models lib/data_migration.rb
Trade-Offs
Option 1 is far more scalable as your data volume grows. Option 2
can be sped up with ActiveRecord::Extensions[3], but Option 1 is
still faster.
Option 2 is DRY with respect to your new schemas. In option 1 you
specify your new schemas twice: once in the new CSV files and again
in your ActiveRecord model objects. In option 2 you only specify
them in your model objects.
Option 1 allows/requires you to set primary and foreign keys
explicitly, which enables you to preserve the original data’s keys.
Option 2 allows AR to set the keys automatically. This is simpler
though requires an extra field if you want to preserve the original
primary key of each record.
Option 2 catches invalid records much earlier.
Conclusion
Option 1 is the only realistic option for ‘large’ volumes of data.
Option 2 is more elegant and easier to use, so I would prefer it
where practical (‘small’ volumes of data).
The boundary between small and large volumes of data depends on the
speed of your CPU, your patience and the length of time you would
spend writing your migration script: I have found in the past that
option 2 loads data slower but that is outweighed by the time saved
having fewer hoops to jump through.
Tip
Assuming you’re cleaning your data along the way, write unit tests
for every cleaning method you write. It’ll save you time and give
you confidence. And write some unit tests for your post-migration
MySQL database: choose a few records before you start and assert
they’re present and correct (via ActiveRecord) in your new database.
Good luck!
Andy S.
[1] http://blog.hasmanythrough.com/articles/2006/08/27/validate-all-
your-records
[2] http://dev.mysql.com/doc/refman/4.1/en/loading-tables.html
[3] http://www.continuousthinking.com/are/activerecord-extensions-0-0-5