Commit every Nth INSERT in migration

jendaperl · June 18, 2007, 6:34pm

We are importing some data into the lookup tables in a mysql database in
the migration and it’t insanely slow. Even more than I’d expect from
Ruby. I extracted the import of a single table into a script and trimmed
the CSV down to some 5000 rows … it still takes some 157s. While a raw
DBI+DBD::msyql import in Perl takes some 72s with autocommit and 1-2s
while commiting only once every 100 rows.

I don’t think I could get thouse numbers as ActiveRecord does have to
have some overhead and especially since it doesn’t prepare the INSERT
statement using placeholders, but I do think even Ruby can do better.

I tried to do the commit trick, but seems Ruby doesn’t like the least
bit more than I like Ruby and it doesn’t work. What am I doing wrong?

require ‘rubygems’
require ‘active_record’
require ‘yaml’
require ‘logger’
require ‘fastercsv’

dbconfig = YAML::load(File.open(‘M:\ecom\config\database.yml’))
ActiveRecord::Base.establish_connection(dbconfig[‘development’])
#ActiveRecord::Base.logger = Logger.new(‘zkSqlBatch2.log’, File::WRONLY
| File::APPEND | File::CREAT)
#ActiveRecord::Base.logger.level = Logger::DEBUG

#ActiveRecord::Base.colorize_logging = false

class ZipCode < ActiveRecord::Base
end

ZipCode.delete_all

started = Time.new()

i = 0
ZipCode.connection.begin_db_transaction()

FasterCSV.foreach(“M:/ecom/db/csv_lookup_data/postal_codes_some.csv”,
:col_sep => “|”) do |row|
ZipCode.create(
:postal_code => row[1] , :country_id => 159, :city => row[3],
:latitude => row[4],
:longitude=> row[5],:state_id => row[6]
)
i+=1
if (i > 100) then
i = 0
ZipCode.connection.commit_db_transaction
ZipCode.connection.begin_db_transaction
end
end
ZipCode.connection.commit_db_transaction

puts “Taken #{Time.new - started} s”

The Logger was there so that I could see what does AR send to the
database. BTW, I’d expect to be told that I have to set the logger
BEFORE AR connects to the database, since otherwise it’ll be silently
ignored. Or maybe I would not. In either case I really wonder how am I
supposed to do that eg. in a migration.

Jenda

jendaperl · June 19, 2007, 3:18pm

Jenda K. wrote:

We are importing some data into the lookup tables in a mysql database in
the migration and it’t insanely slow. Even more than I’d expect from
Ruby. I extracted the import of a single table into a script and trimmed
the CSV down to some 5000 rows … it still takes some 157s. While a raw
DBI+DBD::msyql import in Perl takes some 72s with autocommit and 1-2s
while commiting only once every 100 rows.

You’d think that the Rubyfolk would be eager to show me the light and
prove Ruby can do better, but it seems it can’t. To give AR a fair
chance I tried to do the insert using DBIx::Class. Well …

Importing a CSV with 2500 rows
Ruby&ActiveRecord: 137 s
Ruby&ActiveRecord + batches via object: 147 s
Ruby&ActiveRecord + batches directly: 148 s
Perl+DBIx::Class: 73 s
Perl+DBIx::Class + batches: 5 s
Perl+DBI: 62 s
Perl+DBI + batches: 1-2 s

Well … I don’t know. I heard Ruby is slow, but this slow?

Jenda

jendaperl · June 19, 2007, 4:13pm

Hey, here’s your prize, you discovered that a fully fledged ORM written
in
Ruby is slower than a lower level DB access interface written in Perl.
Perhaps if you hadn’t phrased your original question with things like
“the least
bit more than I like Ruby”, people would have been more eager to help
with your issue.

As it stands your post was petulant, trollish and ignores the
distinction between “Ruby” and “Rails”. ActiveRecord is part of Rails.
If you find ActiveRecord to be slow I suggest the following libraries:

http://ruby-dbi.rubyforge.org/
http://tmtm.org/en/mysql/ruby/

Once you use the right tool for the right job and stop complaining that
your ride-on lawnmower keeps losing drag-races your life will be much
easier.

AB

Jenda K. wrote:

Jenda K. wrote:

We are importing some data into the lookup tables in a mysql database in
the migration and it’t insanely slow. Even more than I’d expect from
Ruby. I extracted the import of a single table into a script and trimmed
the CSV down to some 5000 rows … it still takes some 157s. While a raw
DBI+DBD::msyql import in Perl takes some 72s with autocommit and 1-2s
while commiting only once every 100 rows.

You’d think that the Rubyfolk would be eager to show me the light and
prove Ruby can do better, but it seems it can’t. To give AR a fair
chance I tried to do the insert using DBIx::Class. Well …

Importing a CSV with 2500 rows
Ruby&ActiveRecord: 137 s
Ruby&ActiveRecord + batches via object: 147 s
Ruby&ActiveRecord + batches directly: 148 s
Perl+DBIx::Class: 73 s
Perl+DBIx::Class + batches: 5 s
Perl+DBI: 62 s
Perl+DBI + batches: 1-2 s

Well … I don’t know. I heard Ruby is slow, but this slow?

Jenda

jendaperl · June 19, 2007, 5:45pm

On Mon, 18 Jun 2007, Jenda K. wrote:

Ruby. I extracted the import of a single table into a script and trimmed
the CSV down to some 5000 rows … it still takes some 157s. While a raw
DBI+DBD::msyql import in Perl takes some 72s with autocommit and 1-2s
while commiting only once every 100 rows.

You may find that (Perl|Python|Lua|…) are quicker than ruby for some
tasks.
If performance is your biggest constraint then it may be worth choosing
the fastest tool for each bit, till the others catch up… But, is
there
anything else to be done?
[…]

dbconfig = YAML::load(File.open(‘M:\ecom\config\database.yml’))
[OT, but may be worth having a bit more error checking in there]

ZipCode.delete_all
All that looks fairly normal to me…

started = Time.new()

i = 0
ZipCode.connection.begin_db_transaction()

To squeeze this further, put the result of zipconde.connection in
a variable.

FasterCSV.foreach(“M:/ecom/db/csv_lookup_data/postal_codes_some.csv”,
:col_sep => “|”) do |row|

FasterCSV is certainly faster than CSV, but it may be quicker to just
split lines on /|/, if they are sufficiently simple.

M: – speed of access to that drive isn’t a problem? If you’ve tried
Perl etc then probably not. (Rhetorical question, BYTK)

end
end
ZipCode.connection.commit_db_transaction

puts “Taken #{Time.new - started} s”

As this is ruby code, I suggest you run it with -rprofile to find
out which bits are actually slow. The profiler will slow things down
of course, but you’ll at least see where the bottlenecks are so you
can work on those.

    Hugh

jendaperl · June 19, 2007, 6:22pm

Hugh S. wrote:

On Mon, 18 Jun 2007, Jenda K. wrote:

i = 0
ZipCode.connection.begin_db_transaction()

To squeeze this further, put the result of zipconde.connection in
a variable.

Well I could, that’s not the problem. The problem seems to be that this
is completely ignored. According to the docs begin_db_transaction()
should turn the autocommit off for the connection and run a “BEGIN
TRANSACTION” or whatever’s the correct syntax for the database in
question. This doesn’t seem to be the case as the logfile looks like
this:

Logfile created on Tue Jun 19 15:02:25 +0200 2007 by logger.rb/1.5.2.9

SQL (0.031000) COMMIT
SQL (0.000000) BEGIN
SQL (0.000000) INSERT INTO zip_codes (city, latitude,
postal_code, country_id, longitude, state_id)
VALUES(‘Daysville’, ‘41.99’, ‘61061’, 159, ‘-89.29’, 13)
SQL (0.031000) COMMIT
SQL (0.000000) BEGIN
SQL (0.000000) INSERT INTO zip_codes (city, latitude,
postal_code, country_id, longitude, state_id) VALUES(NULL, NULL,
NULL, 159, NULL, NULL)
SQL (0.016000) COMMIT
…
SQL (0.000000) BEGIN
SQL (0.000000) INSERT INTO zip_codes (city, latitude,
postal_code, country_id, longitude, state_id) VALUES(‘De Land’,
‘29.05’, ‘32724’, 159, ‘-81.25’, 9)
SQL (0.031000) COMMIT
SQL (0.000000) COMMIT
SQL (0.000000) BEGIN
SQL (0.000000) BEGIN
SQL (0.000000) INSERT INTO zip_codes (city, latitude,
postal_code, country_id, longitude, state_id) VALUES(NULL, NULL,
NULL, 159, NULL, NULL)
…

Which considering the fact that mysql (5.0.27) doesn’t support nested
transactions is the problem. Look at the numbers earlier in the thread
… commiting only every 100 rows makes a huge difference. I don’t
really mind if Ruby+ActiveRecord is twice slower than Perl+DBIx::Class
(well, whatever), I do mind it’s 30 times slower.

FasterCSV.foreach(“M:/ecom/db/csv_lookup_data/postal_codes_some.csv”,
:col_sep => “|”) do |row|

FasterCSV is certainly faster than CSV, but it may be quicker to just
split lines on /|/, if they are sufficiently simple.

Yep, that’d work in this case.

M: – speed of access to that drive isn’t a problem? If you’ve tried
Perl etc then probably not. (Rhetorical question, BYTK)

That’s why I tried the raw low level Perl+DBI. Which also does make
clear how huge a difference does the batch commit make.

end
end
ZipCode.connection.commit_db_transaction

puts “Taken #{Time.new - started} s”

As this is ruby code, I suggest you run it with -rprofile to find
out which bits are actually slow. The profiler will slow things down
of course, but you’ll at least see where the bottlenecks are so you
can work on those.
    Hugh

The bottleneck is in the way the database is used. Some things can’t be
helped without making huge changes to ActiveRecord (building the
statements once with placeholders, preparing them and then executing
instead of building a SQL statement including the values every time),
others should be doable (controlling the commit).

While the second would help me in this case (and - if used - others that
need to import some more data), the first would help everyone using
Rails. Just try to think about the amount of work you force the database
to do by throwing a different SQL at it each time. All the execution
plan caching goes down the drain, the amount of data sent between the
application and the database is much higher, …

Jenda

jendaperl · June 19, 2007, 5:09pm

AB wrote:

Hey, here’s your prize, you discovered that a fully fledged ORM written
in
Ruby is slower than a lower level DB access interface written in Perl.

Well, that’d be true for the Perl+DBI scripts, not the DBIx::Class ones.
DBIx::Class is an ORM for Perl, just like ActiveRecord is for Ruby.

Except that the authors of DBIx::Class cared about performance.

Perhaps if you hadn’t phrased your original question with things like
“the least
bit more than I like Ruby”, people would have been more eager to help
with your issue.

As it stands your post was petulant, trollish and ignores the
distinction between “Ruby” and “Rails”. ActiveRecord is part of Rails.

Yep, the last point was really important. ActiveRecord is a library
that’s part of a bigger bundle of libraries called Rails that was
developed from a programming language called Ruby. What was your point
again? And how was I ignoring the “distinction”?

Sure, I’m an annoying troll. A troll that’s backed up by numbers. You
know, that’s the difference between the Perl and Ruby community. The
Perl one would run to prove me wrong, not to help me, but to make sure
noone things Perl performs poorly. Me being a troll or not. The Ruby one
ignores me. Maybe they don’t feed the trolls. And maybe they CAN’T prove
me wrong.

I don’t care that the import takes an hour now that the PHB finaly
understood that it’s stupid to keep editing migrations forcing everyone
to import and reimport daily. And I don’t care AR is dead-cow slow since
I’m leaving the team in a month or so anyway. You seem to want to
continue using it, you’ll have to force your companies to buy more and
better hardware to sustain Ruby and Rails.

If you find ActiveRecord to be slow I suggest the following libraries:

http://ruby-dbi.rubyforge.org/
MySQL/Ruby

Once you use the right tool for the right job and stop complaining that
your ride-on lawnmower keeps losing drag-races your life will be much
easier.

AB

If I could choose the right tool I would not be using Ruby.

Jenda

jendaperl · June 19, 2007, 8:50pm

On 6/18/07, Jenda K. [email protected] wrote:

statement using placeholders, but I do think even Ruby can do better.
Using AR create() is just horribly slow. You’ll get significant
improvement by batching like Trevor showed (not sure why the way you
were doing it didn’t work).

But for batch loads, you don’t need all the extra stuff AR is doing. I
would suggest building up your own SQL, which isn’t hard to do. You
can still use the AR connection.

Try something along these lines:

BATCH_SIZE = 250
cols = “postal_code,country_id,city,latitude,longitude,state_id”
conn = ZipCode.connection
FasterCSV.read(‘input.csv’).each_slice(BATCH_SIZE) do |batch|
values = batch.map {|row| “(#{row.map
{|col| conn.quote(col)}.join(‘,’)})”}
sql = “insert into zip_codes (#{cols}) values #{values}”
conn.execute sql
end

I’m just building up a properly quoted insert statement (multiple rows
inserted in one statement, which is MySQL-specific, but hey…)

My own testing shows this to be about 12-15x faster than using AR
create() in batches of 250 rows, and 50-60x faster than AR create()
with commits after each row.

So AR sucks and Ruby sucks or whatever. But something like this will
get your data loaded quickly. If you need to load truly huge amounts
of data you’re probably better off using database specific bulk
loading tools.

jendaperl · June 20, 2007, 2:55pm

Bob S. wrote:

On 6/18/07, Jenda K. [email protected] wrote:

statement using placeholders, but I do think even Ruby can do better.
Using AR create() is just horribly slow. You’ll get significant
improvement by batching like Trevor showed (not sure why the way you
were doing it didn’t work).

But for batch loads, you don’t need all the extra stuff AR is doing. I
would suggest building up your own SQL, which isn’t hard to do. You
can still use the AR connection.

Try something along these lines:

BATCH_SIZE = 250
cols = “postal_code,country_id,city,latitude,longitude,state_id”
conn = ZipCode.connection
FasterCSV.read(‘input.csv’).each_slice(BATCH_SIZE) do |batch|
values = batch.map {|row| “(#{row.map
{|col| conn.quote(col)}.join(‘,’)})”}
sql = “insert into zip_codes (#{cols}) values #{values}”
conn.execute sql
end

I’m just building up a properly quoted insert statement (multiple rows
inserted in one statement, which is MySQL-specific, but hey…)

My own testing shows this to be about 12-15x faster than using AR
create() in batches of 250 rows, and 50-60x faster than AR create()
with commits after each row.

Sweet

Thanks, it made a huge difference!

Jenda

jendaperl · June 19, 2007, 6:52pm

Hey,

Sorry if this has already been mentioned.

If you manually create a couple of zipcodes like this

ZipCode.transaction do
ZipCode.create(:foo => ‘data’, :bar => ‘data’)
ZipCode.create(:foo => ‘data’, :bar => ‘data’)
end

The logs will show you that only one transaction was issued.

This is (in rails) by far the most common strategy used for taking
control of your transactions: passing a block to transaction().

To take advantage of it you’re going to have to partition your input
csv rows into groups of 100 - something like this:

all_your_csv_input_rows.each_slice(100) do |batch|
ZipCode.transaction do
batch.each do |row|
ZipCode.create(:foo => row[1], :bar => row[2])
end
end
end

HTH,
Trevor

On 6/19/07, Jenda K. [email protected] wrote:

is completely ignored. According to the docs begin_db_transaction()
VALUES(‘Daysville’, ‘41.99’, ‘61061’, 159, ‘-89.29’, 13)
‘29.05’, ‘32724’, 159, ‘-81.25’, 9)
transactions is the problem. Look at the numbers earlier in the thread

ZipCode.connection.commit_db_transaction

plan caching goes down the drain, the amount of data sent between the
application and the database is much higher, …

Jenda

–
Posted via http://www.ruby-forum.com/.

–

Trevor S.
http://somethinglearned.com