Forum: Ruby on Rails Commit every Nth INSERT in migration

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
88521907e2c9c585bc94e35a38893dc5?d=identicon&s=25 Jenda Krynicky (jendaperl)
on 2007-06-18 18:34
We are importing some data into the lookup tables in a mysql database in
the migration and it't insanely slow. Even more than I'd expect from
Ruby. I extracted the import of a single table into a script and trimmed
the CSV down to some 5000 rows ... it still takes some 157s. While a raw
DBI+DBD::msyql import in Perl takes some 72s with autocommit and 1-2s
while commiting only once every 100 rows.

I don't think I could get thouse numbers as ActiveRecord does have to
have some overhead and especially since it doesn't prepare the INSERT
statement using placeholders, but I do think even Ruby can do better.

I tried to do the commit trick, but seems Ruby doesn't like the least
bit more than I like Ruby and it doesn't work. What am I doing wrong?


require 'rubygems'
require 'active_record'
require 'yaml'
require 'logger'
require 'fastercsv'

dbconfig = YAML::load(File.open('M:\ecom\config\database.yml'))
ActiveRecord::Base.establish_connection(dbconfig['development'])
#ActiveRecord::Base.logger = Logger.new('zkSqlBatch2.log', File::WRONLY
| File::APPEND | File::CREAT)
#ActiveRecord::Base.logger.level = Logger::DEBUG

#ActiveRecord::Base.colorize_logging = false

class ZipCode < ActiveRecord::Base
end

ZipCode.delete_all

started = Time.new()

i = 0
ZipCode.connection.begin_db_transaction()

FasterCSV.foreach("M:/ecom/db/csv_lookup_data/postal_codes_some.csv",
:col_sep => "|") do |row|
  ZipCode.create(
    :postal_code => row[1] , :country_id => 159, :city => row[3],
:latitude => row[4],
    :longitude=> row[5],:state_id => row[6]
  )
  i+=1
  if (i > 100) then
    i = 0
    ZipCode.connection.commit_db_transaction
    ZipCode.connection.begin_db_transaction
  end
end
ZipCode.connection.commit_db_transaction

puts "Taken #{Time.new - started} s"



The Logger was there so that I could see what does AR send to the
database. BTW, I'd expect to be told that I have to set the logger
BEFORE AR connects to the database, since otherwise it'll be silently
ignored. Or maybe I would not. In either case I really wonder how am I
supposed to do that eg. in a migration.

Jenda
88521907e2c9c585bc94e35a38893dc5?d=identicon&s=25 Jenda Krynicky (jendaperl)
on 2007-06-19 15:18
Jenda Krynicky wrote:
> We are importing some data into the lookup tables in a mysql database in
> the migration and it't insanely slow. Even more than I'd expect from
> Ruby. I extracted the import of a single table into a script and trimmed
> the CSV down to some 5000 rows ... it still takes some 157s. While a raw
> DBI+DBD::msyql import in Perl takes some 72s with autocommit and 1-2s
> while commiting only once every 100 rows.

You'd think that the Rubyfolk would be eager to show me the light and
prove Ruby can do better, but it seems it can't. To give AR a fair
chance I tried to do the insert using DBIx::Class. Well ...

Importing a CSV with 2500 rows
Ruby&ActiveRecord: 137 s
Ruby&ActiveRecord + batches via object: 147 s
Ruby&ActiveRecord + batches directly: 148 s
Perl+DBIx::Class: 73 s
Perl+DBIx::Class + batches: 5 s
Perl+DBI: 62 s
Perl+DBI + batches: 1-2 s

Well ... I don't know. I heard Ruby is slow, but this slow?

Jenda
F47569210256d1052b03ae25af337cb1?d=identicon&s=25 AB (Guest)
on 2007-06-19 16:13
Hey, here's your prize, you discovered that a fully fledged ORM written
in
Ruby is slower than a lower level DB access interface written in Perl.
Perhaps if you hadn't phrased your original question with things like
"the least
bit more than I like Ruby", people would have been more eager to help
with your issue.

As it stands your post was petulant, trollish and ignores the
distinction between "Ruby" and "Rails". ActiveRecord is part of Rails.
If you find ActiveRecord to be slow I suggest the following libraries:

http://ruby-dbi.rubyforge.org/
http://tmtm.org/en/mysql/ruby/

Once you use the right tool for the right job and stop complaining that
your ride-on lawnmower keeps losing drag-races your life will be much
easier.

- AB

Jenda Krynicky wrote:
> Jenda Krynicky wrote:
>> We are importing some data into the lookup tables in a mysql database in
>> the migration and it't insanely slow. Even more than I'd expect from
>> Ruby. I extracted the import of a single table into a script and trimmed
>> the CSV down to some 5000 rows ... it still takes some 157s. While a raw
>> DBI+DBD::msyql import in Perl takes some 72s with autocommit and 1-2s
>> while commiting only once every 100 rows.
>
> You'd think that the Rubyfolk would be eager to show me the light and
> prove Ruby can do better, but it seems it can't. To give AR a fair
> chance I tried to do the insert using DBIx::Class. Well ...
>
> Importing a CSV with 2500 rows
> Ruby&ActiveRecord: 137 s
> Ruby&ActiveRecord + batches via object: 147 s
> Ruby&ActiveRecord + batches directly: 148 s
> Perl+DBIx::Class: 73 s
> Perl+DBIx::Class + batches: 5 s
> Perl+DBI: 62 s
> Perl+DBI + batches: 1-2 s
>
> Well ... I don't know. I heard Ruby is slow, but this slow?
>
> Jenda
88521907e2c9c585bc94e35a38893dc5?d=identicon&s=25 Jenda Krynicky (jendaperl)
on 2007-06-19 17:09
AB wrote:
> Hey, here's your prize, you discovered that a fully fledged ORM written
> in
> Ruby is slower than a lower level DB access interface written in Perl.

Well, that'd be true for the Perl+DBI scripts, not the DBIx::Class ones.
DBIx::Class is an ORM for Perl, just like ActiveRecord is for Ruby.

Except that the authors of DBIx::Class cared about performance.

> Perhaps if you hadn't phrased your original question with things like
> "the least
> bit more than I like Ruby", people would have been more eager to help
> with your issue.
>
> As it stands your post was petulant, trollish and ignores the
> distinction between "Ruby" and "Rails". ActiveRecord is part of Rails.

Yep, the last point was really important. ActiveRecord is a library
that's part of a bigger bundle of libraries called Rails that was
developed from a programming language called Ruby. What was your point
again? And how was I ignoring the "distinction"?

Sure, I'm an annoying troll. A troll that's backed up by numbers. You
know, that's the difference between the Perl and Ruby community. The
Perl one would run to prove me wrong, not to help me, but to make sure
noone things Perl performs poorly. Me being a troll or not. The Ruby one
ignores me. Maybe they don't feed the trolls. And maybe they CAN'T prove
me wrong.

I don't care that the import takes an hour now that the PHB finaly
understood that it's stupid to keep editing migrations forcing everyone
to import and reimport daily. And I don't care AR is dead-cow slow since
I'm leaving the team in a month or so anyway. You seem to want to
continue using it, you'll have to force your companies to buy more and
better hardware to sustain Ruby and Rails.

> If you find ActiveRecord to be slow I suggest the following libraries:
>
> http://ruby-dbi.rubyforge.org/
> http://tmtm.org/en/mysql/ruby/
>
> Once you use the right tool for the right job and stop complaining that
> your ride-on lawnmower keeps losing drag-races your life will be much
> easier.
>
> - AB

If I could choose the right tool I would not be using Ruby.

Jenda
457cf540784a12ba2f30e06565a2c189?d=identicon&s=25 Hugh Sasse (Guest)
on 2007-06-19 17:45
(Received via mailing list)
On Mon, 18 Jun 2007, Jenda Krynicky wrote:

> Ruby. I extracted the import of a single table into a script and trimmed
> the CSV down to some 5000 rows ... it still takes some 157s. While a raw
> DBI+DBD::msyql import in Perl takes some 72s with autocommit and 1-2s
> while commiting only once every 100 rows.

You may find that (Perl|Python|Lua|...) are quicker than ruby for some
tasks.
If performance is your biggest constraint then it may be worth choosing
the fastest tool for each bit, till the others catch up...  But, is
there
anything else to be done?
        [...]
> dbconfig = YAML::load(File.open('M:\ecom\config\database.yml'))
[OT, but may be worth having a bit more error checking in there]

> ZipCode.delete_all
All that looks fairly normal to me...
>
> started = Time.new()
>
> i = 0
> ZipCode.connection.begin_db_transaction()

To squeeze this further, put the result of zipconde.connection in
a variable.
>
> FasterCSV.foreach("M:/ecom/db/csv_lookup_data/postal_codes_some.csv",
> :col_sep => "|") do |row|

FasterCSV is certainly faster than CSV, but it may be quicker to just
split lines on /\|/, if they are sufficiently simple.

M: -- speed of access to that drive isn't a problem?  If you've tried
Perl etc then probably not.  (Rhetorical question, BYTK)

>   end
> end
> ZipCode.connection.commit_db_transaction
>
> puts "Taken #{Time.new - started} s"
>

As this is ruby code, I suggest you run it with -rprofile to find
out which bits are actually slow.  The profiler will slow things down
of course, but you'll at least see where the bottlenecks are so you
can work on those.

        Hugh
88521907e2c9c585bc94e35a38893dc5?d=identicon&s=25 Jenda Krynicky (jendaperl)
on 2007-06-19 18:22
Hugh Sasse wrote:
> On Mon, 18 Jun 2007, Jenda Krynicky wrote:
>> i = 0
>> ZipCode.connection.begin_db_transaction()
>
> To squeeze this further, put the result of zipconde.connection in
> a variable.

Well I could, that's not the problem. The problem seems to be that this
is completely ignored. According to the docs begin_db_transaction()
should turn the autocommit off for the connection and run a "BEGIN
TRANSACTION" or whatever's the correct syntax for the database in
question. This doesn't seem to be the case as the logfile looks like
this:

# Logfile created on Tue Jun 19 15:02:25 +0200 2007 by logger.rb/1.5.2.9
SQL (0.031000)  COMMIT
SQL (0.000000)  BEGIN
SQL (0.000000)  INSERT INTO zip_codes (`city`, `latitude`,
`postal_code`, `country_id`, `longitude`, `state_id`)
VALUES('Daysville', '41.99', '61061', 159, '-89.29', 13)
SQL (0.031000)  COMMIT
SQL (0.000000)  BEGIN
SQL (0.000000)  INSERT INTO zip_codes (`city`, `latitude`,
`postal_code`, `country_id`, `longitude`, `state_id`) VALUES(NULL, NULL,
NULL, 159, NULL, NULL)
SQL (0.016000)  COMMIT
...
SQL (0.000000)  BEGIN
SQL (0.000000)  INSERT INTO zip_codes (`city`, `latitude`,
`postal_code`, `country_id`, `longitude`, `state_id`) VALUES('De Land',
'29.05', '32724', 159, '-81.25', 9)
SQL (0.031000)  COMMIT
SQL (0.000000)  COMMIT
SQL (0.000000)  BEGIN
SQL (0.000000)  BEGIN
SQL (0.000000)  INSERT INTO zip_codes (`city`, `latitude`,
`postal_code`, `country_id`, `longitude`, `state_id`) VALUES(NULL, NULL,
NULL, 159, NULL, NULL)
...

Which considering the fact that mysql (5.0.27) doesn't support nested
transactions is the problem. Look at the numbers earlier in the thread
... commiting only every 100 rows makes a huge difference. I don't
really mind if Ruby+ActiveRecord is twice slower than Perl+DBIx::Class
(well, whatever), I do mind it's 30 times slower.


>> FasterCSV.foreach("M:/ecom/db/csv_lookup_data/postal_codes_some.csv",
>> :col_sep => "|") do |row|
>
> FasterCSV is certainly faster than CSV, but it may be quicker to just
> split lines on /\|/, if they are sufficiently simple.

Yep, that'd work in this case.

> M: -- speed of access to that drive isn't a problem?  If you've tried
> Perl etc then probably not.  (Rhetorical question, BYTK)

That's why I tried the raw low level Perl+DBI. Which also does make
clear how huge a difference does the batch commit make.

>>   end
>> end
>> ZipCode.connection.commit_db_transaction
>>
>> puts "Taken #{Time.new - started} s"
>>
>
> As this is ruby code, I suggest you run it with -rprofile to find
> out which bits are actually slow.  The profiler will slow things down
> of course, but you'll at least see where the bottlenecks are so you
> can work on those.
>
>         Hugh

The bottleneck is in the way the database is used. Some things can't be
helped without making huge changes to ActiveRecord (building the
statements once with placeholders, preparing them and then executing
instead of building a SQL statement including the values every time),
others should be doable (controlling the commit).

While the second would help me in this case (and - if used - others that
need to import some more data), the first would help everyone using
Rails. Just try to think about the amount of work you force the database
to do by throwing a different SQL at it each time. All the execution
plan caching goes down the drain, the amount of data sent between the
application and the database is much higher, ... :-(

Jenda
27c170f482104299af279902be0a9c26?d=identicon&s=25 Trevor Squires (Guest)
on 2007-06-19 18:52
(Received via mailing list)
Hey,

Sorry if this has already been mentioned.

If you manually create a couple of zipcodes like this

ZipCode.transaction do
  ZipCode.create(:foo => 'data', :bar => 'data')
  ZipCode.create(:foo => 'data', :bar => 'data')
end

The logs will show you that only one transaction was issued.

This is (in rails) by far the most common strategy used for taking
control of your transactions: passing a block to transaction().

To take advantage of it you're going to have to partition your input
csv rows into groups of 100 - something like this:

all_your_csv_input_rows.each_slice(100) do |batch|
  ZipCode.transaction do
    batch.each do |row|
      ZipCode.create(:foo => row[1], :bar => row[2])
    end
  end
end

HTH,
Trevor

On 6/19/07, Jenda Krynicky <rails-mailing-list@andreas-s.net> wrote:
> is completely ignored. According to the docs begin_db_transaction()
> VALUES('Daysville', '41.99', '61061', 159, '-89.29', 13)
> '29.05', '32724', 159, '-81.25', 9)
> transactions is the problem. Look at the numbers earlier in the thread
>
> >> ZipCode.connection.commit_db_transaction
>
> plan caching goes down the drain, the amount of data sent between the
> application and the database is much higher, ... :-(
>
> Jenda
>
> --
> Posted via http://www.ruby-forum.com/.
>
> >
>


--
--
Trevor Squires
http://somethinglearned.com
813f535246722b7bf02aacc9ce818de8?d=identicon&s=25 Bob Showalter (Guest)
on 2007-06-19 20:50
(Received via mailing list)
On 6/18/07, Jenda Krynicky <rails-mailing-list@andreas-s.net> wrote:
> statement using placeholders, but I do think even Ruby can do better.
Using AR create() is just horribly slow. You'll get significant
improvement by batching like Trevor showed (not sure why the way you
were doing it didn't work).

But for batch loads, you don't need all the extra stuff AR is doing. I
would suggest building up your own SQL, which isn't hard to do. You
can still use the AR connection.

Try something along these lines:

  BATCH_SIZE = 250
  cols = "postal_code,country_id,city,latitude,longitude,state_id"
  conn = ZipCode.connection
  FasterCSV.read('input.csv').each_slice(BATCH_SIZE) do |batch|
    values = batch.map {|row| "(#{row.map \
      {|col| conn.quote(col)}.join(',')})"}
    sql = "insert into zip_codes (#{cols}) values #{values}"
    conn.execute sql
  end

I'm just building up a properly quoted insert statement (multiple rows
inserted in one statement, which is MySQL-specific, but hey...)

My own testing shows this to be about 12-15x faster than using AR
create() in batches of 250 rows, and 50-60x faster than AR create()
with commits after each row.

So AR sucks and Ruby sucks or whatever. But something like this will
get your data loaded quickly. If you need to load truly huge amounts
of data you're probably better off using database specific bulk
loading tools.
88521907e2c9c585bc94e35a38893dc5?d=identicon&s=25 Jenda Krynicky (jendaperl)
on 2007-06-20 14:55
Bob Showalter wrote:
> On 6/18/07, Jenda Krynicky <rails-mailing-list@andreas-s.net> wrote:
>> statement using placeholders, but I do think even Ruby can do better.
> Using AR create() is just horribly slow. You'll get significant
> improvement by batching like Trevor showed (not sure why the way you
> were doing it didn't work).
>
> But for batch loads, you don't need all the extra stuff AR is doing. I
> would suggest building up your own SQL, which isn't hard to do. You
> can still use the AR connection.
>
> Try something along these lines:
>
>   BATCH_SIZE = 250
>   cols = "postal_code,country_id,city,latitude,longitude,state_id"
>   conn = ZipCode.connection
>   FasterCSV.read('input.csv').each_slice(BATCH_SIZE) do |batch|
>     values = batch.map {|row| "(#{row.map \
>       {|col| conn.quote(col)}.join(',')})"}
>     sql = "insert into zip_codes (#{cols}) values #{values}"
>     conn.execute sql
>   end
>
> I'm just building up a properly quoted insert statement (multiple rows
> inserted in one statement, which is MySQL-specific, but hey...)
>
> My own testing shows this to be about 12-15x faster than using AR
> create() in batches of 250 rows, and 50-60x faster than AR create()
> with commits after each row.

Sweet :-)

Thanks, it made a huge difference!

Jenda
This topic is locked and can not be replied to.