Can you explain this benchmark?

fxn · May 15, 2006, 5:13pm

I want to load about 14000 words (a subset of /usr/share/dict/words)
into a MySQL table in a migration:

class CreateWords < ActiveRecord::Migration
def self.up
create_table :words, :force => true do |t|
t.column :word, :string
end
say_with_time ‘loading words…’ do
words = File.join(RAILS_ROOT, ‘db’, ‘words’)
Word.transaction do
File.foreach(words) do |line|
line.chomp!
Word.create!(:word => line)
end
end
end
end

def self.down
drop_table :words
end
end

That takes about 45 seconds, and a one-liner shows the time needed by
the foreach loop itself is negligible. The analogous SQL file with
inserts takes about 3 seconds to load passed directly to the mysql
client.

I have no problem with those 45 seconds, but I would like to know why
that’s that way. Do you know why there is so much difference? Is it
the driver?

– fxn

fxn · May 15, 2006, 5:58pm

On 5/15/06, Xavier N. [email protected] wrote:

   Word.transaction do
end
end

That takes about 45 seconds, and a one-liner shows the time needed by
the foreach loop itself is negligible. The analogous SQL file with
inserts takes about 3 seconds to load passed directly to the mysql
client.

I have no problem with those 45 seconds, but I would like to know why
that’s that way. Do you know why there is so much difference? Is it
the driver?

The ActiveRecord approach is doing far more work simple sql insert
statements. As well as the (relatively) negligible for loop and file
access, you’re creating a fully fledged model object (composed of many
other underlying objects) for each of 14000 words, not to mention the
overhead of running 14000 db statements through ruby. This is the
main reason its so much slower, not any inefficiencies in the driver.

Tom

fxn · May 15, 2006, 6:23pm

On May 15, 2006, at 17:55, Tom W. wrote:

The ActiveRecord approach is doing far more work simple sql insert
statements. As well as the (relatively) negligible for loop and file
access, you’re creating a fully fledged model object (composed of many
other underlying objects) for each of 14000 words, not to mention the
overhead of running 14000 db statements through ruby. This is the
main reason its so much slower, not any inefficiencies in the driver.

OK, I discarded that 14000 db statement through ruby count, this
simple script

require ‘mysql’

my = Mysql.connect(…)
my.autocommit(false)
st = my.prepare(“insert into words (word) values (?)”)
File.foreach(‘words’) do |line|
line.chomp!
st.execute(line)
end
my.commit

runs at the same speed that the SQL file. We seem to conclude that
the overhead comes from AR alone then.

– fxn

fxn · May 15, 2006, 6:33pm

Generally OR mapping tools such as ActiveRecord (or Hibernate, OJB etc
in Java) shouldn’t be used for batch transactions such as data migration
or other large scale data processing. Direct SQL would be much more
efficient and logical. These tools should be used as they were intended
for – an abstraction of a relational database into a data model for
interactive usage. If you’re thinking there’s something wrong with
ActiveRecord, there’s not, it’s just not meant for such kind job tasks
you described below.

Best tool for the job, not the same tool for every job

Xavier N. wrote:

simple script
my.commit

–
Sau S.

http://blog.saush.com
http://www.projectible.com
http://jaccal.sourceforge.net

fxn · May 15, 2006, 6:33pm

On May 15, 2006, at 18:28, Philip H. wrote:

Just for kicks, remove your autocommit/commit lines and see what
happens if you commit per insert. My understanding is this is
what active record does…

Good. It takes about 10 seconds (versus about 3 with autocommit off).

In the migration, if the loading is not wrapped in a transaction it
takes about 80 seconds instead of 45. I use a transaction to have as
a side effect autocommit off, a trick I learned the hard way loading
data into SQLite :-).

– fxn

fxn · May 15, 2006, 6:39pm

On May 15, 2006, at 18:31, Chang Sau S. wrote:

Best tool for the job, not the same tool for every job

Yeah, yeah, no problem with AR whatsoever. Just technical curiosity
that may lead to a more appropriate way of doing this in any case.

How do you load data like that in a way that feels natural working
with migrations? I mean, linked to the migration itself somehow,
portable, etc.?

– fxn

fxn · May 15, 2006, 6:27pm

require ‘mysql’
runs at the same speed that the SQL file. We seem to conclude that the
overhead comes from AR alone then.

Just for kicks, remove your autocommit/commit lines and see what happens
if you commit per insert. My understanding is this is what active
record does…

-philip

fxn · May 15, 2006, 7:07pm

On May 15, 2006, at 18:47, Chang Sau S. wrote:

But if you’re talking abt product migration, you will want to keep the
code, in which case I would suggest that you keep it where it’s
closest
to, that is, in the DB itself. For DBs with stored procedures, it
might
be worthwhile to write migration stored procedures, otherwise some
shell
scripts (or Ruby scripts, which I’ll fully recommend ) to execute
some predefined SQL statements.

I see thank you.

I was talking indeed about migrations as in Rails jargon, what you
run with rake db:migrate. Switching to execute() we go from ~45 s
down to ~18 s, better! I wonder whether that can be yet improved.

 say_with_time 'loading words...' do
   words = File.join(RAILS_ROOT, 'db', 'words')
   Word.transaction do
     suppress_messages do
       File.foreach(words) do |line|
         line.chomp!
         execute "insert into words (word) values ('#{line}')"
       end
     end
   end
 end

– fxn

fxn · May 15, 2006, 7:36pm

On May 15, 2006, at 10:06 AM, Xavier N. wrote:

I was talking indeed about migrations as in Rails jargon, what you
run with rake db:migrate. Switching to execute() we go from ~45 s
down to ~18 s, better! I wonder whether that can be yet improved.

The timing is faster, but this is not an improvement: you’ve just
thrown quoting and type casting out the window.

Someone posted recently about having Active Record generate rather
than execute SQL. This way, generating the batch is still costly but
executing its result is as fast as you’d expect.

Best,
jeremy

fxn · May 15, 2006, 6:49pm

This is from personal experience and might or might not be applicable
for your case, but … most data migrations that I’ve done are one-shots
meaning it’s a migration from one system (mostly legacy) to another.
Nothing is really re-usable because they are not likely to migrate again
The old system is de-commissioned and that’s the end of the story
(putting it simply of course – usually they keep the data ‘just in
case’ for a while).

But if you’re talking abt product migration, you will want to keep the
code, in which case I would suggest that you keep it where it’s closest
to, that is, in the DB itself. For DBs with stored procedures, it might
be worthwhile to write migration stored procedures, otherwise some shell
scripts (or Ruby scripts, which I’ll fully recommend ) to execute
some predefined SQL statements.

Just my 2 cents.

Xavier N. wrote:

Rails mailing list
[email protected]
http://lists.rubyonrails.org/mailman/listinfo/rails

–
Sau S.

http://blog.saush.com
http://www.projectible.com
http://jaccal.sourceforge.net

fxn · May 15, 2006, 8:38pm

On May 15, 2006, at 19:33, Jeremy K. wrote:

On May 15, 2006, at 10:06 AM, Xavier N. wrote:

I was talking indeed about migrations as in Rails jargon, what you
run with rake db:migrate. Switching to execute() we go from ~45 s
down to ~18 s, better! I wonder whether that can be yet improved.

The timing is faster, but this is not an improvement: you’ve just
thrown quoting and type casting out the window.

Yeah, but that it is not a fault of the approach itself, AFAICT it is
a limitation of execute(), which does not accept placeholders.

Someone posted recently about having Active Record generate rather
than execute SQL. This way, generating the batch is still costly
but executing its result is as fast as you’d expect.

Interesting, but how do you execute de generated SQL from within the
migration in a portable way?

– fxn

fxn · May 15, 2006, 7:55pm

On 5/15/06, Xavier N. [email protected] wrote:

I was talking indeed about migrations as in Rails jargon, what you
run with rake db:migrate. Switching to execute() we go from ~45 s
down to ~18 s, better! I wonder whether that can be yet improved.

I had to solve a similar problem. First, I concatenated all the ‘words’
files I could find into one, not worrying about duplicates. Then using a
migration, I create the ‘word’ table, load the data into a temp table,
then
move it into the real table with a select that orders the words
alphabetically. Then I ‘optimize’ and ‘analyze’ the table in mysql to
speed
up access to the records. It takes ~17s on my Athlon XP 1800 Linux box,
and
I’m left with a table that includes 335,171 words.

Here’s my migration:

class WordTable < ActiveRecord::Migration
class Word < ActiveRecord::Base; end

def self.up

Table: word

create_table :word do |t|
  t.primary_key :id
  t.column :word, :string, :limit => 128, :null => false
end

Load words the slow way

# load_words_ar

Load words the fast way

load_words_sql

end

def self.load_words_ar
f = File.new(‘/path/words.txt’)
begin
while(line = f.readline)
line.chomp
w = Word.new
w.word = line
w.save
end
rescue EOFError
f.close
end
end

def self.load_words_sql
begin
db = Word.connection.current_database
%x{mysql #{db} < ‘db/sql/load_words.sql’}
rescue
puts “Execution of ‘db/sql/load_words.sql’ failed.”
end
end

def self.down
drop_table :word
end

end

And here’s the SQL that is referenced above:

create temporary table t_word (
word varchar(128) not null,
id integer auto_increment not null,
primary key(id),
key x_word (word)
);

load data infile ‘/path/words.txt’
replace into table t_word;

delete from word;
insert into
word
select
id, word
from
t_word
order by
t_word.id;

drop table t_word;

alter table word add key x_word (word);
optimize table word;
analyze table word;

The temporary table stuff isn’t really necessary. The point is that the
‘load data infile’ command in MySQL is fastest way to get data into your
tables that I know of.

Daniel Butler