Updating the db 6000 times will take few minutes?


#21

Jian L. wrote:

seconds.
http://stackoverflow.com/questions/843524/updating-the-db-6000-times-will-take-few-minutes

but i just want to use the most basic way to do it… such as by turning
off the force-write of sqlite3

If you have made the change to count up words first and then insert
all the (word,count) records into the database in a single transaction
then it ought to take less than a second. I would expect that the total
number of (word,count) records is in the order or hundreds or perhaps a
thousand or so? Any decent DB ought to insert that in under a second as
a single transaction.

If it is still taking minutes then you are probably not doing the above
somehow. I think in that case you need to post your code again so we can
see what it is doing now.

Cheers,
Gary.


#22

Jian L. wrote:

somehow. I think in that case you need to post your code again so we can
frequencies[phrase], :length => lengths[phrase])
recordPhrase.save
end
end

i am using “Phrase” instead of “Word”… but it is the same thing…

all_phrases.length is about 34000 for all data…

I’m not sure how you get 34000 records from 6000 words. Surely you
should get less not more.

it would run for at least a minute… that’s kind of weird…

Are you sure it is just the above code that is taking a minute and not
the bit of code that counts the words?

What hardware is this on exactly?

Cheers,
Gary.


#23

Jian L. wrote:

i was doing it like this:

all_phrases = frequencies.keys
Phrase.transaction do
all_phrases.each do |phrase|
recordPhrase = Phrase.new(:s => phrase, :frequency =>
frequencies[phrase], :length => lengths[phrase])
recordPhrase.save
end
end

by the way, the table has indexes on all fields: s, frequency, and
length.


#24

Jian L. wrote:

recordPhrase.save
end
end

by the way, the table has indexes on all fields: s, frequency, and
length.

That’ll slow down inserts/updates, but not by a huge amount on such a
small table.


#25

Gary D. wrote:

I’m not sure how you get 34000 records from 6000 words. Surely you
should get less not more.

it would run for at least a minute… that’s kind of weird…

Are you sure it is just the above code that is taking a minute and not
the bit of code that counts the words?

What hardware is this on exactly?

it is 34000 records because it is actually count up phrases instead of
words… for example, “a quick brown fox jumps over the lazy dog” i
will actually count phrases such as

“a quick”
“a quick brown”
“a quick brown fox”

etc… so the final result is 34000 entry hashes, mapping to
frequencies and word counts.

the hardware should be pretty good… it is the HP TouchSmart IQ804…
with a Core 2 Duo and hard drive ST3320820AS which is 7200rpm, 8MB
buffer.

hm… makes me wonder if the db is just 4MB, how come the hard drive
buffer 8MB didn’t totally handled it in its RAM and be super fast. did
it actually force write to the physical disc?


#26

Michael S. wrote:

On Sunday 10 May 2009, Jian L. wrote:

end
Consider

Phrase.transaction do
frequencies.each do |phrase, freq|
Phrase.create!(:s => phrase, :frequency => freq)
end
end

Hash#each passes keys and values to your block and avoids unnecessary
lookups.

My snippet above doesn’t take into account your :length =>
lengths[phrase] and that is as it should be. Presumably, lengths[phrase]
== phrase.length. Then

class Phrase < ActiveRecord::Base
attr_protected :length

def s=(value)
self.length = value.length
end
end

would be much cleaner code because it puts responsibility for setting
the length attribute where it belongs.

i changed it to

time_start = Time.now
Phrase.transaction do
all_phrases.each do |phrase|
recordPhrase = Phrase.create!(:s => phrase, :frequency =>
frequencies[phrase], :length => lengths[phrase])
end
end
p "took “, Time.now - time_start, " seconds to finish”

but it is still the same: it took 75 seconds…

i wanted the length, which is the count of word because in the future i
might want to do query such as “select * where length > 3” and so if i
count the word by getting “s” first, then it will be really slow won’t
it? if length is stored and indexed, then “length > 3” can be super
fast?


#27

On Sunday 10 May 2009, Jian L. wrote:

end
Consider

Phrase.transaction do
frequencies.each do |phrase, freq|
Phrase.create!(:s => phrase, :frequency => freq)
end
end

Hash#each passes keys and values to your block and avoids unnecessary
lookups.

My snippet above doesn’t take into account your :length =>
lengths[phrase] and that is as it should be. Presumably, lengths[phrase]
== phrase.length. Then

class Phrase < ActiveRecord::Base
attr_protected :length

def s=(value)
self.length = value.length
end
end

would be much cleaner code because it puts responsibility for setting
the length attribute where it belongs.

Michael


Michael S.
mailto:removed_email_address@domain.invalid
http://www.schuerig.de/michael/


#28

Gary D. wrote:

Jian L. wrote:

but it is still the same: it took 75 seconds…

That’s because it’s still really the same code!

i wanted the length, which is the count of word because in the future i
might want to do query such as “select * where length > 3” and so if i
count the word by getting “s” first, then it will be really slow won’t
it? if length is stored and indexed, then “length > 3” can be super
fast?

Just because it has an index on it doesn’t automatically make it super
fast. It depends largely on the selectivity of the length field. I
suspect that most of your phrases are larger than 3 characters and so
such a query will result in a full table scan anyway. In that case
making the table smaller by leaving out the length may actually make it
faster. However, selecting all the phrases where length < 5 would almost
certainly use the index and make it faster. Only you know what you are
likely to do in general here so you need to decide (and test) whether it
is better to have the index and length column or not.

You can always do “select * from phrases where length(s) > 3” or
something like.

Are you sure your overall run time is not limited by CPU rather than IO?
How much CPU time is used to run your code?

oh sorry i should have clarified… the length is actually the count of
words… so length > 3 means 3 words at least. or i might do a query
that is word > 5 or 7… so will having an column and an index make it
fast if i do query such as length > 7? i intend for example to get back
only 5% or 3% of all records so the index might make it fast.

by the way, the 75 second is the time from the moment the

Phrase.transaction do

to the end of it… so… i think during that time it is mostly IO
time…

let me actually write up a test case with some dummy data to simulate it
so that every one is on the same ground…


#29

Jian L. wrote:

but it is still the same: it took 75 seconds…

That’s because it’s still really the same code!

i wanted the length, which is the count of word because in the future i
might want to do query such as “select * where length > 3” and so if i
count the word by getting “s” first, then it will be really slow won’t
it? if length is stored and indexed, then “length > 3” can be super
fast?

Just because it has an index on it doesn’t automatically make it super
fast. It depends largely on the selectivity of the length field. I
suspect that most of your phrases are larger than 3 characters and so
such a query will result in a full table scan anyway. In that case
making the table smaller by leaving out the length may actually make it
faster. However, selecting all the phrases where length < 5 would almost
certainly use the index and make it faster. Only you know what you are
likely to do in general here so you need to decide (and test) whether it
is better to have the index and length column or not.

You can always do “select * from phrases where length(s) > 3” or
something like.

Are you sure your overall run time is not limited by CPU rather than IO?
How much CPU time is used to run your code?

Cheers,
Gary.


#30

ok, the following code with 6000 records insert will take 13.3 seconds
to finish (just for the db portion). if i change 6000 to 30000 then it
will take 67 seconds.

it is being run using

ruby script/runner foo.rb


code:

srand(123)

frequencies = {}
lengths = {}

6000.times do
phrase = (65+rand(26)).chr + “#{rand(100_000_000)}”;
frequencies[phrase] = rand(20);
lengths[phrase] = rand(10);
end

#p frequencies

all_phrases = frequencies.keys

p “Starting inserting records”
time_start = Time.now

Phrase.transaction do
all_phrases.each do |phrase|
recordPhrase = Phrase.create(:s => phrase, :frequency =>
frequencies[phrase], :length => lengths[phrase])
end
end

p “took #{Time.now - time_start} seconds to finish”
p Phrase.count


the scheme.rb for the table is

create_table “phrases”, :force => true do |t|
t.string “s”
t.integer “frequency”
t.integer “length”
t.datetime “created_at”
t.datetime “updated_at”
end

add_index “phrases”, [“frequency”], :name =>
“index_phrases_on_frequency”
add_index “phrases”, [“length”], :name => “index_phrases_on_length”
add_index “phrases”, [“s”], :name => “index_phrases_on_s”


#31

Jian L. wrote:

ok, the following code with 6000 records insert will take 13.3 seconds
to finish (just for the db portion). if i change 6000 to 30000 then it
will take 67 seconds.

OK, I created the table and the index and ran your code for 30000
records, but I wrapped the database part in a Benchmark.measure{}

Using MySQL:

For 30000 inserts with no indexes:

“Starting inserting records”
31.996000 0.639000 32.635000 ( 35.356000)
30000

For 30000 inserts with all indexes:

“Starting inserting records”
32.795000 0.982000 33.777000 ( 37.103000)
30000

That’s 33 seconds of CPU time with 37 seconds elapsed on a quad core 2.4
GHz with a Seagate Barracuda SATA drive with 32MB cache.

As you can see, it’s pretty much all in CPU time!!!

The result was essentially the same in Postgres and MS SQL server!

So you can forget about the database itself. None of the database
engines were unduly taxed by the test.

Just for fun I changed the program to output the data as SQL INSERT
statements and then run that (with 30000 inserts wrapped in a
transaction) against MySQL.

Imported in 1.2 seconds!!

I Don’t know if it is the hash lookup code or ActiveRecord that is
gobbling up the time, but it certainly isn’t the database.

You’ll need to tinker with, or better profile your code to find out what
is sucking up the time.

Cheers,
Gary.


#32

Gary D. wrote:

Jian L. wrote:

ok, the following code with 6000 records insert will take 13.3 seconds
to finish (just for the db portion). if i change 6000 to 30000 then it
will take 67 seconds.

OK, I created the table and the index and ran your code for 30000
records, but I wrapped the database part in a Benchmark.measure{}

Using MySQL:

For 30000 inserts with no indexes:

“Starting inserting records”
31.996000 0.639000 32.635000 ( 35.356000)
30000

For 30000 inserts with all indexes:

“Starting inserting records”
32.795000 0.982000 33.777000 ( 37.103000)
30000

That’s 33 seconds of CPU time with 37 seconds elapsed on a quad core 2.4
GHz with a Seagate Barracuda SATA drive with 32MB cache.

As you can see, it’s pretty much all in CPU time!!!

The result was essentially the same in Postgres and MS SQL server!

So you can forget about the database itself. None of the database
engines were unduly taxed by the test.

Just for fun I changed the program to output the data as SQL INSERT
statements and then run that (with 30000 inserts wrapped in a
transaction) against MySQL.

Imported in 1.2 seconds!!

I Don’t know if it is the hash lookup code or ActiveRecord that is
gobbling up the time, but it certainly isn’t the database.

You’ll need to tinker with, or better profile your code to find out what
is sucking up the time.

yeah that’s what i was going to say… the line

Phrase.transaction do

didn’t cause any “transaction” statement to show up in the
development.log

so is it suppose to have begin a transaction?

so the 1.2 second result you have, is by collecting all those INSERT
statements and then wrap them in a begin transaction and commit and it
is 1.2 seconds…

I wonder then, can’t this be achieved in ActiveRecord?


#33

Jian L. wrote:

oh sorry i should have clarified… the length is actually the count of
words… so length > 3 means 3 words at least. or i might do a query
that is word > 5 or 7… so will having an column and an index make it
fast if i do query such as length > 7? i intend for example to get back
only 5% or 3% of all records so the index might make it fast.

Ah, yes, that makes it different :slight_smile:

by the way, the 75 second is the time from the moment the

Phrase.transaction do

to the end of it… so… i think during that time it is mostly IO
time…

OK, just need to look at the CPU time used then.

Is this Windows or Linux or something else?

let me actually write up a test case with some dummy data to simulate it
so that every one is on the same ground…

That would be the best of all. I can run your code with your sample data
against MySQL and Postgres to give you some ideas. I don’t really know
much about sqlite, but I must admit I’m curious as to where the time has
gone in such an apparently simple situation.

Don’t forget to include the database definition and data as well as the
code.

Cheers,
Gary.


#34

Jian L. wrote:

hm… the Hash look up time should be really small… for example, if i
just write all data to a text file, it should be really fast… it
might be ActiveRecord, although I though ActiveRecord merely translate
the method into a SQL statement and so shouldn’t be so CPU intensive.

Aha… It looks like ActiveRecord has an enormous overhead in
creating/saving records.

If you change the inserts to this…

puts Benchmark.measure {
Phrase.transaction do
all_phrases.each do |phrase|
Phrase.connection.execute(“insert into phrases(s,frequency,length)
values(’#{phrase}’,#{frequencies[phrase]},#{lengths[phrase]})”)
end
end
}

you get this:

“Starting inserting records”
1.123000 0.686000 1.809000 ( 5.096000)
30000

Which is exactly what you want I think :slight_smile:

This falls down of course if the phrases contain single quote characters
or are from an untrusted source. You will need to at least escape quotes
before putting them in the insert statement. This will slow it down a
bit, but not as much as before I would think.

Note to self: Don’t attempt to do lots of records creations with
standard AR code!

Cheers,
Gary.


#35

Gary D. wrote:

“Starting inserting records”
31.996000 0.639000 32.635000 ( 35.356000)
30000

For 30000 inserts with all indexes:

“Starting inserting records”
32.795000 0.982000 33.777000 ( 37.103000)
30000

I Don’t know if it is the hash lookup code or ActiveRecord that is
gobbling up the time, but it certainly isn’t the database.

by the way… interesting to find out the CPU time is so much… i
thought this code is I/O bound at first…

by the way I am using Rails 2.3.2, Ruby 1.8.6 patchlevel 287… on
Windows 7.

hm… the Hash look up time should be really small… for example, if i
just write all data to a text file, it should be really fast… it
might be ActiveRecord, although I though ActiveRecord merely translate
the method into a SQL statement and so shouldn’t be so CPU intensive.


#36

Gary D. wrote:

Aha… It looks like ActiveRecord has an enormous overhead in
creating/saving records.

If you change the inserts to this…

puts Benchmark.measure {
Phrase.transaction do
all_phrases.each do |phrase|
Phrase.connection.execute(“insert into phrases(s,frequency,length)
values(’#{phrase}’,#{frequencies[phrase]},#{lengths[phrase]})”)
end
end
}

you get this:

“Starting inserting records”
1.123000 0.686000 1.809000 ( 5.096000)
30000

Which is exactly what you want I think :slight_smile:

i was googling for “ActiveRecord transaction” and got this page

http://api.rubyonrails.org/classes/ActiveRecord/Transactions/ClassMethods.html

where it says:

Save and destroy are automatically wrapped in a transaction

so i think create is also invoking a save… do you think so? if it is
then it is making it a transaction. that’s why it is so slow… it
cannot be lots of record creations in a single transaction, or maybe
there is a method and we don’t know yet (maybe a method is using
ar:extensions, or why not have a standard activerecord mechanism to do
hugh updates/inserts, i wonder)


#37

Jian L. wrote:

Save and destroy are automatically wrapped in a transaction

so i think create is also invoking a save…

i just read from Learning Rails the book (p.50 if remembered correctly)
that

create is 3 operations in one: it has a new, an assignment of values,
and a save, so create involves a save, and therefore is in a transaction
by itself.


#38

On May 10, 10:41 am, Jian L. removed_email_address@domain.invalid
wrote:

it actually force write to the physical disc?
It’s never that simple (and yes under the appropriate circumstances
the drive is supposed to flush to the actual disk). For example
there’s a constant amount of overhead with each query: network latency
etc. not a huge amount (probably less than a millisecond), but
multiply that by 34000 and it will add up. If you really need to
manipulate this much data you’d be well advised to do the inserts in
bulk.

Fred


#39

Frederick C. wrote:

On May 10, 10:41�am, Jian L. removed_email_address@domain.invalid
wrote:

it actually force write to the physical disc?
It’s never that simple (and yes under the appropriate circumstances
the drive is supposed to flush to the actual disk). For example
there’s a constant amount of overhead with each query: network latency
etc. not a huge amount (probably less than a millisecond), but
multiply that by 34000 and it will add up. If you really need to
manipulate this much data you’d be well advised to do the inserts in
bulk.

so how to do inserts in bulk? by using

Phrase.connection.execute(“insert into phrases(s,frequency,length)
values(’#{phrase}’,#{frequencies[phrase]},#{lengths[phrase]})”)

? or by ar-extensions? can ActiveRecord have a mode for saving without
being in a transaction, or can ActiveRecord has some standard method of
doing bulk inserts?


#40

On Sunday 10 May 2009, Jian L. wrote:

so how to do inserts in bulk? by using

Phrase.connection.execute(“insert into phrases(s,frequency,length)
values(’#{phrase}’,#{frequencies[phrase]},#{lengths[phrase]})”)

You might be using the wrong tool and you might be reinventing existing
solutions. Did you have a look at existing text mining tools/frameworks
in Ruby as well as other languages?

Michael


Michael S.
mailto:removed_email_address@domain.invalid
http://www.schuerig.de/michael/