Updating the db 6000 times will take few minutes?


#1

i am writing a test program for ActiveRecord, and it reads a document
which is like 6000 words long. And then i just tally up the words by

recordWord = Word.find_by_s(word);
if (recordWord.nil?)
  recordWord = Word.new
  recordWord.s = word
end
if recordWord.count.nil?
  recordWord.count = 1
else
  recordWord.count += 1
end
recordWord.save

and so this part loops for 6000 times… and it takes a few minutes to
run at least using sqlite3. Is it normal? I was expecting it could run
within a couple seconds… can MySQL speed it up a lot?


#2

Have you got an index on the s column?

Colin

2009/5/9 Jian L. removed_email_address@domain.invalid


#3

I’m glad you said it was a test because it wouldn’t be nice to do that
for real.

It looks like you’ve got at least 12000 transactions in there (select +
update). Any indexes would make it worse.

I’m guessing that this would take around 5 minutes on decent single disk
PC hardware.

No database engine (even oracle) is going to do that in 2 seconds on a
standard PC with a single sata/ide hard disk. You’re asking for 6000
read/write operations per second. You’ll need some mighty big hardware
to do that!

If you did the whole thing in a single transaction you might get it to
go quite a bit faster.

Cheers,
Gary.


#4

Gary D. wrote:

I’m glad you said it was a test because it wouldn’t be nice to do that
for real.

It looks like you’ve got at least 12000 transactions in there (select +
update). Any indexes would make it worse.

I’m guessing that this would take around 5 minutes on decent single disk
PC hardware.

No database engine (even oracle) is going to do that in 2 seconds on a
standard PC with a single sata/ide hard disk. You’re asking for 6000
read/write operations per second. You’ll need some mighty big hardware
to do that!

If you did the whole thing in a single transaction you might get it to
go quite a bit faster.

will record.save automatically make it a transaction? so how do i
make it into 1 transaction? I can’t make it a hash or an array of
records and at the end, do a loop of record.save too? because each will
be a transaction… ?

what about using memcache? the db seems like was about 45MB when i run
6000 words test on a few pages… so memcache of 64MB… maybe
everything will happen in RAM and it can be really fast?


#5

Colin L. wrote:

Would not the index make the lookup faster but the write slower? Is it
that
the cacheing would mean that the lookup would be pretty quick anyway,
even
without an index?

Colin

2009/5/9 Gary D. removed_email_address@domain.invalid

yeah, seems like the search will be a lot faster with the index… i
was also thinking of using production mode to run it, since the
development mode will write all the SQL log and slow down the
operation…


#6

Would not the index make the lookup faster but the write slower? Is it
that
the cacheing would mean that the lookup would be pretty quick anyway,
even
without an index?

Colin

2009/5/9 Gary D. removed_email_address@domain.invalid


#7

Is this a test or a real requirement? If it is a real requirement then
count the words in memory first and then update the db so each record is
only written once.

Colin

2009/5/9 Jian L. removed_email_address@domain.invalid


#8

Colin L. wrote:

Is this a test or a real requirement? If it is a real requirement then
count the words in memory first and then update the db so each record is
only written once.

Colin

2009/5/9 Jian L. removed_email_address@domain.invalid

it is a real requirement… so how do i update the db at the end in 1
operation? thanks.


#9

Yes, an index/caching will make the lookup faster, but even at the
optimum you are looking at 6000 transactions. Never in two seconds
unless you have some unsafe (buffered, no fsync) setting on your db
engine.

Cheers,
Gary.


#10

On May 9, 5:48 pm, Jian L. removed_email_address@domain.invalid wrote:

operation? thanks.
Well like others have said you can wrap everything in a single
transaction (look at the rails api docs for transactions) which at
least means that your database won’t have to flush to disk on each
write. There’s nothing built into rails for bulk inserts but there’s
nothing stopping you from building up the appropriate insert statement
yourself (there’s also a plugin (http://github.com/zdennis/ar-
extensions/tree/master) that does that sort of thing).

Fred


#11

great… will try it out now.

actually, i was thinking, that the computer sometimes has 1GB free RAM
or 3GB free RAM (of the 4GB of RAM). how come the OS doesn’t
automatically create a cache for the 45MB db file? If the OS creates
the cache, everything happens in memory, and it should be quite fast.

Is it true that sqlite3 actually flush the data into the hard drive?
but at least for the searching part, can it still happen just in RAM?
Can’t MySQL or Sqlite3 actually have a mode so that the db can be cached
in RAM as much as possible?


#12

Jian L. wrote:

in RAM as much as possible?
The database can be cached in ram by most database engines. However,
most database engines will also make sure that every transaction
(insert, update, delete) is committed to disk by forcing a write to the
physical disk for every transaction. If the database engine doesn’t do
this you risk losing part or all of your database if some kind of
failure happens part way though your updates.

If you don’t care if you lose part or all of your database, most
database engines also have a setting for this…

Cheers,
Gary.


#13

On Saturday 09 May 2009, Jian L. wrote:

else
  recordWord.count += 1
end
recordWord.save

and so this part loops for 6000 times…

Simply put, this is an inappropriate use of a relational database. As
the other respondents said, you’re executing 12000 transactions each
with a single database access. If you wrap the code above into a
transaction block

Word.transaction do

end

you’re down to one transaction, but still have 12000 accesses. IIUC, you
have a word list of 6000 words (or something) in memory already, why not
do the frequency counting where it’s easy?

frequencies = Hash.new(0)
wordlist.each { |word| frequencies[word] += 1 }

Word.transaction do
wordlist.each do |word, freq|
unless w = Word.find_by_s(word)
w = Word.new(:s => word, :count => 0)
end
w.freq += 1
w.save!
end
end

This way still incurs two database accesses per unique word in your
list. To cut this down without custom SQL ([1], [2]) you have to
know/decide whether there are mostly known or unknown words in every new
document.

Assuming most words are already in the database, you could do the update
like this

Word.transaction do
wordlist.each do |word, freq|
update_count = Word.update_all(
[“freq = freq + ?”, freq],
[“s = ?”, word])
if update_count == 0
Word.create!(:s => word, :count => freq)
end
end
end

Michael

[1] http://dev.mysql.com/doc/refman/5.1/en/insert-on-duplicate.html
[2] http://www.postgresql.org/docs/current/static/plpgsql-control-
structures.html#PLPGSQL-ERROR-TRAPPING


Michael S.
mailto:removed_email_address@domain.invalid
http://www.schuerig.de/michael/


#14

Gary D. wrote:

Jian L. wrote:

in RAM as much as possible?
The database can be cached in ram by most database engines. However,
most database engines will also make sure that every transaction
(insert, update, delete) is committed to disk by forcing a write to the
physical disk for every transaction. If the database engine doesn’t do
this you risk losing part or all of your database if some kind of
failure happens part way though your updates.

If you don’t care if you lose part or all of your database, most
database engines also have a setting for this…

so i re-ran the test and it worked quite quickly… down to 1 minute or
so instead of 30 minutes… (for several pages)… and the db size is
only 2.5MB (i created another rails project to start anew). So
if we are just write scripts, and running with script/runner test.rb,
how can we turn the “force write” feature of the sqlite3 off?


#15

Phlip wrote:

Jian L. wrote:

so i re-ran the test and it worked quite quickly… down to 1 minute or
so instead of 30 minutes… (for several pages)… and the db size is
only 2.5MB (i created another rails project to start anew). So
if we are just write scripts, and running with script/runner test.rb,
how can we turn the “force write” feature of the sqlite3 off?

Are you running inside a transaction? sqlite3 in transaction mode is
super-fast…

(Jumping into the thread late, why should a test push 6 000 records?
Such a test
is not exactly a “unit” test, and alternate strategies should work
better.)

i tried both record.save and record.save! and both needed a few
minutes at the end of the program to write the data to the db…

i am writing a program to tally up the common word or phrases (say,
either in English or Chinese) that appears often. So for example, at
the end of the program (i used ruby script/runner check.rb to run it
using Ruby with ActiveRecored support), when the results are in, it may
be a table of 6000 keys and 6000 frequency counts. So I need to write
the data into the db. That’s it. Wonder why it takes a few minutes
and how to make it just be a few seconds.

If I just write the results to a flat text file using CSV format, it
should take no more than a few seconds, i think.


#16

Jian L. wrote:

so i re-ran the test and it worked quite quickly… down to 1 minute or
so instead of 30 minutes… (for several pages)… and the db size is
only 2.5MB (i created another rails project to start anew). So
if we are just write scripts, and running with script/runner test.rb,
how can we turn the “force write” feature of the sqlite3 off?

Are you running inside a transaction? sqlite3 in transaction mode is
super-fast…

(Jumping into the thread late, why should a test push 6 000 records?
Such a test
is not exactly a “unit” test, and alternate strategies should work
better.)


Phlip
http://flea.sourceforge.net/resume.html


#17

Have you made the suggested change to count the words in the document
first
then update the db, so that you are not updating each record in the
database
many times (presumably hundreds of times for common words)? This will
change the number of db accesses from the total number of words in the
document to the number of unique words in the document.

2009/5/10 Jian L. removed_email_address@domain.invalid


#18

Colin L. wrote:

Have you made the suggested change to count the words in the document
first
then update the db, so that you are not updating each record in the
database
many times (presumably hundreds of times for common words)? This will
change the number of db accesses from the total number of words in the
document to the number of unique words in the document.

yes, so right now i first tally up the count by a Hash, and then at the
end, write all the data to the table. it is strange that writing all
this data take a minute or so… because the final db is only 2.5MB,
and that if i write all the data to a flat file, i think it is done in 2
seconds.

so right now, i want to find out

  1. can i tell sqlite3 not to write to the db every time i add a record
    (a row), but to do it all in memory and then finally, write to the table
    once at the end.

  2. use memCache to do it.

  3. some one at stackoverflow.com suggested using AR:Extensions
    http://stackoverflow.com/questions/843524/updating-the-db-6000-times-will-take-few-minutes

but i just want to use the most basic way to do it… such as by turning
off the force-write of sqlite3


#19

2009/5/10 Jian L. removed_email_address@domain.invalid

yes, so right now i first tally up the count by a Hash, and then at the
end, write all the data to the table. it is strange that writing all
this data take a minute or so… because the final db is only 2.5MB,
and that if i write all the data to a flat file, i think it is done in 2

seconds.

Have you checked the sql in the log to make sure that it is only
rewriting
each word once? Have you added the index to the word field (I think now
that each record is only being written once then the index should be
beneficial as it will speed up the find, someone correct me if I am
wrong)?

Another thing you could do is comment out the db access bit just to make
sure that the time is not going somewhere else. Initially comment out
the
save then comment out the find also.

Colin


#20

Gary D. wrote:

If you have made the change to count up words first and then insert
all the (word,count) records into the database in a single transaction
then it ought to take less than a second. I would expect that the total
number of (word,count) records is in the order or hundreds or perhaps a
thousand or so? Any decent DB ought to insert that in under a second as
a single transaction.

If it is still taking minutes then you are probably not doing the above
somehow. I think in that case you need to post your code again so we can
see what it is doing now.

i was doing it like this:

all_phrases = frequencies.keys
Phrase.transaction do
all_phrases.each do |phrase|
recordPhrase = Phrase.new(:s => phrase, :frequency =>
frequencies[phrase], :length => lengths[phrase])
recordPhrase.save
end
end

i am using “Phrase” instead of “Word”… but it is the same thing…

all_phrases.length is about 34000 for all data…

it would run for at least a minute… that’s kind of weird…