Database cleaning

indigoi · December 14, 2008, 7:25pm

Although i’m using validations in my models, i’ve managed to get a few
duplicate entries on my site. I’ve got roughly 300,000 entries, and my
administration only allows long running rake tasks of up to 5 hours. I
need to remove all entries that have the same :word and :language.

Is there an easy way to iterate through all entries, check to see if any
other entries have the same :word and :language, and delete those from
the DB?

indigoi · December 14, 2008, 8:04pm

Richard S. wrote:

Although i’m using validations in my models, i’ve managed to get a few
duplicate entries on my site. I’ve got roughly 300,000 entries, and my
administration only allows long running rake tasks of up to 5 hours. I
need to remove all entries that have the same :word and :language.

Is there an easy way to iterate through all entries, check to see if any
other entries have the same :word and :language, and delete those from
the DB?

Long term, I recommend adding a UNIQUE INDEX constraint to the database
table (preferably via a migration) in addition to a
validates_uniqueness_of in the model. The ActiveRecord validations are
susceptible to race conditions, but the database constraint should not
be.

Before you can add this index you’ll need to remove the duplicates.
There are a couple ways to do this, and the best choice depends on how
many dupes you have within the entire dataset. In most cases there are
relatively few dupes (say less than 1000 in your 300K), and here’s my
preference (performed in a rake task or a migration, or manually using
the SQL tool of your choice, whichever way fits your workflow best):

Fetch the set of distinct tuples that are duplicated (word + language
in your case). Something like this:

select word, language from entries group by word, language having
count(*) > 1

Once you know what your duplicates are, iterate over them and delete
all but one of the records:

dupes.each do |dupe|
# this algorithm retains only the most recently added dupe, modify
as needed
# make sure you have an index on (word, language), not yet unique
entries = Entry.all(
:conditions => {
:word => dupe.word,
:language => dupe.language},
:order => ‘id desc’)

entries.each_with_index do |e, index|
  e.destroy if index > 0
end

end

Re-run the query and confirm that you don’t have any lingering
duplicates.
Add the UNIQUE INDEX constraint to prevent more dupes from being
created. Make sure you have a validates_uniqueness_of in the model too.

Hope that helps!

Jeremy
http://jeronrails.blogspot.com