Forum: Ruby on Rails Database cleaning

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
71ee1809f53df0a2b4dac0530f630933?d=identicon&s=25 Richard Schneeman (snowmaninthesun)
on 2008-12-14 19:25
Although i'm using validations in my models, i've managed to get a few
duplicate entries on my site. I've got roughly 300,000 entries, and my
administration only allows long running rake tasks of up to 5 hours. I
need to remove all entries that have the same :word and :language.

Is there an easy way to iterate through all entries, check to see if any
other entries have the same :word and :language, and delete those from
the DB?
5170ca260dbd2cdfd5a887a4dba7636f?d=identicon&s=25 Jeremy Weiskotten (jeremy_weiskotten)
on 2008-12-14 20:04
Richard Schneeman wrote:
> Although i'm using validations in my models, i've managed to get a few
> duplicate entries on my site. I've got roughly 300,000 entries, and my
> administration only allows long running rake tasks of up to 5 hours. I
> need to remove all entries that have the same :word and :language.
>
> Is there an easy way to iterate through all entries, check to see if any
> other entries have the same :word and :language, and delete those from
> the DB?

Long term, I recommend adding a UNIQUE INDEX constraint to the database
table (preferably via a migration) in addition to a
validates_uniqueness_of in the model. The ActiveRecord validations are
susceptible to race conditions, but the database constraint should not
be.

Before you can add this index you'll need to remove the duplicates.
There are a couple ways to do this, and the best choice depends on how
many dupes you have within the entire dataset. In most cases there are
relatively few dupes (say less than 1000 in your 300K), and here's my
preference (performed in a rake task or a migration, or manually using
the SQL tool of your choice, whichever way fits your workflow best):

1. Fetch the set of distinct tuples that are duplicated (word + language
in your case). Something like this:

  select word, language from entries group by word, language having
count(*) > 1

2. Once you know what your duplicates are, iterate over them and delete
all but one of the records:

  dupes.each do |dupe|
    # this algorithm retains only the most recently added dupe, modify
as needed
    # make sure you have an index on (word, language), not yet unique
    entries = Entry.all(
        :conditions => {
            :word => dupe.word,
            :language => dupe.language},
        :order => 'id desc')

    entries.each_with_index do |e, index|
      e.destroy if index > 0
    end
  end

3. Re-run the query and confirm that you don't have any lingering
duplicates.

4. Add the UNIQUE INDEX constraint to prevent more dupes from being
created. Make sure you have a validates_uniqueness_of in the model too.

Hope that helps!

Jeremy
http://jeronrails.blogspot.com
This topic is locked and can not be replied to.