Finding duplicated lines in folder for refactoring purposes

Hello.

I am wondering if anyone knows of a good trick to identify duplicated
lines in a directory for refactoring purposes.

The idea is this: If I can get a listing of all lines, by file in a
directory (recursively) then refactoring code could be focused
eliminating “repeated” lines.

The solution would involve

  • recursively searching all files in a given path
  • displaying a list of repeated lines (ignoring case and whitespace)
  • grouping results by path/filename combination
  • sorting results by line repeat count
  • ideally only displaying lines that are repeated at lease once…

I’m thinking this should exist either via a application, ruby script,
or shell script.

Does anyone know if any ideas or solutions that are remotely close to
this?

-Shannon

Try simian [ http://www.redhillconsulting.com.au/products/simian/ ].

Craig

2008/6/9 [email protected] [email protected]:

I am wondering if anyone knows of a good trick to identify duplicated
lines in a directory for refactoring purposes.

Do you mean “duplicated lines in files in a directory tree”? Ther
term “duplicated lines in a directory” does not make much sense to me
as a directory does not have “lines”.

I’m thinking this should exist either via a application, ruby script,
or shell script.

Does anyone know if any ideas or solutions that are remotely close to
this?

If you have enough memory or few enough files you could do

untested

require ‘find’
require ‘set’

def normalize(line)
l = line.strip
l.gsub!(/\s+/, ’ ')
l.downcase!
l
end

duplicates = Hash.new {|h,k| h[k] = Set.new}

Find.find dir do |file|
File.foreach file do |line|
duplicates[ normalize(line) ] << file
end
end

duplicates.each do |line, files|
puts line, files.sort.join(‘,’) if files.size > 1
end

Cheers

robert