Best way to compare lines in 2 files

dubstep · October 5, 2011, 2:34pm

Hi everybody,

I have an issue :

I have 2 list in 2 separate files
I call them 1st file = old file, 2nd file = new file
They are similar, except there is some lines added and some lines
removed in both files, and I only want to keep : lines common in both
files and line only existing in my 2nd file.

My first solution is to read the 2nd file, compare each lines with each
lines of the 1st file see if I fond it and if not, save it in a 3rd file
to list all lines I have to remove in my 2nd file.

Then I will read my 2nd file and compare each line with each lines of my
3rd file, if it does not exist, I write the line in a 4th file (the
final)

I’m not sure if this is the quickest and best method, maybe something
faster exist.
I also assume there is a fast method, with high memory usage and a
slower but with low memory usage.
My files are excel sheet exported in CSV files, but with something like
40k lines each

So if someone have clues to help me I would be grateful.

gbenoit79 · October 5, 2011, 2:39pm

On Wed, Oct 5, 2011 at 2:34 PM, Guillaume B. [email protected]
wrote:

I’m not sure if this is the quickest and best method, maybe something
faster exist.

–
Phillip G.

gplus.to/phgaw | twitter.com/phgaw

A method of solution is perfect if we can forsee from the start,
and even prove, that following that method we shall attain our aim.
– Leibniz

gbenoit79 · October 5, 2011, 2:51pm

http://mywiki.wooledge.org/BashFAQ/036

gbenoit79 · October 5, 2011, 5:09pm

Guillaume B. wrote in post #1025125:

I have 2 list in 2 separate files
I call them 1st file = old file, 2nd file = new file
They are similar, except there is some lines added and some lines
removed in both files, and I only want to keep : lines common in both
files and line only existing in my 2nd file.

You want two separate output files, one containing lines common to both,
and one containing lines only in the 2nd file?

As long as preserving the order isn’t important, look at the manpages
for ‘sort’ and ‘join’.

Otherwise, since the files aren’t too big, you can read all of file1
into a Hash of {linedata=>true}. Then read through file2 and print a
line only if the corresponding Hash entry is true.

seen = {}
File.open(“file1”) do |f|
f.each_line do |line|
seen[line] = true
end
end
File.open(“file2”) do |f|
f.each_line do |line|
if seen[line]
print “In both,#{line}”
else
print “Only in file 2,#{line}”
end
end
end

gbenoit79 · October 6, 2011, 5:45pm

Thanks for your answers.

Phillip and Dwayne, your answer was very useful, I did not know some
commands and such softwares exist for this application (but as I’m on
windows, the diff command was not working because of my file size) and I
did not only see the difference, but create a new file with some
specific things.

As Bartosz and Brian said I’ll try with your solution, the file is not
so big (~300k text each)

Only because I think it could be good to know, someone have some kind of
solution if files are bigger ?

I was thinking of sorting lines in the files before removing lines in
the 2nd file.

I’ll try with your ideas and maybe think about an alternate method in
case of …

thanks again for your help.

gbenoit79 · October 5, 2011, 6:58pm

And if the files are quite small and you couldn’t care less about
speed as long as it takes less than 5 seconds you could just do:

lines1 = File.readlines ‘file1’
lines2 = File.readlines ‘file2’

common = lines1 & lines2
second_only = lines2 - lines1

– Matma R.

gbenoit79 · October 6, 2011, 8:42pm

Guillaume B. wrote in post #1025377:

(but as I’m on
windows, the diff command was not working because of my file size)

Install Cygwin, and you should get decent GNU utils instead of any junk
ones which come with Windows.