Copy file into new without dups, eol problem

dubstep · March 28, 2012, 9:41pm

File.open(“newf.txt”, “w+”) { |file| file.puts
File.readlines(“oldf.txt”).uniq }

Hi,
I’m doing it like above on most recent 1.9.3 and I have mess with end of
line char, it’s simple ascii file, all runs on windows:

oldf.txt:::=====================================
12/30/2011 02:12 AM 5,921,042 01
12/30/2011 02:12 AM 5,921,042 01
12/30/2011 02:12 AM 5,938,806 02
12/30/2011 02:12 AM 5,921,042 01
12 AM 5,921,042 01 - Andrea Bocel

newf.txt:::=====================================
12/30/2011 02:12 AM 5,921,042 01 12/30/2011 02:12 AM
5,938,806 02 12 AM 5,921,042 01 - Andrea Bocel

Is there any option I should check??

Tx
Mario

dainova · March 28, 2012, 11:25pm

Hi,

Ruby uses the LF character ("\n") for line endings, while Windows uses
the CR LF combination ("\r\n").

It seems you have to open the file in binary mode so that the CR LF
don’t get converted into LF:

File.open ‘C:/htdocs/new.txt’, ‘wb+’ do |file|
lines = File.open(‘C:/htdocs/old.txt’, ‘rb’, &:readlines).uniq
file.print *lines
end

Yeah, this is ugly. I wonder why Ruby cannot handle that itself.

dainova · March 28, 2012, 11:48pm

On 03/28/2012 02:41 PM, Mario T. wrote:

12/30/2011 02:12 AM 5,938,806 02
12/30/2011 02:12 AM 5,921,042 01
12 AM 5,921,042 01 - Andrea Bocel

newf.txt:::=====================================
12/30/2011 02:12 AM 5,921,042 01 12/30/2011 02:12 AM
5,938,806 02 12 AM 5,921,042 01 - Andrea Bocel

Is there any option I should check??

File.readlines opens the file with mode ‘r’ by default. On Windows,
this means to open the file in text read mode, and that will convert
Windows line endings into Unix line endings for each line that is read
in from oldf.txt. When you write the array of filtered lines to
newf.txt, that first joins the lines together with an empty string
between them and then writes the result to the file. I think that
prevents the embedded Unix line endings from being converted to Windows
line endings.

I don’t have a convenient Windows system on which to try this, but I
think the easy solution since you’re on Ruby 1.9.3 would be to tell
File.readlines to open the file in binary read mode, AKA ‘rb’:

File.readlines(“oldf.txt”, mode: “rb”)

-Jeremy

dainova · March 29, 2012, 2:55pm

Jeremy B. wrote in post #1053841:

On 03/28/2012 04:25 PM, Jan E. wrote:

file.print *lines
end

Yeah, this is ugly. I wonder why Ruby cannot handle that itself.

In Ruby 1.9, which the OP is using, File.readlines /can/ handle this
better. You can specify the mode in which to open the file directly as
a hash option.

Using #readlines to copy a file identically is the wrong tool IMHO.

Or is the solution “ugly” because you have to manually specify binary
mode when opening files?

I’d rather do it with blocks of fixed length for efficiency reasons:

File.open “oldf.txt”, ‘rb’ do |io_in|
File.open “newf.txt”, ‘wb’ do |io_out|
buffer = “”

while io_in.read(1024, buffer)
  io_out.write(buffer)
end

end
end

But what about the dups? What constitutes a duplicate? If it is just
raw content, you could use “sort -u” (standalone command).

Kind regards

robert

dainova · March 29, 2012, 4:22pm

On 03/29/2012 07:55 AM, Robert K. wrote:

Using #readlines to copy a file identically is the wrong tool IMHO.

From the OP’s example, it appears that copying the file identically is
not the desire.

  io_out.write(buffer)
end
end
end

But what about the dups? What constitutes a duplicate? If it is just
raw content, you could use “sort -u” (standalone command).

Again from the original example, the records to compare for uniqueness
are simple lines. Of course that simplicity belies the issue of line
endings.

Also, the OP appears to be running on Windows, so “sort -u” is not
available out of the box.

-Jeremy

dainova · March 28, 2012, 11:51pm

On 03/28/2012 04:25 PM, Jan E. wrote:

file.print *lines
end

Yeah, this is ugly. I wonder why Ruby cannot handle that itself.

In Ruby 1.9, which the OP is using, File.readlines /can/ handle this
better. You can specify the mode in which to open the file directly as
a hash option.

Or is the solution “ugly” because you have to manually specify binary
mode when opening files?

-Jeremy

dainova · March 29, 2012, 6:11pm

Jeremy B. wrote in post #1053948:

On 03/29/2012 07:55 AM, Robert K. wrote:

But what about the dups? What constitutes a duplicate? If it is just
raw content, you could use “sort -u” (standalone command).

Again from the original example, the records to compare for uniqueness
are simple lines. Of course that simplicity belies the issue of line
endings.

Ah, I overlooked the call to #uniq. I think we should be able to fix
the original with a small insertion:

File.open(“newf.txt”, “w+”) { |file| file.puts
File.readlines(“oldf.txt”).each(&:chomp!).uniq }

Although from an efficiency point of view another approach would be
preferable:

File.open “oldf.txt” do |in|
File.open “newf.txt”, “w” do |out|
last = nil

in.each_line do |line|
  line.chomp!
  out.puts line unless line == last
  last = line
end

end
end

Note that opening “newf.txt” before opening “oldf.txt” will lead to an
empty file being written in case “oldf.txt” does not exist even though
an exception is thrown.

Also, the OP appears to be running on Windows, so “sort -u” is not
available out of the box.

Right, I’m so used to cygwin that I keep forgetting not everybody has it
installed.

Cheers

robert

dainova · March 30, 2012, 10:14am

Jeremy B. wrote in post #1054030:

On 03/29/2012 11:11 AM, Robert K. wrote:
the original with a small insertion:
in.each_line do |line|
  line.chomp!
  out.puts line unless line == last
  last = line
end
end
end
The limitation here is that duplicate lines must occur in consecutive
runs. If they are interleaved with different lines, this filter won’t
work. For the general case, loading all the lines and running #uniq
over the array is more likely to work. I admit that it’s not very
efficient for large files though.

The example had consecutive duplicate lines. But we do not have a clear
specification of the problem, namely the input and desired output. If
the identical behavior of the readlines-uniq approach is needed I would
do this:

require ‘set’

File.open “oldf.txt” do |io_in|
File.open “newf.txt”, “w” do |out|
seen = Set.new

io_in.each_line do |line|
  line.chomp!
  out.puts line if seen.add? line
end

end
end

Kind regards

robert

dainova · March 29, 2012, 8:37pm

On 03/29/2012 11:11 AM, Robert K. wrote:

the original with a small insertion:
in.each_line do |line|
  line.chomp!
  out.puts line unless line == last
  last = line
end
end
end

The limitation here is that duplicate lines must occur in consecutive
runs. If they are interleaved with different lines, this filter won’t
work. For the general case, loading all the lines and running #uniq
over the array is more likely to work. I admit that it’s not very
efficient for large files though.

Note that opening “newf.txt” before opening “oldf.txt” will lead to an
empty file being written in case “oldf.txt” does not exist even though
an exception is thrown.

Good point.

Also, the OP appears to be running on Windows, so “sort -u” is not
available out of the box.

Right, I’m so used to cygwin that I keep forgetting not everybody has it
installed.

When stuck on Windows, Cygwin is definitely a must-have!

-Jeremy