Removing parts from a file


#1

Hello,

I’m relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without reading
the whole file and writing it away again? This because I’ll have to do
this with files of 1 gigabyte and larger… Is there a high performance
solution for this?

Thank you!


#2

Thomas D. wrote:

Hello,

I’m relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without
reading the whole file and writing it away again? This because I’ll
have to do this with files of 1 gigabyte and larger… Is there a high
performance solution for this?

Thank you!

Said like this, i don’t think it is possible as it is not related to the
language but to the file structure.
It depends on the programs which read the file, can they be fixed to
accept lines which begin with somethinq that says ‘skip me’, such as a
‘#’ ?


#3

Harpo wrote:

Thomas D. wrote:

Hello,

I’m relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without
reading the whole file and writing it away again? This because I’ll
have to do this with files of 1 gigabyte and larger… Is there a high
performance solution for this?

Thank you!

Said like this, i don’t think it is possible as it is not related to the
language but to the file structure.
It depends on the programs which read the file, can they be fixed to
accept lines which begin with somethinq that says ‘skip me’, such as a
‘#’ ?

No not really… It’s a large amount of text, all under eachother.
Another file contains an index where a part of the file starts and where
it ends. The code should remove a part that starts at the position
specified in the second file, and the ends at the other position
specified.


#4

On Dec 18, 2005, at 12:32 PM, Thomas D. wrote:

Is it possible to remove one or more lines from a file, without
reading
the whole file and writing it away again? This because I’ll have to do
this with files of 1 gigabyte and larger… Is there a high
performance
solution for this?

In general, no. I’m answering from the perspective of a typical
Unix/Posix file system. You can truncate a file to discard some
number of trailing bytes using the ftruncate system call.
In Ruby that system call is accessed via File.truncate.

There is no analogous function for removing bytes at the start
of a file. You can, of course, seek to any position in a file before
doing IO. In ruby you want to look at IO#seek.

Hope this helps.

Gary W.


#5

On 12/18/05, Thomas D. removed_email_address@domain.invalid wrote:

Is it possible to remove one or more lines from a file, without reading
the whole file and writing it away again? This because I’ll have to do
this with files of 1 gigabyte and larger… Is there a high performance
solution for this?

Sorry, have to agree with the other respondents.
So you are looking for a high-performance write-through as the next best
thing.

For higher performance on large files, you’ll want to read into a
buffer. Something like this:

BUFFER_SIZE=10000

infile: input file name.

outfile: target file name

omit_start, omit_end: skips part between start and end; start/end

given as

file position byte count.

def copy_file_except(infile, outfile, omit_start, omit_end)
out=File.open(outfile, ‘w’)
begin
File.open(infile, ‘r’) {|input|
buffer=" "*BUFFER_SIZE # create a read buffer
fpos=0 # current file position for end of buffer
while input.read(BUFFER_SIZE, buffer)
bstart=fpos # file position of buffer start
bend=fpos+buffer.length # file position of buffer end
fpos+=buffer.length
if bend<omit_start or bstart>omit_end # entire buffer outside
‘omit’ range
out.write(buffer)
next
elsif bstart>=omit_start and bend<=omit_end # entire buffer
inside ‘omit’ range
next # skip all of it
end
# first part of buffer outside omit range
out.write(buffer[0, omit_start-bstart]) if bstart<omit_start
# last part of buffer outside omit range
out.write(buffer[(omit_end-bstart…-1)]) if bend>omit_end
end
}
ensure
out.close
end
end

I don’t have any gigabyte files lying around, but a 25MB file took
~17s with File.each_line and ~2s using a buffer as above.

I also think I would try experimenting with the ‘IO.sysread/syswrite’
methods.

I’m fairly new to Ruby as well (couple of weeks), so there may be
simpler and/or better performing solutions available.


#6

Hmmm… not a general solution, but depending on the specific
requirements there may be a higher performance method.

IFF:
records are fixed size
record order isn’t fixed

Swap record to be deleted with valid record in file (don’t even need
to do a full swap, just overwrite the bad record) - repeat until all
records to be deleted are at the end of the file and truncate the file
before them.

With proper seeking, this is going to have minimal reading and writing.

(if all deletions were block aligned, i’d start looking into direct
filesystem manipulation for pure performance, but I don’t know how
that would work – and i don’t think it would have nice effects on
fragmentation)

-Charlie


#7

On Dec 18, 2005, at 7:23 PM, Charles B. wrote:

Swap record to be deleted with valid record in file (don’t even need
to do a full swap, just overwrite the bad record) - repeat until all
records to be deleted are at the end of the file and truncate the file
before them.

Reasonable idea.

(if all deletions were block aligned, i’d start looking into direct
filesystem manipulation for pure performance, but I don’t know how
that would work – and i don’t think it would have nice effects on
fragmentation)

I hope you don’t mean reading/writing to the raw device. I would
do a whole lot of other things before I would drop down to
raw device access. Premature optimization is not a good thing.

Gary W.


#8

On 12/18/05, removed_email_address@domain.invalid removed_email_address@domain.invalid wrote:

I hope you don’t mean reading/writing to the raw device. I would
do a whole lot of other things before I would drop down to
raw device access. Premature optimization is not a good thing.

nah… it would require filesystem interfaces to the block mapping in
the inode - I don’t know if such things exist and they certainly
aren’t portable, but it seems like it could be a very efficient way to
drop data out of the middle of the file. Probably over-optimizing
though.

Of course if the records were fixed sized and block aligned, the
shuffling would be pretty efficient and the extra level of
optimization would likely be overkill.

-Charlie


#9

Thomas D. wrote:

Hello,

I’m relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without reading
the whole file and writing it away again? This because I’ll have to do
this with files of 1 gigabyte and larger… Is there a high performance
solution for this?

You could read the files in small blocks or line per
line and only process that small part at once. I must
say I am not quite sure if ruby internally buffers the
file (though I assume not), so maybe try:

File.open(fromfile, ‘r’) {|from|
File.open(tofile, ‘w’) {|to|
if (line = from.gets) == some_condition
to.puts line
end
}
}

Thank you!

E


#10

On Dec 18, 2005, at 8:43 PM, Charles B. wrote:

though.
uh, yep.

Just to be clear, when I say reading/writing to the raw device I mean
something like /dev/rdisk0, which presents the underlying media as a
single large file. This bypasses the standard filesystem so that
the media just looks like a huge array of blocks. I was not suggesting
actually writing a disk device driver to interface with the hardware
directly.

It is possible to query the filesystem to get the block size of the
device. You could then arrange for your I/O to be in multiples of
the native block size to improve performance. In Ruby:

File.stat("testfile").blksize

Note: this is all system programming stuff and has little if
anything to do with Ruby except for how Ruby abstracts the underlying
filesystem calls.

Disclaimer: I have no idea what the situation is on the Windows side of
the house.


#11

On Mon, 19 Dec 2005, Charles B. wrote:

though.

Of course if the records were fixed sized and block aligned, the
shuffling would be pretty efficient and the extra level of
optimization would likely be overkill.

-Charlie

if your records are fixed size you’d be mad to takle this application
without
considering bdb (berkeley db) and using it’s record database file
format. this
interface would make modifying the data extremely quick. also, if the
records
are in fact fixed size, using mmap is the cheapest way:

[ahoward@jib ahoward]$ cat a.rb
require “yaml”
require “mmap”

records =
%w( a b c ),
%w( 0 1 2 ),
%w( x y z )
open(“records”, “w”){|f| f.write records.join}

y “records” => IO::read(“records”)

mmap = Mmap::new “records”, “rw”, Mmap::MAP_SHARED

record_0 = mmap[0,3]
record_1 = mmap[3,3]
record_2 = mmap[6,3]
mmap[3,3] = record_2 # move record down
mmap[6 … -1] = “” # truncate

mmap.msync
mmap.munmap

y “records” => IO::read(“records”)

[ahoward@jib ahoward]$ ruby a.rb

records: abc012xyz

records: abcxyz

it’s tough to do io better than the kernel…

regards.

-a


#12

Windows has block sizes too, but File.stat(…).blksize returns nil.
(With Win XP Pro, Ruby 1.8.2)

But you can hardcode block sizes: Find or create a small file (a few
hundred bytes or less) and select ‘properties’ in Windows Explorer. It
says something like “size: 112 bytes. size on disk: 4096 bytes”. ‘Size
on disk’ is the block size.

Sysread/write on my system seems to benefit from using a block-sized
buffer, it is slower with a buffer twice or half the size of the
block. Thanks for the tip :slight_smile:
Plain buffered read/write appears to be less sensitive to buffer size,
but performs best with about twice the buffer size. There seems to be
little to distinguish buffered standard read/write from buffered
sysread/write.

Timings on a laptop for read/write of a 1.2 GB file:

  • 3.5 min: buffered plain read/write (buffer 8192), buffered
    sysread/write (buffer 4096)
  • 17 min: File#each

Just in case the records are not fixed size.

jf

BTW: There are two more scenarios:

  • If there is only one record to remove each time the file is opened,
    it may be possible to use read/write mode (a+) and update the file in
    place: Use IO#seek to go to the entry, and move all blocks following
    the deleted entry forward. On average you save the writing of half the
    file. If there are dozens of records, there is little gain. (Because
    the first one is likely to be relatively close to the start of the
    file.)

  • Use ‘lazy delete’: Merely overwrite the record(s) with blanks,
    nulls, newlines, whatever, or mark them as deleted in some other
    fashion. The file keeps the same size, and all entries have the same
    file position as before. Repackage the file once in a while, removing
    the blank entries, when they start to take up a significant proportion
    of the file size.
    This is clearly the best-performing solution by far, but other
    programs using the file may need to be updated to recognize the ‘this
    entry is deleted’ marking.