Forum: Ruby Removing parts from a file

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
7a66a453f442aaa28183e8b21f8a8ec5?d=identicon&s=25 Thomas Dutch (thomas)
on 2005-12-18 18:31
Hello,

I'm relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without reading
the whole file and writing it away again? This because I'll have to do
this with files of 1 gigabyte and larger... Is there a high performance
solution for this?

Thank you!
Fbaa72b7e8188955d64448f74b8e1feb?d=identicon&s=25 Harpo (Guest)
on 2005-12-18 18:59
(Received via mailing list)
Thomas Dutch wrote:

> Hello,
>
> I'm relatively new to Ruby and I have a question:
>
> Is it possible to remove one or more lines from a file, without
> reading the whole file and writing it away again? This because I'll
> have to do this with files of 1 gigabyte and larger... Is there a high
> performance solution for this?
>
> Thank you!

Said like this, i don't think it is possible as it is not related to the
language but to the file structure.
It depends on the programs which read the file, can they be fixed to
accept lines which begin with somethinq that says 'skip me', such as a
'#' ?
7a66a453f442aaa28183e8b21f8a8ec5?d=identicon&s=25 Thomas Dutch (thomas)
on 2005-12-18 19:01
Harpo wrote:
> Thomas Dutch wrote:
>
>> Hello,
>>
>> I'm relatively new to Ruby and I have a question:
>>
>> Is it possible to remove one or more lines from a file, without
>> reading the whole file and writing it away again? This because I'll
>> have to do this with files of 1 gigabyte and larger... Is there a high
>> performance solution for this?
>>
>> Thank you!
>
> Said like this, i don't think it is possible as it is not related to the
> language but to the file structure.
> It depends on the programs which read the file, can they be fixed to
> accept lines which begin with somethinq that says 'skip me', such as a
> '#' ?

No not really... It's a large amount of text, all under eachother.
Another file contains an index where a part of the file starts and where
it ends. The code should remove a part that starts at the position
specified in the second file, and the ends at the other position
specified.
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 unknown (Guest)
on 2005-12-18 19:47
(Received via mailing list)
On Dec 18, 2005, at 12:32 PM, Thomas Dutch wrote:
> Is it possible to remove one or more lines from a file, without
> reading
> the whole file and writing it away again? This because I'll have to do
> this with files of 1 gigabyte and larger... Is there a high
> performance
> solution for this?

In general, no.  I'm answering from the perspective of a typical
Unix/Posix file system.  You can truncate a file to discard some
number of trailing bytes using the ftruncate system call.
In Ruby that system call is accessed via File.truncate.

There is no analogous function for removing bytes at the start
of a file.  You can, of course, seek to any position in a file before
doing IO. In ruby you want to look at IO#seek.

Hope this helps.


Gary Wright
Fe9b2d0628c0943af374b2fe5b320a82?d=identicon&s=25 Eero Saynatkari (rue)
on 2005-12-18 23:03
Thomas Dutch wrote:
> Hello,
>
> I'm relatively new to Ruby and I have a question:
>
> Is it possible to remove one or more lines from a file, without reading
> the whole file and writing it away again? This because I'll have to do
> this with files of 1 gigabyte and larger... Is there a high performance
> solution for this?

You could read the files in small blocks or line per
line and only process that small part at once. I must
say I am not quite sure if ruby internally buffers the
file (though I assume not), so maybe try:

  File.open(fromfile, 'r') {|from|
    File.open(tofile, 'w') {|to|
      if (line = from.gets) == some_condition
        to.puts line
      end
    }
  }

> Thank you!


E
06c1bab0fb222c7426c02887cd728936?d=identicon&s=25 Johannes Friestad (Guest)
on 2005-12-18 23:20
(Received via mailing list)
On 12/18/05, Thomas Dutch <rubyforum@ikwisthet.net> wrote:
> Is it possible to remove one or more lines from a file, without reading
> the whole file and writing it away again? This because I'll have to do
> this with files of 1 gigabyte and larger... Is there a high performance
> solution for this?
>

Sorry, have to agree with the other respondents.
So you are looking for a high-performance write-through as the next best
thing.

For higher performance on large files, you'll want to read into a
buffer. Something like this:

BUFFER_SIZE=10000
# infile: input file name.
# outfile: target file name
# omit_start, omit_end: skips part between start and end; start/end
given as
# file position byte count.
def copy_file_except(infile, outfile, omit_start, omit_end)
  out=File.open(outfile, 'w')
  begin
  File.open(infile, 'r') {|input|
    buffer=" "*BUFFER_SIZE # create a read buffer
    fpos=0 # current file position for end of buffer
    while input.read(BUFFER_SIZE, buffer)
      bstart=fpos # file position of buffer start
      bend=fpos+buffer.length # file position of buffer end
      fpos+=buffer.length
      if bend<omit_start or bstart>omit_end # entire buffer outside
'omit' range
        out.write(buffer)
        next
      elsif bstart>=omit_start and bend<=omit_end # entire buffer
inside 'omit' range
        next # skip all of it
      end
      # first part of buffer outside omit range
      out.write(buffer[0, omit_start-bstart]) if bstart<omit_start
      # last part of buffer outside omit range
      out.write(buffer[(omit_end-bstart..-1)]) if bend>omit_end
    end
  }
  ensure
    out.close
  end
end

I don't have any gigabyte files lying around, but a 25MB file took
~17s with File.each_line and ~2s using a buffer as above.

I also think I would try experimenting with the 'IO.sysread/syswrite'
methods.

I'm fairly new to Ruby as well (couple of weeks), so there may be
simpler and/or better performing solutions available.
Edb8fcdf99f1de685821e4b040f6459d?d=identicon&s=25 Charles Ballowe (Guest)
on 2005-12-19 01:23
(Received via mailing list)
Hmmm... not a general solution, but depending on the specific
requirements there may be a higher performance method.

IFF:
records are fixed size
record order isn't fixed

Swap record to be deleted with valid record in file (don't even need
to do a full swap, just overwrite the bad record) - repeat until all
records to be deleted are at the end of the file and truncate the file
before them.

With proper seeking, this is going to have minimal reading and writing.

(if all deletions were block aligned, i'd start looking into direct
filesystem manipulation for pure performance, but I don't know how
that would work -- and i don't think it would have nice effects on
fragmentation)

-Charlie
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 unknown (Guest)
on 2005-12-19 02:17
(Received via mailing list)
On Dec 18, 2005, at 7:23 PM, Charles Ballowe wrote:
> Swap record to be deleted with valid record in file (don't even need
> to do a full swap, just overwrite the bad record) - repeat until all
> records to be deleted are at the end of the file and truncate the file
> before them.

Reasonable idea.

> (if all deletions were block aligned, i'd start looking into direct
> filesystem manipulation for pure performance, but I don't know how
> that would work -- and i don't think it would have nice effects on
> fragmentation)

I hope you don't mean reading/writing to the raw device.   I would
do a whole lot of other things before I would drop down to
raw device access. Premature optimization is not a good thing.

Gary Wright
Edb8fcdf99f1de685821e4b040f6459d?d=identicon&s=25 Charles Ballowe (Guest)
on 2005-12-19 02:44
(Received via mailing list)
On 12/18/05, gwtmp01@mac.com <gwtmp01@mac.com> wrote:
>
> I hope you don't mean reading/writing to the raw device.   I would
> do a whole lot of other things before I would drop down to
> raw device access. Premature optimization is not a good thing.
>
nah... it would require filesystem interfaces to the block mapping in
the inode - I don't know if such things exist and they certainly
aren't portable, but it seems like it could be a very efficient way to
drop data out of the middle of the file. Probably over-optimizing
though.

Of course if the records were fixed sized and block aligned, the
shuffling would be pretty efficient and the extra level of
optimization would likely be overkill.

-Charlie
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 unknown (Guest)
on 2005-12-19 03:23
(Received via mailing list)
On Dec 18, 2005, at 8:43 PM, Charles Ballowe wrote:
> though.
uh, yep.

Just to be clear, when I say reading/writing to the raw device I mean
something like /dev/rdisk0, which presents the underlying media as a
single large file.  This bypasses the standard filesystem so that
the media just looks like a huge array of blocks.  I was not suggesting
actually writing a disk device driver to interface with the hardware
directly.

It is possible to query the filesystem to get the block size of the
device.  You could then arrange for your I/O to be in multiples of
the native block size to improve performance.  In Ruby:

	File.stat("testfile").blksize

Note:  this is all system programming stuff and has little if
anything to do with Ruby except for how Ruby abstracts the underlying
filesystem calls.

Disclaimer:  I have no idea what the situation is on the Windows side of
the house.
Cb48ca5059faf7409a5ab3745a964696?d=identicon&s=25 unknown (Guest)
on 2005-12-19 03:44
(Received via mailing list)
On Mon, 19 Dec 2005, Charles Ballowe wrote:

> though.
>
> Of course if the records were fixed sized and block aligned, the
> shuffling would be pretty efficient and the extra level of
> optimization would likely be overkill.
>
> -Charlie

if your records are fixed size you'd be mad to takle this application
without
considering bdb (berkeley db) and using it's record database file
format.  this
interface would make modifying the data extremely quick.  also, if the
records
are in fact fixed size, using mmap is the cheapest way:

   [ahoward@jib ahoward]$ cat a.rb
   require "yaml"
   require "mmap"

   records =
     %w( a b c ),
     %w( 0 1 2 ),
     %w( x y z )
   open("records", "w"){|f| f.write records.join}

   y "records" => IO::read("records")

   mmap = Mmap::new "records", "rw", Mmap::MAP_SHARED

   record_0 = mmap[0,3]
   record_1 = mmap[3,3]
   record_2 =  mmap[6,3]
   mmap[3,3] = record_2  # move record down
   mmap[6 .. -1] = ""  # truncate

   mmap.msync
   mmap.munmap


   y "records" => IO::read("records")


   [ahoward@jib ahoward]$ ruby a.rb
   ---
   records: abc012xyz
   ---
   records: abcxyz


it's tough to do io better than the kernel...

regards.

-a
06c1bab0fb222c7426c02887cd728936?d=identicon&s=25 Johannes Friestad (Guest)
on 2005-12-19 12:08
(Received via mailing list)
Windows has block sizes too, but File.stat(...).blksize returns nil.
(With Win XP Pro, Ruby 1.8.2)

But you can hardcode block sizes: Find or create a small file (a few
hundred bytes or less) and select 'properties' in Windows Explorer. It
says something like "size: 112 bytes. size on disk: 4096 bytes". 'Size
on disk' is the block size.

Sysread/write on my system seems to benefit from using a block-sized
buffer, it is slower with a buffer twice or half the size of the
block. Thanks for the tip :)
Plain buffered read/write appears to be less sensitive to buffer size,
but performs best with about twice the buffer size. There seems to be
little to distinguish buffered standard read/write from buffered
sysread/write.

Timings on a laptop for read/write of a 1.2 GB file:
- 3.5 min: buffered plain read/write (buffer 8192), buffered
sysread/write (buffer 4096)
- 17 min: File#each

Just in case the records are not fixed size.

jf

BTW: There are two more scenarios:

- If there is only one record to remove each time the file is opened,
it may be possible to use read/write mode (a+) and update the file in
place: Use IO#seek to go to the entry, and move all blocks following
the deleted entry forward. On average you save the writing of half the
file. If there are dozens of records, there is little gain. (Because
the first one is likely to be relatively close to the start of the
file.)

- Use 'lazy delete': Merely overwrite the record(s) with blanks,
nulls, newlines, whatever, or mark them as deleted in some other
fashion. The file keeps the same size, and all entries have the same
file position as before. Repackage the file once in a while, removing
the blank entries, when they start to take up a significant proportion
of the file size.
This is clearly the best-performing solution by far, but other
programs using the file may need to be updated to recognize the 'this
entry is deleted' marking.
This topic is locked and can not be replied to.