Forum: Ruby how to quickly find a string towards the end of a large io object

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
883ee4d990ef9e57a79b5b0ff3b66241?d=identicon&s=25 bwv549 (Guest)
on 2008-11-06 07:06
(Received via mailing list)
How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here's very roughly how the contents of the IO object look:

Lots's of stuff... lot's of stuff....lot's of
stuff...."my_string_i_need"...lot's of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end).  What's the fastest way to do this?

Thanks in advance!
1bac2e65d64faf472cf2ebc94f0f5ee0?d=identicon&s=25 Ara Howard (ahoward)
on 2008-11-06 07:17
(Received via mailing list)
On Nov 5, 2008, at 11:03 PM, bwv549 wrote:

> far from the end).  What's the fastest way to do this?
>
> Thanks in advance!

binary search using seek.  in all honestly though, it'd have to be
really big to beat a regex if the entire file contents.  define big?

a @ http://codeforpeople.com/
7a561ec0875fcbbe3066ea8fe288ec77?d=identicon&s=25 Sebastian Hungerecker (Guest)
on 2008-11-06 08:09
(Received via mailing list)
ara.t.howard wrote:
> > In other words, I have a massive IO object that I cannot reverse and I
> > need to very quickly find a string towards the end (but still fairly
> > far from the end).  What's the fastest way to do this?
> >
> > Thanks in advance!
>
> binary search using seek.

He didn't say his data was sorted, did he?
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-11-06 14:44
(Received via mailing list)
2008/11/6 bwv549 <jtprince@gmail.com>:
> How do I scan starting at the end of a big io object to find a string
> that is always closer to then end?
>
> here's very roughly how the contents of the IO object look:
>
> Lots's of stuff... lot's of stuff....lot's of
> stuff...."my_string_i_need"...lot's of stuff
>
> In other words, I have a massive IO object that I cannot reverse and I

What is a "massive IO object"?  Are we talking about a large file?  A
socket or pipe which yields a lot data?

> need to very quickly find a string towards the end (but still fairly
> far from the end).  What's the fastest way to do this?

Start scanning at the beginning and travel forward in time.  :-)

No seriously, we need a bit more detail.

Kind regards

robert
84dc575c33a123789521d53cad0f62ae?d=identicon&s=25 Lloyd Linklater (lloyd)
on 2008-11-06 15:04
bwv549 wrote:
> How do I scan starting at the end of a big io object to find a string
> that is always closer to then end?
>
> here's very roughly how the contents of the IO object look:
>
> Lots's of stuff... lot's of stuff....lot's of
> stuff...."my_string_i_need"...lot's of stuff
>
> In other words, I have a massive IO object that I cannot reverse and I
> need to very quickly find a string towards the end (but still fairly
> far from the end).  What's the fastest way to do this?
>
> Thanks in advance!

They are correct that more info is needed.  If it is just a huge amount
of text and it absolutely *had* to run faster, in other languages I
would break up the string and have separate threads search.
457cf540784a12ba2f30e06565a2c189?d=identicon&s=25 Hugh Sasse (Guest)
on 2008-11-06 15:55
(Received via mailing list)
On Thu, 6 Nov 2008, Lloyd Linklater wrote:

> bwv549 wrote:
> > How do I scan starting at the end of a big io object to find a string
> > that is always closer to then end?
        [...]
> > In other words, I have a massive IO object that I cannot reverse and I
> > need to very quickly find a string towards the end (but still fairly
> > far from the end).  What's the fastest way to do this?
> >
> > Thanks in advance!
>
> They are correct that more info is needed.  If it is just a huge amount
> of text and it absolutely *had* to run faster, in other languages I
> would break up the string and have separate threads search.

Other things to take into account would include:
 * Will be big string be kept for some time?
 * Will there be multiple searches on it?
 * Is it bigger than RAM + swap?

You might find that if you're doing multiple searches that a suffix
array
(see Programming Pearls) is useful. You may wish to look at Ferret. You
may want to call out to another program to do this if is is obtained
from
a file.

ri String#rindex gives:

---------------------------------------------------------- String#rindex
     str.rindex(substring [, fixnum])   => fixnum or nil
     str.rindex(fixnum [, fixnum])   => fixnum or nil
     str.rindex(regexp [, fixnum])   => fixnum or nil
------------------------------------------------------------------------
     Returns the index of the last occurrence of the given _substring_,
     character (_fixnum_), or pattern (_regexp_) in _str_. Returns +nil+
     if not found. If the second parameter is present, it specifies the
     position in the string to end the search---characters beyond this
     point will not be considered.

        "hello".rindex('e')             #=> 1
        "hello".rindex('l')             #=> 3
        "hello".rindex('a')             #=> nil
        "hello".rindex(101)             #=> 1
        "hello".rindex(/[aeiou]/, -2)   #=> 1

Does this actually search from the end backwards?  Not checked the
source
(recently enough || at all) to know, but I'd expect so.

        HTH
        Hugh
883ee4d990ef9e57a79b5b0ff3b66241?d=identicon&s=25 bwv549 (Guest)
on 2008-11-06 17:45
(Received via mailing list)
I really appreciate all the comments.  They are very helpful.  I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB.  It may easily be up to 1GB or more.  The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it.  I need to check a parameter in the file before processing the
complete binary format (which I've decoded).  At this point I only
expect to see files, but I was hoping to make it generic to any IO
object.  I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn't).

Again, the basic layout of the file:

................................my_param = X........

I need the value 'X' following the 'my_para = '

Thanks again!
883ee4d990ef9e57a79b5b0ff3b66241?d=identicon&s=25 bwv549 (Guest)
on 2008-11-06 18:20
(Received via mailing list)
Here's roughly what I'm using right now (sort of an amalgamation of
everyone's thoughts):

File.open(filename) do |handle|
      halfway = handle.stat.size / 2
      handle.seek halfway
      last_half = handle.read
      params_start_index = last_half.rindex(my_substring) + halfway
end

using rindex this takes .17 sec on an 85MB file
using index it takes  .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone's help and I will check back to see if there are
any more thoughts.
1bac2e65d64faf472cf2ebc94f0f5ee0?d=identicon&s=25 Ara Howard (ahoward)
on 2008-11-06 18:28
(Received via mailing list)
On Nov 6, 2008, at 10:18 AM, bwv549 wrote:

> using rindex this takes .17 sec on an 85MB file
> using index it takes  .45 sec on the same file
>
> Those numbers argue that rindex indeed seeks from the end.
>
> Thanks for everyone's help and I will check back to see if there are
> any more thoughts.
>



cfp:~ > cat a.rb
def tail_search io, re, options = {}
   io = open io unless io.respond_to?(:read)
   percent = Float(options['percent']||options[:percent]||0.10)

   buf = ''
   size = io.stat.size
   pagesize = Integer(size * percent)
   pos = 0

   loop do
     pos -= pagesize
     break if pos.abs > size
     io.seek(pos, IO::SEEK_END)
     buf = buf + io.read(pagesize)
     match = re.match(buf)
     return match if match
   end

   return nil
ensure
   io.close rescue nil
end


match = tail_search(__FILE__,  /(key)=(val)/, :percent => 0.02)
p match.to_a if match

p tail_search(__FILE__,  /(key)=(value)/)

__END__
key=val







cfp:~ > ruby a.rb
["key=val", "key", "val"]
nil




you may want to modify to use rindex instead of a regex - up to you.



a @ http://codeforpeople.com/
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-11-06 18:41
(Received via mailing list)
2008/11/6 bwv549 <jtprince@gmail.com>:
> object), but I would like it to be cross-platform and prefer the
> solution in ruby if it is fast enough (who wouldn't).
>
> Again, the basic layout of the file:
>
> ................................my_param = X........
>
> I need the value 'X' following the 'my_para = '

For simplicity reasons I'd start with this:

def get_param(file_name)
  size = File.size file_name
  File.open file_name do |io|
    io.seek(-(size * 0.1).to_i, IO::SEEK_END)
    io.read[%r{my_param\s*=\s*(\S+)}, 1]
  end
end

Kind regards

robert
1bac2e65d64faf472cf2ebc94f0f5ee0?d=identicon&s=25 Ara Howard (ahoward)
on 2008-11-06 18:45
(Received via mailing list)
On Nov 6, 2008, at 10:18 AM, bwv549 wrote:

> using rindex this takes .17 sec on an 85MB file
> using index it takes  .45 sec on the same file
>
> Those numbers argue that rindex indeed seeks from the end.
>
> Thanks for everyone's help and I will check back to see if there are
> any more thoughts.
>



this one never reads more than pagesize into memory and deals with the
fact that a needle could straddle two pages by keeping the current
page plus the previous page's first bit (only need maximum of needle
size bytes) as the search target.

the extra code is just showing you that it does, in fact, find it's
target.

you can up the percent for speed or crank it down to save on memory.


cfp:~ > cat a.rb
def tail_search io, needle, options = {}
   io = open io unless io.respond_to?(:read)
   percent = Float(options['percent']||options[:percent]||0.10)

   buf = ''
   size = io.stat.size
   pagesize = Integer(size * percent)
   pos = 0

   loop do
     pos -= pagesize
     break if pos.abs > size
     io.seek(pos, IO::SEEK_END)
     buf = io.read(pagesize) + buf[0, needle.size]
     relative_index = buf.index(needle)
     if relative_index
       absolute_index = size + pos + relative_index
       return absolute_index
     end
   end

   return nil
ensure
   io.close rescue nil
end


needle = 'key=val'
index = tail_search(__FILE__,  needle, :percent => 0.02)
if index
   open(__FILE__) do |fd|
     fd.seek index
     puts fd.read(needle.size)
   end
end


needle = 'io.close rescue nil'
index = tail_search(__FILE__,  needle, :percent => 0.02)
if index
   open(__FILE__) do |fd|
     fd.seek index
     puts fd.read(needle.size)
   end
end


__END__
key=val




cfp:~ > ruby a.rb
key=val
io.close rescue nil

a @ http://codeforpeople.com/
457cf540784a12ba2f30e06565a2c189?d=identicon&s=25 Hugh Sasse (Guest)
on 2008-11-07 17:54
(Received via mailing list)
On Fri, 7 Nov 2008, bwv549 wrote:

> I really appreciate all the comments.  They are very helpful.  I
> should have been more specific:
>
> In this case, I am dealing with a file that will usually be >= 100
> MB.  It may easily be up to 1GB or more.  The file is in binary form,
> but about 9/10ths of the way there is a small text file embedded into
> it.  I need to check a parameter in the file before processing the
> complete binary format (which I've decoded).  At this point I only
> expect to see files, but I was hoping to make it generic to any IO

You can do that with read.

> object.  I would settle on a solution for a file (rather than an IO
> object), but I would like it to be cross-platform and prefer the
> solution in ruby if it is fast enough (who wouldn't).

Other languages may be faster of course, but first make it work...
>
> Again, the basic layout of the file:
>
> ................................my_param = X........
let the length of your data       ^^^^^^^^^^^^ be k

First try reading the whole lot. If it's too big you'll get an
exception.
Otherwise use rindex on the buffer from the successful read.

If it's a stream you'll have to read it on order anyway,
so read it in chunks with a 2*k overlap so you can definitely
fit the whole string in a chunk, and boundaries between reads
don't matter.  That should work on 10GB files.  You might want
to read in multiples of 512 bytes on Unix, maybe 4k (cluster size??)
on Windows, not sure.

>
> I need the value 'X' following the 'my_para = '
>
> Thanks again!
>
        Hugh
This topic is locked and can not be replied to.