How to quickly find a string towards the end of a large io object


#1

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here’s very roughly how the contents of the IO object look:

Lots’s of stuff… lot’s of stuff…lot’s of
stuff…“my_string_i_need”…lot’s of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What’s the fastest way to do this?

Thanks in advance!


#2

On Nov 5, 2008, at 11:03 PM, bwv549 wrote:

far from the end). What’s the fastest way to do this?

Thanks in advance!

binary search using seek. in all honestly though, it’d have to be
really big to beat a regex if the entire file contents. define big?

a @ http://codeforpeople.com/


#3

ara.t.howard wrote:

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What’s the fastest way to do this?

Thanks in advance!

binary search using seek.

He didn’t say his data was sorted, did he?


#4

2008/11/6 bwv549 removed_email_address@domain.invalid:

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here’s very roughly how the contents of the IO object look:

Lots’s of stuff… lot’s of stuff…lot’s of
stuff…“my_string_i_need”…lot’s of stuff

In other words, I have a massive IO object that I cannot reverse and I

What is a “massive IO object”? Are we talking about a large file? A
socket or pipe which yields a lot data?

need to very quickly find a string towards the end (but still fairly
far from the end). What’s the fastest way to do this?

Start scanning at the beginning and travel forward in time. :slight_smile:

No seriously, we need a bit more detail.

Kind regards

robert


#5

bwv549 wrote:

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?

here’s very roughly how the contents of the IO object look:

Lots’s of stuff… lot’s of stuff…lot’s of
stuff…“my_string_i_need”…lot’s of stuff

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What’s the fastest way to do this?

Thanks in advance!

They are correct that more info is needed. If it is just a huge amount
of text and it absolutely had to run faster, in other languages I
would break up the string and have separate threads search.


#6

On Thu, 6 Nov 2008, Lloyd L. wrote:

bwv549 wrote:

How do I scan starting at the end of a big io object to find a string
that is always closer to then end?
[…]

In other words, I have a massive IO object that I cannot reverse and I
need to very quickly find a string towards the end (but still fairly
far from the end). What’s the fastest way to do this?

Thanks in advance!

They are correct that more info is needed. If it is just a huge amount
of text and it absolutely had to run faster, in other languages I
would break up the string and have separate threads search.

Other things to take into account would include:

  • Will be big string be kept for some time?
  • Will there be multiple searches on it?
  • Is it bigger than RAM + swap?

You might find that if you’re doing multiple searches that a suffix
array
(see Programming Pearls) is useful. You may wish to look at Ferret. You
may want to call out to another program to do this if is is obtained
from
a file.

ri String#rindex gives:

---------------------------------------------------------- String#rindex
str.rindex(substring [, fixnum]) => fixnum or nil
str.rindex(fixnum [, fixnum]) => fixnum or nil
str.rindex(regexp [, fixnum]) => fixnum or nil

 Returns the index of the last occurrence of the given _substring_,
 character (_fixnum_), or pattern (_regexp_) in _str_. Returns +nil+
 if not found. If the second parameter is present, it specifies the
 position in the string to end the search---characters beyond this
 point will not be considered.

    "hello".rindex('e')             #=> 1
    "hello".rindex('l')             #=> 3
    "hello".rindex('a')             #=> nil
    "hello".rindex(101)             #=> 1
    "hello".rindex(/[aeiou]/, -2)   #=> 1

Does this actually search from the end backwards? Not checked the
source
(recently enough || at all) to know, but I’d expect so.

    HTH
    Hugh

#7

Here’s roughly what I’m using right now (sort of an amalgamation of
everyone’s thoughts):

File.open(filename) do |handle|
halfway = handle.stat.size / 2
handle.seek halfway
last_half = handle.read
params_start_index = last_half.rindex(my_substring) + halfway
end

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone’s help and I will check back to see if there are
any more thoughts.


#8

I really appreciate all the comments. They are very helpful. I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB. It may easily be up to 1GB or more. The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it. I need to check a parameter in the file before processing the
complete binary format (which I’ve decoded). At this point I only
expect to see files, but I was hoping to make it generic to any IO
object. I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn’t).

Again, the basic layout of the file:

…my_param = X…

I need the value ‘X’ following the 'my_para = ’

Thanks again!


#9

2008/11/6 bwv549 removed_email_address@domain.invalid:

object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn’t).

Again, the basic layout of the file:

…my_param = X…

I need the value ‘X’ following the 'my_para = ’

For simplicity reasons I’d start with this:

def get_param(file_name)
size = File.size file_name
File.open file_name do |io|
io.seek(-(size * 0.1).to_i, IO::SEEK_END)
io.read[%r{my_param\s*=\s*(\S+)}, 1]
end
end

Kind regards

robert


#10

On Nov 6, 2008, at 10:18 AM, bwv549 wrote:

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone’s help and I will check back to see if there are
any more thoughts.

cfp:~ > cat a.rb
def tail_search io, re, options = {}
io = open io unless io.respond_to?(:read)
percent = Float(options[‘percent’]||options[:percent]||0.10)

buf = ‘’
size = io.stat.size
pagesize = Integer(size * percent)
pos = 0

loop do
pos -= pagesize
break if pos.abs > size
io.seek(pos, IO::SEEK_END)
buf = buf + io.read(pagesize)
match = re.match(buf)
return match if match
end

return nil
ensure
io.close rescue nil
end

match = tail_search(FILE, /(key)=(val)/, :percent => 0.02)
p match.to_a if match

p tail_search(FILE, /(key)=(value)/)

END
key=val

cfp:~ > ruby a.rb
[“key=val”, “key”, “val”]
nil

you may want to modify to use rindex instead of a regex - up to you.

a @ http://codeforpeople.com/


#11

On Nov 6, 2008, at 10:18 AM, bwv549 wrote:

using rindex this takes .17 sec on an 85MB file
using index it takes .45 sec on the same file

Those numbers argue that rindex indeed seeks from the end.

Thanks for everyone’s help and I will check back to see if there are
any more thoughts.

this one never reads more than pagesize into memory and deals with the
fact that a needle could straddle two pages by keeping the current
page plus the previous page’s first bit (only need maximum of needle
size bytes) as the search target.

the extra code is just showing you that it does, in fact, find it’s
target.

you can up the percent for speed or crank it down to save on memory.

cfp:~ > cat a.rb
def tail_search io, needle, options = {}
io = open io unless io.respond_to?(:read)
percent = Float(options[‘percent’]||options[:percent]||0.10)

buf = ‘’
size = io.stat.size
pagesize = Integer(size * percent)
pos = 0

loop do
pos -= pagesize
break if pos.abs > size
io.seek(pos, IO::SEEK_END)
buf = io.read(pagesize) + buf[0, needle.size]
relative_index = buf.index(needle)
if relative_index
absolute_index = size + pos + relative_index
return absolute_index
end
end

return nil
ensure
io.close rescue nil
end

needle = ‘key=val’
index = tail_search(FILE, needle, :percent => 0.02)
if index
open(FILE) do |fd|
fd.seek index
puts fd.read(needle.size)
end
end

needle = ‘io.close rescue nil’
index = tail_search(FILE, needle, :percent => 0.02)
if index
open(FILE) do |fd|
fd.seek index
puts fd.read(needle.size)
end
end

END
key=val

cfp:~ > ruby a.rb
key=val
io.close rescue nil

a @ http://codeforpeople.com/


#12

On Fri, 7 Nov 2008, bwv549 wrote:

I really appreciate all the comments. They are very helpful. I
should have been more specific:

In this case, I am dealing with a file that will usually be >= 100
MB. It may easily be up to 1GB or more. The file is in binary form,
but about 9/10ths of the way there is a small text file embedded into
it. I need to check a parameter in the file before processing the
complete binary format (which I’ve decoded). At this point I only
expect to see files, but I was hoping to make it generic to any IO

You can do that with read.

object. I would settle on a solution for a file (rather than an IO
object), but I would like it to be cross-platform and prefer the
solution in ruby if it is fast enough (who wouldn’t).

Other languages may be faster of course, but first make it work…

Again, the basic layout of the file:

…my_param = X…
let the length of your data ^^^^^^^^^^^^ be k

First try reading the whole lot. If it’s too big you’ll get an
exception.
Otherwise use rindex on the buffer from the successful read.

If it’s a stream you’ll have to read it on order anyway,
so read it in chunks with a 2*k overlap so you can definitely
fit the whole string in a chunk, and boundaries between reads
don’t matter. That should work on 10GB files. You might want
to read in multiples of 512 bytes on Unix, maybe 4k (cluster size??)
on Windows, not sure.

I need the value ‘X’ following the 'my_para = ’

Thanks again!

    Hugh