Hey all,
Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?
Thanks
Hey all,
Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?
Thanks
On Thu, Jul 1, 2010 at 6:47 PM, Stuart C.
[email protected] wrote:
Hey all,
Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?
You can use IO#grep like this:
File.open(‘qimo-2.0-desktop.iso’, ‘r:BINARY’){|io|
io.grep(/apiKey/){|m| p io.pos => m } }
The pos is the position the match ended, so just substract the string
length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).
Oh, and also don’t use binary encoding if you are dealing with another
one
2010/7/1 Michael F. [email protected]:
io.grep(/apiKey/){|m| p io.pos => m } }
The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).
If you only need to know whether the string occurs in the file you can
do
found = File.foreach(“foo”).any? {|line| /apiKey/ =~ line}
This will stop searching as soon as the sequence is found.
“fgrep -l foo” is likely faster.
Kind regards
robert
Thanks.
This seems to be pretty much the best logic for me, however it takes a
good 20 minutes to scan a 2Gb file.
Any ideas?
Thanks
Michael F. wrote:
On Thu, Jul 1, 2010 at 6:47 PM, Stuart C.
[email protected] wrote:Hey all,
Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?You can use IO#grep like this:
File.open(‘qimo-2.0-desktop.iso’, ‘r:BINARY’){|io|
io.grep(/apiKey/){|m| p io.pos => m } }The pos is the position the match ended, so just substract the string
length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).
Oh, and also don’t use binary encoding if you are dealing with another
one
On Thu, Jul 1, 2010 at 7:03 AM, Robert K.
[email protected] wrote:
The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).If you only need to know whether the string occurs in the file you can do
found = File.foreach(“foo”).any? {|line| /apiKey/ =~ line}
This will stop searching as soon as the sequence is found.“fgrep -l foo” is likely faster.
irb> fgrep -l waters /usr/share/dict/words
.size > 0
=> true
irb> fgrep -l watershed /usr/share/dict/words
.size > 0
=> true
irb> fgrep -l watershedz /usr/share/dict/words
.size > 0
=> false
irb> fgrep -ob waters /usr/share/dict/words
.split.map{|s|
s.split(‘:’).first}
=> [“153088”, “153102”, “204143”, “234643”, “472357”, “856441”,
“913606”, “913613”, “913623”, “913635”, “913646”, “913656”, “913668”,
“913679”, “913690”, “913703”]
irb> fgrep -ob watershed /usr/share/dict/words
.split.map{|s|
s.split(‘:’).first}
=> [“913613”, “913623”, “913635”]
irb> fgrep -ob watershedz /usr/share/dict/words
.split.map{|s|
s.split(‘:’).first}
=> []
Stuart C. wrote:
Hey all,
Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?
a fast way is to do it in C
Here are a few other helpers, though:
1.9 has faster regexes
boost regexes: GitHub - michaeledgar/ruby-boost-regex: Wraps Boost::Regex in a Ruby binding (you
could probably optimize it more than it currently is, as well…)
Rubinius also might help.
Also make sure to open your file in binary mode if you’re on 1.9. That
reads much faster. If that’s an option, anyway.
GL.
-rp
Michael F. wrote:
io.grep(/apiKey/){|m| p io.pos => m } }
The pos is the position the match ended
Actually, pos will be the position of the end of the line on which the
match was found, because #grep works line by line.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.
Sponsor our Newsletter | Privacy Policy | Terms of Service | Remote Ruby Jobs