Regexing a file's contents without reading the whole thing?

rogerdpack · November 30, 2009, 9:32pm

I see that it is possible currently to parse through a file without
reading the whole thing into RAM, a la

a = File.open(‘a’, ‘r’)
a.lines{|line|
if line =~ /some regex/
…
end
}

But what if I can to do something like
a = File.read(‘a’).scan /some regex/

is that possible?

Thanks.
-r

rogerdpack · November 30, 2009, 9:55pm

Roger P. wrote:

But what if I can to do something like
a = File.read(‘a’).scan /some regex/

is that possible?

Thanks.
-r

File.open(’/usr/share/dict/words’).grep /ruby/i

rogerdpack · December 1, 2009, 1:59pm

2009/11/30 Roger P. [email protected]:

But what if I can to do something like
a = File.read(‘a’).scan /some regex/

is that possible?

If you know that matches will never cross line breaks you can do

a = []
File.foreach(“a”) do |line|
line.scan /regex/ do |m|
a << m
end

alternative:

a.concat(line.scan(/regex/))
end

If matches can cross line breaks the whole store becomes more
complicated and your solution with File.read is probably the simplest
way to do it (if files aren’t too large).

Kind regards

robert

rogerdpack · December 2, 2009, 2:33am

On 11/30/09, Roger P. [email protected] wrote:

But what if I can to do something like
a = File.read(‘a’).scan /some regex/

is that possible?

The library which makes this possible is sequence. I’m coding this
from memory, so I’m likely to get something wrong, but the equivalent
in sequence looks more or less like this:

require ‘rubygems’
require ‘sequence’
require ‘sequence/file’

seq=Sequence.new(File.open(‘a’))
seq.scan_until(/some regex/)

Keep the following in mind:

Sequence#scan works like StringScanner#scan, not String#scan.
The pattern to be matched must have a max length (4k by default, I
think; it can be changed).
If your pattern is guaranteed to not contain a nl, you’re better
off with readline, as robert said.