Regex. How to return multiple lines

ther · August 8, 2008, 12:56pm

I have a text file containing some data. Rows are normaly delimited with
newline char. I assume that file would be read in one chunk into memory.

Is it possible with regex and how to return all lines containing some
string.

My first thought was to read file in array and process each line, but I
guess if possible, operating on single string would be faster.

by
TheR

ther · August 8, 2008, 1:09pm

2008/8/8 Damjan R. [email protected]:

I have a text file containing some data. Rows are normaly delimited with
newline char. I assume that file would be read in one chunk into memory.

Is it possible with regex and how to return all lines containing some
string.

My first thought was to read file in array and process each line, but I
guess if possible, operating on single string would be faster.

13:07:47 Temp$ ./l.rb
[“aaa\n”, “a\n”, “aaa\n”, “a\n”]
13:07:53 Temp$ cat x
aaa
b
c
dd
a
aaa
s
a
13:07:56 Temp$ cat l.rb
#!/bin/env ruby

c = File.read “x”

p c.grep /a+/

13:08:49 Temp$

There are of course other possible approaches and depending on what
you want to do they might be more efficient.

Kind regards

robert

ther · August 8, 2008, 1:11pm

2008/8/8 Robert K. [email protected]:

PS: one bit of explanation: my piece of code works because String#each
returns each line individually.

ther · August 8, 2008, 2:51pm

There are of course other possible approaches and depending on what
you want to do they might be more efficient.

I would baisicly like to have somekind of fulltext search.

Your example finds 4 letter word in 7.5MB file containing 65000 lines in
0.13 seconds on C2DUO 2.4 Ghz and 0.35 seconds on PIII 1.3Ghz which is
quite good.

by
TheR

ther · August 8, 2008, 3:18pm

2008/8/8 Damjan R. [email protected]:

There are of course other possible approaches and depending on what
you want to do they might be more efficient.

I would baisicly like to have somekind of fulltext search.

Your example finds 4 letter word in 7.5MB file containing 65000 lines in
0.13 seconds on C2DUO 2.4 Ghz and 0.35 seconds on PIII 1.3Ghz which is
quite good.

If you just want to find all lines with certain words and print them
I’d prefer the streamed approach since it works for arbitrary large
files as it does not require the whole file to be in memory:

File.foreach “x.dat” do |line|
puts line if /a+/ =~ line
end

Of course, using grep or egrep would be even faster.

Kind regards

robert

ther · August 10, 2008, 3:52pm

Robert K. wrote:

c = File.read “x”

p c.grep /a+/

Why not:
p File.open(“x”) {|f| f.grep /a+/}
Avoids loading the whole file into memory.

HTH,
Sebastian

ther · August 10, 2008, 4:45pm

On 10.08.2008 15:49, Sebastian H. wrote:

Robert K. wrote:

c = File.read “x”

p c.grep /a+/

Why not:
p File.open(“x”) {|f| f.grep /a+/}
Avoids loading the whole file into memory.

The original request stated “I assume that file would be read in one
chunk into memory.” which I choose to respect. But see my disclaimer at
the end and also my other reply that hinted at this.

Cheers

robert

ther · August 10, 2008, 5:06pm

Robert K. wrote:

The original request stated “I assume that file would be read in one
chunk into memory.” which I choose to respect.

Oh, sorry, I did not notice that.

ther · August 11, 2008, 8:55am

Robert K. wrote:

On 10.08.2008 15:49, Sebastian H. wrote:

Robert K. wrote:

c = File.read “x”

p c.grep /a+/

Why not:
p File.open(“x”) {|f| f.grep /a+/}
Avoids loading the whole file into memory.

The original request stated “I assume that file would be read in one
chunk into memory.” which I choose to respect. But see my disclaimer at
the end and also my other reply that hinted at this.

Cheers

robert

Thank you very much guys.

As I wrote I assume that data would reside in the memory for the life of
a program. Since search argument would be entered by user the final
result looks close to this:

what = get_from_input()
r = Regexp.new(what, true)
a = s.grep®

by
TheR

ther · August 8, 2008, 3:55pm

On Aug 8, 6:49 am, Damjan R. [email protected] wrote:

I would baisicly like to have somekind of fulltext search.

In case it helps, here is my own ruby script for searching for files
with names and/or contents matching a particular regex:

Slim2:~ phrogz$ cat /usr/local/bin/findfile
#!/usr/bin/env ruby

USAGE = <<ENDUSAGE
Usage:
findfile [-d max_depth] [-a] [-c] [-i] name_regexp
[content_regexp]
-d,–depth the maximum depth to recurse to (defaults to no
limit)
-a,–showall with content_regexp, show every match per file
(defaults to only show the first-match per file)
-c,–usecase with content_regexp, use case-sensitive matching
(defaults to case-insensitive)
-i,–includedirs also find directories matching name_regexp
(defaults to files only; not with content_regexp)
-h,–help show some help examples
ENDUSAGE

EXAMPLES = <<ENDEXAMPLES

Examples:
findfile foo

Print the path to all files with ‘foo’ in the name

findfile -i foo

Print the path to all files and directories with ‘foo’ in the name

findfile js$

Print the path to all files whose name ends in “js”

findfile js$ vector

Print the path to all files ending in “js” with “Vector” or

“vector”

(or “vEcTOr”, “VECTOR”, etc.) in the contents, and print some of

the

first line that has that content.

findfile js$ -c Vector

Like above, but must match exactly “Vector”

(not ‘vector’ or ‘VECTOR’).

findfile . vector -a

Print the path to every file with “Vector” (any case) in it

somewhere

printing every line (with line numbers) with that content.

findfile -d 0 .

Print the path to every file that is in the current directory.

findfile -d 1 .

Print the path to every file that is in the current directory or

any

of its child directories (but no subdirectories of the children).

ENDEXAMPLES

ARGS = {}
UNFLAGGED_ARGS = [ :name_regexp, :content_regexp ]
next_arg = UNFLAGGED_ARGS.first
ARGV.each{ |arg|
case arg
when ‘-d’,‘–depth’
next_arg = :max_depth
when ‘-a’,‘–showall’
ARGS[:showall] = true
when ‘-c’,‘–usecase’
ARGS[:usecase] = true
when ‘-i’,‘–includedirs’
ARGS[:includedirs] = true
when ‘-h’,‘–help’
ARGS[:help] = true
else
if next_arg
if next_arg==:max_depth
arg = arg.to_i + 1
end
ARGS[next_arg] = arg
UNFLAGGED_ARGS.delete( next_arg )
end
next_arg = UNFLAGGED_ARGS.first
end
}

if ARGS[:help] or !ARGS[:name_regexp]
puts USAGE
puts EXAMPLES if ARGS[:help]
exit
end

class Dir
def
self.crawl(path,max_depth=nil,include_directories=false,depth=0,&blk)
return if max_depth && depth > max_depth
begin
if File.directory?( path )
yield( path, depth ) if include_directories
files = Dir.entries( path ).select{ |f| true unless f=~/^.
{1,2}$/ }
unless files.empty?
files.collect!{ |file_path|
Dir.crawl( path+‘/’+file_path, max_depth,
include_directories, depth+1, &blk )
}.flatten!
end
return files
else
yield( path, depth )
end
rescue SystemCallError => the_error
warn “ERROR: #{the_error}”
end
end

end

start_time = Time.new
name_match = Regexp.new(ARGS[:name_regexp], true )
content_match = ARGS[:content_regexp] && Regexp.new( “.
{0,20}#{ARGS[:content_regexp]}.{0,20}”, !ARGS[:usecase] )

file_count = 0
matching_count = 0
Dir.crawl(
‘.’,
ARGS[:max_depth],
ARGS[:includedirs] && !content_match
){ |file_path, depth|
if File.split( file_path )[ 1 ] =~ name_match
if content_match
if ARGS[:showall]
shown_file = false
IO.readlines( file_path ).each_with_index{ |
line_text,line_number|
if match = line_text[content_match]
unless shown_file
puts file_path
matching_count += 1
shown_file = true
end
puts ( “%5d: " % (line_number+1) ) + match
end
}
puts " " if shown_file
elsif IO.read( file_path ) =~ content_match
puts file_path,” #{$~}“,” "
matching_count += 1
end
else
puts file_path
matching_count += 1
end
end
file_count += 1
}
elapsed = Time.new - start_time
puts “Found %d file%s (out of %d) in %.2f seconds” % [
matching_count,
matching_count==1 ? ‘’ : ‘s’,
file_count,
elapsed
]