Pattern matching and array methods

dubstep · April 27, 2011, 6:07am

I have a text file that is structured like so:

1:1 abcdefg
1:2 abcdefg
1:3 abcdefg
1:4 abcdefg
1:5 abcdefg

I would like to be able to print out a subset of the file ie: print the
line beginning with 1:2 through the line beginning with 1:4

So far, I’ve started with this;

lines = File.readlines(“file.txt”)

This puts each line of the text file into an array, so the lines[] array
looks like this:

line[0] is 1:1 abcdefg
line[1] is 1:2 abcdefg
line[2] is 1:3 abcdefg
etc.

If I want to print out the lines that start with 1:2 through 1:4, how
should I proceed? Some of the text files won’t be “aligned” in that
line[0] won’t always be 1:1. If a user would like to print the lines
containing 3:9 - 3:31, how can I scan each line of the array and pattern
match the boundaries (3:9 - 3:31 in this example)? What array methods
are available to me.

Thanks in advance for any information that could point me in the right
direction.

send · April 27, 2011, 6:28am

On Tue, Apr 26, 2011 at 11:07 PM, Mfer D. [email protected] wrote:

etc.

–
Posted via http://www.ruby-forum.com/.

I suppose you could use a flip flop… or awk

$ cat file.txt
1:1 abcdefg
1:2 abcdefg
1:3 abcdefg
1:4 abcdefg
1:5 abcdefg

$ ruby -e ’

File.foreach ARGV.first do |line|
puts line if line.start_with?(“1:2”)…line.start_with?(“1:4”)
end
’ file.txt
1:2 abcdefg
1:3 abcdefg
1:4 abcdefg

$ awk ‘$1 == “1:2”, $1 == “1:4”’ file.txt
1:2 abcdefg
1:3 abcdefg
1:4 abcdefg

send · April 27, 2011, 8:33am

Something like this should do it :

line.each {|l| puts l if l =~ /^(1:2)|(1:3)|(1:4)/}

On Wed, Apr 27, 2011 at 09:37, Mfer D. [email protected] wrote:

etc.

–
Posted via http://www.ruby-forum.com/.

–
Thanks & Regards,
Dhruva S. http://dhruvasagar.net

send · April 27, 2011, 9:57am

Dhruva S. wrote in post #995263:

Something like this should do it :

line.each {|l| puts l if l =~ /^(1:2)|(1:3)|(1:4)/}

That’s a poor answer, because your regexp isn’t anchored properly. It
would match “5:6 abc1:3def” and “1:23 foobar” for example.

I suggest using the regexp to parse the line, then using numeric
testing. This makes it easier to solve the other example of 3:9 to 3:31

lines.each do |line|
if line =~ /^(\d+):(\d+)/
major, minor = $1.to_i, $2.to_i
puts line if major == 3 and (9…31).include?(minor)
end
end

Note that you don’t need to read the whole file in at once using
readlines; you can read and process it one line at a time. This lets it
work on huge files which are too big to fit into RAM.

File.open("…") do |file|
file.each_line do |line|
if line =~ … as before
…
end
end
end

send · April 27, 2011, 7:20am

On Tue, Apr 26, 2011 at 9:07 PM, Mfer D. [email protected] wrote:

etc.

If I want to print out the lines that start with 1:2 through 1:4, how
should I proceed? Some of the text files won’t be “aligned” in that
line[0] won’t always be 1:1. If a user would like to print the lines
containing 3:9 - 3:31, how can I scan each line of the array and pattern
match the boundaries (3:9 - 3:31 in this example)? What array methods
are available to me.

If you just want to print the selected lines, the array methods
available aren’t the interesting ones (Array#each is probably enough),
the interesting part is parsing the tag part of the lines, for which
you probably want to consider using regular expressions.

send · April 27, 2011, 10:52am

On Wed, Apr 27, 2011 at 9:57 AM, Brian C. [email protected]
wrote:

lines.each do |line|
if line =~ /^(\d+):(\d+)/
major, minor = $1.to_i, $2.to_i
puts line if major == 3 and (9…31).include?(minor)
end
end

Generalizing a bit more:

lower_major, lower_minor = “3:9”.split(“:”).map {|x| x.to_i}
upper_major, upper_minor = “3:31”.split(“:”).map {|x| x.to_i}
major_range = lower_major…upper_major
minor_range = lower_minor…upper_minor

lines.each do |line|
if line =~ /^(\d+):(\d+)/
major, minor = $1.to_i, $2.to_i
puts line if major_range.include?(major) and
minor_range.include?(minor)
end
end

Jesus.

send · April 27, 2011, 7:34pm

On Wed, Apr 27, 2011 at 05:52:17PM +0900, Jess Gabriel y Galn wrote:

major, minor = $1.to_i, $2.to_i
puts line if major_range.include?(major) and minor_range.include?(minor)
end
end

This might be a bit overkill, but as an extention of that general idea:

class String
def parse_labels
self.split(’:’).collect {|n| n.to_i }
end
end

class Array
def inferior_range(records)
records.each do |r|
yield r if Range.new(*self).include? r.parse_labels[1]
end
end
end

foo = [
‘1:1 foo’,
‘1:2 bar’,
‘1:3 baz’,
‘1:4 qux’,
‘1:5 fee’,
‘1:6 fie’,
‘1:7 foe’,
‘1:8 fum’,
]

[2,5].inferior_range(foo) {|rec| puts rec }

. . . and the output of that should be:

1:2 bar
1:3 baz
1:4 qux
1:5 fee

Return value would be:

=> [“1:1 foo”, “1:2 bar”, “1:3 baz”, “1:4 qux”, “1:5 fee”, “1:6 fie”,
“1:7 foe”, “1:8 fum”]

It’s a somewhat naive implementation of the idea – doing a little ugly
monkeypatching, failing to validate any data, tightly coupling the Array
method with the String method, and so on. Maybe using modules to
compose
this stuff would be better, or subclassing somewhere along the way. My
only defense is “This was a fun diversion for a few moments.”

send · April 27, 2011, 8:01pm

On Tue, Apr 26, 2011 at 11:27 PM, Josh C. [email protected]
wrote:

1:5 abcdefg

Actually, you need a regex here, because start_with?(“1:2”) will match
“1:23
abcdefg” for example. With a regex you can use \b to indicate the word
break, or if you can have leading whitespace, a regex can deal with
that.

$ ruby -e ‘(1…40).each { |big| (1…40).each { |small| puts
“#{big}:#{small}
blah” } }’ |

ruby -e ‘$stdin.each { |line| puts line if line[/^1:2\b/]…line[/^2:2\b/]
}’
1:2 blah
1:3 blah
1:4 blah
1:5 blah
1:6 blah
1:7 blah
1:8 blah
1:9 blah
1:10 blah
1:11 blah
1:12 blah
1:13 blah
1:14 blah
1:15 blah
1:16 blah
1:17 blah
1:18 blah
1:19 blah
1:20 blah
1:21 blah
1:22 blah
1:23 blah
1:24 blah
1:25 blah
1:26 blah
1:27 blah
1:28 blah
1:29 blah
1:30 blah
1:31 blah
1:32 blah
1:33 blah
1:34 blah
1:35 blah
1:36 blah
1:37 blah
1:38 blah
1:39 blah
1:40 blah
2:1 blah
2:2 blah

I don’t really understand why everyone else is parsing the numbers.
Perhaps
they assume these lines might not be in order?

On Tue, Apr 26, 2011 at 11:07 PM, Mfer D. [email protected] wrote:

If a user would like to print the lines
containing 3:9 - 3:31, how can I scan each line of the array and pattern
match the boundaries (3:9 - 3:31 in this example)? What array methods
are available to me.

For custom boundaries, you can just interpolate them into the regex

$ ruby -e ‘(1…40).each { |big| (1…40).each { |small| puts
“#{big}:#{small}
blah” } }’ |

ruby -e ‘$stdin.each { |line| puts line if
line[/^#{ARGV[0]}\b/]…line[/^#{ARGV[1]}\b/] }’ 3:9 3:31
3:9 blah
3:10 blah
3:11 blah
3:12 blah
3:13 blah
3:14 blah
3:15 blah
3:16 blah
3:17 blah
3:18 blah
3:19 blah
3:20 blah
3:21 blah
3:22 blah
3:23 blah
3:24 blah
3:25 blah
3:26 blah
3:27 blah
3:28 blah
3:29 blah
3:30 blah
3:31 blah

I’m reading these line by line from the file, that is most efficient
(what
if your file is enormous, do you really want to read it all into an
array?)
but the interface to an array is exactly the same, instead of iterating
over
the file, you just iterate over the array. Just change $stdin.each to
$stdin.readlines.each, everything works the same, but uses an array now.

$ ruby -e ‘(1…40).each { |big| (1…40).each { |small| puts
“#{big}:#{small}
blah” } }’ |

ruby -e ‘$stdin.readlines.each { |line| puts line if
line[/^#{ARGV[0]}\b/]…line[/^#{ARGV[1]}\b/] }’ 3:9 3:31
3:9 blah
3:10 blah
3:11 blah
3:12 blah
3:13 blah
3:14 blah
3:15 blah
3:16 blah
3:17 blah
3:18 blah
3:19 blah
3:20 blah
3:21 blah
3:22 blah
3:23 blah
3:24 blah
3:25 blah
3:26 blah
3:27 blah
3:28 blah
3:29 blah
3:30 blah
3:31 blah

send · April 27, 2011, 8:30pm

Mfer D. wrote in post #995245:

I have a text file that is structured like so:

1:1 abcdefg
1:2 abcdefg
1:3 abcdefg
1:4 abcdefg
1:5 abcdefg

I would like to be able to print out a subset of the file ie: print the
line beginning with 1:2 through the line beginning with 1:4

If you do some work to save the lines in an easily accessible structure,
you can make the lookup much easier:

lines = [

‘1:1 xxxxxx’,
‘1:2 xxxxxx’,
‘1:3 xxxxxx’,
‘1:4 xxxxxx’,
‘1:5 xxxxxx’,
‘2:1 xxxxxx’,
‘2:2 xxxxxx’,
‘2:3 xxxxxx’,
‘2:4 xxxxxx’,
‘2:5 xxxxxx’
]

#Create a hash whose non-existent keys
#are automatically assigned an empty array:
h = Hash.new {|hash, key| hash[key] = []}

lines.each do |line|
numbers, str = line.split(’ ‘, 2)
key, index = numbers.split(’:’)
h[key][index.to_i] = line
#If h[key] does not exist it will automatically
#be assigned an empty array, which you can then
#index into.
end

target = ‘2:2 - 2:5’
start, stop = target.split(/\s* - \s*/xms)
key1, index1 = start.split(’:’)
key2, index2 = stop.split(’:’)

index1, index2 = index1.to_i, index2.to_i
p h[key1][index1…index2]

–output:–
[“2:2 xxxxxx”, “2:3 xxxxxx”, “2:4 xxxxxx”, “2:5 xxxxxx”]

Note that the code above will not cross key boundaries, for instance
‘2:2 - 3:5’. You can modify the code to do that.

As others mentioned, if your file is 1 Terabyte in size, then you are
going to need at least that much RAM to read the whole file into
memory. On the other hand, if you have 4 GB of RAM, and your file is 1
GB in size, then you can easily read the whole file into memory. The
advantage of reading the whole file into memory is that the lookups will
be much faster.