Inconsistent behaviour when working with a string

Hi List,

I’m writing a supposedly simple program to help me sort some files in to
a different data structure.
I have ~2 million files that look either like this
access1202241814.merged.log.bz2
or this
access_denied1202211457.merged.log.bz2

Though over 90% or of the first type I’d like to be able to handle both
at the same time as this will eventually form part of a larger log
management program.

This is what I currently have

$months = [‘01-January’, ‘02-February’, ‘03-March’, ‘04-April’,
‘05-May’, ‘06-June’, ‘07-July’, ‘08-August’, ‘09-September’,
‘10-October’, ‘11-November’, ‘12-December’]
Dir.chdir("/ruby-scripts/archive/logs/misc/")
Dir.foreach(".") do |file|
fileint = file[/\d{10,}/]
year = fileint[0,2]
month = $months[fileint[2,2].to_i-1]
puts “year = #{year} month = #{month} and file was #{file}”
end

this will return
./archive.rb:7:in block in <main>': undefined method[]’ for
nil:NilClass (NoMethodError)

I don’t understand why though as the example below works fine, but will
only match access1202241814.merged.log.bz2

$months = [‘01-January’, ‘02-February’, ‘03-March’, ‘04-April’,
‘05-May’, ‘06-June’, ‘07-July’, ‘08-August’, ‘09-September’,
‘10-October’, ‘11-November’, ‘12-December’]
Dir.chdir("/ruby-scripts/archive/logs/misc/")
Dir.foreach(".") do |file|
year = file[0,2]
month = $months[file[2,2].to_i-1]
puts “year = #{year} month = #{month} and file was #{file}”
end

The only difference between the 2 is that in the first one I’ve used a
regex to strip out all the extraneous text.
I’ve tried this in ruby 1.9.2 and 1.8.7

Can any one explain why the 1 one is breaking? and if this is expected
behaviour, is there a way I can do this so it will cope with both file
types at the same time?

Thanks for any pointers

Regards,

Tris


This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity
to whom they are addressed. If you have received this email
in error please notify [email protected]

The views expressed within this email are those of the
individual, and not necessarily those of the organisation


Hi,

The Dir.foreach iterator always begins with the current directory ‘.’
and the parent directory ‘…’. If you don’t skip these cases, the
iterator will throw an error on the first run:

If file is ‘.’, then file[/\d{10,}/] is nil (there aren’t any digits).
And if fileint is nil, then fileint[0,2] will fail, because nil doesnt
have a [] method.

You should generally check the file parameter before processing it.
Otherwise, you will always run into trouble if there are any entries
that don’t match the pattern.

For example, you could write

Dir.foreach(".") do |file|
if file =~ /access(?:_denied)?(\d+).merged.log.bz2/
timestamp = $1
year, month =
timestamp[0, 2], month[2, 2]
month = $months[month.to_i - 1]
puts “year = #{year} month = #{month} and file was #{file}”
end
end

You only do the processing if the file parameter matches the pattern you
described earlier. While checking this, you save the timestamp substring
in the first regex group (accessible by $1). You can then extract the
year and month from the timestamp.

Your second example I don’t really understand. You now assume that the
file names begin with the timestamp, although the names should begin
with “access” (that’s what you said earlier). While this example won’t
throw any errors, it will generate nonsense output like “year = ac …”

Jacques

@ Bigmac T.:

Your code won’t work with the current version of Ruby (1.9), because the
String class no longer includes the Enumerable module. This makes
data.map fail.

But the data.map isn’t necessary anyway. In Ruby 1.8, you can simply
leave it out (the string itself already is enumerable). Or you can
replace the for loop with the each_line iterator to make the code work
in both Ruby 1.8 and Ruby 1.9:

data.each_line |ln|

end

Jan E is on the right track–I’d suggest writing a simple script that
just processes the file names and spits out the bad names. Then proceed
from there:

Dir.foreach(“.”) do |file|
next if file == “.” or file == “…”
fileint = file[/\d{10,}/]
year = fileint[0,2]
month = fileint[2,2].to_i
file_date = Date.new(year, month, 1)
old = Date.new(2000, 1, 1)
now = Date.today
if file_date < old or file_date > now
puts “year = #{year} month = #{month} and file was #{file}”
end
end

Best,
Eric

So, i want to share with you a quick script, im not sure if this will
help but it might give you ideas… if your a linux user

data=ls /etc/ | grep conf
for file_name in data.map
#do something with file name
puts file_name
sleep 1

file name read, write, move, delete etc

end

or for you dir example

data=ls /ruby-scripts/archive/logs/misc/ | grep access_
for file_name in data.map
#do something with file name
puts file_name
sleep 1
end

Jan E. wrote in post #1050236:

Hi,

The Dir.foreach iterator always begins with the current directory ‘.’
and the parent directory ‘…’. If you don’t skip these cases, the
iterator will throw an error on the first run:

If file is ‘.’, then file[/\d{10,}/] is nil (there aren’t any digits).
And if fileint is nil, then fileint[0,2] will fail, because nil doesnt
have a [] method.

You should generally check the file parameter before processing it.
Otherwise, you will always run into trouble if there are any entries
that don’t match the pattern.

For example, you could write

Dir.foreach(".") do |file|
if file =~ /access(?:_denied)?(\d+).merged.log.bz2/
timestamp = $1
year, month =
timestamp[0, 2], month[2, 2]
month = $months[month.to_i - 1]
puts “year = #{year} month = #{month} and file was #{file}”
end
end

Absolutely! I’d go just a bit further and make the matching more
rigorous and also extract all relevant data in one go:

Dir.foreach(dir) do |file|
if file =~
/\Aaccess(?:_denied)?(\d{2})(\d{2})\d{6}.merged.log.bz2\z/
year = $1.to_i
month = $2.to_i
puts “year = #{year} month = #{$months[month - 1]} and file was
#{file}”
end
end

Note: if there is a hierarchy of folders then also Find.find or
Pathname.find could be used.

Kind regards

robert

On 06/03/2012 09:42, Robert K. wrote:

   timestamp[0, 2], month[2, 2]

/\Aaccess(?:_denied)?(\d{2})(\d{2})\d{6}.merged.log.bz2\z/
Kind regards

robert

Thank you, that makes a lot of sense, I’d even noticed the . and …
directory pointers earlier on and knew I’d need to do something about
them, but for some reason it did not occur that this was the cause of
the issue I was having at the out set.

Best regards,

Tris


This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity
to whom they are addressed. If you have received this email
in error please notify [email protected]

The views expressed within this email are those of the
individual, and not necessarily those of the organisation