Extract date from filenames using regex

I have my code which looks like this:

delete= 5 + 2 #escape counting in weekends i.e Sat,Sun
folders = $del_path
puts delete_date = DateTime.now - delete

regexp = Regexp.compile(/(\d{4}\d{2}\d{2})/)

fileData = Struct.new(:name, :size)
deleted_files = []

folders.each do |folder|
Dir.glob(folder+"/*") do |file|
match = regexp.match(File.basename(file));
if match
file_date = DateTime.parse(match[1])
When my file name is in the format, 20080331 for example, the script
will run successfully. However, if the filename has additional
characters added to it, say, risk20080331, it’ll run an error. And i
reckon it’s the cause of the above line.

     size = (File.size(file))/1024
     if delete_date > file_date
       deleted_files << fileData.new(file,size)
       FileUtils.rm_r file
       if File.exist?(file)==false
         puts "Files/Folders deleted: #{file} size: #{size} KB"
         end #if
       end #if
   end #if
 end #do

end #each
end #if

So is there any way I can extract the date using regex or whichever way
simpler so I can compare the deletion date and execute the rm_r command?
Thanks!

Hi,

You can use File.mtime(file_name) which will return a Time object.

You can also match with /\d+/ (one or more digits):

(“sdf555sadfsdfg”)[/\d+/]
=> “555”

But watch out!:

(“sdf555sadfs5867dfg”)[/\d+/]
=> “555”

For a nice, object-oriented approach to file manipulation in Ruby, you
might want to check out Pathname in the standard library:
http://www.ruby-doc.org/stdlib/libdoc/pathname/rdoc/index.html

Dan

On Tue, Apr 22, 2008 at 11:48 PM, Clement Ow

Daniel F. wrote:

Hi,

You can use File.mtime(file_name) which will return a Time object.

You can also match with /\d+/ (one or more digits):

(“sdf555sadfsdfg”)[/\d+/]
=> “555”

But watch out!:

(“sdf555sadfs5867dfg”)[/\d+/]
=> “555”

For a nice, object-oriented approach to file manipulation in Ruby, you
might want to check out Pathname in the standard library:
http://www.ruby-doc.org/stdlib/libdoc/pathname/rdoc/index.html

Dan

On Tue, Apr 22, 2008 at 11:48 PM, Clement Ow

Thanks Daniel for your input. I tried using /\d+/ but it’ll extract
files that have even 2 numbers to i decided to use
/(\d\d)(\d\d)(\d\d\d\d)/ instead. It enabled me to run the command on
certain files but not all files and the following error occured:

Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_09042008.dat size: 74 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_10042008.dat size: 81 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_11042008.dat size: 80 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_14042008.dat size: 79 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_15042008.dat size: 77 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_16042008.dat size: 77 KB
c:/ruby/lib/ruby/1.8/date.rb:1536:in new_by_frags': invalid date (ArgumentError ) from c:/ruby/lib/ruby/1.8/date.rb:1583:in parse’
from testing.conf.rb:166:in delFiles' from testing.conf.rb:163:in glob’
from testing.conf.rb:163:in delFiles' from testing.conf.rb:162:in each’
from testing.conf.rb:162:in `delFiles’
from testing.conf.rb:204

Is there anything wrong with mycode that prevents deleting all the files
that I want?

On Wed, Apr 23, 2008 at 9:14 AM, Clement Ow
[email protected] wrote:

    from testing.conf.rb:166:in `delFiles'
    from testing.conf.rb:163:in `glob'
    from testing.conf.rb:163:in `delFiles'
    from testing.conf.rb:162:in `each'
    from testing.conf.rb:162:in `delFiles'
    from testing.conf.rb:204

Is there anything wrong with mycode that prevents deleting all the files
that I want?

OK, now I see the problem. The file that is failing has a number like
this:
16042008. The DateTime.parse method is trying to parse the date as:

1604-20-08 which is obviously an invalid date (month > 12).
There are two solutions to this problem:

1.- Change DateTime.parse to DateTime.strptime passing a format
that describes where in the string you have the two digits of the day,
the month
and the four digits of the date. I haven’t been able to gather a quick
example,
cause I don’t find a reference for the format string (any help here
appreciated).
The doc refers me to the date/format.rb for details and I don’t see
anything clear
there.

2.- Change the regexp a little bit so you capture the day, the month
and the year
in separate groups and create the DateTime using the three values:

irb(main):011:0> regexp = Regexp.compile(/(\d{4})(\d{2})(\d{2})/)
=> /(\d{4})(\d{2})(\d{2})/
irb(main):012:0> m = regexp.match(“20080103asdfasdf”)
=> #MatchData:0xb7c11a6c
irb(main):014:0> d = DateTime.civil m[1].to_i, m[2].to_i, m[3].to_i
=> #<DateTime: 4908937/2,0,2299161>
irb(main):015:0> d.to_s
=> “2008-01-03T00:00:00+00:00”

I think you can apply the above changes to the script and it will work.
Let me know,

Jesus.

On Wed, Apr 23, 2008 at 5:48 AM, Clement Ow
[email protected] wrote:

folders.each do |folder|
Dir.glob(folder+“/*”) do |file|
match = regexp.match(File.basename(file));
if match
file_date = DateTime.parse(match[1])
When my file name is in the format, 20080331 for example, the script
will run successfully. However, if the filename has additional
characters added to it, say, risk20080331, it’ll run an error. And i
reckon it’s the cause of the above line.

Sorry, what is the error? Cause this works for me:

irb(main):001:0> regexp = Regexp.compile(/(\d{4}\d{2}\d{2})/)
=> /(\d{4}\d{2}\d{2})/
irb(main):002:0> match = regexp.match(“risk20080331.log”)
=> #MatchData:0xb7ce20f4
irb(main):003:0> match[1]
=> “20080331”
irb(main):005:0> require ‘date’
=> true
irb(main):006:0> DateTime.parse(match[1])
=> #<DateTime: 4909113/2,0,2299161>

So any string that contains 4 digits followed by 2 digits followed by
2 digits will match that regexp,
independently of what it has around the numbers:

irb(main):007:0> regexp.match(“12345678”)[1]
=> “12345678”
irb(main):008:0> regexp.match(“12345678asdfasdf”)[1]
=> “12345678”
irb(main):009:0> regexp.match(“asdfasdf12345678asdfasdf”)[1]
=> “12345678”
irb(main):010:0> regexp.match(“asdfasdf12345678”)[1]
=> “12345678”

Jesus.

Clement Ow wrote:

Thanks Daniel for your input. I tried using /\d+/ but it’ll extract
files that have even 2 numbers to i decided to use
/(\d\d)(\d\d)(\d\d\d\d)/ instead. It enabled me to run the command on
certain files but not all files and the following error occured:

Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_09042008.dat size: 74 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_10042008.dat size: 81 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_11042008.dat size: 80 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_14042008.dat size: 79 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_15042008.dat size: 77 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_16042008.dat size: 77 KB
c:/ruby/lib/ruby/1.8/date.rb:1536:in new_by_frags': invalid date (ArgumentError ) from c:/ruby/lib/ruby/1.8/date.rb:1583:inparse’
from testing.conf.rb:166:in delFiles' from testing.conf.rb:163:inglob’
from testing.conf.rb:163:in delFiles' from testing.conf.rb:162:ineach’
from testing.conf.rb:162:in `delFiles’
from testing.conf.rb:204

Is there anything wrong with mycode that prevents deleting all the files
that I want?

require ‘date’

str = ‘sins00114178’
pattern = /(\d\d)(\d\d)(\d\d\d\d)/

match_obj = pattern.match(str)
puts match_obj[1]

file_date = DateTime.parse(match_obj[1])

–output:–
00
/usr/lib/ruby/1.8/date.rb:1214:in new_with_hash': invalid date (ArgumentError) from /usr/lib/ruby/1.8/date.rb:1258:inparse’
from r1test.rb:9

On Wed, Apr 23, 2008 at 10:01 AM, 7stud – [email protected]
wrote:

    from testing.conf.rb:163:in `delFiles'

/usr/lib/ruby/1.8/date.rb:1214:in new_with_hash': invalid date (ArgumentError) from /usr/lib/ruby/1.8/date.rb:1258:in parse’
from r1test.rb:9

You are right, I overlooked the fact that he had added more parens
in the regexp, so he was passing only two digits to DateTime.parse.
Anyway the changes I proposed should work for him.

Jesus.

Jesús Gabriel y Galán wrote:

On Wed, Apr 23, 2008 at 9:50 AM, Jes?briel y
Galᮦlt;[email protected]> wrote:

On Wed, Apr 23, 2008 at 9:14 AM, Clement Ow

[email protected] wrote:

1.- Change DateTime.parse to DateTime.strptime passing a format
that describes where in the string you have the two digits of the day, the month
and the four digits of the date. I haven’t been able to gather a quick example,
cause I don’t find a reference for the format string (any help here
appreciated).
The doc refers me to the date/format.rb for details and I don’t see
anything clear
there.

After a couple of trial/error tests this seems to work:

DateTime.strptime “16042008”, “%d%M%Y”

So any of the two solutions will work for you.

Jesus.

Hi Jesus,
First of all thanks for your help!
However,
Despite using
d= DateTime.civil (match[1].to_i, match[2].to_i, match[3].to_i)
file_date=d.to_s
OR

file_date = DateTime.strptime (match[1], “%d%M%Y”)

it still gives me invalid date as the error msg. But when i run it in
the fxri it seems to work fine… This only seems to happen when the date
format is ddmmyyyy, but for yyyymmdd it has no problems though… Any
ideas anyone? I have cracked my head but to no avail.

On Wed, Apr 23, 2008 at 10:38 AM, Clement Ow
[email protected] wrote:

that describes where in the string you have the two digits of the day, the month

      file_date=d.to_s

OR

file_date = DateTime.strptime (match[1], “%d%M%Y”)

it still gives me invalid date as the error msg. But when i run it in
the fxri it seems to work fine… This only seems to happen when the date
format is ddmmyyyy, but for yyyymmdd it has no problems though… Any
ideas anyone? I have cracked my head but to no avail.

Can you post the smallest example that fails? Do you have files with
different date formats?

Jesus.

On Wed, Apr 23, 2008 at 9:50 AM, Jesús Gabriel y
Galán[email protected] wrote:

On Wed, Apr 23, 2008 at 9:14 AM, Clement Ow

[email protected] wrote:

1.- Change DateTime.parse to DateTime.strptime passing a format
that describes where in the string you have the two digits of the day, the month
and the four digits of the date. I haven’t been able to gather a quick example,
cause I don’t find a reference for the format string (any help here
appreciated).
The doc refers me to the date/format.rb for details and I don’t see
anything clear
there.

After a couple of trial/error tests this seems to work:

DateTime.strptime “16042008”, “%d%M%Y”

So any of the two solutions will work for you.

Jesus.

Can you post the smallest example that fails? Do you have files with
different date formats?

Jesus.

delete_date = DateTime.now - delete

regexp = Regexp.compile(/(\d\d)(\d\d)(\d\d\d\d)/)


fileData = Struct.new(:name, :size)
deleted_files = []

folders.each do |folder|
  Dir.glob(folder+"/*") do |file|
  puts match = regexp.match(File.basename(file))
    if match
      file_date = DateTime.strptime(match[1] , fmt='%d%M%Y')
      size = (File.size(file))/1024
      if delete_date > file_date
        deleted_files << fileData.new(file,size)
        FileUtils.rm_r file
        if File.exist?(file)==false
          puts "Files/Folders deleted: #{file} size: #{size} KB"
          end #if
        end #if
    end #if
  end #do
end #each

end #if
end #delFiles

it’ll show this error:
c:/ruby/lib/ruby/1.8/date.rb:1536:in new_by_frags': invalid date (ArgumentError ) from c:/ruby/lib/ruby/1.8/date.rb:1563:instrptime’
from testing.conf.rb:166:in `delFiles’

I do have files with different date formats, but the format yyyymmdd works when i use DateTime.parse maybe because DateTime accepts this format? however if i use strptime it also cant work. Any help will be greatly appreciated =)

On Wed, Apr 23, 2008 at 11:01 AM, Clement Ow
[email protected] wrote:

      size = (File.size(file))/1024

end #if

I do have files with different date formats, but the format yyyymmdd works when i use DateTime.parse maybe
because DateTime accepts this format?

Yes, that’s exactly the issue.

however if i use strptime it also cant work. Any help will be greatly appreciated =)

If you have files with different formats, you will have to know which
format each file is, because DateTime.parse is expecting yyyymmdd,
while strptime is expecting whatever format you pass it, but only one
format. If the dates are current dates, and are only these two formats
(yyyymmdd or ddmmyyyy) I think this is safe:

regexp = /(\d{8})/
match = regexp.match(file_name)
file_date = nil
begin
file_date = DateTime.parse(match[1])
rescue ArgumentError
file_date = DateTime.strptime(match[1], “%d%M%Y”)
end

However, if you have arbitrary dates, this can lead to unexpected
results. For example:

19011902

will result in 1901-19-02 while maybe you meant 19-01-1902.
Also, I think the above is safe because the century (20xx for the
year) is not a valid month, but there might be some corner case I
haven’t realized.

Jesus.

Jesús Gabriel y Galán wrote:

On Wed, Apr 23, 2008 at 11:01 AM, Clement Ow
[email protected] wrote:

      size = (File.size(file))/1024

end #if

I do have files with different date formats, but the format yyyymmdd works when i use DateTime.parse maybe
because DateTime accepts this format?

Yes, that’s exactly the issue.

however if i use strptime it also cant work. Any help will be greatly appreciated =)

If you have files with different formats, you will have to know which
format each file is, because DateTime.parse is expecting yyyymmdd,
while strptime is expecting whatever format you pass it, but only one
format. If the dates are current dates, and are only these two formats
(yyyymmdd or ddmmyyyy) I think this is safe:

regexp = /(\d{8})/
match = regexp.match(file_name)
file_date = nil
begin
file_date = DateTime.parse(match[1])
rescue ArgumentError
file_date = DateTime.strptime(match[1], “%d%M%Y”)
end

However, if you have arbitrary dates, this can lead to unexpected
results. For example:

19011902

will result in 1901-19-02 while maybe you meant 19-01-1902.
Also, I think the above is safe because the century (20xx for the
year) is not a valid month, but there might be some corner case I
haven’t realized.

Jesus.

Hey Jesus,

file_date = DateTime.strptime(match[1], “%d%M%Y”)
when the above is being put, it will parse the date as eg.28012008 even
thought the date is 28032008. So, I tried some trail and error and i
used this:
file_date = DateTime.strptime(match[1], “%d%m%Y”)
and bingo, it parses the date correctly and thus being able to run the
command to delete. Thanks alot for your time and help! =)

Cheers!