RegExp & File read help

Don_L · June 27, 2007, 2:53pm

Hi All,

I am trying to parse out a list of elements from a set of xml file
which match a given regular expression. I am sure there is probably a
way to do this using an xml parsing library, but I thought it might
be just as easy to do so with regular expressions.

My thought was to do the following:

Iterate through a set of files in a directory.
Search each file for a set of lines which match a given regular
expression.
Add the capture group in each match to an array.
Sort the array and remove any duplicate values
print the results.

Here are the steps I have tried in building my script:

First, I tested to make sure my regular expression actually matched
against the pattern I was seeking. This seemed to work as expected.

regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)
</Font-family>/m)
string = %q(Helvetica</Font-
family>)
if string =~ regexp
puts “yes, there is a match. #{$1}”
end

Returns >> yes, there is a match. Helvetica

Then, I tested a different method which would add the matches to an
array. This also seemed to work as expected.

regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)
</Font-family>/m)
string = %q(Helvetica</Font-
family>)

a = regexp.match(string)
puts a[1]

Returns >> Helvitica

Next, I tested opening a file and returning all lines. This seemed to
work as well.

file = File.new(’/Users/donlevan/Desktop/DDRs/Apple Dealer Price
List.xml’)

file.each do |line|
puts line
end

Returns >> <?xml version="1.0" encoding="UTF-16"?>

… end of file

Where I am getting stuck is in the next code fragment, in which I am
testing each line to see if there is a match. There should be as the
string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.

regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)
</Font-family>/m)
file = File.new(’/Users/donlevan/Desktop/DDRs/Apple Dealer Price
List.xml’)

file.each do |string|
if string =~ regexp
puts “yes, there is a match. #{$1}”
end
end

Returns >>

RubyMate r6354 running Ruby r1.8.6 (/usr/local/bin/ruby)

untitled

/Users/donlevan/Library/Application Support/TextMate/Support/lib/
scriptmate.rb:29: warning: Insecure world writable dir /Users/
donlevan/Library/Application Support in PATH, mode 040706
Program exited.

I would be grateful for any assistance. Thanks so much.

Don L.
Brooklyn, New York

Don_L · June 27, 2007, 4:52pm

Hi Don,

I am sure there is probably a
way to do this using an xml parsing library, but I thought it might
be just as easy to do so with regular expressions.

Hpricot is a good choice.

Where I am getting stuck is in the next code fragment, in which I am
testing each line to see if there is a match. There should be as the
string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.

regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)

In the regexp, you need to escape the minus sign also, otherwise,
it is interpreted as a range of signs, i.e. f-t =[‘f’,‘g’,…,‘t’]

regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)
</Font-family>/m)

Best regards,

Axel

Don_L · June 27, 2007, 5:31pm

Hi –

On Wed, 27 Jun 2007, Axel E. wrote:

string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.

regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)

In the regexp, you need to escape the minus sign also, otherwise,
it is interpreted as a range of signs, i.e. f-t =[‘f’,‘g’,…,‘t’]

Only inside a character class. Otherwise it’s just a minus sign:

irb(main):017:0> Regexp.new(/a-z/).match(“a”)
=> nil
irb(main):018:0> Regexp.new(/a-z/).match(“literal a-z”)
=> #MatchData:0x312ce8

David

Don_L · June 27, 2007, 6:44pm

Hi David and Alex,

Thanks for your help. I don’t have it solved yet, but you both have
cleared up the confusion.

Thanks,

Don