RegExp & File read help

Hi All,

I am trying to parse out a list of elements from a set of xml file
which match a given regular expression. I am sure there is probably a
way to do this using an xml parsing library, but I thought it might
be just as easy to do so with regular expressions.

My thought was to do the following:

Iterate through a set of files in a directory.
Search each file for a set of lines which match a given regular
expression.
Add the capture group in each match to an array.
Sort the array and remove any duplicate values
print the results.

Here are the steps I have tried in building my script:

First, I tested to make sure my regular expression actually matched
against the pattern I was seeking. This seemed to work as expected.


regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)
</Font-family>/m)
string = %q(Helvetica</Font-
family>)
if string =~ regexp
puts “yes, there is a match. #{$1}”
end


Returns >> yes, there is a match. Helvetica

Then, I tested a different method which would add the matches to an
array. This also seemed to work as expected.


regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)
</Font-family>/m)
string = %q(Helvetica</Font-
family>)

a = regexp.match(string)
puts a[1]


Returns >> Helvitica

Next, I tested opening a file and returning all lines. This seemed to
work as well.


file = File.new(’/Users/donlevan/Desktop/DDRs/Apple Dealer Price
List.xml’)

file.each do |line|
puts line
end


Returns >> <?xml version="1.0" encoding="UTF-16"?>




… end of file

Where I am getting stuck is in the next code fragment, in which I am
testing each line to see if there is a match. There should be as the
string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.


regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)
</Font-family>/m)
file = File.new(’/Users/donlevan/Desktop/DDRs/Apple Dealer Price
List.xml’)

file.each do |string|
if string =~ regexp
puts “yes, there is a match. #{$1}”
end
end


Returns >>

RubyMate r6354 running Ruby r1.8.6 (/usr/local/bin/ruby)

untitled

/Users/donlevan/Library/Application Support/TextMate/Support/lib/
scriptmate.rb:29: warning: Insecure world writable dir /Users/
donlevan/Library/Application Support in PATH, mode 040706
Program exited.

I would be grateful for any assistance. Thanks so much.

Don L.
Brooklyn, New York

Hi Don,

I am sure there is probably a
way to do this using an xml parsing library, but I thought it might
be just as easy to do so with regular expressions.

Hpricot is a good choice.

Where I am getting stuck is in the next code fragment, in which I am
testing each line to see if there is a match. There should be as the
string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.


regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)

In the regexp, you need to escape the minus sign also, otherwise,
it is interpreted as a range of signs, i.e. f-t =[‘f’,‘g’,…,‘t’]

regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)
</Font-family>/m)

Best regards,

Axel

Hi –

On Wed, 27 Jun 2007, Axel E. wrote:

string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.


regexp = Regexp.new(/<Font-family codeSet="\w*" fontId="\d*">(\w*)

In the regexp, you need to escape the minus sign also, otherwise,
it is interpreted as a range of signs, i.e. f-t =[‘f’,‘g’,…,‘t’]

Only inside a character class. Otherwise it’s just a minus sign:

irb(main):017:0> Regexp.new(/a-z/).match(“a”)
=> nil
irb(main):018:0> Regexp.new(/a-z/).match(“literal a-z”)
=> #MatchData:0x312ce8

David

Hi David and Alex,

Thanks for your help. I don’t have it solved yet, but you both have
cleared up the confusion.

Thanks,

Don

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs