Forum: Ruby RegExp & File read help

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
38c2c57fea9466efc40c7267ba60aa99?d=identicon&s=25 Don Levan (Guest)
on 2007-06-27 14:53
(Received via mailing list)
Hi All,

I am trying to parse out a list of elements from a set of xml file
which match a given regular expression. I am sure there is probably a
way to do this using an xml parsing library, but I thought it might
be just as easy to do so with regular expressions.

My thought was to do the following:

Iterate through a set of files in a directory.
Search each file for a set of lines which match a given regular
expression.
Add the capture group in each match to an array.
Sort the array and remove any duplicate values
print the results.


Here are the steps I have tried in building my script:

First, I tested to make sure my regular expression actually matched
against the pattern I was seeking. This seemed to work as expected.

_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)
<\/Font-family>/m)
string = %q(<Font-family codeSet="Roman" fontId="0">Helvetica</Font-
family>)
if string =~ regexp
puts "yes, there is a match. #{$1}"
end
_________

Returns >>  yes, there is a match. Helvetica



Then, I tested a different method which would add the matches to an
array. This also seemed to work as expected.
_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)
<\/Font-family>/m)
string = %q(<Font-family codeSet="Roman" fontId="0">Helvetica</Font-
family>)

a = regexp.match(string)
puts a[1]
_________

Returns  >> Helvitica



Next, I tested opening a file and returning all lines. This seemed to
work as well.
_________
file = File.new('/Users/donlevan/Desktop/DDRs/Apple Dealer Price
List.xml')

file.each do |line|
   puts line
end
_________
Returns >> <?xml version="1.0" encoding="UTF-16"?>
<FMPReport link="Summary.xml" type="Report" version="8.5v1"
creationDate="6/26/2007" creationTime="10:54:46 AM">
<File name="Apple Dealer Price List" path="10.100.0.10">
<BaseTableCatalog>
  <BaseTable id="32769" name="Apple Dealer Price List" records="235">
    <FieldCatalog> ... end of file




Where I am getting stuck is in the next code fragment, in which I am
testing each line to see if there is a match. There should be as the
string I used above for testing was pulled directly from one line of
the file. Unfortunately, I get an error and no -matches.

_________
regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)
<\/Font-family>/m)
file = File.new('/Users/donlevan/Desktop/DDRs/Apple Dealer Price
List.xml')

file.each do |string|
   if string =~ regexp
      puts "yes, there is a match. #{$1}"
   end
end
_________
Returns >>

RubyMate r6354 running Ruby r1.8.6 (/usr/local/bin/ruby)
 >>> untitled

/Users/donlevan/Library/Application Support/TextMate/Support/lib/
scriptmate.rb:29: warning: Insecure world writable dir /Users/
donlevan/Library/Application Support in PATH, mode 040706
Program exited.



I would be grateful for any assistance. Thanks so much.

Don Levan
Brooklyn, New York
D0338c0de4cb3c5c17300396159933d1?d=identicon&s=25 Axel Etzold (Guest)
on 2007-06-27 16:52
(Received via mailing list)
Hi Don,

> I am sure there is probably a
> way to do this using an xml parsing library, but I thought it might
> be just as easy to do so with regular expressions.

Hpricot is a good choice.
>
> Where I am getting stuck is in the next code fragment, in which I am
> testing each line to see if there is a match. There should be as the
> string I used above for testing was pulled directly from one line of
> the file. Unfortunately, I get an error and no -matches.
>
> _________
> regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)

In the regexp, you need to escape the minus sign also, otherwise,
it is interpreted as a range of signs, i.e. f-t =['f','g',...,'t']

> regexp = Regexp.new(/<Font\-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)
> <\/Font\-family>/m)

Best regards,

Axel
1fba4539b6cafe2e60a2916fa184fc2f?d=identicon&s=25 unknown (Guest)
on 2007-06-27 17:31
(Received via mailing list)
Hi --

On Wed, 27 Jun 2007, Axel Etzold wrote:

>> string I used above for testing was pulled directly from one line of
>> the file. Unfortunately, I get an error and no -matches.
>>
>> _________
>> regexp = Regexp.new(/<Font-family codeSet=\"\w*\" fontId=\"\d*\">(\w*)
>
> In the regexp, you need to escape the minus sign also, otherwise,
> it is interpreted as a range of signs, i.e. f-t =['f','g',...,'t']

Only inside a character class.  Otherwise it's just a minus sign:

irb(main):017:0> Regexp.new(/a-z/).match("a")
=> nil
irb(main):018:0> Regexp.new(/a-z/).match("literal a-z")
=> #<MatchData:0x312ce8>


David
38c2c57fea9466efc40c7267ba60aa99?d=identicon&s=25 Don Levan (Guest)
on 2007-06-27 18:44
(Received via mailing list)
Hi David and Alex,

Thanks for your help. I don't have it solved yet, but you both have
cleared up the confusion.

Thanks,

Don
This topic is locked and can not be replied to.