Forum: Ruby Question concerning multiline regexps and best practice

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
92324d1a212698587be6a5f349c7a9b4?d=identicon&s=25 oliver.andrich (Guest)
on 2005-12-06 23:48
(Received via mailing list)
Hi,

I am currently moving an application from Python to Ruby for a training
purpose and to learn Ruby. Inside this application I am parsing text
files delivered by news agencies. These follow more or less a
specification developed by the IPTC consortium. But back to the
question. :)

At the momemt I use a concatenation of strings as an input for the
Regexp, but I asked myself wether I can use a HERE document, as it would
  make things a lot clearer without all the these single and double
quotes around strings. But sadly inside a HERE document, the \n at the
end of a line are used by the regexp. Is it possible to write a HERE
document or something like that with \n inside, but afterwards the \n
the source are skipped?

Or may be there is an even better way to do it. I even think about
writing a bunch of methods to parse all the stuff without a regex.

Best regards,
Oliver
A70b7da5a3a712e800100e61ef8d8917?d=identicon&s=25 akonsu (Guest)
on 2005-12-06 23:52
(Received via mailing list)
unless i do not understand the question, a regex option that allows
multiline matches might be used.

konstantin
92324d1a212698587be6a5f349c7a9b4?d=identicon&s=25 oliver.andrich (Guest)
on 2005-12-07 00:09
(Received via mailing list)
ako... schrieb:
> unless i do not understand the question, a regex option that allows
> multiline matches might be used.

Well, to better describe what I am currently dealing with, I post a
snippet of python code with the regex.

msg_rx = re.compile(
     "^\x01?" +
     "(?P<srcid>[a-zA-Z]{3,4})(?P<msgnum>\\d{3,4}) " +
     "(?P<prio>\\d) " +
     "(?P<department>[a-zA-Z]{1,3}) " +
     "(?P<wordcnt>\\d{1,4}) " +
     "(?P<optional>.*)\r\n*" +
     "(?P<keywords>.*)\r\n*" +
     "\x02" +
     "(?:(?P<headline>.*)=\\s*\r\n)?" +
     "(?P<text>.*)" +
     "\x03.*" +
     "(?P<day>\\d{2})(?P<hour>\\d{2})(?P<minute>\\d{2}) " +
     "(?P<mon>[a-zA-Z]{3}) " +
     "(?P<year>\\d{2})",
     re.S
)

This little "baby" does the job. As ruby doesn't have named groups in
regexps, I have to add comment lines (?#...) to document the invidual
groups. This would glutter the thing even more. Now ruby has these nice
HERE documents, %r{...} and so on. I would be happy if I could achieve
something like that.

msg_rx = %r{
^\x01?
(?# comment for the line)
([a-zA-Z]{3,4})(?P<msgnum>\\d{3,4})\s
(\\d)\s
(?# comment for the line)
([a-zA-Z]{1,3})\s
(?# comment for the line)
(\\d{1,4})\s
(.*)\r\n*
(.*)\r\n*
\x02
(?# comment for the line)
(?:(.*)=\\s*\r\n)?
(?# comment for the line)
(.*)
\x03.*
(?# comment for the line)
(\\d{2})(\\d{2})(\\d{2})\s
(?# comment for the line)
([a-zA-Z]{3})\s
(?# comment for the line)
(\\d{2})
}

Thinks looks a lot cleaner for me, but sadly the "\n" at the end of the
lines are in my way. :) I could strip them, but if it would just
"happen" it would be nicer.

Hopefully, this makes my question a little clearer.

Best regards, Oliver
036a1b88dafaab8ffd73a8b0a74b5b38?d=identicon&s=25 ef (Guest)
on 2005-12-07 00:21
(Received via mailing list)
On Wed, Dec 07, 2005 at 07:47:34AM +0900, Oliver Andrich wrote:
>
> But sadly inside a HERE document, the \n at the end of a line are
> used by the regexp. Is it possible to write a HERE document or
> something like that with \n inside, but afterwards the \n the source
> are skipped?

The cleanest solution is to make a regular expression that can work
regardless of the presence of newlines.  You probably want multiline
mode.  I'd need to see your specific example.

Or you could strip out the newlines with mystring.sub("\n","").

regards,
Ed
A70b7da5a3a712e800100e61ef8d8917?d=identicon&s=25 akonsu (Guest)
on 2005-12-07 00:33
(Received via mailing list)
the best that i could come up with is to remove new lines from your
regexp:

(ms_rx = <<END).gsub!(/\n/, '')
line one
line two
END
A70b7da5a3a712e800100e61ef8d8917?d=identicon&s=25 akonsu (Guest)
on 2005-12-07 01:10
(Received via mailing list)
the best that i could come up with:

(var = <<TARGET).gsub!(/\n/, '')
line one
line two
TARGET

puts var
32edd0717b3144d5c58a352d613abdc9?d=identicon&s=25 surrender_it (Guest)
on 2005-12-07 01:46
(Received via mailing list)
Oliver Andrich ha scritto:
> ako... schrieb:

> Thinks looks a lot cleaner for me, but sadly the "\n" at the end of the
> lines are in my way. :) I could strip them, but if it would just
> "happen" it would be nicer.

use a /x swicth, it should work even with %r stuff:
 >> rgx=%r[
foo #foo
bar #bar
]x
=> /
foo #foo
bar #bar
/x
 >> m=rgx.match "foobar"
=> #<MatchData:0x29c9490>
 >> m[0]
=> "foobar"
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 w_a_x_man (Guest)
on 2005-12-07 02:15
(Received via mailing list)
Oliver Andrich wrote:

> (\\d)\s
> (.*)
> \x03.*
> (?# comment for the line)
> (\\d{2})(\\d{2})(\\d{2})\s
> (?# comment for the line)
> ([a-zA-Z]{3})\s
> (?# comment for the line)
> (\\d{2})
> }

Use extended mode:

msg_rx = %r{
  ^\x01?
 # comment for the line
  ([a-zA-Z]{3,4}) (<msgnum>\d{3,4})  \s
  (\d)  \s
}x
92324d1a212698587be6a5f349c7a9b4?d=identicon&s=25 oliver.andrich (Guest)
on 2005-12-07 08:20
(Received via mailing list)
Thank William and Gabriele! This is exactly what I have been looking
for. Now this part of my module looks nice, clean and uncluttered.

Best regards,
Oliver
This topic is locked and can not be replied to.