Question concerning multiline regexps and best practice


#1

Hi,

I am currently moving an application from Python to Ruby for a training
purpose and to learn Ruby. Inside this application I am parsing text
files delivered by news agencies. These follow more or less a
specification developed by the IPTC consortium. But back to the
question. :slight_smile:

At the momemt I use a concatenation of strings as an input for the
Regexp, but I asked myself wether I can use a HERE document, as it would
make things a lot clearer without all the these single and double
quotes around strings. But sadly inside a HERE document, the \n at the
end of a line are used by the regexp. Is it possible to write a HERE
document or something like that with \n inside, but afterwards the \n
the source are skipped?

Or may be there is an even better way to do it. I even think about
writing a bunch of methods to parse all the stuff without a regex.

Best regards,
Oliver


#2

unless i do not understand the question, a regex option that allows
multiline matches might be used.

konstantin


#3

ako… schrieb:

unless i do not understand the question, a regex option that allows
multiline matches might be used.

Well, to better describe what I am currently dealing with, I post a
snippet of python code with the regex.

msg_rx = re.compile(
“^\x01?” +
"(?P[a-zA-Z]{3,4})(?P\d{3,4}) " +
"(?P\d) " +
"(?P[a-zA-Z]{1,3}) " +
"(?P\d{1,4}) " +
“(?P.)\r\n” +
“(?P.)\r\n” +
“\x02” +
“(?:(?P.)=\s\r\n)?” +
“(?P.)" +
"\x03.
” +
"(?P\d{2})(?P\d{2})(?P\d{2}) " +
"(?P[a-zA-Z]{3}) " +
“(?P\d{2})”,
re.S
)

This little “baby” does the job. As ruby doesn’t have named groups in
regexps, I have to add comment lines (?#…) to document the invidual
groups. This would glutter the thing even more. Now ruby has these nice
HERE documents, %r{…} and so on. I would be happy if I could achieve
something like that.

msg_rx = %r{
^\x01?
(?# comment for the line)
([a-zA-Z]{3,4})(?P\d{3,4})\s
(\d)\s
(?# comment for the line)
([a-zA-Z]{1,3})\s
(?# comment for the line)
(\d{1,4})\s
(.)\r\n
(.)\r\n
\x02
(?# comment for the line)
(?:(.)=\s\r\n)?
(?# comment for the line)
(.)
\x03.

(?# comment for the line)
(\d{2})(\d{2})(\d{2})\s
(?# comment for the line)
([a-zA-Z]{3})\s
(?# comment for the line)
(\d{2})
}

Thinks looks a lot cleaner for me, but sadly the “\n” at the end of the
lines are in my way. :slight_smile: I could strip them, but if it would just
“happen” it would be nicer.

Hopefully, this makes my question a little clearer.

Best regards, Oliver


#4

the best that i could come up with is to remove new lines from your
regexp:

(ms_rx = <<END).gsub!(/\n/, ‘’)
line one
line two
END


#5

the best that i could come up with:

(var = <<TARGET).gsub!(/\n/, ‘’)
line one
line two
TARGET

puts var


#6

On Wed, Dec 07, 2005 at 07:47:34AM +0900, Oliver A. wrote:

But sadly inside a HERE document, the \n at the end of a line are
used by the regexp. Is it possible to write a HERE document or
something like that with \n inside, but afterwards the \n the source
are skipped?

The cleanest solution is to make a regular expression that can work
regardless of the presence of newlines. You probably want multiline
mode. I’d need to see your specific example.

Or you could strip out the newlines with mystring.sub("\n","").

regards,
Ed


#7

Oliver A. ha scritto:

ako… schrieb:

Thinks looks a lot cleaner for me, but sadly the “\n” at the end of the
lines are in my way. :slight_smile: I could strip them, but if it would just
“happen” it would be nicer.

use a /x swicth, it should work even with %r stuff:

rgx=%r[
foo #foo
bar #bar
]x
=> /
foo #foo
bar #bar
/x

m=rgx.match “foobar”
=> #MatchData:0x29c9490

m[0]
=> “foobar”


#8

Oliver A. wrote:

(\d)\s
(.)
\x03.

(?# comment for the line)
(\d{2})(\d{2})(\d{2})\s
(?# comment for the line)
([a-zA-Z]{3})\s
(?# comment for the line)
(\d{2})
}

Use extended mode:

msg_rx = %r{
^\x01?

comment for the line

([a-zA-Z]{3,4}) (\d{3,4}) \s
(\d) \s
}x


#9

Thank William and Gabriele! This is exactly what I have been looking
for. Now this part of my module looks nice, clean and uncluttered.

Best regards,
Oliver