Newlines included in bracket negation

chrismalek · October 26, 2007, 11:13pm

(… that subject probably makes no sense …)

Anyway, I have some unexpected (to me) behavior in the following regexp.
This example is contrived, but based on a real need. Can anyone explain
why
the result is multi-line, even though the re is not?

require ‘test/unit’

class TestRE < Test::Unit::TestCase
def test_newlines
src = “happy\n\nbirthday”
assert_equal(“hday”, src.scan(/h[^x]*?day/).to_s)
end
end

produces

Finished in 0.031 seconds.

Failure:
test_newlines_consumed_in_not_section(TestRE) …
<“hday”> expected but was
<“happy\n\nbirthday”>.

chrismalek · October 26, 2007, 11:31pm

Adding \n inside the brackets fixes it, I just wouldn’t expect to have
to do
this since I didn’t add the multiline mode option.

require ‘test/unit’

class TestRE < Test::Unit::TestCase
def test_newlines
src = “happy\n\nbirthday”
assert_equal(“hday”, src.scan(/h[^x\n]*?day/).to_s)
end
end

chrismalek · October 26, 2007, 11:38pm

Chris M. wrote:

(… that subject probably makes no sense …)

Anyway, I have some unexpected (to me) behavior in the following regexp.
This example is contrived, but based on a real need. Can anyone explain
why
the result is multi-line, even though the re is not?

require ‘test/unit’

class TestRE < Test::Unit::TestCase
def test_newlines
src = “happy\n\nbirthday”
assert_equal(“hday”, src.scan(/h[^x]*?day/).to_s)
end
end

produces

Finished in 0.031 seconds.

Failure:
test_newlines_consumed_in_not_section(TestRE) …
<“hday”> expected but was
<“happy\n\nbirthday”>.

Can anyone explain why
the result is multi-line, even though the re is not?

It’s not a question of the re being multi-line or not, it’s a question
of the re being greedy v. non-greedy. But because there is only one
match for your regex, the issue of greedy v. non-greedy is irrelevant.

If you think about it, there is really no concept of ‘lines’ with
regards to text. There really is only one line–one, long, continuous
line of characters. Some of those characters might be ‘\n’ characters,
and we may choose to interpret a ‘\n’ as a new line, but that doesn’t
change the fact that there is still just one continuous string of
characters. A regex has nothing inherently programmed into it that will
cause it to stop looking for matches when a ‘\n’ is encountered in the
sequence of characters. The regex character ‘.’ will stop searching
at a newline, but that is not true of regex’s generally. In any case,
you do not use the ‘.’ character in your regex, so that behavior is
irrelevant.

chrismalek · October 26, 2007, 11:45pm

On 10/26/07, Chris M. [email protected] wrote:

end
There’s also something I don’t understand, similar to the above.
I always thought that in a non-multiline regexp, the dot didn’t match
newlines (\n), so I don’t understand this:

irb(main):036:0> re = /(h)(.)(day)/
=> /(h)(.)(day)/
irb(main):037:0> “happy\n\nbirthday”.match(re).captures
=> [“h”, “”, “day”]
irb(main):038:0> re = /(h)(.)(day)/m
=> /(h)(.)(day)/m
irb(main):039:0> “happy\n\nbirthday”.match(re).captures
=> [“h”, “appy\n\nbirth”, “day”]

I thought the first case wouldn’t match.
Can anyone shed some light?

Jesus.

chrismalek · October 26, 2007, 11:46pm

On 10/26/07, 7stud – [email protected] wrote:

The regex character ‘.’ will stop searching
at a newline, but that is not true of regex’s generally. In any case,
you do not use the ‘.’ character in your regex, so that behavior is
irrelevant.

Can you check my example above? I’m using a greedy match of .* which I
thought would match up to a \n in a non-multiline regexp, and would
include everything in a multiline one. I must be confused at some
point

Jesus.

chrismalek · October 27, 2007, 12:35am

On 10/27/07, Phrogz [email protected] wrote:

=> [“h”, “”, “day”]
h.+day/, which does not match.
I need more sleep, for sure. I was of course thinking on the first “h”
and the last “day”. That explains it

irb(main):043:0> “happy\n\nday”.match(re).captures
NoMethodError: undefined method `captures’ for nil:NilClass

Thanks,

Jesus.

chrismalek · October 27, 2007, 12:20am

On Oct 26, 3:43 pm, “Jesús Gabriel y Galán” [email protected]
wrote:

irb(main):039:0> “happy\n\nbirthday”.match(re).captures
=> [“h”, “appy\n\nbirth”, “day”]

I thought the first case wouldn’t match.
Can anyone shed some light?

The last four characters of the word “birthday” match the regexp /
h.*day/, without crossing any newlines. Perhaps you were thinking of /
h.+day/, which does not match.

chrismalek · October 27, 2007, 1:48am

On Oct 26, 2007, at 7:23 PM, 7stud – wrote:

rb(main):036:0> re = /(h)(.)(day)/
=> /(h)(.)(day)/
irb(main):037:0> “happy\n\nbirthday”.match(re).captures
=> [“h”, “”, “day”]

The fact that the (.*) matched nothing was an indication that
something
was amiss.

Nothing amiss there at all. the * is match “zero or more times” and
so it is perfectly fine to match zero occurrences of any character
(except newline) between the ‘h’ and the ‘day’

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

chrismalek · October 27, 2007, 1:24am

JesÃºs Gabriel y GalÃ¡n wrote:

On 10/27/07, Phrogz [email protected] wrote:

=> [“h”, “”, “day”]
h.+day/, which does not match.
I need more sleep, for sure. I was of course thinking on the first “h”
and the last “day”. That explains it

A clue was in the capture results:

rb(main):036:0> re = /(h)(.)(day)/
=> /(h)(.)(day)/
irb(main):037:0> “happy\n\nbirthday”.match(re).captures
=> [“h”, “”, “day”]

The fact that the (.*) matched nothing was an indication that something
was amiss.

chrismalek · October 27, 2007, 4:34am

On Oct 26, 2007, at 3:30 PM, Chris M. wrote:

end
end

from memory, ‘multiline’ affects only the behavior of ‘.’ in res
the re

[^x] => ‘not x’

simply matches any char that is not ‘x’ - including newline

it’s the same in perl and python iirc

cheers.

a @ http://codeforpeople.com/

chrismalek · October 27, 2007, 2:36am

JesÃºs Gabriel y GalÃ¡n wrote:

I was of course thinking on the first “h”
and the last “day”.

Rob B. wrote:

Nothing amiss there at all.

Ok.

chrismalek · October 29, 2007, 2:52am

On 10/26/07, ara.t.howard [email protected] wrote:

from memory, ‘multiline’ affects only the behavior of ‘.’ in res
the re

[^x] => ‘not x’

simply matches any char that is not ‘x’ - including newline

it’s the same in perl and python iirc

Yeah, it behaves that way. I guess I need to adjust my expectations