Odd regexp behavior

I’m running 1.9.2-p180

I have the following regexp: /\s(“.*?”)(\s|$)/

For some reason it isn’t matching the end of the following line or any
line
with a similar format. By end I mean the entire user-agent string
msnbot…htm)"\n

“207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] "GET
/dualcredit/courses/general.php HTTP/1.1" 200 27731 "-" "msnbot/2.0b
(+
Bing Webmaster Tools)"\n”

However it matches against the following:

" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"\n"

I am at a total loss as to why. I’m not too sure how to go about
debugging
it either.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I cant hear a word youre saying.”

-Greg Graffin (Bad Religion)

On Aug 10, 2011, at 13:50 , Glen H. wrote:

Bing Webmaster Tools)"\n"

However it matches against the following:

" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"\n"

I am at a total loss as to why. I’m not too sure how to go about debugging
it either.

Don’t match. You’re not trying to match, you’re trying to extract. I see
two easy ways to do this:

  1. split on /"/ and get the field you want (the last one, not counting
    the newline).
  2. scan for everything within quotes and get the last one.

I’m sure there are others, but those are my two favorite go-to string
methods.

examples:

s = “207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] "GET
/dualcredit/courses/general.php HTTP/1.1" 200 27731 "-" "msnbot/2.0b
(+http://search.msn.com/msnbot.htm)"\n”

p s.strip.split(/"/).last

=> “msnbot/2.0b (+http://search.msn.com/msnbot.htm)”

p s.scan(/".+?"/).last[1…-2]

=> “msnbot/2.0b (+http://search.msn.com/msnbot.htm)”

Glen H. wrote in post #1016060:

I’m running 1.9.2-p180

I have the following regexp: /\s(".*?")(\s|$)/

For some reason it isn’t matching the end of the following line or any
line
with a similar format. By end I mean the entire user-agent string

It’s pretty simple: $ matches before a newline–not the end of a string.

str = “207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] “GET
/dualcredit/courses/general.php HTTP/1.1” 200 27731 “-” “msnbot/2.0b
(+http://search.msn.com/msnbot.htm)”\n”

str.scan(/^ .* $/x) do |match|
puts match
puts ‘-’ * 20
end

–output:–
207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] "GET

/dualcredit/courses/general.php HTTP/1.1" 200 27731 “-” "msnbot/2.0b

(+http://search.msn.com/msnbot.htm)"

7stud – wrote in post #1016116:

7stud – wrote in post #1016113:

It’s pretty simple: $ matches before a newline–not the end of a string.

Actually, $ matches after a newline.

Whoops.

def show_match(str,re)
if str =~ re
“#{$`}<<#{$&}>>#{$’}”
else
“no match”
end
end

p show_match(“hello\nhello”, /llo$/)
p show_match(“hello\nhello”, /llo\z/)

–output:–
“he<>\nhello”
“hello\nhe<>”

7stud – wrote in post #1016113:

It’s pretty simple: $ matches before a newline–not the end of a string.

Actually, $ matches after a newline.

Also,

str = “hello\nhello”

str.scan(/^hell/) {|match| p match}
puts “-” * 20
str.scan(/\Ahell/) {|match| p match}

–output:–
“hell”
“hell”

“hell”

On Thu, Aug 11, 2011 at 12:41 AM, 7stud – [email protected]
wrote:

“hell”
“hell”

“hell”


Posted via http://www.ruby-forum.com/.

Hmmm… maybe I should have posted from the beginning rather than where
I
had gotten to in my attempt to solve my problem.

I am tyring to parse log files and am pulling out encapsulated fields so
that I can split the line in a sane way. I have the following regex
which I
am using to do that:

/\s(“.*?”)(\s|$)/

Now as to why I’m looking for the \s before and the \s or $ after. It
turns
out that some of the user agent strings are in a format like “"Custom
Agent"="Mozilla …"”\n

I had been using the regex Ryan suggested earlier until I discovered the
nested quotes.

The expression above works for quoted strings surrounded by spaces but
not
the last one on the line. I’ve tried changing $ to \n and that didn’t
make
any difference.

Here is the exact code I’m using:

x = “encapsulatorhere”
stash = {}

gen_encap_matches.each do |encapex|
line.gsub!(encapex) do |match|
x.next!
stash[x] = $1
@job.log_format.separator + x + @job.log_format.separator
end
end

gen_encap_matches just creates the regexes from a list of encapsulation
characters.

I’m at a complete loss as to why it won’t grab the last quoted string in
the
line.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I cant hear a word youre saying.”

-Greg Graffin (Bad Religion)

On Thu, Aug 11, 2011 at 5:52 AM, Glen H. [email protected]
wrote:

Hmmm… maybe I should have posted from the beginning rather than where I
turns
Here is the exact code I’m using:
end
so loud, I cant hear a word youre saying."

-Greg Graffin (Bad Religion)

So, after a little digging on Stackoverflow I decided to try an explicit
lookahead. For what ever reason it works.

/\s(“.?“)(?=\s|$)/ matches where /\s(”.?”)(\s|$)/ won’t.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I cant hear a word youre saying.”

-Greg Graffin (Bad Religion)

On Aug 12, 2011, at 07:28 AM, Glen H. [email protected] wrote:
[…]

Now as to why I’m looking for the \s before and the \s or $ after. It
turnsout that some of the user agent strings are in a format like “"Custom
Agent"="Mozilla …"”\n

So, after a little digging on Stackoverflow I decided to try an explicit
lookahead. For what ever reason it works.

/\s(“.?“)(?=\s|$)/ matches where /\s(”.?”)(\s|$)/ won’t.

It sounds like you have a solution, but don’t understand it. I’d like to
help you understand it, but I don’t understand what you’re trying to
match. The sample string you provide above does not match your regex
(and obviously so, as there is never whitespace before a quote)

Could you please provide a single string that you’re matching against,
and describe what you are trying to match?

On Fri, Aug 12, 2011 at 9:27 AM, Gavin K. [email protected] wrote:

Sure,

What I’m trying to do is parse our Apache log files. A fairly standard
sample line is as follows:

10.132.18.15 - - [21/Apr/2010:12:22:36 -0600] “GET
/images/2010_front_sprite.jpg HTTP/1.1” 304 - “http://
cnm.edu/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR
2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR
3.0.04506.648;
InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)”

I’m pulling out encapsulated data, splitting the line on the separator
then
putting the encapsulated data back. I was using /“.*?”/ to grab the
quoted
strings but I discovered lines with the following format in the log
file:

12.172.30.9 - - [21/Apr/2010:13:21:04 -0600] “GET
/clickheat/click.php?s=&g=index&x=130&y=432&w=1009&b=safari&c=1&random=Wed%20Apr%2021%202010%2013:21:04%20GMT-0600%20(MDT)
HTTP/1.1” 200 100 “http://cnm.edu/” “"CustomUserAgent"="Mozilla/5.0
(Macintosh; U; Intel Mac OS X 10_6; en-us) AppleWebKit/531.21.8 (KHTML,
like
Gecko) Version/4.0.4 Safari/531.21.10 FOH:R177";”

This broke my simple /“.*?”/ expression. So I decided to include the
separator in the regex and tried the following expression:

/\s(“.*?”)(\s|$)/

I am using gsub to perform the replacement action.

In my gsub block this would get all the quoted strings except for the
user
agent string which ends the entry. If I tried matching that regexp
against
a quoted string with a preceding space and followed by a \n it would
work.
It just didn’t work inside my gsub block.

For example:

10.132.18.15 - - [21/Apr/2010:12:22:36 -0600] “GET
/images/2010_front_sprite.jpg HTTP/1.1” 304 - “http://
cnm.edu/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR
2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR
3.0.04506.648;
InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)”

would come out as

10.132.18.15 - - encapsulatorherf encapsulatorherg 304 -
encapsulatorherh “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1;
.NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR
3.0.04506.648; InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)”

what I wanted and was expecting is

10.132.18.15 - - encapsulatorherf encapsulatorherg 304 -
encapsulatorherh
encapsulatori

As soon as I changed my regexp to /\s(“.*?”)(?=\s|$)/ it worked.

I’m not sure why /\s(“.?“)(\s|$)/ and /\s(”.?”)(?=\s|$)/ are
significantly
different.

Because, when the (\s|$) at the end matches \s (a space), this space
is no longer included in subsequent matches - as if that part of
string “disappeared” - and thus the \s at the beginning can’t match
it. You should use a regex tester for complex regexes (by complex, I
mean almost all), for example http://regexpal.com/. (Try inputting
your data and both of your regexes there.)

I think there’s a similar tool that explicitly uses Ruby’s flavor of
regexp (regexpal uses browser-side JavaScript), but I can’t remember
the URL and AFAIR it sucked.

– Matma R.

2011/8/12 Bartosz Dziewoński [email protected]

– Matma R.

Actually I don’t think that was the case. Here is my gsub block

encapexs.each do |encapex|
line.gsub!(encapex) do |match|
x.next!
stash[x] = $1
separator + x + separator
end
end

So, it should have been and appeared to be putting the space back into
the
line. I really need to modify my regex and grab the preceding space in
it’s
own group so I can clean the last line up some.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

2011/8/12 Bartosz Dziewoński [email protected]

=> [“”, “”, “”, “”]

Note how the first scan returned two results, even though you can
clearly see “aba” appears in the string 4 times. Note how the second
returned 4 matches (even if they’re all empty). Once a character is
matched, regex engine moves forward, discarding everything up to the
end of match, inclusive.

– Matma R.

If that’s the case then I have absolutely no idea why either of my
expressions work.

In my initial testing I wasn’t returning the separator + match_group +
separator pattern from the gsub block and it was skipping the second
encapsulated string when there were two in a row.

At any rate, why does (\s|$) match differently from (?=\s|$)

The first one, (\s|$), is simply a group, matching either whitespace,
or a position - end-of-line. The other one is a lookahead group which
matches a position - a zero-width string, if you like - if, at
this place in string, either following characters are whitespace, or
this position is also the end of line (the following characters
themselves are not matched).

You need to understand that once a character is matched, it’s gone -
this gsub/match/scan/whatever will not match it again in this run.

Have you tried inputting the data, and both regexes, in regexpal and
comparing the results? I think it really clearly shows graphically
what I mean.

– Matma R.

I don’t think I understand; “putting it back in” doesn’t matter here,
nor does using gsub instead of, say, scan.

irb(main):006:0> s = ‘ababababa’
=> “ababababa”
irb(main):007:0> s.gsub(/aba/, ‘aca’)
=> “acabacaba”
irb(main):008:0> s.scan /aba/
=> [“aba”, “aba”]
irb(main):011:0> s.scan /(?=aba)/
=> ["", “”, “”, “”]

Note how the first scan returned two results, even though you can
clearly see “aba” appears in the string 4 times. Note how the second
returned 4 matches (even if they’re all empty). Once a character is
matched, regex engine moves forward, discarding everything up to the
end of match, inclusive.

– Matma R.