On Fri, Aug 12, 2011 at 9:27 AM, Gavin K. [email protected] wrote:
Sure,
What I’m trying to do is parse our Apache log files. A fairly standard
sample line is as follows:
10.132.18.15 - - [21/Apr/2010:12:22:36 -0600] “GET
/images/2010_front_sprite.jpg HTTP/1.1” 304 - “http://
cnm.edu/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR
2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR
3.0.04506.648;
InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)”
I’m pulling out encapsulated data, splitting the line on the separator
then
putting the encapsulated data back. I was using /“.*?”/ to grab the
quoted
strings but I discovered lines with the following format in the log
file:
12.172.30.9 - - [21/Apr/2010:13:21:04 -0600] “GET
/clickheat/click.php?s=&g=index&x=130&y=432&w=1009&b=safari&c=1&random=Wed%20Apr%2021%202010%2013:21:04%20GMT-0600%20(MDT)
HTTP/1.1” 200 100 “http://cnm.edu/” “"CustomUserAgent"="Mozilla/5.0
(Macintosh; U; Intel Mac OS X 10_6; en-us) AppleWebKit/531.21.8 (KHTML,
like
Gecko) Version/4.0.4 Safari/531.21.10 FOH:R177";”
This broke my simple /“.*?”/ expression. So I decided to include the
separator in the regex and tried the following expression:
/\s(“.*?”)(\s|$)/
I am using gsub to perform the replacement action.
In my gsub block this would get all the quoted strings except for the
user
agent string which ends the entry. If I tried matching that regexp
against
a quoted string with a preceding space and followed by a \n it would
work.
It just didn’t work inside my gsub block.
For example:
10.132.18.15 - - [21/Apr/2010:12:22:36 -0600] “GET
/images/2010_front_sprite.jpg HTTP/1.1” 304 - “http://
cnm.edu/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR
2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR
3.0.04506.648;
InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)”
would come out as
10.132.18.15 - - encapsulatorherf encapsulatorherg 304 -
encapsulatorherh “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1;
.NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR
3.0.04506.648; InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)”
what I wanted and was expecting is
10.132.18.15 - - encapsulatorherf encapsulatorherg 304 -
encapsulatorherh
encapsulatori
As soon as I changed my regexp to /\s(“.*?”)(?=\s|$)/ it worked.
I’m not sure why /\s(“.?“)(\s|$)/ and /\s(”.?”)(?=\s|$)/ are
significantly
different.