Regex that works on rubular.com but not in my program

ghan · June 25, 2009, 11:26am

hi,

i have some trouble with a regex.
it works on rubular.com but not in my program
ive used the content in testfile.txt on rubular.com

the regex finds a ip-address, a flag and a username in a TCP-packet(an
example: http://rubular.com/regexes/8389)

regex =
/(?:[I][P]\s)((?:[0-9A-Fa-f]{1,4}:){7}[0-9A-Fa-f]{1,4}|(?:\d{1,3}.){3}\d{1,3}).{0,}(?:[:][\s])([P])(?:.{0,})$\s(?:^[E].{5}[@].{9}[Q])(?:.{30,42}.)\W{0,}([\w\d]{1}(?:[\w\d-]+.){1,13}[\wA-Z0-9])/i

filename = ‘testfile.txt’
file = File.open(filename).collect
j = file.length
i = 0
while i< j

a = file[i].to_s

b = a.scan(regex)
print b.length

i = i + 1
end

ghan · June 25, 2009, 1:08pm

2009/6/25 Andreas H. [email protected]

i have some trouble with a regex.
it works on rubular.com but not in my program
ive used the content in testfile.txt on rubular.com

the regex finds a ip-address, a flag and a username in a TCP-packet(an
example: http://rubular.com/regexes/8389)

regex =
/(?:[I][P]\s)((?:[0-9A-Fa-f]{1,4}:){7}[0-9A-Fa-f]{1,4}|(?:\d{1,3}.){3}\d{1,3}).{0,}(?:[:][\s])([P])(?:.{0,})$\s(?:^[E].{5}[@].{9}[Q])(?:.{30,42}.)\W{0,}([\w\d]{1}(?:[\w\d-]+.){1,13}[\wA-Z0-9])/i

Ugh! This is completely unreadable. How about using switch /x and
embedding some comments? Constructing the large regexp from a few
smaller expressions might also help. For example, you could use
/[a-f0-9]/i for hex digits.

Few notes on glancing over this

[\wA-Z0-9] → \w
\w includes characters and numbers

[\s] → \s
[P] → P
etc.

[\d\d]{1} → \w

filename = ‘testfile.txt’
file = File.open(filename).collect

Not closing the file descriptor properly…

j = file.length
i = 0
while i< j

a = file[i].to_s

b = a.scan(regex)
print b.length

You are not printing a newline here. Are you maybe missing the print
output?

i = i + 1
end

You can greatly simplify your code to

File.foreach filename do |line|
b = line.scan rx
puts b.length
end

Kind regards

robert

ghan · June 25, 2009, 1:15pm

Andreas H. wrote:

i have some trouble with a regex.
it works on rubular.com but not in my program

What regexp language is rubular.com using? Perl, ruby 1.8, ruby 1.9,
other?

regex =
/(?:[I][P]\s)((?:[0-9A-Fa-f]{1,4}:){7}[0-9A-Fa-f]{1,4}|(?:\d{1,3}.){3}\d{1,3}).{0,}(?:[:][\s])([P])(?:.{0,})$\s(?:^[E].{5}[@].{9}[Q])(?:.{30,42}.)\W{0,}([\w\d]{1}(?:[\w\d-]+.){1,13}[\wA-Z0-9])/i

This regexp must be machine-written, since it’s absolutely horrible.
You’d never write it that way by hand. For example:

(?:[I][P]\s)    => is just the same as =>    IP\s

(?:.{0,})$      => is just the same as =>    .*$

[\wA-Z0-9]      => is just the same as =>    \w

I suggest you write your regexp by hand, one bit at a time. This is easy
in IRB as you can develop your regexp interactively.

irb(main):008:0> src
=> “12:23:59.378678 IP 85.225.108.54.54707 > 81.227.132.223.6112: P
590518027:590518071(44) ack 2582330461 win
64240\nE…[email protected]…U.l6Q…#2…<]P…,…wakko0…@…”
irb(main):009:0> /IP / =~ src
=> 16
irb(main):010:0> /IP ([0-9a-fA-F.]+)/ =~ src
=> 16
irb(main):011:0> $1
=> “85.225.108.54.54707”
irb(main):012:0> /IP ([0-9a-fA-F.]+).\d+/ =~ src
=> 16
irb(main):013:0> $1
=> “85.225.108.54”

This one isn’t quite as sophisticated for IP address matching as the one
rubular gave you, but it’s not necessary here. If you really want
stronger matching of IPv4 and IPv6 literals, you can do so if you wish.

ghan · June 25, 2009, 4:39pm

its the first time me or my friend has worked with regex, my friend have
rewritten the regex a bit, maybe it makes more sense now:

(?#start: fetch the ip adress after IP)
IP\s((?:\d*.) {3}\d{1,3})
(?#end: fetch ip)

(?#start: flag).:\s([PSF])(?#end:flag) (?#nothing more interesting on
this line)
.$(?#end)

(?#start: look for index pattern)
\s^E.{5}@.{9}Q.{30,42}
(?#end: index)

(?#start: get the username which is surrounded by multiple dots, minimum
of 2 in the begining and 0+ after)
..{2,}(\w{1,13}\w).
(?#end: username)

Each of those expressions works individually and together(in rubular)
but when i combine them in my program it prints nothing, not even nil.
So i tried them individually in the program as well and all but the
index pattern(prints nothing) works. so if anyone could offer some
insight why its not working or knows a better way to do this weÂ´ll be
very happy:)

another thing:
some usernames are really hard to extract from the packets. an example:
G-eX.Dowden(http://rubular.com/regexes/8401)
any suggestions?

File.foreach filename do |line|
b = line.scan rx
puts b.length
end

was a nice solution, thank you:)

ghan · June 26, 2009, 3:41pm

Andreas H. wrote:

(?#start: flag).:\s([PSF])(?#end:flag) (?#nothing more interesting on
this line)
.$(?#end)

(?#start: look for index pattern)
\s^E.{5}@.{9}Q.{30,42}
(?#end: index)

You are looking for an end-of-line ($), followed by whitespace (\s),
followed by a start of line (^). This doesn’t look right to me. It might
work sometimes, depending on whether your end-of-line is \n or \r\n

(?#start: get the username which is surrounded by multiple dots, minimum
of 2 in the begining and 0+ after)
..{2,}(\w{1,13}\w).

That one makes little sense.

[\w] is the same as \w

(?:\w+.) means one or more word characters followed by any character;
this is then releated between 1 and 13 times
\w must be followed by a word character

.* this is superfluous, since it matches 0 or more dots,
it would therefore match regardless of what is next

Each of those expressions works individually and together(in rubular)

Don’t test them in rubular. Test them in irb or in ruby.

another thing:
some usernames are really hard to extract from the packets. an example:
G-eX.Dowden(http://rubular.com/regexes/8401)
any suggestions?

You’re using the wrong way to view the packets in the first place.

Using a ruby interface to libpcap would be the safest way - I think I
saw one, but I’ve never used it.

Otherwise, look at tcpdump -X for a proper hex packet dump.

Brian.

ghan · June 26, 2009, 3:49pm

Note also that ^ and $ don’t consume characters, and . doesn’t match
newlines without the /m flag.

irb(main):009:0> “abc\ndef” =~ /^a.^d/
=> nil
irb(main):010:0> “abc\ndef” =~ /^a.^d/m
=> 0
irb(main):011:0> “abc\ndef” =~ /^a.*\r?\n^d/
=> 0

re = %r{
(?#start: fetch the ip adress after IP)
IP\s((?:\d+.){3}\d{1,3})
(?#end: fetch ip)

(?#start: flag).*:\s([PSF])(?#end:flag)

(?#nothing more interesting on this line)
.*

(?#start: look for index pattern)
^E.{5}@.{8}Q.{30,}
(?#end: index)

(?#start: get the username which is surrounded by multiple dots, minimum
of 2 in the begining and 0+ after)
.{2,}(\w+)
(?#end: username)
}xm

src = “12:23:59.378678 IP 85.225.108.54.54707 > 81.227.132.223.6112: P
590518027:590518071(44) ack 2582330461 win
64240\nE…[email protected]…U.l6Q…#2…<]P…,…wakko0…@…”

p re =~ src
p $~.to_a

ghan · June 26, 2009, 3:43pm

Brian C. wrote:

Don’t test them in rubular. Test them in irb or in ruby.

In a ruby file, you can comment out bits of them until you make it work,
e.g.

re = %r{
(?#start: fetch the ip adress after IP)
IP\s((?:\d+.){3}\d{1,3})
(?#end: fetch ip)
}x

#(?#start: flag).:\s([PSF])(?#end:flag)
#(?#nothing more interesting on this line)
#.

#(?#start: look for index pattern)
#^E.{5}@.{9}Q.{30,}
#(?#end: index)

#(?#start: get the username which is surrounded by multiple dots,
minimum
#of 2 in the begining and 0+ after)
#.{2,}(\w+)
#(?#end: username)
#}x

src = “12:23:59.378678 IP 85.225.108.54.54707 > 81.227.132.223.6112: P
590518027:590518071(44) ack 2582330461 win
64240\nE…[email protected]…U.l6Q…#2…<]P…,…wakko0…@…”

p re =~ src
p $~.to_a

Then you move the }x end of the regular expression and start
uncommenting further bits until it starts to fail again, then you know
where the problem is.

ghan · June 26, 2009, 8:56pm

On Thu, Jun 25, 2009 at 10:39 AM, Andreas H.[email protected]
wrote:

its the first time me or my friend has worked with regex, my friend have
rewritten the regex a bit, maybe it makes more sense now …

another thing:
some usernames are really hard to extract from the packets. an example:
G-eX.Dowden

cat /tmp/z
p(("12:23:59.378678 IP 85.225.108.54.54707 > 81.227.132.223.6112: P " +
“5 90518027:590518071(44) ack 2582330461 win 64240\nE…[email protected]…U.” +
“l6Q …#2…<]P…,…wakko0…@.” +
“…\n” +
"12:23:59.378678 IP 85.225.108.55.54707 > 81.227.132.223.6112: P " +
“5 90518027:590518071(44) ack 2582330461 win 64240\nE…[email protected]…U.” +
“l6Q …#2…<]P…,…wa-kk.o0…” +
“@…”
).scan(
%r{
# capture the address after “IP”
IP\s((?:\d{1,3}.){3}\d{1,3}).

  .+?  # skip (non-greedy)

  # capture the flag
  :\s([PSF])\s\d

  .+?                # skip (non-greedy)
  ^E.{5}@.{8}Q.{30}  # skip the index pattern
  .+?                # skip (non-greedy)

  # capture the username surrounded by dots: 2+ before, 0+ after
  \.{2,}(\w[-\w\.]+\w)\.?
}mx  # m: "make dot match newlines"

)
)

ruby /tmp/z
[[“85.225.108.54”, “P”, “wakko0”], [“85.225.108.55”, “P”, “wa-kk.o0”]]