Forum: Ruby Searching for a very fast string parser

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
|MKSM| (Guest)
on 2006-03-08 19:31
(Received via mailing list)
Hello,

I want to parse a log file containing several line in the same format.
My log files are about 50mb each (350k lines) so i need something
quite fast. The current (and fastest) solution i came up with is using
StringScanner.

I save what i get into variables and then pass them all into a Struct
i created. Each new struct is then passed into an Array that holds all
structs.


Here's my test code:

require 'strscan'

a = "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"

s = StringScanner.new(a)
time = s.scan(/\d+\.\d+/)
s.pos += 23
rule_no = s.scan(/\d+/)
s.skip(/[\d\D]*?\s/)
stat = s.scan(/\w+/)
s.skip(/.*on\s/)
interface = s.scan(/\w+\:/)
s.skip(/\D+?\s/)
out_ip = s.scan(/(\d+\.){3}\d{0,3}/)
s.pos += 1
out_port = s.scan(/\d+/)
s.skip(/\D+/)
in_ip = s.scan(/(\d+\.){3}\d{0,3}/)
s.pos += 1
in_port = s.scan(/\d+/)
s.pos += 2
proto = s.scan(/\w+/)
proto
s.pos += 1

Running that on a 10k times loop it takes about 0.6 seconds to
complete. Is there a better/faster way on doing it?

Regards,

Ricardo.
unknown (Guest)
on 2006-03-08 20:05
(Received via mailing list)
On Thu, 9 Mar 2006, |MKSM| wrote:

>
> s.pos += 23
> in_ip = s.scan(/(\d+\.){3}\d{0,3}/)
> Regards,
>
> Ricardo.

can you put a demo log file on the web somewhere?

-a
|MKSM| (Guest)
on 2006-03-08 20:11
(Received via mailing list)
On 3/8/06, removed_email_address@domain.invalid 
<removed_email_address@domain.invalid> wrote:
> > i created. Each new struct is then passed into an Array that holds all
> > s = StringScanner.new(a)
> > out_port = s.scan(/\d+/)
> > complete. Is there a better/faster way on doing it?
> knowledge is important, but the much more important is the use toward which it
> is put.  this depends on the heart and mine the one who uses it.
> - h.h. the 14th dali lama
>
>
I'm sorry, the log file i have comes from a live firewall. I'd rather
not release it.

The log is only consisted by several line such as the one i used in the
code.

Regards,

Ricardo
James G. (Guest)
on 2006-03-08 21:29
(Received via mailing list)
On Mar 8, 2006, at 12:09 PM, |MKSM| wrote:

> I'm sorry, the log file i have comes from a live firewall. I'd rather
> not release it.

Would randomizing the data render it safe?

 >> "ABC 123".gsub(/[a-zA-Z0-9]/i) { |chr| ("0".."9").include?(chr) ?
rand(10) : ("A".."Z").to_a[rand(26)] }
=> "HNQ 265"

James Edward G. II
Caleb C. (Guest)
on 2006-03-08 22:48
(Received via mailing list)
OK, so first off, your sample implementation seemed to have several
bugs in it. After fixing those, I thought you might be able to save
some time by glomming all the regexp's together, obviating the need
for StringScanner altogether. However, that doesn't seem to have
actually made any difference... if anything it seems to have been a
little slower. I don't know why. And the great big long Regexp is
considerably harder to read.

I tried to optimize some of your patterns to eliminate backtracking,
use noncapturing parens, etc. That also didn't seem to help much. So,
it looks like (sans bugs) your code is pretty much optimal. I'm
including my version below in case it might be useful anyway.

Some notes:

I got rid of silly stuff like \D+? after the interface name, since it
doesn't seem necessary. (It's not needed for the one line of data you
provided, anyway.)

The ip addresses will now both end with ".". So, chop it off if that's
a problem.

Looks like you inverted the source and destination address/port
fields? I didn't fix that...


require 'strscan'

a = "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"

10000.times do
s = StringScanner.new(a)
time = s.scan(/\d+\.\d+/)
s.pos = 23
rule_no = s.scan(/\d+/)
s.skip(/\S+\s/)
stat = s.scan(/\w+/)
s.skip(/\s\S+\son\s/)
interface = s.scan(/\w+\:/)
s.skip(/\s/)
out_ip = s.scan(/(?:\d+\.){4}/)
out_port = s.scan(/\d+/)
s.skip(/ > /)
in_ip = s.scan(/(?:\d+\.){4}/)
in_port = s.scan(/\d+/)
s.pos += 2
proto = s.scan(/\w+/)
end
Robert K. (Guest)
on 2006-03-09 16:36
(Received via mailing list)
Caleb C. wrote:
> OK, so first off, your sample implementation seemed to have several
> bugs in it. After fixing those, I thought you might be able to save
> some time by glomming all the regexp's together, obviating the need
> for StringScanner altogether. However, that doesn't seem to have
> actually made any difference...

I don't buy this.  A single plain RX is usually faster than a more
complex
solution.  Even on a machine with constant high load (I had no different
available at the moment) I get a significant difference (north of 6%):

>> 15:22:14 [source]: /c/temp/ruby/logscan.rb
Rehearsal ------------------------------------------------
strscan        5.969000   0.000000   5.969000 (  6.095000)
rx             5.828000   0.000000   5.828000 (  5.951000)
rx with conv   5.860000   0.000000   5.860000 (  5.922000)
-------------------------------------- total: 17.657000sec

                   user     system      total        real
strscan        5.953000   0.000000   5.953000 (  6.043000)
rx             5.547000   0.000000   5.547000 (  5.747000)
rx with conv   5.765000   0.000000   5.765000 (  5.924000)

(script attached)

> if anything it seems to have been a
> little slower. I don't know why. And the great big long Regexp is
> considerably harder to read.

Using %r{} and /x makes a great deal in readability (see script).

Kind regards

    robert
Robert K. (Guest)
on 2006-03-09 16:52
(Received via mailing list)
Robert K. wrote:
> (north of 6%):
> rx             5.547000   0.000000   5.547000 (  5.747000)
> Kind regards
>
>     robert

I redid the test on an idle Linux machine with Ruby 1.8.1 and the
StringScanner is actually faster:

[root@fox tmp]# ./logscan.rb
Rehearsal ------------------------------------------------
strscan        2.990000   0.000000   2.990000 (  2.991096)
rx             4.870000   0.000000   4.870000 (  4.868536)
rx with        4.280000   0.010000   4.290000 (  4.284334)
rx with conv   5.240000   0.000000   5.240000 (  5.459702)
-------------------------------------- total: 17.390000sec

                   user     system      total        real
strscan        3.000000   0.000000   3.000000 (  2.999783)
rx             4.870000   0.000000   4.870000 (  4.899242)
rx with        4.300000   0.010000   4.310000 (  4.869835)
rx with conv   5.240000   0.000000   5.240000 (  5.442722)

Apparently I have to correct myself...

    robert
This topic is locked and can not be replied to.