Log Parsing - Near Real Time

dubstep · August 1, 2011, 3:54pm

I’m looking for a near real-time script to parse log files and insert
interesting data into a db.

Does anyone know of an existing script to do this?

John

John_Macleod · August 1, 2011, 4:24pm

I beleive the only way to do this is to custom write a script that reads
the
log files and then issues a reload signal to nginx (something like
logrotate).

If anyone does know of a script id be interested to learn too as I do
this
on my servers using a hacked togeather php script and logrotate (on
crontab
every minute)

John_Macleod · August 1, 2011, 4:40pm

I’m looking for a near real-time script to parse log files and insert
interesting data into a db.
Does anyone know of an existing script to do this?

You can check/try http://www.splunk.com

rr

John_Macleod · August 1, 2011, 4:58pm

My app has a request that opens the log file, fseeks to the end, backs
up as
many bytes as it takes to get to the size the log file was on the last
similar request by that user, and runs a regex over the novel part to
get
interesting metrics before closing the file. Since this happens less
than
once per minute, I have not done anything fancy to optimize.

Randy

John_Macleod · August 1, 2011, 6:25pm

You could use nginx-sflow-module (binary logging over UDP). The
“sflowtool” program can convert this back to a continuous feed of common
log file format at the collector.

This way you can receive from multiple servers all sending to the same
UDP port. You can also apply random 1-in-N sampling at source as an
efficient data-reduction measure if required.

Neil M.

John_Macleod · August 1, 2011, 5:49pm

You could use something like syslog-ng that can “tail” the log file and
run
them through a script.

–
Brian A.

John_Macleod · August 1, 2011, 8:58pm

An alternative is to tail -F (aka. “–follow=name --retry”) the log file
and pipe the output into a script. This allows you to parse the entries
as
they come in and rotate the log file as often as you want independently
of
the parsing script.

Regards,
Dennis

John_Macleod · August 1, 2011, 9:13pm

I cobbled something like this together with open source tools and have
been
using it on hundreds of servers… pls contact me offline if you’d like a
copy

-Harold

On Mon, Aug 1, 2011 at 2:57 PM, Dennis J.
<[email protected]

John_Macleod · August 2, 2011, 1:14pm

You can apply “split” to the output of "tail"j to generate files of with
the
same number of lines, that you would process with the program that
extracts
the intereresting data from them:

NAME
split - split a file into pieces

SYNOPSIS
split [OPTION]… [INPUT [PREFIX]]

DESCRIPTION
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, …;
default
size is 1000 lines, and default PREFIX is `x’. With no INPUT, or when
INPUT
is -, read standard input.

   Mandatory arguments to long options are mandatory for short

options
too.

   -a, --suffix-length=N
          use suffixes of length N (default 2)

   -b, --bytes=SIZE
          put SIZE bytes per output file

   -C, --line-bytes=SIZE
          put at most SIZE bytes of lines per output file

   -d, --numeric-suffixes
          use numeric suffixes instead of alphabetic

   -l, --lines=NUMBER
          put NUMBER lines per output file

   --verbose
          print a diagnostic just before each output file is opened

   --help display this help and exit

   --version
          output version information and exit

   SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB

10001000, M 10241024, GB 100010001000, G 102410241024, and so on
for
T, P, E, Z, Y.

John_Macleod · August 2, 2011, 4:24am

On Tue, Aug 2, 2011 at 3:08 AM, Harold Sinclair
[email protected]wrote:

I cobbled something like this together with open source tools and have been
using it on hundreds of servers… pls contact me offline if you’d like a
copy

Or you can share it on github to make it open to anyone.

John_Macleod · August 2, 2011, 3:09am

sflow would be great it it was open source and had an easily
customizable
server (perl/python/bash or PHP)

John_Macleod · August 4, 2011, 7:00pm

Not sure what you mean about sFlow needing to be open source? Here are
links to the relevant open-source projects:

http://nginx-sflow-module.googlecode.com
http://host-sflow.sourceforge.net
http://www.inmon.com/technology/sflowTools.php

With a more complete “developer resources” description here:

If you use sflowtool to turn sFlow-HTTP into common-log format at the
collector, that opens up a whole ecosystem of open-source
perl/python/bash/PHP tools for the analysis, such as AWStats.

The sFlow-HTTP feed also sends performance counters every N seconds. I
don’t yet know of an open-source adaptor to feed that into something
like Nagios, Ganglia or Graphite, but I know there are options to do
that with the sFlow-HOST performance counters so it shouldn’t be hard
to add. In fact, Ganglia now has native support for the sFlow-HOST
counters.
Ganglia 3.2.0 Released « Ganglia Monitoring System

This sFlow-HOST (http://host-sflow.sourceforge.net) part is helpful
because it provides telemetry on the underlying CPU/mem/disk/network
stats in a light and scalable way, and supports zero-config (DNS-SD) to
make sFlow easier to roll out on a large cluster/farm.

Neil

John_Macleod · August 5, 2011, 2:30am

On Aug 4, 2011, at 12:17 PM, Dennis J. wrote:

The problem is that sFlow currently lacks practical Documentation and a library
that can be used to develop agents and collectors.

The http://sflow.org website has some overview documentation, links to
developer tools, plus agent and collector source code.

It took me a while to realize why I couldn’t find a collector daemon that I
could set up to use with the sflowtool or sFlowTrend. These tools are the
collectors. I would have expected to find some kind of management daemon akin to
the snmp world.

Open Source collectors include Ganglia, ntop and pmacct, but there are
so many possible applications for this data that no single
collector-package is going to encompass them all. I think there may be
some collectors that load data into relational database tables but that
may not always make sense for what you want to do. Hence the starting
point is usually a C, Perl or Java decoder that unpacks the data for you
as a real-time feed, then you are free to do whatever you want.

sFlow looks really interesting but it is unnecessarily obscure and the developer
resources could use a face lift and present things in a way that introduces
concepts, term and methodology to new-comers.
Put out a C library with an agent and a collector API and throw the code on
github and you no doubt will see a pickup in interest from developers.

Each agent is very different because it gets embedded in the device or
application that it is monitoring, so the best thing is probably just
to list some of the open-source projects:

http://nginx-sflow-module.googlecode.com
http://mod-sflow.googlecode.com

http://host-sflow.sourceforge.net
http://openvswitch.org

Perhaps there should be a page on sflow.org with this list? Would that
have been helpful for you?

If you have more questions about sFlow in general then it might make
sense to post them to the sFlow mailing list instead:
http://groups.google.com/group/sflow

But getting back to the question of real-time monitoring of nginx
servers, the nginx-sflow-module is a complete sFlow agent that offers
centralized, real-time monitoring of large clusters. sflowtool can turn
the feed into a piped stream of ASCII CLF data, so it represents one
way to avoid all that log-file tailing.

Neil

John_Macleod · August 5, 2011, 8:30am

On 01/08/2011 14:53, John Macleod wrote:

I’m looking for a near real-time script to parse log files and insert
interesting data into a db.

Does anyone know of an existing script to do this?

I don’t think anyone said rsyslog yet? Logs directly into a database if
you want, optionally passing through some kind of parser first. It has
spooling to disk in case target cant keep up and semi reliable network
modes. I heard someone say that Fedora had switched their default
syslog to rsyslog (confirmation?) so hopefully it’s not too niche for
you… (obviously I believe it can read directly from a file or pipe…)

Alternatively ask nginx to log to some local fifo and write your own
spooler? Beware blocking if you don’t keep up though…

Finally note that there have been some previous patches to nginx to add
syslog logging. I personally believe this is useful for many classes of
problem, but I believe that the position by Igor is that it’s considered
too slow to keep up with nginx? (I think I have seen Thrift patches in
the past also?).

I personally don’t use nginx to the limit and would love to see syslog
logging in standard nginx, even if it limited maximum performance…
Perhaps if there are a core of similar interested users we could
interest (or pay) Igor to consider adding such a feature, caveat the
limitations?

Good luck

Ed W

P.S. This link has some suggestions on logging to a fifo and catching
the output in syslog-ng
Re: Remote-logging nginx? (or other non-syslog-enabled stuff) — Centos

John_Macleod · August 4, 2011, 9:17pm

The problem is that sFlow currently lacks practical Documentation and a
library that can be used to develop agents and collectors.
It took me a while to realize why I couldn’t find a collector daemon
that I
could set up to use with the sflowtool or sFlowTrend. These tools are
the
collectors. I would have expected to find some kind of management daemon
akin to the snmp world.
sFlow looks really interesting but it is unnecessarily obscure and the
developer resources could use a face lift and present things in a way
that
introduces concepts, term and methodology to new-comers.
Put out a C library with an agent and a collector API and throw the code
on
github and you no doubt will see a pickup in interest from developers.

Regards,
Dennis