speed_shoot

help as usual is obtained this way:

speed_shoot --help

it spits out:

usage: speed_shoot [-h] [-c FILE] [-g GEOIP] [-q SILENT] [-cs CACHE_SIZE]
                   [-d DIAGNOSE [DIAGNOSE ...]] [-i INCLUDE] [--off OFF]
                   [-x EXCLUDE] [-f OUTPUT_FORMAT] [-lf LOG_FORMAT]
                   [-lp LOG_PATTERN] [-lpn LOG_PATTERN_NAME]
                   [-dp DATE_PATTERN] [-o OUTPUT_FILE]
                   ...

Utility for parsing logs in the apache/nginx combined log format
and output a json of various aggregatted metrics of frequentation :
     * by Geolocation (quite fuzzy but still);
     * by user agent;
     * by hour;
     * by day;
     * by browser;
     * by status code
     * of url by ip;
     * by ip;
     * by url;
     * and bandwidth by ip;

Example :
=========

from stdin (useful for using zcat)
**********************************
zcat /var/log/apache.log.1.gz | parse_log.py  > dat1.json

excluding IPs 192.168/16 and user agent containing Mozilla
**********************************************************
parse_log -o dat2.json -x '{ "ip" : "^192.168", "agent": "Mozill" }'  /var/log/apache*.log 

Since archery is cool here is a tip for aggregating data
>>> from archery.barrack import bowyer
>>> from archery.bow import mdict
>>> from json import load, dumps
>>> dumps(
        bowyer(mdict,load(file("dat1.json"))) + 
        bowyer(mdict,load(file("dat2.json")))
    )

Hence a usefull trick to merge your old stats with your new one
        

positional arguments:
  files

options:
  -h, --help            show this help message and exit
  -c FILE, --config FILE
                        specify a config file in json format for the command
                        line arguments any command line arguments will disable
                        values in the config
  -g GEOIP, --geoip GEOIP
                        specify a path to a geoIP directory with geoIP.dat and
                        geoIPv6.date default : ~/.yahi/
  -q SILENT, --silent SILENT
                        quietly discard errors
  -cs CACHE_SIZE, --cache-size CACHE_SIZE
                        in conjonction with cp=fixed chooses dict size
  -d DIAGNOSE [DIAGNOSE ...], --diagnose DIAGNOSE [DIAGNOSE ...]
                        diagnose **rejected** : will print on STDERR rejected
                        parsed line, **match** : will print on stderr data
                        filtered out
  -i INCLUDE, --include INCLUDE
                        include from extracted data with a json (string or
                        filename) in the form { "field" : "pattern" }
  --off OFF             turn off plugins : geo_ip to skip geoip, user_agent to
                        turn httpagentparser off
  -x EXCLUDE, --exclude EXCLUDE
                        exclude from extracted data with a json (string or
                        filename) in the form { "field" : "pattern" }
  -f OUTPUT_FORMAT, --output-format OUTPUT_FORMAT
                        decide if output is in a specified formater amongst :
                        csv, json
  -lf LOG_FORMAT, --log-format LOG_FORMAT
                        log format amongst apache_log_combined, lighttpd
  -lp LOG_PATTERN, --log-pattern LOG_PATTERN
                        add a custom named regexp for parsing log lines
  -lpn LOG_PATTERN_NAME, --log-pattern-name LOG_PATTERN_NAME
                        the name with witch you want to register the pattern
  -dp DATE_PATTERN, --date-pattern DATE_PATTERN
                        add a custom date format, usefull if and only if using
                        a custom log_pattern and date pattern differs from
                        apache.
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        output file

Usage

Simplest usage is:

speed_shoot -g /usr/local/data/geoIP /var/www/apache/access*log

it will return a json in the form:

{
    "by_date": {
        "2012-5-3": 11
    },
    "total_line": 11,
    "ip_by_url": {
        "/favicon.ico": {
            "192.168.0.254": 2,
            "192.168.0.35": 2
        },
        "/": {
            "74.125.18.162": 1,
            "192.168.0.254": 1,
            "192.168.0.35": 5
        }
    },
    "by_status": {
        "200": 7,
        "404": 4
    },
    "by_dist": {
        "unknown": 11
    },
    "bytes_by_ip": {
        "74.125.18.162": 151,
        "192.168.0.254": 489,
        "192.168.0.35": 1093
    },
    "by_url": {
        "/favicon.ico": 4,
        "/": 7
    },
    "by_os": {
        "unknown": 11
    },
    "week_browser": {
        "3": {
            "unknown": 11
        }
    },
    "by_referer": {
        "-": 11
    },
    "by_browser": {
        "unknown": 11
    },
    "by_ip": {
        "74.125.18.162": 1,
        "192.168.0.254": 3,
        "192.168.0.35": 7
    },
    "by_agent": {
        "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0,gzip(gfe) (via translate.google.com)": 1,
        "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0": 10
    },
    "by_hour": {
        "9": 3,
        "10": 4,
        "11": 1,
        "12": 3
    },
    "by_country": {
        "": 10,
        "US": 1
    }
}

If you use:

speed_shoot -f csv -g /usr/local/data/geoIP /var/www/apache/access*log

Your result is:

by_date,2012-5-3,11
total_line,11
ip_by_url,/favicon.ico,192.168.0.254,2
ip_by_url,/favicon.ico,192.168.0.35,2
ip_by_url,/,74.125.18.162,1
ip_by_url,/,192.168.0.254,1
ip_by_url,/,192.168.0.35,5
by_status,200,7
by_status,404,4
by_dist,unknown,11
bytes_by_ip,74.125.18.162,151
bytes_by_ip,192.168.0.254,489
bytes_by_ip,192.168.0.35,1093
by_url,/favicon.ico,4
by_url,/,7
by_os,unknown,11
week_browser,3,unknown,11
by_referer,-,11
by_browser,unknown,11
by_ip,74.125.18.162,1
by_ip,192.168.0.254,3
by_ip,192.168.0.35,7
by_agent,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0,gzip(gfe) (via translate.google.com)",1
by_agent,Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0,10
by_hour,9,3
by_hour,10,4
by_hour,11,1
by_hour,12,3
by_country,,10
by_country,US,1

A commented jumbo command line example

The following command line:

speed_shoot -g data/GeoIP.dat -lf lighttpd \
    -x '{ "datetime" : "^01/May", "uri" : "(.*munin|.*(png|jpg))$"}' \
    -d rejected -d match -i '{ "_country" : "(DE|GB)"  }' \
    *log*  yahi/test/biggersample.log

does:

  • locates geoIP g file in data/GeoIP.dat;

  • sets log format lf to lighttpd;

  • excludes (x) any match of either
    • an uri containing munin or ending by jpg or png

    • May the first;

  • includes (i) all match containing
    • any IP which has been geoloclaized,

    • any non authentified user;

  • will diagnose (d) (thus print on stderr) any lines that would not match the log format regexp or any lines rejected by -x and -i

for all the given log files.

Using a config file

Well, not impressive:

speed_shoot -c config.json

If any option is specified in the config file it will override those setted in the command line.

Here is a sample of a config file:

{
    "exclude" : {
        "uri"  : ".*munin.*",
        "referer" : ".*(munin|php).*"
    },
    "include" : { "datetime" : "^04" },
    "silent" : "False",
    "files" : [ "yahi/test/biggersample.log" ]
}

Easter eggs or bad idea

The following options -x -i -c can either take a string or a filename, which makes debugging of badly formatted json a pain.