speed_shoot¶
help as usual is obtained this way:
speed_shoot --help
it spits out:
usage: speed_shoot [-h] [-c FILE] [-g GEOIP] [-q SILENT] [-cs CACHE_SIZE]
[-d DIAGNOSE [DIAGNOSE ...]] [-i INCLUDE] [--off OFF]
[-x EXCLUDE] [-f OUTPUT_FORMAT] [-lf LOG_FORMAT]
[-lp LOG_PATTERN] [-lpn LOG_PATTERN_NAME]
[-dp DATE_PATTERN] [-o OUTPUT_FILE]
...
Utility for parsing logs in the apache/nginx combined log format
and output a json of various aggregatted metrics of frequentation :
* by Geolocation (quite fuzzy but still);
* by user agent;
* by hour;
* by day;
* by browser;
* by status code
* of url by ip;
* by ip;
* by url;
* and bandwidth by ip;
Example :
=========
from stdin (useful for using zcat)
**********************************
zcat /var/log/apache.log.1.gz | parse_log.py > dat1.json
excluding IPs 192.168/16 and user agent containing Mozilla
**********************************************************
parse_log -o dat2.json -x '{ "ip" : "^192.168", "agent": "Mozill" }' /var/log/apache*.log
Since archery is cool here is a tip for aggregating data
>>> from archery.barrack import bowyer
>>> from archery.bow import mdict
>>> from json import load, dumps
>>> dumps(
bowyer(mdict,load(file("dat1.json"))) +
bowyer(mdict,load(file("dat2.json")))
)
Hence a usefull trick to merge your old stats with your new one
positional arguments:
files
options:
-h, --help show this help message and exit
-c FILE, --config FILE
specify a config file in json format for the command
line arguments any command line arguments will disable
values in the config
-g GEOIP, --geoip GEOIP
specify a path to a geoIP directory with geoIP.dat and
geoIPv6.date default : ~/.yahi/
-q SILENT, --silent SILENT
quietly discard errors
-cs CACHE_SIZE, --cache-size CACHE_SIZE
in conjonction with cp=fixed chooses dict size
-d DIAGNOSE [DIAGNOSE ...], --diagnose DIAGNOSE [DIAGNOSE ...]
diagnose **rejected** : will print on STDERR rejected
parsed line, **match** : will print on stderr data
filtered out
-i INCLUDE, --include INCLUDE
include from extracted data with a json (string or
filename) in the form { "field" : "pattern" }
--off OFF turn off plugins : geo_ip to skip geoip, user_agent to
turn httpagentparser off
-x EXCLUDE, --exclude EXCLUDE
exclude from extracted data with a json (string or
filename) in the form { "field" : "pattern" }
-f OUTPUT_FORMAT, --output-format OUTPUT_FORMAT
decide if output is in a specified formater amongst :
csv, json
-lf LOG_FORMAT, --log-format LOG_FORMAT
log format amongst apache_log_combined, lighttpd
-lp LOG_PATTERN, --log-pattern LOG_PATTERN
add a custom named regexp for parsing log lines
-lpn LOG_PATTERN_NAME, --log-pattern-name LOG_PATTERN_NAME
the name with witch you want to register the pattern
-dp DATE_PATTERN, --date-pattern DATE_PATTERN
add a custom date format, usefull if and only if using
a custom log_pattern and date pattern differs from
apache.
-o OUTPUT_FILE, --output-file OUTPUT_FILE
output file
Usage¶
Simplest usage is:
speed_shoot -g /usr/local/data/geoIP /var/www/apache/access*log
it will return a json in the form:
{
"by_date": {
"2012-5-3": 11
},
"total_line": 11,
"ip_by_url": {
"/favicon.ico": {
"192.168.0.254": 2,
"192.168.0.35": 2
},
"/": {
"74.125.18.162": 1,
"192.168.0.254": 1,
"192.168.0.35": 5
}
},
"by_status": {
"200": 7,
"404": 4
},
"by_dist": {
"unknown": 11
},
"bytes_by_ip": {
"74.125.18.162": 151,
"192.168.0.254": 489,
"192.168.0.35": 1093
},
"by_url": {
"/favicon.ico": 4,
"/": 7
},
"by_os": {
"unknown": 11
},
"week_browser": {
"3": {
"unknown": 11
}
},
"by_referer": {
"-": 11
},
"by_browser": {
"unknown": 11
},
"by_ip": {
"74.125.18.162": 1,
"192.168.0.254": 3,
"192.168.0.35": 7
},
"by_agent": {
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0,gzip(gfe) (via translate.google.com)": 1,
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0": 10
},
"by_hour": {
"9": 3,
"10": 4,
"11": 1,
"12": 3
},
"by_country": {
"": 10,
"US": 1
}
}
If you use:
speed_shoot -f csv -g /usr/local/data/geoIP /var/www/apache/access*log
Your result is:
by_date,2012-5-3,11
total_line,11
ip_by_url,/favicon.ico,192.168.0.254,2
ip_by_url,/favicon.ico,192.168.0.35,2
ip_by_url,/,74.125.18.162,1
ip_by_url,/,192.168.0.254,1
ip_by_url,/,192.168.0.35,5
by_status,200,7
by_status,404,4
by_dist,unknown,11
bytes_by_ip,74.125.18.162,151
bytes_by_ip,192.168.0.254,489
bytes_by_ip,192.168.0.35,1093
by_url,/favicon.ico,4
by_url,/,7
by_os,unknown,11
week_browser,3,unknown,11
by_referer,-,11
by_browser,unknown,11
by_ip,74.125.18.162,1
by_ip,192.168.0.254,3
by_ip,192.168.0.35,7
by_agent,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0,gzip(gfe) (via translate.google.com)",1
by_agent,Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0,10
by_hour,9,3
by_hour,10,4
by_hour,11,1
by_hour,12,3
by_country,,10
by_country,US,1
A commented jumbo command line example¶
The following command line:
speed_shoot -g data/GeoIP.dat -lf lighttpd \
-x '{ "datetime" : "^01/May", "uri" : "(.*munin|.*(png|jpg))$"}' \
-d rejected -d match -i '{ "_country" : "(DE|GB)" }' \
*log* yahi/test/biggersample.log
does:
locates geoIP g file in data/GeoIP.dat;
sets log format lf to lighttpd;
- excludes (x) any match of either
an uri containing munin or ending by jpg or png
May the first;
- includes (i) all match containing
any IP which has been geoloclaized,
any non authentified user;
will diagnose (d) (thus print on stderr) any lines that would not match the log format regexp or any lines rejected by -x and -i
for all the given log files.
Using a config file¶
Well, not impressive:
speed_shoot -c config.json
If any option is specified in the config file it will override those setted in the command line.
Here is a sample of a config file:
{
"exclude" : {
"uri" : ".*munin.*",
"referer" : ".*(munin|php).*"
},
"include" : { "datetime" : "^04" },
"silent" : "False",
"files" : [ "yahi/test/biggersample.log" ]
}
Easter eggs or bad idea¶
The following options -x -i -c can either take a string or a filename, which makes debugging of badly formatted json a pain.