LOG-DIET-WITH-REPORT.PL

INTRODUCTION
-----------------------------------------------------------------
The access logs on the portal web servers fill up ridiculously
fast because the site receives about 20 million hits a day. I
created an access log report script "log-diet-report.pl" to put
these lardy logs on a diet.

It works on all uncompressed access logs in the directory
"/apps/logs". It takes a couple of minutes to read each one from
top to bottom. It creates a new stripped-down log, then deletes
the original. It prints a report on the number of lines that are
included in the stripped-down versions of the logs that go to the
DSS server.

BUT WHY?
-----------------------------------------------------------------
Alison  will be making the occasional request to change
the filters. We can now keep tabs on how well these filters are
performing.

HOW IT WORKS
-----------------------------------------------------------------
This perl script reads all the lines to match from the include
filter configuration file
"/export/home/fusion/scripts/log/include.conf". It then
translates them into regular expressions. The script then reads
all the lines in the web server access log "/apps/logs/access".
It keeps a count of any GET request that matches one of these
lines. It finally prints out a report like the one below.

SCHEDULING
-----------------------------------------------------------------
The log stripping script runs at 13 minutes past each hour and
takes anything between a few seconds and a few minutes to run.

root crontab entry
---------------------
13 * * * * /export/home/fusion/bin/log/log-diet.pl >>
/export/home/fusion/logs/log-diet.log 2>&1


LOGGING
A one line summary is printed every time the script runs. This is
appended to the file /export/home/fusion/logs/log-diet.log
eg.
...
file: access.23Nov-07AM, fat lines: 117301, thin lines: 1886,
time: 33 wallclock secs (32.04 usr +  0.16 sys = 32.20 CPU)
file: access.23Nov-08AM, fat lines: 176484, thin lines: 3069,
time: 50 wallclock secs (48.38 usr +  0.40 sys = 48.78 CPU)
file: access.23Nov-09AM, fat lines: 267145, thin lines: 5030,
time: 77 wallclock secs (74.61 usr +  0.49 sys = 75.10 CPU)
...

FILTERS
-----------------------------------------------------------------
The bits of text to include are stored in the appallingly named
"include.conf" file.

$ cat /export/home/fusion/scripts/log/include.conf
# title: include text
# description:
#   Each line of text below is compared to each line in the web server
#   access log. If the "GET file" field the log file includes one of
#   these pieces of text, it is kept. Otherwise, the line is removed.
# dos and don'ts
#   - Do NOT use a "\" at the end of a line. The program will think that this
#     is a "line continuation" character.
#   - You CAN use the glob-style wildcard "*" to match any character.
#   - Do NOT use URLs such as "http://www.openworld.com". This only matches
#     the referrer field, NOT the "GET file" field. This will screw up your
#     statistics.
#
# last modified:
#---------#---------#---------#---------#---------#---------#---------
# wap tracking
/wap/

# additions 07/08/01
# corporate site tracking
/pzn/openworld/corporate
/pzn/corporate

# additions 30/04/01
# recruitment site referral tracking
/anon
/index.htm
/fast10
/FAST10
/office
/Office
/OFFICE
/order
/ORDER
/Order
/now
/NOW
/Now
/possibility
/POSSIBILITY
/Possibility
/blistering
/BLISTERING
/Blistering
/revolution
/REVOLUTION
/Revolution
/speed
/SPEED
/Speed
/rev
/REV
/Rev
/fly
/FLY
/Fly
/new
/NEW
/New
/fast
/FAST
/Fast
/customer
/CUSTOMER
/Customer
/priority
/PRIORITY
/Priority
/first
/FIRST
/First
/premier
/PREMIER
/Premier
/demos
/DEMOS
/Demos
/fieldmarketing
/FIELDMARKETING
/Fieldmarketing
/retail
/RETAIL
/Retail
/CHEERS
/cheers
/update1
/competition1
/update2
/competition2
/update3
/competition3
/update4
/competition4
/update5
/competition5
/update6
/competition6
/fm
/FM
/demo
/Demo
/DEMO
/register
/hp571
/HP571
/flash
/flash.html
/direct
/Direct
/DIRECT
/imail0201
/Imail0201
/IMAIL0201
/IMail0201
/imail0201
/iMail0201
/mail0201
/MAIL0201
/open
/Open
/OPEN
/Fast1
/fast1
/FAST1
/Fast2
/fast2
/FAST2
/Fast3
/fast3
/FAST3
/Fast4
/fast4
/FAST4
/Fast5
/fast5
/FAST5
/Fast6
/fast6
/FAST6
/Fast7
/fast7
/FAST7
/Fast8
/fast8
/FAST8
/Fast9
/fast9
/FAST9
/Fast10
/howfast
/HOWFAST
/Howfast


FILTER EXAMPLES
-----------------------------------------------------------------
An entry like this
   /directory/*.htm
would match pages like these.
   /directory/mypage1.htm
   /directory/mypage1.html
   /directory/index.htm
   /directory/index.html
The only difference between "htm" and "html" is that Microsoft
products create names ending in "htm" and everything else creates
names ending in "html". It's best to catch both, if the page
names are what you are after.

193.113.57.165 - - [09/May/2001:09:50:21 +0100] "GET  internet/pop/cspopc/new_stylesheet.css/0,379
5,,00.html HTTP/1.0" 200 1188 "http://www.openworld.com/internet/pop/cspopd/anon_index_d/" "Mozi
lla/4.0 (compatible; MSIE 5.5; Windows 95)" GET /internet/pop/cspopc/new_stylesheet.css/0,3795,,00
.html - "HTTP/1.0"

This is being matched by the filter "/new".

213.122.97.158 - - [09/May/2001:09:50:31 +0100] "GET /pzn/internet/pop/cspopd/anon_index_d/top/pix
el.gif?180 HTTP/1.1" 200 43 "http://www.openworld.com/internet/pop/cspopd/anon_index_d_0800/" "M
ozilla/4.0 (compatible; MSIE 5.5; Windows 98; sureseeker.com; internet CD v7.0)" GET /pzn/intern
et/pop/cspopd/anon_index_d/top/pixel.gif 180 "HTTP/1.1"

This is being matched by the filter "/pzn/*/pixel.gif".


--
SUNDAY TELEGRAPH(LONDON) November 04, 2001, This truck runs on
diesel produced from dead cows
By MICHAEL LEIDIG in Vienna
A GERMAN company that was forced to stop turning the fat from
cattle carcasses into animal feed because of the BSE crisis is
now using it to make diesel to fuel its lorries...
--
larg
--
