Return to Alan Ng's personal homepage, Technology area.

I've done two versions of this script:

For use with IIS version 6.0 default log format

# visits-IIS6.awk
#
# Generates two new files given an IIS 6.0-format server log of Web hits.
# 1) (sent to stdout for you to pipe whereever you like)
#      Assigns each visit a unique long integer as an additional, first field in the new log file
#      The integer is formed by concatenating the POSIX timecode of the visit's first hit
#      with the IP stripped of periods. This should work at least until 2038-01-19 03:14:07 UTC.
# 2) firsthits.txt (you can change this name in the BEGIN section)
#      Separate text file containing one line for the initial hit of each unique visit. The format
#	   of each line is identical to the format of the main log file.
#
# The first output file preserves all the data from the original IIS server log (excluding log remarks).
# The record order in that main output file is: all hits from one visit are adjacent in
# chronological order. The visits are grouped only roughly in chronological order. The file 
# firsthits.txt allows you to link the two tables (if you import this data into a database)
# to do sorting and grouping of visits.
#
# Example DOS command: c:\util\awk\gawk -f h:\data\visits.awk yourHits.log > yourVisits.txt
#
# Requires gawk from gnu.org
# Copyright (C) October 2002 Alan Ng nospamalan@alan-ng.net. Remove "nospam" to contact me.
# Revised January 2005 for IIS 6 format.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
# You can view a copy of the GNU General Public License at:
# http://www.gnu.org/copyleft/gpl.html

BEGIN {
### definition of "visit" is that it expires upon no hit from that IP for timer seconds
### you may change this to any positive integer value
timer = 600

### you may change this to any valid path/filename
visitfile = "firsthits.txt"
}

# MAIN
{
### reject all IIS log remark lines, which start with '#' and may occur anywhere in a log file
if ($1 ~ /^#/) { next }

### parse timestamp
### assumes that first two fields of each log line are date and time
time = mktime(gensub(/-/, " ", "g", $1) " " gensub(/:/, " ", "g", $2))

### output and then delete visits that are now expired
for (ip in ids) {
	if ((time - last[ip]) > timer) { 
		print transcript[ip]
		print firsthit[ip] > visitfile
		delete transcript[ip]
		delete firsthit[ip]
		delete ids[ip]
		delete start[ip]
		delete end[ip]
		}
	}

### store the current hit
### assumes that visitor IP ("c-ip" in IIS log) is the 11th field
if (start[$11] == "") {
	### new visit
	ids[$11] = time gensub(/\./, "", "g", $11)
	start[$11] = time
	last[$11] = time
	firsthit[$11] = ids[$11] " " $0
	transcript[$11] = ids[$11] " " $0
	}
else {
	### build visit transcript
	last[$11] = time
	transcript[$11] = transcript[$11] "\n" ids[$11] " " $0
	}	
}

END {
### output all leftover (= unexpired) visits
for (ip in ids) {
	print transcript[ip]
	print firsthit[ip] > visitfile
	}
}

For use with IIS version 5.0 default log format

# visits.awk
#
# Generates two new files given an IIS 5.0-format server log of Web hits.
# 1) (sent to stdout for you to pipe whereever you like)
#      Assigns each visit a unique long integer as an additional, first field in the new log file
#      The integer is formed by concatenating the POSIX timecode of the visit's first hit
#      with the IP stripped of periods. This should work at least until 2038-01-19 03:14:07 UTC.
# 2) firsthits.txt (you can change this name in the BEGIN section)
#      Separate text file containing one line for the initial hit of each unique visit. The format
#	   of each line is identical to the format of the main log file.
#
# The first output file preserves all the data from the original IIS server log (excluding log remarks).
# The record order in that main output file is: all hits from one visit are adjacent in
# chronological order. The visits are grouped only roughly in chronological order. The file 
# firsthits.txt allows you to link the two tables (if you import this data into a database)
# to do sorting and grouping of visits.
#
# Example DOS command: c:\util\awk\gawk -f h:\data\visits.awk yourHits.log > yourVisits.txt
#
# Requires gawk from gnu.org
# Copyright (C) October 2002 Alan Ng nospamalan@alan-ng.net. Remove "nospam" to contact me.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
# You can view a copy of the GNU General Public License at:
# http://www.gnu.org/copyleft/gpl.html

BEGIN {
### definition of "visit" is that it expires upon no hit from that IP for timer seconds
### you may change this to any positive integer value
timer = 600

### you may change this to any valid path/filename
visitfile = "firsthits.txt"
}

# MAIN
{
### reject all IIS log remark lines, which start with '#' and may occur anywhere in a log file
if ($1 ~ /^#/) { next }

### parse timestamp
time = mktime(gensub(/-/, " ", "g", $1) " " gensub(/:/, " ", "g", $2))

### output and then delete visits that are now expired
for (ip in ids) {
	if ((time - last[ip]) > timer) { 
		print transcript[ip]
		print firsthit[ip] > visitfile
		delete transcript[ip]
		delete firsthit[ip]
		delete ids[ip]
		delete start[ip]
		delete end[ip]
		}
	}

### store the current hit
if (start[$3] == "") {
	### new visit
	ids[$3] = time gensub(/\./, "", "g", $3)
	start[$3] = time
	last[$3] = time
	firsthit[$3] = ids[$3] " " $0
	transcript[$3] = ids[$3] " " $0
	}
else {
	### build visit transcript
	last[$3] = time
	transcript[$3] = transcript[$3] "\n" ids[$3] " " $0
	}	
}

END {
### output all leftover (= unexpired) visits
for (ip in ids) {
	print transcript[ip]
	print firsthit[ip] > visitfile
	}
}

If you appreciate this page, please consider pitching in a small donation to the costs of running this Web server. Thank you!