AWK for the Most Minimal of Performance Testing Toolsets!

Datetime:2016-08-23 01:00:05          Topic: AWK  Performance Test           Share

awk is a traditional Unix utility that I never had much use for. So you say it processes data line-by-line? In it’s own unique domain-specific language ? Even simple one-liners run from the shell tend to have quotes ( "" ) square brackets ( [] ), and curly braces ( {} )? And have you heard how local variables are declared in functions ?!? It seemed quaint, complex, and distinctly old-school—the kind of command you might use before piping output to your 9-pin dot-matrix printer with | pr | opr . Oh, and the latest edition of O’Reilly’s Effective awk Programming was published June 2, 2001. Ya .

All that was before I stumbled across Adam Drake’s post Command-line tools can be 235x faster than your Hadoop cluster , read Bruce Barnett’s easy to follow awk tutorial , and wrote my first real awk program—a quick script to get basic statistics from multiple runs of the time command. Together with the bash script below for timing multiple runs of a program, this gives us a the start of a super-minimal performance testing toolset. Perfect for when you want to know, say, the average running time across 80 runs of a script .

Step 1: Timing Multiple Runs of a Script

The time command produces results in the format:

real    0m12.667s
user    0m8.240s
sys     0m3.411s

This gives us the total amount of time elapsed while running a command (‘real’), the amount of time spent in user-land code (‘user’), and the amount of time spent on system calls (‘sys’).

To get results over multiple runs, we have the super-simple bash script below. It takes the name of a command to run, runs it (linearly) an arbitrary number of time, and appends the output to the a specified text file:

#!/usr/bin/env bash
# Usage: ./tester [number of times to run] [utility to run] [output file]
# Ex: ./tester 20 ./test1 test1results
i=1
while [ $i -le $1 ]; do
	echo "Run $i"
	echo "Run $i" >> $3
	{ time $2 > /dev/null; } 2>> $3
	let i=i+1
	echo >> $3
done

Step 2: Getting Some Basic Statistics with AWK: Count, Total, Average, Minimum, Maximum, and Sample Standard Deviation

The fun thing about AWK is that it makes it very easy to implement the three steps of a common data processing pattern:

  1. First you do some initial setup at the beginning.
  2. Then you do something to every data ‘record’ (every row in the table, every line in the file, etc.).
  3. Finally, when you’re done processing all the data, you print your results.

Depending on what you’re doing, steps 1 and 3 might be optional. And if you just want to run a program without processing data record-by-record, you could put it in step 1 (in AWK, the BEGIN {} section) or step 3 ( END {} ). Simple, right?

The AWK script below processes every record and then prints some basic statistics.

#!/usr/bin/env awk -f
################################################################################
# Input: A file with output from multiple runs of the time command in a format such as:
#real	0m29.352s
#user	0m17.516s
#sys	0m8.003s
# Output: Basic statictics (count, total, average, minimum, maximum, sample standard deviation) across all runs
#
# Written after reading http://www.grymoire.com/Unix/Awk.html
#
# Changelog
# 2015-11-27 v5 Changed approach for determining  minimum and maximum values to further reduce memory usage and improve performance. Run time over a million records is currently 22 seconds real time with total memory usage of around 300 MB. Removed pre-declaration of other variables in BEGIN, as this no longer results in reduced memory usage.
# 2015-11-25 v4 Pre-declaring variables in BEGIN to reduce memory usage by between 43 and 48 MB over a million records with no other changes.
# 2015-11-21 v3 Changed calculation of SSD to use stored averages, improving run time over 1 million records by about 3.8 seconds (29.679 to 25.845 seconds real time).
# 2015-11-18 v2 Improved parsing of times to reduce memory usage by about 136MB over 1 million sets of records (~3 million records total, 395 MB vs 531 MB as reported by top MEM colum on OS X)
# 2015-10-12 v1 Initial version
################################################################################

BEGIN {
	# Pre-declare minimum and maximum values. 2^53 is the largest interger we can accurately represent with AWK's IEEE 64-bit doubles
	# Rather than pre-declaring fixed minimum and maximum values, a cleaner approach would simply add an inline test for whether min[$1] or max[$1] have already been set (e.g. min[$1] == ""). While I found that this produced accurate results, it seemed to introduce a memory leak.

        min["real"]=2^53;
        min["user"]=2^53;
        min["sys"]=2^53;
        max["real"]=0;
        max["user"]=0;
        max["sys"]=0;

}

# For lines with 2 fields where the first field contains real, user, or system. (Note that the Number of Fields check is not strictly necessary.) 
$1 ~ /(real|user|sys)/ && NF == 2 {
	# Extract times in seconds and add to total real/user/sys run time
        m_loc = index($2, "m");
        s_loc = index($2, "s");
        minutes_only = substr($2, 1, m_loc - 1);
        seconds_only = substr($2, m_loc + 1, s_loc - (m_loc + 1));
	seconds = (60 * minutes_only) + seconds_only;
        total_time[$1] += seconds;

	# Count the number of times we've encountered real/user/sys. If the numbers for each don't end up the same we may have a problem
	lines[$1]++;

	# Calculate max & min
	if (seconds < min[$1]){
		min[$1] = seconds;
	}
	if (seconds > max[$1]){
		max[$1] = seconds;
	}

	# Add number of seconds to a multi-dimensional array so we can calculate standard deviation later 
	all_times[$1 "," lines[$1]] = seconds;
}
END {
	print "-----Runs-----"
	for (i in lines) {
		print i, lines[i];
	}
	print "\n-----Total-----"
	for (i in total_time){
		print i, total_time[i];
	}
	print "\n-----Average-----"
	for (i in lines) {
		average[i] = total_time[i]/lines[i];
		print i, average[i];
	}
	print "\n-----Min/Max-----"
	for (i in lines) {
		print i, "\t" min[i], "\t", max[i];
	}
	print "\n-----Sample Standard Deviation-----"
	for (i in lines) {
		for (j = 1; j <= lines[i]; j++) {
			# SSD: Sum of Squared Differences
			ssd[i] += (all_times[i "," j] - average[i]) ^ 2
		}
	}
	for (i in lines) {
		# Sample rather than population standard deviation. Subtract 1 from number of samples ("Bessel's correction")
		print i, sqrt(ssd[i]/(lines[i] - 1));
	}
}

So how long does it take to run?

On my MacBook , about 22 seconds (46,004 records/second) on battery power or 25 seconds (39,765 records/second) on wall power. Since we're storing data in memory to compute sample standard deviation, over a million records it uses around 300 MB.If we don't calculate sample standard deviation memory usage is constant at just 516 KB and the script runs in 14 seconds. While these numbers might not seem particularly fast, keep in mind that 1) they include time to load and parse raw textual data and 2) as written, the computation is bound to a single CPU core. Altogether...

...pretty awking awesome.

~





About List