web analytics

AWK scripting examples

Several kinds of tasks occur repeatedly when working with text files. You might want to extract certain lines and discard the rest. Or you may need to make changes wherever certain patterns appear, but leave the rest of the file alone. Awk scripting allows you to do this and much more.

Here is a list of useful awk scripts.

Calling external programs

External programs can be called from awk scripts by using the awk pipe mechanism. This is an example.

 /Condition/ {print $0 | "cut -d':' -f2";}

This would call the command cut and extract the second field assuming ':' as field delimiter (useful when FS is set to different character).

Another example :

 /Condition/ {print $0 | "cat >> file";}

Calculating mean values

This first script calculates the mean value of column #2 in a test data file.

cat data | awk '{TOTAL+=$2} END{printf("COUNT:%d, TOTAL:%d, MEAN:%d\n",NR,TOTAL,TOTAL/NR)}'

If the fields in the file are not space separated, (e.g. separated by commas), then we need to specify that by setting the FS variable in the beginning of the awk script. The command would look like this :

cat data | awk 'BEGIN{FS=";"} {TOTAL+=$3} END{printf("COUNT:%d, TOTAL:%d, MEAN:%d\n",NR,TOTAL,TOTAL/NR)}'

We can also compute mean values for three columns at the same time.

cat data | awk '{TOTAL1+=$1; TOTAL2+=$2; TOTAL3+=$3} END{printf("COUNT:%d, MEAN-1:%d MEAN-2:%d MEAN-3:%d\n",NR,TOTAL1/NR,TOTAL2/NR,TOTAL3/NR)}'

Removing outliers

cat data | awk 'BEGIN{CNT=0} {ROW[CNT]=$0;DATA[CNT]=$3; TOTAL+=$3;CNT+=1;} END{for (i = 0;i < NR; i++){if ((sqrt((DATA[i]-(TOTAL/NR))^2))<((TOTAL/NR)*30/100)) {print ROW[i] ;}}}'

Given multi-column, multi-row data file, this command removes rows where the third element has more than 30% divergence from the average in the third column.

This method only works if your data set is large and you have relatively few outliers. You should be able to check whether this works for you by plotting your data before and after removing the outliers. Other methods for detecting outliers include outliers box plot as described here : http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm