Gawking Text Files

Datetime:2016-08-23 00:57:37          Topic: AWK           Share

Some tools in a toolbox are versatile. You can use a screwdriver as a pry bar to open a paint can, for example. I’ve even hammered a tack in with a screwdriver handle even though you probably shouldn’t. But a chainsaw isn’t that versatile. It only cuts. But man does it cut!

AWK is a chainsaw for processing text files line-by-line (and the GNU version is known as GAWK). That’s a pretty common case. It is even more common if you produce a text file from a spreadsheet or work with other kinds of text files. AWK has some serious limitations, but so do chainsaws. They are still super useful. Although AWK sounds like a penguin-like bird (see right), that’s an auk. Sounds the same, but spelled differently. AWK is actually an acronym of the original author’s names.

If you know C and you grok regular expressions, then you can learn AWK in about 5 minutes. If you only know C, go read up on regular expressions and come back. Five minutes later you will know AWK. If you are running Linux, you probably already have GAWK installed and can run it using the alias awk. If you are running Windows, you might consider installing Cygwin , although there are pure Windows versions available. If you just want to play in a browser, try webawk .

AWK Processing Loop

Every AWK program has an invisible main() built into it. In quasi-C code it looks like this:

int main(int argc, char *argv[]) {
   process(BEGIN);
   for (i=1;i<argc;i++) {
      FILENAME=argv[i];
      for each line in FILENAME {
         process(line);
      }
   }
   process(END);
}

In other words, AWK reads each file on its command line and processes each line it reads from them. A line is a bunch of characters that ends with whatever is in the RS variable (usually a newline; RS =Record Separator). Before it starts and after it is done it does processing for BEGIN and END (not exactly lines as you’ll see in a second).

Line Processing

Big deal, right? The trick is in the line processing. Here’s what AWK does:

  • It puts the whole line in a special variable called $0
  • It splits the line into fields $1 , $2 , $3 , etc.
  • Variable FS sets the regular expression to use for splitting. The default is any whitespace, but you can set it to be commas, commas and spaces, or even blank lines.
  • Variable NF gets the number of fields in the current line.
  • The line number for this file is in FNR and the line number overall is in NR .

Now the AWK script is what handles the processing. Comments start with the # character, so ignore those. They go to the end of the line. Anything in column 1 (except a brace) is a match. Think of it as an implied if statement. If the expression is true, the code associated with it executes. The BEGIN expression is true when the processing hasn’t started yet and END is for the processing after everything else is done. If there is no condition, the code always executes.

Example:

BEGIN { # brace has to be on this line!
   count=0
}
    { 
    count++
    }
END {
    print count “ “ NR
    }

You can put semicolons like you do in C if it makes you feel better. You can use parenthesis too (like print(…) ). What does the example mean? Well, at the start set count=0 (variables are made up as you need them and start out empty which means we don’t really need this since an empty variable will be zero if it is used like a number). There are no types. Everything is a string but can be a number when needed.

The brace set in the middle matches every line no matter what. It increases the count by 1. Then at the end, we print the count variable and NR which should match. When you write two strings (or string variables) next to each other in AWK they become one string. So that is really just printing one string made up of count , a space, and NR .

Something Practical

Here’s something more interesting:

BEGIN { # brace has to be on this line!
  count=0
  }
count==10 {
   print “The first word on line 10 is “ $1
}

This prints the first word ( $1 ) of the 10th line of input. If you don’t give any arguments to print, it prints $0 , by the way. Overall, this still isn’t too exciting. Maybe it would be more exciting to use $1==”Title:” or something as a condition. But the real power is that you can use regular expressions as a condition. For example:

/[tT]itle:/ {

This line would match any input line that had title: or Title: in it.

/^[ \t]*[tT]itle:/ {

This would match the same but only at the start of a line with 0 or more blanks ahead of it.You can match fields too:

$2 ~ /[mM]icrosoft/ # match Microsoft
$2 !~ /Linux[0-9]/ # must NOT match Linux0 Linux1 Linux2 etc.

Functions

AWK has functions for regular expressions in the code too such as sub , match , and gsub (see the manual ). You can also define your own functions. So:

/dog/ {
   gsub(/dog/,”cat”) # works on $0 by default
   print # print $0 by default
   next
}
{
   print
}

This transforms lines that have dog in them to read cat (including converting dogma to catma). The next keyword makes AWK stop processing the line and go to the next line of the input file. You could make this script shorter by omitting both the first print and the next. Then the last print will do all the printing.

Associative Arrays

The other thing that is super useful about AWK is that it has a little database built into it hiding as an array. These are associative arrays and they are just like C arrays with string indexes. Here’s a script to count the number of times a word appears in a text:

{
  for (i=1;i<=NF;i++) {
    gsub(/[-,.;:!@#$%^&*()+=]/,"",$i) # get rid of most punctuation
    word[$i]=word[$i]+1
     }
   }

END {
   for (w in word) {
     print w "-" word[w]
    }
}

Don’t forget you can use numbers too so foo[4] is also an array. However, you can also mix and match even though you probably shouldn’t. That is foo[4]=10 and foo[“Hello”]=9 are both legal and now the array has two elements, 4 and “Hello” !

Another note: $i and i are two different things! In the first pass of the above for loop, i is equal to 1 and $i is the content for the first field (because i is 1 ). You often see $NF used, for example. This is not the number of fields. It is the contents of the last field. NF is the number of fields.

You can run this script with:

awk –f word.awk

Or, if you are on a Unix-like system, you could start the file with “#!/bin/awk –f” instead and make it executable.

Hacker Use

There are lots of other things you can do with AWK. You can read lines from a file, read command line variables, use shell programs to filter input and output, change output formats, and more. Most C functions will work (even printf ). You can read the entire manual online and you’ll find examples everywhere.

You might wonder why hardware hackers need AWK. Next time, I’ll show you some practical uses that range from processing logged data, reading Intel hex files, and even compiling languages. Meanwhile, maybe you’d like to use crosswords to learn regular expressions .

Photo credit: [Michael Haferkamp] CC BY-SA 3.0





About List