Awk, Unix, and functional programming

Datetime:2016-08-23 00:59:10          Topic: AWK  Functional Program  Unix           Share

Awk, Unix, and functional programming

February 1, 2016  ∞

The functional programming community does a poor job of explaining itself, judging from this recent lecture by Brian Kernighan, How to succeed in language design without really trying . At 13:45 , Brian says that he doesn’t know much about functional languages, and that they are little used in the real world outside academia!

Brian is famous for inventing and explaining programming languages, and he spent many years at Bell Laboratories working around people like Dave MacQueen (Standard ML) and Phil Wadler (Haskell). How could this happen?

I guess I blame Wadler :-).

All is not lost, however. I think that Brian does understand functional programming, and he proves it in his own talk.

Two meanings of “functional”

Before getting to the example, I need to say that there are two very different meanings of “functional.” The “pure” functional languages, such as Haskell, use one meaning, and the “impure” or “imperative” functional languages, including LISP, Scheme, and ML, use another. This in itself is surely responsible for a great deal of confusion.

At any rate, in this post, I’m only going to be concerned with the meaning of functional as in LISP, Scheme, and ML. Haskell brings in additional issues (absence of side effects, lazy evaluation, monads) that would obscure my main point.

Awk is a functional program

Brian starts his talk with Awk, a programming language that he invented along with Al Aho and Peter Weinberger. At 3:23 he compares a C program with its Awk equivalent. The C program is about 16 lines long:

#include <stdio.h>
#include <string.h>

int main(void) {
  char line[1000], line2[1000];
  char *p;
  double mag;

  while (fgets(line, sizeof(line), stdin) != NULL) {
    strcpy(line2, line);
    p = strtok(line, "\t");
    p = strtok(NULL, "\t");
    p = strtok(NULL, "\t");
    scanf(p, "%lf", &mag);
    if (mag > 6)     /* $3 > 6 */
      printf("%s", line2);
  }
  return 0;
}

The Awk program, on the other hand, is a single line :

$3 > 6

Very nice. Awk in fact was designed to make many programs one-liners. Aho, Weinberger, and Kernighan noticed that many of the programs they were writing or saw others in the Unix world write had a similar structure, and they didn’t want programmers to have to write this “boilerplate.” So, in Awk, you would simply write the interesting part of the program, the action , which was often just a single line of code.

The boilerplate part was provided by Awk itself, and it has the following structure ( 21:20 ):

for each file
  for each input line
    for each pattern
      if pattern matches input line
        do the action

We can think of Awk as a program that takes as input one of these actions, builds a new program by combining the boilerplate with the action, and runs the new program to produce the output.

Any functional programmer will immediately tell you that this kind of situation is exactly what functional programming is for: Awk is a functional program . And in a functional programming language it would be written something like this:

Awk(action) =
    for each file
      for each input line
        for each pattern
          if pattern matches input line
            do action(fields)

That is, Awk is a function that takes a function (action) as an argument , and applies that function to the fields extracted from each matching line of input to produce each line of output.

This hinges on being able to separate out the action from the boilerplate, which requires functions to be able to accept functions as arguments—which is exactly what a functional programming language gives you. This is much harder to do in a language like C, which Awk is written in. So it is relatively hard to write Awk as, say, a library function in C; instead it is a compiled program of its own, that takes in an action as a text fragment, parses it, and so on.

(Notice that I’m not saying that Awk is a functional programming language ; that’s not the case. Awk does not give you first-class functions. I’m saying that Awk itself can be thought of as an example of a functional program .)

Functional programming is just programming (with first-class functions)

It’s instructive to think about how Al, Peter, and Brian came up with the structure of Awk. They noticed a programming idiom (the boilerplate) that turned up in many programs, and they built Awk as a way of refactoring those programs so that the idiom could be implemented once, robustly and correctly, and did not have to be repeated in all of the programs. What was left of the programs (one line each) was easier to write, read, and understand.

It’s hard to think of anything that is more central to the practice of programming than this. The only thing that is different in this case from what millions of non-functional programmers do every day is that pulling out the Awk idiom requires the use of first-class functions.

Gerald Sussman expressed this beautifully in his foreword to Friedman’s The Little Lisper (1974):

“In Lisp, procedures are first-class data, to be passed as arguments, returned as values, and stored in data structures. This flexibility is valuable, but most importantly, it provides mechanisms for formalizing, naming, and saving the idioms—the common patterns of usage that are essential to engineering design.”

If the main idea of functional programming is so simple, then why is it so hard to understand? I think that it is because what I have described is just the first step on a long path that ends up in a very different place than the starting point. Someone who has learned how to use first-class functions in this way starts to see uses for it (idioms) that they could not see before.

In the words of Benjamin Whorf, as quoted by Brian in his lecture (18:19) :

“Language shapes the way we think and determines what we can think about .”

(The emphasis is mine.)

Unix shell programming is functional programming

I want to close this post with one more example from Unix. In 1986, Jon Bentley wrote two Programming Pearls, here and here , in which he invited Don Knuth to help him introduce Knuth’s notion of Literate Programming . The second column presents Knuth’s literate programming solution to a challenge problem from Bentley:

Given a text file and an integer k , print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

The first part of the column presented Knuth’s program, and the second part of the column consisted of a review of Knuth’s program by Doug McIlroy. (I always thought that this was an ambush, but I never did ask Bentley.)

McIlroy had some pointed criticisms of Knuth’s program, and offered up this 6-line alternative solution:

tr -cs A-Za-z '
' |
tr A-Z a-z |
sort |
sort -rn |
sed ${1}q

McIlroy explains this on page 480 for those unfamiliar with Unix shell programming. He says:

The utilities employed in this trivial solution are Unix staples. They grew up over the years as people noticed useful steps that tended to recur in real problems. Every one was written first for a particular need, but untangled from the specific application.

(The emphasis is again mine.)

McIlroy, like the creators of Awk, and like Sussman, recognizes the central importance of discovering and “untangling” common idioms in programming. And he uses functional programming throughout his solution.

For one thing, consider that sed, like Awk, can be thought of as a function that takes a function as argument (the command/procedure ${1}q ), and applies it to each line of the standard input.

And Unix pipes (McIlroy’s invention) are functional through and through. The pipe operator, | , is a function that takes two functions as arguments and returns a function, their functional composition . For example in

tr A-Z a-z | sort

the first function argument is tr A-Z a-z (the output of the function is the same as the input, except that upper case letters are replaced by their lower case equivalent), and the second function argument sorts its input onto its output. So the return value tr A-Z a-z | sort is a function that first transliterates and then sorts its input.

This is pretty obvious if you have a functional programming background. So, naturally, the idea was explored in detail long ago; see for example the Scheme Shell, Scsh , from the early 1990s. And I’d bet it was done even earlier, in LISP. Possibly before Unix!

What about the real world?

I hope you agree that I have shown that functional programming is in fact used pervasively in the real world, if you consider Unix to be part of the real world.

It’s not hard to find other examples. Every programmer today knows MapReduce, whose creators say

Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.

Or consider that the most widely used language in the world is probably Javascript, and Javascript is a functional language—Brendan Eich originally intended it to be “Scheme in the browser.” (Brendan talks about the history of Javascript in this podcast ).

Nowadays, first-class functions are being added to many languages, so we may be seeing even more functional programming in the future. However, first-class functions alone do not make a programming language a functional language. I’ll leave that explanation for a later post.





About List