How to get things done with awk ?

Author: Sakari Mattila
Updated: 30-Mar-1999
Updated: 22-Jan-1996

awk is a pattern matching program. It takes two inputs: data file and command file. The data file contains text, that is lines containing words. Data file need not be usual text, any data composed of character groups (words) and lines is suitable. Default character group separator is space, but other separators may be defined on awk starting command line. More on awk starting command is at the end of this document. The command file contains pattern matching instructions, it is equal to ordinary computer program. Feel free to thing awk as an interpreter executing commands from the command file on the data file.

awk comes with all Unix and Linux operating systems. It is a command line utility. awk is also included in several Unix-like utilities packages for MS Windows 95/98, MS Windows NT and other operating systems. Cygwin one source of these packages. awk source code in C is available with full Linux packages and GNU packages.

awk can be used to extract parts from a large text body, format text and extract information for other programs. It is very versatile program, especially when used as a part of a pipe. It is good practice to filter all extra control characters out of the awk input text file, because non-printable control characters cause errors with some awk versions. In Unix systems, tr (translate characters) program is suitable to remove offending characters. See tr manual pages or tr instructions at the end of the Short sed guide for more information.

The commands each are usually on one line. There are three execution types of commands:
1. starting commands, first word BEGIN, which are executed only once for each input file at the beginning of the file;
2. pattern matching commands, each of which is executed once for each line in the data file and
3. ending commands, first word END, which are executed only once for each input file when end of file has been reached.
Pattern matching commands are executed in order from up to down like reading the program. Lines are read from data file one by one.

All commands see only that single line from the data file at a time and all awk program variables. The whole line is subject to pattern matching. It is also automatically loaded into special variables. Variable $0 is the whole line, $1 is first word, $2 second word and so on.

The awk matching command consists of two parts: patterns and commands. Patterns are zero or more patterns to be matched with the line from the data file. If the whole pattern is missing, it matches any line and the command is always executed. Pattern consists of character "/", regular expression and character "/". More on regular expressions is below, but until then, imagine the regular expression being just a text string. Typical pattern may look like:

/Smith /

Several patterns may be put into the pattern part of pattern matching command, separated by logical operators "!" (logical negation), "&&" (logical and) and "||" (logical or). Typical combined pattern may look like:

/Smith / || /Jones /

That would match if either text string "Smith " or "Jones " are found on the line. Space between "/" and text is significant,

matches "Jon", "Jones" and any word starting "Jon". Because there is not space between "/" and "J", it also matches any word letters "Jon" within the word, like "Newton-Jones"

There are more advanced form of pattern, please see Unix awk man page or awk textbook.

The command may be missing, then the default command, print the line, is executed. The command is always between curly brackets "{" and "}". Brackets may be nested and may be used to extend the command over several lines. Multi line command acts as if it were one long line. Complete awk command line may look like:

/Smith / || /Jones / { print NR, $0 }

NR is predefined variable containing line number and $0 is the contents of the whole line. Thus this command prints line number and the original line.

awk is typeless , which means, that you can put anything, number or character string into its variables and awk tries to make some sense out of that. There are several inbuilt variables:

NF number of words on this line
NR number of record, ie. line number
FILENAME name of the input file

FS input field separator (space or tab character)
RS input record (line) separator (newline)

The command may contain several instructions separated by ";". The common awk instructions are:

variable1 = variable2 or constant
print (variables or constants to be printed)
printf ("format string", variables to be printed)
if (condition) command1 [ else command2 ]
while (condition) command
for (command1; (condition); command2) command3

These instructions behave like similar instructions in C language. Please be careful with "=" which is assignment and "==" which is equality operator in condition parts of instructions. It is possible to refine pattern matching with conditional statements, even replace the whole pattern part. However, pattern matching part can handle partial words and regular expressions, if-statement only handles full words or string functions of words. More on awk instructions is near the end of this text.

There is no defined commenting mechanism in awk. Comments can be included using an assignment and a character strings like:

{ comment = "This is a comment text" }

Some awk implementations allow comments starting with # as the first character on line.

awk tries to execute commands as far it is somehow possible, results may be astonishing. Syntactic checking in awk is minimal. In addition to errors in awk commands, unusual input characters or unexpected data may confuse awk program.

The common awk functions are:

substr(string, first-char, number-of-chars)
int (numeric-variable)
exp (numeric-variable)
log (numeric-variable)
sqrt (numeric-variable)

Check C programming language manuals for details. Note that the C-like part of awk is very small subset of C.

awk is stateless, that is, it treats each new input line similar way. However, you can use variables and conditional instructions to create states. It is important to know how, because most real tasks need states.

The most practical way to create a state system (state machine) is to reserve one variable as the state variable. It must be initialised in the BEGIN command and then its value shall change according to patterns matched. Because all command lines, at least the pattern part, are always executed for each input line, state dependent commands must be guarded with if-instructions.

Some awk implementations do not allow BEGIN or END commands, but set all variables to zero or space when the execution of the program starts.

An awk program performs the given operation and consist of one or more awk statements. Following awk program extracts letters sent by a given machine from standard Unix mail file. System has two states, state 0 is searching and state 1 is printing selected mail. Variable p is the state variable. Note, that the end pattern may be part of the start pattern when there are two letters from the selected machine immediately after each other. Thus the order of pattern matching statements is important. If several statements are expected to match the same line, they must be in order of selectivity, most selective last.

Here is the sample program consisting of five awk statements:

BEGIN { print FILENAME; p = 0 }
/From / && / 199/ { p = 0 }
/From / && / { p = 1 }
{ if (p > 0) print $0 }
END { print "End ..." }

The first line is executed only once when the text file is opened. Second line is conditional return to search state. Third line is conditional entry to printing state. Fourth line is the guarded print statement, withou pattern part, It is active only when the program is in the printing state. Fifth line is executed only once at the end of text file. In practice the end condition above is not selective enough. It is probable to find "From " and "199" text fragments in the letter body, which would end the printing of the letter. The main problem in this case is non existing end of letter mark, the end of letter is only known when the program finds the begining of the next letter.

Regular expressions are a way to define conditional character strings . One regular expression may equal, that is match, several different character strings. Regular expression is a character string, which contains ordinary characters and metacharacters denoting one more real characters.
(Back to pattern matching commands)

Common regular expressions are composed of following way:

text Text as written
. Any one character
[string] Any one character in the string "string"
[a-k] Any one character from "a" to "k"

* Zero or more repeats of previous
regular expression
^ At the beginning of a regular expression
limits it to the beginning of the line
$ At the end of a regular expression
limits it to the end of the line
\ Removes metacharacter's special meaning
\( \) Grouping brackets, as in mathematics

Please note, regular expressions can only be used in the pattern part of awk command. This makes pattern part more flexible in finding character patterns inside words thand command part. If-instruction with substring-function may imitate regular expressions to some degree. In practice regular expressions should be as simple as possible. Complex regular expressions are difficult to debug.

The details of awk instructions are following. You separate instructions with semicolon ";" and form one composite instruction from several instructions by putting them within curly brackets "{" and "}". See examples for more details. Assignment statement is:

variable1 = variable2 or constant

The content of variable2 or constant is copied into variable1, old content of variable1 is lost. Variables and constanst can contain numbers or strings. Some examples:
first = "first"
first = $1
first = 1
second = first

There are two print statements:

print (variables or constants to be printed)
printf ("format string", variables or constants to be printed)

print is the usual print command. Variables and constants to be printed are separated with commas "," and each print statement prints one line. Some examples:
print ( $0 )
print ( "This was input line: ", $0 )
print ( "This was first word: ", $1, " and this second: ", $2 )
print ( "The content of variable _first_ is: ", first )

You use printf for formatted printing. printf prints everything on one line if you don't put newline charater into format string, constants to be printed or into variables to be printed. format string contains text and conversion specifications. Conversion specifications consist of percent character "%", zero or more flags, field width, precision, conversion specifier and conversion qualifier. Percent character and conversion specifier must be there, other are optional. Conversion specifications correspond one by one to the variables or constants to be printed. Most common conversion specifiers are:
%d signed integer
%e signed fractional with exponent
%f signed fractional without exponent
%s sequence of characters in corresponding variable or constant
%% percent character

Escape sequences are used to format printf output. Most common escape sequences are:
\f form feed, new page
\n new line (\012 or \015)
\r carriage return, overprint
\t horizontal tab
\v vertical tab
\' single quote
\" double quote
\\ backslash
\0 null (character value 000)
\a alert, bell
\b backspace
\040 space
\ddd octal notation
\xddd hexadecimal notation

Some printf examples:
printf("%d %e %f", A, B, C)
printf("%s %d %% %s", "It is", proof, " alcohol")
printf("First line, number %d \n and second line, number %d", n1, n2)

Conditional statement, also known as if-else-statement is:

if (condition) command1 [ else command2 ]

If the condition is true, command1 will be executed, if condition is false, command2 will be executed. The whole else-part may be left out. Some examples:
if ( a==0 ) print("Zero") else print("Not zero")
if ( a==0 ) { print("Zero") } else { print("Not zero")}
if ( a==0 ) print("Zero")

Repeating statements are:

while (condition) command
for (command1; (condition); command2) command3

while repeats command as long as condition is true. There must be something in command which changes condition false sooner or later, otherwise awk will keep repeating forever. If the condition is false when entering this instruction, command will not be executed at all. for statements sets initial values with command1 , terminating condition with condition and changes the values with command2 . command3 is executed as many times as command1, condition and command2 allow. command3 should not change any values set in command1 or command2 or tested in condition . Please remember the brackets in for-statement. With while statement you can not set any initial values, with for statement you can. You use while in indefinite cases and for when you know how many times the command shall be repeated. If that sounds complicated, see the examples:
while ( i < 10 ) i = i + 1
for ( i=0; ( i < 10 ); i=i+1 ) print i

Please refer to C-language books for more information on these commands. The commands are not exactly similar, but differences are small. Good reference to C-language is P. J. Plauger & Jim Brodie: Standard C, ISBN 1-55615-158-6.

Typical command line to run awk is:

awk -f program.AWK inputfile >outputfile

There are several other ways to run awk programs, please see awk manual page or awk help.