Section 11.5
A Simple Recursive-descent Parser
I HAVE ALWAYS been fascinated by language --
both natural languages like English and the artificial languages
that are used by computers. There are many difficult questions about
how languages can convey information, how they are structured, and
how they can be processed. Natural and artificial languages are similar
enough that the study of programming languages, which are pretty well
understood, can give some insight into the much more complex and difficult
natural languages. And programming languages raise more than enough interesting
issues to make them worth studying in their own right. How can it be,
after all, that computers can be made to "understand" even the relatively
simple languages
that are used to write programs? Computers, after all, can only directly
use instructions expressed in very simple machine language. Higher level
languages must be translated into machine language. But the translation
is done by a compiler, which is just a program. How could such a translation
program be written?
Natural and artificial languages are similar in that they have
a structure known as grammar or syntax. Syntax can be expressed by
a set of rules that describe what it means to be a legal sentence or
program. For programming languages, syntax rules are often expressed
in BNF (Backus-Naur Form), a system that
was developed by computer scientists John Backus and Peter Naur in the
late 1950s. Interestingly, an equivalent system was developed independently
at about the same time by linguist Noam Chomsky to describe the grammar of
natural language. BNF cannot express all possible syntax rules. For
example, it can't express the fact that a variable must be defined before
it is used. Furthermore, it says nothing about the meaning or semantics
of the langauge. The problem of specifying the semantics of a language --
even of an artificial programming langauge -- is one that is still far from being
completely solved. However, BNF does express the basic structure of
the language, and it plays a central role in the design of translation programs.
In English, terms such as "noun", "transitive verb," and "propositional phrase" are
syntactic categories that describe building
blocks of sentences. Similarly, "statement", "number," and "while loop"
are syntactic categories that describe building blocks of Java programs.
In BNF, a syntactic category is written as a word enclosed between
"<" and ">". For example: <noun>,
<verb-phrase>, or <while-loop>.
A rule in BNF specifies the structure of
an item in a given syntactic category, in terms of other syntactic categories
and/or basic symbols of the language. For example, one BNF rule for the
English language might be
<sentence> ::= <noun-phrase> <verb-phrase>
The symbol "::=" is read "can be", so this rule says that
a <sentence> can be a <noun-phrase> followed
by a <verb-phrase>. (The term is "can be" rather than "is"
because there might be other rules that specify other possible forms for a
sentence.) This rule can be thought of as a recipe for a sentence:
If you want to make a sentence, make a noun-phrase and follow it by a
verb-phrase. Noun-phrase and verb-phrase must, in turn, be defined by
other BNF rules.
In BNF, a choice between alternatives is represented by the symbol "|",
which is read "or". For example, the rule
<verb-phrase> ::= <intransitive-verb> |
( <transitive-verb> <noun-phrase> )
says that a <verb-phrase> can be an <intransitive-verb>, or it can be
or a <transitive-verb> followed by a <noun-phrase>. Note also that
parentheses can be used for grouping. To express the fact that an item
is optional, it can be enclosed between "[" and "]".
An optional item that can be repeated one or more times is enclosed
between "[" and "]...". And a symbol that is an actual
part of the language that is being described is enclosed in quotes.
For example,
<noun-phrase> ::= <common-noun> [ "that" <verb-phrase> ] |
<common-noun> [ <propositional-phrase> ]...
says that a <noun-phrase> can be a <common-noun>,
optionally followed by the literal word "that" and a <verb-phrase>,
or it can be a <common-noun> followed by zero or more
<propositional-phrase>'s. Obviously, we can describe very complex
structures in this way. The real power comes from the fact that BNF rules
can be recursive. In fact, the two preceding rules, taken together, are recursive.
A <noun-phrase> is defined partly in terms of <verb-phrase>,
while <verb-phrase> is defined partly in terms of <noun-phrase>.
For example, a <noun-phrase> might be "the rat that ate
the cheese", since "ate the cheese" is a <verb-phrase>.
But then we can, recursively, make the more complex <noun-phrase>
"the cat that caught the rat that ate the cheese" out of the <common-noun> "the cat",
the word "that" and the <verb-phrase> "caught the rat that ate the cheese".
Building from there, we can make the <noun-phrase>
"the dog that chased the cat that caught the rat that ate the cheese".
The recursive structure of language is one of the most fundamental properties of language,
and the ability of BNF to express this recursive
structure is what makes it so useful.
BNF can be used to describe the syntax of a programming language such
as Java in a formal and precise way. For example, a <while-loop>
can be defined as
<while-loop> ::= "while" "(" <condition> ")" <statement>
This says that a <while-loop> consists of the
word "while", followed by a left parenthesis, followed by a <condition>,
followed by a right parenthesis, followed by a <statement>.
Of course, it still remains to define what is meant by a condition and
by a statement. Since a statement can be, among other things, a while
loop, we can already see the recursive structure of the Java language.
The exact specification of an if statement, which is hard to
express clearly in words, can be given as
<if-statement> ::=
"if" "(" <condition> ")" <statement>
[ "else" "if" "(" <condition> ")" <statement> ]...
[ "else" <statement> ]
This rule makes it clear that the "else" part is optional and
that there can be, optionally, one or more "else if" parts.
In the rest of this section, I will show how a BNF grammar for a language
can be used as a guide for constructing a parser. A parser is a program that
determines the grammatical structure of a phrase in the language. This is the
first step to determining the meaning of the phrase -- which for a programming
language means translating it into machine language. Although we will look
at only a simple example, I hope it will be enough to convince you that
compilers can in fact be written and understood by mortals and to give
you some idea of how that can be done.
The parsing method that we will use is called recursive
descent parsing. It is not the only possible parsing method, or the
most efficient, but it is the one most suited for writing compilers by hand
(rather than with the help of so called "parser generator" programs).
In a recursive descent parser, every rule of the BNF grammar is the model
for a subroutine. Not every BNF grammar is suitable for recursive descent parsing.
The grammar must satisfy a certain property. Essentially, while parsing a
phrase, it must be possible to tell what syntactic category is coming up next
just by looking at the next item in the input. Many grammars are designed with this
property in mind.
I should also mention that many variations of BNF are in use. The one that
I've described here is one that is well-suited for recursive descent parsing.
When we try to parse a phrase that contains a syntax error, we need some
way to respond to the error. A convenient way of doing this is to throw
an exception. I'll use an exception class called ParseError,
defined as follows:
class ParseError extends Exception {
// Represents a syntax error detected while parsing.
ParseError(String message) {
// Construct a ParseError object containing the
// given string as its error message.
super(message); // (Call constructor from superclass.)
}
}
Another general point is that our BNF rules don't say anything about
spaces between items, but in reality we want to be able to insert spaces
between items at will. To allow for this, I'll always call the following
routine before trying to look ahead to see what's coming up next in input:
static void skipBlanks() {
// Skip over blanks and tabs in standard input.
while ( TextIO.peek() == ' ' || TextIO.peek() == '\t' )
TextIO.getAnyChar();
}
Let's start with a very simple example. A "fully parenthesized expression"
can be specified in BNF by the rules
<expression> ::= <number> |
"(" <expression> <operator> <expression> ")"
<operator> ::= "+" | "-" | "*" | "/"
where <number> refers to any positive real number. An example of
a fully parenthesized expression is "(((34-17)*8)+(2*7))".
Since every operator corresponds to a pair of parentheses, there is no ambiguity
about the order in which the operators are to be applied. Suppose we want a program
that will read and evaluate such expressions.
We'll read the expressions from standard
input, using TextIO. To apply recursive descent parsing, we need a subroutine for each rule in the grammar.
Corresponding to the rule for <operator>, we get a subroutine
that reads an operator. The operator can be a choice of any of four things.
Any other input will be an error.
static char getOperator() throws ParseError {
// If the next character in input is one of the legal operators,
// read it and return it. Otherwise, throw a ParseError.
skipBlanks(); // Skip past any blanks and tabs.
char op = TextIO.peek(); // Look ahead at the next character.
if ( op == '+' || op == '-' || op == '*' || op == '/' ) {
TextIO.getAnyChar(); // Read the character.
return op;
}
else if (op == '\n')
throw new ParseError("Missing operator at end of line.");
else
throw new ParseError("Missing operator. Found \"" +
op + "\" instead of +, -, *, or /.");
} // end getOperator()
I've tried to give a reasonable error message, depending on whether the
next character is an end-of-line or something else. I use TextIO.peek()
to look ahead at the next character before I read it, and I call skipBlanks()
before testing TextIO.peek() in order to ignore any blanks that separate
items. I will follow this same pattern in every case.
When we come to the subroutine for <expression>, things are a little more
interesting. The rule says that an expression can be either a number or
an expression enclosed in parentheses. We can tell which it is by looking ahead
at the next character. If the character is a digit, we have to read a number.
If the character is a "(", we have to read the "(", followed by an expression,
followed by an operator, followed by another
expression, followed by a ")". If the next character is anything else, there is
an error. Note that we need recursion to read the nested expressions.
The routine doesn't just read the expression. It also computes and
returns its value. This requires semantical information that is not
specified in the BNF rule.
static double expressionValue() throws ParseError {
// Read an expression from the current line of input and
// return its value.
skipBlanks();
if ( Character.isDigit(TextIO.peek()) ) {
// The next item in input is a number, so the expression
// must consist of just that number. Read and return
// the number.
return TextIO.getDouble();
}
else if ( TextIO.peek() == '(' ) {
// The expression must be of the form
// "(" <expression> <operator> <expression> ")"
// Read all these items, perform the operation, and
// return the result.
TextIO.getAnyChar(); // Read the "("
double leftVal = expressionValue(); // First expression.
char op = getOperator(); // The operator.
double rightVal = expressionValue(); // Second expression.
skipBlanks();
if ( TextIO.peek() != ')' ) {
// According to the rule, there must be a ")" here.
// Since it's missing, throw a ParseError.
throw new ParseError("Missing right parenthesis.");
}
TextIO.getAnyChar(); // Read the ")"
switch (op) { // Apply the operator and return the result.
case '+': return leftVal + rightVal;
case '-': return leftVal - rightVal;
case '*': return leftVal * rightVal;
case '/': return leftVal / rightVal;
default: return 0; // (Actually, this can't occur.)
}
}
else {
throw new ParseError("Encountered unexpected character, \"" +
TextIO.peek() + "\" in input.");
}
} // end expressionValue()
I hope that you can see how this routine corresponds to the BNF rule.
Where the rule uses "|" to give a choice between alternatives, there
is an if statement in the routine to determine which choice to
take. Where the rule contains a sequence of items,
"(" <expression> <operator>
<expression> ")", there is a sequence of statements in
the subroutine to read each item in turn.
When expressionValue() is called to evaluate the
expression (((34-17)*8)+(2*7)), it sees the "(" at the
beginning of the input, so the else part of the if
statement is executed. The "(" is read. Then the first recursive
call to expressionValue() reads and evaluates the subexpression
((34-17)*8), the call to getOperator() reads the "+" operator,
and the second recursive call to expressionValue() reads and
evaluates the second subexpression (2*7). Finally, the ")"
at the end of the expression is read. Of course, reading the first
subexpression, ((34-17)*8), involves further recursive calls
to the expressionValue() routine, but it's better not to think
too deeply about that! Rely on the recursion to handle the details.
You'll find a complete program that uses these routines in the
file SimpleParser1.java.
Fully parenthesized expressions aren't very natural for people to use.
But with ordinary expressions, we have to worry about the question of
operator precedence, which tells us, for example, that the "*"
in the expression "5+3*7" is applied before the "+". The complex
expression "3*6+8*(7+1)/4-24" should be seen as made up of
three "terms", 3*6, 8*(7+1)/4, and 24,
combined with "+" and "-" operators. A term, on the other hand, can be
made up of several factors combined with "*" and "/"
operators. For example, 8*(7+1)/4 contains the factors
8, (7+1) and 4. This example also shows that
a factor can be either a number or an expression in parentheses.
To complicate things a bit more, we allow for leading minus signs in
expressions, as in "-(3+4)" or "-7". (Since a <number> is a positive number,
this is the only way we can get negative numbers. It's done this way to
avoid "3 * -7", for example.)
This structure can be expressed by the BNF rules
<expression> ::= [ "-" ] <term> [ ( "+" | "-" ) <term> ]...
<term> ::= <factor> [ ( "*" | "/" ) <factor> ]...
<factor> ::= <number> | "(" <expression> ")"
The first rule uses the "[ ]..." notation, which says that the
items that it encloses can occur zero, one, two, or more times. This means
that an <expression> can begin, optionally, with a "-".
Then there must be a <term>
which can optionally be followed by one of the operators "+" or "-" and
another <term>, optionally followed by another operator and
<term>, and so on. In a subroutine that reads and evaluates
expressions, this repetition is handled by a while loop. An if
statement is used at the beginning of the loop to test whether a
leading minus sign is present:
static double expressionValue() throws ParseError {
// Read an expression from the current line of input and
// return its value.
skipBlanks();
boolean negative; // True if there is a leading minus sign.
negative = false;
if (TextIO.peek() == '-') {
TextIO.getAnyChar(); // Read the "-"
negative = true;
}
double val; // Value of the expression.
val = termValue(); // Get the value of the first term.
if (negative)
val = -val; // Apply the leading "-" operator.
skipBlanks();
while ( TextIO.peek() == '+' || TextIO.peek() == '-' ) {
// There is a "+" or "-" followed by another term.
// Read the next term and add it to or subtract it from
// the value of previous terms in the expression.
char op = TextIO.getAnyChar(); // Read the operator.
double nextVal = termValue(); // Read the next term.
if (op == '+')
val += nextVal;
else
val -= nextVal;
skipBlanks();
}
return val;
} // end expressionValue()
The subroutine for <term> is very similar to this, and the
subroutine for <factor> is similar to the example given above
for fully parenthesized expressions. A complete program that reads and
evaluates expressions based on the above BNF rules
can be found in the file SimpleParser2.java.
Now, so far, we've only evaluated expressions. What does that have to
do with translating programs into machine language? Well, instead of
actually evaluating the expression, it would be almost as easy to generate
the machine language instructions that are needed to evaluate the expression.
If we are working with a "stack machine", these instructions would be
stack operations such as "push a number" or "apply a + operation".
The program SimpleParser3.java
can both evaluate the expression and print a list of stack machine
operations for evaluating the expression. Here is an applet that simulates
the program:
It's quite a jump from this program to a recursive descent parser that
can read a program written in Java and generate the equivalent machine
language code -- but the conceptual leap is not huge.
The SimpleParser3 program doesn't actually generate the stack
operations directly as it parses an expression. Instead, it builds
an expression tree, as discussed in the previous section,
to represent the expression. The expression tree is then used to find
the value and to generate the stack operations. The tree is made up of
nodes belonging to classes ConstNode and BinOpNode
that are similar to those given in the previous section. Another class,
UnaryMinusNode, has been introduced to represent the unary minus
operation. I've added a method, printStackCommands(), to each class.
This method is responsible for printing out the stack operations that are
necessary to evaluate an expression. Here for example is the new
BinOpNode class from SimpleParser3.java:
class BinOpNode extends ExpNode {
// An expression node representing a binary operator.
char op; // The operator.
ExpNode left; // The expression for its left operand.
ExpNode right; // The expression for its right operand.
BinOpNode(char op, ExpNode left, ExpNode right) {
// Construct a BinOpNode containing the specified data.
this.op = op;
this.left = left;
this.right = right;
}
double value() {
// The value is obtained by evaluating the left and right
// operands and combining the values with the operator.
double x = left.value();
double y = right.value();
switch (op) {
case '+': return x + y;
case '-': return x - y;
case '*': return x * y;
case '/': return x / y;
default: return Double.NaN; // Bad operator!
}
}
void printStackCommands() {
// To evaluate the expression on a stack machine, first do
// whatever is necessary to evaluate the left operand,
// leaving the answer on the stack. Then do the same thing
// for the right operand. Then apply the operator (which
// means popping the operands, applying the operator, and
// pushing the result).
left.printStackCommands();
right.printStackCommands();
TextIO.putln(" Operator " + op);
}
} // end class BinOpNode
It's also interesting to look at the new parsing subroutines. Instead of
computing a value, each subroutine builds an expression tree. For example, the subroutine
corresponding to the rule for <expression> becomes
static ExpNode expressionTree() throws ParseError {
// Read an expression from the current line of input and
// return an expression tree representing the expression.
skipBlanks();
boolean negative; // True if there is a leading minus sign.
negative = false;
if (TextIO.peek() == '-') {
TextIO.getAnyChar();
negative = true;
}
ExpNode exp; // The expression tree for the expression.
exp = termTree(); // Start with a tree for first term.
if (negative) {
// Build the tree that corresponds to applying a
// unary minus operator to the term we've
// just read.
exp = new UnaryMinusNode(exp);
}
skipBlanks();
while ( TextIO.peek() == '+' || TextIO.peek() == '-' ) {
// Read the next term and combine it with the
// previous terms into a bigger expression tree.
char op = TextIO.getAnyChar();
ExpNode nextTerm = termTree();
// Create a tree that applies the binary operator
// to the previous tree and the term we just read.
exp = new BinOpNode(op, exp, nextTerm);
skipBlanks();
}
return exp;
} // end expressionTree()
In some real compilers, the parser creates a tree to represent the
program that is being parsed. This tree is called a parse
tree. Parse trees are somewhat different in form from expression
trees, but the purpose is the same. Once you have the tree, there are a number
of things you can do with it. For one thing, it can be used to generate
machine language code. But there are also techniques for examining the
tree and detecting certain types of programming errors, such as an attempt
to reference a local variable before it has been assigned a value. (The
Java compiler, of course, will reject the program if it contains such an
error.) It's also possible to manipulate the tree to optimize
the program. In optimization, the tree is transformed to make the program
more efficient before the code is generated.
And so we wind up back where we started in Chapter 1, looking
at programming languages, compilers, and machine language. But looking at
them, I hope, with a lot more understanding and a much wider perspective.
End of Chapter 11