lex and
yacc each generate code for a single
function — respectively, “get a token from the input
stream” and “parse a sequence of tokens to see if it
matches a grammar”. Usually, the
yacc-generated parser function calls a
Lex-generated tokenizer function each time it wants to get another
token. If there are no user-written C callbacks at all in the
yacc-generated parser, all it will do is a
syntax check; the value returned will tell the caller if the input
matched the grammar it was expecting.
More usually, the user's C code, embedded in the generated
parser, populates some runtime data structures as a side-effect of
parsing the input. If the minilanguage is declarative, your
application can use these runtime data structures directly. If your
design was an imperative minilanguage, the data structures might
include a parse tree which is immediately fed to some kind of
evaluation function.
yacc has a rather ugly interface,
through exported global variables with the name prefix
yy_. This is because it predates structs in C; in
fact, yacc predates C itself; the first
implementation was written in C's predecessor B. The crude
though effective algorithm yacc-generated
parsers use to try to recover from parse errors (pop tokens until an
explicit error production is matched) can also lead to
problems, including memory leaks.
|
If you are building parse trees, using malloc to make nodes, and
you start popping things off the stack in error recovery, you don't
get to recover (free) the storage. In general, Yacc can't do it,
since it doesn't know enough about what's on the stack. If the yacc
parser were in C++, it could assume that the values were classes and
“destruct” them. In “real” compilers, parse
tree nodes are generated using an arena-based allocator, so the nodes
don't leak, but there is a logical leak anyway that needs to be
thought about to make industrial-strength error recovery.
|
|
--
Steve Johnson
|
|
lex is a lexical analyzer generator.
It's a member of the same functional family as
grep(1)
and
awk(1),
but more powerful because it enables you to arrange for arbitrary C
code to be executed on each match. It accepts a declarative
minilanguage and emits skeleton C code.
A crude but useful way to think about what a
lex-generated tokenizer does is as a sort
of inverse
grep(1).
Where
grep(1)
takes a single regular expression and returns a list of matches in the
incoming data stream, each call to a
lex-generated tokenizer takes a list of
regular expressions and indicates which expression occurs next in the
datastream.
|
Splitting input analysis into tokenizing input and parsing
the token stream is a useful tactic even if you're not using Yacc and
Lex and your “tokens” are nothing like the usual ones in
a compiler. More than once I've found that splitting input handling
into two levels made the code much simpler and easier to understand,
despite the complexity added by the split itself.
|
|
--
Henry Spencer
|
|
If you are attacking any kind of pattern-recognition or
state-machine problem in which all the possible input stimuli will fit
in a byte, lex may enable you to generate
code that will be more efficient and reliable than a hand-crafted
state machine.
lex generates parsers that are up
to an order of magnitude slower than hand-coded parsers. This is
not a good reason to hand-code, however; it's an argument for
prototyping with lex and hand-hacking
only if prototyping reveals an actual bottleneck.
yacc is a parser generator. It, too,
was written to automate part of the job of writing compilers. It
takes as input a grammar specification in a declarative minilanguage
resembling BNF (Backus-Naur Form) with C code associated with each
element of the grammar. It generates code for a parser function
that, when called, accepts text matching the grammar from an input
stream. As each grammar element is recognized, the parser function
runs the associated C code.
The combination of lex and
yacc is very effective for writing language
interpreters of all kinds. Though most Unix programmers never get to
do the kind of general-purpose compiler-building that these tools were
meant to assist, they're extremely useful for writing parsers for
run-control file syntaxes and domain-specific minilanguages.
lex-generated tokenizers are very
fast at recognizing low-level patterns in input streams, but the
regular-expression minilanguage that lex
knows is not good at counting things, or recognizing recursively
nested structures. For parsing those, you want
yacc. On the other hand, while you
theoretically could write a yacc grammar to
do its own token-gathering, the grammar to specify that would be
hugely bloated and the parser extremely slow. For tokenizing input,
you want lex. Thus, these tools are
symbiotic.