Unix Programming - Applying Minilanguages - Case Study: The Documenter's Workbench Tools
The
troff(1)
typesetting formatter was, as we noted in Chapter2, Unix's original killer application.
troff is the center of a suite of
formatting tools (collectively called Documenter's Workbench or DWB),
all of which are domain-specific minilanguages of various kinds. Most
are either preprocessors or postprocessors for
troff markup. Open-source Unixes host an
enhanced implementation of Documenter's Workbench called
groff(1),
from the Free Software Foundation.
We'll examine troff in more detail in
Chapter18; for now,
it's sufficient to note that it is a good example of an imperative
minilanguage that borders on being a full-fledged interpreter (it has
conditionals and recursion but not loops; it is accidentally
Turing-complete).
The postprocessors (‘drivers’ in DWB terminology)
are normally not visible to troff users.
The original troff emitted codes for the
particular typesetter the Unix development group had available in
1970; later in the 1970s these were cleaned up into a
device-independent minilanguage for placing text and simple graphics
on a page. The postprocessors translate this language (called
“ditroff” for “device-independent troff”)
into something modern imaging printers can actually accept — the
most important of these (and the modern default) is PostScript.
The preprocessors are more interesting, because they actually
add capabilities to the troff language.
There are three common ones:
tbl(1)
for making tables,
eqn(1)
for typesetting mathematical equations, and
pic(1)
for drawing diagrams. Less used, but still live, are
grn(1)
for graphics, and
refer(1)
and
bib(1)
for formatting bibliographies. Open-source equivalents of all of these
ship with groff. The
grap(1)
preprocessor provided a rather versatile plotting facility; there is
an open-source implementation separate from
groff.
Some other preprocessors have no open-source implementation and
are no longer in common use. Best known of these was
ideal(1),
for graphics. A younger sibling of the family,
chem(1),
draws chemical structural formulas; it is available as part of Bell
Labs's netlib code.[86]
Each of these preprocessors is a little program that accepts a
minilanguage and compiles it into troff requests. Each one recognizes
the markup it is supposed to interpret by looking for a unique start
and end request, and passes through unaltered any markup outside those
(tbl looks for
.TS/.TE,
pic looks for
.PS/.PE, etc.). Thus, most of the
preprocessors can normally be run in any order without stepping on
each other. There are some exceptions: in particular,
chem and grap
both issue pic commands, and so must come
before it in the pipeline.
cat thesis.ms | chem | tbl | refer | grap | pic | eqn \
| groff -Tps >thesis.ps
The preceding is a full-Monty example of a Documenter's Workbench
processing pipeline,
for a hypothetical thesis incorporating chemical formulas,
mathematical equations, tables, bibliographies, plots, and diagrams.
(The
cat(1)
command simply copies its input or a file argument to its output; we
use it here to emphasize the order of operations.) In practice modern
troff implementations tend to support
command-line options that can invoke at least
tbl(1),
eqn(1)
and
pic(1),
so it isn't necessary to write such an elaborate pipeline. Even if it
were, these sorts of build recipes are normally composed just once and
stashed away in a makefile or shellscript wrapper for repeated use.
The document markup of Documenter's Workbench is in some ways
obsolete, but the range of problems these preprocessors address gives
some indication of the power of the minilanguage model — it
would be extremely difficult to embed equivalent knowledge into a
WYSIWYG word processor. There are some ways in which modern XML-based
document markups and toolchains are still, in 2003, playing
catch-up with capabilities that Documenter's Workbench had in 1979.
We'll discuss these issues in more detail in Chapter18.
The design themes that gave Documenter's Workbench so much power
should by now be familiar ones; all the tools share a common
text-stream representation of documents, and the formatting system is
broken up into independent components that can be debugged and
improved separately. The pipeline architecture supports plugging in
new, experimental preprocessors and postprocessors without disturbing
old ones. It is modular and extensible.
The architecture of Documenter's Workbench as a whole teaches us
some things about how to fit multiple specialist minilanguages into a
cooperating system. One preprocessor can build on another. Indeed,
the Documenter's Workbench tools were an early exemplar of the power
of pipes, filtering, and minilanguages that influenced a lot of later
Unix design by example. The design of the individual preprocessors
has more lessons to teach about what effective minilanguage designs
look like.
One of these lessons is negative. Sometimes users writing
descriptions in the minilanguages do unclean things with low-level
troff markup inserted by hand. This can
produce interactions and bugs that are hard to diagnose, because the
generated troff coming out of the pipeline
is not visible — and would not be readable if it were. This is
analogous to the sorts of bugs that happen in code that mixes C with
snippets of in-line assembler. It might have been better to separate
the language layers more completely, if that were possible.
Minilanguage designers should take note of this.
All the preprocessor languages (though not troff markup itself)
have relatively clean, shell-like syntaxes that follow many of the
conventions we described in Chapter5 for the design of data-file formats.
There are a few embarrassing exceptions; notably,
tbl(1)
defaults to using a tab as a field separator between table columns,
replicating an infamous botch in the design of
make(1)
and causing annoying bugs when editors or other tools invisibly change
the composition of whitespace.
While troff itself is a specialized
imperative language, one theme that runs through at least three of the
Documenter's Workbench minilanguages is declarative semantics: doing
layout from constraints. This is an idea that shows up in modern GUI
toolkits as well — that, instead of giving pixel coordinates for
graphical objects, what you really want to do is declare spatial
relationships among them (“widget A is above widget B, which is
to the left of widget C”) and have your software compute a
best-fit layout for A, B, and C according to those constraints.
The
pic(1)
program uses this approach to lay out elements for diagrams. The
language taxonomy diagram at Figure8.1 was produced with
the pic source code in Example8.4
[87] run through pic2graph, one of our case studies in Chapter7.
Example8.4.Taxonomy of languages — the pic source.
# Minilanguage taxonomy
#
# Base ellipses
define smallellipse {ellipse width 3.0 height 1.5}
M: ellipse width 3.0 height 1.8 fill 0.2
line from M.n to M.s dashed
D: smallellipse() with .e at M.w + (0.8, 0)
line from D.n to D.s dashed
I: smallellipse() with .w at M.e - (0.8, 0)
#
# Captions
"" "Data formats" at D.s
"" "Minilanguages" at M.s
"" "Interpreters" at I.s
#
# Heads
arrow from D.w + (0.4, 0.8) to D.e + (-0.4, 0.8)
"flat to structured" "" at last arrow.c
arrow from M.w + (0.4, 1.0) to M.e + (-0.4, 1.0)
"declarative to imperative" "" at last arrow.c
arrow from I.w + (0.4, 0.8) to I.e + (-0.4, 0.8)
"less to more general" "" at last arrow.c
#
# The arrow of loopiness
arrow from D.w + (0, 1.2) to I.e + (0, 1.2)
"increasing loopiness" "" at last arrow.c
#
# Flat data files
"/etc/passwd" ".newsrc" at 0.5 between D.c and D.w
# Structured data files
"SNG" at 0.5 between D.c and M.w
# Datafile/minilanguage borderline cases
"regexps" "Glade" at 0.5 between M.w and D.e
# Declarative minilanguages
"m4" "Yacc" "Lex" "make" "XSLT" "pic" "tbl" "eqn" \
at 0.5 between M.c and D.e
# Imperative minilanguages
"fetchmail" "awk" "troff" "Postscript" at 0.5 between M.c and I.w
# Minilanguage/interpreter borderline cases
"dc" "bc" at 0.5 between I.w and M.e
# Interpreters
"Emacs Lisp" "JavaScript" at 0.25 between M.e and I.e
"sh" "tcl" at 0.55 between M.e and I.e
"Perl" "Python" "Java" at 0.8 between M.e and I.e
This is a very typical Unix minilanguage design, and as such has
some points of interest even on the purely syntactic level. Notice
how much it looks like a shell program: # leads comments, and
the syntax is obviously token-oriented with the simplest possible
convention for strings. The designer of
pic(1)
knew that Unix programmers expect minilanguage syntaxes to look like
this unless there is a strong and specific reason they should not.
The Rule of Least Surprise is in full operation here.
It probably doesn't take a lot of effort to discern that the
first line of code is a macro definition; the later references to
smallellipse() encapsulate a repeated design
element of the diagram. Nor will it take much scrutiny to deduce that
box invis declares a box with invisible borders,
actually just a frame for text to be stacked inside. The
arrow command is equally obvious.
With these as clues and one eye on the actual diagram, the
meaning of the remaining pieces of the syntax (position references
like M.s and constructions like
last arrow or at
0.25 between M.e and I.e or the addition of vector offsets
to a location) should become rapidly apparent. As with
Glade markup and m4, an
example like this one can teach a good bit of the language without any
reference to a manual (a compactness property
troff(1)
markup, unfortunately, does
not
have).
The example of
pic(1)
reflects a common design theme in minilanguages, which we also
saw reflected in Glade — the use of a
minilanguage interpreter to encapsulate some form of constraint-based
reasoning and turn it into actions. We could actually choose to view
pic(1)
as an imperative language rather than a declarative one; it has
elements of both, and the dispute would quickly grow
theological.
The combination of macros with constraint-based layout gives
pic(1)
the ability to express the structure of diagrams in a way that more
modern vector-based markups like SVG cannot. It is therefore
fortunate that one effect of the Documenter's Workbench design is to
make it relatively easy to keep
pic(1)
useful outside the DWB context. The pic2graph script we used as a case study in
Chapter7 was an
ad-hoc way to accomplish this, using the retrofitted PostScript
capability of
groff(1)
as a half-way step to a modern bitmap format.
A cleaner solution is the
pic2plot(1)
utility distributed with the GNU plotutils
package, which
exploited the internal modularity of the GNU
pic(1)
code. The code was split into a parsing front end and a back end that
generated troff markup, the two communicating through a layer of
drawing primitives. Because this design obeyed the Rule of
Modularity,
pic2plot(1)
implementers were able to split off the GNU
pic parsing stage and reimplement the
drawing primitives using a modern plotting library. Their solution
has the disadvantage, however, that text in the output is generated
with fonts built into pic2plot that won't
match those of troff.
[an error occurred while processing this directive]
|