This is an often-overlooked strength of the Unix
tradition. Many of its most effective tools are thin wrappers around
a direct translation of some single powerful algorithm.
Perhaps the clearest example of this is
diff(1),
the Unix tool for reporting differences between related files. This
tool and its dual,
patch(1),
have become central to the network-distributed development style of
modern Unix. A valuable property of diff is that it seldom surprises
anyone. It doesn't have special cases or painful edge conditions,
because it uses a simple, mathematically sound method of sequence
comparison. This has consequences:
By virtue of a mathematical model and a solid
algorithm, Unix diff contrasts markedly with its
imitators. First, the central engine is solid, small, and
has never needed one line of maintenance. Second, the
results are clear and consistent, unmarred by surprises
where heuristics fail.
--Doug McIlroy
Thus, people who use diff can develop an intuitive feel for what
it will do in any given situation without necessarily understanding
the central algorithm perfectly. Other well-known examples of
this special kind of clarity achieved through a strong central
algorithm abound in Unix:
The yacc(1) utility for
generating language parsers is a thin wrapper around the formal theory
of LR(1) grammars. Its partner, the lexical analyzer generator
lex(1),
is a similarly thin wrapper around the theory of nondeterministic
finite-state automata.
All three of these programs are so bug-free that their correct
functioning is taken utterly for granted, and compact enough to fit
easily in a programmer's hand. Only a part of these good qualities are due
to the polishing that comes with a long service life and frequent use;
most of it is that, having been constructed around a strong and
provably correct algorithmic core, they never needed much polishing
in the first place.
The opposite of a formal approach is using
heuristics—rules of thumb leading
toward a solution that is probabilistically, but not certainly,
correct. Sometimes we use heuristics because a deterministically
correct solution is impossible. Think of spam filtering, for example;
an algorithmically perfect spam filter would need a full solution to
the problem of understanding natural language as a module. Other
times, we use heuristics because known formally correct methods are
impossibly expensive. Virtual-memory management is an example of
this; there are near-perfect solutions, but they require so much
runtime instrumentation that their overhead would swamp any
theoretical gain over heuristics.
The trouble with heuristics is that they proliferate special
cases and edge cases. If nothing else, you usually have to backstop
a heuristic with some sort of recovery mechanism when it fails. All
the usual problems with escalating complexity follow. To manage the
resulting tradeoffs, you have to start by being aware of them. Always
ask if a heuristic actually pays off in performance what it costs
in code complexity — and don't guess at the performance
difference, actually measure it before making a decision.
[an error occurred while processing this directive]