Unix Programming - Compactness and Orthogonality - Compactness and the Strong Single Center

On-line Guides

Eclipse Documentation

How To Guides

The Art of Unix Programming
Prev	Home	Next

Unix Programming - Compactness and Orthogonality - Compactness and the Strong Single Center

Compactness and the Strong Single Center

One subtle but powerful way to promote compactness in a design is to organize it around a strong core algorithm addressing a clear formal definition of the problem, avoiding heuristics and fudging.

	Formalization often clarifies a task spectacularly. It is not enough for a programmer to recognize that bits of his task fall within standard computer-science categories — a little depth-first search here and a quicksort there. The best results occur when the nub of the task can be formalized, and a clear model of the job at hand can be constructed. It is not necessary that ultimate users comprehend the model. The very existence of a unifying core will provide a comfortable feel, unencumbered with the why-in-hell-did-they-do-that moments that are so prevalent in using Swiss-army-knife programs.
-- Doug McIlroy

This is an often-overlooked strength of the Unix tradition. Many of its most effective tools are thin wrappers around a direct translation of some single powerful algorithm.

Perhaps the clearest example of this is diff(1), the Unix tool for reporting differences between related files. This tool and its dual, patch(1), have become central to the network-distributed development style of modern Unix. A valuable property of diff is that it seldom surprises anyone. It doesn't have special cases or painful edge conditions, because it uses a simple, mathematically sound method of sequence comparison. This has consequences:

	By virtue of a mathematical model and a solid algorithm, Unix diff contrasts markedly with its imitators. First, the central engine is solid, small, and has never needed one line of maintenance. Second, the results are clear and consistent, unmarred by surprises where heuristics fail.
-- Doug McIlroy

Thus, people who use diff can develop an intuitive feel for what it will do in any given situation without necessarily understanding the central algorithm perfectly. Other well-known examples of this special kind of clarity achieved through a strong central algorithm abound in Unix:

The grep(1) utility for selecting lines out of files by pattern matching is a simple wrapper around a formal algebra of regular-expression patterns (see the section called “Case Study: Regular Expressions” for discussion). If it had lacked this consistent mathematical model, it would probably look like the design of the original glob(1) facility in the oldest Unixes, a handful of ad-hoc wildcards that can't be combined.
The yacc(1) utility for generating language parsers is a thin wrapper around the formal theory of LR(1) grammars. Its partner, the lexical analyzer generator lex(1), is a similarly thin wrapper around the theory of nondeterministic finite-state automata.

All three of these programs are so bug-free that their correct functioning is taken utterly for granted, and compact enough to fit easily in a programmer's hand. Only a part of these good qualities are due to the polishing that comes with a long service life and frequent use; most of it is that, having been constructed around a strong and provably correct algorithmic core, they never needed much polishing in the first place.

The opposite of a formal approach is using heuristics—rules of thumb leading toward a solution that is probabilistically, but not certainly, correct. Sometimes we use heuristics because a deterministically correct solution is impossible. Think of spam filtering, for example; an algorithmically perfect spam filter would need a full solution to the problem of understanding natural language as a module. Other times, we use heuristics because known formally correct methods are impossibly expensive. Virtual-memory management is an example of this; there are near-perfect solutions, but they require so much runtime instrumentation that their overhead would swamp any theoretical gain over heuristics.

The trouble with heuristics is that they proliferate special cases and edge cases. If nothing else, you usually have to backstop a heuristic with some sort of recovery mechanism when it fails. All the usual problems with escalating complexity follow. To manage the resulting tradeoffs, you have to start by being aware of them. Always ask if a heuristic actually pays off in performance what it costs in code complexity — and don't guess at the performance difference, actually measure it before making a decision.

[an error occurred while processing this directive]

The Art of Unix Programming
Prev	Home	Next

Published under free license.

Design by Interspire