In Chapter10 we
will discuss a different set of conventions used for program
run-control files, but you should notice that it will share some of
these same rules (especially about the lexical level, the rules by
which characters are assembled into tokens).
One record per newline-terminated line, if
possible. This makes it easy to extract records with
text-stream tools. For data interchange with other operating systems,
it's wise to make your file-format parser indifferent to whether
the line ending is LF or CR-LF. It's also conventional to ignore
trailing whitespace in such formats; this protects against common
editor bobbles.
Less than 80 characters per line, if possible. This
makes the format browseable in an ordinary-sized terminal window.
If many records must be longer than 80 characters, consider a
stanza format (see below).
Use # as an introducer for comments.
It is good to have a way to embed annotations and comments in
data files. It's best if they're actually part of the file
structure, and so will be preserved by tools that know its format.
For comments that are not preserved during parsing, # is the
conventional start character.
Support the backslash convention. The least
surprising way to support embedding nonprintable control characters
is by parsing C-like backslash escapes — \n for a newline, \r for a carriage return, \t for a tab, \b
for backspace, \f for formfeed,
\e for ASCII escape (27), \nnn or \onnn or
\0nnn for the character with octal
value nnn, \xnn for the character with hexadecimal value
nn, \dnnn for the character with decimal value
nnn, \\
for a literal backslash. A newer convention, but one worth following,
is the use of \unnnn for a hexadecimal
Unicode literal.
In one-record-per-line formats, use colon or any run
of whitespace as a field separator. The colon convention
seems to have originated with the Unix password file. If your fields
must contain instances of the separator(s), use a backslash as the
prefix to escape them.
Do not allow the distinction between tab and
whitespace to be significant. This is a recipe for serious
headaches when the tab settings on your users' editors are different;
more generally, it's confusing to the eye. Using tab alone as a field
separator is especially likely to cause problems; allowing any run of
tabs and spaces to be a field separator, on the other hand, works
well.
Favor hex over octal. Hex-digit pairs and
quads are easier to eyeball-map into bytes and today's 32- and 64-bit
words than octal digits of three bits each; also marginally more
efficient. This rule needs emphasizing because some older Unix tools
such as
od(1)
violate it; that's a legacy from the instruction field sizes in the
machine languages of older PDP minicomputers.