DSV stands for Delimiter-Separated
Values. Our first case study in textual metaformats was
the /etc/passwd file, which is a DSV format with
colon as the value separator. Under Unix, colon is the default
separator for DSV formats in which the field values may contain
whitespace.
/etc/passwd format (one record
per line, colon-separated fields) is very traditional under Unix
and frequently used for tabular data. Other classic examples
include the /etc/group file describing security
groups and the /etc/inittab file used to control
startup and shutdown of Unix service programs at different run levels
of the operating system.
Data files in this style are expected to support inclusion of
colons in the data fields by backslash escaping. More generally,
code that reads them is expected to support record continuation by
ignoring backslash-escaped newlines, and to allow embedding
nonprintable character data by C-style backslash escapes.
This format is most appropriate when the data is tabular,
keyed by a name (in the first field), and records are typically
short (less than 80 characters long). It works well with
traditional Unix tools.
One occasionally sees field separators other than the colon,
such as the pipe character | or even an ASCII NUL. Old-school Unix
practice used to favor tabs, a preference reflected in the defaults
for
cut(1)
and
paste(1);
but this has gradually changed as format designers became aware of the
many small irritations that ensue from the fact that tabs and spaces
are not visually distinguishable.
This format is to Unix what CSV (comma-separated value) format
is under Microsoft Windows and elsewhere outside the Unix world.
CSV (fields separated by commas, double quotes used to escape
commas, no continuation lines) is rarely found under Unix.
In fact, the Microsoft version of CSV is a textbook example of
how
not
to design a textual file format. Its
problems begin with the case in which the separator character (in this
case, a comma) is found inside a field. The Unix way would be to
simply escape the separator with a backslash, and have a double escape
represent a literal backslash. This design gives us a single special case
(the escape character) to check for when parsing the file, and only a
single action when the escape is found (treat the following character
as a literal). The latter conveniently not only handles the separator
character, but gives us a way to handle the escape character and
newlines for free. CSV, on the other hand, encloses the entire field
in double quotes if it contains the separator. If the field contains
double quotes, it must also be enclosed in double quotes, and the
individual double quotes in the field must themselves be repeated
twice to indicate that they don't end the field.
The bad results of proliferating special cases are twofold.
First, the complexity of the parser (and its vulnerability to bugs) is
increased. Second, because the format rules are complex and
underspecified, different implementations diverge in their handling of
edge cases. Sometimes continuation lines
are
supported, by starting the last field of the line with an unterminated
double quote — but only in some products! Microsoft has
incompatible versions of CSV files between its own applications, and
in some cases between different versions of the same application
(Excel being the obvious example here).