Many modern Unix projects, such as OpenOffice.org and AbiWord,
now use XML compressed with
zip(1)
or
gzip(1)
as a data file format. Compressed XML combines space economy with some of the
advantages of a textual format — notably, it avoids the problem
that binary formats must often allocate space for information that may
not be used in particular cases (e.g., for unusual options or large
ranges). But there is some dispute about this, dispute which turns on some
of the central tradeoffs discussed in this chapter.
On the one hand, experiments have shown that documents in a
compressed XML file are usually significantly smaller than the
Microsoft Word's native file format, a binary format that one might
imagine would take less space. The reason relates to a fundamental of
the Unix philosophy: Do one thing well. Creating a single tool
to do the compression job well is more effective than ad-hoc
compression on parts of the file, because the tool can look across all
the data and exploit
all
repetition in the
information.
Also, by separating the representation design from the
particular compression method used, you leave open the possibility of
using different compression methods in the future with no more than
minimal changes to the actual file parsing — perhaps, with no
changes at all.
On the other hand, compression does some damage to transparency.
While a human being can estimate from context whether uncompressing
the file is likely to show him anything useful, tools such as
file(1)
cannot as of mid-2003 see through the wrapping.
Some would advocate a less structured compression format —
straight
gzip(1)-compressed
XML data, say, without the internal structure and self-identifying
header chunk provided by
zip(1). While
using a format similar to that of
zip(1)
solves the identification problem, it means that decoding such
files will be tricky for programs written in the simpler scripting
languages.
Any of these solutions (straight text, straight binary, or
compressed text) may be optimal depending on the relative weight you
give to storage economy, discoverability, or making browsing tools
as simple as possible to write. The point of the preceding discussion
is not to advocate any one of these approaches over the others, but
rather to suggest how you can think about the options and design
tradeoffs clearly.
This having been said, the truly Unixy solution would probably
be to fix
file(1)
to see file prefixes through the compression — and, failing
that, to write a shellscript wrapper around
file(1)
that would interpret compression as a direction to apply
gunzip(1)
and take a second look.
[an error occurred while processing this directive]