Unix Programming - Data File Metaformats - The Pros and Cons of File Compression

On-line Guides

Eclipse Documentation

How To Guides

The Art of Unix Programming
Prev	Home	Next

Unix Programming - Data File Metaformats - The Pros and Cons of File Compression

The Pros and Cons of File Compression

Many modern Unix projects, such as OpenOffice.org and AbiWord, now use XML compressed with zip(1) or gzip(1) as a data file format. Compressed XML combines space economy with some of the advantages of a textual format — notably, it avoids the problem that binary formats must often allocate space for information that may not be used in particular cases (e.g., for unusual options or large ranges). But there is some dispute about this, dispute which turns on some of the central tradeoffs discussed in this chapter.

On the one hand, experiments have shown that documents in a compressed XML file are usually significantly smaller than the Microsoft Word's native file format, a binary format that one might imagine would take less space. The reason relates to a fundamental of the Unix philosophy: Do one thing well. Creating a single tool to do the compression job well is more effective than ad-hoc compression on parts of the file, because the tool can look across all the data and exploit all repetition in the information.

Also, by separating the representation design from the particular compression method used, you leave open the possibility of using different compression methods in the future with no more than minimal changes to the actual file parsing — perhaps, with no changes at all.

On the other hand, compression does some damage to transparency. While a human being can estimate from context whether uncompressing the file is likely to show him anything useful, tools such as file(1) cannot as of mid-2003 see through the wrapping.

Some would advocate a less structured compression format — straight gzip(1)-compressed XML data, say, without the internal structure and self-identifying header chunk provided by zip(1). While using a format similar to that of zip(1) solves the identification problem, it means that decoding such files will be tricky for programs written in the simpler scripting languages.

Any of these solutions (straight text, straight binary, or compressed text) may be optimal depending on the relative weight you give to storage economy, discoverability, or making browsing tools as simple as possible to write. The point of the preceding discussion is not to advocate any one of these approaches over the others, but rather to suggest how you can think about the options and design tradeoffs clearly.

This having been said, the truly Unixy solution would probably be to fix file(1) to see file prefixes through the compression — and, failing that, to write a shellscript wrapper around file(1) that would interpret compression as a direction to apply gunzip(1) and take a second look.

[an error occurred while processing this directive]

The Art of Unix Programming
Prev	Home	Next

Published under free license.

Design by Interspire