Unix Programming - Data File Metaformats

On-line Guides

Eclipse Documentation

How To Guides

The Art of Unix Programming
Prev	Home	Next

Unix Programming - Data File Metaformats - XML

XML

XML is a very simple syntax resembling HTML — angle-bracketed tags and ampersand-led literal sequences. It is about as simple as a plain-text markup can be and yet express recursively nested data structures. XML is just a low-level syntax; it requires a document type definition (such as XHTML) and associated application logic to give it semantics.

XML is well suited for complex data formats (the sort of things for which the old-school Unix tradition would use an RFC-822-like stanza format) though overkill for simpler ones. It is especially appropriate for formats that have a complex nested or recursive structure of the sort that the RFC 822 metaformat does not handle well. For a good introduction to the format, see XML in a Nutshell [Harold-Means].

	Among the hardest things to get right in designing any text file format are issues of quoting, whitespace and other low-level syntax details. Custom file formats often suffer from slightly broken syntax that doesn't quite match other similar formats. Using a standard format such as XML, which is verifiable and parsed by a standard library, eliminates most of these issues.
-- Keith Packard

Example5.5 is a simple example of an XML-based configuration file. It is part of the kdeprint tool shipped with the open-source KDE office suite hosted under Linux. It describes options for an image-to-PostScript filtering operation, and how to map them into arguments for a filter command. For another instructive example, see the discussion of Glade in Chapter8.

Example5.5.An XML example.


<?xml version="1.0"?>
<kprintfilter name="imagetops">
    <filtercommand 
           data="imagetops %filterargs %filterinput %filteroutput" />
    <filterargs>
        <filterarg name="center" 
                   description="Image centering" 
                   format="-nocenter" type="bool" default="true">
            <value name="true" description="Yes" />
            <value name="false" description="No" />
        </filterarg>
        <filterarg name="turn" 
                   description="Image rotation" 
                   format="-%value" type="list" default="auto">
            <value name="auto" description="Automatic" />
            <value name="noturn" description="None" />
            <value name="turn" description="90 deg" />
        </filterarg>
        <filterarg name="scale" 
                   description="Image scale" 
                   format="-scale %value" 
                   type="float" 
                        min="0.0" max="1.0" default="1.000" />
        <filterarg name="dpi" 
                   description="Image resolution" 
                   format="-dpi %value" 
                   type="int" min="72" max="1200" default="300" />
    </filterargs>
    <filterinput>
        <filterarg name="file" format="%in" />
        <filterarg name="pipe" format="" />
    </filterinput>
    <filteroutput>
        <filterarg name="file" format="> %out" />
        <filterarg name="pipe" format="" />
    </filteroutput>
</kprintfilter>

One advantage of XML is that it is often possible to detect ill-formed, corrupted, or incorrectly generated data through a syntax check, without knowing the semantics of the data.

The most serious problem with XML is that it doesn't play well with traditional Unix tools. Software that wants to read an XML format needs an XML parser; this means bulky, complicated programs. Also, XML is itself rather bulky; it can be difficult to see the data amidst all the markup.

One application area in which XML is clearly winning is in markup formats for document files (we'll have more to say about this in Chapter18). Tagging in such documents tends to be relatively sparse among large blocks of plain text; thus, traditional Unix tools still work fairly well for simple text searches and transformations.

One interesting bridge between these worlds is PYX format — a line-oriented translation of XML that can be hacked with traditional line-oriented Unix text tools and then losslessly translated back to XML. A Web search for “Pyxie” will turn up resources. The xmltk toolkit takes the opposite tack, providing stream-oriented tools analogous to grep(1) and sort(1) for filtering XML documents; Web search for “xmltk” to find it.

XML can be a simplifying choice or a complicating one. There is a lot of hype surrounding it, but don't become a fashion victim by either adopting or rejecting it uncritically. Choose carefully and bear the KISS principle in mind.

[an error occurred while processing this directive]

The Art of Unix Programming
Prev	Home	Next

Published under free license.

Design by Interspire