Chapter 34. File Formats: CSV, Tab, XML, Logs and Others
We looked at general features of the file system in Chapter 19, Files
. In this chapter we'll look at Python techniques
for handling files in a few of the innumeraable formats that are in common
use. Most file formats are relatively easy to handle with Python
techniques we've already seen. Comma-Separated Values (CSV) files, XML
files and packed binary files, however, are a little more
sophisticated.
This only the tip of the iceberg in the far larger problem called
“persistence”. In addition to simple file system persistence,
we also have the possibility of object persistence using an object
database. In this case, the databse processing lies between our program
and the file system on which the database resides. This area also includes
object-relational mapping, where our program relies on a mapper; the
mapper uses to database, and the database manages the file system. We
can't explore the whole persistence problem in this chapter.
In this chapter we'll present a conceptual overview of the various
approaches to reading and writing files in the section called “Overview”. We'll look at reading and writing CSV
files in the section called “Comma-Separated Values: The csv
Module”, tab-delimited files in
the section called “Tab Files: Nothing Special”. We'll look reading property files in
the section called “Property Files and Configuration (or.INI
)
Files: The ConfigParser
Module”. We'll look at the
subleties of processing legacy COBOL files in the section called “Fixed Format Files, A COBOL Legacy: The
codecs
Module”. We'll cover the basics of
reading XML files in the section called “XML Files: The
xml.minidom
and xml.sax
Modules”.
Most programs need a way to write sophisticated, easy-to-control log
files what contain status and debugging information. For simple one-page
programs, the
print
statement is fine. As soon as we
have multiple modules, where we need more sophisticated debugging, we find
a need for the logging
module. Of course, any
program that requires careful auditing will benefit from the
logging
module. We'll look at creating standard
logs in the section called “Log Files: The logging
Module”.
When we introduced the concept of file we mentioned that we could
look at a file on two levels.
-
A file is a sequence of bytes. This is the OS's view of views,
as it is the lowest-common denominator.
-
A file is a sequence of data objects, represented as sequences
of bytes.
A file format is the processing rules
required to translate between usable Python objects and sequences of
bytes. People have invented innumerable distinct file formats. We'll
look at some techniques which should cover most of the bases.
We'll look at three broad families of files: text, binary and
pickled objects. Each has some advantages and processing
complexities.
-
Text files are designed so that a person can easily read and
write them. We'll look at several common text file formats,
including CSV, XML, Tab-delimited, property-format, and fixed
position. Since text files are intended for human consumption, they
are difficult to update in place.
-
Binary files are designed to optimize processing speed or the
overall size of the file. Most databases use very complex binary
file formats for speed. A JPEG file, on the other hand, uses a
binary format to minimize the size of the file. A binary-format file
will typically place data at known offsets, making it possible to do
direct access to any particular byte using the
seek
method of a Python file object.
-
Pickled Objects are produced by Python's
pickle
or shelve
modules. There are several pickle protocols available, including
text and binary alternatives. More importantly, a pickled file is
not designed to be seen by people, nor have we spent any design
effort optimizng performace or size. In a sense, a pickled object
requires the least design effort.