Python - Chapter 34. File Formats: CSV, Tab, XML, Logs and Others

Chapter 34. File Formats: CSV, Tab, XML, Logs and Others
	Part IV. Components, Modules and Packages

Chapter 34. File Formats: CSV, Tab, XML, Logs and Others

Table of Contents

Overview

Comma-Separated Values: The csv Module

About CSV Files
The CSV Module
Basic CSV Reading
Consistent Columns as Dictionaries
Writing CSV Files

Tab Files: Nothing Special

Property Files and Configuration (or.INI) Files: The ConfigParser Module

Fixed Format Files, A COBOL Legacy: The codecs Module

XML Files: The xml.minidom and xml.sax Modules

Log Files: The logging Module

File Format Exercises

We looked at general features of the file system in Chapter 19, Files . In this chapter we'll look at Python techniques for handling files in a few of the innumeraable formats that are in common use. Most file formats are relatively easy to handle with Python techniques we've already seen. Comma-Separated Values (CSV) files, XML files and packed binary files, however, are a little more sophisticated.

This only the tip of the iceberg in the far larger problem called “persistence”. In addition to simple file system persistence, we also have the possibility of object persistence using an object database. In this case, the databse processing lies between our program and the file system on which the database resides. This area also includes object-relational mapping, where our program relies on a mapper; the mapper uses to database, and the database manages the file system. We can't explore the whole persistence problem in this chapter.

In this chapter we'll present a conceptual overview of the various approaches to reading and writing files in the section called “Overview”. We'll look at reading and writing CSV files in the section called “Comma-Separated Values: The csv Module”, tab-delimited files in the section called “Tab Files: Nothing Special”. We'll look reading property files in the section called “Property Files and Configuration (or.INI) Files: The ConfigParser Module”. We'll look at the subleties of processing legacy COBOL files in the section called “Fixed Format Files, A COBOL Legacy: The codecs Module”. We'll cover the basics of reading XML files in the section called “XML Files: The xml.minidom and xml.sax Modules”.

Most programs need a way to write sophisticated, easy-to-control log files what contain status and debugging information. For simple one-page programs, the print statement is fine. As soon as we have multiple modules, where we need more sophisticated debugging, we find a need for the logging module. Of course, any program that requires careful auditing will benefit from the logging module. We'll look at creating standard logs in the section called “Log Files: The logging Module”.

Overview

When we introduced the concept of file we mentioned that we could look at a file on two levels.

A file is a sequence of bytes. This is the OS's view of views, as it is the lowest-common denominator.
A file is a sequence of data objects, represented as sequences of bytes.

A file format is the processing rules required to translate between usable Python objects and sequences of bytes. People have invented innumerable distinct file formats. We'll look at some techniques which should cover most of the bases.

We'll look at three broad families of files: text, binary and pickled objects. Each has some advantages and processing complexities.

Text files are designed so that a person can easily read and write them. We'll look at several common text file formats, including CSV, XML, Tab-delimited, property-format, and fixed position. Since text files are intended for human consumption, they are difficult to update in place.
Binary files are designed to optimize processing speed or the overall size of the file. Most databases use very complex binary file formats for speed. A JPEG file, on the other hand, uses a binary format to minimize the size of the file. A binary-format file will typically place data at known offsets, making it possible to do direct access to any particular byte using the seek method of a Python file object.
Pickled Objects are produced by Python's pickle or shelve modules. There are several pickle protocols available, including text and binary alternatives. More importantly, a pickled file is not designed to be seen by people, nor have we spent any design effort optimizng performace or size. In a sense, a pickled object requires the least design effort.


File Module Exercises		Comma-Separated Values: The `csv` Module