Python - Chapter 33. File Handling Modules

Chapter 33. File Handling Modules
	Part IV. Components, Modules and Packages

Chapter 33. File Handling Modules

Table of Contents

The os.path Module
The os Module
The fileinput Module
The tempfile Module
The glob and fnmatch Modules
The shutil Module
The File Archive Modules: tarfile and zipfile
The Data Compression Modules: zlib, gzip, bz2
The sys Module
Additional File-Processing Modules
File Module Exercises

There are a number of operations closely related to file processing. Deleting and renaming files are examples of operations that change the directory information that the operating system maintains to describe a file. Python provides numerous modules for these operating system operations.

We can't begin to cover all of the various ways in which Python supports file handling. However, we can identify the essential modules that may help you avoid reinventing the wheel. Further, these modules can provide you a view of the Pythonic way of working with data from files.

The following modules have features that are essential for supporting file processing. We'll cover selected features of each module that are directly relevant to file processing. We'll present these in the order you'd find them in the Python library documentation.

Chapter 11 - File and Directory Access. Chapter 11 of the Library reference covers many modules which are essential for reliable use of files and directories. We'll look closely at the following modules.

os.path: Common pathname manipulations. Use this to split and join full directory path names. This is operating-system neutral, with a correct implementation for all operating systems.
os: Miscellaneous OS interfaces. This includes parameters of the current process, additional file object creation, manipluations of file descriptors, managing directories and files, managing subprocesses, and additional details about the current operating system.
fileinput: This module has functions which will iterate over lines from multiple input streams. This allows you to write a single, simple loop that processes lines from any number of input files.
tempfile: Generate temporary files and temporary file names.
glob: UNIX shell style pathname pattern expansion. Unix shells translate name patterns like *.py into a list of files. This is called globbing. The glob module implements this within Python, which allows this feature to work even in Windows where it isn't supported by the OS itself.
fnmatch: UNIX shell style filename pattern matching. This implements the glob-style rules using *, ? and []. * matches any number of characters, ? matches any single character, [ chars ] encloses a list of allowed characters, [! chars ] encloses a list of disallowed characters.
shutil: High-level file operations, including copying and removal. The kinds of things that the shell handles with simple commands like cp or rm become available to a Python program, and are just as simple in Python as they are in the shell.

Chapter 12 - Data Compression and Archiving. Data Compression is covered in Chapter 12 of the Library referece. We'll look closely at the following modules.

tarfile, zipfile: These modules helps you read and write archive files; files which are an archive of a complex directory structure. This includes GNU/Linux tape archive (.tar) files, compressed GZip tar files (.tgz files or .tar.gz files) sometimes called tarballs, and ZIP files.
zlib, gzip, bz2: These modules are all variations on a common theme of reading and writing files which are compressed to remove redundant bytes of data. The zlib and bz2 modules have a more sophisticated interface, allowing you to use compression selectively within a more complex application. The gzip module has a different (and simpler) interface that only applies only to complete files.

Chapter 26 - Python Runtime Services. These modules described in Chapter 26 of the Library reference include some that are used for handling various kinds of files. We'll look closely as just one.

sys: This module has several system-specific parameters and functions, including definitions of the three standard files that are available to every program.

The `os.path` Module

The os.path module contains more useful functions for managing path and directory names. A serious mistake is to use ordinary string functions with literal strings for the path separators. A Windows program using \ as the separator won't work anywhere else. A less serious mistake is to use os.pathsep instead of the routines in the os.path module.

The os.path module contains the following functions for completely portable path and filename manipulation.

os.path.basename ( path ) → fileName: Return the base filename, the second half of the result created by os.path.split( path )
os.path.dirname ( path ) → dirName: Return the directory name, the first half of the result created by os.path.split( path )
os.path.exists ( path ) → boolean: Return True if the pathname refers to an existing file or directory.
os.path.getatime ( path ) → time: Return the last access time of a file, reported by os.stat. See the time module for functions to process the time value.
os.path.getmtime ( path ) → time: Return the last modification time of a file, reported by os.stat. See the time module for functions to process the time value.
os.path.getsize ( path ) → int: Return the size of a file, in bytes, reported by os.stat.
os.path.isdir ( path ) → boolean: Return True if the pathname refers to an existing directory.
os.path.isfile ( path ) → boolean: Return True if the pathname refers to an existing regular file.
os.path.join ( string , ... ) → path: Join path components using the appropriate path separator.
os.path.split ( path ) → tuple: Split a pathname into two parts: the directory and the basename (the filename, without path separators, in that directory). The result (s, t) is such that os.path.join( s , t ) yields the original path.
os.path.splitdrive ( path ) → tuple: Split a pathname into a drive specification and the rest of the path. Useful on DOS/Windows/NT.
os.path.splitext ( path ) → tuple: Split a path into root and extension. The extension is everything starting at the last dot in the last component of the pathname; the root is everything before that. The result (r, e) is such that r+e yields the original path.

The following example is typical of the manipulations done with os.path.

import sys, os.path
def process( oldName, newName ):
    
Some Processing...


for oldFile in sys.argv[1:]:
    dir, fileext= os.path.split(oldFile)
    file, ext= os.path.splitext( fileext )
    if ext.upper() == '.RST':
        newFile= os.path.join( dir, file ) + '.HTML'
        print oldFile, '->', newFile
        process( oldFile, newFile )

	This program imports the `sys` and `os.path` modules.
	The `process` function does something interesting and useful to the input file. It is the real heart of the program.
	The for statement sets the variable `oldFile` to each `string` (after the first) in the sequence `sys.argv`.
	Each file name is split into the path name and the base name. The base name is further split to separate the file name from the extension. The `os.path` does this correctly for all operating systems, saving us having to write platform-specific code. For example, `splitext` correctly handles the situation where a linux file has multiple '.'s in the file name.
	The extension is tested to be '.RST'. A new file name is created from the path, base name and a new extension ('.HTML'). The old and new file names are printed and some processing, defined in the `process`, uses the `oldFile` and `newFile` names.


Additional `time` Module Features		The `os` Module

Chapter 33. File Handling Modules

The os.path Module

The `os.path` Module