There are a number of operations closely related to file processing.
Deleting and renaming files are examples of operations that change the
directory information that the operating system maintains to describe a
file. Python provides numerous modules for these operating system
operations.
We can't begin to cover all of the various ways in which Python
supports file handling. However, we can identify the essential modules
that may help you avoid reinventing the wheel. Further, these modules can
provide you a view of the Pythonic way of working with data from
files.
The following modules have features that are essential for
supporting file processing. We'll cover selected features of each module
that are directly relevant to file processing. We'll present these in the
order you'd find them in the Python library documentation.
Chapter 11 - File and Directory Access. Chapter 11 of the Library reference covers many modules which are
essential for reliable use of files and directories. We'll look closely
at the following modules.
os.path
Common pathname manipulations. Use this to split and join full
directory path names. This is operating-system neutral, with a
correct implementation for all operating systems.
os
Miscellaneous OS interfaces. This includes
parameters of the current process, additional file object creation,
manipluations of file descriptors, managing directories and files,
managing subprocesses, and additional details about the current
operating system.
fileinput
This module has functions which will iterate over lines from
multiple input streams. This allows you to write a single, simple
loop that processes lines from any number of input files.
tempfile
Generate temporary files and temporary file names.
glob
UNIX shell style pathname pattern expansion. Unix shells
translate name patterns like *.py into a
list of files. This is called
globbing. The glob
module implements this within Python, which allows this feature to
work even in Windows where it isn't supported by the
OS itself.
fnmatch
UNIX shell style filename pattern matching. This implements
the glob-style rules using *, ? and []. * matches any number of
characters, ? matches any single character,
[chars] encloses a
list of allowed characters,
[!chars] encloses a
list of disallowed characters.
shutil
High-level file operations, including copying and removal. The
kinds of things that the shell handles with simple commands like
cp or rm become available to a
Python program, and are just as simple in Python as they are in the
shell.
Chapter 12 - Data Compression and Archiving. Data Compression is covered in Chapter 12 of the Library referece.
We'll look closely at the following modules.
tarfile, zipfile
These modules helps you read and write archive files; files
which are an archive of a complex directory structure. This includes
GNU/Linux tape archive (.tar) files, compressed
GZip tar files (.tgz files or
.tar.gz files) sometimes called tarballs, and
ZIP files.
zlib, gzip, bz2
These modules are all variations on a common theme of reading
and writing files which are compressed to
remove redundant bytes of data. The zlib and
bz2 modules have a more sophisticated
interface, allowing you to use compression selectively within a more
complex application. The gzip module has a
different (and simpler) interface that only applies only to complete
files.
Chapter 26 - Python Runtime Services. These modules described in Chapter 26 of the Library reference
include some that are used for handling various kinds of files. We'll
look closely as just one.
sys
This module has several system-specific parameters and
functions, including definitions of the three standard files that
are available to every program.
The os.path Module
The os.path module contains more useful
functions for managing path and directory names. A serious mistake is to
use ordinary string functions with literal
strings for the path separators. A Windows
program using \ as the separator won't work anywhere
else. A less serious mistake is to use os.pathsep
instead of the routines in the os.path
module.
The os.path module contains the following
functions for completely portable path and filename manipulation.
os.path.basename (path
) → fileName
Return the base filename, the second half of the result
created by os.path.split( path
)
os.path.dirname (path
) → dirName
Return the directory name, the first half of the result
created by os.path.split( path
)
os.path.exists (path
) → boolean
Return True if the pathname refers to an existing file or
directory.
os.path.getatime (path
) → time
Return the last access time of a file, reported by
os.stat. See the time
module for functions to process the time value.
os.path.getmtime (path
) → time
Return the last modification time of a file, reported by
os.stat. See the time
module for functions to process the time value.
os.path.getsize (path
) → int
Return the size of a file, in bytes, reported by
os.stat.
os.path.isdir (path
) → boolean
Return True if the pathname refers to an existing
directory.
os.path.isfile (path
) → boolean
Return True if the pathname refers to an existing regular
file.
os.path.join (string,
... ) → path
Join path components using the appropriate path
separator.
os.path.split (path
) → tuple
Split a pathname into two parts: the directory and the
basename (the filename, without path separators, in that
directory). The result (s, t) is such that
os.path.join(s,
t ) yields the original
path.
os.path.splitdrive (path
) → tuple
Split a pathname into a drive specification and the rest of
the path. Useful on DOS/Windows/NT.
os.path.splitext (path
) → tuple
Split a path into root and extension. The extension is
everything starting at the last dot in the last component of the
pathname; the root is everything before that. The result (r, e) is
such that r+e yields the original path.
The following example is typical of the manipulations done with
os.path.
The process function does something
interesting and useful to the input file. It is the real heart of
the program.
The for statement sets the variable
oldFile to each string
(after the first) in the sequence
sys.argv.
Each file name is split into the path name and the base
name. The base name is further split to separate the file name
from the extension. The os.path does this
correctly for all operating systems, saving us having to write
platform-specific code. For example, splitext
correctly handles the situation where a linux file has multiple
'.'s in the file name.
The extension is tested to be '.RST'. A new file name is
created from the path, base name and a new extension ('.HTML').
The old and new file names are printed and some processing,
defined in the process, uses the
oldFile and newFile
names.
Published under the terms of the Open Publication License