Programs often deal with external data; data outside of volatile
primary memory. This external data could be persistent data on a file
system or transient data on an input-output device. Most operating systems
provide a simple, uniform interface to external data via
files
. In the section called “File Semantics”,
we provide an overview of the semantics of files. We cover the most
important of Python's built-in functions for working with files in the section called “Built-in Functions”. In the section called “File Methods”,
we describe some method functions of file objects.
In one sense a file is a container for a sequence of bytes. A more
useful view, however, is that a file
is a
container of data objects, encoded as a sequence of bytes. Files can be
kept on persistent but slow devices like disks. Files can also be
presented as a stream of bytes flowing through a network interface. Even
the user's keyboard can be processed as if it was a file; in this case
the file forces our software to wait until the person types
something.
Our operating systems use the abstraction of
file
as a way to unify access to a large number of
devices and operating system services. In the Linux world, all external
devices, plus a large number of in-memory data structures are accessible
through the file interface. The wide variety of things with file-like
interfaces is a consequence of how Unix was originally designed. Since
the number and types of devices that will be connected to a computer is
essentially infinite, device drivers were designed as a simple, flexible
plug-in to the operating system. For more information on the ubiquity of
files, see the section called “Additional Background”.
Files include more than disk drives and network interfaces. Kernel
memory, random data generators, semaphores, shared memory blocks, and
other things have file interfaces, even though they aren't — strictly
speaking — devices. Our OS applies the file abstraction to many things.
Python, similarly, extends the file interface to include certain kinds
of in-memory buffers.
All GNU/Linux operating systems make all devices available through
a standard file-oriented interface. Windows makes most devices available
through a reasonably consistent file interface. Python's
file
class provides access to the OS file API's,
giving our applications the same uniform access to a variety of
devices.
Important
The terminology is sometimes confusing. We have physical files
on our disk, the file abstraction in our operating system, and
file
objects in our Python program. Our Python
file
object makes use of the operating system
file API's which, in turn, manipulate the files on a disk.
We'll try to be clear, but with only one overloaded word for
three different things, this chapter may sometimes be
confusing.
We rarely have a reason to talk about a physical file on a disk.
Generally we'll talk about the OS abstraction of file and the Python
class of file
.
Standard Files. Consistent with POSIX standards, all Python programs have three
files available: sys.stdin
,
sys.stdout
, sys.stderr
. These
files are used by certain built-in statements and functions. The
print
statement, for example, writes to
sys.stdout
. The input
and
raw_input
functions both write their prompt to
sys.stdout
and read their input from
sys.stdin
.
These standard files are always available, and Python assures that
they are handled consistently by all operating systems. The
sys
module makes these files available for
explict use. Newbies may want to check File Redirection for Newbies for some additional notes
on these standard files.
File Organization and Structure. Some operating systems provide support for a large variety of
file organizations. Different file organizations include different
record termination rules, possibly with keys, and possibly fixed
length records. The POSIX standard, however, considers a file to be
nothing more than a sequence of bytes. It becomes entirely the job of
the application program, or libraries outside the operating system to
impose any organization on those bytes.
The basic file
objects in Python consider a
file to be a sequence of characters. (These can be ASCII or Unicode
characters.) The characters can be processed as a sequence of variable
length lines; each line terminated with a newline character. Files moved
from a Windows environment may contain lines with an extraneous ASCII
carriage return character (\r
), which is easily
removed with the string
strip
method.
Ordinary text files can be managed directly with the built-in
file
objects and their methods for reading and
writing lines of data. We will cover this basic text file processing in
the rest of this chapter.