The GNU/Linux view of files can be surprising for programmers with
a background that focuses on mainframe Z/OS or Windows. This is
additional background information for programmers who are new to the
POSIX use of the file abstraction. This POSIX view informs how Python
works.
In the Z/OS world, files are called data
sets, and can be managed by the OS catalog or left
uncataloged. While this is also true in the GNU/Linux world, the catalog
(called a directory) is seamless, silent and automatic, making files far
easier to manage than they are in the Z/OS world. In the GNU/Linux
world, uncataloged, temporary files are atypical, rarely used, and
require special API's.
In the Z/OS world, files are generally limited to disk files and
nothing else. This is different from the GNU/Linux use of file to mean
almost any kind of external device or service.
Block Mode Files. File devices can be organized into two different kinds of
structures: block mode and character
mode. Block mode devices are exemplified by magnetic
disks: the data is structured into blocks of bytes that can be
accessed in any order. Both the media (disk) and read-write head can
move; the device can be repositioned to any block as often as
necessary. A disk provides direct (sometimes also called random)
access to each block of data.
Character mode devices are exemplified by network connections: the
bytes come pouring into the processor buffers. The stream cannot be
repositioned. If the buffer fills up and bytes are missed, the lost data
are gone forever.
Operating system support for block mode devices includes file
directories and file management utilities for deleting, renaming and
copying files. Modern operating systems include file navigators (Finders
or Explorers), iconic representations of files, and standard GUI dialogs
for opening files from within application programs. The operating system
also handles moving data blocks from memory buffers to disk and from
disk to memory buffers. All of the device-specific vagaries are handled
by having a variety of device drivers so that a
range of physical devices can be supported in a uniform manner by a
single operating system software interface.
Files on block mode devices are sometimes called
seekable. They support the operating system
seek
function that can begin reading from any byte
of the file. If the file is structured in fixed-size blocks or records,
this seek function can be very simple and effective. Typically, database
applications are designed to work with fixed-size blocks so that seeking
is always done to a block, from which database rows are
manipulated.
Character Mode Devices and Keyboards. Operating systems also provide rich support for character mode
devices like networks and keyboards. Typically, a network connection
requires a protocol stack that interprets the
bytes into packets, and handles the error correction, sequencing and
retransmission of the packets. One of the most famous protocol stacks
is the TCP/IP stack. TCP/IP can make a streaming device appear like a
sequential file of bytes. Most operating systems come with numerous
client programs that make heavy use of the netrowk, examples include
sendmail, ftp, and a web browser.
A special kind of character mode file is the
console; it usually provides input from the
keyboard. The POSIX standard allows a program to be run so that input
comes from files, pipes or the actual user. If the input file is a
TTY (teletype), this is the actual human user's
keyboard. If the file is a pipe, this is a connection to another process
running concurrently. The keyboard console or TTY is different from
ordinary character mode devices, pipes or files for two reasons. First,
the keyboard often needs to explicitly echo characters back so that a
person can see what they are typing. Second, pre-processing must often
be done to make backspaces work as expected by people.
The echo feature is enabled for entering ordinary data or disabled
for entering passwords. The echo feature is accomplished by having
keyboard events be queued up for the program to read as if from a file.
These same keyboard events are automatically sent to update the GUI if
echo is turned on.
The pre-processing feature is used to allow some standard edits of
the input before the application program receives the buffer of input. A
common example is handling the backspace character. Most experienced
computer users expect that the backspace key will remove the last
character typed. This is handled by the OS: it buffers ordinary
characters, removes characters from the buffer when backspace is
received, and provides the final buffer of characters to the application
when the user hits the Return key. This handling of backspaces can also
be disabled; the application would then see the keyboard events as
raw characters. The usual mode is for the OS to
provide cooked characters, with backspace
characters handled before the application sees any data.
Typically, this is all handled in a GUI in modern applications.
However, Python provides some functions to interact with Unix TTY
console software to enable and disable echo and process raw keyboard
input.
File Formats and Access Methods. In Z/OS (and Open VMS, and a few other operating systems) files
have very specific formats, and data access is mediated by the
operating system. In Z/OS, they call these access methods, and they
have names like BDAM or VSAM. This
view is handy in some respects, but it tends to limit you to the
access methods supplied by the OS vendor.
The GNU/Linux view is that files should be managed minimally by
the operating system. At the OS level, files are just bytes. If you
would like to impose some organization on the bytes of the file, your
application should provide the access method. You can, for example, use
a database management system (DBMS) to structure your bytes into tables,
rows and columns.
The C-language standard I/O library (stdio) can access files as a
sequence of individual lines; each line is terminated by the newline
character,
\n
. Since Python is built in the C
libraries, Python can also read files as a sequence of lines.