This section will overview about 50 of the most useful libary
modules. These modules are proven technology, widely used, heavily
tested and constantly improved. The time spent learning these modules
will reduce the time it takes you to build an application that does
useful work.
We'll dig more deeply into just a few of these modules in
subsequent chapters.
Lessons Learned
As a consultant, we've seen far too many programmers writing
modules which overlap these. There are two causes: ignorance and
hubris. In this section, we hope to tackle the ignorance cause.
Python includes a large number of pre-built modules. The more
you know about these, the less programming you have to do.
Hubris sometimes comes from the feeling that the library module
doesn't fit our unique problem well-enough to justify studying the
library module. In many cases you can't read the library module to see
what it really does. In Python, the documentation
is only an introduction; you're encouraged to actually read the
library module.
We find that hubris is most closely associated with calendrical
calcuations. It isn't clear why programmers invest so much time and
effort writing buggy calendrical calculations. Python provides many
modules for dealing with times, dates and the calendar.
4. String Services. The String Services modules contains string-related functions or
classes. See Chapter 12, Strings for more information
on strings.
re
The re module is the core of text
pattern recognition and processing. A regular
expression is a formula that specifies how to
recognize and parse strings. The re module
is described in detail in Chapter 31, Complex Strings: the re Module.
struct
The avowed purpose of the struct
module is to allow a Python program to access C-language API's; it
packs and unpacks C-language struct object. It turns out that this
module can also help you deal with files in packed binary
formats.
difflib
The difflib module contains the
essential algorithms for comparing two sequences, usually
sequences of lines of text. This has algorithms similar to those
used by the Unix diff command (the Window
COMP command).
StringIO, cStringIO
There are two variations on StringIO
which provide file-like objects that read from or write to a
string buffer. The StringIO module defines
the class StringIO, from which subclasses
can be derived. The cStringIO module
provides a high-speed C-language implementation that can't be
subclassed.
Note that these modules have atypical mixed-case
names.
textwrap
This is a module to format plain text. While the
word-wrapping task is sometimes handled by word processors, you
may need this in other kinds of programs. Plain text files are
still the most portable, standard way to provide a
document.
codecs
This module has hundreds of text encodings. This includes
the vast array of Windows code pages and the Macintosh code pages.
The most commonly used are the various Unicode schemes (utf-16 and
utf-8). However, there are also a number of codecs for translating
between strings of text and arrays of bytes. These schemes include
base-64, zip compression, bz2 compression, various quoting rules,
and even the simple rot_13 substitution cipher.
5. Data Types. The Data Types modules implement a number of widely-used data
structures. These aren't as useful as sequences, dictionaries or
strings -- which are built-in to the language. These data types
include dates, general collections, arrays, and schedule events. This
module includes modules for searching lists, copying structures or
producing a nicely formatted output for a complex structure.
datetime
The datetime handles details of the
calendar, including dates and times. Additionally, the
time module provides some more basic
functions for time and date processing. We'll cover both modules
in detail in Chapter 32, Dates and Times: the time and
datetime Modules.
These modules mean that you never need to attempt your own
calendrical calculations. One of the important lessons learned in
the late 90's was that many programmers love to tackle calendrical
calculations, but their efforts had to be tested and reworked
prior to January 1, 2000, because of innumerable small
problems.
calendar
This module contains routines for displaying and working
with the calendar. This can help you determine the day of the week
on which a month starts and ends; it can count leap days in an
interval of years, etc.
collections
This package contains two data types, and is likely to grow
with future releases of Python. One tye is the
deque -- a "double-ended queue" -- that can
be used as stack (LIFO) or queue (FIFO). The other class is a
specialized dictionary, defaultdict, which
can return a default value instead of raising an exception for
missing keys.
bisect
The bisect module contains the
bisect function to search a sorted list for a
specific value. It also contains the insort
fucntion to insert an item into a list maintaining the sorted
order. This module performs faster than simply appending values to
a list and calling the sort method of a list.
This module's source is instructive as a lesson in well-crafted
algorithms.
array
The array module gives you a
high-performance, highly compact collection of values. It isn't as
flexible as a list or a tuple, but it is fast and takes up
relatively little memory. This is helpful for processing media
like image or sound files.
sched
The sched module contains the
definition for the scheduler class that
builds a simple task scheduler. When a scheduler is contructed, it
is given two user-supplied functions: one returns the
“time” and the other executes a “delay”
waiting for the time to arrive. For real-time scheduling, the
time module time and
sleep functions can be used. The scheduler
has a main loop that calls the supplied time function and compares
the current time with the time for scheduled tasks; it then calls
the supplied a delay function for the difference in time. It runs
the scheduled task, and calls the delay function with a duration
of zero to release any resources.
Clearly, this simple algorithm is very versatile. By
supplying custom time functions that work in minutes instead of
seconds, and a delay function that does additional background
processing while waiting for the scheduled time, a flexible task
manager can be constructed.
copy
The copy module contains functions
for making copies of complex objects. This module contains a
function to make a shallow copy of an
object, where any objects contained within the parent are not
copied, but references are inserted in the parent. It also
contains a function to make a deep copy of
an object, where all objects contained within the parent object
are duplicated.
Note that Python's simple assignment only creates a variable
which is a label (or reference) to an object, not a duplicate
copy. This module is the easiest way to create an independent
copy.
pprint
The pprint module contains some
useful functions like pprint.pprint for
printing easy-to-read representations of nested lists and
dictionaries. It also has a PrettyPrinter
class from which you can make subclasses to customize the way in
which lists or dictionaries or other objects are printed.
6. Numeric and Mathematical Modules. These modules include more specialized mathemathical functions
and some additional numeric data types.
decimal
The decimal module provides decimal-based arithmetic which
correctly handles significant digits, rounding and other features
common to currency amounts.
7. Internet Data Handling. The Internet Data Handling modules contain a number of handy
algorithms. A great deal of data is defined by the Internet Request
for Comments (RFCs). Since these effectively
standardize data on the Internet, it helps to have modules already in
place to process this standardized data. Most of these modules are
specialized, but a few have much wider application.
mimify, base64, binascii, binhex, quopri, uu
These modules all provide various kinds of conversions,
ecapes or quoting so that binary data can be manipulated as safe,
universal ASCII text. The number of these modules reflects the
number of different clever solutions to the problem of packing
binary data into ordinary email messages.
8. Structured Markup Processing Tools. The following modules contain algorithms for working with
structured markup: Standard General Markup Lanaguage (SGML), Hypertext
Markup Language (HTML) and Extensible Markup Language (XML). These
modules simplify the parsing and analysis of complex documents. In
addition to these modules, you may also need to use the CSV module for
processing files; that's in chapter 9, File Formats.
htmllib
Ordinary HTML documents can be examined with the
htmllib module. This module based on the
sgmllib module. The basic
HTMLParser class definition is a
superclass; you will typically override the various functions to
do the appropriate processing for your application.
One problem with parsing HTML is that browsers — in order to
conform with the applicable standards — must accept incorrect
HTML. This means that many web sites publish HTML which is
tolerated by browsers, but can't easily be parsed by
htmllib. When confronted with serious
horrows, consider downloading the Beautiful Soup module. This
handles erroneous HTML more gracefully than
htmllib.
xml.sax, xml.dom, xml.dom.minidom
The xml.sax and
xml.dom modules provide the classes
necessary to conveniently read and process XML documents. A SAX
parser separates the various types of content and passes a series
of events the handler objects attached to the parser. A DOM parser
decomposes the document into the Document Object Model
(DOM).
The xml.dom module contains the
classes which define an XML document's structure. The
xml.dom.minidom module contains a parser
which creates a DOM object.
Additionally, there is a Miscellaneous Module (in chapter 33) that
goes along with these.
formatter
The formatter module can be used in
conjunction with the HTML and XML parsers. A formatter instance
depends on a writer instance that produces the final (formatted)
output. It can also be used on its own to format text in different
ways.
9. File Formats. These are modules for reading and writing files in a few of the
amazing variety of file formats that are in common use. In addition to
these common formats, modules in chapter 8, Structured Markup
Processig Tools are also important.
csv
The csv module helps you parse and
create Comma-Separated Value (CSV) data files.
This helps you exchange data with many desktop tools that produce
or consume CSV files. We'll look at this in the section called “Comma-Separated Values: The csv
Module”.
ConfigParser
Configuration files can take a number of forms. The simplest
approach is to use a Python module as the configuration for a
large, complex program. Sometimes configurations are encoded in
XML. Many Windows legacy programs use .INI
files. The ConfigParser can gracefully parse these files. We'll
look at this in the section called “Property Files and Configuration (or.INI)
Files: The ConfigParser Module”.
10. Cryptographic Services. These modules aren't specifically encryption modules. Many
popular encryption algorithms are protected by patents. Often,
encryption requires compiled modules for performance reasons. These
modules compute secure digests of messages using a variety of
algorithms.
hashlib, hmac, md5, sha
Compute a secure hash or digest of a message to ensure that
it was not tampered with. MD5, for example, is often used for
validating that a downloaded file was recieved correctly and
completely.
11. File and Directory Access. We'll look at many of these modules in Chapter 33, File Handling Modules. These are the modules which are essential
for handling data files.
os, os.path
The os and
os.path modules are critical for creating
portable Python programs. The popular operating systems (Linux,
Windows and MacOS) each have different approaches to the common
services provided by an operating system. A Python program can
depend on os and
os.path modules behaving consistently in
all environments.
One of the most obvious differences among operating systems
is the way that files are named. In particular, the
path separator can be either the POSIX
standard /, or the windows \.
Additionally, the Mac OS Classic mode can also use :.
Rather than make each program aware of the operating system rules
for path construction, Python provides the
os.path module to make all of the common
filename manipulations completely consistent.
Programmers are faced with a dilemma between writing a
“simple” hack to strip paths or extensions from
file names and using the os.path module.
Some programmers argue that the os.path
module is too much overhead for such a simple problem as
removing the .html from a file name. Other
programmers recognize that most hacks are a false economy: in
the long run they do not save time, but rather lead to costly
maintenance when the program is expanded or modified.
fileinput
The fileinput module helps your
progam process a large number of files smoothly and simply.
glob, fnmatch
The glob and
fnmatch modules help a Windows program
handle wild-card file names in a manner consistent with other
operating systems.
shutil
The shutil module provides shell-like
utilities for file copy, file rename, directory moves, etc. This
module lets you write short, effective Pytthon programs that do
things that are typically done by shell scripts.
Why use Python instead of the shell? Python is far easier to
read, far more efficient, and far more capable of writing
moderately sophisticated programs. Using Python saves you from
having to write long, painful shell scripts.
12. Data Compression and Archiving. These modules handle the various file compression algorithms
that are available. We'll look at these modules in Chapter 33, File Handling Modules.
tarfile, zipfile
These two modules create archive files, which contain a
number of files that are bound together. The TAR format is not
compressed, where the ZIP format is compressed. Often a TAR
archive is compressed using GZIP to create a .tar.gz
archive.
zlib, gzip, bz2
These modules are different compression algorithms. They all
have similar features to compress or uncompress files.
13. Data Persistence. There are several issues related to making objects persistent.
In Chapter 9 of the Python Reference, there are several modules that
help deal with files in various kinds of formats. We'll talk about
these modules in detail in Chapter 34, File Formats: CSV, Tab, XML, Logs and Others.
There are several additional techniques for managing persistence.
We can "pickle" or "shelve" an object. In this case, we don't define our
file format in detail, instead we leave it to Python to persist our
objects.
We can map our objects to a relational database. In this case,
we'll use the SQL language to define our storage, create and retrieve
our objects.
pickle, shelve
The pickle and
shelve modules are used to create
persistent objects; objects that persist beyond the one-time
execution of a Python program. The pickle
module produces a serial text representation of any object,
however complex; this can reconstitute an object from its text
representation. The shelve module uses a
dbm database to store and retrieve objects.
The shelve module is not a complete
object-oriented database, as it lacks any transaction management
capabilities.
sqlite3
This module provides access to the SQLite relational
database. This database provides a significant subset of SQL
language features, allowing us to build a relational database
that's compatible with products like MySQL or Postgres.
14. Generic Operating System Services. The following modules contain basic features that are common to
all operating systems. Most of this commonality is acheived by using
the C standard libraries. By using this module, you can be assured
that your Python application will be portable to almost any operating
system.
The time module provides basic
functions for time and date processing. Additionally
datetime handles details of the calendar
more gracefully than time does. We'll cover
both modules in detail in Chapter 32, Dates and Times: the time and
datetime Modules.
Having modules like datetime and
time mean that you never need to attempt
your own calendrical calculations. One of the important lessons
learned in the late 90's was that many programmers love to tackle
calendrical calculations, but their efforts had to be tested and
reworked because of innumerable small problems.
getopt, optparse
A well-written program makes use of the command-line
interface. It is configured through options and arguments, as well
as properties files. We'll cover the
getopt, optparse and
glob modules in Chapter 35, Programs: Standing Alone.
18. Internet Protocols and Support. The following modules contain algorithms for responding the
several of the most common Internet protocols. These modules greatly
simplify developing applications based on these protocols.
cgi
The cgi module is used for web server
applications invoked as CGI scripts. This allows you to put Python
programming in the cgi-bin
directory. When the web server invokes the CGI script, the Python
interpreter is started and the Python script is executed.
urllib, urllib2, urlparse
These modules allow you to write relatively simple
application programs which open a URL as if it were a standard
Python file. The content can be read and perhaps parsed with the
HTML or XML parser modules, described below. The
urllib module depends on the
httplib, ftplib and
gopherlib modules. It will also open local
files when the scheme of the URL is file:. The
urlparse module includes the functions
necessary to parse or assemble URL's. The
urllib2 module handles more complex
situations where there is authentication or cookies
involved.
httplib, ftplib, gopherlib
The httplib,
ftplib and gopherlib
modules include relatively complete support for building client
applications that use these protocols. Between the
html module and
httplib module, a simple character-oriented
web browser or web content crawler can be built.
poplib, imaplib
The poplib and
imaplib modules allow you to build mail
reader client applications. The poplib
module is for mail clients using the Post-Office Protocol, POP3
(RFC 1725), to extract mail from a mail server. The
imaplib module is for mail servers using
the Internet Message Access Protocol, IMAP4 (RFC 2060) to manage
mail on an IMAP server.
nntplib
The nntplib module allows you to
build a network news reader. The newsgroups, like
comp.lang.python, are processed by NNTP
servers. You can build special-purpose news readers with this
module.
SocketServer
The SocketServer module provides the
relatively advanced programming required to create TCP/IP or
UDP/IP server applications. This is typically the core of a
stand-alone application server.
SimpleHTTPServer, CGIHTPPServer, BaseHTTPServer
The SimpleHTTPServer and
CGIHTTPServer modules rely on the basic
BaseHTTPServer and
SocketServer modules to create a web
server. The SimpleHTTPServer module
provides the programming to handle basic URL requests. The
CGIHTTPServer module adds the capability
for running CGI scripts; it does this with the
fork and exec functions
of the os module, which are not necessarily
supported on all platforms.
asyncore, asynchat
The asyncore (and
asynchat) modules help to build a
time-sharing application server. When client requests can be
handled quickly by the server, complex multi-threading and
multi-processing aren't really necessary. Instead, this module
simply dispatches each client communication to an appropriate
handler function.
The cmd module contains a superclass
useful for building the main command-reading loop of an
interactive program. The standard features include printing a
prompt, reading commands, providing help and providing a command
history buffer. A subclass is expected to provide functions with
names of the form do_command. When the user
enters a line beginning with command, the
appropriate do_command function is
called.
shlex
The shlex module can be used to
tokenize input in a simple language similar to the Linux shell
languages. This module defines a basic
shlex class with parsing methods that can
separate words, quotes strings and comments, and return them to
the requesting program.
26. Python Runtime Services. The Python Runtime Services modules are considered to support
the Python runtime environment. These can be divided into two groups:
those that are an interface into the Python interpreter, and those
that are generally useful for programming. The interpreter interface
allows us to peer under the hood at how Python works internally. The
programming category is more generally useful, and includes
sys, pickle, and
shelve.
sys
The sys module contains execution
context information. It has the command-line arguments (in
sys.argv) used to start the Python interpreter.
It has the standard input, output and error file definitions. It
has functions for retrieving exception information. It defines the
platform, byte order, module search path and other basic facts.
This is typically used by a main program to get run-time
environment information.
Published under the terms of the Open Publication License