Chapter 31. Complex Strings: the re
Module
There are a number of related problems when processing strings. When
we get strings as input from files, we need to recognize the input as
meaningful. Once we're sure it's in the right form, we need to parse the
inputs, sometimes we'll ahve to convert some parts into numbers (or other
objects) for further use.
For example, a file may contain lines which are supposed to be like
"Birth Date: 3/8/85"
. We may need to determine if a
given string has the right form. Then, we may need to break the string
into individual elements for date processing.
We can accomplish these recognition, parsing and conversion
operations with the re
module in Python. A
regular expression (RE) is a rule
or pattern used for matching strings. It differs from the fairly simple
“wild-card” rules used by many operating systems for naming
files with a pattern. These simple operating system file-name matching
rules are embodied in two simpler packages: fnmatch
and glob
.
We'll look at the semantics of a regular expression in the section called “Semantics”. We'll look at the syntax for
defining a RE in the section called “Creating a Regular Expression”. In the section called “Using a Regular Expression” we'll put the regular expression to
use.
One way to look at regular expressions is as a production rule for
constructing strings. In principle, such a rule could describe an
infinite number of strings. The real purpose is not to enumerate all of
the strings described by the production rule, but to match a candidate
string against the production rule to see if the rule could have
constructed the given string.
For example, a rule could be "aba"
. All strings
of the form "aba"
would match this simple rule. This
rule produces only a single string. Determining a match between a given
string and the one string produced by this rule is pretty simple.
A more complex rule could be "ab*a"
. The
b*
means zero or more copies of b
.
This rule produces an infinite set of strings including
"aa"
, "aba"
,
"abba"
, etc. It's a little more complex to see if a
given string could have been produced by this rule.
The Python re
module includes Python
constructs for creating regular expressions (REs), matching candidate
strings against RE's, and examining the details of the substrings that
match. There is a lot of power and subtlety to this package. A complete
treatment is beyond the scope of this book.