There are a lot of options and clauses that can be used to create
regular expressions. We can't pretend to cover them all in a single
chapter. Instead, we'll cover the basics of creating and using RE's. The
full set of rules is given in section 4.2.1 Regular Expression Syntax of
the Python Library Reference document.
Additionally, there are many fine books devoted to this subject.
Any ordinary character, by itself, is an RE. Example: "a" is an RE that matches the
character a in the candidate string. While
trivial, it is critical to know that each ordinary character is a
stand-alone RE.
Some characters have special meanings. We can
escape that special meaning by using a
\ in front of them. For example, *
is a special character, but \* escapes the
special meaning and creates a single-character RE that matches the
character *.
Additionally, some ordinary characters can be made special
with \. For instance \d is any
digit, \s is any whitespace character.
\D is any non-digit, \S is any
non-whitespace character.
The character . is an RE that matches
any single character. Example: "x.z" is an RE that matches the
strings like xaz or xbz, but
doesn't match strings like xabz.
The brackets, "[...]", create a RE that
matches any character between the [ ]'s. Example: "x[abc]z" matches any of
xaz, xbz or
xcz. A range of characters can be specified
using a -, for example "x[1-9]z". To include a
-, it must be first or last. ^ cannot be first. Multiple ranges
are allowed, for example "x[A-Za-z]z". Here's a
common RE that matches a letter followed by a letter, digit or _:
"[A-Za-z][A-Za-z1-9_]"
The modified brackets, "[^...]", create
a regular expression that matches any character
except those between the [ ]'s. Example: "a[^xyz]b" matches strings like
a9b and a$b, but don't match
axb. As with [ ], a range can be specified and
multiple ranges can be specified.
A regular expression can be formed from concatenating
regular expressions. Example: "a.b" is three regular
expressions, the first matches a, the second
matches any character, the third matches
b.
A regular expression can be a group of regular expressions,
formed with ()'s. Example: "(ab)c" is a regular expression
composed of two regular expressions: "(ab)"
(which, in turn, is composed of two RE's) and
"c". ()'s also group RE's for extraction
purposes. The elements matched within ()'s are remembered by the
regular expression processor and set aside in a
match object.
A regular expression can be repeated. Several repeat constructs are available:
"x*" repeats "x" zero or
more times; "x+" repeats "x"
1 or more times; "x?" repeats
"x" zero or once. Example:
"1(abc)*2" matches 12 or
1abc2 or 1abcabc2, etc. The
first match, against 12, is often surprising;
but there are zero copies of abc between
1 and 2.
The character "^" is an RE that only
matches the beginning of the line, "$" is an RE
that only matches the end of the line. Example: "^$" matches a completely empty
line.
Here are some examples.
"[_A-Za-z][_A-Za-z1-9]*"
Matches a Python identifier. This embodies the rule of
starting with a letter or _, and containing any
number of letters, digits or _'s. Note that any
number includes 0 occurances, so a single letter or _
is a valid identifier.
"^\s*import\s"
Matches a simple import statement. It
matches the beginning of the line with ^, zero
or more whitespace characters with \s*, the
sequence of letters import; and one more
whitespace character. This pattern will ignore the rest of the
line.
"^\s*from\s+[_A-Za-z][_A-Za-z1-9]*\s+import\s"
Matches a from module
import statement. As with the simple import, it matches the
beginning of the line (^), zero or more
whitespace characters (\s*), the sequence of
letters from, a Python module name, one or more
whitespace characters (\s+), the sequence
import, and one more whitespace
character.
"(\d+):(\d+):(\d+\.?\d*)"
Matches a one or more digits, a :, one or
more digits, a :, and digits followed by
optional . and zero or more other digits. For
example 20:07:13.2 would match, as would
13:04:05 Further, the ()'s would allow
separating the digit strings for conversion and further
processing.
"def\s+([_A-Za-z][_A-Za-z1-9]*)\s+\([^)]*\):"
Matches Python function definition lines. It matches the
letters def; a string of 1 or more whitespace
characters (\s); an identifier, surrounded by
()'s to capture the entire identifier as a match. It matches a
(; we've used \( to escape
the meaning of ( and make it an ordinary character. It matches a
string of non-) characters, which would be the
parameter list. The parameter list ends with a
); we've used \) to make
escape the meaning of ) and make it an ordinary
character. Finally, we need tyo see the
:.
Published under the terms of the Open Publication License