Using a Regular Expression
There are several methods which are commonly used with regular
expressions. The most common first step is to compile the RE definition
string to make an Pattern
object. The resulting
Pattern object can then be used to match or search candidate strings. A
successful match returns a Match
object with
details of the matching substring.
The re
module provides the
compile
function.
-
re.compile
(
expr
) →
Pattern
-
Create a Pattern
object from an RE
string. The Pattern is used for all subsequent searching or
matching operations. A Pattern has several methods, including
match
and search
.
Generally, raw string notation (r"pattern"
) is
used to write a RE. This simplifies the \
's required.
Without the raw notation, each \
in the string would
have to be escaped by a \
, making it
\\
. This rapidly gets cumbersome. There are some
other options available for re.compile
, see the
Python Library Reference, section 4.2, for more
information.
The following methods are part of a compiled
Pattern
. We'll use the name
pat
to refer to some
Pattern
object created by the
re.compile
function.
-
pat.
match
(
string
) →
Match
-
Match the candidate string against the compiled regular
expression,
pat
.
Matching means that the regular expression and the candidate
string must match, starting at the beginning of the candidate
string. A Match
object is returned if there
is match, otherwise None
is returned.
-
pat.
search
(
string
) →
Match
-
Search a candidate string for the compiled regular
expression,
pat
.
Searching means that the regular expression must be found
somewhere in the candidate string. A Match
object is returned if the pattern is found, otherwise
None
is returned.
If search
or match
finds
the pattern in the candidate string, a Match
object is created to describe the part of the candidate string which
matched. The following methods are part of a
Match
object. We'll use the name
match
to refer to some
Match
object created by a successul search or
match operation.
-
match.
group
(
number
) → string
-
Retrieve the string that matched a particular () grouping in
the regular expression. Group zero is a tuple of everything that
matched. Group 1 is the material that matched the first set of
()'s.
Here's a more complete example.
>>>
import re
>>>
rawin= "20:07:13.2"
>>>
hms_pat= re.compile( r'(\d+):(\d+):(\d+\.?\d*)' )
>>>
hms_match= hms_pat.match( rawin )
>>>
print hms_match.group( 0, 1, 2, 3 )
('20:07:13.2', '20', '07', '13.2')
>>>
h,m,s= map( float, hms_match.group(1,2,3) )
>>>
seconds= ((h*60)+m)*60+s
>>>
print h, m, s, "=", seconds
20.0 7.0 13.2 = 72433.2
This sequence decodes a complex input value into fields and then
computes a single result. The
import
statement
incorporates the re
module. The
rawin
variable is sample input, perhaps from a file,
perhaps from raw_input
. The
hms_pat
variable is the compiled regular expression
object which matches three numbers, using "(\d+)"
,
separated by :'s.
The digit-sequence RE's are surround by ()'s so that the material
that matched is returned as a group. This will lead to four groups:
group 0 is everything that matched, groups 1, 2, and 3 are successive
digit strings. The hms_match
variable is a
Match
object that indicates success or failure in
matching. If hms_match
is None
, no
match occurred. Otherwise, the hms_match.group
method will reveal the individually matched input items.
The statement that sets h
,
m
, and s
does three things. First
is uses hms_match.group
to create a tuple of
requested items. Each item in the tuple will be a string, so the
map
function is used to apply the built-in
float
function against each string to create a
tuple of three numbers. Finally, this statement relies on the
multiple-assignment feature to set all three variables at once. Finally,
seconds
is computed as the number of seconds past
midnight for the given time stamp.