|
Unix Programming - Applying Minilanguages - Case Study: Regular Expressions
A kind of specification that turns up repeatedly in Unix tools
and scripting languages is the regular
expression (‘regexp’ for short). We consider
it here as a declarative minilanguage for describing text patterns; it
is often embedded in other minilanguages. Regexps are so ubiquitous
that the are hardly thought of as a minilanguage, but they replace
what would otherwise be huge volumes of code implementing different
(and incompatible) search capabilities.
This introduction skates over some details like POSIX
extensions,
back-references, and internationalization features; for a more
complete treatment, see Mastering Regular
Expressions [Friedl].
Regular expressions describe patterns that may either match or
fail to match against strings. The simplest regular-expression tool
is grep(1), a filter that passes through to its output every
line in its input matching a specified regexp.
Regexp notation is summarized in Table8.1.
Table8.1.Regular-expression examples.
Regexp |
Matches |
"x.y"
|
x followed by any character followed by y. |
"x\.y"
|
x followed by a literal period followed by y. |
"xz?y"
|
x followed by at most one z followed by y; thus, "xy" or
"xzy" but not "xz" or "xdy". |
"xz*y"
|
x followed by any number of instances of z, followed by y;
thus, "xy" or "xzy" or "xzzzy" but not "xz" or "xdy". |
"xz+y"
|
x followed by one or more instances of z, followed by y;
thus, "xzy" or "xzzy" but not "xy" or "xz" or "xdy". |
"s[xyz]t"
|
s followed by any of the characters x or y or z, followed by t;
thus, "sxt" or "syt" or "szt" but not "st" or "sat". |
"a[x0-9]b"
|
a followed by either x or characters in the range 0–9, followed by b;
thus, "axb" or "a0b" or "a4b" but not "ab" or "aab". |
"s[^xyz]t"
|
s followed by any character that is not x or y or z, followed by t;
thus, "sdt" or "set" but not "sxt" or "syt" or "szt". |
"s[^x0-9]t"
|
s followed by any character that is not x or in the range
0–9, followed by t; thus, "slt" or "smt" but not "sxt" or "s0t" or
"s4t". |
"^x"
|
x at the beginning of a string;
thus, "xzy" or "xzzy" but not "yzy" or "yxy". |
"x$"
|
x at the end of a string;
thus, "yzx" or "yx" but not "yxz" or "zxy". |
There are a number of minor variants of regexp notation:
Now that we've looked at some motivating examples, Table8.2 is a summary of the standard regular-expression
wildcards. Note: we're not including the glob variant in this table,
so a value of “All” implies only all three of the basic,
extended/Emacs, and Perl/Python variants.[81]
Table8.2.Introduction to regular-expression operations.
Wildcard |
Supported in |
Matches |
\
|
All |
Escape next character. Toggles whether following punctuation
is treated as a wildcard or not. Following letters or digits are interpreted
in various different ways depending on the program. |
.
|
All |
Any character. |
^
|
All |
Beginning of line |
$
|
All |
End of line |
[...]
|
All |
Any of the characters between the brackets |
[^...]
|
All |
Any characters
except those
between the brackets. |
*
|
All |
Accept any number of instances of the previous element. |
?
|
egrep/Emacs, Perl/Python |
Accept zero or one instances of the previous element. |
+
|
egrep/Emacs, Perl/Python |
Accept one or more instances of the previous element. |
{n}
|
egrep, Perl/Python; as\{n\} in Emacs |
Accept exactly n repetitions of the previous element.
Not supported by some older regexp engines. |
{n,}
|
egrep, Perl/Python; as\{n,\} in Emacs |
Accept n or more repetitions of the previous element.
Not supported by some older regexp engines. |
{m,n}
|
egrep, Perl/Python; as\{m,n\} in Emacs |
Accept at least m and at most n repetitions of the previous
element. Not supported by some older regexp engines. |
|
|
egrep, Perl/Python; as\| in Emacs |
Accept the element to the left or the element to the right.
This is usually used with some form of pattern-grouping delimiters. |
(...)
|
Perl/Python; as\(...\) in older versions. |
Treat this pattern as a group (in newer regexp engines like
Perl and Python's). Older regexp engines such as those in Emacs and grep
require \(...\). |
Design practice in new languages with regexp support has
stabilized on the Perl/Python variant. It is more transparent than the
others, notably because backlash before a non-alphanumeric character
always means that character as a literal, so there is much
less confusion about how to quote elements of regexps.
Regular expressions are an extreme example of how concise a
minilanguage can be. Simple regular expressions express recognition
behavior that would otherwise have to be implenented with hundreds of
lines of fussy, bug-prone code.
[an error occurred while processing this directive]
|
|