Unix Programming - Applying Minilanguages - Case Study: Regular Expressions

The Art of Unix Programming
Prev	Home	Next

Case Study: Regular Expressions

Regular expressions describe patterns that may either match or fail to match against strings. The simplest regular-expression tool is grep(1), a filter that passes through to its output every line in its input matching a specified regexp. Regexp notation is summarized in Table8.1.

Regexp

Matches

"x.y"

x followed by any character followed by y.

"x\.y"

x followed by a literal period followed by y.

"xz?y"

x followed by at most one z followed by y; thus, "xy" or "xzy" but not "xz" or "xdy".

"xz*y"

x followed by any number of instances of z, followed by y; thus, "xy" or "xzy" or "xzzzy" but not "xz" or "xdy".

"xz+y"

x followed by one or more instances of z, followed by y; thus, "xzy" or "xzzy" but not "xy" or "xz" or "xdy".

"s[xyz]t"

s followed by any of the characters x or y or z, followed by t; thus, "sxt" or "syt" or "szt" but not "st" or "sat".

"a[x0-9]b"

a followed by either x or characters in the range 0–9, followed by b; thus, "axb" or "a0b" or "a4b" but not "ab" or "aab".

"s[^xyz]t"

s followed by any character that is not x or y or z, followed by t; thus, "sdt" or "set" but not "sxt" or "syt" or "szt".

"s[^x0-9]t"

s followed by any character that is not x or in the range 0–9, followed by t; thus, "slt" or "smt" but not "sxt" or "s0t" or "s4t".

"^x"

x at the beginning of a string; thus, "xzy" or "xzzy" but not "yzy" or "yxy".

"x$"

x at the end of a string; thus, "yzx" or "yx" but not "yxz" or "zxy".

Glob expressions. This is the limited set of wildcard conventions used by early Unix shells for filename matching. There are only three wildcards: *, which matches any sequence of characters (like .* in the other variants); ?, which matches any single character (like . in the other variants); and [...], which matches a character class just as in the other variants. Some shells (csh, bash, zsh) later added {} for alternation. Thus, x{a,b}c matches xac or xbc but not xc. Some shells further extend globs in the direction of extended regular expressions.

Basic regular expressions. This is the notation accepted by the original grep(1) utility for extracting lines matching a given regexp from a file. The line editor ed(1), the stream editor sed(1), also use these. Old Unix hands think of these as the basic or ‘vanilla’ flavor of regexp; people first exposed to the more modern tools tend to assume the extended form described next.

Extended regular expressions. This is the notation accepted by the extended grep utility egrep(1) for extracting lines matching a given regexp from a file. Regular expressions in Lex and the Emacs editor are very close to the egrep flavor.

Perl regular expressions. This is the notation accepted by Perl and Python regexp functions. These are quite a bit more powerful than the egrep flavor.

Now that we've looked at some motivating examples, Table8.2 is a summary of the standard regular-expression wildcards. Note: we're not including the glob variant in this table, so a value of “All” implies only all three of the basic, extended/Emacs, and Perl/Python variants.^[81]

Wildcard

Supported in

Matches

\

All

Escape next character. Toggles whether following punctuation is treated as a wildcard or not. Following letters or digits are interpreted in various different ways depending on the program.

.

All

Any character.

^

All

Beginning of line

$

All

End of line

[...]

All

Any of the characters between the brackets

[^...]

All

Any characters except those between the brackets.

*

All

Accept any number of instances of the previous element.

?

egrep/Emacs, Perl/Python

Accept zero or one instances of the previous element.

+

egrep/Emacs, Perl/Python

Accept one or more instances of the previous element.

{n}

egrep, Perl/Python; as\{n\} in Emacs

Accept exactly n repetitions of the previous element. Not supported by some older regexp engines.

{n,}

egrep, Perl/Python; as\{n,\} in Emacs

Accept n or more repetitions of the previous element. Not supported by some older regexp engines.

{m,n}

egrep, Perl/Python; as\{m,n\} in Emacs

Accept at least m and at most n repetitions of the previous element. Not supported by some older regexp engines.

|

egrep, Perl/Python; as\| in Emacs

Accept the element to the left or the element to the right. This is usually used with some form of pattern-grouping delimiters.

(...)

Perl/Python; as$...$ in older versions.

Treat this pattern as a group (in newer regexp engines like Perl and Python's). Older regexp engines such as those in Emacs and grep require $...$.

The Art of Unix Programming

Home