Contents
12. Matching
Matching involves use of patterns called "regular
expressions". This, as you will see, leads to Perl Paradox Number Four:
Regular expressions aren't. See sections 13 and 14 of the
Quick
Reference.
The =~ operator performs pattern matching and substitution.
For example, if:
$s = 'One if by land and two if by sea';
then:
if ($s =~ /if by la/) {print "YES"}
else {print "NO"}
prints "YES", because the string $s matches the
simple constant pattern "if by la".
if ($s =~ /one/) {print "YES"}
else {print "NO"}
prints "NO", because the string does not match the
pattern. However, by adding the "i" option to ignore case, we would get a "YES"
from the following:
if ($s =~ /one/i) {print "YES"}
else {print "NO"}
Patterns can contain a mind-boggling variety of special directions that
facilitate very general matching. See Perl Reference Guide section
13, Regular Expressions. For example, a period matches any character (except the
"newline" \n character).
if ($x =~ /l.mp/) {print "YES"}
would print "YES" for $x = "lamp",
"lump", "slumped", but not for $x = "lmp" or "less amperes".
Parentheses () group pattern elements. An asterisk * means that the preceding
character, element, or group of elements may occur zero times, one time, or many
times. Similarly, a plus + means that the preceding element or group of elements
must occur at least once. A question mark ? matches zero or one times. So:
/fr.*nd/ matches "frnd", "friend", "front and back"
/fr.+nd/ matches "frond", "friend", "front and back"
but not "frnd".
/10*1/ matches "11", "101", "1001", "100000001".
/b(an)*a/ matches "ba", "bana", "banana", "banananana"
/flo?at/ matches "flat" and "float"
but not "flooat"
Square brackets [ ] match a class of single characters.
[0123456789] matches any single digit
[0-9] matches any single digit
[0-9]+ matches any sequence of one or more digits
[a-z]+ matches any lowercase word
[A-Z]+ matches any uppercase word
[ab n]* matches the null string "", "b",
any number of blanks, "nab a banana"
[^...] matches characters that are not "...":
[^0-9] matches any non-digit character.
Curly braces allow more precise specification of repeated fields. For example
[0-9]{6}
matches any sequence of 6 digits, and
[0-9]{6,10}
matches any sequence of 6 to 10 digits.
Patterns float, unless anchored. The caret ^ (outside [ ]) anchors a pattern
to the beginning, and dollar-sign $ anchors a pattern at the end, so:
/at/ matches "at", "attention", "flat", & "flatter"
/^at/ matches "at" & "attention" but not "flat"
/at$/ matches "at" & "flat", but not "attention"
/^at$/ matches "at" and nothing else.
/^at$/i matches "at", "At", "aT", and "AT".
/^[ \t]*$/ matches a "blank line", one that contains nothing
or any combination of blanks and tabs.
The Backslash. Other characters simply match themselves, but the
characters +?.*^$()[]{}|\
and usually /
must be
escaped with a backslash \
to be taken literally. Thus:
/10.2/ matches "10Q2", "1052", and "10.2"
/10\.2/ matches "10.2" but not "10Q2" or "1052"
/\*+/ matches one or more asterisks
/A:\\DIR/ matches "A:\DIR"
/\/usr\/bin/ matches "/usr/bin"
If a backslash preceeds an
alphanumeric character, this sequence takes a special meaning, typically a short
form of a [ ] character class. For example, \d is the same as the
[0-9]
digits character class.
/[-+]?\d*\.?\d*/ is the same as
/[-+]?[0-9]*\.?\d*/
Either of the above matches decimal numbers:
"-150", "-4.13", "3.1415", "+0000.00", etc.
A simple \s
specifies "white space", the same as the character
class [ \t\n\r\f]
(blank, tab, newline, carriage return,form-feed).
A character may be specified in hexadecimal as a \x
followed by two
hexadecimal digits; \x1b is the ESC character.
A vertical bar | specifies "or".
if ($answer =~ /^y|^yes|^yeah/i ) {
print "Affirmative!";
}
prints "Affirmative!" for $answer equal to "y" or "yes" or "yeah" (or
"Y", "YeS", or "yessireebob, that's right").