12. Matching

Matching involves use of patterns called "regular expressions". This, as you will see, leads to Perl Paradox Number Four: Regular expressions aren't. See sections 13 and 14 of the Quick Reference.

The =~ operator performs pattern matching and substitution. For example, if:

    $s = 'One if by land and two if by sea';

then:

    if ($s =~ /if by la/) {print "YES"}

    else {print "NO"}

prints "YES", because the string $s matches the simple constant pattern "if by la".

    if ($s =~ /one/) {print "YES"}

    else {print "NO"}

prints "NO", because the string does not match the pattern. However, by adding the "i" option to ignore case, we would get a "YES" from the following:

    if ($s =~ /one/i) {print "YES"}

    else {print "NO"}

Patterns can contain a mind-boggling variety of special directions that facilitate very general matching. See Perl Reference Guide section 13, Regular Expressions. For example, a period matches any character (except the "newline" \n character).

    if ($x =~ /l.mp/) {print "YES"}

would print "YES" for $x = "lamp", "lump", "slumped", but not for $x = "lmp" or "less amperes".

Parentheses () group pattern elements. An asterisk * means that the preceding character, element, or group of elements may occur zero times, one time, or many times. Similarly, a plus + means that the preceding element or group of elements must occur at least once. A question mark ? matches zero or one times. So:

    /fr.*nd/  matches "frnd", "friend", "front and back"

    /fr.+nd/  matches "frond", "friend", "front and back"

                but not "frnd".

    /10*1/    matches "11", "101", "1001", "100000001".

    /b(an)*a/ matches "ba", "bana", "banana", "banananana"

    /flo?at/  matches "flat" and "float"

                but not "flooat"

Square brackets [ ] match a class of single characters.

    [0123456789] matches any single digit

    [0-9]        matches any single digit

    [0-9]+       matches any sequence of one or more digits

    [a-z]+       matches any lowercase word

    [A-Z]+       matches any uppercase word

    [ab n]*      matches the null string "", "b",

                    any number of blanks, "nab a banana"

[^...] matches characters that are not "...":

    [^0-9]       matches any non-digit character.

Curly braces allow more precise specification of repeated fields. For example [0-9]{6} matches any sequence of 6 digits, and [0-9]{6,10} matches any sequence of 6 to 10 digits.

Patterns float, unless anchored. The caret ^ (outside [ ]) anchors a pattern to the beginning, and dollar-sign $ anchors a pattern at the end, so:

    /at/         matches "at", "attention", "flat", & "flatter"

    /^at/        matches "at" & "attention" but not "flat"

    /at$/        matches "at" & "flat", but not "attention"

    /^at$/       matches "at" and nothing else.

    /^at$/i      matches "at", "At", "aT", and "AT".

    /^[ \t]*$/   matches a "blank line", one that contains nothing

                          or any combination of blanks and tabs.

The Backslash. Other characters simply match themselves, but the characters +?.*^$()[]{}|\ and usually / must be escaped with a backslash \ to be taken literally. Thus:

    /10.2/       matches "10Q2", "1052", and "10.2"

    /10\.2/      matches "10.2" but not "10Q2" or "1052"

    /\*+/        matches one or more asterisks

    /A:\\DIR/    matches "A:\DIR"

    /\/usr\/bin/ matches "/usr/bin"

If a backslash preceeds an alphanumeric character, this sequence takes a special meaning, typically a short form of a [ ] character class. For example, \d is the same as the [0-9] digits character class.

    /[-+]?\d*\.?\d*/      is the same as

    /[-+]?[0-9]*\.?\d*/

Either of the above matches decimal numbers: "-150", "-4.13", "3.1415", "+0000.00", etc.

A simple \s specifies "white space", the same as the character class [ \t\n\r\f] (blank, tab, newline, carriage return,form-feed). A character may be specified in hexadecimal as a \x followed by two hexadecimal digits; \x1b is the ESC character.

A vertical bar | specifies "or".

    if ($answer =~ /^y|^yes|^yeah/i ) {

         print "Affirmative!";

    }

prints "Affirmative!" for $answer equal to "y" or "yes" or "yeah" (or "Y", "YeS", or "yessireebob, that's right").

Contents

12. Matching