5.4. My RE isn't matching/deleting what I want it to. (Or, "Greedy vs. stingy pattern matching")
The two most common causes for this problem are: (1) misusing the
'.' metacharacter, and (2) misusing the '*' metacharacter. The RE
'.*' is designed to be "greedy" (i.e., matching as many characters
as possible). However, sometimes users need an expression which is
"stingy," matching the shortest possible string.
(1) On single-line patterns, the '.' metacharacter matches any
single character on the line. ('.' cannot match the newline at the
end of the line because the newline is removed when the line is put
into the pattern space; sed adds a newline automatically when the
pattern space is printed.) On multi-line patterns obtained with the
'N' or 'G' commands, '.' will match a newline in the middle of the
pattern space. If there are 3 lines in the pattern space, "s/.*//"
will delete all 3 lines, not just the first one (leaving 1 blank
line, since the trailing newline is added to the output).
Normal misuse of '.' occurs in trying to match a word or bounded
field, and forgetting that '.' will also cross the field limits.
Suppose you want to delete the first word in braces:
echo {one} {two} {three} | sed 's/{.*}/{}/' # fails
echo {one} {two} {three} | sed 's/{[^}]*}/{}/' # succeeds
's/{.*}/{}/' is not the solution, since the regex '.' will match
any character, including the close braces. Replace the '.' with
'[^}]', which signifies a negated character set '[^...]' containing
anything other than a right brace. FWIW, we know that 's/{one}/{}/'
would also solve our question, but we're trying to illustrate the
use of the negated character set: [^anything-but-this].
A negated character set should be used for matching words between
quote marks, for fields separated by commas, and so on. See also
section 4.12 ("How do I parse a comma-delimited data file?").
(2) The '*' metacharacter represents zero or more instances of the
previous expression. The '*' metacharacter looks for the leftmost
possible match first and will match zero characters. Thus,
echo foo | sed 's/o*/EEE/'
will generate 'EEEfoo', not 'fEEE' as one might expect. This is
because /o*/ matches the null string at the beginning of the word.
After finding the leftmost possible match, the '*' is GREEDY; it
always tries to match the longest possible string. When two or
three instances of '.*' occur in the same RE, the leftmost instance
will grab the most characters. Consider this example, which uses
grouping '\(...\)' to save patterns:
echo bar bat bay bet bit | sed 's/^.*\(b.*\)/\1/'
What will be displayed is 'bit', never anything longer, because the
leftmost '.*' took the longest possible match. Remember this rule:
"leftmost match, longest possible string, zero also matches."