Creating regular expressions
You can begin learning regular expressions with a useful subset of the possible constructs. A complete list of constructs for building regular expressions can be found in the javadocs for the Pattern class for package java.util.regex.
Characters
|
B
|
The specific character B
|
\xhh
|
Character with hex value 0xhh
|
\uhhhh
|
The Unicode character with hex representation 0xhhhh
|
\t
|
Tab
|
\n
|
Newline
|
\r
|
Carriage return
|
\f
|
Form feed
|
\e
|
Escape
|
The power of regular expressions begins to appear when defining character classes. Here are some typical ways to create character classes, and some predefined classes:
Character Classes
|
.
|
Represents any character
|
[abc]
|
Any of the characters a, b, or c (same as a|b|c)
|
[^abc]
|
Any character except a, b, and c (negation)
|
[a-zA-Z]
|
Any character a through z or A through Z (range)
|
[abc[hij]]
|
Any of a,b,c,h,i,j (same as a|b|c|h|i|j) (union)
|
[a-z&&[hij]]
|
Either h, i, or j (intersection)
|
\s
|
A whitespace character (space, tab, newline, formfeed, carriage return)
|
\S
|
A non-whitespace character ([^\s])
|
\d
|
A numeric digit [0-9]
|
\D
|
A non-digit [^0-9]
|
\w
|
A word character [a-zA-Z_0-9]
|
\W
|
A non-word character [^\w]
|
If you have any experience with regular expressions in other languages, you’ll immediately notice a difference in the way backslashes are handled. In other languages, “\\” means “I want to insert a plain old (literal) backslash in the regular expression. Don’t give it any special meaning.” In Java, “\\” means “I’m inserting a regular expression backslash, so the following character has special meaning.” For example, if you want to indicate one or more word characters, your regular expression string will be “\\w+”. If you want to insert a literal backslash, you say “\\\\”. However, things like newlines and tabs just use a single backslash: “\n\t”.
What’s shown here is only a sampling; you’ll want to have the java.util.regex.Pattern JDK documentation page bookmarked or on your “Start” menu so you can easily access all the possible regular expression patterns.
Logical Operators
|
XY
|
X followed by Y
|
X|Y
|
X or Y
|
(X)
|
A capturing group. You can refer to the ith captured group later in the expression with \i
|
Boundary Matchers
|
^
|
Beginning of a line
|
$
|
End of a line
|
\b
|
Word boundary
|
\B
|
Non-word boundary
|
\G
|
End of the previous match
|
As an example, each of the following represent valid regular expressions, and all will successfully match the character sequence "Rudolph":
Rudolph
[rR]udolph
[rR][aeiou][a-z]ol.*
R.*