Java Regular Expressions
Regular expressions (Regex) are a powerful string processing tool. They are special character sequences used to define search patterns. In Java, the core functionality for handling regular expressions is located in the java.util.regex package.
What are Regular Expressions?
Regular expressions can be used to:
- Validate: Check if a string conforms to a certain format (such as email, phone number).
- Search: Find all substrings in a text that match a specific pattern.
- Replace: Find matching substrings and replace them with other content.
- Split: Split a string based on a pattern.
Core Classes in the java.util.regex Package
PatternClass: Represents a compiled regular expression. APatternobject has no public constructor and needs to be created through its static methodPattern.compile().MatcherClass: A regex matching engine. It performs matching operations on an input string by interpreting aPattern. AMatcherobject is obtained through thepattern.matcher(inputString)method.
Basic Matching Process
Using regular expressions typically follows these three steps:
- Create a
Patternobject usingPattern.compile(regex). - Create a
Matcherobject usingpattern.matcher(input). - Use methods of the
Matcherobject (such asfind(),matches()) to perform matching.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexExample {
public static void main(String[] args) {
String text = "The quick brown fox jumps over the lazy dog.";
String regex = "\\b[a-zA-Z]{3}\\b"; // Match all 3-letter words
// 1. Compile the regular expression
Pattern pattern = Pattern.compile(regex);
// 2. Create a matcher
Matcher matcher = pattern.matcher(text);
// 3. Find matches
System.out.println("Finding all 3-letter words in the text:");
while (matcher.find()) {
// find() tries to find the next match
// group() returns the currently found matching substring
System.out.println("Found: '" + matcher.group() + "' at index " + matcher.start());
}
}
}
// Output:
// Found: 'The' at index 0
// Found: 'fox' at index 16
// Found: 'the' at index 31
// Found: 'dog' at index 40Note: In Java strings, the backslash
\is an escape character, so to use a\in a regular expression, you need to write\\in the string.
Common Matcher Methods
matches(): Attempts to match the entire input string against the pattern. Returnstrueonly if the entire string matches completely.find(): Attempts to find the next subsequence of the input string that matches the pattern. Each call continues searching from where the last match ended.lookingAt(): Attempts to match the pattern from the beginning of the input string. Returnstrueif the beginning matches, without requiring the entire string to match.group(): Returns the substring captured by the last matching operation (such asfind()).start()/end(): Returns the start index and end index (exclusive) of the last matched substring.replaceAll(replacement): Replaces all matching substrings.
Regex Methods in the String Class
For convenience, the String class also has some built-in methods that directly support regular expressions.
boolean matches(String regex): Determines if the entire string matches the given regular expression. Equivalent toPattern.matches(regex, this).javaString email = "test@example.com"; // A simple email format validation boolean isValid = email.matches("^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"); System.out.println("Is email format valid: " + isValid); // trueString[] split(String regex): Splits the string based on a regular expression.javaString text = "apple, banana; orange"; String[] fruits = text.split("[,;\\s]+"); // Split by comma, semicolon, or whitespace // fruits -> ["apple", "banana", "orange"]String replaceAll(String regex, String replacement): Replaces all substrings matching the regular expression with the specified string.javaString text = "My phone number is 123-456-7890."; // Replace all digits with 'X' String censored = text.replaceAll("\\d", "X"); // censored -> "My phone number is XXX-XXX-XXXX."
Common Regex Metacharacters
| Metacharacter | Description |
|---|---|
. | Matches any single character except newline |
\d | Matches a digit, equivalent to [0-9] |
\D | Matches a non-digit character |
\s | Matches any whitespace character (space, tab, newline, etc.) |
\S | Matches any non-whitespace character |
\w | Matches any word character (letter, digit, underscore), equivalent to [a-zA-Z_0-9] |
\W | Matches any non-word character |
\b | Matches a word boundary |
^ | Matches the beginning of input |
$ | Matches the end of input |
* | Matches the preceding element zero or more times |
+ | Matches the preceding element one or more times |
? | Matches the preceding element zero or one time |
{n} | Matches the preceding element exactly n times |
{n,} | Matches the preceding element at least n times |
{n,m} | Matches the preceding element at least n times, but no more than m times |
[] | Character set, matches any one character in the brackets. For example, [abc] matches 'a', 'b', or 'c' |
() | Grouping, treats multiple characters as a single unit, and used for capturing matches |
| | OR operator, matches either expression on either side of |. For example, cat|dog matches "cat" or "dog" |