Python Regular Expressions
Regular Expressions (regex) are a powerful tool that uses a specialized syntax to search, match, and manipulate specific patterns in strings. Python supports regular expressions through the built-in re module.
What is a Regular Expression?
Imagine you want to find all phone numbers or email addresses in a long text. These patterns are hard to describe using simple string methods (like find() or startswith()). Regular expressions allow you to define a "pattern" and then use this pattern to match any complex strings you want.
Core Functions of the re Module
re.search(pattern, string)
Scans the entire string and finds the first occurrence of the pattern, returning a match object (Match Object) if found, otherwise returning None.
import re
text = "The rain in Spain falls mainly in the plain."
# Find the 'ai' pattern
match = re.search(r"ai", text)
if match:
print("Match found!")
print(f"Span: {match.span()}") # Start and end positions of match: (5, 7)
print(f"String: {match.string}") # Original string
print(f"Group: {match.group()}") # Matched string: 'ai'
else:
print("No match found.")What is
r"..."? Therprefix indicates this is a "raw string". In regular expressions, backslashes\have special meanings (such as\drepresenting numbers). Using raw strings prevents the Python interpreter from escaping backslashes, thereby simplifying the writing of regular expressions.
re.findall(pattern, string)
Finds all non-overlapping substrings in the string that match the pattern and returns them as a list.
import re
text = "The rain in Spain falls mainly in the plain."
# Find all instances of 'ai'
all_matches = re.findall(r"ai", text)
print(all_matches) # Output: ['ai', 'ai', 'ai', 'ai']re.sub(pattern, repl, string)
Finds substrings that match the pattern and replaces them with repl. Returns the new string after replacement.
import re
text = "My phone number is 123-456-7890."
# Replace phone number with [REDACTED]
redacted_text = re.sub(r"\d{3}-\d{3}-\d{4}", "[REDACTED]", text)
print(redacted_text) # Output: My phone number is [REDACTED].Common Metacharacters
Metacharacters are characters with special meanings in regular expressions.
| Metacharacter | Description | Example | Matches |
|---|---|---|---|
. | Matches any single character except newline | a.b | acb, a_b |
^ | Matches the beginning of a string | ^Hello | Hello World |
$ | Matches the end of a string | World$ | Hello World |
* | Matches the preceding character 0 or more times | ab*c | ac, abc, abbbc |
+ | Matches the preceding character 1 or more times | ab+c | abc, abbbc (doesn't match ac) |
? | Matches the preceding character 0 or 1 time | ab?c | ac, abc |
{m,n} | Matches the preceding character m to n times | a{2,4} | aa, aaa, aaaa |
[] | Character set, matches any one character in the brackets | [aeiou] | a, e, i, o, u |
\ | Escapes special characters or introduces special sequences | \. | . (the character itself) |
\d | Matches any digit (equivalent to [0-9]) | \d+ | 123, 45 |
\D | Matches any non-digit character | ||
\s | Matches any whitespace character (space, tab, newline) | ||
\S | Matches any non-whitespace character | ||
\w | Matches any letter, number, or underscore (equivalent to [a-zA-Z0-9_]) | ||
\W | Matches any non-letter, number, or underscore character |
Grouping
Using parentheses () can group patterns. This has two main purposes:
- Apply quantifiers (such as
*,+,?) to multiple characters as a whole. - Capture the matched content for later referencing.
import re
text = "Email: john.doe@example.com, User: jane_doe"
# Pattern matches a complete email address
# (\w+\.\w+) captures the username part
# (\w+\.\w+) captures the domain part
match = re.search(r"(\w+\.\w+)@(\w+\.\w+)", text)
if match:
print(f"Full match: {match.group(0)}") # group(0) or group() is the entire match
print(f"Username: {match.group(1)}") # group(1) is the content captured by the first parentheses
print(f"Domain: {match.group(2)}") # group(2) is the content captured by the second parenthesesRegular expressions are a very vast and powerful field, and mastering them requires constant practice. It's recommended to use online tools (such as regex101.com) to test and learn patterns.