Skip to content

Python Regular Expressions

Regular Expressions (regex) are a powerful tool that uses a specialized syntax to search, match, and manipulate specific patterns in strings. Python supports regular expressions through the built-in re module.

What is a Regular Expression?

Imagine you want to find all phone numbers or email addresses in a long text. These patterns are hard to describe using simple string methods (like find() or startswith()). Regular expressions allow you to define a "pattern" and then use this pattern to match any complex strings you want.

Core Functions of the re Module

re.search(pattern, string)

Scans the entire string and finds the first occurrence of the pattern, returning a match object (Match Object) if found, otherwise returning None.

python
import re

text = "The rain in Spain falls mainly in the plain."

# Find the 'ai' pattern
match = re.search(r"ai", text)

if match:
    print("Match found!")
    print(f"Span: {match.span()}")   # Start and end positions of match: (5, 7)
    print(f"String: {match.string}") # Original string
    print(f"Group: {match.group()}")   # Matched string: 'ai'
else:
    print("No match found.")

What is r"..."? The r prefix indicates this is a "raw string". In regular expressions, backslashes \ have special meanings (such as \d representing numbers). Using raw strings prevents the Python interpreter from escaping backslashes, thereby simplifying the writing of regular expressions.

re.findall(pattern, string)

Finds all non-overlapping substrings in the string that match the pattern and returns them as a list.

python
import re

text = "The rain in Spain falls mainly in the plain."

# Find all instances of 'ai'
all_matches = re.findall(r"ai", text)

print(all_matches) # Output: ['ai', 'ai', 'ai', 'ai']

re.sub(pattern, repl, string)

Finds substrings that match the pattern and replaces them with repl. Returns the new string after replacement.

python
import re

text = "My phone number is 123-456-7890."

# Replace phone number with [REDACTED]
redacted_text = re.sub(r"\d{3}-\d{3}-\d{4}", "[REDACTED]", text)

print(redacted_text) # Output: My phone number is [REDACTED].

Common Metacharacters

Metacharacters are characters with special meanings in regular expressions.

MetacharacterDescriptionExampleMatches
.Matches any single character except newlinea.bacb, a_b
^Matches the beginning of a string^HelloHello World
$Matches the end of a stringWorld$Hello World
*Matches the preceding character 0 or more timesab*cac, abc, abbbc
+Matches the preceding character 1 or more timesab+cabc, abbbc (doesn't match ac)
?Matches the preceding character 0 or 1 timeab?cac, abc
{m,n}Matches the preceding character m to n timesa{2,4}aa, aaa, aaaa
[]Character set, matches any one character in the brackets[aeiou]a, e, i, o, u
\Escapes special characters or introduces special sequences\.. (the character itself)
\dMatches any digit (equivalent to [0-9])\d+123, 45
\DMatches any non-digit character
\sMatches any whitespace character (space, tab, newline)
\SMatches any non-whitespace character
\wMatches any letter, number, or underscore (equivalent to [a-zA-Z0-9_])
\WMatches any non-letter, number, or underscore character

Grouping

Using parentheses () can group patterns. This has two main purposes:

  1. Apply quantifiers (such as *, +, ?) to multiple characters as a whole.
  2. Capture the matched content for later referencing.
python
import re

text = "Email: john.doe@example.com, User: jane_doe"

# Pattern matches a complete email address
# (\w+\.\w+) captures the username part
# (\w+\.\w+) captures the domain part
match = re.search(r"(\w+\.\w+)@(\w+\.\w+)", text)

if match:
    print(f"Full match: {match.group(0)}") # group(0) or group() is the entire match
    print(f"Username: {match.group(1)}")   # group(1) is the content captured by the first parentheses
    print(f"Domain: {match.group(2)}")     # group(2) is the content captured by the second parentheses

Regular expressions are a very vast and powerful field, and mastering them requires constant practice. It's recommended to use online tools (such as regex101.com) to test and learn patterns.

Content is for learning and research only.