Skip to content

Regular Expressions

Overview

Regular expressions are a powerful text pattern matching tool. PHP uses the PCRE (Perl Compatible Regular Expressions) library to support regular expressions. This chapter will learn how to use regular expressions for text matching, replacement, splitting, and validation.

Basic Syntax

PCRE Functions Introduction

php
<?php
// Main PCRE functions
// preg_match() - Execute a match
// preg_match_all() - Execute a global match
// preg_replace() - Perform search and replace
// preg_split() - Split string using regular expression

// Basic matching example
$text = "Hello World 2024";
$pattern = '/World/';

if (preg_match($pattern, $text)) {
    echo "Match found!\n";
}

// Get match results
if (preg_match('/(\d+)/', $text, $matches)) {
    echo "Found number: " . $matches[1] . "\n";
}
?>

Basic Metacharacters

php
<?php
$text = "The price is $25.99 for item #123";

// . - Match any character (except newline)
preg_match('/p.ice/', $text, $matches);
echo "Match '.': " . ($matches[0] ?? 'None') . "\n";

// * - Match preceding character 0 or more times
preg_match('/\d*/', $text, $matches);
echo "Match '*': " . ($matches[0] ?? 'None') . "\n";

// + - Match preceding character 1 or more times
preg_match('/\d+/', $text, $matches);
echo "Match '+': " . ($matches[0] ?? 'None') . "\n";

// ? - Match preceding character 0 or 1 time
preg_match('/\$?\d+/', $text, $matches);
echo "Match '?': " . ($matches[0] ?? 'None') . "\n";
?>

Character Classes and Predefined Character Classes

php
<?php
$text = "User ID: A123, Age: 25, Email: user@example.com";

// [abc] - Match any character in the character set
preg_match('/[AEI]/', $text, $matches);
echo "Character class [AEI]: " . ($matches[0] ?? 'None') . "\n";

// [a-z] - Match characters in range
preg_match('/[a-z]+/', $text, $matches);
echo "Character class [a-z]: " . ($matches[0] ?? 'None') . "\n";

// Predefined character classes
// \d - Match digits [0-9]
preg_match_all('/\d/', $text, $matches);
echo "Digit characters: " . implode(', ', $matches[0]) . "\n";

// \w - Match word characters [a-zA-Z0-9_]
preg_match_all('/\w+/', $text, $matches);
echo "Word characters: " . implode(', ', $matches[0]) . "\n";

// \s - Match whitespace characters
preg_match_all('/\s/', $text, $matches);
echo "Whitespace character count: " . count($matches[0]) . "\n";
?>

Common Validation Patterns

Data Validation Class

php
<?php
class Validator {
    // Email validation
    public static function validateEmail($email) {
        $pattern = '/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/';
        return preg_match($pattern, $email);
    }
    
    // Phone number validation (Mainland China)
    public static function validatePhone($phone) {
        $pattern = '/^1[3-9]\d{9}$/';
        return preg_match($pattern, $phone);
    }
    
    // ID card validation (simplified)
    public static function validateIdCard($idCard) {
        $pattern = '/^[1-9]\d{5}(19|20)\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])\d{3}[\dX]$/';
        return preg_match($pattern, $idCard);
    }
    
    // Password strength validation
    public static function validatePassword($password) {
        // At least 8 characters, containing uppercase, lowercase, digits, and special characters
        $pattern = '/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$/';
        return preg_match($pattern, $password);
    }
    
    // URL validation
    public static function validateUrl($url) {
        $pattern = '/^https?:\/\/(?:[-\w.])+(?:\:[0-9]+)?(?:\/(?:[\w\/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?$/';
        return preg_match($pattern, $url);
    }
    
    // IP address validation
    public static function validateIP($ip) {
        $pattern = '/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/';
        return preg_match($pattern, $ip);
    }
}

// Test validator
$testData = [
    'email' => 'user@example.com',
    'phone' => '13800138000',
    'password' => 'StrongP@ss123',
    'url' => 'https://www.example.com',
    'ip' => '192.168.1.1'
];

foreach ($testData as $type => $value) {
    $method = 'validate' . ucfirst($type);
    if (method_exists('Validator', $method)) {
        $isValid = Validator::$method($value);
        echo "$type ($value): " . ($isValid ? "Valid" : "Invalid") . "\n";
    }
}
?>

Text Processing and Replacement

Search and Replace

php
<?php
$text = "Contact us: Phone 010-12345678, Mobile 138-0013-8000, Email contact@example.com";

// Basic replacement
$result = preg_replace('/\d{3}-\d{4}-\d{4}/', '***-****-****', $text);
echo "Hide phone number: $result\n";

// Replace using callback function
$result = preg_replace_callback('/(\w+)@([\w.-]+)/', function($matches) {
    return $matches[1] . '@***';
}, $text);
echo "Hide email domain: $result\n";

// Advanced replacement using groups
$html = '<img src="image1.jpg" alt="Image 1"><img src="image2.png" alt="Image 2">';
$result = preg_replace('/<img src="([^"]+)" alt="([^"]+)">/', '<figure><img src="$1"><figcaption>$2</figcaption></figure>', $html);
echo "HTML conversion: $result\n";
?>

Text Splitting and Extraction

php
<?php
// Split string
$text = "Apple,Banana;Orange|Grape Strawberry";
$fruits = preg_split('/[,;|\s]+/', $text);
print_r($fruits);

// Extract log information
$log = "2024-01-15 10:30:45 [ERROR] Database connection failed";
$pattern = '/(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) \[(\w+)\] (.+)/';
if (preg_match($pattern, $log, $matches)) {
    echo "Date: " . $matches[1] . "\n";
    echo "Time: " . $matches[2] . "\n";
    echo "Level: " . $matches[3] . "\n";
    echo "Message: " . $matches[4] . "\n";
}

// Extract all email addresses
$text = "Contact: admin@example.com, support@test.org, info@company.net";
preg_match_all('/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/', $text, $matches);
echo "Found email addresses:\n";
foreach ($matches[0] as $email) {
    echo "- $email\n";
}
?>

Advanced Features

Lookahead and Lookbehind Assertions

php
<?php
$text = "password123, admin456, user789, guest000";

// Positive lookahead (?=...)
// Match words followed by digits
preg_match_all('/\w+(?=\d+)/', $text, $matches);
echo "Words followed by digits: " . implode(', ', $matches[0]) . "\n";

// Positive lookbehind (?<=...)
// Match content preceded by specific pattern
$text3 = "Price: $100, Fee: $50, Tax: $10";
preg_match_all('/(?<=\$)\d+/', $text3, $matches);
echo "Price numbers: " . implode(', ', $matches[0]) . "\n";
?>

Practical Application: Log Parser

php
<?php
class LogParser {
    private $patterns = [
        'apache' => '/^(\S+) \S+ \S+ \[([\w:\/]+\s[+\-]\d{4})\] "(\S+) (\S+) (\S+)" (\d{3}) (\d+)/',
        'custom' => '/^\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (\w+): (.+)/'
    ];
    
    public function parseLogLine($line, $type = 'custom') {
        if (!isset($this->patterns[$type])) {
            throw new InvalidArgumentException("Unsupported log type: $type");
        }
        
        $pattern = $this->patterns[$type];
        
        if (preg_match($pattern, trim($line), $matches)) {
            return $this->formatLogEntry($matches, $type);
        }
        
        return null;
    }
    
    private function formatLogEntry($matches, $type) {
        switch ($type) {
            case 'apache':
                return [
                    'ip' => $matches[1],
                    'timestamp' => $matches[2],
                    'method' => $matches[3],
                    'url' => $matches[4],
                    'status' => $matches[6],
                    'size' => $matches[7]
                ];
            case 'custom':
                return [
                    'timestamp' => $matches[1],
                    'level' => $matches[2],
                    'message' => $matches[3]
                ];
            default:
                return $matches;
        }
    }
}

// Usage example
$parser = new LogParser();
$logLine = "[2024-01-15 10:30:45] ERROR: Database connection failed";
$parsed = $parser->parseLogLine($logLine);

if ($parsed) {
    echo "Time: {$parsed['timestamp']}\n";
    echo "Level: {$parsed['level']}\n";
    echo "Message: {$parsed['message']}\n";
}
?>

Best Practices and Performance Optimization

Error Handling and Safe Usage

php
<?php
// Safe regular expression usage
function safeRegexMatch($pattern, $subject) {
    $result = preg_match($pattern, $subject, $matches);
    
    if ($result === false) {
        $error = preg_last_error();
        $errorMessages = [
            PREG_NO_ERROR => 'No error',
            PREG_INTERNAL_ERROR => 'Internal error',
            PREG_BACKTRACK_LIMIT_ERROR => 'Backtrack limit error',
            PREG_RECURSION_LIMIT_ERROR => 'Recursion limit error',
            PREG_BAD_UTF8_ERROR => 'UTF-8 error'
        ];
        
        throw new RuntimeException('Regular expression error: ' . ($errorMessages[$error] ?? 'Unknown error'));
    }
    
    return [$result, $matches ?? []];
}

// Validate user input
function validateUserInput($input, $type) {
    $patterns = [
        'username' => '/^[a-zA-Z0-9_]{3,20}$/',
        'email' => '/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/',
        'phone' => '/^1[3-9]\d{9}$/'
    ];
    
    if (!isset($patterns[$type])) {
        throw new InvalidArgumentException("Unsupported validation type: $type");
    }
    
    return preg_match($patterns[$type], $input) === 1;
}

// Usage example
try {
    list($result, $matches) = safeRegexMatch('/\d+/', 'test123');
    if ($result) {
        echo "Found number: " . $matches[0] . "\n";
    }
} catch (RuntimeException $e) {
    echo "Error: " . $e->getMessage() . "\n";
}
?>

Performance Optimization Tips

php
<?php
// 1. Use character classes instead of alternation
// Slow: (a|b|c|d|e)
// Fast: [a-e]

// 2. Avoid unnecessary backtracking
function optimizedEmailValidation($email) {
    // Use atomic groups (?>...) to avoid backtracking
    $pattern = '/^[a-zA-Z0-9]++(?:\.[a-zA-Z0-9]++)*+@[a-zA-Z0-9]++(?:\.[a-zA-Z0-9]++)*+$/';
    return preg_match($pattern, $email);
}

// 3. Handle UTF-8 characters
$text = "中文测试123";

// Wrong: u modifier not specified
$wrong = '/\w+/';

// Correct: Use u modifier for Unicode support
$correct = '/\w+/u';

preg_match_all($wrong, $text, $matches1);
preg_match_all($correct, $text, $matches2);

echo "Without u modifier: " . implode(', ', $matches1[0]) . "\n";
echo "With u modifier: " . implode(', ', $matches2[0]) . "\n";
?>

Common Errors and Solutions

Escaping Character Issues

php
<?php
// Wrong: Not properly escaped
$wrong = '/\d+.\d+/'; // . in regex means any character

// Correct: Properly escaped
$correct = '/\d+\.\d+/'; // \. means literal dot

$number = "3.14";
echo "Wrong pattern: " . (preg_match($wrong, $number) ? "Match" : "No match") . "\n";
echo "Correct pattern: " . (preg_match($correct, $number) ? "Match" : "No match") . "\n";
?>

Greedy Matching Issues

php
<?php
// Problem: Greedy matching causes unexpected results
$html = '<div>content1</div><div>content2</div>';
$greedy = '/<div>.*<\/div>/';
$nonGreedy = '/<div>.*?<\/div>/';

preg_match($greedy, $html, $matches1);
preg_match($nonGreedy, $html, $matches2);

echo "Greedy match: " . $matches1[0] . "\n";
echo "Non-greedy match: " . $matches2[0] . "\n";
?>

Summary

This chapter introduced the use of regular expressions in PHP:

  • Basic Syntax: Metacharacters, character classes, quantifiers
  • Advanced Features: Groups, assertions, lookahead/lookbehind
  • Practical Applications: Data validation, text processing, log parsing
  • Performance Optimization: Avoid backtracking, use appropriate patterns
  • Error Handling: Safe usage, exception handling

Mastering regular expressions can greatly improve the efficiency and accuracy of text processing. In the next chapter, we will learn about PHP's standard library and built-in functions.

Content is for learning and research only.