Ruby Regular Expressions
Regular expressions (regex or regexp) are powerful tools for text processing. In Ruby, regular expressions are widely used for string validation, text search, data extraction, and formatting. Ruby provides native support for regular expressions, making it an ideal choice for handling text data. This chapter will provide a detailed introduction to regular expression syntax, methods, and practical applications in Ruby.
🎯 Regular Expression Basics
What is a Regular Expression
A regular expression is a special text string used to describe search patterns. It consists of ordinary characters and special characters (metacharacters) that can be used to match, find, and replace specific patterns in text.
ruby
# Basic regular expression matching
# Check if string contains numbers
text = "My phone number is 13812345678"
if text.match?(/\d+/)
puts "Text contains numbers"
end
# Find matched content
match = text.match(/\d+/)
if match
puts "Found number: #{match[0]}" # 13812345678
end
# Using string methods
puts "hello".match?(/h/) # true
puts "hello".match?(/H/) # false
puts "hello".match?(/H/i) # true (case insensitive)Regular Expression Literals
ruby
# Using slashes to define regular expressions
regex1 = /hello/
regex2 = /\d+/ # Match one or more digits
regex3 = /[a-z]+/ # Match one or more lowercase letters
regex4 = /hello/i # Case insensitive
# Using %r to define regular expressions (avoid escaping slashes)
url_pattern = %r{https?://[\w\-\.]+\.[a-zA-Z]{2,}(/\S*)?}
email_pattern = %r{[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+}
# Using Regexp.new to create
pattern = Regexp.new('\d+', Regexp::IGNORECASE)
# Regular expression options
# i - Case insensitive
# m - Multiline mode
# x - Extended mode (ignores whitespace and comments)
regex_with_options = /hello world/imx🔤 Basic Metacharacters
Character Matching
ruby
# Ordinary characters
/hello/.match("hello world") # Match
# Dot - matches any character (except newline)
/a.c/.match("abc") # Match
/a.c/.match("a c") # Match
/a.c/.match("axc") # Match
/a.c/.match("ac") # No match
# Escape characters
/\./.match("a.b") # Match dot
/\*/.match("a*b") # Match asterisk
/\?/.match("a?b") # Match question mark
# Character class
/[abc]/.match("a") # Match
/[abc]/.match("b") # Match
/[abc]/.match("d") # No match
# Range
/[a-z]/.match("m") # Match lowercase letter
/[A-Z]/.match("M") # Match uppercase letter
/[0-9]/.match("5") # Match digit
/[a-zA-Z0-9]/.match("K") # Match letter or digit
# Predefined character classes
/\d/.match("5") # Match digit [0-9]
/\D/.match("a") # Match non-digit [^0-9]
/\w/.match("a") # Match word character [a-zA-Z0-9_]
/\W/.match("@") # Match non-word character [^a-zA-Z0-9_]
/\s/.match(" ") # Match whitespace character [ \t\r\n\f]
/\S/.match("a") # Match non-whitespace characterQuantifiers
ruby
# Basic quantifiers
/a*/.match("") # Match zero or more a
/a*/.match("a") # Match
/a*/.match("aaa") # Match
/a+/.match("a") # Match one or more a
/a+/.match("aaa") # Match
/a+/.match("") # No match
/a?/.match("") # Match zero or one a
/a?/.match("a") # Match
/a?/.match("aa") # Matches only first a
# Exact quantifiers
/a{3}/.match("aaa") # Match exactly 3 a
/a{2,4}/.match("aa") # Match 2 to 4 a
/a{2,}/.match("aaaa") # Match 2 or more a
# Greedy vs non-greedy matching
/<.+>/.match("<b>bold</b>") # Greedy match: <b>bold</b>
/<.+?>/.match("<b>bold</b>") # Non-greedy match: <b>
# Practical application example
text = "Mobile: 13812345678, Landline: 010-12345678"
# Extract mobile number
phone_match = text.match(/1[3-9]\d{9}/)
puts phone_match[0] if phone_match # 13812345678
# Extract landline number
landline_match = text.match(/0\d{2,3}-?\d{7,8}/)
puts landline_match[0] if landline_match # 010-12345678🏗️ Advanced Regular Expressions
Grouping and Capturing
ruby
# Basic grouping
/(ab)+/.match("ababab") # Match repeated ab
// Use grouping for capture
match = /(\d{4})-(\d{2})-(\d{2})/.match("2023-12-25")
if match
puts "Year: #{match[1]}" # 2023
puts "Month: #{match[2]}" # 12
puts "Day: #{match[3]}" # 25
puts "Full match: #{match[0]}" # 2023-12-25
end
// Named groups
match = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/.match("2023-12-25")
if match
puts "Year: #{match[:year]}" # 2023
puts "Month: #{match[:month]}" # 12
puts "Day: #{match[:day]}" # 25
end
// Non-capturing grouping
/(?:abc)+/.match("abcabc") # Match but don't capture
// Group reference
/(.)\1/.match("aa") # Match repeated character
/(..)\1/.match("abab") # Match repeated two characters
// Practical application: Parse URL
url = "https://www.example.com:8080/path/to/page?param1=value1¶m2=value2#section"
url_pattern = %r{(?<protocol>https?)://(?<host>[\w\-\.]+)(?::(?<port>\d+))?(?<path>/[^?#]*)?(?:\?(?<query>[^#]*))?(?:#(?<fragment>.*))?}
match = url_pattern.match(url)
if match
puts "Protocol: #{match[:protocol]}" # https
puts "Host: #{match[:host]}" # www.example.com
puts "Port: #{match[:port]}" # 8080
puts "Path: #{match[:path]}" # /path/to/page
puts "Query: #{match[:query]}" # param1=value1¶m2=value2
puts "Fragment: #{match[:fragment]}" # section
endAnchors and Boundaries
ruby
// Line start and end anchors
/^hello/.match("hello world") // Match at line start
/world$/.match("hello world") // Match at line end
// Word boundaries
/\bword\b/.match("word") // Match whole word
/\bword\b/.match("wording") // No match
/\bword/.match("wording") // Match word start
/word\b/.match("reword") // Match word end
// String boundaries
/\Ahello/.match("hello world") // Match string start
/world\z/.match("hello world") // Match string end
// Practical application: Validate input
def valid_username?(username)
// Username can only contain letters, numbers, underscores, length 3-16
username.match?(/\A[a-zA-Z0-9_]{3,16}\z/)
end
def valid_email?(email)
// Simple email validation
email.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i)
end
puts valid_username?("user_123") // true
puts valid_username?("us") // false
puts valid_email?("user@example.com") // true
puts valid_email?("invalid.email") // falseAlternation and Conditions
ruby
// OR operator
/(cat|dog)/.match("I have a cat") // Match cat
/(cat|dog)/.match("I have a dog") // Match dog
// Complex alternation
/(https?|ftp):\/\//.match("https://example.com") // Match protocol
/(https?|ftp):\/\//.match("ftp://example.com") // Match protocol
// Conditional matching
// If previous pattern matched, then another pattern must match
/(?<=\d{3})\d{4}/.match("13812345678") // Match 1234 (preceded by 3 digits)
// Negative lookahead
/(?<!\d{3})\d{4}/.match("12345678") // Match 1234 (not preceded by 3 digits)
// Practical application: Extract multiple date formats
text = "Today is 2023-12-25, tomorrow is 2023/12/26, the day after is Dec 27, 2023"
date_patterns = [
/(\d{4})-(\d{2})-(\d{2})/, // 2023-12-25
/(\d{4})\/(\d{2})\/(\d{2})/, // 2023/12/26
/(\w{3})\s+(\d{1,2}),\s+(\d{4})/ // Dec 27, 2023
]
text.scan(Regexp.union(date_patterns)) do |match|
puts "Found date: #{match.compact}"
end🔧 Regular Expression Methods
Regular Expression Methods in Strings
ruby
text = "Contact: phone 13812345678, email user@example.com"
// match method - returns match object
match = text.match(/1[3-9]\d{9}/)
if match
puts "Phone: #{match[0]}"
end
// match? method - returns boolean
puts text.match?(/1[3-9]\d{9}/) // true
// scan method - find all matches
text = "Prices: 100 yuan, discount: 80 yuan, shipping: 10 yuan"
prices = text.scan(/\d+/)
puts prices.inspect // ["100", "80", "10"]
// gsub method - global replace
masked_text = text.gsub(/1[3-9]\d{9}/, "****")
puts masked_text
// sub method - replace first match
first_replaced = text.sub(/\d+/, "***")
puts first_replaced
// split method - split using regular expression
data = "apple,banana;orange:grape"
fruits = data.split(/[,;:]/)
puts fruits.inspect // ["apple", "banana", "orange", "grape"]
// Practical application: Data cleaning
def clean_phone_numbers(text)
// Extract and format phone numbers
text.gsub(/(\d{3})(\d{4})(\d{4})/) do |match|
"#{$1}-#{$2}-#{$3}"
end
end
text = "Contact: 13812345678, alternate: 13987654321"
puts clean_phone_numbers(text)
// Contact: 138-1234-5678, alternate: 139-8765-4321Regexp Class Methods
ruby
// Compile regular expression
pattern = Regexp.new('\d+', Regexp::IGNORECASE)
puts pattern.match?("123abc") // true
// Escape special characters
special_chars = "hello*world?.+"
escaped = Regexp.escape(special_chars)
puts escaped // hello\*world\?\.\+
// Union multiple regular expressions
patterns = [/cat/, /dog/, /bird/]
union_pattern = Regexp.union(patterns)
puts "I have a cat".match?(union_pattern) // true
// Get regular expression options
regex = /hello/i
puts regex.options // 1 (corresponds to Regexp::IGNORECASE)
// Check if regular expressions are equal
regex1 = /hello/
regex2 = /hello/
puts regex1 == regex2 // true🎯 Practical Regular Expression Examples
Data Validation
ruby
class DataValidator
// Email validation
EMAIL_PATTERN = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i
// Phone number validation
PHONE_PATTERN = /\A1[3-9]\d{9}\z/
// ID card validation
ID_CARD_PATTERN = /\A\d{17}[\dXx]\z/
// Postal code validation
POSTAL_CODE_PATTERN = /\A\d{6}\z/
// URL validation
URL_PATTERN = /\Ahttps?:\/\/[\w\-\.]+\.[a-zA-Z]{2,}(/\S*)?\z/
// Password strength validation
PASSWORD_PATTERN = /\A(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*(),.?":{}|<>]).{8,}\z/
def self.valid_email?(email)
email.match?(EMAIL_PATTERN)
end
def self.valid_phone?(phone)
phone.match?(PHONE_PATTERN)
end
def self.valid_id_card?(id_card)
id_card.match?(ID_CARD_PATTERN)
end
def self.valid_postal_code?(code)
code.match?(POSTAL_CODE_PATTERN)
end
def self.valid_url?(url)
url.match?(URL_PATTERN)
end
def self.strong_password?(password)
password.match?(PASSWORD_PATTERN)
end
end
// Using data validator
puts DataValidator.valid_email?("user@example.com") // true
puts DataValidator.valid_phone?("13812345678") // true
puts DataValidator.valid_id_card?("110101199001011234") // true
puts DataValidator.strong_password?("Password123!") // trueText Processing and Extraction
ruby
class TextProcessor
// Extract all email addresses
def self.extract_emails(text)
email_pattern = /[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+/i
text.scan(email_pattern)
end
// Extract all phone numbers
def self.extract_phones(text)
phone_pattern = /1[3-9]\d{9}|\d{3,4}-?\d{7,8}/
text.scan(phone_pattern)
end
// Extract all URLs
def self.extract_urls(text)
url_pattern = %r{https?://[\w\-\.]+\.[a-zA-Z]{2,}(/\S*)?}
text.scan(url_pattern)
end
// Extract HTML tag content
def self.extract_html_content(html)
html.gsub(/<[^>]*>/, "")
end
// Clean excess whitespace
def self.clean_whitespace(text)
text.gsub(/\s+/, " ").strip
end
// Format phone numbers
def self.format_phone(phone)
phone.gsub(/(\d{3})(\d{4})(\d{4})/, '\1-\2-\3')
end
// Extract quoted text
def self.extract_quoted_text(text)
text.scan(/["']([^"']*)["']/).flatten
end
end
// Using text processor
text = <<~TEXT
Contact us:
Email: contact@example.com, support@company.org
Phone: 13812345678, 010-12345678
Website: https://www.example.com
"This is a quoted text"
'This is another quoted text'
TEXT
puts "Email addresses:"
TextProcessor.extract_emails(text).each { |email| puts " #{email}" }
puts "Phone numbers:"
TextProcessor.extract_phones(text).each { |phone| puts " #{phone}" }
puts "URLs:"
TextProcessor.extract_urls(text).each { |url| puts " #{url}" }
puts "Formatted phone: #{TextProcessor.format_phone('13812345678')}"
puts "Cleaned whitespace: '#{TextProcessor.clean_whitespace(" Multiple whitespace characters ")}'"Log Analysis
ruby
class LogAnalyzer
// Apache access log pattern
APACHE_LOG_PATTERN = /
^(\S+)\s+ // IP address
(\S+)\s+ // Identifier
(\S+)\s+ // User ID
\[([^\]]+)\]\s+ // Timestamp
"(\S+)\s+(\S+)\s+(\S+)"\s+ // Request line
(\d{3})\s+ // Status code
(\d+|-)\s+ // Response size
"(.*?)"\s+ // Referrer
"(.*?)" // User agent
/x
// Error log pattern
ERROR_LOG_PATTERN = /
^\[([^\]]+)\]\s+ // Timestamp
\[([^\]]+)\]\s+ // Log level
\[([^\]]+)\]\s+ // Process ID
(.*) // Error message
/x
def self.parse_apache_log(line)
match = line.match(APACHE_LOG_PATTERN)
return nil unless match
{
ip: match[1],
timestamp: match[4],
method: match[5],
path: match[6],
protocol: match[7],
status: match[8].to_i,
size: match[9] == '-' ? 0 : match[9].to_i,
referrer: match[10],
user_agent: match[11]
}
end
def self.parse_error_log(line)
match = line.match(ERROR_LOG_PATTERN)
return nil unless match
{
timestamp: match[1],
level: match[2],
pid: match[3],
message: match[4]
}
end
// Count visits
def self.count_visits(log_lines)
visits = Hash.new(0)
log_lines.each do |line|
record = parse_apache_log(line)
next unless record
visits[record[:ip]] += 1
end
visits
end
// Count status codes
def self.count_status_codes(log_lines)
status_codes = Hash.new(0)
log_lines.each do |line|
record = parse_apache_log(line)
next unless record
status_codes[record[:status]] += 1
end
status_codes
end
end
// Using log analyzer
apache_log_line = '127.0.0.1 - - [25/Dec/2023:14:30:45 +0800] "GET /index.html HTTP/1.1" 200 1234 "http://example.com" "Mozilla/5.0"'
error_log_line = '[2023-12-25 14:30:45] [ERROR] [12345] Database connection failed'
apache_record = LogAnalyzer.parse_apache_log(apache_log_line)
error_record = LogAnalyzer.parse_error_log(error_log_line)
puts "Apache log record:"
puts apache_record.inspect if apache_record
puts "Error log record:"
puts error_record.inspect if error_record📊 Performance Optimization
Regular Expression Performance Considerations
ruby
// Avoid catastrophic backtracking
// Bad pattern
bad_pattern = /a+b+a+b+a+b+a+b+a+b+a+b+a+b+a+b+a+b+/
// Good pattern - use atomic groups or possessive quantifiers
good_pattern = /a++b+a++b+a++b+a++b+a++b+a++b+a++b+a++b+/
// Pre-compile commonly used regular expressions
class RegexCache
@@cache = {}
def self.get(pattern)
@@cache[pattern] ||= Regexp.new(pattern)
end
end
// Using cached regular expressions
def find_emails_cached(text)
email_pattern = RegexCache.get('[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+')
text.scan(email_pattern)
end
// Avoid unnecessary captures
// Use non-capturing groups when capture is not needed
non_capturing = /(?:abc)+/
capturing = /(abc)+/
// Use anchors to improve matching efficiency
// Good practice
/^start.*end$/.match("start something end")
// Avoid greedy matching in large text
// Use non-greedy matching or more precise patterns
text = "a" * 1000 + "b"
// Slow
slow_pattern = /a+b/
// Fast
fast_pattern = /a++b/
// Batch processing optimization
def process_large_text(text)
// Process large text in chunks
chunk_size = 10000
results = []
0.step(text.length, chunk_size) do |i|
chunk = text[i, chunk_size]
results.concat(chunk.scan(/\b\w{4,}\b/))
end
results
endRegular Expression Debugging
ruby
// Debug regular expression
def debug_regex(pattern, text)
regex = Regexp.new(pattern)
match = text.match(regex)
if match
puts "Match successful!"
puts "Full match: #{match[0]}"
puts "Match position: #{match.begin(0)}-#{match.end(0)}"
match.captures.each_with_index do |capture, index|
puts "Group #{index + 1}: #{capture} (position: #{match.begin(index + 1)}-#{match.end(index + 1)})"
end
else
puts "Match failed"
end
end
// Using debug function
debug_regex(/(\d{4})-(\d{2})-(\d{2})/, "Today is 2023-12-25")
// Performance testing
def benchmark_regex(pattern, text, iterations = 1000)
regex = Regexp.new(pattern)
start_time = Time.now
iterations.times do
text.match?(regex)
end
end_time = Time.now
puts "Executed #{iterations} times in #{end_time - start_time} seconds"
end
// benchmark_regex('\d+', 'abc123def456ghi789')🛡️ Regular Expression Best Practices
1. Writing Maintainable Regular Expressions
ruby
// Use extended mode for readability
complex_pattern = /
^ // Line start
(?<protocol>https?) // Protocol group
:\/\/ // Separator
(?<host> // Host group
[\w\-\.]+ // Domain characters
\. // Dot
[a-zA-Z]{2,} // Top-level domain
)
(?::(?<port>\d+))? // Optional port group
(?<path>\/[^?#]*)? // Optional path group
(?:\?(?<query>[^#]*))? // Optional query group
(?:\#(?<fragment>.*))? // Optional fragment group
$ // Line end
/x
// Break down complex regular expressions
class EmailValidator
LOCAL_PART = '[a-zA-Z0-9.!#$%&\'*+\/=?^_`{|}~-]+'
DOMAIN_PART = '[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?'
DOMAIN = "#{DOMAIN_PART}(?:\\.#{DOMAIN_PART})*"
TLD = '[a-zA-Z]{2,}'
EMAIL_PATTERN = /\A#{LOCAL_PART}@#{DOMAIN}\.#{TLD}\z/
def self.valid?(email)
email.match?(EMAIL_PATTERN)
end
end
// Use named constants
PHONE_PATTERN = /\A1[3-9]\d{9}\z/
ID_CARD_PATTERN = /\A\d{17}[\dXx]\z/
def valid_phone?(phone)
phone.match?(PHONE_PATTERN)
end
def valid_id_card?(id_card)
id_card.match?(ID_CARD_PATTERN)
end2. Security Considerations
ruby
// Prevent Regular Expression Denial of Service (ReDoS)
class SafeRegex
// Set matching timeout (Ruby 2.6+)
def self.safe_match?(pattern, string, timeout: 1)
regex = Regexp.new(pattern)
// Execute matching in new thread
thread = Thread.new { string.match?(regex) }
result = thread.join(timeout)
if result
thread.value
else
thread.kill
raise "Regular expression matching timeout"
end
end
// Validate user-provided regular expression
def self.validate_user_regex(pattern)
begin
Regexp.new(pattern)
true
rescue RegexpError
false
end
end
// Sanitize user input
def self.sanitize_input(input)
// Remove potentially dangerous characters
input.gsub(/[<>'"&]/, '')
end
end
// Using safe regular expression
begin
result = SafeRegex.safe_match?('\d+', '12345')
puts "Match result: #{result}"
rescue => e
puts "Error: #{e.message}"
end3. Testing Regular Expressions
ruby
// Regular expression test framework
class RegexTester
def initialize(pattern)
@pattern = pattern
@regex = Regexp.new(pattern)
end
def test_positive(*strings)
strings.each do |string|
unless string.match?(@regex)
puts "Failed: '#{string}' should match pattern '#{@pattern}'"
return false
end
end
true
end
def test_negative(*strings)
strings.each do |string|
if string.match?(@regex)
puts "Failed: '#{string}' should not match pattern '#{@pattern}'"
return false
end
end
true
end
def test_capture(string, expected_captures)
match = string.match(@regex)
if match
actual_captures = match.captures
if actual_captures == expected_captures
true
else
puts "Capture failed: expected #{expected_captures}, actual #{actual_captures}"
false
end
else
puts "Match failed: '#{string}' did not match"
false
end
end
end
// Using test framework
email_tester = RegexTester.new('[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+')
// Test positive cases
email_tester.test_positive(
'user@example.com',
'test.email@domain.org',
'user123@test-domain.co.uk'
)
// Test negative cases
email_tester.test_negative(
'invalid.email',
'@example.com',
'user@',
'user@.com'
)
// Test capture groups
date_tester = RegexTester.new('(\d{4})-(\d{2})-(\d{2})')
date_tester.test_capture('2023-12-25', ['2023', '12', '25'])4. Real Application Scenarios
ruby
// Configuration file parser
class ConfigParser
SECTION_PATTERN = /^\[(.+)\]$/
KEY_VALUE_PATTERN = /^([^=]+)=(.*)$/
COMMENT_PATTERN = /^\s*[#;]/
def self.parse(config_text)
config = {}
current_section = 'default'
config_text.each_line do |line|
line = line.strip
next if line.empty? || line.match?(COMMENT_PATTERN)
if section_match = line.match(SECTION_PATTERN)
current_section = section_match[1]
config[current_section] = {}
elsif kv_match = line.match(KEY_VALUE_PATTERN)
key = kv_match[1].strip
value = kv_match[2].strip
config[current_section] ||= {}
config[current_section][key] = value
end
end
config
end
end
// Using configuration parser
config_text = <<~CONFIG
# Application configuration
[database]
host = localhost
port = 5432
username = admin
[logging]
level = info
file = app.log
CONFIG
config = ConfigParser.parse(config_text)
puts config.inspect
// CSV parser
class CSVParser
CSV_LINE_PATTERN = /(?<=^|,)(?:"([^"]*)"|([^",]*))(?=,|$)/
def self.parse_line(line)
matches = line.scan(CSV_LINE_PATTERN)
matches.map { |match| match[0] || match[1] || '' }
end
def self.parse(csv_text)
csv_text.each_line.map { |line| parse_line(line.chomp) }
end
end
// Using CSV parser
csv_text = <<~CSV
"Name","Age","City"
"Alice","25","New York"
"Bob","30","Los Angeles"
CSV
data = CSVParser.parse(csv_text)
data.each { |row| puts row.inspect }📚 Next Steps
After mastering Ruby regular expressions, continue learning:
- Ruby Database Access - Learn database operations
- Ruby Network Programming - Learn Socket programming
- Ruby JSON - Process JSON data
- Ruby Web Services - Learn web service development
Continue your Ruby learning journey!