Ruby Chinese Encoding
When processing Chinese text, correctly handling character encoding is very important. This chapter will explain in detail how to handle character encoding in Ruby, helping you avoid Chinese garbled text problems.
📚 Character Encoding Basics
What is Character Encoding?
Character encoding is a system that maps characters to numbers, allowing computers to store and process text data. Common encoding formats include:
- ASCII: American Standard Code for Information Interchange, supports only English characters
- UTF-8: Universal Character Set, supports all language characters including Chinese
- GBK: Chinese encoding standard, mainly used in China
- GB2312: Simplified Chinese character set
Encoding Support in Ruby
Ruby has provided strong encoding support since version 1.9, capable of correctly handling multi-byte characters including Chinese.
ruby
# Check Ruby version
puts RUBY_VERSION
# Check default internal encoding
puts Encoding.default_internal
# Check default external encoding
puts Encoding.default_external🔤 String Encoding Handling
Checking String Encoding
ruby
# Create a string containing Chinese
text = "你好,世界!"
puts text.encoding # Output: UTF-8
# Create strings with different encodings
gbk_text = "你好".encode("GBK")
puts gbk_text.encoding # Output: GBKEncoding Conversion
ruby
# UTF-8 to GBK
utf8_text = "中文测试"
gbk_text = utf8_text.encode("GBK")
puts gbk_text.encoding # Output: GBK
# GBK to UTF-8
back_to_utf8 = gbk_text.encode("UTF-8")
puts back_to_utf8.encoding # Output: UTF-8
# Handling encoding errors
begin
problematic_text = utf8_text.encode("ASCII")
rescue Encoding::UndefinedConversionError => e
puts "Encoding conversion error: #{e.message}"
endForce Encoding
ruby
# Force set string encoding (without converting actual bytes)
text = "中文"
text.force_encoding("ASCII-8BIT")
puts text.encoding # Output: ASCII-8BIT
# Restore correct encoding
text.force_encoding("UTF-8")
puts text # Output: 中文📄 Encoding Handling in File Read/Write
Reading Chinese Files
ruby
# Create a test file containing Chinese
File.open("chinese_text.txt", "w:utf-8") do |file|
file.write("This is a test file containing Chinese.\n")
file.write("Second line content: Chinese encoding test.\n")
end
# Read UTF-8 encoded file
File.open("chinese_text.txt", "r:utf-8") do |file|
file.each_line do |line|
puts line.chomp
puts "Encoding: #{line.encoding}"
end
end
# Auto-detect encoding (requires charlock_holmes gem)
# require 'charlock_holmes'
# content = File.read("chinese_text.txt")
# detection = CharlockHolmes::EncodingDetector.detect(content)
# puts "Detected encoding: #{detection[:encoding]}"Writing Chinese Files
ruby
# Write UTF-8 encoded file
File.open("output_utf8.txt", "w:utf-8") do |file|
file.puts "This is a UTF-8 encoded file"
file.puts "Contains Chinese characters"
end
# Write GBK encoded file
File.open("output_gbk.txt", "w:GBK") do |file|
file.puts "This is a GBK encoded file".encode("GBK")
file.puts "Contains Chinese characters".encode("GBK")
end
# Append mode writing Chinese
File.open("append_test.txt", "a:utf-8") do |file|
file.puts "Appended Chinese content"
endHandling Files with Different Encodings
ruby
# Read and convert files with different encodings
def read_and_convert_file(filename, from_encoding, to_encoding="UTF-8")
content = File.read(filename, encoding: from_encoding)
content.encode(to_encoding)
rescue Encoding::InvalidByteSequenceError => e
puts "Invalid byte sequence: #{e.message}"
nil
rescue Encoding::UndefinedConversionError => e
puts "Undefined conversion: #{e.message}"
nil
end
# Usage example
# converted_content = read_and_convert_file("gbk_file.txt", "GBK")
# puts converted_content if converted_content🌐 Encoding Handling in Network Requests
HTTP Request Processing
ruby
require 'net/http'
require 'uri'
# Handle URLs containing Chinese
def fetch_with_chinese_url(base_url, chinese_path)
# URI encode Chinese characters
encoded_path = URI.encode(chinese_path)
full_url = "#{base_url}#{encoded_path}"
uri = URI(full_url)
response = Net::HTTP.get_response(uri)
# Ensure response content is correctly decoded
content = response.body
content.force_encoding("UTF-8")
content
end
# Send POST request with Chinese data
def post_with_chinese_data(url, data)
uri = URI(url)
http = Net::HTTP.new(uri.host, uri.port)
# Set request headers
request = Net::HTTP::Post.new(uri)
request["Content-Type"] = "application/x-www-form-urlencoded; charset=utf-8"
# Encode POST data
encoded_data = URI.encode_www_form(data)
request.body = encoded_data
response = http.request(request)
response.body.force_encoding("UTF-8")
endEncoding in JSON Processing
ruby
require 'json'
# Create JSON data containing Chinese
data = {
name: "张三",
city: "北京",
hobbies: ["读书", "游泳", "编程"]
}
# Generate JSON string
json_string = JSON.generate(data)
puts json_string
# Parse JSON string
parsed_data = JSON.parse(json_string)
puts "Name: #{parsed_data['name']}"
puts "City: #{parsed_data['city']}"
# Handle special characters
special_data = {
message: "Contains special characters:\nNewline\tTab\"Quotes"
}
json_with_special = JSON.generate(special_data)
puts json_with_special🛠️ Encoding-Related Methods
Using the Encoding Class
ruby
# Get all supported encodings
puts "Supported encoding formats:"
Encoding.list.each do |encoding|
puts "- #{encoding.name}"
end
# Find specific encoding
utf8_encoding = Encoding.find("UTF-8")
puts "UTF-8 encoding: #{utf8_encoding}"
# Check if encoding is valid
puts Encoding.name_list.include?("GBK") # true or falseString Encoding Methods
ruby
text = "中文测试"
# Check encoding validity
puts text.valid_encoding? # true
# Get byte representation
puts text.bytes # [228, 184, 173, 230, 150, 135, 230, 181, 139, 232, 175, 149]
# Get character count
puts text.length # 4
puts text.size # 4
puts text.bytesize # 12 (Each Chinese character takes 3 bytes in UTF-8)
# Encoding aliases
puts Encoding.aliases["UTF-8"] # utf8⚠️ Common Encoding Problems and Solutions
Garbled Text Problems
ruby
# Problem: Garbled text when reading file
# Solution: Explicitly specify file encoding
File.open("chinese_file.txt", "r:GBK") do |file|
content = file.read
# Convert to UTF-8
utf8_content = content.encode("UTF-8", "GBK")
puts utf8_content
end
# Problem: Console output garbled
# Solution: Set console encoding
if RUBY_PLATFORM =~ /mingw|mswin/
# Windows system
puts "Console encoding: #{ENV['CONSOLE_ENCODING'] || 'GBK'}"
else
# Unix/Linux system
puts "Console encoding: UTF-8"
endEncoding Conversion Error Handling
ruby
def safe_encode(text, to_encoding, from_encoding=nil)
begin
if from_encoding
text.encode(to_encoding, from_encoding)
else
text.encode(to_encoding)
end
rescue Encoding::InvalidByteSequenceError => e
puts "Invalid byte sequence: #{e.message}"
# Try using replacement strategy
text.encode(to_encoding, invalid: :replace, undef: :replace)
rescue Encoding::UndefinedConversionError => e
puts "Undefined conversion: #{e.message}"
# Try using replacement strategy
text.encode(to_encoding, invalid: :replace, undef: :replace)
end
end
# Usage example
problematic_text = "有问题的文本"
safe_result = safe_encode(problematic_text, "ASCII")
puts safe_result if safe_resultDatabase Encoding Problems
ruby
# Set encoding when connecting to MySQL database
require 'mysql2'
client = Mysql2::Client.new(
host: "localhost",
username: "user",
password: "password",
database: "mydb",
encoding: "utf8mb4" # Recommended to use utf8mb4 instead of utf8
)
# Query data containing Chinese
results = client.query("SELECT * FROM users WHERE name LIKE '%张%'")
results.each do |row|
puts "User: #{row['name']}"
end🎯 Best Practices
1. Unified Use of UTF-8 Encoding
ruby
# Set default encoding at the beginning of the program
Encoding.default_internal = "UTF-8"
Encoding.default_external = "UTF-8"
# Or add encoding declaration at the beginning of the file
# -*- coding: utf-8 -*-2. Explicitly Specify File Encoding
ruby
# Explicitly specify encoding when reading files
File.open("data.txt", "r:utf-8") do |file|
# Process file content
end
# Explicitly specify encoding when writing files
File.open("output.txt", "w:utf-8") do |file|
file.puts "中文内容"
end3. Validate Encoding When Processing External Data
ruby
def process_external_data(data)
# Validate encoding validity
unless data.valid_encoding?
puts "Warning: Data contains invalid encoding"
# Try to fix encoding
data = data.encode("UTF-8", invalid: :replace, undef: :replace)
end
# Process data
data
end4. Error Handling and Logging
ruby
def handle_encoding_errors(&block)
begin
yield
rescue Encoding::InvalidByteSequenceError => e
puts "Encoding error - Invalid byte sequence: #{e.message}"
# Log error
log_error("InvalidByteSequenceError", e.message)
rescue Encoding::UndefinedConversionError => e
puts "Encoding error - Undefined conversion: #{e.message}"
# Log error
log_error("UndefinedConversionError", e.message)
end
end
# Usage example
handle_encoding_errors do
# Code that might have encoding errors
process_chinese_text(text)
end🧪 Encoding Test Examples
Encoding Detection Tool
ruby
class EncodingDetector
def self.detect_encoding(text)
# Check UTF-8 validity
if text.encoding.name == "UTF-8" && text.valid_encoding?
return "UTF-8"
end
# Try other common encodings
["GBK", "GB2312", "ASCII"].each do |encoding|
begin
text.encode("UTF-8", encoding)
return encoding
rescue
next
end
end
"Unknown"
end
def self.convert_to_utf8(text, from_encoding=nil)
if from_encoding
text.encode("UTF-8", from_encoding)
else
detected = detect_encoding(text)
if detected != "Unknown"
text.encode("UTF-8", detected)
else
text
end
end
end
end
# Test encoding detection
test_strings = [
"UTF-8中文测试",
"GBK中文测试".encode("GBK"),
"ASCII text"
]
test_strings.each do |str|
puts "Original encoding: #{str.encoding}"
detected = EncodingDetector.detect_encoding(str)
puts "Detected encoding: #{detected}"
if detected != "Unknown"
converted = EncodingDetector.convert_to_utf8(str, detected)
puts "After conversion: #{converted}"
end
puts "---"
end📚 Next Steps
After mastering Ruby Chinese encoding handling, we recommend continuing to learn:
- Ruby Strings - Learn more about string operations
- Ruby File Processing and I/O - Learn file read/write operations
- Ruby Regular Expressions - Master text pattern matching
- Ruby Database Access - Learn encoding handling in database operations
Continue your Ruby learning journey!