Skip to content

Ruby Chinese Encoding

When processing Chinese text, correctly handling character encoding is very important. This chapter will explain in detail how to handle character encoding in Ruby, helping you avoid Chinese garbled text problems.

📚 Character Encoding Basics

What is Character Encoding?

Character encoding is a system that maps characters to numbers, allowing computers to store and process text data. Common encoding formats include:

  • ASCII: American Standard Code for Information Interchange, supports only English characters
  • UTF-8: Universal Character Set, supports all language characters including Chinese
  • GBK: Chinese encoding standard, mainly used in China
  • GB2312: Simplified Chinese character set

Encoding Support in Ruby

Ruby has provided strong encoding support since version 1.9, capable of correctly handling multi-byte characters including Chinese.

ruby
# Check Ruby version
puts RUBY_VERSION

# Check default internal encoding
puts Encoding.default_internal

# Check default external encoding
puts Encoding.default_external

🔤 String Encoding Handling

Checking String Encoding

ruby
# Create a string containing Chinese
text = "你好,世界!"
puts text.encoding  # Output: UTF-8

# Create strings with different encodings
gbk_text = "你好".encode("GBK")
puts gbk_text.encoding  # Output: GBK

Encoding Conversion

ruby
# UTF-8 to GBK
utf8_text = "中文测试"
gbk_text = utf8_text.encode("GBK")
puts gbk_text.encoding  # Output: GBK

# GBK to UTF-8
back_to_utf8 = gbk_text.encode("UTF-8")
puts back_to_utf8.encoding  # Output: UTF-8

# Handling encoding errors
begin
  problematic_text = utf8_text.encode("ASCII")
rescue Encoding::UndefinedConversionError => e
  puts "Encoding conversion error: #{e.message}"
end

Force Encoding

ruby
# Force set string encoding (without converting actual bytes)
text = "中文"
text.force_encoding("ASCII-8BIT")
puts text.encoding  # Output: ASCII-8BIT

# Restore correct encoding
text.force_encoding("UTF-8")
puts text  # Output: 中文

📄 Encoding Handling in File Read/Write

Reading Chinese Files

ruby
# Create a test file containing Chinese
File.open("chinese_text.txt", "w:utf-8") do |file|
  file.write("This is a test file containing Chinese.\n")
  file.write("Second line content: Chinese encoding test.\n")
end

# Read UTF-8 encoded file
File.open("chinese_text.txt", "r:utf-8") do |file|
  file.each_line do |line|
    puts line.chomp
    puts "Encoding: #{line.encoding}"
  end
end

# Auto-detect encoding (requires charlock_holmes gem)
# require 'charlock_holmes'
# content = File.read("chinese_text.txt")
# detection = CharlockHolmes::EncodingDetector.detect(content)
# puts "Detected encoding: #{detection[:encoding]}"

Writing Chinese Files

ruby
# Write UTF-8 encoded file
File.open("output_utf8.txt", "w:utf-8") do |file|
  file.puts "This is a UTF-8 encoded file"
  file.puts "Contains Chinese characters"
end

# Write GBK encoded file
File.open("output_gbk.txt", "w:GBK") do |file|
  file.puts "This is a GBK encoded file".encode("GBK")
  file.puts "Contains Chinese characters".encode("GBK")
end

# Append mode writing Chinese
File.open("append_test.txt", "a:utf-8") do |file|
  file.puts "Appended Chinese content"
end

Handling Files with Different Encodings

ruby
# Read and convert files with different encodings
def read_and_convert_file(filename, from_encoding, to_encoding="UTF-8")
  content = File.read(filename, encoding: from_encoding)
  content.encode(to_encoding)
rescue Encoding::InvalidByteSequenceError => e
  puts "Invalid byte sequence: #{e.message}"
  nil
rescue Encoding::UndefinedConversionError => e
  puts "Undefined conversion: #{e.message}"
  nil
end

# Usage example
# converted_content = read_and_convert_file("gbk_file.txt", "GBK")
# puts converted_content if converted_content

🌐 Encoding Handling in Network Requests

HTTP Request Processing

ruby
require 'net/http'
require 'uri'

# Handle URLs containing Chinese
def fetch_with_chinese_url(base_url, chinese_path)
  # URI encode Chinese characters
  encoded_path = URI.encode(chinese_path)
  full_url = "#{base_url}#{encoded_path}"

  uri = URI(full_url)
  response = Net::HTTP.get_response(uri)

  # Ensure response content is correctly decoded
  content = response.body
  content.force_encoding("UTF-8")
  content
end

# Send POST request with Chinese data
def post_with_chinese_data(url, data)
  uri = URI(url)
  http = Net::HTTP.new(uri.host, uri.port)

  # Set request headers
  request = Net::HTTP::Post.new(uri)
  request["Content-Type"] = "application/x-www-form-urlencoded; charset=utf-8"

  # Encode POST data
  encoded_data = URI.encode_www_form(data)
  request.body = encoded_data

  response = http.request(request)
  response.body.force_encoding("UTF-8")
end

Encoding in JSON Processing

ruby
require 'json'

# Create JSON data containing Chinese
data = {
  name: "张三",
  city: "北京",
  hobbies: ["读书", "游泳", "编程"]
}

# Generate JSON string
json_string = JSON.generate(data)
puts json_string

# Parse JSON string
parsed_data = JSON.parse(json_string)
puts "Name: #{parsed_data['name']}"
puts "City: #{parsed_data['city']}"

# Handle special characters
special_data = {
  message: "Contains special characters:\nNewline\tTab\"Quotes"
}
json_with_special = JSON.generate(special_data)
puts json_with_special

Using the Encoding Class

ruby
# Get all supported encodings
puts "Supported encoding formats:"
Encoding.list.each do |encoding|
  puts "- #{encoding.name}"
end

# Find specific encoding
utf8_encoding = Encoding.find("UTF-8")
puts "UTF-8 encoding: #{utf8_encoding}"

# Check if encoding is valid
puts Encoding.name_list.include?("GBK")  # true or false

String Encoding Methods

ruby
text = "中文测试"

# Check encoding validity
puts text.valid_encoding?  # true

# Get byte representation
puts text.bytes  # [228, 184, 173, 230, 150, 135, 230, 181, 139, 232, 175, 149]

# Get character count
puts text.length  # 4
puts text.size    # 4
puts text.bytesize # 12 (Each Chinese character takes 3 bytes in UTF-8)

# Encoding aliases
puts Encoding.aliases["UTF-8"]  # utf8

⚠️ Common Encoding Problems and Solutions

Garbled Text Problems

ruby
# Problem: Garbled text when reading file
# Solution: Explicitly specify file encoding
File.open("chinese_file.txt", "r:GBK") do |file|
  content = file.read
  # Convert to UTF-8
  utf8_content = content.encode("UTF-8", "GBK")
  puts utf8_content
end

# Problem: Console output garbled
# Solution: Set console encoding
if RUBY_PLATFORM =~ /mingw|mswin/
  # Windows system
  puts "Console encoding: #{ENV['CONSOLE_ENCODING'] || 'GBK'}"
else
  # Unix/Linux system
  puts "Console encoding: UTF-8"
end

Encoding Conversion Error Handling

ruby
def safe_encode(text, to_encoding, from_encoding=nil)
  begin
    if from_encoding
      text.encode(to_encoding, from_encoding)
    else
      text.encode(to_encoding)
    end
  rescue Encoding::InvalidByteSequenceError => e
    puts "Invalid byte sequence: #{e.message}"
    # Try using replacement strategy
    text.encode(to_encoding, invalid: :replace, undef: :replace)
  rescue Encoding::UndefinedConversionError => e
    puts "Undefined conversion: #{e.message}"
    # Try using replacement strategy
    text.encode(to_encoding, invalid: :replace, undef: :replace)
  end
end

# Usage example
problematic_text = "有问题的文本"
safe_result = safe_encode(problematic_text, "ASCII")
puts safe_result if safe_result

Database Encoding Problems

ruby
# Set encoding when connecting to MySQL database
require 'mysql2'

client = Mysql2::Client.new(
  host: "localhost",
  username: "user",
  password: "password",
  database: "mydb",
  encoding: "utf8mb4"  # Recommended to use utf8mb4 instead of utf8
)

# Query data containing Chinese
results = client.query("SELECT * FROM users WHERE name LIKE '%张%'")
results.each do |row|
  puts "User: #{row['name']}"
end

🎯 Best Practices

1. Unified Use of UTF-8 Encoding

ruby
# Set default encoding at the beginning of the program
Encoding.default_internal = "UTF-8"
Encoding.default_external = "UTF-8"

# Or add encoding declaration at the beginning of the file
# -*- coding: utf-8 -*-

2. Explicitly Specify File Encoding

ruby
# Explicitly specify encoding when reading files
File.open("data.txt", "r:utf-8") do |file|
  # Process file content
end

# Explicitly specify encoding when writing files
File.open("output.txt", "w:utf-8") do |file|
  file.puts "中文内容"
end

3. Validate Encoding When Processing External Data

ruby
def process_external_data(data)
  # Validate encoding validity
  unless data.valid_encoding?
    puts "Warning: Data contains invalid encoding"
    # Try to fix encoding
    data = data.encode("UTF-8", invalid: :replace, undef: :replace)
  end

  # Process data
  data
end

4. Error Handling and Logging

ruby
def handle_encoding_errors(&block)
  begin
    yield
  rescue Encoding::InvalidByteSequenceError => e
    puts "Encoding error - Invalid byte sequence: #{e.message}"
    # Log error
    log_error("InvalidByteSequenceError", e.message)
  rescue Encoding::UndefinedConversionError => e
    puts "Encoding error - Undefined conversion: #{e.message}"
    # Log error
    log_error("UndefinedConversionError", e.message)
  end
end

# Usage example
handle_encoding_errors do
  # Code that might have encoding errors
  process_chinese_text(text)
end

🧪 Encoding Test Examples

Encoding Detection Tool

ruby
class EncodingDetector
  def self.detect_encoding(text)
    # Check UTF-8 validity
    if text.encoding.name == "UTF-8" && text.valid_encoding?
      return "UTF-8"
    end

    # Try other common encodings
    ["GBK", "GB2312", "ASCII"].each do |encoding|
      begin
        text.encode("UTF-8", encoding)
        return encoding
      rescue
        next
      end
    end

    "Unknown"
  end

  def self.convert_to_utf8(text, from_encoding=nil)
    if from_encoding
      text.encode("UTF-8", from_encoding)
    else
      detected = detect_encoding(text)
      if detected != "Unknown"
        text.encode("UTF-8", detected)
      else
        text
      end
    end
  end
end

# Test encoding detection
test_strings = [
  "UTF-8中文测试",
  "GBK中文测试".encode("GBK"),
  "ASCII text"
]

test_strings.each do |str|
  puts "Original encoding: #{str.encoding}"
  detected = EncodingDetector.detect_encoding(str)
  puts "Detected encoding: #{detected}"

  if detected != "Unknown"
    converted = EncodingDetector.convert_to_utf8(str, detected)
    puts "After conversion: #{converted}"
  end
  puts "---"
end

📚 Next Steps

After mastering Ruby Chinese encoding handling, we recommend continuing to learn:

Continue your Ruby learning journey!

Content is for learning and research only.