Skip to content

Pandas JSON Processing

JSON (JavaScript Object Notation) is the most commonly used data exchange format in modern web applications and APIs. Pandas provides powerful JSON processing capabilities to easily read, write, and manipulate JSON data. This chapter will detail how to use Pandas to handle various JSON data formats.

1. JSON Data Format Overview

1.1 JSON Basics

JSON is a lightweight data interchange format with the following characteristics:

  • Easy to read and write for humans
  • Easy to parse and generate for machines
  • Based on JavaScript syntax but language-independent
  • Supports objects, arrays, strings, numbers, booleans, and null

1.2 Common JSON Data Structures

python
import pandas as pd
import json
import numpy as np
from datetime import datetime

# Simple JSON object
simple_json = {
    "name": "Alice",
    "age": 30,
    "city": "Beijing",
    "married": True
}

# JSON array
json_array = [
    {"id": 1, "name": "Alice", "score": 85},
    {"id": 2, "name": "Bob", "score": 92},
    {"id": 3, "name": "Charlie", "score": 78}
]

# Nested JSON
nested_json = {
    "user": {
        "id": 1,
        "profile": {
            "name": "Alice",
            "age": 30,
            "address": {
                "city": "Beijing",
                "district": "Chaoyang"
            }
        },
        "orders": [
            {"order_id": "ORD001", "amount": 299.99, "date": "2023-01-15"},
            {"order_id": "ORD002", "amount": 159.50, "date": "2023-02-20"}
        ]
    }
}

print("Simple JSON:", simple_json)
print("JSON array:", json_array)
print("Nested JSON:", nested_json)

2. Reading JSON Data

2.1 Reading JSON from String

python
# Create DataFrame from JSON string
json_string = '''
[
    {"name": "Alice", "age": 25, "department": "IT", "salary": 8000},
    {"name": "Bob", "age": 30, "department": "HR", "salary": 7000},
    {"name": "Charlie", "age": 28, "department": "Finance", "salary": 9000}
]
'''

# Read using pd.read_json()
df_from_string = pd.read_json(json_string)
print("DataFrame created from JSON string:")
print(df_from_string)
print("\nData types:")
print(df_from_string.dtypes)

2.2 Reading JSON from File

python
# Create sample JSON file
employees_data = [
    {"id": 1, "name": "Alice", "age": 25, "department": "IT", "salary": 8000, "join_date": "2020-01-15"},
    {"id": 2, "name": "Bob", "age": 30, "department": "HR", "salary": 7000, "join_date": "2019-03-20"},
    {"id": 3, "name": "Charlie", "age": 28, "department": "Finance", "salary": 9000, "join_date": "2021-05-10"},
    {"id": 4, "name": "David", "age": 32, "department": "IT", "salary": 9500, "join_date": "2018-12-01"}
]

# Save as JSON file
with open('employees.json', 'w', encoding='utf-8') as f:
    json.dump(employees_data, f, ensure_ascii=False, indent=2)

# Read from file
df_from_file = pd.read_json('employees.json')
print("DataFrame read from JSON file:")
print(df_from_file)

# Process date column
df_from_file['join_date'] = pd.to_datetime(df_from_file['join_date'])
print("\nData types after processing dates:")
print(df_from_file.dtypes)

2.3 Reading Different JSON Formats

python
# Different JSON formats

# 1. Records format (default) - each row is a record
records_format = '''
[
    {"name": "Alice", "age": 25, "city": "Beijing"},
    {"name": "Bob", "age": 30, "city": "Shanghai"}
]
'''

# 2. Index format - indexed by key
index_format = '''
{
    "0": {"name": "Alice", "age": 25, "city": "Beijing"},
    "1": {"name": "Bob", "age": 30, "city": "Shanghai"}
}
'''

# 3. Columns format - keyed by column name
columns_format = '''
{
    "name": {"0": "Alice", "1": "Bob"},
    "age": {"0": 25, "1": 30},
    "city": {"0": "Beijing", "1": "Shanghai"}
}
'''

# 4. Values format - array of values only
values_format = '''
[
    ["Alice", 25, "Beijing"],
    ["Bob", 30, "Shanghai"]
]
'''

# Read different formats
df_records = pd.read_json(records_format, orient='records')
df_index = pd.read_json(index_format, orient='index')
df_columns = pd.read_json(columns_format, orient='columns')
df_values = pd.read_json(values_format, orient='values')

print("Records format:")
print(df_records)
print("\nIndex format:")
print(df_index)
print("\nColumns format:")
print(df_columns)
print("\nValues format (need to set column names manually):")
df_values.columns = ['name', 'age', 'city']
print(df_values)

2.4 Processing Nested JSON

python
# Process nested JSON data
nested_data = '''
[
    {
        "id": 1,
        "name": "Alice",
        "profile": {
            "age": 25,
            "address": {
                "city": "Beijing",
                "district": "Chaoyang"
            }
        },
        "skills": ["Python", "SQL", "Machine Learning"]
    },
    {
        "id": 2,
        "name": "Bob",
        "profile": {
            "age": 30,
            "address": {
                "city": "Shanghai",
                "district": "Pudong"
            }
        },
        "skills": ["Java", "Spring", "MySQL"]
    }
]
'''

# Use json_normalize to process nested data
from pandas import json_normalize

# Parse JSON string
data = json.loads(nested_data)

# Normalize nested JSON
df_normalized = json_normalize(data)
print("Normalized DataFrame:")
print(df_normalized)
print("\nColumn names:")
print(df_normalized.columns.tolist())

# Specify separator
df_custom_sep = json_normalize(data, sep='_')
print("\nUsing custom separator:")
print(df_custom_sep.columns.tolist())

# Process array fields
df_with_meta = json_normalize(data, record_path='skills', 
                              meta=['id', 'name', ['profile', 'age']], 
                              meta_prefix='user_')
print("\nExpanded skills array:")
print(df_with_meta)

3. Writing JSON Data

3.1 Basic Writing Operations

python
# Create sample DataFrame
df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 28, 32],
    'department': ['IT', 'HR', 'Finance', 'IT'],
    'salary': [8000, 7000, 9000, 9500],
    'join_date': pd.to_datetime(['2020-01-15', '2019-03-20', '2021-05-10', '2018-12-01'])
})

print("Original DataFrame:")
print(df)

# Convert to JSON string
json_string = df.to_json(orient='records', force_ascii=False, indent=2)
print("\nJSON string:")
print(json_string)

# Save to file
df.to_json('output.json', orient='records', force_ascii=False, indent=2)
print("\nSaved to output.json file")

3.2 Different Output Formats

python
# Demonstrate different orient parameters
print("1. Records format (default):")
print(df.to_json(orient='records', force_ascii=False, indent=2)[:200] + "...")

print("\n2. Index format:")
print(df.to_json(orient='index', force_ascii=False, indent=2)[:200] + "...")

print("\n3. Columns format:")
print(df.to_json(orient='columns', force_ascii=False, indent=2)[:200] + "...")

print("\n4. Values format:")
print(df.to_json(orient='values', force_ascii=False, indent=2)[:200] + "...")

print("\n5. Split format:")
print(df.to_json(orient='split', force_ascii=False, indent=2)[:200] + "...")

print("\n6. Table format:")
print(df.to_json(orient='table', force_ascii=False, indent=2)[:200] + "...")

3.3 Handling Dates and Times

python
# Date handling options
print("Default date format (epoch milliseconds):")
print(df[['name', 'join_date']].to_json(orient='records'))

print("\nISO format dates:")
print(df[['name', 'join_date']].to_json(orient='records', date_format='iso'))

print("\nCustom date format:")
print(df[['name', 'join_date']].to_json(orient='records', date_format='%Y-%m-%d'))

# Handle NaN values
df_with_nan = df.copy()
df_with_nan.loc[0, 'salary'] = np.nan

print("\nData with NaN (default null):")
print(df_with_nan[['name', 'salary']].to_json(orient='records'))

print("\nCustom NaN value:")
print(df_with_nan[['name', 'salary']].to_json(orient='records', na_value='N/A'))

4. JSON Data Operations and Transformation

4.1 JSON Data Cleaning and Preprocessing

python
# Process irregular JSON data
irregular_json = '''
[
    {"name": "Alice", "age": "25", "salary": "8000", "active": "true"},
    {"name": "Bob", "age": "30", "salary": "7000", "active": "false"},
    {"name": "Charlie", "age": null, "salary": "9000", "active": "true"},
    {"name": "David", "age": "32", "salary": null, "active": "1"}
]
'''

df_irregular = pd.read_json(irregular_json)
print("Original irregular data:")
print(df_irregular)
print("\nData types:")
print(df_irregular.dtypes)

# Data type conversion and cleaning
df_cleaned = df_irregular.copy()

# Convert numeric types
df_cleaned['age'] = pd.to_numeric(df_cleaned['age'], errors='coerce')
df_cleaned['salary'] = pd.to_numeric(df_cleaned['salary'], errors='coerce')

# Convert boolean type
def convert_to_bool(value):
    if pd.isna(value):
        return False
    if isinstance(value, str):
        return value.lower() in ['true', '1', 'yes']
    return bool(value)

df_cleaned['active'] = df_cleaned['active'].apply(convert_to_bool)

print("\nCleaned data:")
print(df_cleaned)
print("\nCleaned data types:")
print(df_cleaned.dtypes)

4.2 JSON Data Merging and Splitting

python
# Merge data from multiple JSON files
json_file1 = '''
[
    {"id": 1, "name": "Alice", "department": "IT"},
    {"id": 2, "name": "Bob", "department": "HR"}
]
'''

json_file2 = '''
[
    {"id": 1, "salary": 8000, "join_date": "2020-01-15"},
    {"id": 2, "salary": 7000, "join_date": "2019-03-20"}
]
'''

df1 = pd.read_json(json_file1)
df2 = pd.read_json(json_file2)

# Merge data
df_merged = pd.merge(df1, df2, on='id')
print("Merged data:")
print(df_merged)

# Split complex fields
complex_json = '''
[
    {"id": 1, "name": "Alice", "contact": "email:alice@example.com;phone:13800138000"},
    {"id": 2, "name": "Bob", "contact": "email:bob@example.com;phone:13900139000"}
]
'''

df_complex = pd.read_json(complex_json)
print("\nOriginal complex data:")
print(df_complex)

# Split contact field
def parse_contact(contact_str):
    parts = contact_str.split(';')
    result = {}
    for part in parts:
        key, value = part.split(':')
        result[key] = value
    return result

contact_data = df_complex['contact'].apply(parse_contact)
contact_df = pd.json_normalize(contact_data)

# Combine split data
df_expanded = pd.concat([df_complex[['id', 'name']], contact_df], axis=1)
print("\nSplit data:")
print(df_expanded)

4.3 Processing JSON Array Fields

python
# Process JSON data with arrays
array_json = '''
[
    {
        "id": 1,
        "name": "Alice",
        "skills": ["Python", "SQL", "Machine Learning"],
        "projects": [
            {"name": "Project A", "status": "Completed"},
            {"name": "Project B", "status": "In Progress"}
        ]
    },
    {
        "id": 2,
        "name": "Bob",
        "skills": ["Java", "Spring", "MySQL"],
        "projects": [
            {"name": "Project C", "status": "Completed"}
        ]
    }
]
'''

data = json.loads(array_json)

# Expand skills array
skills_df = json_normalize(data, record_path='skills', 
                          meta=['id', 'name'], 
                          record_prefix='skill_')
print("Expanded skills data:")
print(skills_df)

# Expand projects array
projects_df = json_normalize(data, record_path='projects', 
                            meta=['id', 'name'])
print("\nExpanded projects data:")
print(projects_df)

# Count skills per person
skills_count = skills_df.groupby(['id', 'name']).size().reset_index(name='skills_count')
print("\nSkills count statistics:")
print(skills_count)

5. Processing API Response Data

5.1 Simulated API Response Processing

python
# Simulated API response data
api_response = '''
{
    "status": "success",
    "message": "Data retrieved successfully",
    "data": {
        "total": 100,
        "page": 1,
        "per_page": 10,
        "users": [
            {
                "id": 1,
                "username": "alice_smith",
                "profile": {
                    "name": "Alice",
                    "age": 25,
                    "email": "alice@example.com"
                },
                "last_login": "2023-12-01T10:30:00Z",
                "is_active": true
            },
            {
                "id": 2,
                "username": "bob_jones",
                "profile": {
                    "name": "Bob",
                    "age": 30,
                    "email": "bob@example.com"
                },
                "last_login": "2023-12-02T14:20:00Z",
                "is_active": false
            }
        ]
    }
}
'''

# Parse API response
response_data = json.loads(api_response)

# Extract user data
if response_data['status'] == 'success':
    users_data = response_data['data']['users']
    df_users = json_normalize(users_data)
    
    print("API user data:")
    print(df_users)
    
    # Process timestamp
    df_users['last_login'] = pd.to_datetime(df_users['last_login'])
    
    print("\nAfter processing timestamp:")
    print(df_users[['username', 'last_login', 'is_active']])
else:
    print(f"API request failed: {response_data['message']}")

5.2 Batch Processing API Data

python
# Simulated batch API data processing
def process_api_batch(api_responses):
    """
    Batch process API response data
    """
    all_data = []
    
    for i, response in enumerate(api_responses):
        try:
            data = json.loads(response)
            if data['status'] == 'success':
                # Add batch info
                batch_data = data['data']['users']
                for item in batch_data:
                    item['batch_id'] = i + 1
                    item['processed_at'] = datetime.now().isoformat()
                
                all_data.extend(batch_data)
            else:
                print(f"Batch {i+1} processing failed: {data['message']}")
        except json.JSONDecodeError as e:
            print(f"Batch {i+1} JSON parsing error: {e}")
        except Exception as e:
            print(f"Batch {i+1} processing error: {e}")
    
    if all_data:
        df_result = json_normalize(all_data)
        return df_result
    else:
        return pd.DataFrame()

# Simulate multiple API responses
api_responses = [api_response, api_response.replace('alice_smith', 'charlie_brown')]
df_batch = process_api_batch(api_responses)

print("Batch processing results:")
print(df_batch[['username', 'profile.name', 'batch_id']])

6. JSON Data Analysis Cases

6.1 E-commerce Order Data Analysis

python
# E-commerce order JSON data
ecommerce_data = '''
[
    {
        "order_id": "ORD001",
        "customer": {
            "id": 1001,
            "name": "Alice",
            "city": "Beijing"
        },
        "items": [
            {"product_id": "P001", "name": "Laptop", "price": 5999, "quantity": 1},
            {"product_id": "P002", "name": "Mouse", "price": 99, "quantity": 2}
        ],
        "order_date": "2023-12-01",
        "status": "Completed"
    },
    {
        "order_id": "ORD002",
        "customer": {
            "id": 1002,
            "name": "Bob",
            "city": "Shanghai"
        },
        "items": [
            {"product_id": "P003", "name": "Phone", "price": 3999, "quantity": 1}
        ],
        "order_date": "2023-12-02",
        "status": "Shipped"
    }
]
'''

orders_data = json.loads(ecommerce_data)

# Analyze order data
# 1. Basic order info
orders_df = json_normalize(orders_data)
print("Order basic info:")
print(orders_df[['order_id', 'customer.name', 'customer.city', 'order_date', 'status']])

# 2. Expand item details
items_df = json_normalize(orders_data, record_path='items', 
                         meta=['order_id', ['customer', 'name'], ['customer', 'city']])
print("\nItem details:")
print(items_df)

# 3. Calculate order total
items_df['total_price'] = items_df['price'] * items_df['quantity']
order_totals = items_df.groupby('order_id')['total_price'].sum().reset_index()
print("\nOrder totals:")
print(order_totals)

# 4. City sales statistics
city_sales = items_df.groupby('customer.city')['total_price'].sum().reset_index()
print("\nCity sales statistics:")
print(city_sales)

6.2 Social Media Data Analysis

python
# Social media post data
social_media_data = '''
[
    {
        "post_id": "POST001",
        "author": {
            "username": "user123",
            "followers": 1500,
            "verified": true
        },
        "content": "Beautiful weather today!",
        "timestamp": "2023-12-01T10:30:00Z",
        "engagement": {
            "likes": 45,
            "comments": 12,
            "shares": 8
        },
        "hashtags": ["#weather", "#mood"],
        "mentions": ["@friend1", "@friend2"]
    },
    {
        "post_id": "POST002",
        "author": {
            "username": "user456",
            "followers": 3200,
            "verified": false
        },
        "content": "Sharing an interesting tech article",
        "timestamp": "2023-12-01T15:45:00Z",
        "engagement": {
            "likes": 89,
            "comments": 23,
            "shares": 15
        },
        "hashtags": ["#tech", "#sharing", "#learning"],
        "mentions": []
    }
]
'''

social_data = json.loads(social_media_data)

# Analyze social media data
posts_df = json_normalize(social_data)
print("Post data:")
print(posts_df[['post_id', 'author.username', 'author.followers', 'content']])

# Calculate engagement metrics
posts_df['total_engagement'] = (posts_df['engagement.likes'] + 
                               posts_df['engagement.comments'] + 
                               posts_df['engagement.shares'])

posts_df['engagement_rate'] = posts_df['total_engagement'] / posts_df['author.followers']

print("\nEngagement analysis:")
print(posts_df[['post_id', 'author.username', 'total_engagement', 'engagement_rate']])

# Expand hashtag data
hashtags_df = json_normalize(social_data, record_path='hashtags', 
                            meta=['post_id', 'author.username'], 
                            record_prefix='hashtag_')
print("\nHashtag usage statistics:")
hashtag_counts = hashtags_df['hashtag_0'].value_counts()
print(hashtag_counts)

7. Performance Optimization and Best Practices

7.1 Large JSON File Processing

python
# Strategies for processing large JSON files
def process_large_json(file_path, chunk_size=1000):
    """
    Process large JSON file in chunks
    """
    import ijson  # Need to install: pip install ijson
    
    results = []
    current_chunk = []
    
    with open(file_path, 'rb') as file:
        # Use ijson for streaming parsing
        parser = ijson.parse(file)
        
        for prefix, event, value in parser:
            if event == 'start_map':
                current_item = {}
            elif event == 'map_key':
                current_key = value
            elif event == 'string' or event == 'number':
                current_item[current_key] = value
            elif event == 'end_map':
                current_chunk.append(current_item)
                
                if len(current_chunk) >= chunk_size:
                    # Process current chunk
                    chunk_df = pd.DataFrame(current_chunk)
                    results.append(chunk_df)
                    current_chunk = []
        
        # Process last chunk
        if current_chunk:
            chunk_df = pd.DataFrame(current_chunk)
            results.append(chunk_df)
    
    return pd.concat(results, ignore_index=True)

# Memory optimization tips
def optimize_json_dataframe(df):
    """
    Optimize JSON DataFrame memory usage
    """
    # Convert data types
    for col in df.columns:
        if df[col].dtype == 'object':
            # Try converting to numeric type
            numeric_series = pd.to_numeric(df[col], errors='ignore')
            if numeric_series.dtype != 'object':
                df[col] = numeric_series
            else:
                # Convert to category type (if many duplicates)
                if df[col].nunique() / len(df) < 0.5:
                    df[col] = df[col].astype('category')
    
    return df

7.2 JSON Data Validation

python
# JSON data validation
def validate_json_data(df, schema):
    """
    Validate JSON data against expected format
    """
    validation_errors = []
    
    # Check required fields
    for field in schema.get('required', []):
        if field not in df.columns:
            validation_errors.append(f"Missing required field: {field}")
        elif df[field].isnull().any():
            validation_errors.append(f"Field {field} contains null values")
    
    # Check data types
    for field, expected_type in schema.get('types', {}).items():
        if field in df.columns:
            if expected_type == 'numeric':
                if not pd.api.types.is_numeric_dtype(df[field]):
                    validation_errors.append(f"Field {field} should be numeric type")
            elif expected_type == 'datetime':
                try:
                    pd.to_datetime(df[field])
                except:
                    validation_errors.append(f"Field {field} should be datetime type")
    
    # Check numeric ranges
    for field, range_config in schema.get('ranges', {}).items():
        if field in df.columns:
            min_val, max_val = range_config
            if (df[field] < min_val).any() or (df[field] > max_val).any():
                validation_errors.append(f"Field {field} values out of range [{min_val}, {max_val}]")
    
    return validation_errors

# Define data schema
data_schema = {
    'required': ['id', 'name', 'age'],
    'types': {
        'id': 'numeric',
        'age': 'numeric',
        'join_date': 'datetime'
    },
    'ranges': {
        'age': (0, 120),
        'salary': (0, 1000000)
    }
}

# Validate data
validation_errors = validate_json_data(df, data_schema)
if validation_errors:
    print("Data validation errors:")
    for error in validation_errors:
        print(f"- {error}")
else:
    print("Data validation passed")

8. Practical Application Scenarios

8.1 Configuration File Management

python
# Application configuration management
config_json = '''
{
    "database": {
        "host": "localhost",
        "port": 5432,
        "name": "myapp",
        "credentials": {
            "username": "admin",
            "password": "secret123"
        }
    },
    "api": {
        "base_url": "https://api.example.com",
        "timeout": 30,
        "retry_count": 3
    },
    "features": {
        "enable_logging": true,
        "debug_mode": false,
        "max_file_size": 10485760
    }
}
'''

config_data = json.loads(config_json)
config_df = json_normalize(config_data)
print("Configuration data:")
print(config_df)

# Configuration item management
def get_config_value(config_df, key_path):
    """
    Get nested configuration value
    """
    if key_path in config_df.columns:
        return config_df[key_path].iloc[0]
    return None

db_host = get_config_value(config_df, 'database.host')
api_timeout = get_config_value(config_df, 'api.timeout')
print(f"\nDatabase host: {db_host}")
print(f"API timeout: {api_timeout}")

8.2 Log Data Analysis

python
# Application log data
log_data = '''
[
    {
        "timestamp": "2023-12-01T10:30:15.123Z",
        "level": "INFO",
        "message": "User login successful",
        "user_id": 1001,
        "ip_address": "192.168.1.100",
        "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "response_time": 245
    },
    {
        "timestamp": "2023-12-01T10:31:22.456Z",
        "level": "ERROR",
        "message": "Database connection failed",
        "error_code": "DB_CONNECTION_ERROR",
        "ip_address": "192.168.1.101",
        "response_time": 5000
    },
    {
        "timestamp": "2023-12-01T10:32:10.789Z",
        "level": "WARN",
        "message": "API call rate too high",
        "user_id": 1002,
        "ip_address": "192.168.1.102",
        "api_endpoint": "/api/users",
        "response_time": 1200
    }
]
'''

logs_df = pd.read_json(log_data)
logs_df['timestamp'] = pd.to_datetime(logs_df['timestamp'])

print("Log data analysis:")
print(logs_df[['timestamp', 'level', 'message', 'response_time']])

# Log level statistics
log_level_counts = logs_df['level'].value_counts()
print("\nLog level statistics:")
print(log_level_counts)

# Response time analysis
print("\nResponse time statistics:")
print(logs_df['response_time'].describe())

# Error log analysis
error_logs = logs_df[logs_df['level'] == 'ERROR']
print("\nError logs:")
print(error_logs[['timestamp', 'message', 'error_code']])

9. Common Issues and Solutions

9.1 Encoding Issues

python
# Handle encoding issues
def safe_read_json(file_path, encodings=['utf-8', 'gbk', 'gb2312']):
    """
    Safely read JSON files with potential encoding issues
    """
    for encoding in encodings:
        try:
            with open(file_path, 'r', encoding=encoding) as f:
                data = json.load(f)
            return pd.json_normalize(data)
        except UnicodeDecodeError:
            continue
        except json.JSONDecodeError as e:
            print(f"JSON parsing error ({encoding}): {e}")
            continue
    
    raise ValueError(f"Unable to read file with any encoding: {file_path}")

# Handle special characters
def clean_json_string(json_str):
    """
    Clean special characters from JSON string
    """
    # Remove BOM mark
    if json_str.startswith('\ufeff'):
        json_str = json_str[1:]
    
    # Replace common problematic characters
    replacements = {
        '\u0000': '',  # NULL character
        '\u000b': ' ',  # Vertical tab
        '\u000c': ' ',  # Form feed
    }
    
    for old, new in replacements.items():
        json_str = json_str.replace(old, new)
    
    return json_str

9.2 Performance Issues

python
# Performance optimization tips
import time

def benchmark_json_methods(data, iterations=1000):
    """
    Compare performance of different JSON processing methods
    """
    json_str = json.dumps(data)
    
    # Method 1: pd.read_json
    start_time = time.time()
    for _ in range(iterations):
        df1 = pd.read_json(json_str)
    time1 = time.time() - start_time
    
    # Method 2: json.loads + DataFrame
    start_time = time.time()
    for _ in range(iterations):
        data_parsed = json.loads(json_str)
        df2 = pd.DataFrame(data_parsed)
    time2 = time.time() - start_time
    
    # Method 3: json_normalize
    start_time = time.time()
    for _ in range(iterations):
        data_parsed = json.loads(json_str)
        df3 = json_normalize(data_parsed)
    time3 = time.time() - start_time
    
    print(f"pd.read_json: {time1:.4f} seconds")
    print(f"json.loads + DataFrame: {time2:.4f} seconds")
    print(f"json_normalize: {time3:.4f} seconds")

# Test performance
test_data = [{'id': i, 'name': f'User{i}', 'value': i*10} for i in range(100)]
benchmark_json_methods(test_data, 100)

Chapter Summary

This chapter covered various methods and techniques for handling JSON data with Pandas:

  1. JSON Basics: Understanding JSON format and common data structures
  2. Reading JSON: From strings, files, and different formats
  3. Processing Nested Data: Using json_normalize for complex nested JSON
  4. Writing JSON: Exporting DataFrame to different JSON formats
  5. Data Operations: Cleaning, transforming, and analyzing JSON data
  6. API Data Processing: Handling API responses and batch data
  7. Practical Applications: E-commerce, social media, configuration management, etc.
  8. Performance Optimization: Large file processing and memory optimization
  9. Problem Solving: Encoding and performance issue solutions

JSON is the standard format for modern data exchange, and mastering its processing skills is crucial for data analysis work. Through this chapter, you should be able to proficiently handle various JSON data formats and apply these skills in real projects.

Exercises

  1. Create JSON data with nested structures and convert it to DataFrame using different methods
  2. Write a function to handle paginated API data and merge results from multiple pages
  3. Implement a JSON data validator to check data completeness and format correctness
  4. Process a JSON file containing time series data and perform time analysis
  5. Optimize the processing workflow for a large JSON file to improve efficiency

In the next chapter, we will learn about Pandas common functions that will help us process and analyze data more efficiently.

Content is for learning and research only.