Skip to content

Pandas CSV and Excel Handling

CSV and Excel files are the most common data formats in data analysis. This chapter will detail how to use Pandas to read, process, and save data in these file formats.

📚 File Format Overview

CSV File Characteristics

  • Comma-Separated Values: Standard text format
  • Plain text format: Can be opened with any text editor
  • Cross-platform compatible: Supported by almost all systems
  • Small file size: More space-efficient than Excel files
  • Fast processing: High read/write efficiency

Excel File Characteristics

  • Multiple worksheets: One file can contain multiple sheets
  • Rich formatting: Supports fonts, colors, formulas, etc.
  • Data type preservation: Automatically recognizes numbers, dates, etc.
  • Business standard: Widely used in business environments
  • Larger file size: Contains formatting information

📖 CSV File Processing

Reading CSV Files

Basic Reading

python
import pandas as pd
import numpy as np

# Create sample CSV data (simulating file content)
csv_content = """Name,Age,City,Salary
Alice,25,Beijing,8000
Bob,30,Shanghai,12000
Charlie,35,Guangzhou,15000
David,28,Shenzhen,9500
Eve,32,Hangzhou,11000"""

# Write content to file (in actual use, you already have a CSV file)
with open('employees.csv', 'w', encoding='utf-8') as f:
    f.write(csv_content)

# Read CSV file
df = pd.read_csv('employees.csv')
print("Basic read result:")
print(df)
print(f"\nData types:\n{df.dtypes}")

Reading with Parameters

python
# Specify encoding
df_encoding = pd.read_csv('employees.csv', encoding='utf-8')

# Specify delimiter (if not comma)
# df_tab = pd.read_csv('data.txt', sep='\t')  # Tab-separated
# df_semicolon = pd.read_csv('data.csv', sep=';')  # Semicolon-separated

# Specify index column
df_index = pd.read_csv('employees.csv', index_col='Name')
print("With specified index column:")
print(df_index)

# Select specific columns
df_selected = pd.read_csv('employees.csv', usecols=['Name', 'Salary'])
print("\nSelected columns:")
print(df_selected)

# Skip rows
df_skip = pd.read_csv('employees.csv', skiprows=1)  # Skip first row
print("\nAfter skipping first row:")
print(df_skip)

Handling Missing Values and Special Cases

python
# Create CSV with missing values
csv_with_nan = """Name,Age,City,Salary
Alice,25,Beijing,8000
Bob,,Shanghai,12000
Charlie,35,,15000
David,28,Shenzhen,
Eve,32,Hangzhou,11000"""

with open('employees_nan.csv', 'w', encoding='utf-8') as f:
    f.write(csv_with_nan)

# Handle missing values when reading
df_nan = pd.read_csv('employees_nan.csv')
print("Data with missing values:")
print(df_nan)
print(f"\nMissing value count:\n{df_nan.isnull().sum()}")

# Specify missing value indicators
df_na_values = pd.read_csv('employees_nan.csv', na_values=['', 'NULL', 'N/A'])
print("\nAfter specifying missing value indicators:")
print(df_na_values)

Specifying Data Types

python
# Specify data types
dtype_dict = {
    'Name': 'string',
    'Age': 'Int64',  # Nullable integer type
    'City': 'category',
    'Salary': 'float64'
}

df_typed = pd.read_csv('employees.csv', dtype=dtype_dict)
print("After specifying data types:")
print(df_typed.dtypes)
print(df_typed.info())

Writing CSV Files

Basic Writing

python
# Create sample data
data = {
    'Product': ['Laptop', 'Desktop', 'Tablet', 'Smartphone'],
    'Price': [5999, 3999, 2999, 1999],
    'Stock': [50, 30, 80, 120],
    'Category': ['Computer', 'Computer', 'Computer', 'Phone']
}

products_df = pd.DataFrame(data)

# Save as CSV
products_df.to_csv('products.csv', index=False, encoding='utf-8')
print("Data saved to products.csv")

# Verify saved result
saved_df = pd.read_csv('products.csv')
print("\nSaved data:")
print(saved_df)

Advanced Writing Options

python
# Keep index
products_df.to_csv('products_with_index.csv', encoding='utf-8')

# Specify delimiter
products_df.to_csv('products_tab.txt', sep='\t', index=False)

# Select specific columns
products_df[['Product', 'Price']].to_csv('products_simple.csv', index=False)

# Append mode (note: file must exist first)
new_product = pd.DataFrame({
    'Product': ['Smartwatch'],
    'Price': [1299],
    'Stock': [60],
    'Category': ['Electronics']
})

# Append to existing file
new_product.to_csv('products.csv', mode='a', header=False, index=False)

print("\nAfter appending data:")
print(pd.read_csv('products.csv'))

📊 Excel File Processing

Installing Dependencies

python
# Additional libraries needed for Excel files
# pip install openpyxl  # For .xlsx files
# pip install xlrd      # For .xls files
# pip install xlsxwriter # For writing Excel files

Reading Excel Files

Basic Reading

python
# First create an Excel file for demonstration
with pd.ExcelWriter('company_data.xlsx', engine='openpyxl') as writer:
    # Employee info sheet
    employees = pd.DataFrame({
        'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Department': ['Tech', 'Sales', 'Tech', 'HR', 'Finance'],
        'Salary': [8000, 12000, 15000, 9500, 11000],
        'JoinDate': pd.date_range('2020-01-01', periods=5, freq='3M')
    })
    
    # Department info sheet
    departments = pd.DataFrame({
        'DeptName': ['Tech', 'Sales', 'HR', 'Finance'],
        'Manager': ['Manager A', 'Manager B', 'Manager C', 'Manager D'],
        'HeadCount': [15, 8, 5, 6]
    })
    
    employees.to_excel(writer, sheet_name='Employees', index=False)
    departments.to_excel(writer, sheet_name='Departments', index=False)

print("Excel file created")

# Read Excel file (default reads first sheet)
df_excel = pd.read_excel('company_data.xlsx')
print("\nRead first worksheet:")
print(df_excel)

Reading Specific Worksheets

python
# Read specified worksheet
employees_sheet = pd.read_excel('company_data.xlsx', sheet_name='Employees')
departments_sheet = pd.read_excel('company_data.xlsx', sheet_name='Departments')

print("Employees sheet:")
print(employees_sheet)
print("\nDepartments sheet:")
print(departments_sheet)

# Read multiple worksheets
all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)
print(f"\nWorksheet names: {list(all_sheets.keys())}")

# Read specific multiple worksheets
selected_sheets = pd.read_excel('company_data.xlsx', 
                               sheet_name=['Employees', 'Departments'])
print(f"\nSelected worksheets: {list(selected_sheets.keys())}")

Advanced Reading Options

python
# Specify reading range
df_range = pd.read_excel('company_data.xlsx', 
                        sheet_name='Employees',
                        usecols='A:D',  # Only read columns A to D
                        nrows=3)        # Only read first 3 rows
print("Reading specified range:")
print(df_range)

# Skip rows and specify header row
df_skip = pd.read_excel('company_data.xlsx',
                       sheet_name='Employees',
                       skiprows=1,     # Skip first row
                       header=0)       # Second row as header
print("\nAfter skipping rows:")
print(df_skip)

Writing Excel Files

Basic Writing

python
# Create sales data
sales_data = pd.DataFrame({
    'Month': pd.date_range('2024-01-01', periods=12, freq='M'),
    'Revenue': [120000, 135000, 142000, 158000, 163000, 171000,
              185000, 192000, 178000, 165000, 155000, 148000],
    'Cost': [80000, 85000, 90000, 95000, 98000, 102000,
            108000, 115000, 110000, 105000, 98000, 92000]
})

sales_data['Profit'] = sales_data['Revenue'] - sales_data['Cost']
sales_data['ProfitMargin'] = (sales_data['Profit'] / sales_data['Revenue'] * 100).round(2)

# Save to Excel
sales_data.to_excel('sales_report.xlsx', sheet_name='Monthly Sales', index=False)
print("Sales report saved")

Multiple Worksheet Writing

python
# Create multiple data tables
q1_data = sales_data.iloc[:3].copy()
q2_data = sales_data.iloc[3:6].copy()
q3_data = sales_data.iloc[6:9].copy()
q4_data = sales_data.iloc[9:12].copy()

# Write to multiple worksheets
with pd.ExcelWriter('quarterly_report.xlsx', engine='openpyxl') as writer:
    q1_data.to_excel(writer, sheet_name='Q1', index=False)
    q2_data.to_excel(writer, sheet_name='Q2', index=False)
    q3_data.to_excel(writer, sheet_name='Q3', index=False)
    q4_data.to_excel(writer, sheet_name='Q4', index=False)
    
    # Annual summary
    annual_summary = pd.DataFrame({
        'Quarter': ['Q1', 'Q2', 'Q3', 'Q4'],
        'Revenue': [q1_data['Revenue'].sum(), q2_data['Revenue'].sum(),
                 q3_data['Revenue'].sum(), q4_data['Revenue'].sum()],
        'Profit': [q1_data['Profit'].sum(), q2_data['Profit'].sum(),
               q3_data['Profit'].sum(), q4_data['Profit'].sum()]
    })
    
    annual_summary.to_excel(writer, sheet_name='Annual Summary', index=False)

print("Quarterly report saved")

Formatting Excel Output

python
# Use xlsxwriter engine for formatting
with pd.ExcelWriter('formatted_report.xlsx', engine='xlsxwriter') as writer:
    sales_data.to_excel(writer, sheet_name='Sales Data', index=False)
    
    # Get workbook and worksheet objects
    workbook = writer.book
    worksheet = writer.sheets['Sales Data']
    
    # Define formats
    header_format = workbook.add_format({
        'bold': True,
        'text_wrap': True,
        'valign': 'top',
        'fg_color': '#D7E4BC',
        'border': 1
    })
    
    currency_format = workbook.add_format({'num_format': '$#,##0'})
    percent_format = workbook.add_format({'num_format': '0.00%'})
    date_format = workbook.add_format({'num_format': 'yyyy-mm-dd'})
    
    # Apply formats
    for col_num, value in enumerate(sales_data.columns.values):
        worksheet.write(0, col_num, value, header_format)
    
    # Set column widths
    worksheet.set_column('A:A', 12)  # Month column
    worksheet.set_column('B:D', 15)  # Numeric columns
    
    # Format data columns
    worksheet.set_column('B:C', 15, currency_format)  # Revenue and Cost
    worksheet.set_column('D:D', 15, currency_format)  # Profit
    
print("Formatted report saved")

🔧 Advanced File Processing Techniques

Large File Processing

Chunked Reading

python
# Create large file example
large_data = pd.DataFrame({
    'ID': range(1, 10001),
    'Value1': np.random.randn(10000),
    'Value2': np.random.randn(10000),
    'Category': np.random.choice(['A', 'B', 'C'], 10000)
})

large_data.to_csv('large_file.csv', index=False)

# Read large file in chunks
chunk_size = 1000
results = []

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    processed_chunk = chunk[chunk['Value1'] > 0]  # Example: filter positive values
    results.append(processed_chunk)

# Combine results
final_result = pd.concat(results, ignore_index=True)
print(f"Original data: {len(large_data)} rows")
print(f"Processed data: {len(final_result)} rows")

Memory-Optimized Reading

python
# Optimize data types to save memory
dtype_optimization = {
    'ID': 'int32',
    'Value1': 'float32',
    'Value2': 'float32',
    'Category': 'category'
}

# Compare memory usage
df_default = pd.read_csv('large_file.csv')
df_optimized = pd.read_csv('large_file.csv', dtype=dtype_optimization)

print(f"Default type memory usage: {df_default.memory_usage(deep=True).sum()} bytes")
print(f"Optimized type memory usage: {df_optimized.memory_usage(deep=True).sum()} bytes")
print(f"Memory savings: {(1 - df_optimized.memory_usage(deep=True).sum()/df_default.memory_usage(deep=True).sum())*100:.1f}%")

File Format Conversion

CSV and Excel Conversion

python
# CSV to Excel
csv_df = pd.read_csv('employees.csv')
csv_df.to_excel('employees_from_csv.xlsx', index=False)

# Excel to CSV
excel_df = pd.read_excel('company_data.xlsx', sheet_name='Employees')
excel_df.to_csv('employees_from_excel.csv', index=False)

print("File format conversion complete")

Batch Processing Multiple Files

python
import os
import glob

# Create multiple CSV files for demonstration
for i in range(3):
    sample_data = pd.DataFrame({
        'Date': pd.date_range(f'2024-0{i+1}-01', periods=5, freq='D'),
        'Quantity': np.random.randint(50, 200, 5),
        'Revenue': np.random.randint(1000, 5000, 5)
    })
    sample_data.to_csv(f'sales_month_{i+1}.csv', index=False)

# Batch read and combine
csv_files = glob.glob('sales_month_*.csv')
all_data = []

for file in csv_files:
    df = pd.read_csv(file)
    df['SourceFile'] = file
    all_data.append(df)

combined_data = pd.concat(all_data, ignore_index=True)
print("Combined data:")
print(combined_data.head(10))

# Save combined result
combined_data.to_excel('combined_sales.xlsx', index=False)
print("\nCombined data saved to Excel")

🎨 Practical Application Examples

Example 1: Financial Report Processing

python
# Create financial data
financial_data = {
    'Account': ['Revenue', 'Cost of Sales', 'Sales Expenses', 'Admin Expenses', 'Finance Expenses', 'Operating Profit'],
    'CurrentMonth': [1000000, 600000, 80000, 120000, 15000, 185000],
    'LastMonth': [950000, 570000, 75000, 115000, 12000, 178000],
    'LastYear': [900000, 540000, 70000, 110000, 10000, 170000]
}

financial_df = pd.DataFrame(financial_data)

# Calculate growth rates
financial_df['MoMGrowth'] = ((financial_df['CurrentMonth'] - financial_df['LastMonth']) / financial_df['LastMonth'] * 100).round(2)
financial_df['YoYGrowth'] = ((financial_df['CurrentMonth'] - financial_df['LastYear']) / financial_df['LastYear'] * 100).round(2)

print("Financial Report:")
print(financial_df)

# Save to formatted Excel
with pd.ExcelWriter('financial_report.xlsx', engine='xlsxwriter') as writer:
    financial_df.to_excel(writer, sheet_name='Financial Report', index=False)
    
    workbook = writer.book
    worksheet = writer.sheets['Financial Report']
    
    # Define formats
    header_format = workbook.add_format({
        'bold': True, 'bg_color': '#4F81BD', 'font_color': 'white'
    })
    
    currency_format = workbook.add_format({'num_format': '$#,##0'})
    percent_format = workbook.add_format({'num_format': '0.00%'})
    
    # Apply formats
    for col_num, value in enumerate(financial_df.columns.values):
        worksheet.write(0, col_num, value, header_format)
    
    # Format numeric columns
    worksheet.set_column('B:D', 15, currency_format)
    worksheet.set_column('E:F', 12, percent_format)

print("\nFinancial report saved")

Example 2: Sales Data Analysis

python
# Create sales detail data
sales_detail = pd.DataFrame({
    'OrderID': [f'ORD{str(i).zfill(4)}' for i in range(1, 101)],
    'Customer': np.random.choice(['Customer A', 'Customer B', 'Customer C', 'Customer D', 'Customer E'], 100),
    'Product': np.random.choice(['Product 1', 'Product 2', 'Product 3', 'Product 4'], 100),
    'Quantity': np.random.randint(1, 20, 100),
    'UnitPrice': np.random.choice([100, 150, 200, 250, 300], 100),
    'Date': pd.date_range('2024-01-01', periods=100, freq='D')
})

sales_detail['Amount'] = sales_detail['Quantity'] * sales_detail['UnitPrice']

# Data analysis
customer_summary = sales_detail.groupby('Customer').agg({
    'OrderID': 'count',
    'Quantity': 'sum',
    'Amount': 'sum'
}).rename(columns={'OrderID': 'OrderCount'})

product_summary = sales_detail.groupby('Product').agg({
    'OrderID': 'count',
    'Quantity': 'sum',
    'Amount': 'sum'
}).rename(columns={'OrderID': 'OrderCount'})

# Monthly summary
sales_detail['Month'] = sales_detail['Date'].dt.to_period('M')
monthly_summary = sales_detail.groupby('Month').agg({
    'OrderID': 'count',
    'Amount': 'sum'
}).rename(columns={'OrderID': 'OrderCount'})

# Save analysis results
with pd.ExcelWriter('sales_analysis.xlsx') as writer:
    sales_detail.to_excel(writer, sheet_name='Sales Detail', index=False)
    customer_summary.to_excel(writer, sheet_name='Customer Analysis')
    product_summary.to_excel(writer, sheet_name='Product Analysis')
    monthly_summary.to_excel(writer, sheet_name='Monthly Summary')

print("Sales analysis report generated")
print("\nCustomer Analysis:")
print(customer_summary.sort_values('Amount', ascending=False))

Example 3: Data Cleaning and Standardization

python
# Create dirty data
dirty_data = pd.DataFrame({
    'Name': ['Alice', ' Bob ', 'CHARLIE', 'David', 'Eve'],
    'Phone': ['138-0000-0000', '139 0000 0000', '(140)0000-0000', '141-0000-0000', '142.0000.0000'],
    'Email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@Email.Com', 'david@email.com', 'eve@email.com'],
    'Age': ['25', '30 years', '35', 'unknown', '28'],
    'Salary': ['8000', '12,000', '15000 USD', 'N/A', '9500']
})

print("Original data:")
print(dirty_data)

# Data cleaning
cleaned_data = dirty_data.copy()

# Clean names: strip whitespace, standardize case
cleaned_data['Name'] = cleaned_data['Name'].str.strip().str.title()

# Clean phone: standardize format
cleaned_data['Phone'] = cleaned_data['Phone'].str.replace(r'[^\d]', '', regex=True)
cleaned_data['Phone'] = cleaned_data['Phone'].str.replace(r'(\d{3})(\d{4})(\d{4})', r'\1-\2-\3', regex=True)

# Clean email: lowercase
cleaned_data['Email'] = cleaned_data['Email'].str.lower()

# Clean age: extract numbers
cleaned_data['Age'] = cleaned_data['Age'].str.extract(r'(\d+)').astype(float)

# Clean salary: extract numbers, handle thousands separator
cleaned_data['Salary'] = cleaned_data['Salary'].str.replace(r'[^\d,]', '', regex=True)
cleaned_data['Salary'] = cleaned_data['Salary'].str.replace(',', '')
cleaned_data['Salary'] = pd.to_numeric(cleaned_data['Salary'], errors='coerce')

print("\nCleaned data:")
print(cleaned_data)

# Save cleaning results
with pd.ExcelWriter('data_cleaning_result.xlsx') as writer:
    dirty_data.to_excel(writer, sheet_name='Original Data', index=False)
    cleaned_data.to_excel(writer, sheet_name='Cleaned Data', index=False)
    
    # Data quality report
    quality_report = pd.DataFrame({
        'Field': cleaned_data.columns,
        'MissingCount': [cleaned_data[col].isnull().sum() for col in cleaned_data.columns],
        'MissingRate': [f"{cleaned_data[col].isnull().sum()/len(cleaned_data)*100:.1f}%" for col in cleaned_data.columns]
    })
    
    quality_report.to_excel(writer, sheet_name='Data Quality Report', index=False)

print("\nData cleaning report saved")

🚀 Performance Optimization Tips

Reading Performance Optimization

python
import time

# Create test data
test_data = pd.DataFrame({
    'col1': range(100000),
    'col2': np.random.randn(100000),
    'col3': np.random.choice(['A', 'B', 'C'], 100000)
})

test_data.to_csv('performance_test.csv', index=False)

# Method 1: Default reading
start_time = time.time()
df1 = pd.read_csv('performance_test.csv')
default_time = time.time() - start_time

# Method 2: Specify data types
start_time = time.time()
df2 = pd.read_csv('performance_test.csv', dtype={
    'col1': 'int32',
    'col2': 'float32',
    'col3': 'category'
})
typed_time = time.time() - start_time

# Method 3: Using pyarrow engine (if installed)
try:
    start_time = time.time()
    df3 = pd.read_csv('performance_test.csv', engine='pyarrow')
    pyarrow_time = time.time() - start_time
except:
    pyarrow_time = None

print(f"Default read time: {default_time:.4f} seconds")
print(f"Typed read time: {typed_time:.4f} seconds")
if pyarrow_time:
    print(f"PyArrow engine read time: {pyarrow_time:.4f} seconds")

print(f"\nMemory usage comparison:")
print(f"Default types: {df1.memory_usage(deep=True).sum()} bytes")
print(f"Optimized types: {df2.memory_usage(deep=True).sum()} bytes")

Writing Performance Optimization

python
# Compare different writing methods
large_df = pd.DataFrame({
    'A': range(50000),
    'B': np.random.randn(50000),
    'C': np.random.choice(['X', 'Y', 'Z'], 50000)
})

# Method 1: Standard CSV write
start_time = time.time()
large_df.to_csv('standard_write.csv', index=False)
standard_time = time.time() - start_time

# Method 2: Compressed write
start_time = time.time()
large_df.to_csv('compressed_write.csv.gz', index=False, compression='gzip')
compressed_time = time.time() - start_time

# Method 3: Parquet format (for frequent reads)
start_time = time.time()
large_df.to_parquet('data.parquet', index=False)
parquet_time = time.time() - start_time

print(f"Standard CSV write time: {standard_time:.4f} seconds")
print(f"Compressed CSV write time: {compressed_time:.4f} seconds")
print(f"Parquet write time: {parquet_time:.4f} seconds")

# File size comparison
import os
print(f"\nFile size comparison:")
print(f"Standard CSV: {os.path.getsize('standard_write.csv')} bytes")
print(f"Compressed CSV: {os.path.getsize('compressed_write.csv.gz')} bytes")
print(f"Parquet: {os.path.getsize('data.parquet')} bytes")

🔍 Error Handling and Debugging

Common Error Handling

python
# Handle encoding errors
def safe_read_csv(filename, encodings=['utf-8', 'gbk', 'gb2312', 'latin1']):
    """Safely read CSV file, automatically try different encodings"""
    for encoding in encodings:
        try:
            df = pd.read_csv(filename, encoding=encoding)
            print(f"Successfully read file with {encoding} encoding")
            return df
        except UnicodeDecodeError:
            print(f"{encoding} encoding failed, trying next...")
            continue
    
    raise ValueError("All encodings failed")

# Handle file not found error
def safe_read_file(filename):
    """Safely read file"""
    try:
        if filename.endswith('.csv'):
            return pd.read_csv(filename)
        elif filename.endswith(('.xlsx', '.xls')):
            return pd.read_excel(filename)
        else:
            raise ValueError("Unsupported file format")
    except FileNotFoundError:
        print(f"File {filename} not found")
        return None
    except Exception as e:
        print(f"Error reading file: {e}")
        return None

# Test error handling
result = safe_read_file('nonexistent.csv')
if result is not None:
    print(result.head())
else:
    print("File read failed")

📝 Chapter Summary

Through this chapter, you should have mastered:

CSV File Processing: Reading, writing, and various parameter settings
Excel File Processing: Multiple worksheet operations and formatting
Large File Processing: Chunked reading and memory optimization
File Format Conversion: CSV and Excel conversion
Batch Processing: Processing multiple files
Data Cleaning: Processing dirty data and standardization
Performance Optimization: Improving read/write efficiency
Error Handling: Handling common file operation errors

Key Points

  1. Choose the Right File Format: CSV for simple data, Excel for complex reports
  2. Optimize Data Types: Can significantly save memory and improve performance
  3. Handle Large Files: Use chunked reading to avoid memory overflow
  4. Importance of Data Cleaning: Data quality is the foundation of analysis
  5. Error Handling: Write robust code to handle various exceptions

Best Practices

  • Understand the data structure before reading files
  • Specify appropriate data types to optimize performance
  • Handle missing values and anomalous data
  • Use compression formats to save storage space
  • Add error handling for important operations

Next Steps

Now that you've mastered file processing skills, next we'll learn about Pandas data cleaning techniques.


Next Chapter: Pandas Data Cleaning

Content is for learning and research only.