Pandas CSV and Excel Handling
CSV and Excel files are the most common data formats in data analysis. This chapter will detail how to use Pandas to read, process, and save data in these file formats.
📚 File Format Overview
CSV File Characteristics
- Comma-Separated Values: Standard text format
- Plain text format: Can be opened with any text editor
- Cross-platform compatible: Supported by almost all systems
- Small file size: More space-efficient than Excel files
- Fast processing: High read/write efficiency
Excel File Characteristics
- Multiple worksheets: One file can contain multiple sheets
- Rich formatting: Supports fonts, colors, formulas, etc.
- Data type preservation: Automatically recognizes numbers, dates, etc.
- Business standard: Widely used in business environments
- Larger file size: Contains formatting information
📖 CSV File Processing
Reading CSV Files
Basic Reading
import pandas as pd
import numpy as np
# Create sample CSV data (simulating file content)
csv_content = """Name,Age,City,Salary
Alice,25,Beijing,8000
Bob,30,Shanghai,12000
Charlie,35,Guangzhou,15000
David,28,Shenzhen,9500
Eve,32,Hangzhou,11000"""
# Write content to file (in actual use, you already have a CSV file)
with open('employees.csv', 'w', encoding='utf-8') as f:
f.write(csv_content)
# Read CSV file
df = pd.read_csv('employees.csv')
print("Basic read result:")
print(df)
print(f"\nData types:\n{df.dtypes}")Reading with Parameters
# Specify encoding
df_encoding = pd.read_csv('employees.csv', encoding='utf-8')
# Specify delimiter (if not comma)
# df_tab = pd.read_csv('data.txt', sep='\t') # Tab-separated
# df_semicolon = pd.read_csv('data.csv', sep=';') # Semicolon-separated
# Specify index column
df_index = pd.read_csv('employees.csv', index_col='Name')
print("With specified index column:")
print(df_index)
# Select specific columns
df_selected = pd.read_csv('employees.csv', usecols=['Name', 'Salary'])
print("\nSelected columns:")
print(df_selected)
# Skip rows
df_skip = pd.read_csv('employees.csv', skiprows=1) # Skip first row
print("\nAfter skipping first row:")
print(df_skip)Handling Missing Values and Special Cases
# Create CSV with missing values
csv_with_nan = """Name,Age,City,Salary
Alice,25,Beijing,8000
Bob,,Shanghai,12000
Charlie,35,,15000
David,28,Shenzhen,
Eve,32,Hangzhou,11000"""
with open('employees_nan.csv', 'w', encoding='utf-8') as f:
f.write(csv_with_nan)
# Handle missing values when reading
df_nan = pd.read_csv('employees_nan.csv')
print("Data with missing values:")
print(df_nan)
print(f"\nMissing value count:\n{df_nan.isnull().sum()}")
# Specify missing value indicators
df_na_values = pd.read_csv('employees_nan.csv', na_values=['', 'NULL', 'N/A'])
print("\nAfter specifying missing value indicators:")
print(df_na_values)Specifying Data Types
# Specify data types
dtype_dict = {
'Name': 'string',
'Age': 'Int64', # Nullable integer type
'City': 'category',
'Salary': 'float64'
}
df_typed = pd.read_csv('employees.csv', dtype=dtype_dict)
print("After specifying data types:")
print(df_typed.dtypes)
print(df_typed.info())Writing CSV Files
Basic Writing
# Create sample data
data = {
'Product': ['Laptop', 'Desktop', 'Tablet', 'Smartphone'],
'Price': [5999, 3999, 2999, 1999],
'Stock': [50, 30, 80, 120],
'Category': ['Computer', 'Computer', 'Computer', 'Phone']
}
products_df = pd.DataFrame(data)
# Save as CSV
products_df.to_csv('products.csv', index=False, encoding='utf-8')
print("Data saved to products.csv")
# Verify saved result
saved_df = pd.read_csv('products.csv')
print("\nSaved data:")
print(saved_df)Advanced Writing Options
# Keep index
products_df.to_csv('products_with_index.csv', encoding='utf-8')
# Specify delimiter
products_df.to_csv('products_tab.txt', sep='\t', index=False)
# Select specific columns
products_df[['Product', 'Price']].to_csv('products_simple.csv', index=False)
# Append mode (note: file must exist first)
new_product = pd.DataFrame({
'Product': ['Smartwatch'],
'Price': [1299],
'Stock': [60],
'Category': ['Electronics']
})
# Append to existing file
new_product.to_csv('products.csv', mode='a', header=False, index=False)
print("\nAfter appending data:")
print(pd.read_csv('products.csv'))📊 Excel File Processing
Installing Dependencies
# Additional libraries needed for Excel files
# pip install openpyxl # For .xlsx files
# pip install xlrd # For .xls files
# pip install xlsxwriter # For writing Excel filesReading Excel Files
Basic Reading
# First create an Excel file for demonstration
with pd.ExcelWriter('company_data.xlsx', engine='openpyxl') as writer:
# Employee info sheet
employees = pd.DataFrame({
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Department': ['Tech', 'Sales', 'Tech', 'HR', 'Finance'],
'Salary': [8000, 12000, 15000, 9500, 11000],
'JoinDate': pd.date_range('2020-01-01', periods=5, freq='3M')
})
# Department info sheet
departments = pd.DataFrame({
'DeptName': ['Tech', 'Sales', 'HR', 'Finance'],
'Manager': ['Manager A', 'Manager B', 'Manager C', 'Manager D'],
'HeadCount': [15, 8, 5, 6]
})
employees.to_excel(writer, sheet_name='Employees', index=False)
departments.to_excel(writer, sheet_name='Departments', index=False)
print("Excel file created")
# Read Excel file (default reads first sheet)
df_excel = pd.read_excel('company_data.xlsx')
print("\nRead first worksheet:")
print(df_excel)Reading Specific Worksheets
# Read specified worksheet
employees_sheet = pd.read_excel('company_data.xlsx', sheet_name='Employees')
departments_sheet = pd.read_excel('company_data.xlsx', sheet_name='Departments')
print("Employees sheet:")
print(employees_sheet)
print("\nDepartments sheet:")
print(departments_sheet)
# Read multiple worksheets
all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)
print(f"\nWorksheet names: {list(all_sheets.keys())}")
# Read specific multiple worksheets
selected_sheets = pd.read_excel('company_data.xlsx',
sheet_name=['Employees', 'Departments'])
print(f"\nSelected worksheets: {list(selected_sheets.keys())}")Advanced Reading Options
# Specify reading range
df_range = pd.read_excel('company_data.xlsx',
sheet_name='Employees',
usecols='A:D', # Only read columns A to D
nrows=3) # Only read first 3 rows
print("Reading specified range:")
print(df_range)
# Skip rows and specify header row
df_skip = pd.read_excel('company_data.xlsx',
sheet_name='Employees',
skiprows=1, # Skip first row
header=0) # Second row as header
print("\nAfter skipping rows:")
print(df_skip)Writing Excel Files
Basic Writing
# Create sales data
sales_data = pd.DataFrame({
'Month': pd.date_range('2024-01-01', periods=12, freq='M'),
'Revenue': [120000, 135000, 142000, 158000, 163000, 171000,
185000, 192000, 178000, 165000, 155000, 148000],
'Cost': [80000, 85000, 90000, 95000, 98000, 102000,
108000, 115000, 110000, 105000, 98000, 92000]
})
sales_data['Profit'] = sales_data['Revenue'] - sales_data['Cost']
sales_data['ProfitMargin'] = (sales_data['Profit'] / sales_data['Revenue'] * 100).round(2)
# Save to Excel
sales_data.to_excel('sales_report.xlsx', sheet_name='Monthly Sales', index=False)
print("Sales report saved")Multiple Worksheet Writing
# Create multiple data tables
q1_data = sales_data.iloc[:3].copy()
q2_data = sales_data.iloc[3:6].copy()
q3_data = sales_data.iloc[6:9].copy()
q4_data = sales_data.iloc[9:12].copy()
# Write to multiple worksheets
with pd.ExcelWriter('quarterly_report.xlsx', engine='openpyxl') as writer:
q1_data.to_excel(writer, sheet_name='Q1', index=False)
q2_data.to_excel(writer, sheet_name='Q2', index=False)
q3_data.to_excel(writer, sheet_name='Q3', index=False)
q4_data.to_excel(writer, sheet_name='Q4', index=False)
# Annual summary
annual_summary = pd.DataFrame({
'Quarter': ['Q1', 'Q2', 'Q3', 'Q4'],
'Revenue': [q1_data['Revenue'].sum(), q2_data['Revenue'].sum(),
q3_data['Revenue'].sum(), q4_data['Revenue'].sum()],
'Profit': [q1_data['Profit'].sum(), q2_data['Profit'].sum(),
q3_data['Profit'].sum(), q4_data['Profit'].sum()]
})
annual_summary.to_excel(writer, sheet_name='Annual Summary', index=False)
print("Quarterly report saved")Formatting Excel Output
# Use xlsxwriter engine for formatting
with pd.ExcelWriter('formatted_report.xlsx', engine='xlsxwriter') as writer:
sales_data.to_excel(writer, sheet_name='Sales Data', index=False)
# Get workbook and worksheet objects
workbook = writer.book
worksheet = writer.sheets['Sales Data']
# Define formats
header_format = workbook.add_format({
'bold': True,
'text_wrap': True,
'valign': 'top',
'fg_color': '#D7E4BC',
'border': 1
})
currency_format = workbook.add_format({'num_format': '$#,##0'})
percent_format = workbook.add_format({'num_format': '0.00%'})
date_format = workbook.add_format({'num_format': 'yyyy-mm-dd'})
# Apply formats
for col_num, value in enumerate(sales_data.columns.values):
worksheet.write(0, col_num, value, header_format)
# Set column widths
worksheet.set_column('A:A', 12) # Month column
worksheet.set_column('B:D', 15) # Numeric columns
# Format data columns
worksheet.set_column('B:C', 15, currency_format) # Revenue and Cost
worksheet.set_column('D:D', 15, currency_format) # Profit
print("Formatted report saved")🔧 Advanced File Processing Techniques
Large File Processing
Chunked Reading
# Create large file example
large_data = pd.DataFrame({
'ID': range(1, 10001),
'Value1': np.random.randn(10000),
'Value2': np.random.randn(10000),
'Category': np.random.choice(['A', 'B', 'C'], 10000)
})
large_data.to_csv('large_file.csv', index=False)
# Read large file in chunks
chunk_size = 1000
results = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk
processed_chunk = chunk[chunk['Value1'] > 0] # Example: filter positive values
results.append(processed_chunk)
# Combine results
final_result = pd.concat(results, ignore_index=True)
print(f"Original data: {len(large_data)} rows")
print(f"Processed data: {len(final_result)} rows")Memory-Optimized Reading
# Optimize data types to save memory
dtype_optimization = {
'ID': 'int32',
'Value1': 'float32',
'Value2': 'float32',
'Category': 'category'
}
# Compare memory usage
df_default = pd.read_csv('large_file.csv')
df_optimized = pd.read_csv('large_file.csv', dtype=dtype_optimization)
print(f"Default type memory usage: {df_default.memory_usage(deep=True).sum()} bytes")
print(f"Optimized type memory usage: {df_optimized.memory_usage(deep=True).sum()} bytes")
print(f"Memory savings: {(1 - df_optimized.memory_usage(deep=True).sum()/df_default.memory_usage(deep=True).sum())*100:.1f}%")File Format Conversion
CSV and Excel Conversion
# CSV to Excel
csv_df = pd.read_csv('employees.csv')
csv_df.to_excel('employees_from_csv.xlsx', index=False)
# Excel to CSV
excel_df = pd.read_excel('company_data.xlsx', sheet_name='Employees')
excel_df.to_csv('employees_from_excel.csv', index=False)
print("File format conversion complete")Batch Processing Multiple Files
import os
import glob
# Create multiple CSV files for demonstration
for i in range(3):
sample_data = pd.DataFrame({
'Date': pd.date_range(f'2024-0{i+1}-01', periods=5, freq='D'),
'Quantity': np.random.randint(50, 200, 5),
'Revenue': np.random.randint(1000, 5000, 5)
})
sample_data.to_csv(f'sales_month_{i+1}.csv', index=False)
# Batch read and combine
csv_files = glob.glob('sales_month_*.csv')
all_data = []
for file in csv_files:
df = pd.read_csv(file)
df['SourceFile'] = file
all_data.append(df)
combined_data = pd.concat(all_data, ignore_index=True)
print("Combined data:")
print(combined_data.head(10))
# Save combined result
combined_data.to_excel('combined_sales.xlsx', index=False)
print("\nCombined data saved to Excel")🎨 Practical Application Examples
Example 1: Financial Report Processing
# Create financial data
financial_data = {
'Account': ['Revenue', 'Cost of Sales', 'Sales Expenses', 'Admin Expenses', 'Finance Expenses', 'Operating Profit'],
'CurrentMonth': [1000000, 600000, 80000, 120000, 15000, 185000],
'LastMonth': [950000, 570000, 75000, 115000, 12000, 178000],
'LastYear': [900000, 540000, 70000, 110000, 10000, 170000]
}
financial_df = pd.DataFrame(financial_data)
# Calculate growth rates
financial_df['MoMGrowth'] = ((financial_df['CurrentMonth'] - financial_df['LastMonth']) / financial_df['LastMonth'] * 100).round(2)
financial_df['YoYGrowth'] = ((financial_df['CurrentMonth'] - financial_df['LastYear']) / financial_df['LastYear'] * 100).round(2)
print("Financial Report:")
print(financial_df)
# Save to formatted Excel
with pd.ExcelWriter('financial_report.xlsx', engine='xlsxwriter') as writer:
financial_df.to_excel(writer, sheet_name='Financial Report', index=False)
workbook = writer.book
worksheet = writer.sheets['Financial Report']
# Define formats
header_format = workbook.add_format({
'bold': True, 'bg_color': '#4F81BD', 'font_color': 'white'
})
currency_format = workbook.add_format({'num_format': '$#,##0'})
percent_format = workbook.add_format({'num_format': '0.00%'})
# Apply formats
for col_num, value in enumerate(financial_df.columns.values):
worksheet.write(0, col_num, value, header_format)
# Format numeric columns
worksheet.set_column('B:D', 15, currency_format)
worksheet.set_column('E:F', 12, percent_format)
print("\nFinancial report saved")Example 2: Sales Data Analysis
# Create sales detail data
sales_detail = pd.DataFrame({
'OrderID': [f'ORD{str(i).zfill(4)}' for i in range(1, 101)],
'Customer': np.random.choice(['Customer A', 'Customer B', 'Customer C', 'Customer D', 'Customer E'], 100),
'Product': np.random.choice(['Product 1', 'Product 2', 'Product 3', 'Product 4'], 100),
'Quantity': np.random.randint(1, 20, 100),
'UnitPrice': np.random.choice([100, 150, 200, 250, 300], 100),
'Date': pd.date_range('2024-01-01', periods=100, freq='D')
})
sales_detail['Amount'] = sales_detail['Quantity'] * sales_detail['UnitPrice']
# Data analysis
customer_summary = sales_detail.groupby('Customer').agg({
'OrderID': 'count',
'Quantity': 'sum',
'Amount': 'sum'
}).rename(columns={'OrderID': 'OrderCount'})
product_summary = sales_detail.groupby('Product').agg({
'OrderID': 'count',
'Quantity': 'sum',
'Amount': 'sum'
}).rename(columns={'OrderID': 'OrderCount'})
# Monthly summary
sales_detail['Month'] = sales_detail['Date'].dt.to_period('M')
monthly_summary = sales_detail.groupby('Month').agg({
'OrderID': 'count',
'Amount': 'sum'
}).rename(columns={'OrderID': 'OrderCount'})
# Save analysis results
with pd.ExcelWriter('sales_analysis.xlsx') as writer:
sales_detail.to_excel(writer, sheet_name='Sales Detail', index=False)
customer_summary.to_excel(writer, sheet_name='Customer Analysis')
product_summary.to_excel(writer, sheet_name='Product Analysis')
monthly_summary.to_excel(writer, sheet_name='Monthly Summary')
print("Sales analysis report generated")
print("\nCustomer Analysis:")
print(customer_summary.sort_values('Amount', ascending=False))Example 3: Data Cleaning and Standardization
# Create dirty data
dirty_data = pd.DataFrame({
'Name': ['Alice', ' Bob ', 'CHARLIE', 'David', 'Eve'],
'Phone': ['138-0000-0000', '139 0000 0000', '(140)0000-0000', '141-0000-0000', '142.0000.0000'],
'Email': ['alice@email.com', 'BOB@EMAIL.COM', 'charlie@Email.Com', 'david@email.com', 'eve@email.com'],
'Age': ['25', '30 years', '35', 'unknown', '28'],
'Salary': ['8000', '12,000', '15000 USD', 'N/A', '9500']
})
print("Original data:")
print(dirty_data)
# Data cleaning
cleaned_data = dirty_data.copy()
# Clean names: strip whitespace, standardize case
cleaned_data['Name'] = cleaned_data['Name'].str.strip().str.title()
# Clean phone: standardize format
cleaned_data['Phone'] = cleaned_data['Phone'].str.replace(r'[^\d]', '', regex=True)
cleaned_data['Phone'] = cleaned_data['Phone'].str.replace(r'(\d{3})(\d{4})(\d{4})', r'\1-\2-\3', regex=True)
# Clean email: lowercase
cleaned_data['Email'] = cleaned_data['Email'].str.lower()
# Clean age: extract numbers
cleaned_data['Age'] = cleaned_data['Age'].str.extract(r'(\d+)').astype(float)
# Clean salary: extract numbers, handle thousands separator
cleaned_data['Salary'] = cleaned_data['Salary'].str.replace(r'[^\d,]', '', regex=True)
cleaned_data['Salary'] = cleaned_data['Salary'].str.replace(',', '')
cleaned_data['Salary'] = pd.to_numeric(cleaned_data['Salary'], errors='coerce')
print("\nCleaned data:")
print(cleaned_data)
# Save cleaning results
with pd.ExcelWriter('data_cleaning_result.xlsx') as writer:
dirty_data.to_excel(writer, sheet_name='Original Data', index=False)
cleaned_data.to_excel(writer, sheet_name='Cleaned Data', index=False)
# Data quality report
quality_report = pd.DataFrame({
'Field': cleaned_data.columns,
'MissingCount': [cleaned_data[col].isnull().sum() for col in cleaned_data.columns],
'MissingRate': [f"{cleaned_data[col].isnull().sum()/len(cleaned_data)*100:.1f}%" for col in cleaned_data.columns]
})
quality_report.to_excel(writer, sheet_name='Data Quality Report', index=False)
print("\nData cleaning report saved")🚀 Performance Optimization Tips
Reading Performance Optimization
import time
# Create test data
test_data = pd.DataFrame({
'col1': range(100000),
'col2': np.random.randn(100000),
'col3': np.random.choice(['A', 'B', 'C'], 100000)
})
test_data.to_csv('performance_test.csv', index=False)
# Method 1: Default reading
start_time = time.time()
df1 = pd.read_csv('performance_test.csv')
default_time = time.time() - start_time
# Method 2: Specify data types
start_time = time.time()
df2 = pd.read_csv('performance_test.csv', dtype={
'col1': 'int32',
'col2': 'float32',
'col3': 'category'
})
typed_time = time.time() - start_time
# Method 3: Using pyarrow engine (if installed)
try:
start_time = time.time()
df3 = pd.read_csv('performance_test.csv', engine='pyarrow')
pyarrow_time = time.time() - start_time
except:
pyarrow_time = None
print(f"Default read time: {default_time:.4f} seconds")
print(f"Typed read time: {typed_time:.4f} seconds")
if pyarrow_time:
print(f"PyArrow engine read time: {pyarrow_time:.4f} seconds")
print(f"\nMemory usage comparison:")
print(f"Default types: {df1.memory_usage(deep=True).sum()} bytes")
print(f"Optimized types: {df2.memory_usage(deep=True).sum()} bytes")Writing Performance Optimization
# Compare different writing methods
large_df = pd.DataFrame({
'A': range(50000),
'B': np.random.randn(50000),
'C': np.random.choice(['X', 'Y', 'Z'], 50000)
})
# Method 1: Standard CSV write
start_time = time.time()
large_df.to_csv('standard_write.csv', index=False)
standard_time = time.time() - start_time
# Method 2: Compressed write
start_time = time.time()
large_df.to_csv('compressed_write.csv.gz', index=False, compression='gzip')
compressed_time = time.time() - start_time
# Method 3: Parquet format (for frequent reads)
start_time = time.time()
large_df.to_parquet('data.parquet', index=False)
parquet_time = time.time() - start_time
print(f"Standard CSV write time: {standard_time:.4f} seconds")
print(f"Compressed CSV write time: {compressed_time:.4f} seconds")
print(f"Parquet write time: {parquet_time:.4f} seconds")
# File size comparison
import os
print(f"\nFile size comparison:")
print(f"Standard CSV: {os.path.getsize('standard_write.csv')} bytes")
print(f"Compressed CSV: {os.path.getsize('compressed_write.csv.gz')} bytes")
print(f"Parquet: {os.path.getsize('data.parquet')} bytes")🔍 Error Handling and Debugging
Common Error Handling
# Handle encoding errors
def safe_read_csv(filename, encodings=['utf-8', 'gbk', 'gb2312', 'latin1']):
"""Safely read CSV file, automatically try different encodings"""
for encoding in encodings:
try:
df = pd.read_csv(filename, encoding=encoding)
print(f"Successfully read file with {encoding} encoding")
return df
except UnicodeDecodeError:
print(f"{encoding} encoding failed, trying next...")
continue
raise ValueError("All encodings failed")
# Handle file not found error
def safe_read_file(filename):
"""Safely read file"""
try:
if filename.endswith('.csv'):
return pd.read_csv(filename)
elif filename.endswith(('.xlsx', '.xls')):
return pd.read_excel(filename)
else:
raise ValueError("Unsupported file format")
except FileNotFoundError:
print(f"File {filename} not found")
return None
except Exception as e:
print(f"Error reading file: {e}")
return None
# Test error handling
result = safe_read_file('nonexistent.csv')
if result is not None:
print(result.head())
else:
print("File read failed")📝 Chapter Summary
Through this chapter, you should have mastered:
✅ CSV File Processing: Reading, writing, and various parameter settings
✅ Excel File Processing: Multiple worksheet operations and formatting
✅ Large File Processing: Chunked reading and memory optimization
✅ File Format Conversion: CSV and Excel conversion
✅ Batch Processing: Processing multiple files
✅ Data Cleaning: Processing dirty data and standardization
✅ Performance Optimization: Improving read/write efficiency
✅ Error Handling: Handling common file operation errors
Key Points
- Choose the Right File Format: CSV for simple data, Excel for complex reports
- Optimize Data Types: Can significantly save memory and improve performance
- Handle Large Files: Use chunked reading to avoid memory overflow
- Importance of Data Cleaning: Data quality is the foundation of analysis
- Error Handling: Write robust code to handle various exceptions
Best Practices
- Understand the data structure before reading files
- Specify appropriate data types to optimize performance
- Handle missing values and anomalous data
- Use compression formats to save storage space
- Add error handling for important operations
Next Steps
Now that you've mastered file processing skills, next we'll learn about Pandas data cleaning techniques.
Next Chapter: Pandas Data Cleaning