Skip to content

Pandas Data Cleaning

Data cleaning is one of the most important steps in the data analysis process. In real-world work, raw data often contains missing values, duplicate values, outliers, and other issues that need to be cleaned and preprocessed before analysis. This chapter will detail how to use Pandas for data cleaning.

1. Data Cleaning Overview

1.1 What is Data Cleaning

Data cleaning refers to the process of identifying and correcting errors, inconsistencies, and incompleteness in data. It mainly includes:

  • Missing Value Handling: Identifying and processing null values
  • Duplicate Value Handling: Finding and removing duplicate records
  • Outlier Handling: Identifying and processing outliers
  • Data Type Conversion: Ensuring correct data types
  • Data Format Standardization: Unifying data formats

1.2 Common Types of Data Quality Issues

python
import pandas as pd
import numpy as np

# Create sample data with various data quality issues
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', None, ''],
    'age': [25, 30, np.nan, 25, 35, 28, -5],
    'salary': [5000, 6000, 7000, 5000, 8000, 9000, 999999],
    'department': ['IT', 'HR', 'it', 'IT', 'Finance', 'hr', 'IT'],
    'join_date': ['2020-01-15', '2019-03-20', '2021/05/10', '2020-01-15', 
                  '2018-12-01', 'invalid date', '2022-02-28']
}

df = pd.DataFrame(data)
print("Original data:")
print(df)
print("\nData info:")
print(df.info())

2. Missing Value Handling

2.1 Identifying Missing Values

python
# Check for missing values
print("Missing value statistics:")
print(df.isnull().sum())

# View missing value distribution
print("\nMissing value percentage:")
print(df.isnull().mean() * 100)

# View rows containing missing values
print("\nRows with missing values:")
print(df[df.isnull().any(axis=1)])

# Visualize missing value patterns
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
plt.title('Missing Value Patterns')
plt.show()

2.2 Strategies for Handling Missing Values

Dropping Missing Values

python
# Drop rows with any missing values
df_drop_any = df.dropna()
print("Data after dropping any missing values:")
print(df_drop_any)

# Drop rows where all values are missing
df_drop_all = df.dropna(how='all')

# Drop missing values in specific columns
df_drop_subset = df.dropna(subset=['age'])

# Drop columns with missing values
df_drop_columns = df.dropna(axis=1)

Filling Missing Values

python
# Fill with fixed value
df_filled = df.copy()
df_filled['age'].fillna(df['age'].mean(), inplace=True)  # Fill with mean
df_filled['name'].fillna('Unknown', inplace=True)  # Fill with fixed value

# Forward fill and backward fill
df_ffill = df.fillna(method='ffill')  # Forward fill
df_bfill = df.fillna(method='bfill')  # Backward fill

# Fill numeric data with interpolation
df_interpolate = df.copy()
df_interpolate['age'] = df_interpolate['age'].interpolate()

# Fill categorical data with mode
from scipy import stats
mode_department = stats.mode(df['department'].dropna())[0][0]
df_filled['department'].fillna(mode_department, inplace=True)

2.3 Advanced Missing Value Handling

python
# Conditional filling
def fill_age_by_department(row):
    if pd.isna(row['age']):
        if row['department'] == 'IT':
            return 28  # IT department average age
        elif row['department'] == 'HR':
            return 32  # HR department average age
        else:
            return 30  # Other departments average age
    return row['age']

df_conditional = df.copy()
df_conditional['age'] = df_conditional.apply(fill_age_by_department, axis=1)

# Using machine learning to fill missing values
from sklearn.impute import KNNImputer

# Prepare numeric data
numeric_cols = ['age', 'salary']
imputer = KNNImputer(n_neighbors=3)
df_knn = df.copy()
df_knn[numeric_cols] = imputer.fit_transform(df[numeric_cols])

3. Duplicate Value Handling

3.1 Identifying Duplicates

python
# Check for completely duplicate rows
print("Number of duplicate rows:", df.duplicated().sum())
print("\nDuplicate rows:")
print(df[df.duplicated()])

# Check duplicates based on specific columns
print("\nDuplicates based on name:")
print(df[df.duplicated(subset=['name'])])

# View all duplicates (including first occurrence)
print("\nAll duplicate items:")
print(df[df.duplicated(subset=['name'], keep=False)])

3.2 Handling Duplicates

python
# Remove completely duplicate rows
df_no_duplicates = df.drop_duplicates()

# Remove duplicates based on specific columns, keep first
df_unique_names = df.drop_duplicates(subset=['name'], keep='first')

# Keep last duplicate
df_keep_last = df.drop_duplicates(subset=['name'], keep='last')

# Remove all duplicates
df_remove_all_duplicates = df.drop_duplicates(subset=['name'], keep=False)

print("Data after handling duplicates:")
print(df_unique_names)

4. Outlier Handling

4.1 Identifying Outliers

python
# Identify outliers using statistical methods
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (data < lower_bound) | (data > upper_bound)

# Check age outliers
age_outliers = detect_outliers_iqr(df['age'].dropna())
print("Age outliers:")
print(df.loc[age_outliers, 'age'])

# Using Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary'].dropna()))
threshold = 3
salary_outliers = z_scores > threshold
print("\nSalary outliers (Z-score > 3):")
print(df.loc[df['salary'].dropna().index[salary_outliers], 'salary'])

4.2 Handling Outliers

python
# Remove outliers
df_no_outliers = df.copy()
age_mask = ~detect_outliers_iqr(df_no_outliers['age'].dropna())
df_no_outliers = df_no_outliers.loc[df_no_outliers['age'].dropna().index[age_mask]]

# Replace outliers with boundary values
def cap_outliers(data, method='iqr'):
    if method == 'iqr':
        Q1 = data.quantile(0.25)
        Q3 = data.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
    
    return data.clip(lower=lower_bound, upper=upper_bound)

df_capped = df.copy()
df_capped['salary'] = cap_outliers(df_capped['salary'])

# Replace outliers with median
df_median_replace = df.copy()
outlier_mask = detect_outliers_iqr(df_median_replace['age'].dropna())
median_age = df_median_replace['age'].median()
df_median_replace.loc[outlier_mask, 'age'] = median_age

5. Data Type Conversion

5.1 Checking and Converting Data Types

python
# Check current data types
print("Current data types:")
print(df.dtypes)

# Convert data types
df_converted = df.copy()

# Convert to numeric
df_converted['age'] = pd.to_numeric(df_converted['age'], errors='coerce')
df_converted['salary'] = pd.to_numeric(df_converted['salary'], errors='coerce')

# Convert to datetime
df_converted['join_date'] = pd.to_datetime(df_converted['join_date'], 
                                          errors='coerce', 
                                          infer_datetime_format=True)

# Convert to categorical
df_converted['department'] = df_converted['department'].astype('category')

print("\nConverted data types:")
print(df_converted.dtypes)

5.2 Processing String Data

python
# String cleaning
df_string_clean = df.copy()

# Strip whitespace
df_string_clean['name'] = df_string_clean['name'].str.strip()
df_string_clean['department'] = df_string_clean['department'].str.strip()

# Standardize case
df_string_clean['department'] = df_string_clean['department'].str.upper()

# Replace empty strings with NaN
df_string_clean['name'] = df_string_clean['name'].replace('', np.nan)

# Standardize text
df_string_clean['department'] = df_string_clean['department'].replace({
    'IT': 'Information Technology',
    'HR': 'Human Resources'
})

print("String cleaned data:")
print(df_string_clean)

6. Data Validation and Quality Checks

6.1 Data Integrity Checks

python
def data_quality_report(df):
    """
    Generate data quality report
    """
    report = {}
    
    # Basic information
    report['Total rows'] = len(df)
    report['Total columns'] = len(df.columns)
    
    # Missing value statistics
    missing_stats = df.isnull().sum()
    report['Missing value stats'] = missing_stats[missing_stats > 0].to_dict()
    
    # Duplicate value statistics
    report['Duplicate rows'] = df.duplicated().sum()
    
    # Data type statistics
    report['Data types'] = df.dtypes.value_counts().to_dict()
    
    # Statistics for numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        report['Numeric statistics'] = df[numeric_cols].describe().to_dict()
    
    return report

# Generate data quality report
quality_report = data_quality_report(df)
for key, value in quality_report.items():
    print(f"{key}: {value}")
    print("-" * 50)

6.2 Data Constraint Checks

python
def validate_data(df):
    """
    Data validation function
    """
    issues = []
    
    # Check age range
    if 'age' in df.columns:
        invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
        if not invalid_age.empty:
            issues.append(f"Found {len(invalid_age)} invalid age values")
    
    # Check salary range
    if 'salary' in df.columns:
        invalid_salary = df[(df['salary'] < 0) | (df['salary'] > 1000000)]
        if not invalid_salary.empty:
            issues.append(f"Found {len(invalid_salary)} suspicious salary values")
    
    # Check required fields
    required_fields = ['name', 'department']
    for field in required_fields:
        if field in df.columns:
            missing_count = df[field].isnull().sum()
            if missing_count > 0:
                issues.append(f"Required field {field} has {missing_count} missing values")
    
    return issues

# Execute data validation
validation_issues = validate_data(df)
if validation_issues:
    print("Data validation issues found:")
    for issue in validation_issues:
        print(f"- {issue}")
else:
    print("Data validation passed")

7. Comprehensive Data Cleaning Pipeline

7.1 Complete Data Cleaning Pipeline

python
class DataCleaner:
    def __init__(self):
        self.cleaning_log = []
    
    def log_action(self, action):
        self.cleaning_log.append(action)
        print(f"Executing: {action}")
    
    def clean_data(self, df):
        """
        Complete data cleaning process
        """
        df_clean = df.copy()
        original_shape = df_clean.shape
        
        # 1. Process string data
        self.log_action("Cleaning string data")
        string_cols = df_clean.select_dtypes(include=['object']).columns
        for col in string_cols:
            if col in df_clean.columns:
                df_clean[col] = df_clean[col].astype(str).str.strip()
                df_clean[col] = df_clean[col].replace(['', 'nan', 'None'], np.nan)
        
        # 2. Data type conversion
        self.log_action("Converting data types")
        if 'age' in df_clean.columns:
            df_clean['age'] = pd.to_numeric(df_clean['age'], errors='coerce')
        if 'salary' in df_clean.columns:
            df_clean['salary'] = pd.to_numeric(df_clean['salary'], errors='coerce')
        if 'join_date' in df_clean.columns:
            df_clean['join_date'] = pd.to_datetime(df_clean['join_date'], errors='coerce')
        
        # 3. Handle duplicates
        self.log_action("Removing duplicates")
        duplicates_before = df_clean.duplicated().sum()
        df_clean = df_clean.drop_duplicates()
        duplicates_removed = duplicates_before - df_clean.duplicated().sum()
        
        # 4. Handle outliers
        self.log_action("Processing outliers")
        numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            df_clean[col] = cap_outliers(df_clean[col])
        
        # 5. Handle missing values
        self.log_action("Processing missing values")
        # Fill numeric with median
        for col in numeric_cols:
            if df_clean[col].isnull().any():
                median_val = df_clean[col].median()
                df_clean[col].fillna(median_val, inplace=True)
        
        # Fill categorical with mode
        categorical_cols = df_clean.select_dtypes(include=['object', 'category']).columns
        for col in categorical_cols:
            if df_clean[col].isnull().any():
                mode_val = df_clean[col].mode().iloc[0] if not df_clean[col].mode().empty else 'Unknown'
                df_clean[col].fillna(mode_val, inplace=True)
        
        # 6. Final validation
        self.log_action("Executing final validation")
        final_shape = df_clean.shape
        
        # Generate cleaning report
        report = {
            'Original data shape': original_shape,
            'Cleaned data shape': final_shape,
            'Duplicate rows removed': duplicates_removed,
            'Remaining missing values': df_clean.isnull().sum().sum(),
            'Cleaning steps': self.cleaning_log
        }
        
        return df_clean, report

# Use data cleaner
cleaner = DataCleaner()
df_cleaned, cleaning_report = cleaner.clean_data(df)

print("\nCleaning Report:")
for key, value in cleaning_report.items():
    print(f"{key}: {value}")

print("\nCleaned data:")
print(df_cleaned)

8. Practical Application Cases

8.1 Sales Data Cleaning

python
# Simulated sales data
sales_data = {
    'order_id': ['ORD001', 'ORD002', 'ORD003', 'ORD001', 'ORD004', None],
    'customer_name': ['Alice', 'Bob', '', 'Alice', 'Charlie', 'David'],
    'product': ['Laptop', 'LAPTOP', 'Laptop', 'Laptop', 'Phone', 'Tablet'],
    'quantity': [1, 2, -1, 1, 3, 0],
    'price': [5000, 6000, 5000, 5000, 2000, 3000],
    'order_date': ['2023-01-15', '2023/02/20', '2023-03-10', '2023-01-15', 
                   '2023-04-05', 'invalid']
}

sales_df = pd.DataFrame(sales_data)
print("Original sales data:")
print(sales_df)

# Apply cleaning process
sales_cleaner = DataCleaner()
sales_cleaned, sales_report = sales_cleaner.clean_data(sales_df)

# Additional business rule cleaning
sales_cleaned = sales_cleaned[sales_cleaned['quantity'] > 0]  # Remove invalid quantity
sales_cleaned['product'] = sales_cleaned['product'].str.upper()  # Standardize product names

print("\nCleaned sales data:")
print(sales_cleaned)

9. Performance Optimization Tips

9.1 Large Dataset Cleaning Strategy

python
# Chunk processing for large files
def clean_large_dataset(file_path, chunk_size=10000):
    """
    Process large dataset in chunks
    """
    cleaned_chunks = []
    
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        # Apply cleaning logic
        chunk_cleaned = clean_data_chunk(chunk)
        cleaned_chunks.append(chunk_cleaned)
    
    # Combine all cleaned chunks
    return pd.concat(cleaned_chunks, ignore_index=True)

def clean_data_chunk(chunk):
    """
    Clean a single data chunk
    """
    # Basic cleaning operations
    chunk = chunk.dropna(how='all')  # Drop all-empty rows
    chunk = chunk.drop_duplicates()  # Drop duplicate rows
    
    # Data type optimization
    for col in chunk.select_dtypes(include=['object']).columns:
        if chunk[col].nunique() / len(chunk) < 0.5:  # If unique ratio < 50%
            chunk[col] = chunk[col].astype('category')
    
    return chunk

9.2 Memory Optimization

python
# Optimize data types to save memory
def optimize_dtypes(df):
    """
    Optimize DataFrame data types
    """
    df_optimized = df.copy()
    
    # Optimize integer types
    for col in df_optimized.select_dtypes(include=['int64']).columns:
        col_min = df_optimized[col].min()
        col_max = df_optimized[col].max()
        
        if col_min >= 0:
            if col_max < 255:
                df_optimized[col] = df_optimized[col].astype('uint8')
            elif col_max < 65535:
                df_optimized[col] = df_optimized[col].astype('uint16')
            elif col_max < 4294967295:
                df_optimized[col] = df_optimized[col].astype('uint32')
        else:
            if col_min > -128 and col_max < 127:
                df_optimized[col] = df_optimized[col].astype('int8')
            elif col_min > -32768 and col_max < 32767:
                df_optimized[col] = df_optimized[col].astype('int16')
            elif col_min > -2147483648 and col_max < 2147483647:
                df_optimized[col] = df_optimized[col].astype('int32')
    
    # Optimize float types
    for col in df_optimized.select_dtypes(include=['float64']).columns:
        df_optimized[col] = pd.to_numeric(df_optimized[col], downcast='float')
    
    return df_optimized

# Apply optimization
df_optimized = optimize_dtypes(df_cleaned)
print(f"Original memory usage: {df_cleaned.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Optimized memory usage: {df_optimized.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

10. Best Practices and Considerations

10.1 Data Cleaning Best Practices

  1. Preserve Original Data: Always keep a backup of the original data
  2. Document Cleaning Steps: Record each cleaning step and decision rationale
  3. Validate Cleaning Results: Verify data integrity and correctness after cleaning
  4. Incremental Cleaning: Clean step by step, checking results at each stage
  5. Business Understanding: Cleaning decisions should be based on business understanding

10.2 Common Pitfalls and Considerations

python
# Common mistakes and correct approaches

# Wrong: Dropping all missing values directly
# df_wrong = df.dropna()  # May delete too much useful data

# Correct: Selectively process based on business logic
df_correct = df.copy()
# Only drop missing values in key fields
df_correct = df_correct.dropna(subset=['name', 'department'])
# Fill other fields with appropriate methods
df_correct['age'].fillna(df_correct['age'].median(), inplace=True)

# Wrong: Blindly removing outliers
# df_wrong = df[df['salary'] < df['salary'].quantile(0.95)]  # May remove valid high salaries

# Correct: Combine with business logic
# Executive salaries may genuinely be high, need special consideration
high_salary_threshold = 50000
df_correct = df[df['salary'] <= high_salary_threshold]

Chapter Summary

Data cleaning is the foundation of data analysis. This chapter covered:

  1. Missing Value Handling: Various methods for identifying, dropping, and filling missing values
  2. Duplicate Value Handling: Finding and processing duplicate data
  3. Outlier Handling: Using statistical methods to identify and process outliers
  4. Data Type Conversion: Ensuring data type correctness
  5. Data Validation: Establishing data quality check mechanisms
  6. Comprehensive Cleaning Pipeline: Building complete data cleaning pipelines
  7. Performance Optimization: Strategies for handling large datasets and memory optimization

Mastering these data cleaning skills will help you handle real-world dirty data and lay a solid foundation for subsequent data analysis. In practical applications, choose appropriate cleaning strategies based on specific business scenarios and data characteristics.

Exercises

  1. Create a DataFrame with various data quality issues and implement a complete cleaning process
  2. Write a function to automatically detect and report data quality issues
  3. Implement a configurable data cleaning pipeline that supports different cleaning strategies
  4. Process a real CSV file and document the cleaning process and results
  5. Compare the impact of different missing value filling methods on subsequent analysis results

In the next chapter, we will learn about Pandas common functions that will help us process and analyze data more efficiently.

Content is for learning and research only.