Chapter 3: Data Preprocessing Basics
Data preprocessing is one of the most important steps in machine learning projects. Real-world data is often "dirty," containing missing values, outliers, different scales, and other issues. This chapter will provide a detailed introduction on how to use Scikit-learn for data preprocessing.
3.1 Why Do We Need Data Preprocessing?
In real projects, raw data typically has the following problems:
- Missing Values: Omissions during data collection
- Outliers: Measurement errors or extreme cases
- Different Scales: Large differences in value ranges between different features
- Inconsistent Data Types: Mix of numeric and categorical data
- Duplicate Data: Same records appearing multiple times
3.2 Create Example Dataset
First, let's create an example dataset containing various problems:
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler,
LabelEncoder, OneHotEncoder,
SimpleImputer, KNNImputer
)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Set random seed
np.random.seed(42)
# Create example data
n_samples = 1000
# Generate base data
data = {
'age': np.random.normal(35, 10, n_samples),
'income': np.random.exponential(50000, n_samples),
'education_years': np.random.normal(14, 3, n_samples),
'credit_score': np.random.normal(650, 100, n_samples),
'gender': np.random.choice(['Male', 'Female'], n_samples),
'city': np.random.choice(['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen'], n_samples),
'loan_approved': np.random.choice([0, 1], n_samples, p=[0.3, 0.7])
}
# Create DataFrame
df = pd.DataFrame(data)
# Artificially introduce some problems
# 1. Missing values
missing_indices = np.random.choice(df.index, size=int(0.1 * n_samples), replace=False)
df.loc[missing_indices[:50], 'income'] = np.nan
df.loc[missing_indices[50:], 'education_years'] = np.nan
# 2. Outliers
outlier_indices = np.random.choice(df.index, size=20, replace=False)
df.loc[outlier_indices, 'age'] = np.random.uniform(100, 120, 20)
# 3. Negative values (unreasonable data)
negative_indices = np.random.choice(df.index, size=10, replace=False)
df.loc[negative_indices, 'credit_score'] = np.random.uniform(-100, 0, 10)
print("Original Dataset Information:")
print(df.info())
print("\nFirst 5 Rows of Data:")
print(df.head())3.3 Data Exploration and Problem Identification
3.3.1 Basic Statistics
python
# View data statistics
print("Data Statistics Summary:")
print(df.describe())
# View missing values
print("\nMissing Value Statistics:")
missing_stats = df.isnull().sum()
missing_percent = (missing_stats / len(df)) * 100
missing_df = pd.DataFrame({
'Missing Count': missing_stats,
'Missing Percentage(%)': missing_percent
})
print(missing_df[missing_df['Missing Count'] > 0])
# View data types
print("\nData Types:")
print(df.dtypes)3.3.2 Visualize Data Distribution
python
# Create visualization charts
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Data Distribution Visualization', fontsize=16)
# Distribution of numeric features
numeric_cols = ['age', 'income', 'education_years', 'credit_score']
for i, col in enumerate(numeric_cols):
row = i // 2
col_idx = i % 2
# Histogram
axes[row, col_idx].hist(df[col].dropna(), bins=30, alpha=0.7, edgecolor='black')
axes[row, col_idx].set_title(f'{col} Distribution')
axes[row, col_idx].set_xlabel(col)
axes[row, col_idx].set_ylabel('Frequency')
# Distribution of categorical features
axes[1, 2].pie(df['gender'].value_counts(), labels=df['gender'].value_counts().index, autopct='%1.1f%%')
axes[1, 2].set_title('Gender Distribution')
# Target variable distribution
axes[1, 3] = fig.add_subplot(2, 3, 6)
df['loan_approved'].value_counts().plot(kind='bar', ax=axes[1, 3])
axes[1, 3].set_title('Loan Approval Status')
axes[1, 3].set_xlabel('Loan Approved (0=No, 1=Yes)')
plt.tight_layout()
plt.show()3.3.3 Outlier Detection
python
# Use boxplot to detect outliers
plt.figure(figsize=(12, 8))
numeric_cols = ['age', 'income', 'education_years', 'credit_score']
for i, col in enumerate(numeric_cols, 1):
plt.subplot(2, 2, i)
plt.boxplot(df[col].dropna())
plt.title(f'{col} Boxplot')
plt.ylabel(col)
plt.tight_layout()
plt.show()
# Use IQR method to identify outliers
def detect_outliers_iqr(data, column):
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
return outliers, lower_bound, upper_bound
# Detect outliers in each column
for col in numeric_cols:
outliers, lower, upper = detect_outliers_iqr(df, col)
print(f"\n{col} Outlier Detection:")
print(f"Normal Range: [{lower:.2f}, {upper:.2f}]")
print(f"Number of Outliers: {len(outliers)}")
if len(outliers) > 0:
print(f"Outlier Examples: {outliers[col].head().tolist()}")3.4 Handling Missing Values
3.4.1 Simple Imputation Strategies
python
# Create a copy of data for processing
df_processed = df.copy()
# Method 1: Use mean to impute numeric features
print("Missing Values Before Processing:")
print(df_processed.isnull().sum())
# Use SimpleImputer
numeric_imputer = SimpleImputer(strategy='mean')
numeric_cols_with_missing = ['income', 'education_years']
df_processed[numeric_cols_with_missing] = numeric_imputer.fit_transform(
df_processed[numeric_cols_with_missing]
)
print("\nAfter Mean Imputation:")
print(df_processed.isnull().sum())
# Method 2: Use mode to impute categorical features
categorical_imputer = SimpleImputer(strategy='most_frequent')
categorical_cols = ['gender', 'city']
# If categorical features have missing values
if df_processed[categorical_cols].isnull().sum().sum() > 0:
df_processed[categorical_cols] = categorical_imputer.fit_transform(
df_processed[categorical_cols]
)3.4.2 Advanced Imputation Strategies
python
# Method 3: Use KNN imputation
df_knn = df.copy()
# First encode categorical variables
le_gender = LabelEncoder()
le_city = LabelEncoder()
df_knn['gender_encoded'] = le_gender.fit_transform(df_knn['gender'])
df_knn['city_encoded'] = le_city.fit_transform(df_knn['city'])
# Select numeric features for KNN imputation
numeric_features = ['age', 'income', 'education_years', 'credit_score', 'gender_encoded', 'city_encoded']
knn_imputer = KNNImputer(n_neighbors=5)
df_knn[numeric_features] = knn_imputer.fit_transform(df_knn[numeric_features])
print("Statistics After KNN Imputation:")
print(df_knn[['income', 'education_years']].describe())3.4.3 Imputation Effect Comparison
python
# Compare effects of different imputation methods
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Original data (removing missing values)
axes[0].hist(df['income'].dropna(), bins=30, alpha=0.7, label='Original Data')
axes[0].set_title('Original Income Distribution')
axes[0].set_xlabel('Income')
axes[0].set_ylabel('Frequency')
# Mean imputation
axes[1].hist(df_processed['income'], bins=30, alpha=0.7, label='Mean Imputation', color='orange')
axes[1].set_title('Income Distribution After Mean Imputation')
axes[1].set_xlabel('Income')
# KNN imputation
axes[2].hist(df_knn['income'], bins=30, alpha=0.7, label='KNN Imputation', color='green')
axes[2].set_title('Income Distribution After KNN Imputation')
axes[2].set_xlabel('Income')
plt.tight_layout()
plt.show()3.5 Handling Outliers
3.5.1 Remove Outliers
python
# Method 1: Directly remove outliers
def remove_outliers_iqr(data, columns):
"""Remove outliers using IQR method"""
data_clean = data.copy()
for col in columns:
Q1 = data_clean[col].quantile(0.25)
Q3 = data_clean[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
data_clean = data_clean[
(data_clean[col] >= lower_bound) &
(data_clean[col] <= upper_bound)
]
return data_clean
# Remove age outliers
df_no_outliers = remove_outliers_iqr(df_processed, ['age'])
print(f"Number of samples before removing outliers: {len(df_processed)}")
print(f"Number of samples after removing outliers: {len(df_no_outliers)}")3.5.2 Cap Outliers
python
# Method 2: Cap outliers to reasonable range
def cap_outliers(data, column, lower_percentile=5, upper_percentile=95):
"""Cap outliers to specified percentile range"""
lower_bound = data[column].quantile(lower_percentile / 100)
upper_bound = data[column].quantile(upper_percentile / 100)
data[column] = np.clip(data[column], lower_bound, upper_bound)
return data
# Handle negative credit scores
df_capped = df_processed.copy()
df_capped = cap_outliers(df_capped, 'credit_score', 1, 99)
print("Outlier Handling Comparison:")
print(f"Before credit score range: [{df_processed['credit_score'].min():.2f}, {df_processed['credit_score'].max():.2f}]")
print(f"After credit score range: [{df_capped['credit_score'].min():.2f}, {df_capped['credit_score'].max():.2f}]")3.6 Feature Scaling
3.6.1 Standardization (StandardScaler)
python
# Select features to scale
numeric_features = ['age', 'income', 'education_years', 'credit_score']
# Standardization: mean=0, std=1
scaler_standard = StandardScaler()
df_standard = df_capped.copy()
df_standard[numeric_features] = scaler_standard.fit_transform(df_capped[numeric_features])
print("Statistics After Standardization:")
print(df_standard[numeric_features].describe())3.6.2 Min-Max Scaling (MinMaxScaler)
python
# Min-Max scaling: scale to [0,1] range
scaler_minmax = MinMaxScaler()
df_minmax = df_capped.copy()
df_minmax[numeric_features] = scaler_minmax.fit_transform(df_capped[numeric_features])
print("Statistics After Min-Max Scaling:")
print(df_minmax[numeric_features].describe())3.6.3 Robust Scaling (RobustScaler)
python
# Robust scaling: use median and IQR, insensitive to outliers
scaler_robust = RobustScaler()
df_robust = df_capped.copy()
df_robust[numeric_features] = scaler_robust.fit_transform(df_capped[numeric_features])
print("Statistics After Robust Scaling:")
print(df_robust[numeric_features].describe())3.6.4 Scaling Method Comparison
python
# Visualize effects of different scaling methods
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Comparison of Different Scaling Methods', fontsize=16)
# Original data
axes[0, 0].boxplot([df_capped[col] for col in numeric_features], labels=numeric_features)
axes[0, 0].set_title('Original Data')
axes[0, 0].tick_params(axis='x', rotation=45)
# Standardization
axes[0, 1].boxplot([df_standard[col] for col in numeric_features], labels=numeric_features)
axes[0, 1].set_title('Standardization')
axes[0, 1].tick_params(axis='x', rotation=45)
# Min-Max scaling
axes[1, 0].boxplot([df_minmax[col] for col in numeric_features], labels=numeric_features)
axes[1, 0].set_title('Min-Max Scaling')
axes[1, 0].tick_params(axis='x', rotation=45)
# Robust scaling
axes[1, 1].boxplot([df_robust[col] for col in numeric_features], labels=numeric_features)
axes[1, 1].set_title('Robust Scaling')
axes[1, 1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()3.7 Categorical Feature Encoding
3.7.1 Label Encoding
python
# Label encoding: convert categories to numbers
df_encoded = df_standard.copy()
# Encode gender
le_gender = LabelEncoder()
df_encoded['gender_encoded'] = le_gender.fit_transform(df_encoded['gender'])
print("Gender Label Encoding:")
gender_mapping = dict(zip(le_gender.classes_, le_gender.transform(le_gender.classes_)))
print(gender_mapping)
# Encode city
le_city = LabelEncoder()
df_encoded['city_encoded'] = le_city.fit_transform(df_encoded['city'])
print("\nCity Label Encoding:")
city_mapping = dict(zip(le_city.classes_, le_city.transform(le_city.classes_)))
print(city_mapping)3.7.2 One-Hot Encoding
python
# One-hot encoding: create binary features for each category
df_onehot = df_standard.copy()
# Use pandas for one-hot encoding
df_onehot = pd.get_dummies(df_onehot, columns=['gender', 'city'], prefix=['gender', 'city'])
print("Features After One-Hot Encoding:")
print(df_onehot.columns.tolist())
print(f"Number of Features: {len(df_onehot.columns)}")
# View one-hot encoding results
print("\nOne-Hot Encoding Examples (First 5 rows):")
onehot_cols = [col for col in df_onehot.columns if col.startswith(('gender_', 'city_'))]
print(df_onehot[onehot_cols].head())3.7.3 Encoding Method Comparison
python
# Compare effects of different encoding methods on model performance
def compare_encoding_methods():
"""Compare effects of label encoding and one-hot encoding"""
# Prepare data
X_label = df_encoded[['age', 'income', 'education_years', 'credit_score', 'gender_encoded', 'city_encoded']]
X_onehot = df_onehot.drop(['gender', 'city', 'loan_approved'], axis=1, errors='ignore')
y = df_encoded['loan_approved']
results = {}
# Label encoding
X_train, X_test, y_train, y_test = train_test_split(X_label, y, test_size=0.2, random_state=42)
model_label = RandomForestClassifier(random_state=42)
model_label.fit(X_train, y_train)
acc_label = model_label.score(X_test, y_test)
results['Label Encoding'] = acc_label
# One-hot encoding
X_train, X_test, y_train, y_test = train_test_split(X_onehot, y, test_size=0.2, random_state=42)
model_onehot = RandomForestClassifier(random_state=42)
model_onehot.fit(X_train, y_train)
acc_onehot = model_onehot.score(X_test, y_test)
results['One-Hot Encoding'] = acc_onehot
return results
encoding_results = compare_encoding_methods()
print("Encoding Method Performance Comparison:")
for method, accuracy in encoding_results.items():
print(f"{method}: {accuracy:.4f}")3.8 Complete Preprocessing Pipeline
3.8.1 Create Preprocessing Pipeline
python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
def create_preprocessing_pipeline():
"""Create complete data preprocessing pipeline"""
# Define numeric and categorical features
numeric_features = ['age', 'income', 'education_years', 'credit_score']
categorical_features = ['gender', 'city']
# Numeric feature preprocessing pipeline
numeric_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')), # Fill missing values
('scaler', StandardScaler()) # Standardization
])
# Categorical feature preprocessing pipeline
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')), # Fill missing values
('onehot', OneHotEncoder(drop='first', sparse_output=False)) # One-hot encoding
])
# Combined preprocessor
preprocessor = ColumnTransformer([
('num', numeric_pipeline, numeric_features),
('cat', categorical_pipeline, categorical_features)
])
return preprocessor
# Create and use preprocessing pipeline
preprocessor = create_preprocessing_pipeline()
# Prepare data
X = df[['age', 'income', 'education_years', 'credit_score', 'gender', 'city']]
y = df['loan_approved']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply preprocessing
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
print(f"Processed Training Set Shape: {X_train_processed.shape}")
print(f"Processed Test Set Shape: {X_test_processed.shape}")3.8.2 Complete Machine Learning Pipeline
python
# Create complete pipeline including preprocessing and model
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# Train model
full_pipeline.fit(X_train, y_train)
# Predict and evaluate
y_pred = full_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Complete Pipeline Accuracy: {accuracy:.4f}")
# Get feature names
feature_names = (
['age', 'income', 'education_years', 'credit_score'] + # Numeric features
[f'gender_{cat}' for cat in preprocessor.named_transformers_['cat']['onehot'].categories_[0][1:]] + # Gender features
[f'city_{cat}' for cat in preprocessor.named_transformers_['cat']['onehot'].categories_[1][1:]] # City features
)
print(f"\nProcessed Features: {feature_names}")3.9 Data Preprocessing Best Practices
3.9.1 Processing Order
python
def preprocessing_best_practices():
"""Data preprocessing best practices example"""
print("Data Preprocessing Best Practices:")
print("1. Data Exploration and Understanding")
print("2. Handle Duplicate Values")
print("3. Handle Missing Values")
print("4. Handle Outliers")
print("5. Feature Encoding")
print("6. Feature Scaling")
print("7. Feature Selection (Optional)")
# Example: Complete preprocessing workflow
df_clean = df.copy()
# 1. Remove duplicate values
df_clean = df_clean.drop_duplicates()
print(f"\nNumber of samples after removing duplicates: {len(df_clean)}")
# 2. Handle obviously erroneous data
df_clean = df_clean[df_clean['age'] > 0] # Age must be positive
df_clean = df_clean[df_clean['credit_score'] >= 300] # Minimum credit score is 300
print(f"Number of samples after removing erroneous data: {len(df_clean)}")
# 3. Apply preprocessing pipeline
X_clean = df_clean[['age', 'income', 'education_years', 'credit_score', 'gender', 'city']]
y_clean = df_clean['loan_approved']
return X_clean, y_clean
X_clean, y_clean = preprocessing_best_practices()3.9.2 Avoid Data Leakage
python
def avoid_data_leakage_example():
"""Correct practices to avoid data leakage"""
# Wrong approach: Preprocess before splitting data
print("❌ Wrong Approach:")
X_wrong = df[['age', 'income', 'education_years', 'credit_score', 'gender', 'city']]
y_wrong = df['loan_approved']
# Preprocess entire dataset first (wrong!)
scaler_wrong = StandardScaler()
X_wrong_scaled = scaler_wrong.fit_transform(X_wrong.select_dtypes(include=[np.number]))
# Then split data
X_train_wrong, X_test_wrong, y_train_wrong, y_test_wrong = train_test_split(
X_wrong_scaled, y_wrong, test_size=0.2, random_state=42
)
print("This approach causes data leakage because test set information is used for training set preprocessing")
# Correct approach: Split data first, then preprocess separately
print("\n✅ Correct Approach:")
X_correct = df[['age', 'income', 'education_years', 'credit_score', 'gender', 'city']]
y_correct = df['loan_approved']
# Split data first
X_train_correct, X_test_correct, y_train_correct, y_test_correct = train_test_split(
X_correct, y_correct, test_size=0.2, random_state=42
)
# Fit preprocessor on training set only
preprocessor_correct = create_preprocessing_pipeline()
X_train_processed = preprocessor_correct.fit_transform(X_train_correct)
# Apply preprocessor on test set (without refitting)
X_test_processed = preprocessor_correct.transform(X_test_correct)
print("This approach avoids data leakage and ensures test set is completely independent")
avoid_data_leakage_example()3.10 Exercises
Exercise 1: Missing Value Handling
- Create a dataset with 30% missing values
- Compare effects of mean, median, mode, and KNN imputation
- Analyze which method is most suitable for your data
Exercise 2: Outlier Detection
- Implement Z-score method for outlier detection
- Compare differences between IQR and Z-score methods
- Visualize outlier detection results
Exercise 3: Feature Scaling
- Create a dataset with features of different scales
- Compare model performance with and without scaling, and with different scaling methods
- Analyze which scaling method is most suitable for different algorithms
Exercise 4: Encoding Methods
- Create a dataset with high-cardinality categorical features
- Compare effects of label encoding, one-hot encoding, and target encoding
- Analyze impacts of different encoding methods on model performance and training time
3.11 Summary
In this chapter, we learned:
Core Concepts
- Data Quality Issues: Missing values, outliers, scale differences
- Importance of Preprocessing: Improve model performance and stability
- Avoiding Data Leakage: Correct timing of preprocessing
Main Techniques
- Missing Value Handling: SimpleImputer, KNNImputer
- Outlier Handling: IQR method, capping method
- Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler
- Category Encoding: LabelEncoder, OneHotEncoder
- Pipeline Building: Pipeline, ColumnTransformer
Best Practices
- Split data first, then preprocess
- Choose appropriate imputation strategy
- Choose scaling method based on algorithm
- Consider cardinality of categorical features
Key Points
- Data preprocessing is key to machine learning success
- Different preprocessing methods are suitable for different scenarios
- Pipelines can avoid data leakage and improve code reusability
- Preprocessing decisions should be based on data characteristics and business understanding
3.12 Next Steps
Now you have mastered the core skills of data preprocessing! In the next chapter Linear Regression Explained, we will deeply learn the first important machine learning algorithm - linear regression, and understand how to predict continuous values.
Chapter Key Points Review:
- ✅ Mastered methods to identify and handle data quality issues
- ✅ Learned to use Scikit-learn's preprocessing tools
- ✅ Understood applicable scenarios for different preprocessing methods
- ✅ Mastered skills to build preprocessing pipelines
- ✅ Learned best practices to avoid data leakage