Chapter 3: Data Preprocessing Basics
Data preprocessing is one of the most important steps in machine learning projects. Real-world data is often "dirty," containing missing values, outliers, different scales, and other issues. This chapter will provide a detailed introduction on how to use Scikit-learn for data preprocessing.
3.1 Why Do We Need Data Preprocessing?
In real projects, raw data typically has the following problems:
- Missing Values: Omissions during data collection
- Outliers: Measurement errors or extreme cases
- Different Scales: Large differences in value ranges between different features
- Inconsistent Data Types: Mix of numeric and categorical data
- Duplicate Data: Same records appearing multiple times
3.2 Create Example Dataset
First, let's create an example dataset containing various problems:
3.3 Data Exploration and Problem Identification
3.3.1 Basic Statistics
3.3.2 Visualize Data Distribution
3.3.3 Outlier Detection
3.4 Handling Missing Values
3.4.1 Simple Imputation Strategies
3.4.2 Advanced Imputation Strategies
3.4.3 Imputation Effect Comparison
3.5 Handling Outliers
3.5.1 Remove Outliers
3.5.2 Cap Outliers
3.6 Feature Scaling
3.6.1 Standardization (StandardScaler)
3.6.2 Min-Max Scaling (MinMaxScaler)
3.6.3 Robust Scaling (RobustScaler)
3.6.4 Scaling Method Comparison
3.7 Categorical Feature Encoding
3.7.1 Label Encoding
3.7.2 One-Hot Encoding
3.7.3 Encoding Method Comparison
3.8 Complete Preprocessing Pipeline
3.8.1 Create Preprocessing Pipeline
3.8.2 Complete Machine Learning Pipeline
3.9 Data Preprocessing Best Practices
3.9.1 Processing Order
3.9.2 Avoid Data Leakage
3.10 Exercises
Exercise 1: Missing Value Handling
- Create a dataset with 30% missing values
- Compare effects of mean, median, mode, and KNN imputation
- Analyze which method is most suitable for your data
Exercise 2: Outlier Detection
- Implement Z-score method for outlier detection
- Compare differences between IQR and Z-score methods
- Visualize outlier detection results
Exercise 3: Feature Scaling
- Create a dataset with features of different scales
- Compare model performance with and without scaling, and with different scaling methods
- Analyze which scaling method is most suitable for different algorithms
Exercise 4: Encoding Methods
- Create a dataset with high-cardinality categorical features
- Compare effects of label encoding, one-hot encoding, and target encoding
- Analyze impacts of different encoding methods on model performance and training time
3.11 Summary
In this chapter, we learned:
Core Concepts
- Data Quality Issues: Missing values, outliers, scale differences
- Importance of Preprocessing: Improve model performance and stability
- Avoiding Data Leakage: Correct timing of preprocessing
Main Techniques
- Missing Value Handling: SimpleImputer, KNNImputer
- Outlier Handling: IQR method, capping method
- Feature Scaling: StandardScaler, MinMaxScaler, RobustScaler
- Category Encoding: LabelEncoder, OneHotEncoder
- Pipeline Building: Pipeline, ColumnTransformer
Best Practices
- Split data first, then preprocess
- Choose appropriate imputation strategy
- Choose scaling method based on algorithm
- Consider cardinality of categorical features
Key Points
- Data preprocessing is key to machine learning success
- Different preprocessing methods are suitable for different scenarios
- Pipelines can avoid data leakage and improve code reusability
- Preprocessing decisions should be based on data characteristics and business understanding
3.12 Next Steps
Now you have mastered the core skills of data preprocessing! In the next chapter Linear Regression Explained, we will deeply learn the first important machine learning algorithm - linear regression, and understand how to predict continuous values.
Chapter Key Points Review:
- ✅ Mastered methods to identify and handle data quality issues
- ✅ Learned to use Scikit-learn's preprocessing tools
- ✅ Understood applicable scenarios for different preprocessing methods
- ✅ Mastered skills to build preprocessing pipelines
- ✅ Learned best practices to avoid data leakage