Chapter 12: Principal Component Analysis
Principal Component Analysis (PCA) is one of the most important dimensionality reduction techniques. It projects high-dimensional data into a low-dimensional space through linear transformation, reducing feature dimensions while preserving the main information of the data. This chapter will详细介绍 the principles, implementation, and applications of PCA.
12.1 What is Principal Component Analysis?
PCA is an unsupervised dimensionality reduction technique that finds the directions of maximum variance in the data (principal components) and projects the data onto these directions, thereby achieving dimensionality reduction.
12.1.1 Goals of PCA
- Dimensionality Reduction: Reduce the number of features and simplify data structure
- Decorrelation: Eliminate linear correlations between features
- Information Preservation: Preserve as much information from the original data as possible
- Visualization: Project high-dimensional data into 2D or 3D space
12.1.2 Applications of PCA
- Data Compression: Reduce storage space and computational complexity
- Noise Removal: Remove noise by keeping main components
- Feature Extraction: Extract the most important feature combinations
- Data Visualization: Visual display of high-dimensional data
- Preprocessing: Prepare data for other machine learning algorithms
12.1.3 Mathematical Principles of PCA
PCA achieves dimensionality reduction through the following steps:
- Data Standardization: Make all features have the same scale
- Covariance Matrix Calculation: Measure correlations between features
- Eigenvalue Decomposition: Find eigenvectors and eigenvalues of the covariance matrix
- Principal Component Selection: Select main directions based on eigenvalue size
- Data Projection: Project original data onto principal component space
12.2 Environment and Data Preparation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_wine, load_breast_cancer, make_classification
from sklearn.decomposition import PCA, IncrementalPCA, KernelPCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')
# Set random seed
np.random.seed(42)
# Set figure style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False12.3 PCA Basic Principles Demonstration
12.3.1 PCA of Two-dimensional Data
def demonstrate_pca_2d():
"""Demonstrate PCA process for two-dimensional data"""
# Create correlated two-dimensional data
np.random.seed(42)
n_samples = 200
# Generate correlated data
x1 = np.random.normal(0, 1, n_samples)
x2 = 0.8 * x1 + 0.6 * np.random.normal(0, 1, n_samples)
X_2d = np.column_stack([x1, x2])
print("PCA basic principles demonstration:")
print("1. Original data has correlations")
print("2. PCA finds direction of maximum variance as first principal component")
print("3. Second principal component is orthogonal to first principal component")
print("4. Data projected onto principal component space achieves dimensionality reduction")
# Standardize data
scaler = StandardScaler()
X_2d_scaled = scaler.fit_transform(X_2d)
# Apply PCA
pca_2d = PCA(n_components=2)
X_2d_pca = pca_2d.fit_transform(X_2d_scaled)
# Get principal components
components = pca_2d.components_
explained_variance_ratio = pca_2d.explained_variance_ratio_
print(f"\nPrincipal component analysis results:")
print(f"First principal component explained variance ratio: {explained_variance_ratio[0]:.3f}")
print(f"Second principal component explained variance ratio: {explained_variance_ratio[1]:.3f}")
print(f"Cumulative explained variance ratio: {np.cumsum(explained_variance_ratio)}")
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('PCA Principle Demonstration', fontsize=16)
# Original data
axes[0, 0].scatter(X_2d_scaled[:, 0], X_2d_scaled[:, 1], alpha=0.6, color='blue')
axes[0, 0].set_title('Standardized Original Data')
axes[0, 0].set_xlabel('Feature 1 (Standardized)')
axes[0, 0].set_ylabel('Feature 2 (Standardized)')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].axis('equal')
# Display principal component directions
axes[0, 1].scatter(X_2d_scaled[:, 0], X_2d_scaled[:, 1], alpha=0.6, color='blue')
# Draw principal component vectors
mean_point = np.mean(X_2d_scaled, axis=0)
for i, (component, variance_ratio) in enumerate(zip(components, explained_variance_ratio)):
# Scale vector for visualization
vector = component * np.sqrt(pca_2d.explained_variance_[i]) * 2
axes[0, 1].arrow(mean_point[0], mean_point[1], vector[0], vector[1],
head_width=0.1, head_length=0.1, fc=f'C{i}', ec=f'C{i}',
linewidth=3, label=f'PC{i+1} ({variance_ratio:.2f})')
axes[0, 1].set_title('Principal Component Directions')
axes[0, 1].set_xlabel('Feature 1 (Standardized)')
axes[0, 1].set_ylabel('Feature 2 (Standardized)')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].axis('equal')
# PCA transformed data
axes[1, 0].scatter(X_2d_pca[:, 0], X_2d_pca[:, 1], alpha=0.6, color='red')
axes[1, 0].set_title('PCA Transformed Data')
axes[1, 0].set_xlabel('First Principal Component')
axes[1, 0].set_ylabel('Second Principal Component')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].axis('equal')
# Reconstruction keeping only first principal component
X_1d_pca = pca_2d.transform(X_2d_scaled)
X_1d_pca[:, 1] = 0 # Set second principal component to 0
X_reconstructed = pca_2d.inverse_transform(X_1d_pca)
axes[1, 1].scatter(X_2d_scaled[:, 0], X_2d_scaled[:, 1], alpha=0.3, color='blue', label='Original Data')
axes[1, 1].scatter(X_reconstructed[:, 0], X_reconstructed[:, 1], alpha=0.6, color='red', label='Reconstructed Data')
axes[1, 1].set_title('Dimensionality Reduction Reconstruction (Keeping Only First Principal Component)')
axes[1, 1].set_xlabel('Feature 1 (Standardized)')
axes[1, 1].set_ylabel('Feature 2 (Standardized)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].axis('equal')
plt.tight_layout()
plt.show()
return X_2d_scaled, X_2d_pca, pca_2d
X_2d_scaled, X_2d_pca, pca_2d = demonstrate_pca_2d()12.3.2 Covariance Matrix and Eigenvalue Decomposition
def analyze_covariance_and_eigenvalues():
"""Analyze covariance matrix and eigenvalue decomposition"""
print("Covariance matrix and eigenvalue decomposition analysis:")
print("=" * 40)
# Calculate covariance matrix
cov_matrix = np.cov(X_2d_scaled.T)
print("Covariance matrix:")
print(cov_matrix)
# Manual eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort by eigenvalue size
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
print(f"\nEigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")
# Calculate explained variance ratio
explained_variance_ratio_manual = eigenvalues / np.sum(eigenvalues)
print(f"\nManually calculated explained variance ratio: {explained_variance_ratio_manual}")
print(f"PCA calculated explained variance ratio: {pca_2d.explained_variance_ratio_}")
# Visualize eigenvalues and eigenvectors
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Covariance matrix heatmap
sns.heatmap(cov_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0])
axes[0].set_title('Covariance Matrix')
# Eigenvalues
axes[1].bar(range(1, len(eigenvalues) + 1), eigenvalues, color='skyblue', alpha=0.7)
axes[1].set_title('Eigenvalues')
axes[1].set_xlabel('Principal Component')
axes[1].set_ylabel('Eigenvalue')
axes[1].grid(True, alpha=0.3)
# Explained variance ratio
axes[2].bar(range(1, len(explained_variance_ratio_manual) + 1),
explained_variance_ratio_manual, color='lightgreen', alpha=0.7)
axes[2].set_title('Explained Variance Ratio')
axes[2].set_xlabel('Principal Component')
axes[2].set_ylabel('Explained Variance Ratio')
axes[2].grid(True, alpha=0.3)
# Add numeric labels
for i, (val, ratio) in enumerate(zip(eigenvalues, explained_variance_ratio_manual)):
axes[1].text(i, val + 0.01, f'{val:.3f}', ha='center')
axes[2].text(i, ratio + 0.01, f'{ratio:.3f}', ha='center')
plt.tight_layout()
plt.show()
return cov_matrix, eigenvalues, eigenvectors
cov_matrix, eigenvalues, eigenvectors = analyze_covariance_and_eigenvalues()12.4 PCA for High-dimensional Data
12.4.1 Iris Dataset PCA
def pca_iris_analysis():
"""PCA analysis of Iris dataset"""
# Load Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Iris dataset PCA analysis:")
print(f"Original data shape: {X_iris.shape}")
print(f"Feature names: {feature_names}")
# Standardize data
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)
# Apply PCA
pca_iris = PCA()
X_iris_pca = pca_iris.fit_transform(X_iris_scaled)
# Analyze results
explained_variance_ratio = pca_iris.explained_variance_ratio_
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
print(f"\nExplained variance ratio for each principal component:")
for i, ratio in enumerate(explained_variance_ratio):
print(f"PC{i+1}: {ratio:.4f}")
print(f"\nCumulative explained variance ratio:")
for i, cum_ratio in enumerate(cumulative_variance_ratio):
print(f"First {i+1} principal components: {cum_ratio:.4f}")
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Iris Dataset PCA Analysis', fontsize=16)
# Explained variance ratio
axes[0, 0].bar(range(1, len(explained_variance_ratio) + 1),
explained_variance_ratio, color='skyblue', alpha=0.7)
axes[0, 0].set_title('Explained Variance Ratio for Each Principal Component')
axes[0, 0].set_xlabel('Principal Component')
axes[0, 0].set_ylabel('Explained Variance Ratio')
axes[0, 0].grid(True, alpha=0.3)
# Cumulative explained variance ratio
axes[0, 1].plot(range(1, len(cumulative_variance_ratio) + 1),
cumulative_variance_ratio, 'ro-', linewidth=2, markersize=8)
axes[0, 1].axhline(y=0.95, color='red', linestyle='--', alpha=0.7, label='95% Threshold')
axes[0, 1].set_title('Cumulative Explained Variance Ratio')
axes[0, 1].set_xlabel('Number of Principal Components')
axes[0, 1].set_ylabel('Cumulative Explained Variance Ratio')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Principal component loading plot
components_df = pd.DataFrame(
pca_iris.components_[:2].T,
columns=['PC1', 'PC2'],
index=feature_names
)
axes[0, 2].scatter(components_df['PC1'], components_df['PC2'], s=100)
for i, feature in enumerate(feature_names):
axes[0, 2].annotate(feature, (components_df.iloc[i, 0], components_df.iloc[i, 1]),
xytext=(5, 5), textcoords='offset points')
axes[0, 2].set_title('Principal Component Loading Plot')
axes[0, 2].set_xlabel('PC1')
axes[0, 2].set_ylabel('PC2')
axes[0, 2].grid(True, alpha=0.3)
# 2D projection
colors = ['red', 'blue', 'green']
for i, target_name in enumerate(target_names):
mask = y_iris == i
axes[1, 0].scatter(X_iris_pca[mask, 0], X_iris_pca[mask, 1],
c=colors[i], alpha=0.7, label=target_name)
axes[1, 0].set_title('2D Projection of First Two Principal Components')
axes[1, 0].set_xlabel(f'PC1 ({explained_variance_ratio[0]:.2%})')
axes[1, 0].set_ylabel(f'PC2 ({explained_variance_ratio[1]:.2%})')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# 3D projection
ax_3d = fig.add_subplot(2, 3, 5, projection='3d')
for i, target_name in enumerate(target_names):
mask = y_iris == i
ax_3d.scatter(X_iris_pca[mask, 0], X_iris_pca[mask, 1], X_iris_pca[mask, 2],
c=colors[i], alpha=0.7, label=target_name)
ax_3d.set_title('3D Projection of First Three Principal Components')
ax_3d.set_xlabel(f'PC1 ({explained_variance_ratio[0]:.2%})')
ax_3d.set_ylabel(f'PC2 ({explained_variance_ratio[1]:.2%})')
ax_3d.set_zlabel(f'PC3 ({explained_variance_ratio[2]:.2%})')
ax_3d.legend()
# Original feature vs principal component
feature_comparison = pd.DataFrame({
'Sepal Length': X_iris_scaled[:, 0],
'PC1': X_iris_pca[:, 0],
'Class': y_iris
})
for i, target_name in enumerate(target_names):
mask = y_iris == i
axes[1, 2].scatter(feature_comparison.loc[mask, 'Sepal Length'],
feature_comparison.loc[mask, 'PC1'],
c=colors[i], alpha=0.7, label=target_name)
axes[1, 2].set_title('Original Feature vs First Principal Component')
axes[1, 2].set_xlabel('Sepal Length (Standardized)')
axes[1, 2].set_ylabel('PC1')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return X_iris_scaled, X_iris_pca, pca_iris
X_iris_scaled, X_iris_pca, pca_iris = pca_iris_analysis()12.4.2 Determining the Number of Principal Components
def determine_n_components():
"""Methods for determining optimal number of principal components"""
# Use wine dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target
# Standardize
scaler = StandardScaler()
X_wine_scaled = scaler.fit_transform(X_wine)
# Calculate all principal components
pca_full = PCA()
pca_full.fit(X_wine_scaled)
explained_variance_ratio = pca_full.explained_variance_ratio_
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
print("Methods for determining number of principal components:")
print("=" * 30)
# Method 1: Cumulative explained variance ratio threshold
threshold_95 = np.argmax(cumulative_variance_ratio >= 0.95) + 1
threshold_90 = np.argmax(cumulative_variance_ratio >= 0.90) + 1
print(f"Method 1 - Cumulative explained variance ratio:")
print(f" Need {threshold_90} principal components to retain 90% variance")
print(f" Need {threshold_95} principal components to retain 95% variance")
# Method 2: Kaiser criterion (eigenvalues > 1)
eigenvalues = pca_full.explained_variance_
kaiser_components = np.sum(eigenvalues > 1)
print(f"\nMethod 2 - Kaiser criterion (eigenvalues > 1): {kaiser_components} principal components")
# Method 3: Scree plot analysis
print(f"\nMethod 3 - Scree plot: Observe the 'elbow' of eigenvalues")
# Visualize different methods
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Methods for Determining Number of Principal Components', fontsize=16)
# Explained variance ratio
axes[0, 0].bar(range(1, len(explained_variance_ratio) + 1),
explained_variance_ratio, alpha=0.7, color='skyblue')
axes[0, 0].set_title('Explained Variance Ratio for Each Principal Component')
axes[0, 0].set_xlabel('Principal Component')
axes[0, 0].set_ylabel('Explained Variance Ratio')
axes[0, 0].grid(True, alpha=0.3)
# Cumulative explained variance ratio
axes[0, 1].plot(range(1, len(cumulative_variance_ratio) + 1),
cumulative_variance_ratio, 'bo-', linewidth=2, markersize=6)
axes[0, 1].axhline(y=0.90, color='red', linestyle='--', alpha=0.7, label='90%')
axes[0, 1].axhline(y=0.95, color='orange', linestyle='--', alpha=0.7, label='95%')
axes[0, 1].axvline(x=threshold_90, color='red', linestyle=':', alpha=0.7)
axes[0, 1].axvline(x=threshold_95, color='orange', linestyle=':', alpha=0.7)
axes[0, 1].set_title('Cumulative Explained Variance Ratio')
axes[0, 1].set_xlabel('Number of Principal Components')
axes[0, 1].set_ylabel('Cumulative Explained Variance Ratio')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Eigenvalues (Kaiser criterion)
axes[1, 0].bar(range(1, len(eigenvalues) + 1), eigenvalues, alpha=0.7, color='lightgreen')
axes[1, 0].axhline(y=1, color='red', linestyle='--', alpha=0.7, label='Kaiser Threshold')
axes[1, 0].set_title('Eigenvalues (Kaiser Criterion)')
axes[1, 0].set_xlabel('Principal Component')
axes[1, 0].set_ylabel('Eigenvalue')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Scree plot
axes[1, 1].plot(range(1, len(eigenvalues) + 1), eigenvalues, 'ro-', linewidth=2, markersize=8)
axes[1, 1].set_title('Scree Plot')
axes[1, 1].set_xlabel('Principal Component')
axes[1, 1].set_ylabel('Eigenvalue')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Compare classification performance with different number of principal components
print(f"\nImpact of different number of principal components on classification performance:")
print("PC Count\tCumulative Var Ratio\tClassification Accuracy")
print("-" * 60)
X_train, X_test, y_train, y_test = train_test_split(
X_wine_scaled, y_wine, test_size=0.2, random_state=42
)
n_components_list = [2, 3, 5, 8, 10, 13] # 13 is total number of features
for n_comp in n_components_list:
# PCA dimensionality reduction
pca_temp = PCA(n_components=n_comp)
X_train_pca = pca_temp.fit_transform(X_train)
X_test_pca = pca_temp.transform(X_test)
# Classification
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
accuracy = accuracy_score(y_test, y_pred)
cum_var_ratio = np.sum(pca_temp.explained_variance_ratio_)
print(f"{n_comp}\t\t{cum_var_ratio:.4f}\t\t{accuracy:.4f}")
return threshold_90, threshold_95, kaiser_components
threshold_90, threshold_95, kaiser_components = determine_n_components()12.5 PCA Variants
12.5.1 Incremental PCA
def incremental_pca_demo():
"""Demonstrate Incremental PCA"""
print("Incremental PCA:")
print("Suitable for large datasets, can process data in batches")
# Create large dataset
X_large, y_large = make_classification(
n_samples=5000, n_features=50, n_informative=30,
n_redundant=10, random_state=42
)
# Standardize
scaler = StandardScaler()
X_large_scaled = scaler.fit_transform(X_large)
# Compare standard PCA and incremental PCA
import time
# Standard PCA
start_time = time.time()
pca_standard = PCA(n_components=10)
X_pca_standard = pca_standard.fit_transform(X_large_scaled)
time_standard = time.time() - start_time
# Incremental PCA
start_time = time.time()
pca_incremental = IncrementalPCA(n_components=10, batch_size=500)
X_pca_incremental = pca_incremental.fit_transform(X_large_scaled)
time_incremental = time.time() - start_time
print(f"\nPerformance comparison:")
print(f"Standard PCA time: {time_standard:.4f} seconds")
print(f"Incremental PCA time: {time_incremental:.4f} seconds")
# Compare result similarity
# Since signs may differ, compare absolute values
correlation = np.corrcoef(np.abs(X_pca_standard[:, 0]), np.abs(X_pca_incremental[:, 0]))[0, 1]
print(f"First principal component correlation: {correlation:.4f}")
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Standard PCA results
axes[0].scatter(X_pca_standard[:, 0], X_pca_standard[:, 1],
c=y_large, cmap='viridis', alpha=0.6, s=10)
axes[0].set_title('Standard PCA')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
# Incremental PCA results
axes[1].scatter(X_pca_incremental[:, 0], X_pca_incremental[:, 1],
c=y_large, cmap='viridis', alpha=0.6, s=10)
axes[1].set_title('Incremental PCA')
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
# Explained variance ratio comparison
x_pos = np.arange(10)
width = 0.35
axes[2].bar(x_pos - width/2, pca_standard.explained_variance_ratio_,
width, label='Standard PCA', alpha=0.7)
axes[2].bar(x_pos + width/2, pca_incremental.explained_variance_ratio_,
width, label='Incremental PCA', alpha=0.7)
axes[2].set_title('Explained Variance Ratio Comparison')
axes[2].set_xlabel('Principal Component')
axes[2].set_ylabel('Explained Variance Ratio')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return pca_standard, pca_incremental
pca_standard, pca_incremental = incremental_pca_demo()12.5.2 Kernel PCA
def kernel_pca_demo():
"""Demonstrate Kernel PCA"""
print("Kernel PCA (Kernel PCA):")
print("Use kernel trick to handle nonlinear data")
# Create nonlinear data
from sklearn.datasets import make_circles, make_moons
# Concentric circles data
X_circles, y_circles = make_circles(n_samples=400, noise=0.1, factor=0.3, random_state=42)
# Crescent moon data
X_moons, y_moons = make_moons(n_samples=400, noise=0.1, random_state=42)
datasets = [
('Concentric Circles', X_circles, y_circles),
('Crescent Moon', X_moons, y_moons)
]
# Different kernel functions
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for dataset_name, X, y in datasets:
print(f"\n{dataset_name} dataset:")
fig, axes = plt.subplots(1, len(kernels) + 1, figsize=(20, 4))
fig.suptitle(f'Kernel PCA of {dataset_name} Data', fontsize=16)
# Original data
axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.7)
axes[0].set_title('Original Data')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
for i, kernel in enumerate(kernels):
try:
# Kernel PCA
if kernel == 'poly':
kpca = KernelPCA(n_components=2, kernel=kernel, degree=3, random_state=42)
elif kernel == 'rbf':
kpca = KernelPCA(n_components=2, kernel=kernel, gamma=1, random_state=42)
else:
kpca = KernelPCA(n_components=2, kernel=kernel, random_state=42)
X_kpca = kpca.fit_transform(X)
# Visualization
axes[i + 1].scatter(X_kpca[:, 0], X_kpca[:, 1], c=y, cmap='viridis', alpha=0.7)
axes[i + 1].set_title(f'{kernel.upper()} Kernel')
axes[i + 1].set_xlabel('First Kernel Principal Component')
axes[i + 1].set_ylabel('Second Kernel Principal Component')
print(f" {kernel} kernel: success")
except Exception as e:
axes[i + 1].text(0.5, 0.5, f'Error:\n{str(e)[:30]}...',
ha='center', va='center', transform=axes[i + 1].transAxes)
axes[i + 1].set_title(f'{kernel.upper()} Kernel (Failed)')
print(f" {kernel} kernel: failed - {str(e)[:50]}")
plt.tight_layout()
plt.show()
kernel_pca_demo()12.6 Applications of PCA in Machine Learning
12.6.1 Impact of Dimensionality Reduction on Classification Performance
def pca_classification_performance():
"""Analyze the impact of PCA dimensionality reduction on classification performance"""
# Use breast cancer dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target
print("Analysis of impact of PCA dimensionality reduction on classification performance:")
print(f"Original data shape: {X_cancer.shape}")
# Standardize
scaler = StandardScaler()
X_cancer_scaled = scaler.fit_transform(X_cancer)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_cancer_scaled, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)
# Test different numbers of principal components
n_components_list = [2, 5, 10, 15, 20, 25, 30] # Original 30 features
results = {
'n_components': [],
'variance_ratio': [],
'train_accuracy': [],
'test_accuracy': [],
'train_time': []
}
print("\nPC Count\tCumulative Var Ratio\tTrain Accuracy\tTest Accuracy\tTrain Time")
print("-" * 70)
for n_comp in n_components_list:
# PCA dimensionality reduction
pca = PCA(n_components=n_comp)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Train classifier
start_time = time.time()
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train_pca, y_train)
train_time = time.time() - start_time
# Predict
y_train_pred = clf.predict(X_train_pca)
y_test_pred = clf.predict(X_test_pca)
# Calculate accuracy
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
# Cumulative variance ratio
cum_var_ratio = np.sum(pca.explained_variance_ratio_)
# Save results
results['n_components'].append(n_comp)
results['variance_ratio'].append(cum_var_ratio)
results['train_accuracy'].append(train_acc)
results['test_accuracy'].append(test_acc)
results['train_time'].append(train_time)
print(f"{n_comp}\t\t{cum_var_ratio:.4f}\t\t{train_acc:.4f}\t\t{test_acc:.4f}\t\t{train_time:.4f}s")
# Add results for original data
start_time = time.time()
clf_original = LogisticRegression(random_state=42, max_iter=1000)
clf_original.fit(X_train, y_train)
train_time_original = time.time() - start_time
y_train_pred_original = clf_original.predict(X_train)
y_test_pred_original = clf_original.predict(X_test)
train_acc_original = accuracy_score(y_train, y_train_pred_original)
test_acc_original = accuracy_score(y_test, y_test_pred_original)
print(f"Original data\t1.0000\t\t{train_acc_original:.4f}\t\t{test_acc_original:.4f}\t\t{train_time_original:.4f}s")
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Impact of PCA Dimensionality Reduction on Classification Performance', fontsize=16)
# Accuracy vs number of principal components
axes[0, 0].plot(results['n_components'], results['train_accuracy'],
'bo-', label='Training Accuracy', linewidth=2, markersize=6)
axes[0, 0].plot(results['n_components'], results['test_accuracy'],
'ro-', label='Testing Accuracy', linewidth=2, markersize=6)
axes[0, 0].axhline(y=test_acc_original, color='green', linestyle='--',
alpha=0.7, label='Original Data Test Accuracy')
axes[0, 0].set_title('Accuracy vs Number of Principal Components')
axes[0, 0].set_xlabel('Number of Principal Components')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Cumulative variance ratio vs number of principal components
axes[0, 1].plot(results['n_components'], results['variance_ratio'],
'go-', linewidth=2, markersize=6)
axes[0, 1].axhline(y=0.95, color='red', linestyle='--', alpha=0.7, label='95% Threshold')
axes[0, 1].set_title('Cumulative Variance Ratio vs Number of Principal Components')
axes[0, 1].set_xlabel('Number of Principal Components')
axes[0, 1].set_ylabel('Cumulative Variance Ratio')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Training time vs number of principal components
axes[1, 0].plot(results['n_components'], results['train_time'],
'mo-', linewidth=2, markersize=6)
axes[1, 0].axhline(y=train_time_original, color='orange', linestyle='--',
alpha=0.7, label='Original Data Training Time')
axes[1, 0].set_title('Training Time vs Number of Principal Components')
axes[1, 0].set_xlabel('Number of Principal Components')
axes[1, 0].set_ylabel('Training Time (seconds)')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Accuracy vs variance ratio
axes[1, 1].scatter(results['variance_ratio'], results['test_accuracy'],
c=results['n_components'], cmap='viridis', s=100, alpha=0.7)
axes[1, 1].scatter(1.0, test_acc_original, c='red', s=150, marker='*',
label='Original Data')
# Add colorbar
scatter = axes[1, 1].scatter(results['variance_ratio'], results['test_accuracy'],
c=results['n_components'], cmap='viridis', s=100, alpha=0.7)
plt.colorbar(scatter, ax=axes[1, 1], label='Number of Principal Components')
axes[1, 1].set_title('Test Accuracy vs Cumulative Variance Ratio')
axes[1, 1].set_xlabel('Cumulative Variance Ratio')
axes[1, 1].set_ylabel('Test Accuracy')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Find optimal number of principal components
best_idx = np.argmax(results['test_accuracy'])
best_n_components = results['n_components'][best_idx]
best_test_acc = results['test_accuracy'][best_idx]
best_var_ratio = results['variance_ratio'][best_idx]
print(f"\nOptimal configuration:")
print(f"Number of principal components: {best_n_components}")
print(f"Test accuracy: {best_test_acc:.4f}")
print(f"Cumulative variance ratio: {best_var_ratio:.4f}")
print(f"Performance change compared to original data: {(best_test_acc - test_acc_original):.4f}")
return results
classification_results = pca_classification_performance()12.6.2 PCA for Data Visualization
def pca_visualization_example():
"""Use PCA for high-dimensional data visualization"""
# Use wine dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target
feature_names = wine.feature_names
target_names = wine.target_names
print("Use PCA for high-dimensional data visualization:")
print(f"Original data dimension: {X_wine.shape[1]}")
print(f"Number of classes: {len(target_names)}")
# Standardize
scaler = StandardScaler()
X_wine_scaled = scaler.fit_transform(X_wine)
# PCA to 2D and 3D
pca_2d = PCA(n_components=2)
pca_3d = PCA(n_components=3)
X_wine_2d = pca_2d.fit_transform(X_wine_scaled)
X_wine_3d = pca_3d.fit_transform(X_wine_scaled)
print(f"\n2D PCA explained variance ratio: {pca_2d.explained_variance_ratio_}")
print(f"2D PCA cumulative variance ratio: {np.sum(pca_2d.explained_variance_ratio_):.4f}")
print(f"3D PCA cumulative variance ratio: {np.sum(pca_3d.explained_variance_ratio_):.4f}")
# Visualization
fig = plt.figure(figsize=(20, 15))
# Original feature correlation matrix
plt.subplot(3, 3, 1)
correlation_matrix = np.corrcoef(X_wine_scaled.T)
sns.heatmap(correlation_matrix, cmap='coolwarm', center=0, square=True,
xticklabels=False, yticklabels=False)
plt.title('Original Feature Correlation Matrix')
# Principal component loading plot
plt.subplot(3, 3, 2)
loadings = pca_2d.components_.T
plt.scatter(loadings[:, 0], loadings[:, 1], alpha=0.7)
# Annotate important features
for i, feature in enumerate(feature_names):
if abs(loadings[i, 0]) > 0.3 or abs(loadings[i, 1]) > 0.3:
plt.annotate(feature[:10], (loadings[i, 0], loadings[i, 1]),
xytext=(2, 2), textcoords='offset points', fontsize=8)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%})')
plt.title('Principal Component Loading Plot')
plt.grid(True, alpha=0.3)
# 2D PCA visualization
plt.subplot(3, 3, 3)
colors = ['red', 'blue', 'green']
for i, (target_name, color) in enumerate(zip(target_names, colors)):
mask = y_wine == i
plt.scatter(X_wine_2d[mask, 0], X_wine_2d[mask, 1],
c=color, alpha=0.7, label=target_name, s=50)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%})')
plt.title('2D PCA Visualization')
plt.legend()
plt.grid(True, alpha=0.3)
# 3D PCA visualization
ax_3d = fig.add_subplot(3, 3, 4, projection='3d')
for i, (target_name, color) in enumerate(zip(target_names, colors)):
mask = y_wine == i
ax_3d.scatter(X_wine_3d[mask, 0], X_wine_3d[mask, 1], X_wine_3d[mask, 2],
c=color, alpha=0.7, label=target_name, s=30)
ax_3d.set_xlabel(f'PC1 ({pca_3d.explained_variance_ratio_[0]:.2%})')
ax_3d.set_ylabel(f'PC2 ({pca_3d.explained_variance_ratio_[1]:.2%})')
ax_3d.set_zlabel(f'PC3 ({pca_3d.explained_variance_ratio_[2]:.2%})')
ax_3d.set_title('3D PCA Visualization')
ax_3d.legend()
# Original feature distribution (select several important features)
important_features_idx = [0, 6, 9, 12] # Select several representative features
for idx, feature_idx in enumerate(important_features_idx):
plt.subplot(3, 3, 5 + idx)
for i, (target_name, color) in enumerate(zip(target_names, colors)):
mask = y_wine == i
plt.hist(X_wine_scaled[mask, feature_idx], alpha=0.6,
color=color, label=target_name, bins=15)
plt.xlabel(feature_names[feature_idx][:15])
plt.ylabel('Frequency')
plt.title(f'Feature Distribution: {feature_names[feature_idx][:15]}')
plt.legend()
plt.grid(True, alpha=0.3)
# Explained variance ratio
plt.subplot(3, 3, 9)
all_pca = PCA()
all_pca.fit(X_wine_scaled)
plt.bar(range(1, len(all_pca.explained_variance_ratio_) + 1),
all_pca.explained_variance_ratio_, alpha=0.7, color='skyblue')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio of All Principal Components')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Analyze meaning of principal components
print(f"\nPrincipal component analysis:")
print("Major contributing features for first two principal components:")
# Get loading matrix
loadings_df = pd.DataFrame(
pca_2d.components_.T,
columns=['PC1', 'PC2'],
index=feature_names
)
print("\nMajor contributing features for PC1:")
pc1_contributions = loadings_df['PC1'].abs().sort_values(ascending=False)
for i, (feature, contribution) in enumerate(pc1_contributions.head(5).items()):
print(f" {i+1}. {feature}: {contribution:.3f}")
print("\nMajor contributing features for PC2:")
pc2_contributions = loadings_df['PC2'].abs().sort_values(ascending=False)
for i, (feature, contribution) in enumerate(pc2_contributions.head(5).items()):
print(f" {i+1}. {feature}: {contribution:.3f}")
return X_wine_2d, X_wine_3d, pca_2d, pca_3d
X_wine_2d, X_wine_3d, pca_2d, pca_3d = pca_visualization_example()12.7 PCA Limitations and Considerations
12.7.1 PCA Assumptions and Limitations
def pca_limitations_demo():
"""Demonstrate PCA limitations"""
print("PCA limitations and considerations:")
print("=" * 30)
# 1. Linearity assumption limitations
print("1. Linearity assumption limitations:")
# Create nonlinear data
np.random.seed(42)
t = np.linspace(0, 4*np.pi, 300)
x1 = t * np.cos(t) + 0.5 * np.random.randn(300)
x2 = t * np.sin(t) + 0.5 * np.random.randn(300)
X_spiral = np.column_stack([x1, x2])
# Standardize
scaler = StandardScaler()
X_spiral_scaled = scaler.fit_transform(X_spiral)
# Apply PCA
pca_spiral = PCA(n_components=2)
X_spiral_pca = pca_spiral.fit_transform(X_spiral_scaled)
# 2. Variance does not equal importance
print("2. Variance does not equal importance:")
# Create data: one feature has high variance but is useless for classification
np.random.seed(42)
n_samples = 500
# Useful feature (low variance but has classification information)
useful_feature = np.random.normal(0, 0.1, n_samples)
# Useless feature (high variance but no classification information)
useless_feature = np.random.normal(0, 5, n_samples)
# Labels based on useful feature
y_synthetic = (useful_feature > 0).astype(int)
X_synthetic = np.column_stack([useful_feature, useless_feature])
# Standardize
X_synthetic_scaled = scaler.fit_transform(X_synthetic)
# PCA
pca_synthetic = PCA(n_components=2)
X_synthetic_pca = pca_synthetic.fit_transform(X_synthetic_scaled)
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('PCA Limitations Demonstration', fontsize=16)
# Nonlinear data
axes[0, 0].scatter(X_spiral_scaled[:, 0], X_spiral_scaled[:, 1],
c=t, cmap='viridis', alpha=0.7, s=20)
axes[0, 0].set_title('Nonlinear Spiral Data')
axes[0, 0].set_xlabel('Feature 1')
axes[0, 0].set_ylabel('Feature 2')
# PCA transformed spiral data
axes[0, 1].scatter(X_spiral_pca[:, 0], X_spiral_pca[:, 1],
c=t, cmap='viridis', alpha=0.7, s=20)
axes[0, 1].set_title('PCA Transformed Spiral Data')
axes[0, 1].set_xlabel('PC1')
axes[0, 1].set_ylabel('PC2')
# Principal component directions for spiral data
axes[0, 2].scatter(X_spiral_scaled[:, 0], X_spiral_scaled[:, 1],
c=t, cmap='viridis', alpha=0.5, s=20)
# Draw principal component directions
mean_point = np.mean(X_spiral_scaled, axis=0)
for i, component in enumerate(pca_spiral.components_):
vector = component * np.sqrt(pca_spiral.explained_variance_[i]) * 2
axes[0, 2].arrow(mean_point[0], mean_point[1], vector[0], vector[1],
head_width=0.1, head_length=0.1, fc=f'C{i}', ec=f'C{i}',
linewidth=3, label=f'PC{i+1}')
axes[0, 2].set_title('Principal Component Directions (Spiral Data)')
axes[0, 2].set_xlabel('Feature 1')
axes[0, 2].set_ylabel('Feature 2')
axes[0, 2].legend()
# Variance vs importance problem
axes[1, 0].scatter(X_synthetic_scaled[:, 0], X_synthetic_scaled[:, 1],
c=y_synthetic, cmap='RdYlBu', alpha=0.7, s=30)
axes[1, 0].set_title('Original Data (color = class)')
axes[1, 0].set_xlabel('Useful Feature (Low Variance)')
axes[1, 0].set_ylabel('Useless Feature (High Variance)')
# PCA transformed data
axes[1, 1].scatter(X_synthetic_pca[:, 0], X_synthetic_pca[:, 1],
c=y_synthetic, cmap='RdYlBu', alpha=0.7, s=30)
axes[1, 1].set_title('PCA Transformed Data')
axes[1, 1].set_xlabel(f'PC1 ({pca_synthetic.explained_variance_ratio_[0]:.2%})')
axes[1, 1].set_ylabel(f'PC2 ({pca_synthetic.explained_variance_ratio_[1]:.2%})')
# Feature variance comparison
original_var = np.var(X_synthetic_scaled, axis=0)
axes[1, 2].bar(['Useful Feature', 'Useless Feature'], original_var,
color=['green', 'red'], alpha=0.7)
axes[1, 2].set_title('Original Feature Variance')
axes[1, 2].set_ylabel('Variance')
# Add numeric labels
for i, var in enumerate(original_var):
axes[1, 2].text(i, var + 0.01, f'{var:.3f}', ha='center')
plt.tight_layout()
plt.show()
print(f" Spiral data PCA explained variance ratio: {pca_spiral.explained_variance_ratio_}")
print(f" Synthetic data PCA explained variance ratio: {pca_synthetic.explained_variance_ratio_}")
print(f" Note: PCA selected high variance but useless feature as first principal component")
# 3. Interpretability problem
print(f"\n3. Interpretability problem:")
print(f" Principal components are linear combinations of original features, difficult to interpret directly")
print(f" For example: PC1 = {pca_synthetic.components_[0][0]:.3f} * useful feature + {pca_synthetic.components_[0][1]:.3f} * useless feature")
return X_spiral_scaled, X_synthetic_scaled, pca_spiral, pca_synthetic
X_spiral_scaled, X_synthetic_scaled, pca_spiral, pca_synthetic = pca_limitations_demo()12.7.2 PCA Best Practices
def pca_best_practices():
"""PCA best practices guide"""
print("PCA best practices guide:")
print("=" * 25)
practices = {
"Data Preprocessing": [
"Always standardize data (unless features are already on same scale)",
"Check and handle missing values",
"Consider impact of outliers",
"Ensure data quality"
],
"Principal Component Selection": [
"Use cumulative explained variance ratio (usually 85%-95%)",
"Apply Kaiser criterion (eigenvalues > 1)",
"Observe elbow of scree plot",
"Consider downstream task performance"
],
"Model Validation": [
"Use cross-validation to evaluate dimensionality reduction effect",
"Compare model performance before and after dimensionality reduction",
"Check stability of principal components",
"Analyze interpretability of principal components"
],
"Application Scenarios": [
"Data visualization (reduce to 2D/3D)",
"Noise removal and data compression",
"Feature preprocessing (reduce curse of dimensionality)",
"Exploratory data analysis"
],
"Considerations": [
"PCA assumes linear relationships",
"High variance does not equal high importance",
"Principal components are difficult to interpret",
"Sensitive to outliers"
]
}
for category, items in practices.items():
print(f"\n{category}:")
for i, item in enumerate(items, 1):
print(f" {i}. {item}")
# Practice example: complete PCA workflow
print(f"\nComplete PCA workflow example:")
print("-" * 30)
# Use breast cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
print(f"1. Data overview: {X.shape}")
# 2. Data preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"2. Data standardization completed")
# 3. Preliminary PCA analysis
pca_full = PCA()
pca_full.fit(X_scaled)
# 4. Select number of principal components
cumsum_var = np.cumsum(pca_full.explained_variance_ratio_)
n_components_95 = np.argmax(cumsum_var >= 0.95) + 1
print(f"3. Need {n_components_95} principal components to retain 95% variance")
# 5. Apply PCA
pca_final = PCA(n_components=n_components_95)
X_pca = pca_final.fit_transform(X_scaled)
print(f"4. Data shape after dimensionality reduction: {X_pca.shape}")
# 6. Validate effect
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
X_train_pca, X_test_pca, _, _ = train_test_split(
X_pca, y, test_size=0.2, random_state=42
)
# Original data classification
clf_original = LogisticRegression(random_state=42, max_iter=1000)
clf_original.fit(X_train, y_train)
acc_original = clf_original.score(X_test, y_test)
# PCA data classification
clf_pca = LogisticRegression(random_state=42, max_iter=1000)
clf_pca.fit(X_train_pca, y_train)
acc_pca = clf_pca.score(X_test_pca, y_test)
print(f"5. Performance comparison:")
print(f" Original data accuracy: {acc_original:.4f}")
print(f" PCA data accuracy: {acc_pca:.4f}")
print(f" Dimensionality reduced: {X.shape[1]} → {X_pca.shape[1]} ({X_pca.shape[1]/X.shape[1]*100:.1f}%)")
# Visualize workflow results
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Cumulative explained variance ratio
axes[0].plot(range(1, len(cumsum_var) + 1), cumsum_var, 'bo-', linewidth=2)
axes[0].axhline(y=0.95, color='red', linestyle='--', alpha=0.7, label='95% Threshold')
axes[0].axvline(x=n_components_95, color='red', linestyle=':', alpha=0.7)
axes[0].set_title('Cumulative Explained Variance Ratio')
axes[0].set_xlabel('Number of Principal Components')
axes[0].set_ylabel('Cumulative Explained Variance Ratio')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# 2D visualization of first two principal components
axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='RdYlBu', alpha=0.7)
axes[1].set_title('Visualization of First Two Principal Components')
axes[1].set_xlabel(f'PC1 ({pca_final.explained_variance_ratio_[0]:.2%})')
axes[1].set_ylabel(f'PC2 ({pca_final.explained_variance_ratio_[1]:.2%})')
# Performance comparison
methods = ['Original Data', 'PCA Data']
accuracies = [acc_original, acc_pca]
colors = ['blue', 'orange']
bars = axes[2].bar(methods, accuracies, color=colors, alpha=0.7)
axes[2].set_title('Classification Performance Comparison')
axes[2].set_ylabel('Accuracy')
axes[2].set_ylim(0.9, 1.0)
# Add numeric labels
for bar, acc in zip(bars, accuracies):
height = bar.get_height()
axes[2].text(bar.get_x() + bar.get_width()/2., height + 0.005,
f'{acc:.4f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
return X_scaled, X_pca, pca_final
X_scaled, X_pca, pca_final = pca_best_practices()12.8 Exercises
Exercise 1: Basic PCA
- Perform PCA analysis using Iris dataset
- Determine number of principal components needed to retain 90% and 95% variance
- Visualize first two principal components and analyze their meaning
Exercise 2: PCA and Classification
- Use wine dataset, compare impact of different number of principal components on classification performance
- Plot accuracy vs number of principal components curve
- Find optimal number of principal components
Exercise 3: PCA Variant Comparison
- Create a large dataset (samples > 5000)
- Compare performance and results of standard PCA and incremental PCA
- Analyze pros and cons of both methods
Exercise 4: Nonlinear Data
- Create a nonlinear dataset (e.g., spiral or S-shape)
- Compare effect of linear PCA and kernel PCA
- Try different kernel functions
Exercise 5: Real Application
- Use handwritten digit dataset (load_digits)
- Apply PCA for dimensionality reduction and visualization
- Analyze impact of dimensionality reduction on digit recognition performance
12.9 Summary
In this chapter, we have learned various aspects of principal component analysis in depth:
Core Concepts
- Dimensionality Reduction Technology: Reduce feature dimensions while preserving main information
- Variance Maximization: Find directions of maximum variance in data
- Linear Transformation: Achieve dimensionality reduction through orthogonal transformation
Main Techniques
- Standard PCA: Classic principal component analysis method
- Incremental PCA: Batch processing suitable for large datasets
- Kernel PCA: Handle nonlinear data using kernel trick
- Principal Component Selection: Multiple methods to determine optimal number of principal components
Practical Skills
- Data Preprocessing: Importance of standardization
- Visualization Techniques: 2D/3D projection of high-dimensional data
- Performance Evaluation: Impact of dimensionality reduction on downstream tasks
- Parameter Tuning: Strategies for selecting number of principal components
Key Points
- PCA is an unsupervised linear dimensionality reduction technique
- Data standardization is critical for PCA results
- Number of principal components needs to balance information retention and dimensionality reduction
- PCA assumes linear relationships, has limited effectiveness on nonlinear data
Application Scenarios
Suitable for using PCA:
- Data dimensions are very high and need dimensionality reduction
- Linear correlations exist between features
- Need data visualization
- Want to perform noise removal
- Limited storage space or computational resources
Not suitable for using PCA:
- Data dimensions themselves are low
- Weak correlations between features
- Need to maintain feature interpretability
- Data has strong nonlinear relationships
- All features are very important
Best Practices Summary
Data Preprocessing
- Always standardize data
- Handle missing values and outliers
- Check data quality
Principal Component Selection
- Use multiple methods to determine number of principal components
- Consider downstream task performance
- Balance information retention and computational efficiency
Result Validation
- Analyze interpretability of principal components
- Validate impact of dimensionality reduction on task performance
- Check stability of results
Practical Application
- Choose appropriate PCA variant based on specific problem
- Combine domain knowledge to interpret principal components
- Consider limitations of PCA
12.10 Next Steps
Now you have mastered the important dimensionality reduction technique of principal component analysis! In next chapter Anomaly Detection, we will learn how to identify anomalies and outliers in data, which has important applications in data quality control and fraud detection.
Chapter Points Review:
- ✅ Understood mathematical principles and geometric intuition of PCA
- ✅ Mastered implementation and parameter tuning of PCA
- ✅ Learned methods for determining optimal number of principal components
- ✅ Understood variants and applicable scenarios of PCA
- ✅ Mastered application of PCA in data visualization
- ✅ Recognized limitations and best practices of PCA
- ✅ Able to reasonably use PCA technology in real-world projects