Skip to content

Chapter 12: Principal Component Analysis

Principal Component Analysis (PCA) is one of the most important dimensionality reduction techniques. It projects high-dimensional data into a low-dimensional space through linear transformation, reducing feature dimensions while preserving the main information of the data. This chapter will详细介绍 the principles, implementation, and applications of PCA.

12.1 What is Principal Component Analysis?

PCA is an unsupervised dimensionality reduction technique that finds the directions of maximum variance in the data (principal components) and projects the data onto these directions, thereby achieving dimensionality reduction.

12.1.1 Goals of PCA

  • Dimensionality Reduction: Reduce the number of features and simplify data structure
  • Decorrelation: Eliminate linear correlations between features
  • Information Preservation: Preserve as much information from the original data as possible
  • Visualization: Project high-dimensional data into 2D or 3D space

12.1.2 Applications of PCA

  • Data Compression: Reduce storage space and computational complexity
  • Noise Removal: Remove noise by keeping main components
  • Feature Extraction: Extract the most important feature combinations
  • Data Visualization: Visual display of high-dimensional data
  • Preprocessing: Prepare data for other machine learning algorithms

12.1.3 Mathematical Principles of PCA

PCA achieves dimensionality reduction through the following steps:

  1. Data Standardization: Make all features have the same scale
  2. Covariance Matrix Calculation: Measure correlations between features
  3. Eigenvalue Decomposition: Find eigenvectors and eigenvalues of the covariance matrix
  4. Principal Component Selection: Select main directions based on eigenvalue size
  5. Data Projection: Project original data onto principal component space

12.2 Environment and Data Preparation

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_wine, load_breast_cancer, make_classification
from sklearn.decomposition import PCA, IncrementalPCA, KernelPCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set figure style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

12.3 PCA Basic Principles Demonstration

12.3.1 PCA of Two-dimensional Data

python
def demonstrate_pca_2d():
    """Demonstrate PCA process for two-dimensional data"""

    # Create correlated two-dimensional data
    np.random.seed(42)
    n_samples = 200

    # Generate correlated data
    x1 = np.random.normal(0, 1, n_samples)
    x2 = 0.8 * x1 + 0.6 * np.random.normal(0, 1, n_samples)

    X_2d = np.column_stack([x1, x2])

    print("PCA basic principles demonstration:")
    print("1. Original data has correlations")
    print("2. PCA finds direction of maximum variance as first principal component")
    print("3. Second principal component is orthogonal to first principal component")
    print("4. Data projected onto principal component space achieves dimensionality reduction")

    # Standardize data
    scaler = StandardScaler()
    X_2d_scaled = scaler.fit_transform(X_2d)

    # Apply PCA
    pca_2d = PCA(n_components=2)
    X_2d_pca = pca_2d.fit_transform(X_2d_scaled)

    # Get principal components
    components = pca_2d.components_
    explained_variance_ratio = pca_2d.explained_variance_ratio_

    print(f"\nPrincipal component analysis results:")
    print(f"First principal component explained variance ratio: {explained_variance_ratio[0]:.3f}")
    print(f"Second principal component explained variance ratio: {explained_variance_ratio[1]:.3f}")
    print(f"Cumulative explained variance ratio: {np.cumsum(explained_variance_ratio)}")

    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('PCA Principle Demonstration', fontsize=16)

    # Original data
    axes[0, 0].scatter(X_2d_scaled[:, 0], X_2d_scaled[:, 1], alpha=0.6, color='blue')
    axes[0, 0].set_title('Standardized Original Data')
    axes[0, 0].set_xlabel('Feature 1 (Standardized)')
    axes[0, 0].set_ylabel('Feature 2 (Standardized)')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].axis('equal')

    # Display principal component directions
    axes[0, 1].scatter(X_2d_scaled[:, 0], X_2d_scaled[:, 1], alpha=0.6, color='blue')

    # Draw principal component vectors
    mean_point = np.mean(X_2d_scaled, axis=0)
    for i, (component, variance_ratio) in enumerate(zip(components, explained_variance_ratio)):
        # Scale vector for visualization
        vector = component * np.sqrt(pca_2d.explained_variance_[i]) * 2
        axes[0, 1].arrow(mean_point[0], mean_point[1], vector[0], vector[1],
                        head_width=0.1, head_length=0.1, fc=f'C{i}', ec=f'C{i}',
                        linewidth=3, label=f'PC{i+1} ({variance_ratio:.2f})')

    axes[0, 1].set_title('Principal Component Directions')
    axes[0, 1].set_xlabel('Feature 1 (Standardized)')
    axes[0, 1].set_ylabel('Feature 2 (Standardized)')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].axis('equal')

    # PCA transformed data
    axes[1, 0].scatter(X_2d_pca[:, 0], X_2d_pca[:, 1], alpha=0.6, color='red')
    axes[1, 0].set_title('PCA Transformed Data')
    axes[1, 0].set_xlabel('First Principal Component')
    axes[1, 0].set_ylabel('Second Principal Component')
    axes[1, 0].grid(True, alpha=0.3)
    axes[1, 0].axis('equal')

    # Reconstruction keeping only first principal component
    X_1d_pca = pca_2d.transform(X_2d_scaled)
    X_1d_pca[:, 1] = 0  # Set second principal component to 0
    X_reconstructed = pca_2d.inverse_transform(X_1d_pca)

    axes[1, 1].scatter(X_2d_scaled[:, 0], X_2d_scaled[:, 1], alpha=0.3, color='blue', label='Original Data')
    axes[1, 1].scatter(X_reconstructed[:, 0], X_reconstructed[:, 1], alpha=0.6, color='red', label='Reconstructed Data')
    axes[1, 1].set_title('Dimensionality Reduction Reconstruction (Keeping Only First Principal Component)')
    axes[1, 1].set_xlabel('Feature 1 (Standardized)')
    axes[1, 1].set_ylabel('Feature 2 (Standardized)')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    axes[1, 1].axis('equal')

    plt.tight_layout()
    plt.show()

    return X_2d_scaled, X_2d_pca, pca_2d

X_2d_scaled, X_2d_pca, pca_2d = demonstrate_pca_2d()

12.3.2 Covariance Matrix and Eigenvalue Decomposition

python
def analyze_covariance_and_eigenvalues():
    """Analyze covariance matrix and eigenvalue decomposition"""

    print("Covariance matrix and eigenvalue decomposition analysis:")
    print("=" * 40)

    # Calculate covariance matrix
    cov_matrix = np.cov(X_2d_scaled.T)
    print("Covariance matrix:")
    print(cov_matrix)

    # Manual eigenvalue decomposition
    eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

    # Sort by eigenvalue size
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]

    print(f"\nEigenvalues: {eigenvalues}")
    print(f"Eigenvectors:\n{eigenvectors}")

    # Calculate explained variance ratio
    explained_variance_ratio_manual = eigenvalues / np.sum(eigenvalues)
    print(f"\nManually calculated explained variance ratio: {explained_variance_ratio_manual}")
    print(f"PCA calculated explained variance ratio: {pca_2d.explained_variance_ratio_}")

    # Visualize eigenvalues and eigenvectors
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    # Covariance matrix heatmap
    sns.heatmap(cov_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0])
    axes[0].set_title('Covariance Matrix')

    # Eigenvalues
    axes[1].bar(range(1, len(eigenvalues) + 1), eigenvalues, color='skyblue', alpha=0.7)
    axes[1].set_title('Eigenvalues')
    axes[1].set_xlabel('Principal Component')
    axes[1].set_ylabel('Eigenvalue')
    axes[1].grid(True, alpha=0.3)

    # Explained variance ratio
    axes[2].bar(range(1, len(explained_variance_ratio_manual) + 1),
               explained_variance_ratio_manual, color='lightgreen', alpha=0.7)
    axes[2].set_title('Explained Variance Ratio')
    axes[2].set_xlabel('Principal Component')
    axes[2].set_ylabel('Explained Variance Ratio')
    axes[2].grid(True, alpha=0.3)

    # Add numeric labels
    for i, (val, ratio) in enumerate(zip(eigenvalues, explained_variance_ratio_manual)):
        axes[1].text(i, val + 0.01, f'{val:.3f}', ha='center')
        axes[2].text(i, ratio + 0.01, f'{ratio:.3f}', ha='center')

    plt.tight_layout()
    plt.show()

    return cov_matrix, eigenvalues, eigenvectors

cov_matrix, eigenvalues, eigenvectors = analyze_covariance_and_eigenvalues()

12.4 PCA for High-dimensional Data

12.4.1 Iris Dataset PCA

python
def pca_iris_analysis():
    """PCA analysis of Iris dataset"""

    # Load Iris dataset
    iris = load_iris()
    X_iris = iris.data
    y_iris = iris.target
    feature_names = iris.feature_names
    target_names = iris.target_names

    print("Iris dataset PCA analysis:")
    print(f"Original data shape: {X_iris.shape}")
    print(f"Feature names: {feature_names}")

    # Standardize data
    scaler = StandardScaler()
    X_iris_scaled = scaler.fit_transform(X_iris)

    # Apply PCA
    pca_iris = PCA()
    X_iris_pca = pca_iris.fit_transform(X_iris_scaled)

    # Analyze results
    explained_variance_ratio = pca_iris.explained_variance_ratio_
    cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

    print(f"\nExplained variance ratio for each principal component:")
    for i, ratio in enumerate(explained_variance_ratio):
        print(f"PC{i+1}: {ratio:.4f}")

    print(f"\nCumulative explained variance ratio:")
    for i, cum_ratio in enumerate(cumulative_variance_ratio):
        print(f"First {i+1} principal components: {cum_ratio:.4f}")

    # Visualization
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Iris Dataset PCA Analysis', fontsize=16)

    # Explained variance ratio
    axes[0, 0].bar(range(1, len(explained_variance_ratio) + 1),
                   explained_variance_ratio, color='skyblue', alpha=0.7)
    axes[0, 0].set_title('Explained Variance Ratio for Each Principal Component')
    axes[0, 0].set_xlabel('Principal Component')
    axes[0, 0].set_ylabel('Explained Variance Ratio')
    axes[0, 0].grid(True, alpha=0.3)

    # Cumulative explained variance ratio
    axes[0, 1].plot(range(1, len(cumulative_variance_ratio) + 1),
                    cumulative_variance_ratio, 'ro-', linewidth=2, markersize=8)
    axes[0, 1].axhline(y=0.95, color='red', linestyle='--', alpha=0.7, label='95% Threshold')
    axes[0, 1].set_title('Cumulative Explained Variance Ratio')
    axes[0, 1].set_xlabel('Number of Principal Components')
    axes[0, 1].set_ylabel('Cumulative Explained Variance Ratio')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)

    # Principal component loading plot
    components_df = pd.DataFrame(
        pca_iris.components_[:2].T,
        columns=['PC1', 'PC2'],
        index=feature_names
    )

    axes[0, 2].scatter(components_df['PC1'], components_df['PC2'], s=100)
    for i, feature in enumerate(feature_names):
        axes[0, 2].annotate(feature, (components_df.iloc[i, 0], components_df.iloc[i, 1]),
                           xytext=(5, 5), textcoords='offset points')
    axes[0, 2].set_title('Principal Component Loading Plot')
    axes[0, 2].set_xlabel('PC1')
    axes[0, 2].set_ylabel('PC2')
    axes[0, 2].grid(True, alpha=0.3)

    # 2D projection
    colors = ['red', 'blue', 'green']
    for i, target_name in enumerate(target_names):
        mask = y_iris == i
        axes[1, 0].scatter(X_iris_pca[mask, 0], X_iris_pca[mask, 1],
                          c=colors[i], alpha=0.7, label=target_name)
    axes[1, 0].set_title('2D Projection of First Two Principal Components')
    axes[1, 0].set_xlabel(f'PC1 ({explained_variance_ratio[0]:.2%})')
    axes[1, 0].set_ylabel(f'PC2 ({explained_variance_ratio[1]:.2%})')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)

    # 3D projection
    ax_3d = fig.add_subplot(2, 3, 5, projection='3d')
    for i, target_name in enumerate(target_names):
        mask = y_iris == i
        ax_3d.scatter(X_iris_pca[mask, 0], X_iris_pca[mask, 1], X_iris_pca[mask, 2],
                     c=colors[i], alpha=0.7, label=target_name)
    ax_3d.set_title('3D Projection of First Three Principal Components')
    ax_3d.set_xlabel(f'PC1 ({explained_variance_ratio[0]:.2%})')
    ax_3d.set_ylabel(f'PC2 ({explained_variance_ratio[1]:.2%})')
    ax_3d.set_zlabel(f'PC3 ({explained_variance_ratio[2]:.2%})')
    ax_3d.legend()

    # Original feature vs principal component
    feature_comparison = pd.DataFrame({
        'Sepal Length': X_iris_scaled[:, 0],
        'PC1': X_iris_pca[:, 0],
        'Class': y_iris
    })

    for i, target_name in enumerate(target_names):
        mask = y_iris == i
        axes[1, 2].scatter(feature_comparison.loc[mask, 'Sepal Length'],
                          feature_comparison.loc[mask, 'PC1'],
                          c=colors[i], alpha=0.7, label=target_name)
    axes[1, 2].set_title('Original Feature vs First Principal Component')
    axes[1, 2].set_xlabel('Sepal Length (Standardized)')
    axes[1, 2].set_ylabel('PC1')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    return X_iris_scaled, X_iris_pca, pca_iris

X_iris_scaled, X_iris_pca, pca_iris = pca_iris_analysis()

12.4.2 Determining the Number of Principal Components

python
def determine_n_components():
    """Methods for determining optimal number of principal components"""

    # Use wine dataset
    wine = load_wine()
    X_wine = wine.data
    y_wine = wine.target

    # Standardize
    scaler = StandardScaler()
    X_wine_scaled = scaler.fit_transform(X_wine)

    # Calculate all principal components
    pca_full = PCA()
    pca_full.fit(X_wine_scaled)

    explained_variance_ratio = pca_full.explained_variance_ratio_
    cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

    print("Methods for determining number of principal components:")
    print("=" * 30)

    # Method 1: Cumulative explained variance ratio threshold
    threshold_95 = np.argmax(cumulative_variance_ratio >= 0.95) + 1
    threshold_90 = np.argmax(cumulative_variance_ratio >= 0.90) + 1

    print(f"Method 1 - Cumulative explained variance ratio:")
    print(f"  Need {threshold_90} principal components to retain 90% variance")
    print(f"  Need {threshold_95} principal components to retain 95% variance")

    # Method 2: Kaiser criterion (eigenvalues > 1)
    eigenvalues = pca_full.explained_variance_
    kaiser_components = np.sum(eigenvalues > 1)
    print(f"\nMethod 2 - Kaiser criterion (eigenvalues > 1): {kaiser_components} principal components")

    # Method 3: Scree plot analysis
    print(f"\nMethod 3 - Scree plot: Observe the 'elbow' of eigenvalues")

    # Visualize different methods
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Methods for Determining Number of Principal Components', fontsize=16)

    # Explained variance ratio
    axes[0, 0].bar(range(1, len(explained_variance_ratio) + 1),
                   explained_variance_ratio, alpha=0.7, color='skyblue')
    axes[0, 0].set_title('Explained Variance Ratio for Each Principal Component')
    axes[0, 0].set_xlabel('Principal Component')
    axes[0, 0].set_ylabel('Explained Variance Ratio')
    axes[0, 0].grid(True, alpha=0.3)

    # Cumulative explained variance ratio
    axes[0, 1].plot(range(1, len(cumulative_variance_ratio) + 1),
                    cumulative_variance_ratio, 'bo-', linewidth=2, markersize=6)
    axes[0, 1].axhline(y=0.90, color='red', linestyle='--', alpha=0.7, label='90%')
    axes[0, 1].axhline(y=0.95, color='orange', linestyle='--', alpha=0.7, label='95%')
    axes[0, 1].axvline(x=threshold_90, color='red', linestyle=':', alpha=0.7)
    axes[0, 1].axvline(x=threshold_95, color='orange', linestyle=':', alpha=0.7)
    axes[0, 1].set_title('Cumulative Explained Variance Ratio')
    axes[0, 1].set_xlabel('Number of Principal Components')
    axes[0, 1].set_ylabel('Cumulative Explained Variance Ratio')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)

    # Eigenvalues (Kaiser criterion)
    axes[1, 0].bar(range(1, len(eigenvalues) + 1), eigenvalues, alpha=0.7, color='lightgreen')
    axes[1, 0].axhline(y=1, color='red', linestyle='--', alpha=0.7, label='Kaiser Threshold')
    axes[1, 0].set_title('Eigenvalues (Kaiser Criterion)')
    axes[1, 0].set_xlabel('Principal Component')
    axes[1, 0].set_ylabel('Eigenvalue')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)

    # Scree plot
    axes[1, 1].plot(range(1, len(eigenvalues) + 1), eigenvalues, 'ro-', linewidth=2, markersize=8)
    axes[1, 1].set_title('Scree Plot')
    axes[1, 1].set_xlabel('Principal Component')
    axes[1, 1].set_ylabel('Eigenvalue')
    axes[1, 1].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Compare classification performance with different number of principal components
    print(f"\nImpact of different number of principal components on classification performance:")
    print("PC Count\tCumulative Var Ratio\tClassification Accuracy")
    print("-" * 60)

    X_train, X_test, y_train, y_test = train_test_split(
        X_wine_scaled, y_wine, test_size=0.2, random_state=42
    )

    n_components_list = [2, 3, 5, 8, 10, 13]  # 13 is total number of features

    for n_comp in n_components_list:
        # PCA dimensionality reduction
        pca_temp = PCA(n_components=n_comp)
        X_train_pca = pca_temp.fit_transform(X_train)
        X_test_pca = pca_temp.transform(X_test)

        # Classification
        clf = LogisticRegression(random_state=42, max_iter=1000)
        clf.fit(X_train_pca, y_train)
        y_pred = clf.predict(X_test_pca)

        accuracy = accuracy_score(y_test, y_pred)
        cum_var_ratio = np.sum(pca_temp.explained_variance_ratio_)

        print(f"{n_comp}\t\t{cum_var_ratio:.4f}\t\t{accuracy:.4f}")

    return threshold_90, threshold_95, kaiser_components

threshold_90, threshold_95, kaiser_components = determine_n_components()

12.5 PCA Variants

12.5.1 Incremental PCA

python
def incremental_pca_demo():
    """Demonstrate Incremental PCA"""

    print("Incremental PCA:")
    print("Suitable for large datasets, can process data in batches")

    # Create large dataset
    X_large, y_large = make_classification(
        n_samples=5000, n_features=50, n_informative=30,
        n_redundant=10, random_state=42
    )

    # Standardize
    scaler = StandardScaler()
    X_large_scaled = scaler.fit_transform(X_large)

    # Compare standard PCA and incremental PCA
    import time

    # Standard PCA
    start_time = time.time()
    pca_standard = PCA(n_components=10)
    X_pca_standard = pca_standard.fit_transform(X_large_scaled)
    time_standard = time.time() - start_time

    # Incremental PCA
    start_time = time.time()
    pca_incremental = IncrementalPCA(n_components=10, batch_size=500)
    X_pca_incremental = pca_incremental.fit_transform(X_large_scaled)
    time_incremental = time.time() - start_time

    print(f"\nPerformance comparison:")
    print(f"Standard PCA time: {time_standard:.4f} seconds")
    print(f"Incremental PCA time: {time_incremental:.4f} seconds")

    # Compare result similarity
    # Since signs may differ, compare absolute values
    correlation = np.corrcoef(np.abs(X_pca_standard[:, 0]), np.abs(X_pca_incremental[:, 0]))[0, 1]
    print(f"First principal component correlation: {correlation:.4f}")

    # Visualize comparison
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    # Standard PCA results
    axes[0].scatter(X_pca_standard[:, 0], X_pca_standard[:, 1],
                   c=y_large, cmap='viridis', alpha=0.6, s=10)
    axes[0].set_title('Standard PCA')
    axes[0].set_xlabel('PC1')
    axes[0].set_ylabel('PC2')

    # Incremental PCA results
    axes[1].scatter(X_pca_incremental[:, 0], X_pca_incremental[:, 1],
                   c=y_large, cmap='viridis', alpha=0.6, s=10)
    axes[1].set_title('Incremental PCA')
    axes[1].set_xlabel('PC1')
    axes[1].set_ylabel('PC2')

    # Explained variance ratio comparison
    x_pos = np.arange(10)
    width = 0.35

    axes[2].bar(x_pos - width/2, pca_standard.explained_variance_ratio_,
               width, label='Standard PCA', alpha=0.7)
    axes[2].bar(x_pos + width/2, pca_incremental.explained_variance_ratio_,
               width, label='Incremental PCA', alpha=0.7)
    axes[2].set_title('Explained Variance Ratio Comparison')
    axes[2].set_xlabel('Principal Component')
    axes[2].set_ylabel('Explained Variance Ratio')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    return pca_standard, pca_incremental

pca_standard, pca_incremental = incremental_pca_demo()

12.5.2 Kernel PCA

python
def kernel_pca_demo():
    """Demonstrate Kernel PCA"""

    print("Kernel PCA (Kernel PCA):")
    print("Use kernel trick to handle nonlinear data")

    # Create nonlinear data
    from sklearn.datasets import make_circles, make_moons

    # Concentric circles data
    X_circles, y_circles = make_circles(n_samples=400, noise=0.1, factor=0.3, random_state=42)

    # Crescent moon data
    X_moons, y_moons = make_moons(n_samples=400, noise=0.1, random_state=42)

    datasets = [
        ('Concentric Circles', X_circles, y_circles),
        ('Crescent Moon', X_moons, y_moons)
    ]

    # Different kernel functions
    kernels = ['linear', 'poly', 'rbf', 'sigmoid']

    for dataset_name, X, y in datasets:
        print(f"\n{dataset_name} dataset:")

        fig, axes = plt.subplots(1, len(kernels) + 1, figsize=(20, 4))
        fig.suptitle(f'Kernel PCA of {dataset_name} Data', fontsize=16)

        # Original data
        axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.7)
        axes[0].set_title('Original Data')
        axes[0].set_xlabel('Feature 1')
        axes[0].set_ylabel('Feature 2')

        for i, kernel in enumerate(kernels):
            try:
                # Kernel PCA
                if kernel == 'poly':
                    kpca = KernelPCA(n_components=2, kernel=kernel, degree=3, random_state=42)
                elif kernel == 'rbf':
                    kpca = KernelPCA(n_components=2, kernel=kernel, gamma=1, random_state=42)
                else:
                    kpca = KernelPCA(n_components=2, kernel=kernel, random_state=42)

                X_kpca = kpca.fit_transform(X)

                # Visualization
                axes[i + 1].scatter(X_kpca[:, 0], X_kpca[:, 1], c=y, cmap='viridis', alpha=0.7)
                axes[i + 1].set_title(f'{kernel.upper()} Kernel')
                axes[i + 1].set_xlabel('First Kernel Principal Component')
                axes[i + 1].set_ylabel('Second Kernel Principal Component')

                print(f"  {kernel} kernel: success")

            except Exception as e:
                axes[i + 1].text(0.5, 0.5, f'Error:\n{str(e)[:30]}...',
                                ha='center', va='center', transform=axes[i + 1].transAxes)
                axes[i + 1].set_title(f'{kernel.upper()} Kernel (Failed)')
                print(f"  {kernel} kernel: failed - {str(e)[:50]}")

        plt.tight_layout()
        plt.show()

kernel_pca_demo()

12.6 Applications of PCA in Machine Learning

12.6.1 Impact of Dimensionality Reduction on Classification Performance

python
def pca_classification_performance():
    """Analyze the impact of PCA dimensionality reduction on classification performance"""

    # Use breast cancer dataset
    cancer = load_breast_cancer()
    X_cancer = cancer.data
    y_cancer = cancer.target

    print("Analysis of impact of PCA dimensionality reduction on classification performance:")
    print(f"Original data shape: {X_cancer.shape}")

    # Standardize
    scaler = StandardScaler()
    X_cancer_scaled = scaler.fit_transform(X_cancer)

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X_cancer_scaled, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
    )

    # Test different numbers of principal components
    n_components_list = [2, 5, 10, 15, 20, 25, 30]  # Original 30 features

    results = {
        'n_components': [],
        'variance_ratio': [],
        'train_accuracy': [],
        'test_accuracy': [],
        'train_time': []
    }

    print("\nPC Count\tCumulative Var Ratio\tTrain Accuracy\tTest Accuracy\tTrain Time")
    print("-" * 70)

    for n_comp in n_components_list:
        # PCA dimensionality reduction
        pca = PCA(n_components=n_comp)
        X_train_pca = pca.fit_transform(X_train)
        X_test_pca = pca.transform(X_test)

        # Train classifier
        start_time = time.time()
        clf = LogisticRegression(random_state=42, max_iter=1000)
        clf.fit(X_train_pca, y_train)
        train_time = time.time() - start_time

        # Predict
        y_train_pred = clf.predict(X_train_pca)
        y_test_pred = clf.predict(X_test_pca)

        # Calculate accuracy
        train_acc = accuracy_score(y_train, y_train_pred)
        test_acc = accuracy_score(y_test, y_test_pred)

        # Cumulative variance ratio
        cum_var_ratio = np.sum(pca.explained_variance_ratio_)

        # Save results
        results['n_components'].append(n_comp)
        results['variance_ratio'].append(cum_var_ratio)
        results['train_accuracy'].append(train_acc)
        results['test_accuracy'].append(test_acc)
        results['train_time'].append(train_time)

        print(f"{n_comp}\t\t{cum_var_ratio:.4f}\t\t{train_acc:.4f}\t\t{test_acc:.4f}\t\t{train_time:.4f}s")

    # Add results for original data
    start_time = time.time()
    clf_original = LogisticRegression(random_state=42, max_iter=1000)
    clf_original.fit(X_train, y_train)
    train_time_original = time.time() - start_time

    y_train_pred_original = clf_original.predict(X_train)
    y_test_pred_original = clf_original.predict(X_test)

    train_acc_original = accuracy_score(y_train, y_train_pred_original)
    test_acc_original = accuracy_score(y_test, y_test_pred_original)

    print(f"Original data\t1.0000\t\t{train_acc_original:.4f}\t\t{test_acc_original:.4f}\t\t{train_time_original:.4f}s")

    # Visualize results
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Impact of PCA Dimensionality Reduction on Classification Performance', fontsize=16)

    # Accuracy vs number of principal components
    axes[0, 0].plot(results['n_components'], results['train_accuracy'],
                   'bo-', label='Training Accuracy', linewidth=2, markersize=6)
    axes[0, 0].plot(results['n_components'], results['test_accuracy'],
                   'ro-', label='Testing Accuracy', linewidth=2, markersize=6)
    axes[0, 0].axhline(y=test_acc_original, color='green', linestyle='--',
                       alpha=0.7, label='Original Data Test Accuracy')
    axes[0, 0].set_title('Accuracy vs Number of Principal Components')
    axes[0, 0].set_xlabel('Number of Principal Components')
    axes[0, 0].set_ylabel('Accuracy')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)

    # Cumulative variance ratio vs number of principal components
    axes[0, 1].plot(results['n_components'], results['variance_ratio'],
                   'go-', linewidth=2, markersize=6)
    axes[0, 1].axhline(y=0.95, color='red', linestyle='--', alpha=0.7, label='95% Threshold')
    axes[0, 1].set_title('Cumulative Variance Ratio vs Number of Principal Components')
    axes[0, 1].set_xlabel('Number of Principal Components')
    axes[0, 1].set_ylabel('Cumulative Variance Ratio')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)

    # Training time vs number of principal components
    axes[1, 0].plot(results['n_components'], results['train_time'],
                   'mo-', linewidth=2, markersize=6)
    axes[1, 0].axhline(y=train_time_original, color='orange', linestyle='--',
                       alpha=0.7, label='Original Data Training Time')
    axes[1, 0].set_title('Training Time vs Number of Principal Components')
    axes[1, 0].set_xlabel('Number of Principal Components')
    axes[1, 0].set_ylabel('Training Time (seconds)')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)

    # Accuracy vs variance ratio
    axes[1, 1].scatter(results['variance_ratio'], results['test_accuracy'],
                      c=results['n_components'], cmap='viridis', s=100, alpha=0.7)
    axes[1, 1].scatter(1.0, test_acc_original, c='red', s=150, marker='*',
                      label='Original Data')

    # Add colorbar
    scatter = axes[1, 1].scatter(results['variance_ratio'], results['test_accuracy'],
                                c=results['n_components'], cmap='viridis', s=100, alpha=0.7)
    plt.colorbar(scatter, ax=axes[1, 1], label='Number of Principal Components')

    axes[1, 1].set_title('Test Accuracy vs Cumulative Variance Ratio')
    axes[1, 1].set_xlabel('Cumulative Variance Ratio')
    axes[1, 1].set_ylabel('Test Accuracy')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Find optimal number of principal components
    best_idx = np.argmax(results['test_accuracy'])
    best_n_components = results['n_components'][best_idx]
    best_test_acc = results['test_accuracy'][best_idx]
    best_var_ratio = results['variance_ratio'][best_idx]

    print(f"\nOptimal configuration:")
    print(f"Number of principal components: {best_n_components}")
    print(f"Test accuracy: {best_test_acc:.4f}")
    print(f"Cumulative variance ratio: {best_var_ratio:.4f}")
    print(f"Performance change compared to original data: {(best_test_acc - test_acc_original):.4f}")

    return results

classification_results = pca_classification_performance()

12.6.2 PCA for Data Visualization

python
def pca_visualization_example():
    """Use PCA for high-dimensional data visualization"""

    # Use wine dataset
    wine = load_wine()
    X_wine = wine.data
    y_wine = wine.target
    feature_names = wine.feature_names
    target_names = wine.target_names

    print("Use PCA for high-dimensional data visualization:")
    print(f"Original data dimension: {X_wine.shape[1]}")
    print(f"Number of classes: {len(target_names)}")

    # Standardize
    scaler = StandardScaler()
    X_wine_scaled = scaler.fit_transform(X_wine)

    # PCA to 2D and 3D
    pca_2d = PCA(n_components=2)
    pca_3d = PCA(n_components=3)

    X_wine_2d = pca_2d.fit_transform(X_wine_scaled)
    X_wine_3d = pca_3d.fit_transform(X_wine_scaled)

    print(f"\n2D PCA explained variance ratio: {pca_2d.explained_variance_ratio_}")
    print(f"2D PCA cumulative variance ratio: {np.sum(pca_2d.explained_variance_ratio_):.4f}")
    print(f"3D PCA cumulative variance ratio: {np.sum(pca_3d.explained_variance_ratio_):.4f}")

    # Visualization
    fig = plt.figure(figsize=(20, 15))

    # Original feature correlation matrix
    plt.subplot(3, 3, 1)
    correlation_matrix = np.corrcoef(X_wine_scaled.T)
    sns.heatmap(correlation_matrix, cmap='coolwarm', center=0, square=True,
                xticklabels=False, yticklabels=False)
    plt.title('Original Feature Correlation Matrix')

    # Principal component loading plot
    plt.subplot(3, 3, 2)
    loadings = pca_2d.components_.T
    plt.scatter(loadings[:, 0], loadings[:, 1], alpha=0.7)

    # Annotate important features
    for i, feature in enumerate(feature_names):
        if abs(loadings[i, 0]) > 0.3 or abs(loadings[i, 1]) > 0.3:
            plt.annotate(feature[:10], (loadings[i, 0], loadings[i, 1]),
                        xytext=(2, 2), textcoords='offset points', fontsize=8)

    plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%})')
    plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%})')
    plt.title('Principal Component Loading Plot')
    plt.grid(True, alpha=0.3)

    # 2D PCA visualization
    plt.subplot(3, 3, 3)
    colors = ['red', 'blue', 'green']
    for i, (target_name, color) in enumerate(zip(target_names, colors)):
        mask = y_wine == i
        plt.scatter(X_wine_2d[mask, 0], X_wine_2d[mask, 1],
                   c=color, alpha=0.7, label=target_name, s=50)

    plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%})')
    plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%})')
    plt.title('2D PCA Visualization')
    plt.legend()
    plt.grid(True, alpha=0.3)

    # 3D PCA visualization
    ax_3d = fig.add_subplot(3, 3, 4, projection='3d')
    for i, (target_name, color) in enumerate(zip(target_names, colors)):
        mask = y_wine == i
        ax_3d.scatter(X_wine_3d[mask, 0], X_wine_3d[mask, 1], X_wine_3d[mask, 2],
                     c=color, alpha=0.7, label=target_name, s=30)

    ax_3d.set_xlabel(f'PC1 ({pca_3d.explained_variance_ratio_[0]:.2%})')
    ax_3d.set_ylabel(f'PC2 ({pca_3d.explained_variance_ratio_[1]:.2%})')
    ax_3d.set_zlabel(f'PC3 ({pca_3d.explained_variance_ratio_[2]:.2%})')
    ax_3d.set_title('3D PCA Visualization')
    ax_3d.legend()

    # Original feature distribution (select several important features)
    important_features_idx = [0, 6, 9, 12]  # Select several representative features

    for idx, feature_idx in enumerate(important_features_idx):
        plt.subplot(3, 3, 5 + idx)

        for i, (target_name, color) in enumerate(zip(target_names, colors)):
            mask = y_wine == i
            plt.hist(X_wine_scaled[mask, feature_idx], alpha=0.6,
                    color=color, label=target_name, bins=15)

        plt.xlabel(feature_names[feature_idx][:15])
        plt.ylabel('Frequency')
        plt.title(f'Feature Distribution: {feature_names[feature_idx][:15]}')
        plt.legend()
        plt.grid(True, alpha=0.3)

    # Explained variance ratio
    plt.subplot(3, 3, 9)
    all_pca = PCA()
    all_pca.fit(X_wine_scaled)

    plt.bar(range(1, len(all_pca.explained_variance_ratio_) + 1),
           all_pca.explained_variance_ratio_, alpha=0.7, color='skyblue')
    plt.xlabel('Principal Component')
    plt.ylabel('Explained Variance Ratio')
    plt.title('Explained Variance Ratio of All Principal Components')
    plt.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Analyze meaning of principal components
    print(f"\nPrincipal component analysis:")
    print("Major contributing features for first two principal components:")

    # Get loading matrix
    loadings_df = pd.DataFrame(
        pca_2d.components_.T,
        columns=['PC1', 'PC2'],
        index=feature_names
    )

    print("\nMajor contributing features for PC1:")
    pc1_contributions = loadings_df['PC1'].abs().sort_values(ascending=False)
    for i, (feature, contribution) in enumerate(pc1_contributions.head(5).items()):
        print(f"  {i+1}. {feature}: {contribution:.3f}")

    print("\nMajor contributing features for PC2:")
    pc2_contributions = loadings_df['PC2'].abs().sort_values(ascending=False)
    for i, (feature, contribution) in enumerate(pc2_contributions.head(5).items()):
        print(f"  {i+1}. {feature}: {contribution:.3f}")

    return X_wine_2d, X_wine_3d, pca_2d, pca_3d

X_wine_2d, X_wine_3d, pca_2d, pca_3d = pca_visualization_example()

12.7 PCA Limitations and Considerations

12.7.1 PCA Assumptions and Limitations

python
def pca_limitations_demo():
    """Demonstrate PCA limitations"""

    print("PCA limitations and considerations:")
    print("=" * 30)

    # 1. Linearity assumption limitations
    print("1. Linearity assumption limitations:")

    # Create nonlinear data
    np.random.seed(42)
    t = np.linspace(0, 4*np.pi, 300)
    x1 = t * np.cos(t) + 0.5 * np.random.randn(300)
    x2 = t * np.sin(t) + 0.5 * np.random.randn(300)
    X_spiral = np.column_stack([x1, x2])

    # Standardize
    scaler = StandardScaler()
    X_spiral_scaled = scaler.fit_transform(X_spiral)

    # Apply PCA
    pca_spiral = PCA(n_components=2)
    X_spiral_pca = pca_spiral.fit_transform(X_spiral_scaled)

    # 2. Variance does not equal importance
    print("2. Variance does not equal importance:")

    # Create data: one feature has high variance but is useless for classification
    np.random.seed(42)
    n_samples = 500

    # Useful feature (low variance but has classification information)
    useful_feature = np.random.normal(0, 0.1, n_samples)

    # Useless feature (high variance but no classification information)
    useless_feature = np.random.normal(0, 5, n_samples)

    # Labels based on useful feature
    y_synthetic = (useful_feature > 0).astype(int)

    X_synthetic = np.column_stack([useful_feature, useless_feature])

    # Standardize
    X_synthetic_scaled = scaler.fit_transform(X_synthetic)

    # PCA
    pca_synthetic = PCA(n_components=2)
    X_synthetic_pca = pca_synthetic.fit_transform(X_synthetic_scaled)

    # Visualization
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('PCA Limitations Demonstration', fontsize=16)

    # Nonlinear data
    axes[0, 0].scatter(X_spiral_scaled[:, 0], X_spiral_scaled[:, 1],
                      c=t, cmap='viridis', alpha=0.7, s=20)
    axes[0, 0].set_title('Nonlinear Spiral Data')
    axes[0, 0].set_xlabel('Feature 1')
    axes[0, 0].set_ylabel('Feature 2')

    # PCA transformed spiral data
    axes[0, 1].scatter(X_spiral_pca[:, 0], X_spiral_pca[:, 1],
                      c=t, cmap='viridis', alpha=0.7, s=20)
    axes[0, 1].set_title('PCA Transformed Spiral Data')
    axes[0, 1].set_xlabel('PC1')
    axes[0, 1].set_ylabel('PC2')

    # Principal component directions for spiral data
    axes[0, 2].scatter(X_spiral_scaled[:, 0], X_spiral_scaled[:, 1],
                      c=t, cmap='viridis', alpha=0.5, s=20)

    # Draw principal component directions
    mean_point = np.mean(X_spiral_scaled, axis=0)
    for i, component in enumerate(pca_spiral.components_):
        vector = component * np.sqrt(pca_spiral.explained_variance_[i]) * 2
        axes[0, 2].arrow(mean_point[0], mean_point[1], vector[0], vector[1],
                        head_width=0.1, head_length=0.1, fc=f'C{i}', ec=f'C{i}',
                        linewidth=3, label=f'PC{i+1}')

    axes[0, 2].set_title('Principal Component Directions (Spiral Data)')
    axes[0, 2].set_xlabel('Feature 1')
    axes[0, 2].set_ylabel('Feature 2')
    axes[0, 2].legend()

    # Variance vs importance problem
    axes[1, 0].scatter(X_synthetic_scaled[:, 0], X_synthetic_scaled[:, 1],
                      c=y_synthetic, cmap='RdYlBu', alpha=0.7, s=30)
    axes[1, 0].set_title('Original Data (color = class)')
    axes[1, 0].set_xlabel('Useful Feature (Low Variance)')
    axes[1, 0].set_ylabel('Useless Feature (High Variance)')

    # PCA transformed data
    axes[1, 1].scatter(X_synthetic_pca[:, 0], X_synthetic_pca[:, 1],
                      c=y_synthetic, cmap='RdYlBu', alpha=0.7, s=30)
    axes[1, 1].set_title('PCA Transformed Data')
    axes[1, 1].set_xlabel(f'PC1 ({pca_synthetic.explained_variance_ratio_[0]:.2%})')
    axes[1, 1].set_ylabel(f'PC2 ({pca_synthetic.explained_variance_ratio_[1]:.2%})')

    # Feature variance comparison
    original_var = np.var(X_synthetic_scaled, axis=0)
    axes[1, 2].bar(['Useful Feature', 'Useless Feature'], original_var,
                  color=['green', 'red'], alpha=0.7)
    axes[1, 2].set_title('Original Feature Variance')
    axes[1, 2].set_ylabel('Variance')

    # Add numeric labels
    for i, var in enumerate(original_var):
        axes[1, 2].text(i, var + 0.01, f'{var:.3f}', ha='center')

    plt.tight_layout()
    plt.show()

    print(f"   Spiral data PCA explained variance ratio: {pca_spiral.explained_variance_ratio_}")
    print(f"   Synthetic data PCA explained variance ratio: {pca_synthetic.explained_variance_ratio_}")
    print(f"   Note: PCA selected high variance but useless feature as first principal component")

    # 3. Interpretability problem
    print(f"\n3. Interpretability problem:")
    print(f"   Principal components are linear combinations of original features, difficult to interpret directly")
    print(f"   For example: PC1 = {pca_synthetic.components_[0][0]:.3f} * useful feature + {pca_synthetic.components_[0][1]:.3f} * useless feature")

    return X_spiral_scaled, X_synthetic_scaled, pca_spiral, pca_synthetic

X_spiral_scaled, X_synthetic_scaled, pca_spiral, pca_synthetic = pca_limitations_demo()

12.7.2 PCA Best Practices

python
def pca_best_practices():
    """PCA best practices guide"""

    print("PCA best practices guide:")
    print("=" * 25)

    practices = {
        "Data Preprocessing": [
            "Always standardize data (unless features are already on same scale)",
            "Check and handle missing values",
            "Consider impact of outliers",
            "Ensure data quality"
        ],

        "Principal Component Selection": [
            "Use cumulative explained variance ratio (usually 85%-95%)",
            "Apply Kaiser criterion (eigenvalues > 1)",
            "Observe elbow of scree plot",
            "Consider downstream task performance"
        ],

        "Model Validation": [
            "Use cross-validation to evaluate dimensionality reduction effect",
            "Compare model performance before and after dimensionality reduction",
            "Check stability of principal components",
            "Analyze interpretability of principal components"
        ],

        "Application Scenarios": [
            "Data visualization (reduce to 2D/3D)",
            "Noise removal and data compression",
            "Feature preprocessing (reduce curse of dimensionality)",
            "Exploratory data analysis"
        ],

        "Considerations": [
            "PCA assumes linear relationships",
            "High variance does not equal high importance",
            "Principal components are difficult to interpret",
            "Sensitive to outliers"
        ]
    }

    for category, items in practices.items():
        print(f"\n{category}:")
        for i, item in enumerate(items, 1):
            print(f"  {i}. {item}")

    # Practice example: complete PCA workflow
    print(f"\nComplete PCA workflow example:")
    print("-" * 30)

    # Use breast cancer dataset
    cancer = load_breast_cancer()
    X, y = cancer.data, cancer.target

    print(f"1. Data overview: {X.shape}")

    # 2. Data preprocessing
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    print(f"2. Data standardization completed")

    # 3. Preliminary PCA analysis
    pca_full = PCA()
    pca_full.fit(X_scaled)

    # 4. Select number of principal components
    cumsum_var = np.cumsum(pca_full.explained_variance_ratio_)
    n_components_95 = np.argmax(cumsum_var >= 0.95) + 1

    print(f"3. Need {n_components_95} principal components to retain 95% variance")

    # 5. Apply PCA
    pca_final = PCA(n_components=n_components_95)
    X_pca = pca_final.fit_transform(X_scaled)

    print(f"4. Data shape after dimensionality reduction: {X_pca.shape}")

    # 6. Validate effect
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y, test_size=0.2, random_state=42
    )

    X_train_pca, X_test_pca, _, _ = train_test_split(
        X_pca, y, test_size=0.2, random_state=42
    )

    # Original data classification
    clf_original = LogisticRegression(random_state=42, max_iter=1000)
    clf_original.fit(X_train, y_train)
    acc_original = clf_original.score(X_test, y_test)

    # PCA data classification
    clf_pca = LogisticRegression(random_state=42, max_iter=1000)
    clf_pca.fit(X_train_pca, y_train)
    acc_pca = clf_pca.score(X_test_pca, y_test)

    print(f"5. Performance comparison:")
    print(f"   Original data accuracy: {acc_original:.4f}")
    print(f"   PCA data accuracy: {acc_pca:.4f}")
    print(f"   Dimensionality reduced: {X.shape[1]}{X_pca.shape[1]} ({X_pca.shape[1]/X.shape[1]*100:.1f}%)")

    # Visualize workflow results
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    # Cumulative explained variance ratio
    axes[0].plot(range(1, len(cumsum_var) + 1), cumsum_var, 'bo-', linewidth=2)
    axes[0].axhline(y=0.95, color='red', linestyle='--', alpha=0.7, label='95% Threshold')
    axes[0].axvline(x=n_components_95, color='red', linestyle=':', alpha=0.7)
    axes[0].set_title('Cumulative Explained Variance Ratio')
    axes[0].set_xlabel('Number of Principal Components')
    axes[0].set_ylabel('Cumulative Explained Variance Ratio')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # 2D visualization of first two principal components
    axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='RdYlBu', alpha=0.7)
    axes[1].set_title('Visualization of First Two Principal Components')
    axes[1].set_xlabel(f'PC1 ({pca_final.explained_variance_ratio_[0]:.2%})')
    axes[1].set_ylabel(f'PC2 ({pca_final.explained_variance_ratio_[1]:.2%})')

    # Performance comparison
    methods = ['Original Data', 'PCA Data']
    accuracies = [acc_original, acc_pca]
    colors = ['blue', 'orange']

    bars = axes[2].bar(methods, accuracies, color=colors, alpha=0.7)
    axes[2].set_title('Classification Performance Comparison')
    axes[2].set_ylabel('Accuracy')
    axes[2].set_ylim(0.9, 1.0)

    # Add numeric labels
    for bar, acc in zip(bars, accuracies):
        height = bar.get_height()
        axes[2].text(bar.get_x() + bar.get_width()/2., height + 0.005,
                    f'{acc:.4f}', ha='center', va='bottom')

    plt.tight_layout()
    plt.show()

    return X_scaled, X_pca, pca_final

X_scaled, X_pca, pca_final = pca_best_practices()

12.8 Exercises

Exercise 1: Basic PCA

  1. Perform PCA analysis using Iris dataset
  2. Determine number of principal components needed to retain 90% and 95% variance
  3. Visualize first two principal components and analyze their meaning

Exercise 2: PCA and Classification

  1. Use wine dataset, compare impact of different number of principal components on classification performance
  2. Plot accuracy vs number of principal components curve
  3. Find optimal number of principal components

Exercise 3: PCA Variant Comparison

  1. Create a large dataset (samples > 5000)
  2. Compare performance and results of standard PCA and incremental PCA
  3. Analyze pros and cons of both methods

Exercise 4: Nonlinear Data

  1. Create a nonlinear dataset (e.g., spiral or S-shape)
  2. Compare effect of linear PCA and kernel PCA
  3. Try different kernel functions

Exercise 5: Real Application

  1. Use handwritten digit dataset (load_digits)
  2. Apply PCA for dimensionality reduction and visualization
  3. Analyze impact of dimensionality reduction on digit recognition performance

12.9 Summary

In this chapter, we have learned various aspects of principal component analysis in depth:

Core Concepts

  • Dimensionality Reduction Technology: Reduce feature dimensions while preserving main information
  • Variance Maximization: Find directions of maximum variance in data
  • Linear Transformation: Achieve dimensionality reduction through orthogonal transformation

Main Techniques

  • Standard PCA: Classic principal component analysis method
  • Incremental PCA: Batch processing suitable for large datasets
  • Kernel PCA: Handle nonlinear data using kernel trick
  • Principal Component Selection: Multiple methods to determine optimal number of principal components

Practical Skills

  • Data Preprocessing: Importance of standardization
  • Visualization Techniques: 2D/3D projection of high-dimensional data
  • Performance Evaluation: Impact of dimensionality reduction on downstream tasks
  • Parameter Tuning: Strategies for selecting number of principal components

Key Points

  • PCA is an unsupervised linear dimensionality reduction technique
  • Data standardization is critical for PCA results
  • Number of principal components needs to balance information retention and dimensionality reduction
  • PCA assumes linear relationships, has limited effectiveness on nonlinear data

Application Scenarios

Suitable for using PCA:

  • Data dimensions are very high and need dimensionality reduction
  • Linear correlations exist between features
  • Need data visualization
  • Want to perform noise removal
  • Limited storage space or computational resources

Not suitable for using PCA:

  • Data dimensions themselves are low
  • Weak correlations between features
  • Need to maintain feature interpretability
  • Data has strong nonlinear relationships
  • All features are very important

Best Practices Summary

  1. Data Preprocessing

    • Always standardize data
    • Handle missing values and outliers
    • Check data quality
  2. Principal Component Selection

    • Use multiple methods to determine number of principal components
    • Consider downstream task performance
    • Balance information retention and computational efficiency
  3. Result Validation

    • Analyze interpretability of principal components
    • Validate impact of dimensionality reduction on task performance
    • Check stability of results
  4. Practical Application

    • Choose appropriate PCA variant based on specific problem
    • Combine domain knowledge to interpret principal components
    • Consider limitations of PCA

12.10 Next Steps

Now you have mastered the important dimensionality reduction technique of principal component analysis! In next chapter Anomaly Detection, we will learn how to identify anomalies and outliers in data, which has important applications in data quality control and fraud detection.


Chapter Points Review:

  • ✅ Understood mathematical principles and geometric intuition of PCA
  • ✅ Mastered implementation and parameter tuning of PCA
  • ✅ Learned methods for determining optimal number of principal components
  • ✅ Understood variants and applicable scenarios of PCA
  • ✅ Mastered application of PCA in data visualization
  • ✅ Recognized limitations and best practices of PCA
  • ✅ Able to reasonably use PCA technology in real-world projects

Content is for learning and research only.