Skip to content

Chapter 14: Cross-Validation

Cross-validation is an important technique for evaluating model performance in machine learning. It obtains more reliable performance estimates by splitting data multiple times, helping us select the best model and parameters. This chapter will详细介绍 various cross-validation methods and their applications.

14.1 What is Cross-Validation?

Cross-Validation is a statistical method used to evaluate the generalization ability of machine learning models. It evaluates model performance by dividing data into multiple subsets and using different subsets as training and validation sets in turn.

14.1.1 Why Do We Need Cross-Validation?

  • Avoid Overfitting: Single data split may lead to overly optimistic performance estimates
  • Fully Utilize Data: All data will be used for training and validation
  • Obtain Stable Estimates: Average results of multiple validations are more reliable
  • Model Selection: Compare performance of different models and parameters

14.1.2 Advantages of Cross-Validation

  • More Reliable Performance Estimates: Reduce the impact of randomness
  • Better Model Selection: Based on results from multiple validations
  • High Data Utilization: Don't waste data for validation
  • Detect Overfitting: Difference between training and validation performance

14.1.3 Types of Cross-Validation

  • K-Fold Cross-Validation: Most commonly used method
  • Leave-One-Out Cross-Validation: Leave one sample for validation each time
  • Stratified Cross-Validation: Maintain class proportions
  • Time Series Cross-Validation: Consider time order
  • Group Cross-Validation: Consider data grouping

14.2 Preparing Environment and Data

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_wine, load_breast_cancer, make_classification
from sklearn.model_selection import (
    cross_val_score, cross_validate, KFold, StratifiedKFold, 
    LeaveOneOut, LeavePOut, ShuffleSplit, StratifiedShuffleSplit,
    TimeSeriesSplit, GroupKFold, train_test_split, GridSearchCV,
    learning_curve, validation_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

14.3 K-Fold Cross-Validation

14.3.1 Basic K-Fold Cross-Validation

python
def demonstrate_k_fold_cv():
    """Demonstrate the basic principles of K-fold cross-validation"""
    
    print("K-fold Cross-Validation Basic Principles:")
    print("=" * 25)
    
    # Load iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target
    
    print(f"Dataset size: {X.shape}")
    print(f"Class distribution: {np.bincount(y)}")
    
    # Create classifier
    clf = LogisticRegression(random_state=42, max_iter=1000)
    
    # Different K values
    k_values = [3, 5, 10]
    
    print(f"\nCross-validation results for different K values:")
    print("K Value\tMean Accuracy\tStd Dev\t\tFold Scores")
    print("-" * 60)
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('K-Fold Cross-Validation Demonstration', fontsize=16)
    
    for i, k in enumerate(k_values):
        # K-fold cross-validation
        kfold = KFold(n_splits=k, shuffle=True, random_state=42)
        cv_scores = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy')
        
        mean_score = np.mean(cv_scores)
        std_score = np.std(cv_scores)
        
        print(f"{k}\t{mean_score:.4f}\t\t{std_score:.4f}\t\t{cv_scores}")
        
        # Visualize scores for each fold
        if i < 3:
            row = i // 2
            col = i % 2
            
            axes[row, col].bar(range(1, k+1), cv_scores, alpha=0.7, color='skyblue')
            axes[row, col].axhline(y=mean_score, color='red', linestyle='--', 
                                  label=f'Mean: {mean_score:.3f}')
            axes[row, col].set_title(f'{k}-Fold Cross-Validation')
            axes[row, col].set_xlabel('Fold Number')
            axes[row, col].set_ylabel('Accuracy')
            axes[row, col].legend()
            axes[row, col].grid(True, alpha=0.3)
            axes[row, col].set_ylim(0.8, 1.0)
    
    # Visualize data splitting process
    axes[1, 1].remove()
    ax_split = fig.add_subplot(2, 2, 4)
    
    # Demonstrate 5-fold cross-validation splitting
    kfold_demo = KFold(n_splits=5, shuffle=True, random_state=42)
    
    fold_colors = ['red', 'blue', 'green', 'orange', 'purple']
    y_pos = 0
    
    for fold, (train_idx, val_idx) in enumerate(kfold_demo.split(X)):
        # Draw training set
        ax_split.barh(y_pos, len(train_idx), left=0, height=0.8, 
                     color=fold_colors[fold], alpha=0.3, label=f'Fold{fold+1} Training Set' if fold == 0 else "")
        
        # Draw validation set
        ax_split.barh(y_pos, len(val_idx), left=len(train_idx), height=0.8, 
                     color=fold_colors[fold], alpha=0.8, label=f'Fold{fold+1} Validation Set' if fold == 0 else "")
        
        y_pos += 1
    
    ax_split.set_title('5-Fold Cross-Validation Data Splitting')
    ax_split.set_xlabel('Sample Index')
    ax_split.set_ylabel('Fold Number')
    ax_split.set_yticks(range(5))
    ax_split.set_yticklabels([f'Fold{i+1}' for i in range(5)])
    
    plt.tight_layout()
    plt.show()
    
    return cv_scores

cv_scores = demonstrate_k_fold_cv()

14.3.2 Stratified K-Fold Cross-Validation

python
def stratified_k_fold_demo():
    """Demonstrate stratified K-fold cross-validation"""
    
    print("Stratified K-fold Cross-Validation:")
    print("Maintains the proportion of each class in each fold consistent with the original dataset")
    
    # Create imbalanced dataset
    X_imbalanced, y_imbalanced = make_classification(
        n_samples=1000, n_features=20, n_informative=10,
        n_classes=3, weights=[0.6, 0.3, 0.1],  # Imbalanced classes
        random_state=42
    )
    
    print(f"\nImbalanced Dataset:")
    print(f"Total samples: {len(X_imbalanced)}")
    unique, counts = np.unique(y_imbalanced, return_counts=True)
    for cls, count in zip(unique, counts):
        print(f"Class {cls}: {count} samples ({count/len(y_imbalanced)*100:.1f}%)")
    
    # Compare regular K-fold and stratified K-fold
    k = 5
    
    # Regular K-fold
    kfold = KFold(n_splits=k, shuffle=True, random_state=42)
    
    # Stratified K-fold
    stratified_kfold = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)
    
    # Analyze class distribution for each fold
    print(f"\n{k}-Fold Cross-Validation Class Distribution Comparison:")
    print("Method\t\tFold\tClass0\tClass1\tClass2")
    print("-" * 50)
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Regular K-Fold vs Stratified K-Fold Cross-Validation', fontsize=16)
    
    # Regular K-fold distribution
    fold_distributions_normal = []
    for fold, (train_idx, val_idx) in enumerate(kfold.split(X_imbalanced, y_imbalanced)):
        val_y = y_imbalanced[val_idx]
        unique_val, counts_val = np.unique(val_y, return_counts=True)
        
        # Ensure all classes have counts
        distribution = np.zeros(3)
        for cls, count in zip(unique_val, counts_val):
            distribution[cls] = count
        
        fold_distributions_normal.append(distribution)
        
        print(f"Regular K-fold\t{fold+1}\t{int(distribution[0])}\t{int(distribution[1])}\t{int(distribution[2])}")
    
    # Stratified K-fold distribution
    fold_distributions_stratified = []
    for fold, (train_idx, val_idx) in enumerate(stratified_kfold.split(X_imbalanced, y_imbalanced)):
        val_y = y_imbalanced[val_idx]
        unique_val, counts_val = np.unique(val_y, return_counts=True)
        
        distribution = np.zeros(3)
        for cls, count in zip(unique_val, counts_val):
            distribution[cls] = count
        
        fold_distributions_stratified.append(distribution)
        
        print(f"Stratified K-fold\t{fold+1}\t{int(distribution[0])}\t{int(distribution[1])}\t{int(distribution[2])}")
    
    # Visualize distribution comparison
    fold_distributions_normal = np.array(fold_distributions_normal)
    fold_distributions_stratified = np.array(fold_distributions_stratified)
    
    # Regular K-fold
    x = np.arange(k)
    width = 0.35
    
    axes[0, 0].bar(x - width/2, fold_distributions_normal[:, 0], width, label='Class 0', color='blue')
    axes[0, 0].bar(x - width/2, fold_distributions_normal[:, 1], width, bottom=fold_distributions_normal[:, 0], label='Class 1', color='green')
    axes[0, 0].bar(x - width/2, fold_distributions_normal[:, 2], width, 
                  bottom=fold_distributions_normal[:, 0] + fold_distributions_normal[:, 1], label='Class 2', color='red')
    axes[0, 0].set_title('Regular K-Fold Distribution')
    axes[0, 0].set_xlabel('Fold Number')
    axes[0, 0].set_ylabel('Sample Count')
    axes[0, 0].set_xticks(x)
    axes[0, 0].set_xticklabels([f'Fold{i+1}' for i in range(k)])
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Stratified K-fold
    axes[0, 1].bar(x - width/2, fold_distributions_stratified[:, 0], width, label='Class 0', color='blue')
    axes[0, 1].bar(x - width/2, fold_distributions_stratified[:, 1], width, bottom=fold_distributions_stratified[:, 0], label='Class 1', color='green')
    axes[0, 1].bar(x - width/2, fold_distributions_stratified[:, 2], width, 
                  bottom=fold_distributions_stratified[:, 0] + fold_distributions_stratified[:, 1], label='Class 2', color='red')
    axes[0, 1].set_title('Stratified K-Fold Distribution')
    axes[0, 1].set_xlabel('Fold Number')
    axes[0, 1].set_ylabel('Sample Count')
    axes[0, 1].set_xticks(x)
    axes[0, 1].set_xticklabels([f'Fold{i+1}' for i in range(k)])
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Compare performance
    clf = LogisticRegression(random_state=42, max_iter=1000)
    
    scores_normal = cross_val_score(clf, X_imbalanced, y_imbalanced, cv=kfold, scoring='accuracy')
    scores_stratified = cross_val_score(clf, X_imbalanced, y_imbalanced, cv=stratified_kfold, scoring='accuracy')
    
    print(f"\nPerformance Comparison:")
    print(f"Regular K-fold: {np.mean(scores_normal):.4f} ± {np.std(scores_normal):.4f}")
    print(f"Stratified K-fold: {np.mean(scores_stratified):.4f} ± {np.std(scores_stratified):.4f}")
    
    # Visualize performance comparison
    methods = ['Regular K-Fold', 'Stratified K-Fold']
    means = [np.mean(scores_normal), np.mean(scores_stratified)]
    stds = [np.std(scores_normal), np.std(scores_stratified)]
    
    axes[1, 0].bar(methods, means, yerr=stds, capsize=5, color=['skyblue', 'lightcoral'], alpha=0.7)
    axes[1, 0].set_title('Accuracy Comparison')
    axes[1, 0].set_ylabel('Accuracy')
    axes[1, 0].grid(True, alpha=0.3)
    axes[1, 0].set_ylim(0.6, 0.9)
    
    # Score distribution
    axes[1, 1].boxplot([scores_normal, scores_stratified], labels=methods)
    axes[1, 1].set_title('Score Distribution')
    axes[1, 1].set_ylabel('Accuracy')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return scores_normal, scores_stratified

scores_normal, scores_stratified = stratified_k_fold_demo()

14.3.3 Leave-One-Out Cross-Validation (LOOCV)

python
def loocv_demo():
    """Demonstrate Leave-One-Out Cross-Validation"""
    
    print("Leave-One-Out Cross-Validation (LOOCV):")
    print("Each time, leave one sample as the validation set, use the remaining samples as the training set")
    
    # Use a smaller dataset for demonstration
    X_small, y_small = make_classification(
        n_samples=100, n_features=5, n_informative=3,
        n_classes=2, random_state=42
    )
    
    print(f"Dataset size: {X_small.shape}")
    
    # LOOCV
    loo = LeaveOneOut()
    clf = LogisticRegression(random_state=42, max_iter=1000)
    
    # Perform LOOCV
    scores_loo = cross_val_score(clf, X_small, y_small, cv=loo, scoring='accuracy')
    
    print(f"\nLOOCV Results:")
    print(f"Number of iterations: {len(scores_loo)}")
    print(f"Mean Accuracy: {np.mean(scores_loo):.4f}")
    print(f"Std Dev: {np.std(scores_loo):.4f}")
    
    # Compare with 10-fold
    kfold_10 = KFold(n_splits=10, shuffle=True, random_state=42)
    scores_10fold = cross_val_score(clf, X_small, y_small, cv=kfold_10, scoring='accuracy')
    
    print(f"\n10-Fold Cross-Validation Comparison:")
    print(f"Mean Accuracy: {np.mean(scores_10fold):.4f}")
    print(f"Std Dev: {np.std(scores_10fold):.4f}")
    
    # Visualize
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    fig.suptitle('LOOCV vs 10-Fold Cross-Validation', fontsize=16)
    
    # LOOCV scores distribution
    axes[0].hist(scores_loo, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0].axvline(x=np.mean(scores_loo), color='red', linestyle='--', label=f'Mean: {np.mean(scores_loo):.3f}')
    axes[0].set_title('LOOCV Score Distribution')
    axes[0].set_xlabel('Accuracy')
    axes[0].set_ylabel('Frequency')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # 10-fold scores
    axes[1].bar(range(1, 11), scores_10fold, alpha=0.7, color='lightcoral')
    axes[1].axhline(y=np.mean(scores_10fold), color='blue', linestyle='--', label=f'Mean: {np.mean(scores_10fold):.3f}')
    axes[1].set_title('10-Fold Cross-Validation Scores')
    axes[1].set_xlabel('Fold Number')
    axes[1].set_ylabel('Accuracy')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    axes[1].set_ylim(0.5, 1.0)
    
    # Comparison boxplot
    axes[2].boxplot([scores_loo, scores_10fold], labels=['LOOCV', '10-Fold'])
    axes[2].set_title('Method Comparison')
    axes[2].set_ylabel('Accuracy')
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Advantages and disadvantages of LOOCV
    print("\nLOOCV Advantages:")
    print("✓ Maximum use of data")
    print("✓ Least bias")
    print("✓ Deterministic result")
    
    print("\nLOOCV Disadvantages:")
    print("✗ Computationally expensive")
    print("✗ High variance in estimates")
    print("✗ Cannot be used with some models")
    
    return scores_loo, scores_10fold

scores_loo, scores_10fold = loocv_demo()

14.3.4 Repeated K-Fold Cross-Validation

python
def repeated_kfold_demo():
    """Demonstrate Repeated K-Fold Cross-Validation"""
    
    print("Repeated K-Fold Cross-Validation:")
    print("Repeat K-fold cross-validation multiple times with different splits")
    
    # Create dataset
    X, y = load_iris(return_X_y=True)
    
    print(f"Dataset size: {X.shape}")
    
    # Basic 5-fold
    kfold_5 = KFold(n_splits=5, shuffle=True, random_state=42)
    clf = LogisticRegression(random_state=42, max_iter=1000)
    
    scores_single = cross_val_score(clf, X, y, cv=kfold_5, scoring='accuracy')
    
    # Repeated 5-fold (repeat 10 times)
    from sklearn.model_selection import RepeatedKFold
    
    rkfold = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
    scores_repeated = cross_val_score(clf, X, y, cv=rkfold, scoring='accuracy')
    
    print(f"\nSingle 5-Fold CV: {np.mean(scores_single):.4f} ± {np.std(scores_single):.4f}")
    print(f"Repeated 5-Fold (10 repeats): {np.mean(scores_repeated):.4f} ± {np.std(scores_repeated):.4f}")
    print(f"Total evaluations: {len(scores_repeated)}")
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    fig.suptitle('Single vs Repeated K-Fold Cross-Validation', fontsize=16)
    
    # Score distribution
    axes[0].hist(scores_single, bins=10, alpha=0.7, color='skyblue', label='Single 5-Fold', edgecolor='black')
    axes[0].axvline(x=np.mean(scores_single), color='blue', linestyle='--', linewidth=2)
    axes[0].set_title('Single 5-Fold CV Score Distribution')
    axes[0].set_xlabel('Accuracy')
    axes[0].set_ylabel('Frequency')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Repeated scores (show first 50 for clarity)
    axes[1].plot(scores_repeated[:50], 'o-', alpha=0.7, color='lightcoral', markersize=4)
    axes[1].axhline(y=np.mean(scores_repeated), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(scores_repeated):.3f}')
    axes[1].fill_between(range(50), 
                        np.mean(scores_repeated) - np.std(scores_repeated),
                        np.mean(scores_repeated) + np.std(scores_repeated),
                        alpha=0.2, color='red')
    axes[1].set_title('Repeated 5-Fold CV Scores (First 50)')
    axes[1].set_xlabel('Evaluation Number')
    axes[1].set_ylabel('Accuracy')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return scores_single, scores_repeated

scores_single, scores_repeated = repeated_kfold_demo()

14.4 Special Cross-Validation Methods

14.4.1 Time Series Cross-Validation

python
def time_series_cv_demo():
    """Demonstrate Time Series Cross-Validation"""
    
    print("Time Series Cross-Validation:")
    print("Uses expanding window or sliding window to maintain temporal order")
    
    # Create simulated time series data
    np.random.seed(42)
    n_samples = 200
    
    # Create time series with trend and seasonality
    time = np.arange(n_samples)
    trend = 0.01 * time
    seasonality = 0.3 * np.sin(2 * np.pi * time / 50)
    noise = np.random.normal(0, 0.1, n_samples)
    
    # Target with some relationship to features
    X_ts = np.column_stack([
        trend + noise,
        seasonality + noise,
        np.random.normal(0, 0.2, n_samples),
        np.random.normal(0, 0.2, n_samples)
    ])
    
    # Binary target based on value
    y_ts = (trend + seasonality + noise > 0).astype(int)
    
    print(f"Time series dataset: {X_ts.shape}")
    print(f"Class distribution: {np.bincount(y_ts)}")
    
    # Time series split
    tscv = TimeSeriesSplit(n_splits=5)
    
    print(f"\nTime Series Split Information:")
    for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
        print(f"Fold {i+1}: Train {len(train_idx)} samples, Test {len(test_idx)} samples")
    
    # Visualize time series split
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Time Series Cross-Validation', fontsize=16)
    
    # Show data and splits
    axes[0, 0].plot(time, y_ts, 'b-', alpha=0.7)
    axes[0, 0].set_title('Time Series Target Variable')
    axes[0, 0].set_xlabel('Time')
    axes[0, 0].set_ylabel('Target')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Show splits
    colors = ['red', 'blue', 'green', 'orange', 'purple']
    for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
        axes[0, 1].axvspan(train_idx[0], train_idx[-1], alpha=0.2, color=colors[i], label=f'Fold {i+1} Train' if i == 0 else "")
        axes[0, 1].axvspan(test_idx[0], test_idx[-1], alpha=0.6, color=colors[i], label=f'Fold {i+1} Test' if i == 0 else "")
    
    axes[0, 1].set_title('Time Series Split Visualization')
    axes[0, 1].set_xlabel('Time Index')
    axes[0, 1].set_ylabel('Fold')
    axes[0, 1].legend(loc='upper left', fontsize=8)
    axes[0, 1].grid(True, alpha=0.3)
    
    # Perform time series cross-validation
    clf = LogisticRegression(random_state=42, max_iter=1000)
    
    scores_ts = []
    fold_sizes = []
    
    for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
        X_train, X_test = X_ts[train_idx], X_ts[test_idx]
        y_train, y_test = y_ts[train_idx], y_ts[test_idx]
        
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        scores_ts.append(score)
        fold_sizes.append(len(test_idx))
        
        print(f"Fold {i+1}: Test size={len(test_idx)}, Accuracy={score:.4f}")
    
    # Visualize results
    axes[1, 0].bar(range(1, len(scores_ts) + 1), scores_ts, alpha=0.7, color='skyblue')
    axes[1, 0].set_title('Time Series CV Accuracy by Fold')
    axes[1, 0].set_xlabel('Fold Number')
    axes[1, 0].set_ylabel('Accuracy')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Size vs accuracy
    axes[1, 1].scatter(fold_sizes, scores_ts, s=100, alpha=0.7, color='lightcoral')
    axes[1, 1].set_title('Test Set Size vs Accuracy')
    axes[1, 1].set_xlabel('Test Set Size')
    axes[1, 1].set_ylabel('Accuracy')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nOverall Time Series CV: {np.mean(scores_ts):.4f} ± {np.std(scores_ts):.4f}")
    
    return scores_ts

scores_ts = time_series_cv_demo()

14.4.2 Group K-Fold Cross-Validation

python
def group_kfold_demo():
    """Demonstrate Group K-Fold Cross-Validation"""
    
    print("Group K-Fold Cross-Validation:")
    print("Ensures that samples from the same group are not in both training and validation sets")
    
    # Create dataset with groups
    np.random.seed(42)
    n_groups = 10
    samples_per_group = 20
    n_samples = n_groups * samples_per_group
    
    # Create group labels
    groups = np.repeat(np.arange(n_groups), samples_per_group)
    
    # Create features and target with group structure
    X = np.random.randn(n_samples, 5)
    y = np.random.randint(0, 2, n_samples)
    
    # Add group effect to features
    for g in range(n_groups):
        group_mask = groups == g
        X[group_mask] += np.random.randn(1, 5) * 0.5
    
    print(f"Dataset size: {X.shape}")
    print(f"Number of groups: {n_groups}")
    print(f"Samples per group: {samples_per_group}")
    
    # Compare regular K-fold and group K-fold
    k = 5
    
    # Regular K-fold (may put same group in both train and test)
    kfold = KFold(n_splits=k, shuffle=True, random_state=42)
    
    # Group K-fold (ensures groups don't leak)
    group_kfold = GroupKFold(n_splits=k)
    
    clf = LogisticRegression(random_state=42, max_iter=1000)
    
    # Check for group leakage in regular K-fold
    print(f"\nChecking for group leakage in regular K-fold:")
    leakage_count = 0
    for fold, (train_idx, test_idx) in enumerate(kfold.split(X, y, groups)):
        train_groups = set(groups[train_idx])
        test_groups = set(groups[test_idx])
        overlap = train_groups & test_groups
        if overlap:
            leakage_count += len(overlap)
            print(f"Fold {fold+1}: Group leakage detected - {len(overlap)} groups appear in both train and test")
    
    if leakage_count == 0:
        print("No group leakage detected")
    
    # Group K-fold ensures no leakage
    print(f"\nGroup K-fold ensures no group appears in both train and test sets")
    
    # Perform cross-validation
    scores_regular = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy')
    scores_group = cross_val_score(clf, X, y, cv=group_kfold, groups=groups, scoring='accuracy')
    
    print(f"\nPerformance Comparison:")
    print(f"Regular K-fold: {np.mean(scores_regular):.4f} ± {np.std(scores_regular):.4f}")
    print(f"Group K-fold: {np.mean(scores_group):.4f} ± {np.std(scores_group):.4f}")
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    fig.suptitle('Regular K-Fold vs Group K-Fold Cross-Validation', fontsize=16)
    
    # Show group structure
    group_colors = plt.cm.tab10(np.arange(n_groups))
    for g in range(n_groups):
        mask = groups == g
        axes[0].scatter(X[mask, 0], X[mask, 1], c=[group_colors[g]], label=f'Group {g}' if g < 5 else "", alpha=0.7)
    
    axes[0].set_title('Data Points Colored by Group')
    axes[0].set_xlabel('Feature 1')
    axes[0].set_ylabel('Feature 2')
    axes[0].legend(loc='upper right', fontsize=8)
    axes[0].grid(True, alpha=0.3)
    
    # Performance comparison
    methods = ['Regular K-Fold', 'Group K-Fold']
    means = [np.mean(scores_regular), np.mean(scores_group)]
    stds = [np.std(scores_regular), np.std(scores_group)]
    
    axes[1].bar(methods, means, yerr=stds, capsize=5, color=['skyblue', 'lightcoral'], alpha=0.7)
    axes[1].set_title('Accuracy Comparison')
    axes[1].set_ylabel('Accuracy')
    axes[1].grid(True, alpha=0.3)
    axes[1].set_ylim(0.3, 0.8)
    
    plt.tight_layout()
    plt.show()
    
    return scores_regular, scores_group

scores_regular, scores_group = group_kfold_demo()

14.4.3 Shuffle Split Cross-Validation

python
def shuffle_split_demo():
    """Demonstrate Shuffle Split Cross-Validation"""
    
    print("Shuffle Split Cross-Validation:")
    print("Randomly shuffle data and split into train/test sets multiple times")
    
    # Load data
    X, y = load_iris(return_X_y=True)
    
    print(f"Dataset size: {X.shape}")
    
    # ShuffleSplit
    ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
    
    # Stratified ShuffleSplit
    stratified_ss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
    
    clf = LogisticRegression(random_state=42, max_iter=1000)
    
    # Perform cross-validation
    scores_shuffle = cross_val_score(clf, X, y, cv=ss, scoring='accuracy')
    scores_stratified_shuffle = cross_val_score(clf, X, y, cv=stratified_ss, scoring='accuracy')
    
    print(f"\nShuffle Split Results:")
    print(f"Mean Accuracy: {np.mean(scores_shuffle):.4f} ± {np.std(scores_shuffle):.4f}")
    print(f"Individual scores: {scores_shuffle}")
    
    print(f"\nStratified Shuffle Split Results:")
    print(f"Mean Accuracy: {np.mean(scores_stratified_shuffle):.4f} ± {np.std(scores_stratified_shuffle):.4f}")
    print(f"Individual scores: {scores_stratified_shuffle}")
    
    # Visualize
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    fig.suptitle('Shuffle Split Cross-Validation', fontsize=16)
    
    # Score distribution
    axes[0].hist(scores_shuffle, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0].axvline(x=np.mean(scores_shuffle), color='red', linestyle='--', label=f'Mean: {np.mean(scores_shuffle):.3f}')
    axes[0].set_title('Shuffle Split Score Distribution')
    axes[0].set_xlabel('Accuracy')
    axes[0].set_ylabel('Frequency')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Show different splits
    for i, (train_idx, test_idx) in enumerate(ss.split(X)):
        if i < 5:  # Show first 5 splits
            axes[1].scatter(train_idx, [i] * len(train_idx), c='blue', alpha=0.3, s=5)
            axes[1].scatter(test_idx, [i] * len(test_idx), c='red', alpha=0.3, s=5)
    
    axes[1].set_title('First 5 Shuffle Splits')
    axes[1].set_xlabel('Sample Index')
    axes[1].set_ylabel('Split Number')
    axes[1].legend(handles=[plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='blue', markersize=10, label='Train'),
                       plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Test')],
                      loc='upper right')
    axes[1].grid(True, alpha=0.3)
    
    # Comparison
    axes[2].boxplot([scores_shuffle, scores_stratified_shuffle], labels=['Shuffle Split', 'Stratified Shuffle'])
    axes[2].set_title('Method Comparison')
    axes[2].set_ylabel('Accuracy')
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Advantages of Shuffle Split
    print("\nShuffle Split Advantages:")
    print("✓ Control over train/test size")
    print("✓ Can set number of iterations independently")
    print("✓ Some samples may be used multiple times")
    print("✓ Can be faster than K-fold for large datasets")
    
    return scores_shuffle, scores_stratified_shuffle

scores_shuffle, scores_stratified_shuffle = shuffle_split_demo()

14.5 Cross-Validation for Model Selection

14.5.1 Comparing Multiple Models

python
def compare_models_cv():
    """Compare multiple models using cross-validation"""
    
    print("Comparing Multiple Models with Cross-Validation:")
    
    # Load data
    X, y = load_iris(return_X_y=True)
    
    print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
    
    # Define models to compare
    models = {
        'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'SVM (RBF)': SVC(kernel='rbf', random_state=42),
        'KNN': KNeighborsClassifier(n_neighbors=5)
    }
    
    # Cross-validation settings
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    # Evaluate each model
    results = {}
    
    print(f"\nModel Comparison Results:")
    print("-" * 60)
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    fig.suptitle('Model Comparison with Cross-Validation', fontsize=16)
    
    for i, (name, model) in enumerate(models.items()):
        # Perform cross-validation with multiple metrics
        scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
        
        results[name] = {
            'mean': np.mean(scores),
            'std': np.std(scores),
            'scores': scores
        }
        
        print(f"{name}:")
        print(f"  Accuracy: {np.mean(scores):.4f} ± {np.std(scores):.4f}")
        print(f"  Individual folds: {scores}")
        
        # Visualize
        row = i // 2
        col = i % 2
        axes[row, col].bar(range(1, 6), scores, alpha=0.7, color=plt.cm.tab10(i))
        axes[row, col].axhline(y=np.mean(scores), color='red', linestyle='--', 
                              label=f'Mean: {np.mean(scores):.3f}')
        axes[row, col].set_title(name)
        axes[row, col].set_xlabel('Fold')
        axes[row, col].set_ylabel('Accuracy')
        axes[row, col].legend()
        axes[row, col].grid(True, alpha=0.3)
        axes[row, col].set_ylim(0.8, 1.05)
    
    plt.tight_layout()
    plt.show()
    
    # Summary comparison
    print("\n" + "=" * 60)
    print("Summary:")
    print("-" * 60)
    
    # Sort by mean accuracy
    sorted_results = sorted(results.items(), key=lambda x: x[1]['mean'], reverse=True)
    
    for rank, (name, result) in enumerate(sorted_results, 1):
        print(f"{rank}. {name}: {result['mean']:.4f} ± {result['std']:.4f}")
    
    # Visualize summary
    names = [r[0] for r in sorted_results]
    means = [r[1]['mean'] for r in sorted_results]
    stds = [r[1]['std'] for r in sorted_results]
    
    plt.figure(figsize=(10, 6))
    bars = plt.barh(names, means, xerr=stds, capsize=5, color='skyblue', alpha=0.7)
    plt.xlabel('Accuracy')
    plt.title('Model Comparison Summary')
    plt.xlim(0.8, 1.0)
    plt.grid(True, alpha=0.3)
    
    # Highlight best model
    bars[0].set_color('gold')
    
    plt.tight_layout()
    plt.show()
    
    return results

results = compare_models_cv()

14.5.2 Hyperparameter Tuning with Cross-Validation

python
def hyperparameter_tuning_cv():
    """Demonstrate hyperparameter tuning with cross-validation"""
    
    print("Hyperparameter Tuning with Cross-Validation:")
    
    # Load data
    X, y = load_iris(return_X_y=True)
    
    print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
    
    # Define parameter grid for SVM
    param_grid = {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto', 0.01, 0.1, 1],
        'kernel': ['rbf', 'linear']
    }
    
    # Create SVM classifier
    svc = SVC(random_state=42)
    
    # Grid search with cross-validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    grid_search = GridSearchCV(
        svc, param_grid, cv=cv, scoring='accuracy', 
        n_jobs=-1, verbose=1
    )
    
    grid_search.fit(X, y)
    
    print(f"\nBest Parameters: {grid_search.best_params_}")
    print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")
    
    # Show top 5 parameter combinations
    print(f"\nTop 5 Parameter Combinations:")
    results_df = pd.DataFrame(grid_search.cv_results_)
    results_df = results_df.sort_values('rank_test_score')
    
    for i, (_, row) in enumerate(results_df.head(5).iterrows()):
        print(f"{i+1}. Score: {row['mean_test_score']:.4f} ± {row['std_test_score']:.4f}")
        print(f"   Parameters: {row['params']}")
    
    # Visualize results
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    fig.suptitle('Hyperparameter Tuning Results', fontsize=16)
    
    # Heatmap for RBF kernel
    rbf_results = results_df[results_df['param_kernel'] == 'rbf']
    pivot_table = rbf_results.pivot_table(
        values='mean_test_score',
        index='param_gamma',
        columns='param_C'
    )
    
    sns.heatmap(pivot_table, annot=True, cmap='viridis', fmt='.4f', ax=axes[0])
    axes[0].set_title('SVM RBF Kernel: C vs Gamma')
    axes[0].set_xlabel('C')
    axes[0].set_ylabel('Gamma')
    
    # Performance comparison
    axes[1].plot(range(len(results_df)), results_df['mean_test_score'].values, 'o-', alpha=0.7)
    axes[1].fill_between(range(len(results_df)),
                        results_df['mean_test_score'] - results_df['std_test_score'],
                        results_df['mean_test_score'] + results_df['std_test_score'],
                        alpha=0.2)
    axes[1].set_title('All Parameter Combinations Performance')
    axes[1].set_xlabel('Parameter Combination Index')
    axes[1].set_ylabel('Accuracy')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return grid_search

grid_search = hyperparameter_tuning_cv()

14.5.3 Learning Curves

python
def learning_curves_cv():
    """Analyze learning curves using cross-validation"""
    
    print("Learning Curves Analysis:")
    
    # Load data
    X, y = load_iris(return_X_y=True)
    
    print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
    
    # Define models with different complexities
    models = {
        'Simple (KNN, k=20)': KNeighborsClassifier(n_neighbors=20),
        'Medium (KNN, k=5)': KNeighborsClassifier(n_neighbors=5),
        'Complex (KNN, k=1)': KNeighborsClassifier(n_neighbors=1)
    }
    
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    fig.suptitle('Learning Curves for Different Model Complexities', fontsize=16)
    
    train_sizes = np.linspace(0.1, 1.0, 10)
    
    for i, (name, model) in enumerate(models.items()):
        train_sizes_abs, train_scores, val_scores = learning_curve(
            model, X, y, cv=cv, n_jobs=-1,
            train_sizes=train_sizes, scoring='accuracy'
        )
        
        train_mean = np.mean(train_scores, axis=1)
        train_std = np.std(train_scores, axis=1)
        val_mean = np.mean(val_scores, axis=1)
        val_std = np.std(val_scores, axis=1)
        
        axes[i].plot(train_sizes_abs, train_mean, 'o-', color='blue', label='Training Score')
        axes[i].fill_between(train_sizes_abs, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
        
        axes[i].plot(train_sizes_abs, val_mean, 'o-', color='red', label='Validation Score')
        axes[i].fill_between(train_sizes_abs, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
        
        axes[i].set_title(name)
        axes[i].set_xlabel('Training Set Size')
        axes[i].set_ylabel('Accuracy')
        axes[i].legend(loc='lower right')
        axes[i].grid(True, alpha=0.3)
        axes[i].set_ylim(0.5, 1.05)
    
    plt.tight_layout()
    plt.show()
    
    # Interpretation
    print("\nLearning Curve Interpretation:")
    print("High Training Score + Low Validation Score → Overfitting")
    print("Both Scores Low → Underfitting")
    print("Training Score Improves with More Data → Model Benefits from More Data")
    
    return None

learning_curves_cv()

14.6 Cross-Validation Metrics

14.6.1 Multiple Metrics Evaluation

python
def multiple_metrics_cv():
    """Evaluate models using multiple metrics with cross-validation"""
    
    print("Cross-Validation with Multiple Metrics:")
    
    # Load data
    X, y = load_iris(return_X_y=True)
    
    print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
    
    # Use cross_validate for multiple metrics
    clf = LogisticRegression(random_state=42, max_iter=1000)
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
    
    results = cross_validate(clf, X, y, cv=cv, scoring=scoring, return_train_score=True)
    
    print(f"\nCross-Validation Results:")
    print("-" * 60)
    
    # Extract results
    metrics = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
    
    for metric in metrics:
        test_key = f'test_{metric}'
        train_key = f'train_{metric}'
        
        test_scores = results[test_key]
        train_scores = results[train_key]
        
        print(f"{metric}:")
        print(f"  Test: {np.mean(test_scores):.4f} ± {np.std(test_scores):.4f}")
        print(f"  Train: {np.mean(train_scores):.4f} ± {np.std(train_scores):.4f}")
        print(f"  Gap: {np.mean(train_scores) - np.mean(test_scores):.4f}")
    
    # Visualize
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    fig.suptitle('Multiple Metrics Cross-Validation Results', fontsize=16)
    
    for i, metric in enumerate(metrics):
        row = i // 2
        col = i % 2
        
        test_key = f'test_{metric}'
        train_key = f'train_{metric}'
        
        test_scores = results[test_key]
        train_scores = results[train_key]
        
        x = np.arange(5)
        width = 0.35
        
        axes[row, col].bar(x - width/2, train_scores, width, label='Training', color='skyblue')
        axes[row, col].bar(x + width/2, test_scores, width, label='Test', color='lightcoral')
        
        axes[row, col].set_title(metric.replace('_', ' ').title())
        axes[row, col].set_xlabel('Fold')
        axes[row, col].set_ylabel('Score')
        axes[row, col].set_xticks(x)
        axes[row, col].set_xticklabels([f'Fold {i+1}' for i in range(5)])
        axes[row, col].legend()
        axes[row, col].grid(True, alpha=0.3)
        axes[row, col].set_ylim(0.5, 1.1)
    
    plt.tight_layout()
    plt.show()
    
    return results

results = multiple_metrics_cv()

14.6.2 Confidence Intervals

python
def confidence_intervals_cv():
    """Calculate confidence intervals for cross-validation results"""
    
    print("Confidence Intervals for Cross-Validation:")
    
    # Load data
    X, y = load_iris(return_X_y=True)
    
    # Perform cross-validation
    clf = LogisticRegression(random_state=42, max_iter=1000)
    cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
    
    print(f"Cross-validation scores: {scores}")
    print(f"Number of folds: {len(scores)}")
    
    # Calculate statistics
    mean_score = np.mean(scores)
    std_score = np.std(scores)
    n = len(scores)
    
    # Calculate 95% confidence interval using t-distribution
    from scipy import stats
    
    confidence = 0.95
    t_value = stats.t.ppf((1 + confidence) / 2, n - 1)
    margin_of_error = t_value * std_score / np.sqrt(n)
    
    ci_lower = mean_score - margin_of_error
    ci_upper = mean_score + margin_of_error
    
    print(f"\nResults:")
    print(f"Mean Accuracy: {mean_score:.4f}")
    print(f"Standard Deviation: {std_score:.4f}")
    print(f"95% Confidence Interval: [{ci_lower:.4f}, {ci_upper:.4f}]")
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    fig.suptitle('Cross-Validation with Confidence Intervals', fontsize=16)
    
    # Score distribution with CI
    axes[0].hist(scores, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0].axvline(x=mean_score, color='red', linestyle='-', linewidth=2, label=f'Mean: {mean_score:.3f}')
    axes[0].axvline(x=ci_lower, color='orange', linestyle='--', linewidth=2, label=f'95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]')
    axes[0].axvline(x=ci_upper, color='orange', linestyle='--', linewidth=2)
    axes[0].set_title('Score Distribution')
    axes[0].set_xlabel('Accuracy')
    axes[0].set_ylabel('Frequency')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Individual fold scores with error bars
    fold_numbers = np.arange(1, n + 1)
    axes[1].errorbar(fold_numbers, scores, yerr=std_score, fmt='o', capsize=5, color='lightcoral', markersize=8)
    axes[1].axhline(y=mean_score, color='blue', linestyle='-', linewidth=2, label=f'Mean: {mean_score:.3f}')
    axes[1].fill_between([0.5, n + 0.5], ci_lower, ci_upper, alpha=0.2, color='blue', label='95% CI')
    axes[1].set_title('Fold Scores with Confidence Interval')
    axes[1].set_xlabel('Fold Number')
    axes[1].set_ylabel('Accuracy')
    axes[1].set_xlim(0.5, n + 0.5)
    axes[1].set_ylim(mean_score - 0.1, mean_score + 0.1)
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return scores, (ci_lower, ci_upper)

scores, ci = confidence_intervals_cv()

14.7 Best Practices and Tips

14.7.1 Choosing the Right Cross-Validation Method

python
def cv_method_selection_guide():
    """Guide for choosing the right cross-validation method"""
    
    print("Cross-Validation Method Selection Guide:")
    print("=" * 60)
    
    guidelines = {
        "Small Datasets (< 1000 samples)": "Use LOOCV or stratified K-fold (k=5 or 10)",
        "Large Datasets (> 10000 samples)": "Use K-fold (k=5) or hold-out validation",
        "Imbalanced Classification": "Use stratified K-fold",
        "Time Series Data": "Use TimeSeriesSplit",
        "Groups in Data": "Use GroupKFold",
        "Quick Evaluation": "Use ShuffleSplit with fewer iterations",
        "Stable Results Needed": "Use RepeatedKFold",
        "Feature Selection": "Use nested cross-validation"
    }
    
    print("\nGuidelines:")
    print("-" * 60)
    for scenario, recommendation in guidelines.items():
        print(f"{scenario}:")
        print(f"  → {recommendation}")
        print()
    
    # Summary table
    print("\nMethod Comparison Summary:")
    print("-" * 60)
    comparison = pd.DataFrame({
        'Method': ['KFold', 'StratifiedKFold', 'LOOCV', 'TimeSeriesSplit', 'GroupKFold', 'ShuffleSplit'],
        'Use Case': ['General', 'Classification with imbalance', 'Small datasets', 'Time series', 'Grouped data', 'Large datasets'],
        'Bias': ['Low', 'Low', 'Very Low', 'Medium', 'Low', 'Medium'],
        'Variance': ['Medium', 'Medium', 'High', 'Medium', 'Medium', 'Low'],
        'Speed': ['Fast', 'Fast', 'Slow', 'Fast', 'Fast', 'Very Fast']
    })
    
    print(comparison.to_string(index=False))
    
    return guidelines

guidelines = cv_method_selection_guide()

14.7.2 Common Pitfalls and Solutions

python
def cv_pitfalls_and_solutions():
    """Common pitfalls in cross-validation and their solutions"""
    
    print("Common Cross-Validation Pitfalls and Solutions:")
    print("=" * 60)
    
    pitfalls = [
        {
            'pitfall': 'Data Leakage in Preprocessing',
            'description': 'Applying scaling or other transformations before splitting data',
            'solution': 'Use Pipeline to encapsulate preprocessing and model',
            'example': 'Always use Pipeline with cross_val_score or make_pipeline'
        },
        {
            'pitfall': 'Feature Selection Leakage',
            'description': 'Selecting features using all data before cross-validation',
            'solution': 'Perform feature selection within each fold',
            'example': 'Use sklearn.feature_selection.SelectFromModel inside Pipeline'
        },
        {
            'pitfall': 'Improper Stratification',
            'description': 'Not using stratified split for classification',
            'solution': 'Use StratifiedKFold for classification tasks',
            'example': 'Always use StratifiedKFold for classification'
        },
        {
            'pitfall': 'Not Accounting for Groups',
            'description': 'Having related samples in both train and test',
            'solution': 'Use GroupKFold when data has natural groups',
            'example': 'Use GroupKFold for patient data, time series, etc.'
        },
        {
            'pitfall': 'Incorrect Metric Selection',
            'description': 'Using inappropriate evaluation metrics',
            'solution': 'Choose metrics based on problem type and business goals',
            'example': 'Use F1 for imbalanced, ROC-AUC for ranking'
        },
        {
            'pitfall': 'Data Leakage in Hyperparameter Tuning',
            'description': 'Tuning hyperparameters on test set',
            'solution': 'Use nested cross-validation',
            'example': 'Inner loop for tuning, outer loop for evaluation'
        }
    ]
    
    for i, item in enumerate(pitfalls, 1):
        print(f"\n{i}. {item['pitfall']}")
        print(f"   Problem: {item['description']}")
        print(f"   Solution: {item['solution']}")
        print(f"   Example: {item['example']}")
    
    return pitfalls

pitfalls = cv_pitfalls_and_solutions()

14.8 Practical Application Examples

14.8.1 Complete Model Evaluation Pipeline

python
def complete_model_evaluation():
    """Demonstrate a complete model evaluation pipeline"""
    
    print("Complete Model Evaluation Pipeline:")
    
    # Step 1: Load and prepare data
    print("\nStep 1: Loading Data...")
    X, y = load_breast_cancer(return_X_y=True)
    print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
    print(f"Class distribution: {np.bincount(y)}")
    
    # Step 2: Split data
    print("\nStep 2: Splitting Data...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"Training set: {X_train.shape[0]} samples")
    print(f"Test set: {X_test.shape[0]} samples")
    
    # Step 3: Create pipeline
    print("\nStep 3: Creating Pipeline...")
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(random_state=42, max_iter=1000))
    ])
    
    # Step 4: Cross-validation on training set
    print("\nStep 4: Cross-Validation on Training Set...")
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    cv_results = cross_validate(
        pipeline, X_train, y_train,
        cv=cv,
        scoring=['accuracy', 'precision', 'recall', 'f1'],
        return_train_score=True
    )
    
    print("Cross-Validation Results:")
    for metric in ['accuracy', 'precision', 'recall', 'f1']:
        test_key = f'test_{metric}'
        train_key = f'train_{metric}'
        print(f"  {metric}: {np.mean(cv_results[test_key]):.4f} ± {np.std(cv_results[test_key])}")
    
    # Step 5: Hyperparameter tuning
    print("\nStep 5: Hyperparameter Tuning...")
    param_grid = {
        'classifier__C': [0.01, 0.1, 1, 10],
        'classifier__penalty': ['l1', 'l2']
    }
    
    grid_search = GridSearchCV(
        pipeline, param_grid, cv=cv, scoring='accuracy', n_jobs=-1
    )
    grid_search.fit(X_train, y_train)
    
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Best CV Score: {grid_search.best_score_:.4f}")
    
    # Step 6: Final evaluation on test set
    print("\nStep 6: Final Evaluation on Test Set...")
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    y_pred_proba = best_model.predict_proba(X_test)[:, 1]
    
    print(f"Test Set Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"Test Set Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Test Set Recall: {recall_score(y_test, y_pred):.4f}")
    print(f"Test Set F1: {f1_score(y_test, y_pred):.4f}")
    
    # Visualize results
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    fig.suptitle('Complete Model Evaluation Pipeline', fontsize=16)
    
    # Cross-validation results
    metrics = ['accuracy', 'precision', 'recall', 'f1']
    cv_means = [np.mean(cv_results[f'test_{m}']) for m in metrics]
    cv_stds = [np.std(cv_results[f'test_{m}']) for m in metrics]
    
    axes[0].bar(metrics, cv_means, yerr=cv_stds, capsize=5, color='skyblue', alpha=0.7)
    axes[0].set_title('Cross-Validation Results')
    axes[0].set_ylabel('Score')
    axes[0].grid(True, alpha=0.3)
    axes[0].set_ylim(0.8, 1.0)
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1])
    axes[1].set_title('Confusion Matrix')
    axes[1].set_xlabel('Predicted')
    axes[1].set_ylabel('Actual')
    
    # ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    axes[2].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
    axes[2].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    axes[2].set_title('ROC Curve')
    axes[2].set_xlabel('False Positive Rate')
    axes[2].set_ylabel('True Positive Rate')
    axes[2].legend(loc="lower right")
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return grid_search, best_model

grid_search, best_model = complete_model_evaluation()

14.9 Exercises

Exercise 1: Basic Cross-Validation

  1. Use the iris dataset to perform K-fold cross-validation with different K values
  2. Compare the results and analyze how K affects the mean and variance of scores
  3. Visualize the cross-validation process

Exercise 2: Stratified vs Regular

  1. Create an imbalanced dataset with 3 classes
  2. Compare stratified K-fold with regular K-fold
  3. Analyze the class distribution in each fold

Exercise 3: Time Series CV

  1. Create a simulated time series dataset with trend and seasonality
  2. Use TimeSeriesSplit for cross-validation
  3. Compare with regular K-fold

Exercise 4: Model Selection

  1. Compare 5 different classifiers using cross-validation
  2. Perform hyperparameter tuning with GridSearchCV
  3. Build a complete evaluation pipeline

Exercise 5: Advanced Topics

  1. Implement nested cross-validation for feature selection
  2. Calculate confidence intervals for cross-validation results
  3. Analyze learning curves to diagnose bias and variance

14.10 Summary

In this chapter, we have deeply learned about cross-validation:

Core Concepts

  • Cross-Validation Principles: Why we need it, types, advantages
  • K-Fold Methods: Basic K-fold, Stratified K-fold, Repeated K-fold
  • Special Methods: LOOCV, TimeSeriesSplit, GroupKFold, ShuffleSplit

Main Techniques

  • Model Evaluation: Multiple models, multiple metrics
  • Hyperparameter Tuning: GridSearchCV, nested CV
  • Visualization: Learning curves, performance plots

Practical Skills

  • Method Selection: Choosing the right CV method
  • Data Leakage Prevention: Pipelines, proper splitting
  • Result Interpretation: Confidence intervals, bias-variance

Key Points

  • Cross-validation provides more reliable performance estimates
  • Choose the appropriate CV method based on your data and problem
  • Always use pipelines to prevent data leakage
  • Use multiple metrics for comprehensive evaluation

14.11 Next Steps

Now you have mastered cross-validation techniques! In the next chapter Support Vector Machines, we will learn about another powerful classification algorithm that works well with high-dimensional data.


Chapter Key Points Review:

  • ✓ Understood the principles and importance of cross-validation
  • ✓ Mastered various cross-validation methods (K-fold, LOOCV, TimeSeries, etc.)
  • ✓ Learned to compare multiple models and select the best one
  • ✓ Understood hyperparameter tuning with cross-validation
  • ✓ Learned to analyze learning curves and diagnose model issues
  • ✓ Able to build complete model evaluation pipelines

Content is for learning and research only.