Chapter 14: Cross-Validation
Cross-validation is an important technique for evaluating model performance in machine learning. It obtains more reliable performance estimates by splitting data multiple times, helping us select the best model and parameters. This chapter will详细介绍 various cross-validation methods and their applications.
14.1 What is Cross-Validation?
Cross-Validation is a statistical method used to evaluate the generalization ability of machine learning models. It evaluates model performance by dividing data into multiple subsets and using different subsets as training and validation sets in turn.
14.1.1 Why Do We Need Cross-Validation?
- Avoid Overfitting: Single data split may lead to overly optimistic performance estimates
- Fully Utilize Data: All data will be used for training and validation
- Obtain Stable Estimates: Average results of multiple validations are more reliable
- Model Selection: Compare performance of different models and parameters
14.1.2 Advantages of Cross-Validation
- More Reliable Performance Estimates: Reduce the impact of randomness
- Better Model Selection: Based on results from multiple validations
- High Data Utilization: Don't waste data for validation
- Detect Overfitting: Difference between training and validation performance
14.1.3 Types of Cross-Validation
- K-Fold Cross-Validation: Most commonly used method
- Leave-One-Out Cross-Validation: Leave one sample for validation each time
- Stratified Cross-Validation: Maintain class proportions
- Time Series Cross-Validation: Consider time order
- Group Cross-Validation: Consider data grouping
14.2 Preparing Environment and Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_wine, load_breast_cancer, make_classification
from sklearn.model_selection import (
cross_val_score, cross_validate, KFold, StratifiedKFold,
LeaveOneOut, LeavePOut, ShuffleSplit, StratifiedShuffleSplit,
TimeSeriesSplit, GroupKFold, train_test_split, GridSearchCV,
learning_curve, validation_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# Set random seed
np.random.seed(42)
# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False14.3 K-Fold Cross-Validation
14.3.1 Basic K-Fold Cross-Validation
def demonstrate_k_fold_cv():
"""Demonstrate the basic principles of K-fold cross-validation"""
print("K-fold Cross-Validation Basic Principles:")
print("=" * 25)
# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset size: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")
# Create classifier
clf = LogisticRegression(random_state=42, max_iter=1000)
# Different K values
k_values = [3, 5, 10]
print(f"\nCross-validation results for different K values:")
print("K Value\tMean Accuracy\tStd Dev\t\tFold Scores")
print("-" * 60)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('K-Fold Cross-Validation Demonstration', fontsize=16)
for i, k in enumerate(k_values):
# K-fold cross-validation
kfold = KFold(n_splits=k, shuffle=True, random_state=42)
cv_scores = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy')
mean_score = np.mean(cv_scores)
std_score = np.std(cv_scores)
print(f"{k}\t{mean_score:.4f}\t\t{std_score:.4f}\t\t{cv_scores}")
# Visualize scores for each fold
if i < 3:
row = i // 2
col = i % 2
axes[row, col].bar(range(1, k+1), cv_scores, alpha=0.7, color='skyblue')
axes[row, col].axhline(y=mean_score, color='red', linestyle='--',
label=f'Mean: {mean_score:.3f}')
axes[row, col].set_title(f'{k}-Fold Cross-Validation')
axes[row, col].set_xlabel('Fold Number')
axes[row, col].set_ylabel('Accuracy')
axes[row, col].legend()
axes[row, col].grid(True, alpha=0.3)
axes[row, col].set_ylim(0.8, 1.0)
# Visualize data splitting process
axes[1, 1].remove()
ax_split = fig.add_subplot(2, 2, 4)
# Demonstrate 5-fold cross-validation splitting
kfold_demo = KFold(n_splits=5, shuffle=True, random_state=42)
fold_colors = ['red', 'blue', 'green', 'orange', 'purple']
y_pos = 0
for fold, (train_idx, val_idx) in enumerate(kfold_demo.split(X)):
# Draw training set
ax_split.barh(y_pos, len(train_idx), left=0, height=0.8,
color=fold_colors[fold], alpha=0.3, label=f'Fold{fold+1} Training Set' if fold == 0 else "")
# Draw validation set
ax_split.barh(y_pos, len(val_idx), left=len(train_idx), height=0.8,
color=fold_colors[fold], alpha=0.8, label=f'Fold{fold+1} Validation Set' if fold == 0 else "")
y_pos += 1
ax_split.set_title('5-Fold Cross-Validation Data Splitting')
ax_split.set_xlabel('Sample Index')
ax_split.set_ylabel('Fold Number')
ax_split.set_yticks(range(5))
ax_split.set_yticklabels([f'Fold{i+1}' for i in range(5)])
plt.tight_layout()
plt.show()
return cv_scores
cv_scores = demonstrate_k_fold_cv()14.3.2 Stratified K-Fold Cross-Validation
def stratified_k_fold_demo():
"""Demonstrate stratified K-fold cross-validation"""
print("Stratified K-fold Cross-Validation:")
print("Maintains the proportion of each class in each fold consistent with the original dataset")
# Create imbalanced dataset
X_imbalanced, y_imbalanced = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_classes=3, weights=[0.6, 0.3, 0.1], # Imbalanced classes
random_state=42
)
print(f"\nImbalanced Dataset:")
print(f"Total samples: {len(X_imbalanced)}")
unique, counts = np.unique(y_imbalanced, return_counts=True)
for cls, count in zip(unique, counts):
print(f"Class {cls}: {count} samples ({count/len(y_imbalanced)*100:.1f}%)")
# Compare regular K-fold and stratified K-fold
k = 5
# Regular K-fold
kfold = KFold(n_splits=k, shuffle=True, random_state=42)
# Stratified K-fold
stratified_kfold = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)
# Analyze class distribution for each fold
print(f"\n{k}-Fold Cross-Validation Class Distribution Comparison:")
print("Method\t\tFold\tClass0\tClass1\tClass2")
print("-" * 50)
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Regular K-Fold vs Stratified K-Fold Cross-Validation', fontsize=16)
# Regular K-fold distribution
fold_distributions_normal = []
for fold, (train_idx, val_idx) in enumerate(kfold.split(X_imbalanced, y_imbalanced)):
val_y = y_imbalanced[val_idx]
unique_val, counts_val = np.unique(val_y, return_counts=True)
# Ensure all classes have counts
distribution = np.zeros(3)
for cls, count in zip(unique_val, counts_val):
distribution[cls] = count
fold_distributions_normal.append(distribution)
print(f"Regular K-fold\t{fold+1}\t{int(distribution[0])}\t{int(distribution[1])}\t{int(distribution[2])}")
# Stratified K-fold distribution
fold_distributions_stratified = []
for fold, (train_idx, val_idx) in enumerate(stratified_kfold.split(X_imbalanced, y_imbalanced)):
val_y = y_imbalanced[val_idx]
unique_val, counts_val = np.unique(val_y, return_counts=True)
distribution = np.zeros(3)
for cls, count in zip(unique_val, counts_val):
distribution[cls] = count
fold_distributions_stratified.append(distribution)
print(f"Stratified K-fold\t{fold+1}\t{int(distribution[0])}\t{int(distribution[1])}\t{int(distribution[2])}")
# Visualize distribution comparison
fold_distributions_normal = np.array(fold_distributions_normal)
fold_distributions_stratified = np.array(fold_distributions_stratified)
# Regular K-fold
x = np.arange(k)
width = 0.35
axes[0, 0].bar(x - width/2, fold_distributions_normal[:, 0], width, label='Class 0', color='blue')
axes[0, 0].bar(x - width/2, fold_distributions_normal[:, 1], width, bottom=fold_distributions_normal[:, 0], label='Class 1', color='green')
axes[0, 0].bar(x - width/2, fold_distributions_normal[:, 2], width,
bottom=fold_distributions_normal[:, 0] + fold_distributions_normal[:, 1], label='Class 2', color='red')
axes[0, 0].set_title('Regular K-Fold Distribution')
axes[0, 0].set_xlabel('Fold Number')
axes[0, 0].set_ylabel('Sample Count')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels([f'Fold{i+1}' for i in range(k)])
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Stratified K-fold
axes[0, 1].bar(x - width/2, fold_distributions_stratified[:, 0], width, label='Class 0', color='blue')
axes[0, 1].bar(x - width/2, fold_distributions_stratified[:, 1], width, bottom=fold_distributions_stratified[:, 0], label='Class 1', color='green')
axes[0, 1].bar(x - width/2, fold_distributions_stratified[:, 2], width,
bottom=fold_distributions_stratified[:, 0] + fold_distributions_stratified[:, 1], label='Class 2', color='red')
axes[0, 1].set_title('Stratified K-Fold Distribution')
axes[0, 1].set_xlabel('Fold Number')
axes[0, 1].set_ylabel('Sample Count')
axes[0, 1].set_xticks(x)
axes[0, 1].set_xticklabels([f'Fold{i+1}' for i in range(k)])
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Compare performance
clf = LogisticRegression(random_state=42, max_iter=1000)
scores_normal = cross_val_score(clf, X_imbalanced, y_imbalanced, cv=kfold, scoring='accuracy')
scores_stratified = cross_val_score(clf, X_imbalanced, y_imbalanced, cv=stratified_kfold, scoring='accuracy')
print(f"\nPerformance Comparison:")
print(f"Regular K-fold: {np.mean(scores_normal):.4f} ± {np.std(scores_normal):.4f}")
print(f"Stratified K-fold: {np.mean(scores_stratified):.4f} ± {np.std(scores_stratified):.4f}")
# Visualize performance comparison
methods = ['Regular K-Fold', 'Stratified K-Fold']
means = [np.mean(scores_normal), np.mean(scores_stratified)]
stds = [np.std(scores_normal), np.std(scores_stratified)]
axes[1, 0].bar(methods, means, yerr=stds, capsize=5, color=['skyblue', 'lightcoral'], alpha=0.7)
axes[1, 0].set_title('Accuracy Comparison')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_ylim(0.6, 0.9)
# Score distribution
axes[1, 1].boxplot([scores_normal, scores_stratified], labels=methods)
axes[1, 1].set_title('Score Distribution')
axes[1, 1].set_ylabel('Accuracy')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return scores_normal, scores_stratified
scores_normal, scores_stratified = stratified_k_fold_demo()14.3.3 Leave-One-Out Cross-Validation (LOOCV)
def loocv_demo():
"""Demonstrate Leave-One-Out Cross-Validation"""
print("Leave-One-Out Cross-Validation (LOOCV):")
print("Each time, leave one sample as the validation set, use the remaining samples as the training set")
# Use a smaller dataset for demonstration
X_small, y_small = make_classification(
n_samples=100, n_features=5, n_informative=3,
n_classes=2, random_state=42
)
print(f"Dataset size: {X_small.shape}")
# LOOCV
loo = LeaveOneOut()
clf = LogisticRegression(random_state=42, max_iter=1000)
# Perform LOOCV
scores_loo = cross_val_score(clf, X_small, y_small, cv=loo, scoring='accuracy')
print(f"\nLOOCV Results:")
print(f"Number of iterations: {len(scores_loo)}")
print(f"Mean Accuracy: {np.mean(scores_loo):.4f}")
print(f"Std Dev: {np.std(scores_loo):.4f}")
# Compare with 10-fold
kfold_10 = KFold(n_splits=10, shuffle=True, random_state=42)
scores_10fold = cross_val_score(clf, X_small, y_small, cv=kfold_10, scoring='accuracy')
print(f"\n10-Fold Cross-Validation Comparison:")
print(f"Mean Accuracy: {np.mean(scores_10fold):.4f}")
print(f"Std Dev: {np.std(scores_10fold):.4f}")
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('LOOCV vs 10-Fold Cross-Validation', fontsize=16)
# LOOCV scores distribution
axes[0].hist(scores_loo, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].axvline(x=np.mean(scores_loo), color='red', linestyle='--', label=f'Mean: {np.mean(scores_loo):.3f}')
axes[0].set_title('LOOCV Score Distribution')
axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# 10-fold scores
axes[1].bar(range(1, 11), scores_10fold, alpha=0.7, color='lightcoral')
axes[1].axhline(y=np.mean(scores_10fold), color='blue', linestyle='--', label=f'Mean: {np.mean(scores_10fold):.3f}')
axes[1].set_title('10-Fold Cross-Validation Scores')
axes[1].set_xlabel('Fold Number')
axes[1].set_ylabel('Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(0.5, 1.0)
# Comparison boxplot
axes[2].boxplot([scores_loo, scores_10fold], labels=['LOOCV', '10-Fold'])
axes[2].set_title('Method Comparison')
axes[2].set_ylabel('Accuracy')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Advantages and disadvantages of LOOCV
print("\nLOOCV Advantages:")
print("✓ Maximum use of data")
print("✓ Least bias")
print("✓ Deterministic result")
print("\nLOOCV Disadvantages:")
print("✗ Computationally expensive")
print("✗ High variance in estimates")
print("✗ Cannot be used with some models")
return scores_loo, scores_10fold
scores_loo, scores_10fold = loocv_demo()14.3.4 Repeated K-Fold Cross-Validation
def repeated_kfold_demo():
"""Demonstrate Repeated K-Fold Cross-Validation"""
print("Repeated K-Fold Cross-Validation:")
print("Repeat K-fold cross-validation multiple times with different splits")
# Create dataset
X, y = load_iris(return_X_y=True)
print(f"Dataset size: {X.shape}")
# Basic 5-fold
kfold_5 = KFold(n_splits=5, shuffle=True, random_state=42)
clf = LogisticRegression(random_state=42, max_iter=1000)
scores_single = cross_val_score(clf, X, y, cv=kfold_5, scoring='accuracy')
# Repeated 5-fold (repeat 10 times)
from sklearn.model_selection import RepeatedKFold
rkfold = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
scores_repeated = cross_val_score(clf, X, y, cv=rkfold, scoring='accuracy')
print(f"\nSingle 5-Fold CV: {np.mean(scores_single):.4f} ± {np.std(scores_single):.4f}")
print(f"Repeated 5-Fold (10 repeats): {np.mean(scores_repeated):.4f} ± {np.std(scores_repeated):.4f}")
print(f"Total evaluations: {len(scores_repeated)}")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Single vs Repeated K-Fold Cross-Validation', fontsize=16)
# Score distribution
axes[0].hist(scores_single, bins=10, alpha=0.7, color='skyblue', label='Single 5-Fold', edgecolor='black')
axes[0].axvline(x=np.mean(scores_single), color='blue', linestyle='--', linewidth=2)
axes[0].set_title('Single 5-Fold CV Score Distribution')
axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Repeated scores (show first 50 for clarity)
axes[1].plot(scores_repeated[:50], 'o-', alpha=0.7, color='lightcoral', markersize=4)
axes[1].axhline(y=np.mean(scores_repeated), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(scores_repeated):.3f}')
axes[1].fill_between(range(50),
np.mean(scores_repeated) - np.std(scores_repeated),
np.mean(scores_repeated) + np.std(scores_repeated),
alpha=0.2, color='red')
axes[1].set_title('Repeated 5-Fold CV Scores (First 50)')
axes[1].set_xlabel('Evaluation Number')
axes[1].set_ylabel('Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return scores_single, scores_repeated
scores_single, scores_repeated = repeated_kfold_demo()14.4 Special Cross-Validation Methods
14.4.1 Time Series Cross-Validation
def time_series_cv_demo():
"""Demonstrate Time Series Cross-Validation"""
print("Time Series Cross-Validation:")
print("Uses expanding window or sliding window to maintain temporal order")
# Create simulated time series data
np.random.seed(42)
n_samples = 200
# Create time series with trend and seasonality
time = np.arange(n_samples)
trend = 0.01 * time
seasonality = 0.3 * np.sin(2 * np.pi * time / 50)
noise = np.random.normal(0, 0.1, n_samples)
# Target with some relationship to features
X_ts = np.column_stack([
trend + noise,
seasonality + noise,
np.random.normal(0, 0.2, n_samples),
np.random.normal(0, 0.2, n_samples)
])
# Binary target based on value
y_ts = (trend + seasonality + noise > 0).astype(int)
print(f"Time series dataset: {X_ts.shape}")
print(f"Class distribution: {np.bincount(y_ts)}")
# Time series split
tscv = TimeSeriesSplit(n_splits=5)
print(f"\nTime Series Split Information:")
for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
print(f"Fold {i+1}: Train {len(train_idx)} samples, Test {len(test_idx)} samples")
# Visualize time series split
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Time Series Cross-Validation', fontsize=16)
# Show data and splits
axes[0, 0].plot(time, y_ts, 'b-', alpha=0.7)
axes[0, 0].set_title('Time Series Target Variable')
axes[0, 0].set_xlabel('Time')
axes[0, 0].set_ylabel('Target')
axes[0, 0].grid(True, alpha=0.3)
# Show splits
colors = ['red', 'blue', 'green', 'orange', 'purple']
for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
axes[0, 1].axvspan(train_idx[0], train_idx[-1], alpha=0.2, color=colors[i], label=f'Fold {i+1} Train' if i == 0 else "")
axes[0, 1].axvspan(test_idx[0], test_idx[-1], alpha=0.6, color=colors[i], label=f'Fold {i+1} Test' if i == 0 else "")
axes[0, 1].set_title('Time Series Split Visualization')
axes[0, 1].set_xlabel('Time Index')
axes[0, 1].set_ylabel('Fold')
axes[0, 1].legend(loc='upper left', fontsize=8)
axes[0, 1].grid(True, alpha=0.3)
# Perform time series cross-validation
clf = LogisticRegression(random_state=42, max_iter=1000)
scores_ts = []
fold_sizes = []
for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
X_train, X_test = X_ts[train_idx], X_ts[test_idx]
y_train, y_test = y_ts[train_idx], y_ts[test_idx]
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
scores_ts.append(score)
fold_sizes.append(len(test_idx))
print(f"Fold {i+1}: Test size={len(test_idx)}, Accuracy={score:.4f}")
# Visualize results
axes[1, 0].bar(range(1, len(scores_ts) + 1), scores_ts, alpha=0.7, color='skyblue')
axes[1, 0].set_title('Time Series CV Accuracy by Fold')
axes[1, 0].set_xlabel('Fold Number')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].grid(True, alpha=0.3)
# Size vs accuracy
axes[1, 1].scatter(fold_sizes, scores_ts, s=100, alpha=0.7, color='lightcoral')
axes[1, 1].set_title('Test Set Size vs Accuracy')
axes[1, 1].set_xlabel('Test Set Size')
axes[1, 1].set_ylabel('Accuracy')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nOverall Time Series CV: {np.mean(scores_ts):.4f} ± {np.std(scores_ts):.4f}")
return scores_ts
scores_ts = time_series_cv_demo()14.4.2 Group K-Fold Cross-Validation
def group_kfold_demo():
"""Demonstrate Group K-Fold Cross-Validation"""
print("Group K-Fold Cross-Validation:")
print("Ensures that samples from the same group are not in both training and validation sets")
# Create dataset with groups
np.random.seed(42)
n_groups = 10
samples_per_group = 20
n_samples = n_groups * samples_per_group
# Create group labels
groups = np.repeat(np.arange(n_groups), samples_per_group)
# Create features and target with group structure
X = np.random.randn(n_samples, 5)
y = np.random.randint(0, 2, n_samples)
# Add group effect to features
for g in range(n_groups):
group_mask = groups == g
X[group_mask] += np.random.randn(1, 5) * 0.5
print(f"Dataset size: {X.shape}")
print(f"Number of groups: {n_groups}")
print(f"Samples per group: {samples_per_group}")
# Compare regular K-fold and group K-fold
k = 5
# Regular K-fold (may put same group in both train and test)
kfold = KFold(n_splits=k, shuffle=True, random_state=42)
# Group K-fold (ensures groups don't leak)
group_kfold = GroupKFold(n_splits=k)
clf = LogisticRegression(random_state=42, max_iter=1000)
# Check for group leakage in regular K-fold
print(f"\nChecking for group leakage in regular K-fold:")
leakage_count = 0
for fold, (train_idx, test_idx) in enumerate(kfold.split(X, y, groups)):
train_groups = set(groups[train_idx])
test_groups = set(groups[test_idx])
overlap = train_groups & test_groups
if overlap:
leakage_count += len(overlap)
print(f"Fold {fold+1}: Group leakage detected - {len(overlap)} groups appear in both train and test")
if leakage_count == 0:
print("No group leakage detected")
# Group K-fold ensures no leakage
print(f"\nGroup K-fold ensures no group appears in both train and test sets")
# Perform cross-validation
scores_regular = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy')
scores_group = cross_val_score(clf, X, y, cv=group_kfold, groups=groups, scoring='accuracy')
print(f"\nPerformance Comparison:")
print(f"Regular K-fold: {np.mean(scores_regular):.4f} ± {np.std(scores_regular):.4f}")
print(f"Group K-fold: {np.mean(scores_group):.4f} ± {np.std(scores_group):.4f}")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Regular K-Fold vs Group K-Fold Cross-Validation', fontsize=16)
# Show group structure
group_colors = plt.cm.tab10(np.arange(n_groups))
for g in range(n_groups):
mask = groups == g
axes[0].scatter(X[mask, 0], X[mask, 1], c=[group_colors[g]], label=f'Group {g}' if g < 5 else "", alpha=0.7)
axes[0].set_title('Data Points Colored by Group')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].legend(loc='upper right', fontsize=8)
axes[0].grid(True, alpha=0.3)
# Performance comparison
methods = ['Regular K-Fold', 'Group K-Fold']
means = [np.mean(scores_regular), np.mean(scores_group)]
stds = [np.std(scores_regular), np.std(scores_group)]
axes[1].bar(methods, means, yerr=stds, capsize=5, color=['skyblue', 'lightcoral'], alpha=0.7)
axes[1].set_title('Accuracy Comparison')
axes[1].set_ylabel('Accuracy')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(0.3, 0.8)
plt.tight_layout()
plt.show()
return scores_regular, scores_group
scores_regular, scores_group = group_kfold_demo()14.4.3 Shuffle Split Cross-Validation
def shuffle_split_demo():
"""Demonstrate Shuffle Split Cross-Validation"""
print("Shuffle Split Cross-Validation:")
print("Randomly shuffle data and split into train/test sets multiple times")
# Load data
X, y = load_iris(return_X_y=True)
print(f"Dataset size: {X.shape}")
# ShuffleSplit
ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
# Stratified ShuffleSplit
stratified_ss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
clf = LogisticRegression(random_state=42, max_iter=1000)
# Perform cross-validation
scores_shuffle = cross_val_score(clf, X, y, cv=ss, scoring='accuracy')
scores_stratified_shuffle = cross_val_score(clf, X, y, cv=stratified_ss, scoring='accuracy')
print(f"\nShuffle Split Results:")
print(f"Mean Accuracy: {np.mean(scores_shuffle):.4f} ± {np.std(scores_shuffle):.4f}")
print(f"Individual scores: {scores_shuffle}")
print(f"\nStratified Shuffle Split Results:")
print(f"Mean Accuracy: {np.mean(scores_stratified_shuffle):.4f} ± {np.std(scores_stratified_shuffle):.4f}")
print(f"Individual scores: {scores_stratified_shuffle}")
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('Shuffle Split Cross-Validation', fontsize=16)
# Score distribution
axes[0].hist(scores_shuffle, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].axvline(x=np.mean(scores_shuffle), color='red', linestyle='--', label=f'Mean: {np.mean(scores_shuffle):.3f}')
axes[0].set_title('Shuffle Split Score Distribution')
axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Show different splits
for i, (train_idx, test_idx) in enumerate(ss.split(X)):
if i < 5: # Show first 5 splits
axes[1].scatter(train_idx, [i] * len(train_idx), c='blue', alpha=0.3, s=5)
axes[1].scatter(test_idx, [i] * len(test_idx), c='red', alpha=0.3, s=5)
axes[1].set_title('First 5 Shuffle Splits')
axes[1].set_xlabel('Sample Index')
axes[1].set_ylabel('Split Number')
axes[1].legend(handles=[plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='blue', markersize=10, label='Train'),
plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Test')],
loc='upper right')
axes[1].grid(True, alpha=0.3)
# Comparison
axes[2].boxplot([scores_shuffle, scores_stratified_shuffle], labels=['Shuffle Split', 'Stratified Shuffle'])
axes[2].set_title('Method Comparison')
axes[2].set_ylabel('Accuracy')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Advantages of Shuffle Split
print("\nShuffle Split Advantages:")
print("✓ Control over train/test size")
print("✓ Can set number of iterations independently")
print("✓ Some samples may be used multiple times")
print("✓ Can be faster than K-fold for large datasets")
return scores_shuffle, scores_stratified_shuffle
scores_shuffle, scores_stratified_shuffle = shuffle_split_demo()14.5 Cross-Validation for Model Selection
14.5.1 Comparing Multiple Models
def compare_models_cv():
"""Compare multiple models using cross-validation"""
print("Comparing Multiple Models with Cross-Validation:")
# Load data
X, y = load_iris(return_X_y=True)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
# Define models to compare
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM (RBF)': SVC(kernel='rbf', random_state=42),
'KNN': KNeighborsClassifier(n_neighbors=5)
}
# Cross-validation settings
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Evaluate each model
results = {}
print(f"\nModel Comparison Results:")
print("-" * 60)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Model Comparison with Cross-Validation', fontsize=16)
for i, (name, model) in enumerate(models.items()):
# Perform cross-validation with multiple metrics
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
results[name] = {
'mean': np.mean(scores),
'std': np.std(scores),
'scores': scores
}
print(f"{name}:")
print(f" Accuracy: {np.mean(scores):.4f} ± {np.std(scores):.4f}")
print(f" Individual folds: {scores}")
# Visualize
row = i // 2
col = i % 2
axes[row, col].bar(range(1, 6), scores, alpha=0.7, color=plt.cm.tab10(i))
axes[row, col].axhline(y=np.mean(scores), color='red', linestyle='--',
label=f'Mean: {np.mean(scores):.3f}')
axes[row, col].set_title(name)
axes[row, col].set_xlabel('Fold')
axes[row, col].set_ylabel('Accuracy')
axes[row, col].legend()
axes[row, col].grid(True, alpha=0.3)
axes[row, col].set_ylim(0.8, 1.05)
plt.tight_layout()
plt.show()
# Summary comparison
print("\n" + "=" * 60)
print("Summary:")
print("-" * 60)
# Sort by mean accuracy
sorted_results = sorted(results.items(), key=lambda x: x[1]['mean'], reverse=True)
for rank, (name, result) in enumerate(sorted_results, 1):
print(f"{rank}. {name}: {result['mean']:.4f} ± {result['std']:.4f}")
# Visualize summary
names = [r[0] for r in sorted_results]
means = [r[1]['mean'] for r in sorted_results]
stds = [r[1]['std'] for r in sorted_results]
plt.figure(figsize=(10, 6))
bars = plt.barh(names, means, xerr=stds, capsize=5, color='skyblue', alpha=0.7)
plt.xlabel('Accuracy')
plt.title('Model Comparison Summary')
plt.xlim(0.8, 1.0)
plt.grid(True, alpha=0.3)
# Highlight best model
bars[0].set_color('gold')
plt.tight_layout()
plt.show()
return results
results = compare_models_cv()14.5.2 Hyperparameter Tuning with Cross-Validation
def hyperparameter_tuning_cv():
"""Demonstrate hyperparameter tuning with cross-validation"""
print("Hyperparameter Tuning with Cross-Validation:")
# Load data
X, y = load_iris(return_X_y=True)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
# Define parameter grid for SVM
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.01, 0.1, 1],
'kernel': ['rbf', 'linear']
}
# Create SVM classifier
svc = SVC(random_state=42)
# Grid search with cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(
svc, param_grid, cv=cv, scoring='accuracy',
n_jobs=-1, verbose=1
)
grid_search.fit(X, y)
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")
# Show top 5 parameter combinations
print(f"\nTop 5 Parameter Combinations:")
results_df = pd.DataFrame(grid_search.cv_results_)
results_df = results_df.sort_values('rank_test_score')
for i, (_, row) in enumerate(results_df.head(5).iterrows()):
print(f"{i+1}. Score: {row['mean_test_score']:.4f} ± {row['std_test_score']:.4f}")
print(f" Parameters: {row['params']}")
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Hyperparameter Tuning Results', fontsize=16)
# Heatmap for RBF kernel
rbf_results = results_df[results_df['param_kernel'] == 'rbf']
pivot_table = rbf_results.pivot_table(
values='mean_test_score',
index='param_gamma',
columns='param_C'
)
sns.heatmap(pivot_table, annot=True, cmap='viridis', fmt='.4f', ax=axes[0])
axes[0].set_title('SVM RBF Kernel: C vs Gamma')
axes[0].set_xlabel('C')
axes[0].set_ylabel('Gamma')
# Performance comparison
axes[1].plot(range(len(results_df)), results_df['mean_test_score'].values, 'o-', alpha=0.7)
axes[1].fill_between(range(len(results_df)),
results_df['mean_test_score'] - results_df['std_test_score'],
results_df['mean_test_score'] + results_df['std_test_score'],
alpha=0.2)
axes[1].set_title('All Parameter Combinations Performance')
axes[1].set_xlabel('Parameter Combination Index')
axes[1].set_ylabel('Accuracy')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return grid_search
grid_search = hyperparameter_tuning_cv()14.5.3 Learning Curves
def learning_curves_cv():
"""Analyze learning curves using cross-validation"""
print("Learning Curves Analysis:")
# Load data
X, y = load_iris(return_X_y=True)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
# Define models with different complexities
models = {
'Simple (KNN, k=20)': KNeighborsClassifier(n_neighbors=20),
'Medium (KNN, k=5)': KNeighborsClassifier(n_neighbors=5),
'Complex (KNN, k=1)': KNeighborsClassifier(n_neighbors=1)
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Learning Curves for Different Model Complexities', fontsize=16)
train_sizes = np.linspace(0.1, 1.0, 10)
for i, (name, model) in enumerate(models.items()):
train_sizes_abs, train_scores, val_scores = learning_curve(
model, X, y, cv=cv, n_jobs=-1,
train_sizes=train_sizes, scoring='accuracy'
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
axes[i].plot(train_sizes_abs, train_mean, 'o-', color='blue', label='Training Score')
axes[i].fill_between(train_sizes_abs, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
axes[i].plot(train_sizes_abs, val_mean, 'o-', color='red', label='Validation Score')
axes[i].fill_between(train_sizes_abs, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
axes[i].set_title(name)
axes[i].set_xlabel('Training Set Size')
axes[i].set_ylabel('Accuracy')
axes[i].legend(loc='lower right')
axes[i].grid(True, alpha=0.3)
axes[i].set_ylim(0.5, 1.05)
plt.tight_layout()
plt.show()
# Interpretation
print("\nLearning Curve Interpretation:")
print("High Training Score + Low Validation Score → Overfitting")
print("Both Scores Low → Underfitting")
print("Training Score Improves with More Data → Model Benefits from More Data")
return None
learning_curves_cv()14.6 Cross-Validation Metrics
14.6.1 Multiple Metrics Evaluation
def multiple_metrics_cv():
"""Evaluate models using multiple metrics with cross-validation"""
print("Cross-Validation with Multiple Metrics:")
# Load data
X, y = load_iris(return_X_y=True)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
# Use cross_validate for multiple metrics
clf = LogisticRegression(random_state=42, max_iter=1000)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
results = cross_validate(clf, X, y, cv=cv, scoring=scoring, return_train_score=True)
print(f"\nCross-Validation Results:")
print("-" * 60)
# Extract results
metrics = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
for metric in metrics:
test_key = f'test_{metric}'
train_key = f'train_{metric}'
test_scores = results[test_key]
train_scores = results[train_key]
print(f"{metric}:")
print(f" Test: {np.mean(test_scores):.4f} ± {np.std(test_scores):.4f}")
print(f" Train: {np.mean(train_scores):.4f} ± {np.std(train_scores):.4f}")
print(f" Gap: {np.mean(train_scores) - np.mean(test_scores):.4f}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Multiple Metrics Cross-Validation Results', fontsize=16)
for i, metric in enumerate(metrics):
row = i // 2
col = i % 2
test_key = f'test_{metric}'
train_key = f'train_{metric}'
test_scores = results[test_key]
train_scores = results[train_key]
x = np.arange(5)
width = 0.35
axes[row, col].bar(x - width/2, train_scores, width, label='Training', color='skyblue')
axes[row, col].bar(x + width/2, test_scores, width, label='Test', color='lightcoral')
axes[row, col].set_title(metric.replace('_', ' ').title())
axes[row, col].set_xlabel('Fold')
axes[row, col].set_ylabel('Score')
axes[row, col].set_xticks(x)
axes[row, col].set_xticklabels([f'Fold {i+1}' for i in range(5)])
axes[row, col].legend()
axes[row, col].grid(True, alpha=0.3)
axes[row, col].set_ylim(0.5, 1.1)
plt.tight_layout()
plt.show()
return results
results = multiple_metrics_cv()14.6.2 Confidence Intervals
def confidence_intervals_cv():
"""Calculate confidence intervals for cross-validation results"""
print("Confidence Intervals for Cross-Validation:")
# Load data
X, y = load_iris(return_X_y=True)
# Perform cross-validation
clf = LogisticRegression(random_state=42, max_iter=1000)
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
print(f"Cross-validation scores: {scores}")
print(f"Number of folds: {len(scores)}")
# Calculate statistics
mean_score = np.mean(scores)
std_score = np.std(scores)
n = len(scores)
# Calculate 95% confidence interval using t-distribution
from scipy import stats
confidence = 0.95
t_value = stats.t.ppf((1 + confidence) / 2, n - 1)
margin_of_error = t_value * std_score / np.sqrt(n)
ci_lower = mean_score - margin_of_error
ci_upper = mean_score + margin_of_error
print(f"\nResults:")
print(f"Mean Accuracy: {mean_score:.4f}")
print(f"Standard Deviation: {std_score:.4f}")
print(f"95% Confidence Interval: [{ci_lower:.4f}, {ci_upper:.4f}]")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Cross-Validation with Confidence Intervals', fontsize=16)
# Score distribution with CI
axes[0].hist(scores, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].axvline(x=mean_score, color='red', linestyle='-', linewidth=2, label=f'Mean: {mean_score:.3f}')
axes[0].axvline(x=ci_lower, color='orange', linestyle='--', linewidth=2, label=f'95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]')
axes[0].axvline(x=ci_upper, color='orange', linestyle='--', linewidth=2)
axes[0].set_title('Score Distribution')
axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Individual fold scores with error bars
fold_numbers = np.arange(1, n + 1)
axes[1].errorbar(fold_numbers, scores, yerr=std_score, fmt='o', capsize=5, color='lightcoral', markersize=8)
axes[1].axhline(y=mean_score, color='blue', linestyle='-', linewidth=2, label=f'Mean: {mean_score:.3f}')
axes[1].fill_between([0.5, n + 0.5], ci_lower, ci_upper, alpha=0.2, color='blue', label='95% CI')
axes[1].set_title('Fold Scores with Confidence Interval')
axes[1].set_xlabel('Fold Number')
axes[1].set_ylabel('Accuracy')
axes[1].set_xlim(0.5, n + 0.5)
axes[1].set_ylim(mean_score - 0.1, mean_score + 0.1)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return scores, (ci_lower, ci_upper)
scores, ci = confidence_intervals_cv()14.7 Best Practices and Tips
14.7.1 Choosing the Right Cross-Validation Method
def cv_method_selection_guide():
"""Guide for choosing the right cross-validation method"""
print("Cross-Validation Method Selection Guide:")
print("=" * 60)
guidelines = {
"Small Datasets (< 1000 samples)": "Use LOOCV or stratified K-fold (k=5 or 10)",
"Large Datasets (> 10000 samples)": "Use K-fold (k=5) or hold-out validation",
"Imbalanced Classification": "Use stratified K-fold",
"Time Series Data": "Use TimeSeriesSplit",
"Groups in Data": "Use GroupKFold",
"Quick Evaluation": "Use ShuffleSplit with fewer iterations",
"Stable Results Needed": "Use RepeatedKFold",
"Feature Selection": "Use nested cross-validation"
}
print("\nGuidelines:")
print("-" * 60)
for scenario, recommendation in guidelines.items():
print(f"{scenario}:")
print(f" → {recommendation}")
print()
# Summary table
print("\nMethod Comparison Summary:")
print("-" * 60)
comparison = pd.DataFrame({
'Method': ['KFold', 'StratifiedKFold', 'LOOCV', 'TimeSeriesSplit', 'GroupKFold', 'ShuffleSplit'],
'Use Case': ['General', 'Classification with imbalance', 'Small datasets', 'Time series', 'Grouped data', 'Large datasets'],
'Bias': ['Low', 'Low', 'Very Low', 'Medium', 'Low', 'Medium'],
'Variance': ['Medium', 'Medium', 'High', 'Medium', 'Medium', 'Low'],
'Speed': ['Fast', 'Fast', 'Slow', 'Fast', 'Fast', 'Very Fast']
})
print(comparison.to_string(index=False))
return guidelines
guidelines = cv_method_selection_guide()14.7.2 Common Pitfalls and Solutions
def cv_pitfalls_and_solutions():
"""Common pitfalls in cross-validation and their solutions"""
print("Common Cross-Validation Pitfalls and Solutions:")
print("=" * 60)
pitfalls = [
{
'pitfall': 'Data Leakage in Preprocessing',
'description': 'Applying scaling or other transformations before splitting data',
'solution': 'Use Pipeline to encapsulate preprocessing and model',
'example': 'Always use Pipeline with cross_val_score or make_pipeline'
},
{
'pitfall': 'Feature Selection Leakage',
'description': 'Selecting features using all data before cross-validation',
'solution': 'Perform feature selection within each fold',
'example': 'Use sklearn.feature_selection.SelectFromModel inside Pipeline'
},
{
'pitfall': 'Improper Stratification',
'description': 'Not using stratified split for classification',
'solution': 'Use StratifiedKFold for classification tasks',
'example': 'Always use StratifiedKFold for classification'
},
{
'pitfall': 'Not Accounting for Groups',
'description': 'Having related samples in both train and test',
'solution': 'Use GroupKFold when data has natural groups',
'example': 'Use GroupKFold for patient data, time series, etc.'
},
{
'pitfall': 'Incorrect Metric Selection',
'description': 'Using inappropriate evaluation metrics',
'solution': 'Choose metrics based on problem type and business goals',
'example': 'Use F1 for imbalanced, ROC-AUC for ranking'
},
{
'pitfall': 'Data Leakage in Hyperparameter Tuning',
'description': 'Tuning hyperparameters on test set',
'solution': 'Use nested cross-validation',
'example': 'Inner loop for tuning, outer loop for evaluation'
}
]
for i, item in enumerate(pitfalls, 1):
print(f"\n{i}. {item['pitfall']}")
print(f" Problem: {item['description']}")
print(f" Solution: {item['solution']}")
print(f" Example: {item['example']}")
return pitfalls
pitfalls = cv_pitfalls_and_solutions()14.8 Practical Application Examples
14.8.1 Complete Model Evaluation Pipeline
def complete_model_evaluation():
"""Demonstrate a complete model evaluation pipeline"""
print("Complete Model Evaluation Pipeline:")
# Step 1: Load and prepare data
print("\nStep 1: Loading Data...")
X, y = load_breast_cancer(return_X_y=True)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {np.bincount(y)}")
# Step 2: Split data
print("\nStep 2: Splitting Data...")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# Step 3: Create pipeline
print("\nStep 3: Creating Pipeline...")
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42, max_iter=1000))
])
# Step 4: Cross-validation on training set
print("\nStep 4: Cross-Validation on Training Set...")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
pipeline, X_train, y_train,
cv=cv,
scoring=['accuracy', 'precision', 'recall', 'f1'],
return_train_score=True
)
print("Cross-Validation Results:")
for metric in ['accuracy', 'precision', 'recall', 'f1']:
test_key = f'test_{metric}'
train_key = f'train_{metric}'
print(f" {metric}: {np.mean(cv_results[test_key]):.4f} ± {np.std(cv_results[test_key])}")
# Step 5: Hyperparameter tuning
print("\nStep 5: Hyperparameter Tuning...")
param_grid = {
'classifier__C': [0.01, 0.1, 1, 10],
'classifier__penalty': ['l1', 'l2']
}
grid_search = GridSearchCV(
pipeline, param_grid, cv=cv, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")
# Step 6: Final evaluation on test set
print("\nStep 6: Final Evaluation on Test Set...")
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
print(f"Test Set Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Test Set Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Test Set Recall: {recall_score(y_test, y_pred):.4f}")
print(f"Test Set F1: {f1_score(y_test, y_pred):.4f}")
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('Complete Model Evaluation Pipeline', fontsize=16)
# Cross-validation results
metrics = ['accuracy', 'precision', 'recall', 'f1']
cv_means = [np.mean(cv_results[f'test_{m}']) for m in metrics]
cv_stds = [np.std(cv_results[f'test_{m}']) for m in metrics]
axes[0].bar(metrics, cv_means, yerr=cv_stds, capsize=5, color='skyblue', alpha=0.7)
axes[0].set_title('Cross-Validation Results')
axes[0].set_ylabel('Score')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(0.8, 1.0)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1])
axes[1].set_title('Confusion Matrix')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
# ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
axes[2].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
axes[2].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[2].set_title('ROC Curve')
axes[2].set_xlabel('False Positive Rate')
axes[2].set_ylabel('True Positive Rate')
axes[2].legend(loc="lower right")
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return grid_search, best_model
grid_search, best_model = complete_model_evaluation()14.9 Exercises
Exercise 1: Basic Cross-Validation
- Use the iris dataset to perform K-fold cross-validation with different K values
- Compare the results and analyze how K affects the mean and variance of scores
- Visualize the cross-validation process
Exercise 2: Stratified vs Regular
- Create an imbalanced dataset with 3 classes
- Compare stratified K-fold with regular K-fold
- Analyze the class distribution in each fold
Exercise 3: Time Series CV
- Create a simulated time series dataset with trend and seasonality
- Use TimeSeriesSplit for cross-validation
- Compare with regular K-fold
Exercise 4: Model Selection
- Compare 5 different classifiers using cross-validation
- Perform hyperparameter tuning with GridSearchCV
- Build a complete evaluation pipeline
Exercise 5: Advanced Topics
- Implement nested cross-validation for feature selection
- Calculate confidence intervals for cross-validation results
- Analyze learning curves to diagnose bias and variance
14.10 Summary
In this chapter, we have deeply learned about cross-validation:
Core Concepts
- Cross-Validation Principles: Why we need it, types, advantages
- K-Fold Methods: Basic K-fold, Stratified K-fold, Repeated K-fold
- Special Methods: LOOCV, TimeSeriesSplit, GroupKFold, ShuffleSplit
Main Techniques
- Model Evaluation: Multiple models, multiple metrics
- Hyperparameter Tuning: GridSearchCV, nested CV
- Visualization: Learning curves, performance plots
Practical Skills
- Method Selection: Choosing the right CV method
- Data Leakage Prevention: Pipelines, proper splitting
- Result Interpretation: Confidence intervals, bias-variance
Key Points
- Cross-validation provides more reliable performance estimates
- Choose the appropriate CV method based on your data and problem
- Always use pipelines to prevent data leakage
- Use multiple metrics for comprehensive evaluation
14.11 Next Steps
Now you have mastered cross-validation techniques! In the next chapter Support Vector Machines, we will learn about another powerful classification algorithm that works well with high-dimensional data.
Chapter Key Points Review:
- ✓ Understood the principles and importance of cross-validation
- ✓ Mastered various cross-validation methods (K-fold, LOOCV, TimeSeries, etc.)
- ✓ Learned to compare multiple models and select the best one
- ✓ Understood hyperparameter tuning with cross-validation
- ✓ Learned to analyze learning curves and diagnose model issues
- ✓ Able to build complete model evaluation pipelines