Skip to content

Chapter 7: Random Forest and Ensemble Methods

Ensemble methods are an important concept in machine learning that builds more powerful predictive models by combining multiple base learners. Random Forest is one of the most popular ensemble methods. This chapter will详细介绍 various ensemble techniques and their applications.

7.1 What is Ensemble Learning?

The core idea of ensemble learning is "many hands make light work." By combining multiple weak learners, we can obtain a stronger, more stable learner with better performance.

7.1.1 Advantages of Ensemble Learning

  • Improve prediction accuracy: The combination of multiple models is usually more accurate than a single model
  • Reduce overfitting: Reduce variance through averaging
  • Increase stability: More robust to data changes
  • Handle complex problems: Able to learn more complex decision boundaries

7.1.2 Classification of Ensemble Methods

  1. Bagging: Train multiple models in parallel, such as Random Forest
  2. Boosting: Train models sequentially, such as AdaBoost, Gradient Boosting
  3. Stacking: Use a meta-learner to combine base learners
  4. Voting: Simple majority voting or averaging

7.2 Environment and Data Preparation

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, load_breast_cancer, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, learning_curve
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor,
    AdaBoostClassifier, AdaBoostRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor,
    VotingClassifier, VotingRegressor,
    BaggingClassifier, BaggingRegressor,
    ExtraTreesClassifier, ExtraTreesRegressor
)
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC, SVR
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, roc_auc_score
)
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set figure style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

7.3 Random Forest

7.3.1 Random Forest Principles

Random Forest builds diverse decision trees through two randomization processes:

  1. Bootstrap sampling: Each tree uses a different training subset
  2. Random feature selection: Only consider a random subset of features for each split
python
def demonstrate_bootstrap_sampling():
    """Demonstrate the Bootstrap sampling process"""
    # Create sample data
    original_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    n_samples = len(original_data)

    print("Original data:", original_data)
    print("\nBootstrap sampling example:")

    # Generate 5 Bootstrap samples
    for i in range(5):
        bootstrap_indices = np.random.choice(n_samples, size=n_samples, replace=True)
        bootstrap_sample = original_data[bootstrap_indices]
        unique_samples = len(np.unique(bootstrap_sample))

        print(f"Sample {i+1}: {bootstrap_sample} (unique values: {unique_samples})")

    # Visualize the diversity of Bootstrap sampling
    fig, axes = plt.subplots(2, 3, figsize=(15, 8))
    fig.suptitle('Diversity of Bootstrap Sampling', fontsize=16)

    # Original data
    axes[0, 0].bar(range(len(original_data)), original_data, color='blue', alpha=0.7)
    axes[0, 0].set_title('Original Data')
    axes[0, 0].set_xlabel('Index')
    axes[0, 0].set_ylabel('Value')

    # Bootstrap samples
    for i in range(5):
        row = (i + 1) // 3
        col = (i + 1) % 3

        bootstrap_indices = np.random.choice(n_samples, size=n_samples, replace=True)
        bootstrap_sample = original_data[bootstrap_indices]

        axes[row, col].bar(range(len(bootstrap_sample)), bootstrap_sample,
                          color='orange', alpha=0.7)
        axes[row, col].set_title(f'Bootstrap Sample {i+1}')
        axes[row, col].set_xlabel('Index')
        axes[row, col].set_ylabel('Value')

    plt.tight_layout()
    plt.show()

demonstrate_bootstrap_sampling()

7.3.2 Random Forest Classification

python
# Create classification dataset
X_class, y_class = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_clusters_per_class=1,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42
)

# Create Random Forest classifier
rf_classifier = RandomForestClassifier(
    n_estimators=100,    # Number of trees
    max_depth=10,        # Maximum depth
    min_samples_split=5, # Minimum samples required to split
    min_samples_leaf=2,  # Minimum samples in leaf node
    max_features='sqrt', # Number of features to consider for each split
    random_state=42,
    n_jobs=-1           # Parallel processing
)

# Train model
rf_classifier.fit(X_train, y_train)

# Predict
y_pred_rf = rf_classifier.predict(X_test)
y_pred_proba_rf = rf_classifier.predict_proba(X_test)

# Evaluate
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest classification accuracy: {accuracy_rf:.4f}")

print("\nDetailed classification report:")
print(classification_report(y_test, y_pred_rf))

# Compare with single decision tree
dt_classifier = DecisionTreeClassifier(max_depth=10, random_state=42)
dt_classifier.fit(X_train, y_train)
y_pred_dt = dt_classifier.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

print(f"\nPerformance comparison:")
print(f"Single decision tree accuracy: {accuracy_dt:.4f}")
print(f"Random Forest accuracy: {accuracy_rf:.4f}")
print(f"Performance improvement: {((accuracy_rf - accuracy_dt) / accuracy_dt * 100):.2f}%")

7.3.3 Feature Importance Analysis

python
# Feature importance
feature_importance = rf_classifier.feature_importances_
feature_names = [f'feature_{i+1}' for i in range(X_class.shape[1])]

# Create feature importance DataFrame
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

# Visualize feature importance
plt.figure(figsize=(12, 8))
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance (Top 15 Features)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("Top 10 most important features:")
for i, (_, row) in enumerate(top_features.head(10).iterrows()):
    print(f"{i+1:2d}. {row['feature']}: {row['importance']:.4f}")

# Compare feature importance between single tree and random forest
dt_importance = dt_classifier.feature_importances_

plt.figure(figsize=(12, 6))
x = np.arange(len(feature_names))
width = 0.35

plt.bar(x - width/2, feature_importance, width, label='Random Forest', alpha=0.8)
plt.bar(x + width/2, dt_importance, width, label='Single Decision Tree', alpha=0.8)

plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Random Forest vs Single Decision Tree Feature Importance Comparison')
plt.xticks(x, [f'F{i+1}' for i in range(len(feature_names))], rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

7.3.4 Random Forest Regression

python
# Create regression dataset
X_reg, y_reg = make_regression(
    n_samples=500,
    n_features=10,
    n_informative=5,
    noise=0.1,
    random_state=42
)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Random Forest regressor
rf_regressor = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

rf_regressor.fit(X_train_reg, y_train_reg)

# Predict
y_pred_rf_reg = rf_regressor.predict(X_test_reg)

# Evaluate
r2_rf = r2_score(y_test_reg, y_pred_rf_reg)
rmse_rf = np.sqrt(mean_squared_error(y_test_reg, y_pred_rf_reg))

print(f"Random Forest regression performance:")
print(f"R² score: {r2_rf:.4f}")
print(f"RMSE: {rmse_rf:.4f}")

# Compare with single decision tree
dt_regressor = DecisionTreeRegressor(max_depth=10, random_state=42)
dt_regressor.fit(X_train_reg, y_train_reg)
y_pred_dt_reg = dt_regressor.predict(X_test_reg)

r2_dt = r2_score(y_test_reg, y_pred_dt_reg)
rmse_dt = np.sqrt(mean_squared_error(y_test_reg, y_pred_dt_reg))

print(f"\nSingle decision tree regression performance:")
print(f"R² score: {r2_dt:.4f}")
print(f"RMSE: {rmse_dt:.4f}")

# Visualize prediction results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(y_test_reg, y_pred_dt_reg, alpha=0.6, color='red')
plt.plot([y_test_reg.min(), y_test_reg.max()], [y_test_reg.min(), y_test_reg.max()], 'k--', lw=2)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title(f'Single Decision Tree (R² = {r2_dt:.3f})')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(y_test_reg, y_pred_rf_reg, alpha=0.6, color='blue')
plt.plot([y_test_reg.min(), y_test_reg.max()], [y_test_reg.min(), y_test_reg.max()], 'k--', lw=2)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title(f'Random Forest (R² = {r2_rf:.3f})')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

7.4 Hyperparameter Tuning

7.4.1 Grid Search Optimization

python
# Random Forest hyperparameter grid search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

# Use smaller parameter grid for demonstration
small_param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10],
    'min_samples_split': [2, 5],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    small_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("Performing grid search...")
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Test best model
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Test set accuracy: {accuracy_best:.4f}")

# Visualize grid search results
results_df = pd.DataFrame(grid_search.cv_results_)

# Select several important parameters for visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# n_estimators vs performance
estimators_results = results_df.groupby('param_n_estimators')['mean_test_score'].mean()
axes[0].bar(estimators_results.index.astype(str), estimators_results.values)
axes[0].set_xlabel('Number of Trees')
axes[0].set_ylabel('Mean Cross-Validation Score')
axes[0].set_title('Impact of Number of Trees on Performance')

# max_depth vs performance
depth_results = results_df.groupby('param_max_depth')['mean_test_score'].mean()
axes[1].bar(depth_results.index.astype(str), depth_results.values, color='orange')
axes[1].set_xlabel('Maximum Depth')
axes[1].set_ylabel('Mean Cross-Validation Score')
axes[1].set_title('Impact of Maximum Depth on Performance')

plt.tight_layout()
plt.show()

7.4.2 Learning Curve Analysis

python
def plot_learning_curve_ensemble(estimator, X, y, title="Learning Curve"):
    """Plot learning curve for ensemble models"""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy'
    )

    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)

    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
                     alpha=0.1, color='blue')

    plt.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
                     alpha=0.1, color='red')

    plt.xlabel('Number of Training Samples')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Plot learning curves for different models
plot_learning_curve_ensemble(
    DecisionTreeClassifier(max_depth=10, random_state=42),
    X_train, y_train, "Single Decision Tree Learning Curve"
)

plot_learning_curve_ensemble(
    best_rf, X_train, y_train, "Random Forest Learning Curve"
)

7.5 Other Bagging Methods

7.5.1 Extra Trees (Extremely Randomized Trees)

python
# Extra Trees classifier
et_classifier = ExtraTreesClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

et_classifier.fit(X_train, y_train)
y_pred_et = et_classifier.predict(X_test)
accuracy_et = accuracy_score(y_test, y_pred_et)

print(f"Extra Trees accuracy: {accuracy_et:.4f}")

# Compare Random Forest and Extra Trees
models = {
    'Random Forest': rf_classifier,
    'Extra Trees': et_classifier,
    'Single Decision Tree': dt_classifier
}

results = {}
for name, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy

print("\nModel performance comparison:")
for name, accuracy in results.items():
    print(f"{name}: {accuracy:.4f}")

# Visualize comparison
plt.figure(figsize=(10, 6))
plt.bar(results.keys(), results.values(), color=['blue', 'green', 'red'], alpha=0.7)
plt.title('Accuracy Comparison of Different Models')
plt.ylabel('Accuracy')
plt.ylim(0.8, 1.0)
for i, (name, acc) in enumerate(results.items()):
    plt.text(i, acc + 0.005, f'{acc:.4f}', ha='center')
plt.show()

7.5.2 Bagging Classifier

python
# Bagging with different base learners
base_estimators = {
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42),
    'SVM': SVC(random_state=42, probability=True)
}

bagging_results = {}

for name, base_estimator in base_estimators.items():
    bagging_clf = BaggingClassifier(
        base_estimator=base_estimator,
        n_estimators=50,
        random_state=42,
        n_jobs=-1
    )

    bagging_clf.fit(X_train, y_train)
    y_pred_bag = bagging_clf.predict(X_test)
    accuracy_bag = accuracy_score(y_test, y_pred_bag)

    bagging_results[f'Bagging + {name}'] = accuracy_bag

    # Compare with single base learner
    base_estimator.fit(X_train, y_train)
    y_pred_base = base_estimator.predict(X_test)
    accuracy_base = accuracy_score(y_test, y_pred_base)

    bagging_results[f'Single {name}'] = accuracy_base

print("Bagging method performance comparison:")
for name, accuracy in bagging_results.items():
    print(f"{name}: {accuracy:.4f}")

# Visualize Bagging effect
fig, ax = plt.subplots(figsize=(12, 6))
names = list(bagging_results.keys())
accuracies = list(bagging_results.values())

colors = ['blue' if 'Bagging' in name else 'orange' for name in names]
bars = ax.bar(names, accuracies, color=colors, alpha=0.7)

ax.set_title('Bagging Method vs Single Learner Performance Comparison')
ax.set_ylabel('Accuracy')
ax.tick_params(axis='x', rotation=45)

# Add numeric labels
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.005,
            f'{acc:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

7.6 Boosting Methods

7.6.1 AdaBoost

python
# AdaBoost classifier
ada_classifier = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),  # Weak learner
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)

ada_classifier.fit(X_train, y_train)
y_pred_ada = ada_classifier.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)

print(f"AdaBoost accuracy: {accuracy_ada:.4f}")

# Visualize AdaBoost learning process
def plot_adaboost_learning_process():
    """Visualize the AdaBoost learning process"""
    # Use simple 2D data for visualization
    X_simple, y_simple = make_classification(
        n_samples=200, n_features=2, n_redundant=0, n_informative=2,
        n_clusters_per_class=1, random_state=42
    )

    # Train AdaBoost
    ada_simple = AdaBoostClassifier(
        base_estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=5,
        learning_rate=1.0,
        random_state=42
    )

    ada_simple.fit(X_simple, y_simple)

    # Visualize each weak learner
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    fig.suptitle('AdaBoost Learning Process', fontsize=16)

    # Original data
    axes[0, 0].scatter(X_simple[y_simple==0, 0], X_simple[y_simple==0, 1],
                      c='red', alpha=0.6, label='Class 0')
    axes[0, 0].scatter(X_simple[y_simple==1, 0], X_simple[y_simple==1, 1],
                      c='blue', alpha=0.6, label='Class 1')
    axes[0, 0].set_title('Original Data')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)

    # Decision boundary of each weak learner
    for i in range(5):
        row = (i + 1) // 3
        col = (i + 1) % 3

        # Create AdaBoost with only first i+1 weak learners
        ada_partial = AdaBoostClassifier(
            base_estimator=DecisionTreeClassifier(max_depth=1),
            n_estimators=i+1,
            learning_rate=1.0,
            random_state=42
        )
        ada_partial.fit(X_simple, y_simple)

        # Plot decision boundary
        h = 0.02
        x_min, x_max = X_simple[:, 0].min() - 1, X_simple[:, 0].max() + 1
        y_min, y_max = X_simple[:, 1].min() - 1, X_simple[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                           np.arange(y_min, y_max, h))

        Z = ada_partial.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)

        axes[row, col].contourf(xx, yy, Z, alpha=0.8, cmap='RdYlBu')
        axes[row, col].scatter(X_simple[y_simple==0, 0], X_simple[y_simple==0, 1],
                             c='red', alpha=0.6)
        axes[row, col].scatter(X_simple[y_simple==1, 0], X_simple[y_simple==1, 1],
                             c='blue', alpha=0.6)
        axes[row, col].set_title(f'First {i+1} Weak Learners')
        axes[row, col].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

plot_adaboost_learning_process()

7.6.2 Gradient Boosting

python
# Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_classifier.fit(X_train, y_train)
y_pred_gb = gb_classifier.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)

print(f"Gradient Boosting accuracy: {accuracy_gb:.4f}")

# Visualize loss during training
train_scores = gb_classifier.train_score_
test_scores = []

# Calculate loss on test set
for i, pred in enumerate(gb_classifier.staged_predict_proba(X_test)):
    test_loss = -np.mean(y_test * np.log(pred[:, 1] + 1e-15) +
                        (1 - y_test) * np.log(pred[:, 0] + 1e-15))
    test_scores.append(test_loss)

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(train_scores) + 1), train_scores, label='Training Loss', color='blue')
plt.plot(range(1, len(test_scores) + 1), test_scores, label='Test Loss', color='red')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Gradient Boosting Training Process')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Feature importance
gb_importance = gb_classifier.feature_importances_
rf_importance = rf_classifier.feature_importances_

plt.figure(figsize=(12, 6))
x = np.arange(len(feature_names))
width = 0.35

plt.bar(x - width/2, gb_importance, width, label='Gradient Boosting', alpha=0.8)
plt.bar(x + width/2, rf_importance, width, label='Random Forest', alpha=0.8)

plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Gradient Boosting vs Random Forest Feature Importance Comparison')
plt.xticks(x, [f'F{i+1}' for i in range(len(feature_names))], rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

7.7 Voting and Stacking Methods

7.7.1 Voting Classifier

python
# Create base learners
lr_clf = LogisticRegression(random_state=42)
rf_clf = RandomForestClassifier(n_estimators=50, random_state=42)
svm_clf = SVC(random_state=42, probability=True)

# Hard voting
hard_voting_clf = VotingClassifier(
    estimators=[('lr', lr_clf), ('rf', rf_clf), ('svm', svm_clf)],
    voting='hard'
)

# Soft voting
soft_voting_clf = VotingClassifier(
    estimators=[('lr', lr_clf), ('rf', rf_clf), ('svm', svm_clf)],
    voting='soft'
)

# Train all models
models = {
    'Logistic Regression': lr_clf,
    'Random Forest': rf_clf,
    'SVM': svm_clf,
    'Hard Voting': hard_voting_clf,
    'Soft Voting': soft_voting_clf
}

voting_results = {}

print("Voting method performance comparison:")
print("Model\t\tAccuracy")
print("-" * 30)

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    voting_results[name] = accuracy
    print(f"{name}\t{accuracy:.4f}")

# Visualize voting method effects
plt.figure(figsize=(10, 6))
names = list(voting_results.keys())
accuracies = list(voting_results.values())

colors = ['blue', 'green', 'red', 'orange', 'purple']
bars = plt.bar(names, accuracies, color=colors, alpha=0.7)

plt.title('Voting Method Performance Comparison')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)

# Add numeric labels
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.005,
             f'{acc:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

7.7.2 Simple Stacking Example

python
from sklearn.model_selection import cross_val_predict

def simple_stacking_classifier(base_models, meta_model, X_train, y_train, X_test):
    """Simple stacking classifier implementation"""

    # First layer: cross-validation predictions of base learners
    meta_features_train = np.zeros((X_train.shape[0], len(base_models)))
    meta_features_test = np.zeros((X_test.shape[0], len(base_models)))

    for i, (name, model) in enumerate(base_models.items()):
        # Cross-validation predict training set
        meta_features_train[:, i] = cross_val_predict(
            model, X_train, y_train, cv=5, method='predict_proba'
        )[:, 1]

        # Train model and predict test set
        model.fit(X_train, y_train)
        meta_features_test[:, i] = model.predict_proba(X_test)[:, 1]

    # Second layer: meta-learner
    meta_model.fit(meta_features_train, y_train)
    stacking_pred = meta_model.predict(meta_features_test)

    return stacking_pred, meta_features_train, meta_features_test

# Define base learners and meta-learner
base_models = {
    'lr': LogisticRegression(random_state=42),
    'rf': RandomForestClassifier(n_estimators=50, random_state=42),
    'svm': SVC(random_state=42, probability=True)
}

meta_model = LogisticRegression(random_state=42)

# Execute stacking
stacking_pred, meta_train, meta_test = simple_stacking_classifier(
    base_models, meta_model, X_train, y_train, X_test
)

stacking_accuracy = accuracy_score(y_test, stacking_pred)
print(f"Stacking method accuracy: {stacking_accuracy:.4f}")

# Visualize meta-features
plt.figure(figsize=(12, 8))

# Distribution of meta-features
for i, (name, _) in enumerate(base_models.items()):
    plt.subplot(2, 2, i+1)
    plt.hist(meta_train[y_train==0, i], alpha=0.6, label='Class 0', bins=20)
    plt.hist(meta_train[y_train==1, i], alpha=0.6, label='Class 1', bins=20)
    plt.title(f'Meta-feature Distribution of {name}')
    plt.xlabel('Predicted Probability')
    plt.ylabel('Frequency')
    plt.legend()

# Correlation of meta-features
plt.subplot(2, 2, 4)
meta_df = pd.DataFrame(meta_train, columns=list(base_models.keys()))
correlation_matrix = meta_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation of Base Learner Predictions')

plt.tight_layout()
plt.show()

# Final performance comparison
final_results = voting_results.copy()
final_results['Stacking Method'] = stacking_accuracy

print("\nAll ensemble methods performance summary:")
for name, accuracy in sorted(final_results.items(), key=lambda x: x[1], reverse=True):
    print(f"{name}: {accuracy:.4f}")

7.8 Real-world Application Cases

7.8.1 Breast Cancer Diagnosis

python
# Load breast cancer dataset
cancer = load_breast_cancer()
X_cancer, y_cancer = cancer.data, cancer.target

X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

# Feature standardization
scaler = StandardScaler()
X_train_cancer_scaled = scaler.fit_transform(X_train_cancer)
X_test_cancer_scaled = scaler.transform(X_test_cancer)

# Build multiple ensemble models
ensemble_models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Extra Trees': ExtraTreesClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Voting Classifier': VotingClassifier([
        ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
        ('ada', AdaBoostClassifier(n_estimators=50, random_state=42)),
        ('gb', GradientBoostingClassifier(n_estimators=50, random_state=42))
    ], voting='soft')
}

cancer_results = {}

print("Breast cancer diagnosis model performance comparison:")
print("Model\t\t\tAccuracy\t\tAUC Score")
print("-" * 50)

for name, model in ensemble_models.items():
    # Train model
    if name in ['AdaBoost', 'Gradient Boosting']:
        model.fit(X_train_cancer, y_train_cancer)
        y_pred = model.predict(X_test_cancer)
        y_pred_proba = model.predict_proba(X_test_cancer)[:, 1]
    else:
        model.fit(X_train_cancer_scaled, y_train_cancer)
        y_pred = model.predict(X_test_cancer_scaled)
        y_pred_proba = model.predict_proba(X_test_cancer_scaled)[:, 1]

    # Evaluate
    accuracy = accuracy_score(y_test_cancer, y_pred)
    auc_score = roc_auc_score(y_test_cancer, y_pred_proba)

    cancer_results[name] = {'accuracy': accuracy, 'auc': auc_score}
    print(f"{name:<20}\t{accuracy:.4f}\t\t{auc_score:.4f}")

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Accuracy comparison
names = list(cancer_results.keys())
accuracies = [cancer_results[name]['accuracy'] for name in names]
aucs = [cancer_results[name]['auc'] for name in names]

axes[0].bar(names, accuracies, color='skyblue', alpha=0.7)
axes[0].set_title('Breast Cancer Diagnosis Model Accuracy Comparison')
axes[0].set_ylabel('Accuracy')
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_ylim(0.9, 1.0)

# AUC comparison
axes[1].bar(names, aucs, color='lightcoral', alpha=0.7)
axes[1].set_title('Breast Cancer Diagnosis Model AUC Comparison')
axes[1].set_ylabel('AUC Score')
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0.9, 1.0)

plt.tight_layout()
plt.show()

7.8.2 Comprehensive Feature Importance Analysis

python
# Get feature importance from different models
feature_names_cancer = cancer.feature_names

# Only analyze models with feature_importances_ attribute
importance_models = {
    'Random Forest': ensemble_models['Random Forest'],
    'Extra Trees': ensemble_models['Extra Trees'],
    'Gradient Boosting': ensemble_models['Gradient Boosting']
}

# Collect feature importance
importance_data = {}
for name, model in importance_models.items():
    importance_data[name] = model.feature_importances_

# Create feature importance DataFrame
importance_df = pd.DataFrame(importance_data, index=feature_names_cancer)

# Calculate average importance
importance_df['Average Importance'] = importance_df.mean(axis=1)
importance_df = importance_df.sort_values('Average Importance', ascending=False)

# Visualize top 15 most important features
plt.figure(figsize=(15, 10))

# Heatmap
plt.subplot(2, 1, 1)
top_features = importance_df.head(15)
sns.heatmap(top_features[['Random Forest', 'Extra Trees', 'Gradient Boosting']],
            annot=True, cmap='YlOrRd', fmt='.3f')
plt.title('Feature Importance Heatmap of Different Ensemble Models (Top 15 Features)')

# Average importance bar chart
plt.subplot(2, 1, 2)
plt.barh(range(len(top_features)), top_features['Average Importance'])
plt.yticks(range(len(top_features)), top_features.index)
plt.xlabel('Average Feature Importance')
plt.title('Feature Importance Ranking (Top 15 Features)')
plt.gca().invert_yaxis()

plt.tight_layout()
plt.show()

print("Top 10 most important features:")
for i, (feature, row) in enumerate(top_features.head(10).iterrows()):
    print(f"{i+1:2d}. {feature}: {row['Average Importance']:.4f}")

7.9 Ensemble Method Selection Guide

7.9.1 Characteristics Comparison of Different Methods

python
def compare_ensemble_methods():
    """Compare characteristics of different ensemble methods"""

    comparison_data = {
        'Method': ['Random Forest', 'Extra Trees', 'AdaBoost', 'Gradient Boosting', 'Voting Method', 'Stacking Method'],
        'Training Style': ['Parallel', 'Parallel', 'Sequential', 'Sequential', 'Parallel', 'Layered'],
        'Base Learner': ['Decision Tree', 'Decision Tree', 'Weak Learner', 'Decision Tree', 'Multiple', 'Multiple'],
        'Overfitting Risk': ['Low', 'Low', 'Medium', 'High', 'Low', 'Medium'],
        'Training Speed': ['Fast', 'Fast', 'Medium', 'Slow', 'Medium', 'Slow'],
        'Prediction Speed': ['Fast', 'Fast', 'Fast', 'Fast', 'Medium', 'Slow'],
        'Parameter Sensitivity': ['Low', 'Low', 'Medium', 'High', 'Low', 'Medium'],
        'Interpretability': ['Medium', 'Medium', 'Low', 'Low', 'Low', 'Low']
    }

    comparison_df = pd.DataFrame(comparison_data)
    print("Comparison of ensemble method characteristics:")
    print(comparison_df.to_string(index=False))

    return comparison_df

comparison_df = compare_ensemble_methods()

# Visualize comparison (numerize certain features)
numeric_comparison = {
    'Random Forest': {'Overfitting Risk': 2, 'Training Speed': 4, 'Prediction Speed': 4, 'Parameter Sensitivity': 2, 'Interpretability': 3},
    'Extra Trees': {'Overfitting Risk': 2, 'Training Speed': 4, 'Prediction Speed': 4, 'Parameter Sensitivity': 2, 'Interpretability': 3},
    'AdaBoost': {'Overfitting Risk': 3, 'Training Speed': 3, 'Prediction Speed': 4, 'Parameter Sensitivity': 3, 'Interpretability': 2},
    'Gradient Boosting': {'Overfitting Risk': 4, 'Training Speed': 2, 'Prediction Speed': 4, 'Parameter Sensitivity': 4, 'Interpretability': 2},
    'Voting Method': {'Overfitting Risk': 2, 'Training Speed': 3, 'Prediction Speed': 3, 'Parameter Sensitivity': 2, 'Interpretability': 2},
    'Stacking Method': {'Overfitting Risk': 3, 'Training Speed': 2, 'Prediction Speed': 2, 'Parameter Sensitivity': 3, 'Interpretability': 2}
}

# Radar chart comparison
from math import pi

fig, axes = plt.subplots(2, 3, figsize=(18, 12), subplot_kw=dict(projection='polar'))
fig.suptitle('Ensemble Method Characteristics Radar Chart Comparison', fontsize=16)

categories = list(list(numeric_comparison.values())[0].keys())
N = len(categories)

angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]

for i, (method, values) in enumerate(numeric_comparison.items()):
    row = i // 3
    col = i % 3

    values_list = list(values.values())
    values_list += values_list[:1]

    axes[row, col].plot(angles, values_list, 'o-', linewidth=2, label=method)
    axes[row, col].fill(angles, values_list, alpha=0.25)
    axes[row, col].set_xticks(angles[:-1])
    axes[row, col].set_xticklabels(categories)
    axes[row, col].set_ylim(0, 5)
    axes[row, col].set_title(method)
    axes[row, col].grid(True)

plt.tight_layout()
plt.show()

7.9.2 Selection Recommendations

python
def ensemble_selection_guide():
    """Ensemble method selection guide"""

    print("Ensemble method selection guide:")
    print("=" * 50)

    scenarios = {
        "Large dataset, need fast training": {
            "Recommended": ["Random Forest", "Extra Trees"],
            "Reason": "Parallel training, fast speed, friendly to large datasets"
        },
        "Small dataset, pursue highest accuracy": {
            "Recommended": ["Gradient Boosting", "Stacking Method"],
            "Reason": "Can fully utilize small datasets, usually have higher accuracy"
        },
        "Need model interpretability": {
            "Recommended": ["Random Forest", "Extra Trees"],
            "Reason": "Provide feature importance, decision process is relatively transparent"
        },
        "Noisy data": {
            "Recommended": ["Random Forest", "Voting Method"],
            "Reason": "Robust to noise, not prone to overfitting"
        },
        "Limited computational resources": {
            "Recommended": ["Random Forest", "Voting Method"],
            "Reason": "Fast training and prediction, low resource consumption"
        },
        "Imbalanced dataset": {
            "Recommended": ["Random Forest (adjust class_weight)", "AdaBoost"],
            "Reason": "Can handle class imbalance issues"
        }
    }

    for scenario, info in scenarios.items():
        print(f"\nScenario: {scenario}")
        print(f"Recommended methods: {', '.join(info['Recommended'])}")
        print(f"Reason: {info['Reason']}")

    print("\n" + "=" * 50)
    print("General recommendations:")
    print("1. First try Random Forest - usually a good baseline")
    print("2. If you need higher accuracy, try Gradient Boosting")
    print("3. If training time is a bottleneck, consider Extra Trees")
    print("4. For complex problems, try voting or stacking methods")
    print("5. Always use cross-validation to evaluate and compare different methods")

ensemble_selection_guide()

7.10 Exercises

Exercise 1: Random Forest Tuning

  1. Train a Random Forest classifier using the wine dataset
  2. Optimize hyperparameters through grid search
  3. Analyze the impact of different parameters on performance

Exercise 2: Ensemble Method Comparison

  1. Select a regression dataset
  2. Implement and compare Random Forest, AdaBoost, and Gradient Boosting regressors
  3. Analyze their performance under different noise levels

Exercise 3: Voting Classifier

  1. Build a voting classifier with at least 4 different base learners
  2. Compare the performance of hard voting and soft voting
  3. Analyze the impact of base learner diversity on ensemble performance

Exercise 4: Feature Importance Analysis

  1. Use multiple ensemble methods to analyze feature importance on the same dataset
  2. Compare the feature importance rankings given by different methods
  3. Perform feature selection based on feature importance and evaluate the effect

7.11 Summary

In this chapter, we have learned various aspects of ensemble learning in depth:

Core Concepts

  • Ensemble Learning Principles: Improve performance by combining multiple learners
  • Bagging Methods: Train in parallel, reduce variance
  • Boosting Methods: Train sequentially, reduce bias
  • Voting and Stacking: Different combination strategies

Main Techniques

  • Random Forest: The most popular ensemble method, balancing performance and efficiency
  • Extra Trees: More randomized decision tree ensemble
  • AdaBoost: Adaptive boosting algorithm
  • Gradient Boosting: Gradient boosting algorithm
  • Voting Methods: Simple and effective combination strategy

Practical Skills

  • Hyperparameter Tuning: Grid search, cross-validation
  • Feature Importance: Analyze feature contributions from multiple perspectives
  • Model Selection: Choose appropriate ensemble methods according to scenarios
  • Performance Evaluation: Comprehensive model comparison and analysis

Key Points

  • Ensemble methods usually perform better than single learners
  • Different ensemble methods are suitable for different scenarios
  • Diversity of base learners is key to ensemble success
  • Need to find a balance between performance and computational complexity

7.12 Next Steps

Now you have mastered powerful ensemble learning methods! In the next chapter Support Vector Machines, we will learn another important machine learning algorithm—Support Vector Machines, which has unique advantages in handling high-dimensional data and nonlinear problems.


Chapter Points Review:

  • ✅ Understood the basic principles and advantages of ensemble learning
  • ✅ Mastered Random Forest and various Bagging methods
  • ✅ Learned the principles and applications of Boosting methods
  • ✅ Understood voting and stacking combination strategies
  • ✅ Mastered ensemble method hyperparameter tuning
  • ✅ Able to choose appropriate ensemble methods for real-world scenarios

Content is for learning and research only.