Chapter 5: Logistic Regression in Practice
Logistic regression is one of the most important classification algorithms in machine learning. Despite the "regression" in its name, it is actually a classification algorithm. This chapter will delve into the principles, implementation, and applications of logistic regression.
5.1 What is Logistic Regression?
Logistic regression uses the logistic function (Sigmoid function) to model the probability of binary classification problems. It does not directly predict classes but predicts the probability of a sample belonging to a certain class.
5.1.1 Mathematical Principles
Sigmoid Function:
σ(z) = 1 / (1 + e^(-z))Where z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
Probability Prediction:
P(y=1|x) = σ(β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ)
P(y=0|x) = 1 - P(y=1|x)Decision Boundary:
- When P(y=1|x) ≥ 0.5, predict class 1
- When P(y=1|x) < 0.5, predict class 0
5.1.2 Differences from Linear Regression
| Feature | Linear Regression | Logistic Regression |
|---|---|---|
| Target | Predict continuous values | Predict probability/classification |
| Output Range | (-∞, +∞) | [0, 1] |
| Activation Function | None | Sigmoid |
| Loss Function | Mean Squared Error | Log-Likelihood |
5.2 Preparing Environment and Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_breast_cancer, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report, roc_curve, auc,
precision_recall_curve, log_loss
)
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# Set random seed
np.random.seed(42)
# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False5.3 Binary Classification Logistic Regression
5.3.1 Generate Binary Classification Data
# Generate binary classification dataset
X_binary, y_binary = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
random_state=42
)
# Create DataFrame for analysis
df_binary = pd.DataFrame(X_binary, columns=['Feature1', 'Feature2'])
df_binary['Label'] = y_binary
print("Binary Classification Dataset Info:")
print(df_binary.info())
print("\nClass Distribution:")
print(df_binary['Label'].value_counts())
# Visualize data distribution
plt.figure(figsize=(10, 8))
colors = ['red', 'blue']
for i, label in enumerate([0, 1]):
mask = y_binary == label
plt.scatter(X_binary[mask, 0], X_binary[mask, 1],
c=colors[i], label=f'Class {label}', alpha=0.7)
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.title('Binary Classification Data Distribution')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()5.3.2 Train Binary Classification Logistic Regression Model
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_binary, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)
# Feature standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train logistic regression model
logistic_model = LogisticRegression(random_state=42)
logistic_model.fit(X_train_scaled, y_train)
# View model parameters
print("Logistic Regression Model Parameters:")
print(f"Intercept: {logistic_model.intercept_[0]:.4f}")
print(f"Coefficients: {logistic_model.coef_[0]}")
# Predict probabilities and classes
y_pred_proba = logistic_model.predict_proba(X_test_scaled)
y_pred = logistic_model.predict(X_test_scaled)
print(f"\nPrediction Examples (first 5 samples):")
for i in range(5):
print(f"Sample {i+1}: Actual={y_test[i]}, Predicted={y_pred[i]}, "
f"Probability=[{y_pred_proba[i][0]:.3f}, {y_pred_proba[i][1]:.3f}]")5.3.3 Decision Boundary Visualization
def plot_decision_boundary(X, y, model, scaler=None, title="Decision Boundary"):
"""Plot decision boundary"""
plt.figure(figsize=(10, 8))
# Create grid
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict grid points
grid_points = np.c_[xx.ravel(), yy.ravel()]
if scaler:
grid_points = scaler.transform(grid_points)
Z = model.predict_proba(grid_points)[:, 1]
Z = Z.reshape(xx.shape)
# Plot contours
plt.contourf(xx, yy, Z, levels=50, alpha=0.8, cmap='RdYlBu')
plt.colorbar(label='P(y=1)')
# Plot decision boundary
plt.contour(xx, yy, Z, levels=[0.5], colors='black', linestyles='--', linewidths=2)
# Plot data points
colors = ['red', 'blue']
for i, label in enumerate([0, 1]):
mask = y == label
plt.scatter(X[mask, 0], X[mask, 1],
c=colors[i], label=f'Class {label}', alpha=0.7, edgecolors='black')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.title(title)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Plot decision boundary
plot_decision_boundary(X_train, y_train, logistic_model, scaler, "Logistic Regression Decision Boundary")5.3.4 Sigmoid Function Visualization
# Visualize Sigmoid function
z = np.linspace(-10, 10, 100)
sigmoid = 1 / (1 + np.exp(-z))
plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid, 'b-', linewidth=2, label='Sigmoid Function')
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision Threshold')
plt.axvline(x=0, color='g', linestyle='--', alpha=0.7, label='z=0')
plt.xlabel('z = β₀ + β₁x₁ + β₂x₂')
plt.ylabel('P(y=1|x)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Demonstrate conversion from linear combination to probability
sample_features = X_test_scaled[:10]
linear_combination = logistic_model.decision_function(sample_features)
probabilities = logistic_model.predict_proba(sample_features)[:, 1]
print("Linear Combination to Probability Conversion Example:")
print("Linear Combination(z)\tProbability P(y=1)\tPredicted Class")
print("-" * 40)
for i in range(len(sample_features)):
pred_class = 1 if probabilities[i] >= 0.5 else 0
print(f"{linear_combination[i]:8.3f}\t{probabilities[i]:8.3f}\t{pred_class:8d}")5.4 Model Evaluation
5.4.1 Basic Evaluation Metrics
def evaluate_classification_model(y_true, y_pred, y_pred_proba=None, model_name="Model"):
"""Evaluate classification model performance"""
print(f"{model_name} Evaluation Results:")
print("-" * 50)
# Basic metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
# Log loss
if y_pred_proba is not None:
logloss = log_loss(y_true, y_pred_proba)
print(f"Log Loss: {logloss:.4f}")
print("\nDetailed Classification Report:")
print(classification_report(y_true, y_pred))
return {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'log_loss': log_loss(y_true, y_pred_proba) if y_pred_proba is not None else None
}
# Evaluate model
metrics = evaluate_classification_model(
y_test, y_pred, y_pred_proba, "Logistic Regression"
)5.4.2 Confusion Matrix
# Calculate and visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Class 0', 'Class 1'],
yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
# Calculate metrics from confusion matrix
tn, fp, fn, tp = cm.ravel()
print("Confusion Matrix Analysis:")
print(f"True Negative (TN): {tn}")
print(f"False Positive (FP): {fp}")
print(f"False Negative (FN): {fn}")
print(f"True Positive (TP): {tp}")
print(f"\nManually Calculated Metrics:")
print(f"Accuracy: {(tp + tn) / (tp + tn + fp + fn):.4f}")
print(f"Precision: {tp / (tp + fp):.4f}")
print(f"Recall: {tp / (tp + fn):.4f}")
print(f"Specificity: {tn / (tn + fp):.4f}")5.4.3 ROC Curve and AUC
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:, 1])
roc_auc = auc(fpr, tpr)
# Plot ROC curve
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',
label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()
print(f"AUC Score: {roc_auc:.4f}")
# Performance at different thresholds
print("\nPerformance at Different Thresholds:")
print("Threshold\t\tFPR\t\tTPR\t\tPrecision\t\tRecall")
print("-" * 60)
for i in range(0, len(thresholds), len(thresholds)//10):
threshold = thresholds[i]
y_pred_threshold = (y_pred_proba[:, 1] >= threshold).astype(int)
if len(np.unique(y_pred_threshold)) > 1: # Avoid division by zero
precision_thresh = precision_score(y_test, y_pred_threshold)
recall_thresh = recall_score(y_test, y_pred_threshold)
print(f"{threshold:.3f}\t\t{fpr[i]:.3f}\t\t{tpr[i]:.3f}\t\t{precision_thresh:.3f}\t\t{recall_thresh:.3f}")5.4.4 Precision-Recall Curve
# Calculate precision-recall curve
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(
y_test, y_pred_proba[:, 1]
)
pr_auc = auc(recall_curve, precision_curve)
# Plot PR curve
plt.figure(figsize=(10, 8))
plt.plot(recall_curve, precision_curve, color='blue', lw=2,
label=f'PR Curve (AUC = {pr_auc:.3f})')
# Baseline (random classifier)
baseline = np.sum(y_test) / len(y_test)
plt.axhline(y=baseline, color='red', linestyle='--',
label=f'Random Classifier (Precision = {baseline:.3f})')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"PR-AUC Score: {pr_auc:.4f}")5.5 Multiclass Logistic Regression
5.5.1 Load Multiclass Data
# Use wine dataset (3 classes)
wine_data = load_wine()
X_wine = wine_data.data
y_wine = wine_data.target
feature_names_wine = wine_data.feature_names
target_names_wine = wine_data.target_names
print("Wine Dataset Info:")
print(f"Sample Count: {X_wine.shape[0]}")
print(f"Feature Count: {X_wine.shape[1]}")
print(f"Class Count: {len(np.unique(y_wine))}")
print(f"Class Names: {target_names_wine}")
# View class distribution
unique, counts = np.unique(y_wine, return_counts=True)
plt.figure(figsize=(8, 6))
plt.bar(target_names_wine, counts, color=['red', 'green', 'blue'], alpha=0.7)
plt.title('Wine Dataset Class Distribution')
plt.xlabel('Wine Type')
plt.ylabel('Sample Count')
plt.show()
for i, name in enumerate(target_names_wine):
print(f"{name}: {counts[i]} samples")5.5.2 Feature Analysis
# Create DataFrame for analysis
df_wine = pd.DataFrame(X_wine, columns=feature_names_wine)
df_wine['wine_type'] = y_wine
# Select several important features for visualization
important_features = ['alcohol', 'flavanoids', 'color_intensity', 'proline']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Distribution of Important Features', fontsize=16)
for i, feature in enumerate(important_features):
row = i // 2
col = i % 2
for wine_type in range(3):
data = df_wine[df_wine['wine_type'] == wine_type][feature]
axes[row, col].hist(data, alpha=0.6, label=target_names_wine[wine_type], bins=15)
axes[row, col].set_title(feature)
axes[row, col].set_xlabel(feature)
axes[row, col].set_ylabel('Frequency')
axes[row, col].legend()
axes[row, col].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Feature correlation analysis
plt.figure(figsize=(12, 10))
correlation_matrix = df_wine[important_features + ['wine_type']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('Important Features Correlation Matrix')
plt.tight_layout()
plt.show()5.5.3 Train Multiclass Logistic Regression
# Split data
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
)
# Feature standardization
scaler_wine = StandardScaler()
X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
X_test_wine_scaled = scaler_wine.transform(X_test_wine)
# Train multiclass logistic regression
# multi_class='ovr': One-vs-Rest strategy
# multi_class='multinomial': Multinomial logistic regression
logistic_multi = LogisticRegression(
multi_class='multinomial',
solver='lbfgs',
random_state=42,
max_iter=1000
)
logistic_multi.fit(X_train_wine_scaled, y_train_wine)
print("Multiclass Logistic Regression Model Info:")
print(f"Class Count: {len(logistic_multi.classes_)}")
print(f"Coefficient Matrix Shape: {logistic_multi.coef_.shape}")
print(f"Intercepts: {logistic_multi.intercept_}")
# Predict
y_pred_wine = logistic_multi.predict(X_test_wine_scaled)
y_pred_proba_wine = logistic_multi.predict_proba(X_test_wine_scaled)
# Evaluate
wine_metrics = evaluate_classification_model(
y_test_wine, y_pred_wine, y_pred_proba_wine, "Multiclass Logistic Regression"
)5.5.4 Multiclass Confusion Matrix
# Multiclass confusion matrix
cm_wine = confusion_matrix(y_test_wine, y_pred_wine)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_wine, annot=True, fmt='d', cmap='Blues',
xticklabels=target_names_wine,
yticklabels=target_names_wine)
plt.title('Multiclass Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
# Performance for each class
print("Detailed Performance for Each Class:")
for i, class_name in enumerate(target_names_wine):
class_precision = precision_score(y_test_wine, y_pred_wine,
labels=[i], average=None)[0]
class_recall = recall_score(y_test_wine, y_pred_wine,
labels=[i], average=None)[0]
class_f1 = f1_score(y_test_wine, y_pred_wine,
labels=[i], average=None)[0]
print(f"{class_name}:")
print(f" Precision: {class_precision:.4f}")
print(f" Recall: {class_recall:.4f}")
print(f" F1 Score: {class_f1:.4f}")5.5.5 One-vs-Rest vs Multinomial Comparison
# Compare different multiclass strategies
strategies = ['ovr', 'multinomial']
strategy_results = {}
for strategy in strategies:
model = LogisticRegression(
multi_class=strategy,
solver='lbfgs',
random_state=42,
max_iter=1000
)
model.fit(X_train_wine_scaled, y_train_wine)
y_pred = model.predict(X_test_wine_scaled)
accuracy = accuracy_score(y_test_wine, y_pred)
f1 = f1_score(y_test_wine, y_pred, average='weighted')
strategy_results[strategy] = {'accuracy': accuracy, 'f1': f1}
print(f"{strategy.upper()} Strategy:")
print(f" Accuracy: {accuracy:.4f}")
print(f" F1 Score: {f1:.4f}")
print()
# Visualize comparison
strategies_df = pd.DataFrame(strategy_results).T
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
strategies_df['accuracy'].plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Accuracy Comparison')
axes[0].set_ylabel('Accuracy')
axes[0].tick_params(axis='x', rotation=0)
strategies_df['f1'].plot(kind='bar', ax=axes[1], color='lightcoral')
axes[1].set_title('F1 Score Comparison')
axes[1].set_ylabel('F1 Score')
axes[1].tick_params(axis='x', rotation=0)
plt.tight_layout()
plt.show()5.6 Regularized Logistic Regression
5.6.1 L1 and L2 Regularization
# Create high-dimensional dataset to test regularization effects
X_high_dim, y_high_dim = make_classification(
n_samples=500,
n_features=50,
n_informative=10,
n_redundant=10,
n_clusters_per_class=1,
random_state=42
)
X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(
X_high_dim, y_high_dim, test_size=0.2, random_state=42
)
# Standardization
scaler_hd = StandardScaler()
X_train_hd_scaled = scaler_hd.fit_transform(X_train_hd)
X_test_hd_scaled = scaler_hd.transform(X_test_hd)
# Compare different regularization methods
penalties = ['none', 'l1', 'l2', 'elasticnet']
C_values = [0.01, 0.1, 1, 10, 100]
results = {}
for penalty in penalties:
if penalty == 'none':
model = LogisticRegression(penalty=penalty, solver='lbfgs',
random_state=42, max_iter=1000)
model.fit(X_train_hd_scaled, y_train_hd)
y_pred = model.predict(X_test_hd_scaled)
accuracy = accuracy_score(y_test_hd, y_pred)
results[f'{penalty}'] = accuracy
elif penalty == 'elasticnet':
model = LogisticRegression(penalty=penalty, solver='saga',
C=1.0, l1_ratio=0.5,
random_state=42, max_iter=1000)
model.fit(X_train_hd_scaled, y_train_hd)
y_pred = model.predict(X_test_hd_scaled)
accuracy = accuracy_score(y_test_hd, y_pred)
results[f'{penalty}'] = accuracy
else:
best_accuracy = 0
best_C = None
for C in C_values:
solver = 'liblinear' if penalty == 'l1' else 'lbfgs'
model = LogisticRegression(penalty=penalty, C=C, solver=solver,
random_state=42, max_iter=1000)
model.fit(X_train_hd_scaled, y_train_hd)
y_pred = model.predict(X_test_hd_scaled)
accuracy = accuracy_score(y_test_hd, y_pred)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_C = C
results[f'{penalty} (C={best_C})'] = best_accuracy
print("Regularization Method Comparison:")
for method, accuracy in results.items():
print(f"{method}: {accuracy:.4f}")5.6.2 Regularization Path Visualization
from sklearn.linear_model import LogisticRegressionCV
# L1 regularization path
l1_model = LogisticRegressionCV(
penalty='l1',
solver='liblinear',
Cs=np.logspace(-4, 2, 20),
cv=5,
random_state=42
)
l1_model.fit(X_train_hd_scaled, y_train_hd)
# L2 regularization path
l2_model = LogisticRegressionCV(
penalty='l2',
solver='lbfgs',
Cs=np.logspace(-4, 2, 20),
cv=5,
random_state=42
)
l2_model.fit(X_train_hd_scaled, y_train_hd)
print(f"L1 Best C: {l1_model.C_[0]:.4f}")
print(f"L2 Best C: {l2_model.C_[0]:.4f}")
# Visualize coefficient paths
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# L1 path
C_range = np.logspace(-4, 2, 20)
coefs_l1 = []
for C in C_range:
model = LogisticRegression(penalty='l1', C=C, solver='liblinear',
random_state=42, max_iter=1000)
model.fit(X_train_hd_scaled, y_train_hd)
coefs_l1.append(model.coef_[0])
coefs_l1 = np.array(coefs_l1)
for i in range(min(10, coefs_l1.shape[1])): # Only show first 10 features
axes[0].plot(C_range, coefs_l1[:, i], label=f'Feature{i+1}')
axes[0].set_xscale('log')
axes[0].set_xlabel('C (Inverse of Regularization Strength)')
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('L1 Regularization Path')
axes[0].grid(True, alpha=0.3)
axes[0].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# L2 path
coefs_l2 = []
for C in C_range:
model = LogisticRegression(penalty='l2', C=C, solver='lbfgs',
random_state=42, max_iter=1000)
model.fit(X_train_hd_scaled, y_train_hd)
coefs_l2.append(model.coef_[0])
coefs_l2 = np.array(coefs_l2)
for i in range(min(10, coefs_l2.shape[1])): # Only show first 10 features
axes[1].plot(C_range, coefs_l2[:, i], label=f'Feature{i+1}')
axes[1].set_xscale('log')
axes[1].set_xlabel('C (Inverse of Regularization Strength)')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('L2 Regularization Path')
axes[1].grid(True, alpha=0.3)
axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
# Feature selection effect comparison
l1_final = LogisticRegression(penalty='l1', C=l1_model.C_[0],
solver='liblinear', random_state=42)
l1_final.fit(X_train_hd_scaled, y_train_hd)
l2_final = LogisticRegression(penalty='l2', C=l2_model.C_[0],
solver='lbfgs', random_state=42)
l2_final.fit(X_train_hd_scaled, y_train_hd)
print(f"L1 Regularization Non-zero Coefficients: {np.sum(l1_final.coef_[0] != 0)}/{len(l1_final.coef_[0])}")
print(f"L2 Regularization Non-zero Coefficients: {np.sum(l2_final.coef_[0] != 0)}/{len(l2_final.coef_[0])}")5.7 Hyperparameter Tuning
5.7.1 Grid Search
# Use grid search to optimize hyperparameters
param_grid = {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['liblinear'] # Supports l1 and l2
}
grid_search = GridSearchCV(
LogisticRegression(random_state=42, max_iter=1000),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train_hd_scaled, y_train_hd)
print("Grid Search Results:")
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")
# Test set performance
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_hd_scaled)
test_accuracy = accuracy_score(y_test_hd, y_pred_best)
print(f"Test Set Accuracy: {test_accuracy:.4f}")
# Visualize grid search results
results_df = pd.DataFrame(grid_search.cv_results_)
plt.figure(figsize=(10, 8))
pivot_table = results_df.pivot_table(
values='mean_test_score',
index='param_penalty',
columns='param_C'
)
sns.heatmap(pivot_table, annot=True, cmap='viridis', fmt='.4f')
plt.title('Grid Search Results Heatmap')
plt.xlabel('C Value')
plt.ylabel('Regularization Type')
plt.show()5.7.2 Learning Curve Analysis
from sklearn.model_selection import learning_curve
def plot_learning_curve_classification(estimator, X, y, title="Learning Curve"):
"""Plot learning curve for classification model"""
train_sizes, train_scores, val_scores = learning_curve(
estimator, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy'
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
alpha=0.1, color='blue')
plt.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
alpha=0.1, color='red')
plt.xlabel('Number of Training Samples')
plt.ylabel('Accuracy')
plt.title(title)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Plot learning curve for best model
plot_learning_curve_classification(
best_model, X_train_hd_scaled, y_train_hd,
"Best Logistic Regression Model Learning Curve"
)5.8 Practical Application Cases
5.8.1 Breast Cancer Diagnosis Case
# Load breast cancer dataset
cancer_data = load_breast_cancer()
X_cancer = cancer_data.data
y_cancer = cancer_data.target
feature_names_cancer = cancer_data.feature_names
target_names_cancer = cancer_data.target_names
print("Breast Cancer Dataset Info:")
print(f"Sample Count: {X_cancer.shape[0]}")
print(f"Feature Count: {X_cancer.shape[1]}")
print(f"Classes: {target_names_cancer}")
# View class distribution
unique, counts = np.unique(y_cancer, return_counts=True)
print(f"Benign: {counts[1]} samples")
print(f"Malignant: {counts[0]} samples")
# Split data
X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)
# Create complete preprocessing and modeling pipeline
cancer_pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42, max_iter=1000))
])
# Train model
cancer_pipeline.fit(X_train_cancer, y_train_cancer)
# Predict and evaluate
y_pred_cancer = cancer_pipeline.predict(X_test_cancer)
y_pred_proba_cancer = cancer_pipeline.predict_proba(X_test_cancer)
print("\nBreast Cancer Diagnosis Model Evaluation:")
cancer_metrics = evaluate_classification_model(
y_test_cancer, y_pred_cancer, y_pred_proba_cancer, "Breast Cancer Diagnosis Model"
)5.8.2 Feature Importance Analysis
# Get feature importance (based on absolute coefficient values)
classifier = cancer_pipeline.named_steps['classifier']
feature_importance = np.abs(classifier.coef_[0])
# Create feature importance DataFrame
importance_df = pd.DataFrame({
'feature': feature_names_cancer,
'importance': feature_importance
}).sort_values('importance', ascending=False)
# Visualize top 15 most important features
plt.figure(figsize=(10, 8))
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance (Absolute Coefficient Value)')
plt.title('Breast Cancer Diagnosis Model - Top 15 Important Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
print("Top 10 Most Important Features:")
for i, (_, row) in enumerate(top_features.head(10).iterrows()):
print(f"{i+1:2d}. {row['feature']}: {row['importance']:.4f}")5.8.3 Model Interpretation and Prediction Examples
# Predict new samples
def predict_cancer_diagnosis(model, scaler, sample_features, feature_names):
"""Predict breast cancer diagnosis result"""
# Standardize features
sample_scaled = scaler.transform([sample_features])
# Predict probability
proba = model.predict_proba(sample_scaled)[0]
prediction = model.predict(sample_scaled)[0]
print("Breast Cancer Diagnosis Prediction Result:")
print(f"Predicted Class: {'Benign' if prediction == 1 else 'Malignant'}")
print(f"Malignant Probability: {proba[0]:.3f}")
print(f"Benign Probability: {proba[1]:.3f}")
# Display contribution of most important features
classifier = model.named_steps['classifier']
coefficients = classifier.coef_[0]
print("\nImportant Feature Contribution Analysis:")
feature_contributions = sample_scaled[0] * coefficients
# Get features with largest contribution
top_indices = np.argsort(np.abs(feature_contributions))[-5:]
for idx in reversed(top_indices):
contribution = feature_contributions[idx]
direction = "Supports Malignant" if contribution < 0 else "Supports Benign"
print(f"{feature_names[idx]}: {contribution:.3f} ({direction})")
# Use a sample from test set for demonstration
sample_idx = 0
sample_features = X_test_cancer[sample_idx]
true_label = y_test_cancer[sample_idx]
print(f"True Label: {'Benign' if true_label == 1 else 'Malignant'}")
predict_cancer_diagnosis(cancer_pipeline,
cancer_pipeline.named_steps['scaler'],
sample_features,
feature_names_cancer)5.9 Exercises
Exercise 1: Basic Logistic Regression
- Use
make_classificationto generate a binary classification dataset - Train a logistic regression model and draw the decision boundary
- Analyze the impact of different thresholds on classification results
Exercise 2: Multiclass Problems
- Use iris dataset to train multiclass logistic regression
- Compare performance of One-vs-Rest and Multinomial strategies
- Analyze the classification difficulty for each class
Exercise 3: Imbalanced Data Handling
- Create an imbalanced binary classification dataset (ratio 1:9)
- Use different evaluation metrics to assess model performance
- Try using
class_weight='balanced'parameter to improve performance
Exercise 4: Feature Selection
- Use high-dimensional dataset (features > 100)
- Compare feature selection effects of L1 and L2 regularization
- Analyze the impact of regularization strength on model performance
5.10 Summary
In this chapter, we have deeply learned various aspects of logistic regression:
Core Concepts
- Logistic Regression Principles: Sigmoid function, probability prediction, decision boundary
- Multiclass Strategies: One-vs-Rest, Multinomial
- Regularization Methods: L1, L2, ElasticNet
Main Techniques
- Model Training: Binary and multiclass logistic regression
- Performance Evaluation: Accuracy, Precision, Recall, F1, AUC
- Visualization Techniques: ROC curve, PR curve, decision boundary
- Hyperparameter Tuning: Grid search, cross-validation
Practical Skills
- Data Preprocessing: Standardization, feature selection
- Model Interpretation: Coefficient analysis, feature importance
- Real Applications: Medical diagnosis, classification prediction
- Performance Optimization: Regularization, threshold adjustment
Key Points
- Logistic regression is a linear classifier suitable for linearly separable problems
- The Sigmoid function maps linear combinations to probability space
- Regularization can prevent overfitting and perform feature selection
- The choice of evaluation metrics depends on specific business requirements
5.11 Next Steps
Now you have mastered the important classification algorithm of logistic regression! In the next chapter Decision Tree Algorithm, we will learn a completely different algorithm - decision tree, which has excellent interpretability and is the foundation for understanding more complex ensemble methods.
Chapter Key Points Review:
- ✓ Understood the mathematical principles of logistic regression and Sigmoid function
- ✓ Mastered the implementation of binary and multiclass logistic regression
- ✓ Learned to use various evaluation metrics for classification models
- ✓ Understood the application of regularization in logistic regression
- ✓ Mastered the drawing and interpretation of ROC curves and PR curves
- ✓ Able to build a complete classification prediction system