Chapter 4: Linear Regression in Detail

Linear regression is one of the most fundamental and important algorithms in machine learning. It is not only simple and easy to understand but also lays the foundation for understanding more complex algorithms. This chapter will delve into the principles, implementation, and applications of linear regression.

4.1 What is Linear Regression?

Linear regression is a supervised learning algorithm used for predicting continuous numerical values. It assumes a linear relationship between the target variable and feature variables.

4.1.1 Mathematical Principles

For simple linear regression (one feature):

y = β₀ + β₁x + ε

For multiple linear regression (multiple features):

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Where:

y: Target variable (dependent variable)
x: Feature variable (independent variable)
β₀: Intercept (bias term)
β₁, β₂, ..., βₙ: Regression coefficients (weights)
ε: Error term

4.1.2 Core Assumptions

Linear regression is based on the following assumptions:

Linearity: There is a linear relationship between features and target variable
Independence: Observations are independent of each other
Homoscedasticity: The variance of error terms is constant
Normality: Error terms follow a normal distribution
No Multicollinearity: There is no complete linear relationship between features

4.2 Preparing Data and Environment

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression, load_boston
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

4.3 Simple Linear Regression

4.3.1 Generate Example Data

# Generate simple linear data
def generate_simple_data(n_samples=100, noise=10):
    """Generate simple linear regression data"""
    np.random.seed(42)
    X = np.random.uniform(0, 100, n_samples)
    y = 2 * X + 10 + np.random.normal(0, noise, n_samples)  # y = 2x + 10 + noise
    return X.reshape(-1, 1), y

# Generate data
X_simple, y_simple = generate_simple_data(100, 15)

# Visualize data
plt.figure(figsize=(10, 6))
plt.scatter(X_simple, y_simple, alpha=0.6, color='blue')
plt.xlabel('Feature X')
plt.ylabel('Target Variable y')
plt.title('Simple Linear Regression Data')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Data shape: X={X_simple.shape}, y={y_simple.shape}")

4.3.2 Train Simple Linear Regression Model

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)

# Create and train model
model_simple = LinearRegression()
model_simple.fit(X_train, y_train)

# View model parameters
print("Model parameters:")
print(f"Intercept (β₀): {model_simple.intercept_:.4f}")
print(f"Slope (β₁): {model_simple.coef_[0]:.4f}")
print(f"True parameters: Intercept=10, Slope=2")

# Make predictions
y_pred_train = model_simple.predict(X_train)
y_pred_test = model_simple.predict(X_test)

# Visualize results
plt.figure(figsize=(12, 5))

# Training set results
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, alpha=0.6, label='Training Data')
plt.plot(X_train, y_pred_train, color='red', linewidth=2, label='Fitted Line')
plt.xlabel('Feature X')
plt.ylabel('Target Variable y')
plt.title('Training Set Fitting Results')
plt.legend()
plt.grid(True, alpha=0.3)

# Test set results
plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, alpha=0.6, label='Test Data', color='green')
plt.plot(X_test, y_pred_test, color='red', linewidth=2, label='Predicted Line')
plt.xlabel('Feature X')
plt.ylabel('Target Variable y')
plt.title('Test Set Prediction Results')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.3.3 Model Evaluation

# Calculate evaluation metrics
def evaluate_regression_model(y_true, y_pred, model_name="Model"):
    """Calculate evaluation metrics for regression model"""
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print(f"{model_name} Evaluation Results:")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"Coefficient of Determination (R²): {r2:.4f}")
    print("-" * 40)
    
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R2': r2}

# Evaluate training and test set performance
train_metrics = evaluate_regression_model(y_train, y_pred_train, "Training Set")
test_metrics = evaluate_regression_model(y_test, y_pred_test, "Test Set")

4.4 Multiple Linear Regression

4.4.1 Using Real Dataset

# Create more complex dataset
X_multi, y_multi = make_regression(
    n_samples=500,
    n_features=5,
    n_informative=3,
    noise=10,
    random_state=42
)

# Create feature names
feature_names = [f'Feature_{i+1}' for i in range(X_multi.shape[1])]

# Convert to DataFrame for analysis
df_multi = pd.DataFrame(X_multi, columns=feature_names)
df_multi['Target Variable'] = y_multi

print("Multiple Regression Dataset Info:")
print(df_multi.info())
print("\nData Statistics Summary:")
print(df_multi.describe())

4.4.2 Exploratory Data Analysis

# Correlation analysis
plt.figure(figsize=(10, 8))
correlation_matrix = df_multi.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

# Relationship between features and target variable
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Relationship Between Features and Target Variable', fontsize=16)

for i, feature in enumerate(feature_names):
    row = i // 3
    col = i % 3
    if i < len(feature_names):
        axes[row, col].scatter(df_multi[feature], df_multi['Target Variable'], alpha=0.6)
        axes[row, col].set_xlabel(feature)
        axes[row, col].set_ylabel('Target Variable')
        axes[row, col].set_title(f'{feature} vs Target Variable')
        
        # Add trend line
        z = np.polyfit(df_multi[feature], df_multi['Target Variable'], 1)
        p = np.poly1d(z)
        axes[row, col].plot(df_multi[feature], p(df_multi[feature]), "r--", alpha=0.8)

# Remove empty subplot
if len(feature_names) < 6:
    axes[1, 2].remove()

plt.tight_layout()
plt.show()

4.4.3 Train Multiple Linear Regression Model

# Prepare data
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42
)

# Feature standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_multi)
X_test_scaled = scaler.transform(X_test_multi)

# Train model
model_multi = LinearRegression()
model_multi.fit(X_train_scaled, y_train_multi)

# View model parameters
print("Multiple Linear Regression Model Parameters:")
print(f"Intercept: {model_multi.intercept_:.4f}")
print("Regression Coefficients:")
for i, coef in enumerate(model_multi.coef_):
    print(f"  {feature_names[i]}: {coef:.4f}")

# Feature importance visualization
plt.figure(figsize=(10, 6))
feature_importance = np.abs(model_multi.coef_)
plt.barh(feature_names, feature_importance)
plt.xlabel('Absolute Coefficient Value')
plt.title('Feature Importance (Based on Regression Coefficients)')
plt.tight_layout()
plt.show()

4.4.4 Model Prediction and Evaluation

# Make predictions
y_pred_train_multi = model_multi.predict(X_train_scaled)
y_pred_test_multi = model_multi.predict(X_test_scaled)

# Evaluate model performance
print("Multiple Linear Regression Model Evaluation:")
train_metrics_multi = evaluate_regression_model(y_train_multi, y_pred_train_multi, "Training Set")
test_metrics_multi = evaluate_regression_model(y_test_multi, y_pred_test_multi, "Test Set")

# Predicted vs actual visualization
plt.figure(figsize=(12, 5))

# Training set
plt.subplot(1, 2, 1)
plt.scatter(y_train_multi, y_pred_train_multi, alpha=0.6)
plt.plot([y_train_multi.min(), y_train_multi.max()], 
         [y_train_multi.min(), y_train_multi.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title(f'Training Set: Actual vs Predicted (R² = {train_metrics_multi["R2"]:.3f})')
plt.grid(True, alpha=0.3)

# Test set
plt.subplot(1, 2, 2)
plt.scatter(y_test_multi, y_pred_test_multi, alpha=0.6, color='green')
plt.plot([y_test_multi.min(), y_test_multi.max()], 
         [y_test_multi.min(), y_test_multi.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title(f'Test Set: Actual vs Predicted (R² = {test_metrics_multi["R2"]:.3f})')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.5 Residual Analysis

4.5.1 Residual Plots

# Calculate residuals
residuals_train = y_train_multi - y_pred_train_multi
residuals_test = y_test_multi - y_pred_test_multi

# Residual analysis plot
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Residual Analysis', fontsize=16)

# Residuals vs predicted values
axes[0, 0].scatter(y_pred_train_multi, residuals_train, alpha=0.6)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Predicted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Predicted Values')
axes[0, 0].grid(True, alpha=0.3)

# Residual histogram
axes[0, 1].hist(residuals_train, bins=30, alpha=0.7, edgecolor='black')
axes[0, 1].set_xlabel('Residuals')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Residual Distribution')
axes[0, 1].grid(True, alpha=0.3)

# Q-Q plot (normality test)
from scipy import stats
stats.probplot(residuals_train, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Residual Q-Q Plot')
axes[1, 0].grid(True, alpha=0.3)

# Standardized residuals
standardized_residuals = residuals_train / np.std(residuals_train)
axes[1, 1].scatter(y_pred_train_multi, standardized_residuals, alpha=0.6)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].axhline(y=2, color='r', linestyle=':', alpha=0.7)
axes[1, 1].axhline(y=-2, color='r', linestyle=':', alpha=0.7)
axes[1, 1].set_xlabel('Predicted Values')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Standardized Residual Plot')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.5.2 Residual Statistical Analysis

# Residual statistical analysis
def analyze_residuals(residuals, name="Residuals"):
    """Analyze statistical properties of residuals"""
    print(f"{name} Statistical Analysis:")
    print(f"Mean: {np.mean(residuals):.6f}")
    print(f"Standard Deviation: {np.std(residuals):.4f}")
    print(f"Skewness: {stats.skew(residuals):.4f}")
    print(f"Kurtosis: {stats.kurtosis(residuals):.4f}")
    
    # Normality test
    shapiro_stat, shapiro_p = stats.shapiro(residuals)
    print(f"Shapiro-Wilk Normality Test: statistic={shapiro_stat:.4f}, p-value={shapiro_p:.4f}")
    
    if shapiro_p > 0.05:
        print("✓ Residuals conform to normal distribution assumption")
    else:
        print("✗ Residuals do not conform to normal distribution assumption")
    
    print("-" * 50)

analyze_residuals(residuals_train, "Training Set Residuals")
analyze_residuals(residuals_test, "Test Set Residuals")

4.6 Regularized Linear Regression

4.6.1 Ridge Regression (L2 Regularization)

# Ridge regression
from sklearn.linear_model import RidgeCV

# Use cross-validation to select best alpha
ridge_alphas = np.logspace(-3, 2, 50)
ridge_model = RidgeCV(alphas=ridge_alphas, cv=5)
ridge_model.fit(X_train_scaled, y_train_multi)

print(f"Ridge regression best alpha: {ridge_model.alpha_:.4f}")

# Predict and evaluate
y_pred_ridge = ridge_model.predict(X_test_scaled)
ridge_metrics = evaluate_regression_model(y_test_multi, y_pred_ridge, "Ridge Regression")

4.6.2 Lasso Regression (L1 Regularization)

# Lasso regression
from sklearn.linear_model import LassoCV

# Use cross-validation to select best alpha
lasso_alphas = np.logspace(-4, 1, 50)
lasso_model = LassoCV(alphas=lasso_alphas, cv=5, random_state=42)
lasso_model.fit(X_train_scaled, y_train_multi)

print(f"Lasso regression best alpha: {lasso_model.alpha_:.4f}")

# Predict and evaluate
y_pred_lasso = lasso_model.predict(X_test_scaled)
lasso_metrics = evaluate_regression_model(y_test_multi, y_pred_lasso, "Lasso Regression")

# View feature selection results
print("Lasso regression coefficients:")
for i, coef in enumerate(lasso_model.coef_):
    print(f"  {feature_names[i]}: {coef:.4f}")

# Count non-zero coefficients
non_zero_coefs = np.sum(lasso_model.coef_ != 0)
print(f"Number of non-zero coefficients: {non_zero_coefs}/{len(feature_names)}")

4.6.3 ElasticNet Regression (L1+L2 Regularization)

# ElasticNet regression
from sklearn.linear_model import ElasticNetCV

# Use cross-validation to select best parameters
elasticnet_model = ElasticNetCV(
    alphas=np.logspace(-4, 1, 20),
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9],
    cv=5,
    random_state=42
)
elasticnet_model.fit(X_train_scaled, y_train_multi)

print(f"ElasticNet best alpha: {elasticnet_model.alpha_:.4f}")
print(f"ElasticNet best l1_ratio: {elasticnet_model.l1_ratio_:.4f}")

# Predict and evaluate
y_pred_elasticnet = elasticnet_model.predict(X_test_scaled)
elasticnet_metrics = evaluate_regression_model(y_test_multi, y_pred_elasticnet, "ElasticNet Regression")

4.6.4 Regularization Method Comparison

# Compare different regularization methods
models = {
    'Linear Regression': model_multi,
    'Ridge Regression': ridge_model,
    'Lasso Regression': lasso_model,
    'ElasticNet Regression': elasticnet_model
}

# Coefficient comparison
plt.figure(figsize=(12, 8))
x_pos = np.arange(len(feature_names))
width = 0.2

for i, (name, model) in enumerate(models.items()):
    plt.bar(x_pos + i * width, model.coef_, width, label=name, alpha=0.8)

plt.xlabel('Features')
plt.ylabel('Regression Coefficients')
plt.title('Coefficient Comparison of Different Regularization Methods')
plt.xticks(x_pos + width * 1.5, feature_names, rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Performance comparison
performance_comparison = pd.DataFrame({
    'Linear Regression': test_metrics_multi,
    'Ridge Regression': ridge_metrics,
    'Lasso Regression': lasso_metrics,
    'ElasticNet Regression': elasticnet_metrics
}).T

print("Model Performance Comparison:")
print(performance_comparison.round(4))

# Visualize performance comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# R² comparison
performance_comparison['R2'].plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('R² Score Comparison')
axes[0].set_ylabel('R²')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3)

# RMSE comparison
performance_comparison['RMSE'].plot(kind='bar', ax=axes[1], color='lightcoral')
axes[1].set_title('RMSE Comparison')
axes[1].set_ylabel('RMSE')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.7 Polynomial Regression

4.7.1 Generate Nonlinear Data

# Generate nonlinear data
def generate_polynomial_data(n_samples=100):
    """Generate polynomial data"""
    np.random.seed(42)
    X = np.random.uniform(-2, 2, n_samples)
    y = 0.5 * X**3 - 2 * X**2 + X + 1 + np.random.normal(0, 0.5, n_samples)
    return X.reshape(-1, 1), y

X_poly, y_poly = generate_polynomial_data(150)

# Visualize nonlinear data
plt.figure(figsize=(10, 6))
plt.scatter(X_poly, y_poly, alpha=0.6, color='blue')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Nonlinear Data')
plt.grid(True, alpha=0.3)
plt.show()

4.7.2 Polynomial Feature Transformation

# Split data
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(
    X_poly, y_poly, test_size=0.2, random_state=42
)

# Compare polynomial regression of different degrees
degrees = [1, 2, 3, 4, 5]
poly_results = {}

plt.figure(figsize=(15, 10))

for i, degree in enumerate(degrees):
    # Create polynomial features
    poly_features = PolynomialFeatures(degree=degree)
    X_train_poly_features = poly_features.fit_transform(X_train_poly)
    X_test_poly_features = poly_features.transform(X_test_poly)
    
    # Train model
    poly_model = LinearRegression()
    poly_model.fit(X_train_poly_features, y_train_poly)
    
    # Predict
    y_pred_poly = poly_model.predict(X_test_poly_features)
    
    # Evaluate
    r2 = r2_score(y_test_poly, y_pred_poly)
    rmse = np.sqrt(mean_squared_error(y_test_poly, y_pred_poly))
    
    poly_results[degree] = {'R2': r2, 'RMSE': rmse}
    
    # Visualize
    plt.subplot(2, 3, i + 1)
    
    # Plot data points
    plt.scatter(X_train_poly, y_train_poly, alpha=0.6, label='Training Data')
    plt.scatter(X_test_poly, y_test_poly, alpha=0.6, color='green', label='Test Data')
    
    # Plot fitted curve
    X_plot = np.linspace(X_poly.min(), X_poly.max(), 100).reshape(-1, 1)
    X_plot_poly = poly_features.transform(X_plot)
    y_plot = poly_model.predict(X_plot_poly)
    plt.plot(X_plot, y_plot, color='red', linewidth=2, label=f'Polynomial Fit (degree={degree})')
    
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'Degree={degree}, R²={r2:.3f}, RMSE={rmse:.3f}')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Performance summary
poly_df = pd.DataFrame(poly_results).T
print("Polynomial Regression Performance Comparison:")
print(poly_df.round(4))

4.7.3 Regularized Polynomial Regression

# Use Ridge regularization to prevent overfitting
degree = 5  # Use high-degree polynomial
poly_features = PolynomialFeatures(degree=degree)
X_train_poly_high = poly_features.fit_transform(X_train_poly)
X_test_poly_high = poly_features.transform(X_test_poly)

print(f"Number of polynomial features: {X_train_poly_high.shape[1]}")

# Compare different regularization strengths
alphas = [0, 0.1, 1, 10, 100]
plt.figure(figsize=(15, 10))

for i, alpha in enumerate(alphas):
    if alpha == 0:
        model = LinearRegression()
        model_name = "No Regularization"
    else:
        model = Ridge(alpha=alpha)
        model_name = f"Ridge (α={alpha})"
    
    model.fit(X_train_poly_high, y_train_poly)
    
    # Predict
    y_pred = model.predict(X_test_poly_high)
    r2 = r2_score(y_test_poly, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test_poly, y_pred))
    
    # Visualize
    plt.subplot(2, 3, i + 1)
    plt.scatter(X_train_poly, y_train_poly, alpha=0.6, label='Training Data')
    plt.scatter(X_test_poly, y_test_poly, alpha=0.6, color='green', label='Test Data')
    
    # Plot fitted curve
    X_plot = np.linspace(X_poly.min(), X_poly.max(), 100).reshape(-1, 1)
    X_plot_poly = poly_features.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    plt.plot(X_plot, y_plot, color='red', linewidth=2, label=model_name)
    
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'{model_name}\nR²={r2:.3f}, RMSE={rmse:.3f}')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.8 Cross-Validation and Model Selection

4.8.1 Learning Curves

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, X, y, title="Learning Curve"):
    """Plot learning curve"""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='r2'
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
    
    plt.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
    
    plt.xlabel('Number of Training Samples')
    plt.ylabel('R² Score')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Plot learning curves for different models
plot_learning_curve(LinearRegression(), X_train_scaled, y_train_multi, "Linear Regression Learning Curve")
plot_learning_curve(Ridge(alpha=1.0), X_train_scaled, y_train_multi, "Ridge Regression Learning Curve")

4.8.2 Validation Curves

from sklearn.model_selection import validation_curve

def plot_validation_curve(estimator, X, y, param_name, param_range, title="Validation Curve"):
    """Plot validation curve"""
    train_scores, val_scores = validation_curve(
        estimator, X, y, param_name=param_name, param_range=param_range,
        cv=5, scoring='r2', n_jobs=-1
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.semilogx(param_range, train_mean, 'o-', color='blue', label='Training Score')
    plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
    
    plt.semilogx(param_range, val_mean, 'o-', color='red', label='Validation Score')
    plt.fill_between(param_range, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')
    
    plt.xlabel(param_name)
    plt.ylabel('R² Score')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Validation curve for Ridge regression
alpha_range = np.logspace(-4, 2, 20)
plot_validation_curve(
    Ridge(), X_train_scaled, y_train_multi,
    'alpha', alpha_range, "Ridge Regression Validation Curve"
)

4.9 Practical Application Cases

4.9.1 House Price Prediction Case

# Create house price prediction dataset
def create_house_price_dataset():
    """Create house price prediction dataset"""
    np.random.seed(42)
    n_samples = 1000
    
    # Generate features
    area = np.random.normal(150, 50, n_samples)  # Area
    bedrooms = np.random.poisson(3, n_samples)  # Number of bedrooms
    age = np.random.exponential(10, n_samples)  # House age
    distance_to_center = np.random.exponential(5, n_samples)  # Distance to city center
    
    # Generate target variable (house price)
    price = (
        area * 500 +  # Area effect
        bedrooms * 10000 +  # Number of bedrooms effect
        -age * 1000 +  # Negative effect of age
        -distance_to_center * 2000 +  # Negative effect of distance
        np.random.normal(0, 20000, n_samples)  # Noise
    )
    
    # Ensure price is positive
    price = np.maximum(price, 50000)
    
    data = pd.DataFrame({
        'Area': area,
        'Bedrooms': bedrooms,
        'Age': age,
        'Distance to Center': distance_to_center,
        'Price': price
    })
    
    return data

# Create dataset
house_data = create_house_price_dataset()

print("House Price Dataset Info:")
print(house_data.describe())

# Data visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('House Price Dataset Feature Analysis', fontsize=16)

features = ['Area', 'Bedrooms', 'Age', 'Distance to Center']
for i, feature in enumerate(features):
    row = i // 2
    col = i % 2
    
    axes[row, col].scatter(house_data[feature], house_data['Price'], alpha=0.6)
    axes[row, col].set_xlabel(feature)
    axes[row, col].set_ylabel('Price')
    axes[row, col].set_title(f'{feature} vs Price')
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4.9.2 Build House Price Prediction Model

# Prepare data
X_house = house_data[features]
y_house = house_data['Price']

# Split data
X_train_house, X_test_house, y_train_house, y_test_house = train_test_split(
    X_house, y_house, test_size=0.2, random_state=42
)

# Create preprocessing and modeling pipeline
house_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', Ridge(alpha=1.0))
])

# Train model
house_pipeline.fit(X_train_house, y_train_house)

# Predict
y_pred_house = house_pipeline.predict(X_test_house)

# Evaluate
house_metrics = evaluate_regression_model(y_test_house, y_pred_house, "House Price Prediction Model")

# Feature importance
regressor = house_pipeline.named_steps['regressor']
feature_importance = np.abs(regressor.coef_)

plt.figure(figsize=(10, 6))
plt.barh(features, feature_importance)
plt.xlabel('Absolute Coefficient Value')
plt.title('House Price Prediction Model Feature Importance')
plt.tight_layout()
plt.show()

# Prediction examples
sample_houses = pd.DataFrame({
    'Area': [120, 200, 80],
    'Bedrooms': [2, 4, 1],
    'Age': [5, 15, 25],
    'Distance to Center': [3, 8, 1]
})

predicted_prices = house_pipeline.predict(sample_houses)

print("House Price Prediction Examples:")
for i, (_, house) in enumerate(sample_houses.iterrows()):
    print(f"House {i+1}:")
    print(f"  Area: {house['Area']} sqm, Bedrooms: {house['Bedrooms']}, Age: {house['Age']} years, Distance: {house['Distance to Center']} km")
    print(f"  Predicted Price: ¥{predicted_prices[i]:,.0f}")
    print()

4.10 Exercises

Exercise 1: Basic Linear Regression

Use make_regression to generate a dataset with noise
Train a linear regression model and evaluate performance
Analyze if residuals satisfy the normal distribution assumption

Exercise 2: Feature Engineering

Create a dataset with categorical features
Use one-hot encoding to process categorical features
Compare model performance before and after processing

Exercise 3: Regularization Comparison

Generate a high-dimensional dataset (features > samples)
Compare performance of Linear Regression, Ridge, Lasso, and ElasticNet
Analyze feature selection effects of different regularization methods

Exercise 4: Polynomial Regression

Generate a complex nonlinear dataset
Use polynomial regression with different degrees to fit data
Use cross-validation to select the optimal polynomial degree

4.11 Summary

In this chapter, we have deeply learned various aspects of linear regression:

Core Concepts

Linear Regression Principles: Assumptions, mathematical formulas, geometric interpretation
Model Evaluation: Metrics such as MSE, RMSE, MAE, R²
Residual Analysis: Validating the effectiveness of model assumptions

Main Techniques

Simple Linear Regression: Single feature prediction
Multiple Linear Regression: Multi-feature prediction
Regularization Methods: Ridge, Lasso, ElasticNet
Polynomial Regression: Handling nonlinear relationships

Practical Skills

Data Preprocessing: Standardization, feature engineering
Model Selection: Cross-validation, learning curves
Performance Evaluation: Use of multiple evaluation metrics
Result Interpretation: Coefficient interpretation, feature importance

Key Points

Linear regression is the foundation for understanding machine learning
Regularization can prevent overfitting and perform feature selection
Residual analysis is an important tool for validating model effectiveness
Feature engineering has a significant impact on model performance

4.12 Next Steps

Now you have mastered the core knowledge of linear regression! In the next chapter Logistic Regression in Practice, we will learn how to handle classification problems and understand the powerful classification algorithm of logistic regression.

Chapter Key Points Review:

✓ Understood the mathematical principles and assumptions of linear regression
✓ Mastered the implementation of simple and multiple linear regression
✓ Learned to use regularization methods to prevent overfitting
✓ Understood polynomial regression for handling nonlinear relationships
✓ Mastered model evaluation and residual analysis methods
✓ Able to build a complete regression prediction system

#Chapter 4: Linear Regression in Detail

#4.1 What is Linear Regression?

#4.1.1 Mathematical Principles

#4.1.2 Core Assumptions

#4.2 Preparing Data and Environment

#4.3 Simple Linear Regression

#4.3.1 Generate Example Data

#4.3.2 Train Simple Linear Regression Model

#4.3.3 Model Evaluation

#4.4 Multiple Linear Regression

#4.4.1 Using Real Dataset

#4.4.2 Exploratory Data Analysis

#4.4.3 Train Multiple Linear Regression Model

#4.4.4 Model Prediction and Evaluation

#4.5 Residual Analysis

#4.5.1 Residual Plots

#4.5.2 Residual Statistical Analysis

#4.6 Regularized Linear Regression

#4.6.1 Ridge Regression (L2 Regularization)

#4.6.2 Lasso Regression (L1 Regularization)

#4.6.3 ElasticNet Regression (L1+L2 Regularization)

#4.6.4 Regularization Method Comparison

#4.7 Polynomial Regression

#4.7.1 Generate Nonlinear Data

#4.7.2 Polynomial Feature Transformation

#4.7.3 Regularized Polynomial Regression

#4.8 Cross-Validation and Model Selection

#4.8.1 Learning Curves

#4.8.2 Validation Curves

#4.9 Practical Application Cases

#4.9.1 House Price Prediction Case

#4.9.2 Build House Price Prediction Model

#4.10 Exercises

#Exercise 1: Basic Linear Regression

#Exercise 2: Feature Engineering

#Exercise 3: Regularization Comparison

#Exercise 4: Polynomial Regression

#4.11 Summary

#Core Concepts

#Main Techniques

#Practical Skills

#Key Points

#4.12 Next Steps