Skip to content

Chapter 10: K-Nearest Neighbors Algorithm

The K-Nearest Neighbors (KNN) algorithm is one of the simplest and most intuitive algorithms in machine learning. It is based on the idea that "birds of a feather flock together," making predictions by finding the K most similar neighbors.

10.1 What is K-Nearest Neighbors?

KNN is an instance-based learning method, also known as "lazy learning," because it doesn't build an explicit model during the training phase but performs calculations only during prediction.

10.1.1 Core Concepts

  • Similarity Assumption: Similar samples have similar labels
  • Locality Principle: Nearby samples are more important than distant samples
  • Non-parametric Method: Makes no assumptions about data distribution

10.1.2 Algorithm Steps

  1. Store Training Data: Save all training samples
  2. Calculate Distances: Calculate distances between test samples and all training samples
  3. Find K Nearest Neighbors: Select the K samples with the smallest distances
  4. Make Predictions:
    • Classification: Majority voting
    • Regression: Average or weighted average

10.1.3 Advantages of KNN

  • Simple and Intuitive: Algorithm logic is straightforward
  • No Training Required: No need to learn parameters
  • Strong Adaptability: Can handle multi-class problems
  • Locally Sensitive: Can capture local patterns

10.1.4 Disadvantages of KNN

  • High Computational Complexity: Need to calculate all distances during prediction
  • Large Storage Requirements: Need to save all training data
  • Sensitive to Dimensions: Performance degrades with high-dimensional data
  • Sensitive to Noise: Outliers affect prediction results

10.2 Setting Up Environment and Data

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, load_iris, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

10.3 Distance Metrics

10.3.1 Common Distance Metrics

The core of KNN is calculating distances between samples. Common distance metrics include:

python
def demonstrate_distance_metrics():
    """Demonstrate calculation of different distance metrics"""
    
    # Create two example points
    point1 = np.array([1, 2])
    point2 = np.array([4, 6])
    
    print("Coordinates of two points:")
    print(f"Point 1: {point1}")
    print(f"Point 2: {point2}")
    print()
    
    # 1. Euclidean distance (L2)
    euclidean_dist = np.sqrt(np.sum((point1 - point2)**2))
    print(f"Euclidean distance: {euclidean_dist:.4f}")
    
    # 2. Manhattan distance (L1)
    manhattan_dist = np.sum(np.abs(point1 - point2))
    print(f"Manhattan distance: {manhattan_dist:.4f}")
    
    # 3. Chebyshev distance (L∞)
    chebyshev_dist = np.max(np.abs(point1 - point2))
    print(f"Chebyshev distance: {chebyshev_dist:.4f}")
    
    # 4. Minkowski distance (general form)
    def minkowski_distance(p1, p2, p):
        return np.sum(np.abs(p1 - p2)**p)**(1/p)
    
    print(f"Minkowski distance (p=1): {minkowski_distance(point1, point2, 1):.4f}")
    print(f"Minkowski distance (p=2): {minkowski_distance(point1, point2, 2):.4f}")
    print(f"Minkowski distance (p=3): {minkowski_distance(point1, point2, 3):.4f}")

demonstrate_distance_metrics()

10.3.2 Visualizing Different Distance Metrics

python
def visualize_distance_metrics():
    """Visualize equidistant lines for different distance metrics"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Center point
    center = np.array([0, 0])
    
    # Create grid
    x = np.linspace(-2, 2, 100)
    y = np.linspace(-2, 2, 100)
    X, Y = np.meshgrid(x, y)
    
    # Calculate different distances
    # Euclidean distance
    euclidean = np.sqrt(X**2 + Y**2)
    axes[0].contour(X, Y, euclidean, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
    axes[0].set_title('Euclidean Distance Contours')
    axes[0].set_xlabel('X')
    axes[0].set_ylabel('Y')
    axes[0].grid(True, alpha=0.3)
    axes[0].plot(0, 0, 'ko', markersize=8)
    
    # Manhattan distance
    manhattan = np.abs(X) + np.abs(Y)
    axes[1].contour(X, Y, manhattan, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
    axes[1].set_title('Manhattan Distance Contours')
    axes[1].set_xlabel('X')
    axes[1].set_ylabel('Y')
    axes[1].grid(True, alpha=0.3)
    axes[1].plot(0, 0, 'ko', markersize=8)
    
    # Chebyshev distance
    chebyshev = np.maximum(np.abs(X), np.abs(Y))
    axes[2].contour(X, Y, chebyshev, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
    axes[2].set_title('Chebyshev Distance Contours')
    axes[2].set_xlabel('X')
    axes[2].set_ylabel('Y')
    axes[2].grid(True, alpha=0.3)
    axes[2].plot(0, 0, 'ko', markersize=8)
    
    plt.tight_layout()
    plt.show()

visualize_distance_metrics()

10.4 KNN Classification

10.4.1 Basic KNN Classification

python
# Create binary classification data
X_class, y_class = make_classification(
    n_samples=200,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)

# Create KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

# Predict
y_pred = knn_classifier.predict(X_test)
y_pred_proba = knn_classifier.predict_proba(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Classification Accuracy: {accuracy:.4f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))

# Visualize data distribution
plt.figure(figsize=(12, 5))

# Training data
plt.subplot(1, 2, 1)
colors = ['red', 'blue']
for i, color in enumerate(colors):
    mask = y_train == i
    plt.scatter(X_train[mask, 0], X_train[mask, 1], 
                c=color, label=f'Class {i}', alpha=0.7)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Training Data Distribution')
plt.legend()
plt.grid(True, alpha=0.3)

# Test data and predictions
plt.subplot(1, 2, 2)
for i, color in enumerate(colors):
    mask = y_test == i
    plt.scatter(X_test[mask, 0], X_test[mask, 1], 
                c=color, label=f'True Class {i}', alpha=0.7, s=100)

# Mark wrong predictions
wrong_predictions = y_test != y_pred
plt.scatter(X_test[wrong_predictions, 0], X_test[wrong_predictions, 1], 
            facecolors='none', edgecolors='black', s=200, linewidths=2, 
            label='Wrong Prediction')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Test Results')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

10.4.2 Decision Boundary Visualization

python
def plot_knn_decision_boundary(X, y, k_values, title="KNN Decision Boundary"):
    """Plot KNN decision boundaries for different K values"""
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle(title, fontsize=16)
    
    for i, k in enumerate(k_values):
        row = i // 2
        col = i % 2
        
        # Train KNN
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X, y)
        
        # Create grid
        h = 0.02
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                           np.arange(y_min, y_max, h))
        
        # Predict grid points
        Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        
        # Plot decision boundary
        axes[row, col].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
        
        # Plot data points
        colors = ['red', 'blue']
        for j, color in enumerate(colors):
            mask = y == j
            axes[row, col].scatter(X[mask, 0], X[mask, 1], 
                                 c=color, alpha=0.7, s=50)
        
        axes[row, col].set_title(f'K = {k}')
        axes[row, col].set_xlabel('Feature 1')
        axes[row, col].set_ylabel('Feature 2')
        axes[row, col].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Compare decision boundaries for different K values
k_values = [1, 3, 5, 15]
plot_knn_decision_boundary(X_train, y_train, k_values, "KNN Decision Boundaries for Different K Values")

10.4.3 K Value Impact Analysis

python
# Analyze the impact of different K values on performance
k_range = range(1, 31)
train_scores = []
test_scores = []

print("Impact of K value on KNN performance:")
print("K Value\tTraining Accuracy\tTest Accuracy")
print("-" * 35)

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    
    train_pred = knn.predict(X_train)
    test_pred = knn.predict(X_test)
    
    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)
    
    train_scores.append(train_acc)
    test_scores.append(test_acc)
    
    if k <= 10 or k % 5 == 0:
        print(f"{k}\t{train_acc:.4f}\t\t{test_acc:.4f}")

# Visualize K value impact
plt.figure(figsize=(10, 6))
plt.plot(k_range, train_scores, 'o-', label='Training Accuracy', linewidth=2, markersize=6)
plt.plot(k_range, test_scores, 's-', label='Test Accuracy', linewidth=2, markersize=6)

plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('Impact of K Value on KNN Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 31, 2))
plt.show()

# Find best K value
best_k = k_range[np.argmax(test_scores)]
best_test_score = max(test_scores)
print(f"\nBest K value: {best_k}")
print(f"Best test accuracy: {best_test_score:.4f}")

10.5 Importance of Feature Scaling

10.5.1 Comparison Before and After Scaling

python
# Create data with features of different scales
np.random.seed(42)
n_samples = 200

# Feature 1: Small range [0, 1]
feature1 = np.random.uniform(0, 1, n_samples)

# Feature 2: Large range [0, 1000]
feature2 = np.random.uniform(0, 1000, n_samples)

# Create labels (based on combination of both features)
labels = ((feature1 > 0.5) & (feature2 > 500)).astype(int)

X_scale = np.column_stack([feature1, feature2])
y_scale = labels

print("Data statistics before feature scaling:")
print(f"Feature 1 range: [{X_scale[:, 0].min():.3f}, {X_scale[:, 0].max():.3f}]")
print(f"Feature 2 range: [{X_scale[:, 1].min():.3f}, {X_scale[:, 1].max():.3f}]")

# Split data
X_train_scale, X_test_scale, y_train_scale, y_test_scale = train_test_split(
    X_scale, y_scale, test_size=0.2, random_state=42, stratify=y_scale
)

# 1. KNN without scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train_scale, y_train_scale)
y_pred_no_scale = knn_no_scale.predict(X_test_scale)
acc_no_scale = accuracy_score(y_test_scale, y_pred_no_scale)

# 2. KNN after standardization
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(X_train_scale)
X_test_std = scaler_std.transform(X_test_scale)

knn_std = KNeighborsClassifier(n_neighbors=5)
knn_std.fit(X_train_std, y_train_scale)
y_pred_std = knn_std.predict(X_test_std)
acc_std = accuracy_score(y_test_scale, y_pred_std)

# 3. KNN after min-max scaling
scaler_minmax = MinMaxScaler()
X_train_minmax = scaler_minmax.fit_transform(X_train_scale)
X_test_minmax = scaler_minmax.transform(X_test_scale)

knn_minmax = KNeighborsClassifier(n_neighbors=5)
knn_minmax.fit(X_train_minmax, y_train_scale)
y_pred_minmax = knn_minmax.predict(X_test_minmax)
acc_minmax = accuracy_score(y_test_scale, y_pred_minmax)

print(f"\nImpact of feature scaling on KNN performance:")
print(f"No scaling: {acc_no_scale:.4f}")
print(f"Standardization: {acc_std:.4f}")
print(f"Min-Max scaling: {acc_minmax:.4f}")

# Visualize scaling effects
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Impact of Feature Scaling on KNN', fontsize=16)

# Original data
axes[0, 0].scatter(X_train_scale[y_train_scale==0, 0], X_train_scale[y_train_scale==0, 1], 
                  c='red', alpha=0.6, label='Class 0')
axes[0, 0].scatter(X_train_scale[y_train_scale==1, 0], X_train_scale[y_train_scale==1, 1], 
                  c='blue', alpha=0.6, label='Class 1')
axes[0, 0].set_title(f'Original Data (Accuracy: {acc_no_scale:.3f})')
axes[0, 0].set_xlabel('Feature 1')
axes[0, 0].set_ylabel('Feature 2')
axes[0, 0].legend()

# Standardized data
axes[0, 1].scatter(X_train_std[y_train_scale==0, 0], X_train_std[y_train_scale==0, 1], 
                  c='red', alpha=0.6, label='Class 0')
axes[0, 1].scatter(X_train_std[y_train_scale==1, 0], X_train_std[y_train_scale==1, 1], 
                  c='blue', alpha=0.6, label='Class 1')
axes[0, 1].set_title(f'Standardized Data (Accuracy: {acc_std:.3f})')
axes[0, 1].set_xlabel('Feature 1 (Standardized)')
axes[0, 1].set_ylabel('Feature 2 (Standardized)')
axes[0, 1].legend()

# Min-Max scaled data
axes[1, 0].scatter(X_train_minmax[y_train_scale==0, 0], X_train_minmax[y_train_scale==0, 1], 
                  c='red', alpha=0.6, label='Class 0')
axes[1, 0].scatter(X_train_minmax[y_train_scale==1, 0], X_train_minmax[y_train_scale==1, 1], 
                  c='blue', alpha=0.6, label='Class 1')
axes[1, 0].set_title(f'Min-Max Scaling (Accuracy: {acc_minmax:.3f})')
axes[1, 0].set_xlabel('Feature 1 (Scaled)')
axes[1, 0].set_ylabel('Feature 2 (Scaled)')
axes[1, 0].legend()

# Performance comparison bar chart
methods = ['No Scaling', 'Standardization', 'Min-Max Scaling']
accuracies = [acc_no_scale, acc_std, acc_minmax]
axes[1, 1].bar(methods, accuracies, color=['red', 'green', 'blue'], alpha=0.7)
axes[1, 1].set_title('Scaling Method Performance Comparison')
axes[1, 1].set_ylabel('Accuracy')
axes[1, 1].set_ylim(0, 1)

# Add value labels
for i, acc in enumerate(accuracies):
    axes[1, 1].text(i, acc + 0.02, f'{acc:.3f}', ha='center')

plt.tight_layout()
plt.show()

10.6 KNN Regression

10.6.1 Basic KNN Regression

python
# Create regression data
X_reg, y_reg = make_regression(
    n_samples=200,
    n_features=1,
    noise=10,
    random_state=42
)

# Add some nonlinear relationship
X_reg = X_reg.flatten()
y_reg = y_reg + 0.1 * X_reg**2

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg.reshape(-1, 1), y_reg, test_size=0.2, random_state=42
)

# Compare KNN regression with different K values
k_values_reg = [1, 3, 5, 10, 20]
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('KNN Regression with Different K Values', fontsize=16)

from sklearn.metrics import mean_squared_error, r2_score

print("KNN regression performance with different K values:")
print("K Value\tR² Score\t\tRMSE")
print("-" * 30)

for i, k in enumerate(k_values_reg):
    row = i // 3
    col = i % 3
    
    # Train KNN regressor
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train_reg, y_train_reg)
    
    # Predict
    y_pred_reg = knn_reg.predict(X_test_reg)
    
    # Evaluate
    r2 = r2_score(y_test_reg, y_pred_reg)
    rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
    
    print(f"{k}\t{r2:.4f}\t\t{rmse:.2f}")
    
    # Visualize
    axes[row, col].scatter(X_train_reg, y_train_reg, alpha=0.6, label='Training Data', s=20)
    axes[row, col].scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='Test Data', s=20)
    
    # Plot prediction curve
    X_plot = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
    y_plot = knn_reg.predict(X_plot)
    axes[row, col].plot(X_plot, y_plot, color='red', linewidth=2, label='KNN Prediction')
    
    axes[row, col].set_title(f'K={k}, R²={r2:.3f}, RMSE={rmse:.1f}')
    axes[row, col].set_xlabel('X')
    axes[row, col].set_ylabel('y')
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)

# Remove empty subplot
if len(k_values_reg) < 6:
    axes[1, 2].remove()

plt.tight_layout()
plt.show()

10.6.2 Weighted KNN Regression

python
# Compare ordinary KNN and weighted KNN
def compare_weighted_knn():
    """Compare ordinary KNN and weighted KNN regression"""
    
    # Use the same regression data
    k = 5
    
    # Ordinary KNN regression
    knn_uniform = KNeighborsRegressor(n_neighbors=k, weights='uniform')
    knn_uniform.fit(X_train_reg, y_train_reg)
    y_pred_uniform = knn_uniform.predict(X_test_reg)
    
    # Distance-weighted KNN regression
    knn_distance = KNeighborsRegressor(n_neighbors=k, weights='distance')
    knn_distance.fit(X_train_reg, y_train_reg)
    y_pred_distance = knn_distance.predict(X_test_reg)
    
    # Evaluate
    r2_uniform = r2_score(y_test_reg, y_pred_uniform)
    r2_distance = r2_score(y_test_reg, y_pred_distance)
    rmse_uniform = np.sqrt(mean_squared_error(y_test_reg, y_pred_uniform))
    rmse_distance = np.sqrt(mean_squared_error(y_test_reg, y_pred_distance))
    
    print(f"Impact of weighting method on KNN regression (K={k}):")
    print(f"Ordinary KNN:   R²={r2_uniform:.4f}, RMSE={rmse_uniform:.2f}")
    print(f"Distance-weighted:  R²={r2_distance:.4f}, RMSE={rmse_distance:.2f}")
    
    # Visualize comparison
    plt.figure(figsize=(15, 5))
    
    X_plot = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
    y_plot_uniform = knn_uniform.predict(X_plot)
    y_plot_distance = knn_distance.predict(X_plot)
    
    # Ordinary KNN
    plt.subplot(1, 3, 1)
    plt.scatter(X_train_reg, y_train_reg, alpha=0.6, label='Training Data', s=20)
    plt.scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='Test Data', s=20)
    plt.plot(X_plot, y_plot_uniform, color='red', linewidth=2, label='Ordinary KNN')
    plt.title(f'Ordinary KNN (R²={r2_uniform:.3f})')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Distance-weighted KNN
    plt.subplot(1, 3, 2)
    plt.scatter(X_train_reg, y_train_reg, alpha=0.6, label='Training Data', s=20)
    plt.scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='Test Data', s=20)
    plt.plot(X_plot, y_plot_distance, color='blue', linewidth=2, label='Distance-weighted KNN')
    plt.title(f'Distance-weighted KNN (R²={r2_distance:.3f})')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Comparison
    plt.subplot(1, 3, 3)
    plt.scatter(X_train_reg, y_train_reg, alpha=0.4, label='Training Data', s=20, color='gray')
    plt.plot(X_plot, y_plot_uniform, color='red', linewidth=2, label='Ordinary KNN', linestyle='--')
    plt.plot(X_plot, y_plot_distance, color='blue', linewidth=2, label='Distance-weighted KNN')
    plt.title('Comparison of Both Methods')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

compare_weighted_knn()

10.7 Impact of Distance Metrics

10.7.1 Comparison of Different Distance Metrics

python
# Compare the impact of different distance metrics on KNN
from sklearn.neighbors import DistanceMetric

# Use iris dataset
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

# Standardize data
scaler = StandardScaler()
X_train_iris_scaled = scaler.fit_transform(X_train_iris)
X_test_iris_scaled = scaler.transform(X_test_iris)

# Different distance metrics
distance_metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
metric_results = {}

print("Impact of different distance metrics on KNN classification:")
print("Distance Metric\t\tAccuracy\t\tCross-validation Score")
print("-" * 50)

for metric in distance_metrics:
    if metric == 'minkowski':
        # Minkowski distance requires specifying p parameter
        knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
    else:
        knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric)
    
    knn_metric.fit(X_train_iris_scaled, y_train_iris)
    y_pred_metric = knn_metric.predict(X_test_iris_scaled)
    
    accuracy_metric = accuracy_score(y_test_iris, y_pred_metric)
    cv_scores = cross_val_score(knn_metric, X_iris, y_iris, cv=5)
    cv_mean = np.mean(cv_scores)
    
    metric_results[metric] = {'accuracy': accuracy_metric, 'cv_score': cv_mean}
    
    print(f"{metric}\t\t{accuracy_metric:.4f}\t\t{cv_mean:.4f}")

# Visualize impact of distance metrics
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

metrics = list(metric_results.keys())
accuracies = [metric_results[m]['accuracy'] for m in metrics]
cv_scores = [metric_results[m]['cv_score'] for m in metrics]

# Test accuracy
axes[0].bar(metrics, accuracies, color='skyblue', alpha=0.7)
axes[0].set_title('Test Accuracy for Different Distance Metrics')
axes[0].set_ylabel('Accuracy')
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_ylim(0.8, 1.0)

# Cross-validation scores
axes[1].bar(metrics, cv_scores, color='lightgreen', alpha=0.7)
axes[1].set_title('Cross-validation Scores for Different Distance Metrics')
axes[1].set_ylabel('CV Score')
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0.8, 1.0)

plt.tight_layout()
plt.show()

10.8 Hyperparameter Tuning

10.8.1 Grid Search Optimization

python
# Use grid search to optimize KNN hyperparameters
param_grid = {
    'n_neighbors': range(1, 21),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski'],
    'p': [1, 2, 3]  # Only effective for minkowski distance
}

# Create KNN classifier
knn_grid = KNeighborsClassifier()

# Grid search
print("Performing KNN hyperparameter grid search...")
grid_search = GridSearchCV(
    knn_grid,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_iris_scaled, y_train_iris)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Test best model
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_iris_scaled)
test_accuracy = accuracy_score(y_test_iris, y_pred_best)
print(f"Test set accuracy: {test_accuracy:.4f}")

# Detailed evaluation
print("\nDetailed evaluation of best KNN model:")
print(classification_report(y_test_iris, y_pred_best, 
                          target_names=iris.target_names))

10.8.2 Learning Curve Analysis

python
from sklearn.model_selection import learning_curve

def plot_learning_curve_knn(estimator, X, y, title="KNN Learning Curve"):
    """Plot learning curve for KNN"""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy'
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                     alpha=0.1, color='blue')
    
    plt.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, 
                     alpha=0.1, color='red')
    
    plt.xlabel('Number of Training Samples')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Plot learning curve for best KNN model
plot_learning_curve_knn(best_knn, X_iris, y_iris, "Best KNN Model Learning Curve")

10.9 Practical Application Examples

10.9.1 Handwritten Digit Recognition

python
# Create simplified handwritten digit data
from sklearn.datasets import load_digits

# Load digit dataset
digits = load_digits()
X_digits, y_digits = digits.data, digits.target

print(f"Handwritten digit dataset information:")
print(f"Number of samples: {X_digits.shape[0]}")
print(f"Number of features: {X_digits.shape[1]}")
print(f"Number of classes: {len(np.unique(y_digits))}")

# Visualize some digit samples
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('Handwritten Digit Samples', fontsize=16)

for i in range(10):
    row = i // 5
    col = i % 5
    
    # Find first sample of this digit
    idx = np.where(y_digits == i)[0][0]
    digit_image = X_digits[idx].reshape(8, 8)
    
    axes[row, col].imshow(digit_image, cmap='gray')
    axes[row, col].set_title(f'Digit {i}')
    axes[row, col].axis('off')

plt.tight_layout()
plt.show()

# Split data
X_train_digits, X_test_digits, y_train_digits, y_test_digits = train_test_split(
    X_digits, y_digits, test_size=0.2, random_state=42, stratify=y_digits
)

# Standardize features
scaler_digits = StandardScaler()
X_train_digits_scaled = scaler_digits.fit_transform(X_train_digits)
X_test_digits_scaled = scaler_digits.transform(X_test_digits)

# Train KNN classifier
knn_digits = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_digits.fit(X_train_digits_scaled, y_train_digits)

# Predict
y_pred_digits = knn_digits.predict(X_test_digits_scaled)
accuracy_digits = accuracy_score(y_test_digits, y_pred_digits)

print(f"\nKNN accuracy for handwritten digit recognition: {accuracy_digits:.4f}")

# Detailed classification report
print("\nDetailed classification report:")
print(classification_report(y_test_digits, y_pred_digits))

# Confusion matrix
cm_digits = confusion_matrix(y_test_digits, y_pred_digits)
plt.figure(figsize=(10, 8))
sns.heatmap(cm_digits, annot=True, fmt='d', cmap='Blues',
            xticklabels=range(10), yticklabels=range(10))
plt.title('Confusion Matrix for Handwritten Digit Recognition')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Analyze misclassified samples
wrong_predictions = y_test_digits != y_pred_digits
wrong_indices = np.where(wrong_predictions)[0]

if len(wrong_indices) > 0:
    print(f"\nNumber of misclassified samples: {len(wrong_indices)}")
    
    # Display some misclassified samples
    fig, axes = plt.subplots(2, 5, figsize=(12, 6))
    fig.suptitle('Misclassified Samples', fontsize=16)
    
    for i in range(min(10, len(wrong_indices))):
        row = i // 5
        col = i % 5
        
        idx = wrong_indices[i]
        digit_image = X_test_digits[idx].reshape(8, 8)
        true_label = y_test_digits[idx]
        pred_label = y_pred_digits[idx]
        
        axes[row, col].imshow(digit_image, cmap='gray')
        axes[row, col].set_title(f'True: {true_label}, Pred: {pred_label}')
        axes[row, col].axis('off')
    
    # Remove empty subplots
    for i in range(min(10, len(wrong_indices)), 10):
        row = i // 5
        col = i % 5
        axes[row, col].remove()
    
    plt.tight_layout()
    plt.show()

10.9.2 Recommendation System Basics

python
# Create simple recommendation system example
def create_recommendation_system():
    """Create simple recommendation system based on KNN"""
    
    # Create user-item rating matrix
    np.random.seed(42)
    n_users = 100
    n_items = 20
    
    # Generate sparse rating matrix
    ratings = np.zeros((n_users, n_items))
    
    # Randomly rate some items for each user
    for user in range(n_users):
        # Each user rates 5-15 items
        n_ratings = np.random.randint(5, 16)
        items_to_rate = np.random.choice(n_items, n_ratings, replace=False)
        
        for item in items_to_rate:
            # Rating 1-5
            ratings[user, item] = np.random.randint(1, 6)
    
    print(f"User-item rating matrix:")
    print(f"Number of users: {n_users}")
    print(f"Number of items: {n_items}")
    print(f"Rating density: {np.count_nonzero(ratings) / (n_users * n_items):.2%}")
    
    # Use KNN to find similar users
    # Only use users with ratings for training
    user_profiles = ratings[np.sum(ratings, axis=1) > 0]  # Filter out users without ratings
    
    # Use cosine similarity
    knn_recommender = KNeighborsClassifier(
        n_neighbors=5, 
        metric='cosine',
        algorithm='brute'  # Use brute force for sparse data
    )
    
    # For demonstration, create some dummy labels (user types)
    user_types = np.random.randint(0, 3, len(user_profiles))
    knn_recommender.fit(user_profiles, user_types)
    
    # Recommend for new user
    new_user_ratings = np.zeros(n_items)
    # New user has ratings for some items
    rated_items = [0, 2, 5, 8, 12]
    new_user_ratings[rated_items] = [4, 5, 3, 4, 5]
    
    print(f"\nNew user's ratings:")
    for i, item in enumerate(rated_items):
        print(f"Item {item}: {new_user_ratings[item]}")
    
    # Find similar users
    distances, indices = knn_recommender.kneighbors([new_user_ratings], n_neighbors=5)
    
    print(f"\nTop 5 most similar users:")
    for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
        print(f"User {idx}: Distance = {dist:.3f}")
    
    # Recommend items based on similar users
    similar_users = user_profiles[indices[0]]
    
    # Calculate recommendation scores (average rating of similar users)
    recommendation_scores = np.mean(similar_users, axis=0)
    
    # Exclude already rated items
    recommendation_scores[rated_items] = 0
    
    # Get recommended items
    recommended_items = np.argsort(recommendation_scores)[::-1][:5]
    
    print(f"\nRecommended items:")
    for i, item in enumerate(recommended_items):
        if recommendation_scores[item] > 0:
            print(f"{i+1}. Item {item}: Recommendation score = {recommendation_scores[item]:.2f}")
    
    return ratings, recommendation_scores

ratings_matrix, rec_scores = create_recommendation_system()

# Visualize rating matrix
plt.figure(figsize=(12, 8))
plt.imshow(ratings_matrix[:20, :], cmap='YlOrRd', aspect='auto')
plt.colorbar(label='Rating')
plt.title('User-Item Rating Matrix (First 20 Users)')
plt.xlabel('Item ID')
plt.ylabel('User ID')
plt.show()

10.10 KNN Optimization Techniques

10.10.1 Curse of Dimensionality Problem

python
# Demonstrate the impact of curse of dimensionality on KNN
def demonstrate_curse_of_dimensionality():
    """Demonstrate the impact of curse of dimensionality on KNN performance"""
    
    dimensions = [2, 5, 10, 20, 50, 100]
    accuracies = []
    
    print("Impact of curse of dimensionality on KNN:")
    print("Dimension\tSamples\tAccuracy\tTraining Time")
    print("-" * 45)
    
    import time
    
    for dim in dimensions:
        # Create data with different dimensions
        X_dim, y_dim = make_classification(
            n_samples=500,
            n_features=dim,
            n_informative=min(dim, 10),
            n_redundant=0,
            n_clusters_per_class=1,
            random_state=42
        )
        
        # Split data
        X_train_dim, X_test_dim, y_train_dim, y_test_dim = train_test_split(
            X_dim, y_dim, test_size=0.2, random_state=42, stratify=y_dim
        )
        
        # Standardize
        scaler_dim = StandardScaler()
        X_train_dim_scaled = scaler_dim.fit_transform(X_train_dim)
        X_test_dim_scaled = scaler_dim.transform(X_test_dim)
        
        # Train KNN
        start_time = time.time()
        knn_dim = KNeighborsClassifier(n_neighbors=5)
        knn_dim.fit(X_train_dim_scaled, y_train_dim)
        
        # Predict
        y_pred_dim = knn_dim.predict(X_test_dim_scaled)
        train_time = time.time() - start_time
        
        # Evaluate
        accuracy_dim = accuracy_score(y_test_dim, y_pred_dim)
        accuracies.append(accuracy_dim)
        
        print(f"{dim}\t{len(X_dim)}\t{accuracy_dim:.4f}\t\t{train_time:.4f}s")
    
    # Visualize impact of dimensions on performance
    plt.figure(figsize=(10, 6))
    plt.plot(dimensions, accuracies, 'o-', linewidth=2, markersize=8)
    plt.xlabel('Feature Dimensions')
    plt.ylabel('Accuracy')
    plt.title('Impact of Curse of Dimensionality on KNN Performance')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return dimensions, accuracies

dims, accs = demonstrate_curse_of_dimensionality()

10.10.2 Feature Selection

python
# Use feature selection to improve KNN performance
from sklearn.feature_selection import SelectKBest, f_classif

# Use high-dimensional dataset
X_high_dim, y_high_dim = make_classification(
    n_samples=1000,
    n_features=100,
    n_informative=20,
    n_redundant=10,
    random_state=42
)

X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(
    X_high_dim, y_high_dim, test_size=0.2, random_state=42, stratify=y_high_dim
)

# Standardize
scaler_hd = StandardScaler()
X_train_hd_scaled = scaler_hd.fit_transform(X_train_hd)
X_test_hd_scaled = scaler_hd.transform(X_test_hd)

# Compare impact of different feature counts
k_features = [5, 10, 20, 30, 50, 100]
feature_selection_results = {}

print("Impact of feature selection on KNN performance:")
print("Features\tAccuracy\tTraining Time")
print("-" * 35)

for k in k_features:
    if k < X_high_dim.shape[1]:
        # Feature selection
        selector = SelectKBest(f_classif, k=k)
        X_train_selected = selector.fit_transform(X_train_hd_scaled, y_train_hd)
        X_test_selected = selector.transform(X_test_hd_scaled)
    else:
        X_train_selected = X_train_hd_scaled
        X_test_selected = X_test_hd_scaled
    
    # Train KNN
    start_time = time.time()
    knn_fs = KNeighborsClassifier(n_neighbors=5)
    knn_fs.fit(X_train_selected, y_train_hd)
    
    y_pred_fs = knn_fs.predict(X_test_selected)
    train_time = time.time() - start_time
    
    accuracy_fs = accuracy_score(y_test_hd, y_pred_fs)
    feature_selection_results[k] = {'accuracy': accuracy_fs, 'time': train_time}
    
    print(f"{k}\t{accuracy_fs:.4f}\t\t{train_time:.4f}s")

# Visualize effect of feature selection
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

features = list(feature_selection_results.keys())
accuracies_fs = [feature_selection_results[k]['accuracy'] for k in features]
times_fs = [feature_selection_results[k]['time'] for k in features]

# Accuracy
axes[0].plot(features, accuracies_fs, 'o-', linewidth=2, markersize=8, color='blue')
axes[0].set_xlabel('Number of Features')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Impact of Feature Count on Accuracy')
axes[0].grid(True, alpha=0.3)

# Training time
axes[1].plot(features, times_fs, 's-', linewidth=2, markersize=8, color='red')
axes[1].set_xlabel('Number of Features')
axes[1].set_ylabel('Training Time (seconds)')
axes[1].set_title('Impact of Feature Count on Training Time')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
python
# Demonstrate concept of approximate nearest neighbor search
from sklearn.neighbors import NearestNeighbors

def compare_exact_vs_approximate_search():
    """Compare exact search and approximate search"""
    
    # Create large-scale dataset
    X_large, y_large = make_classification(
        n_samples=5000,
        n_features=50,
        n_informative=30,
        random_state=42
    )
    
    # Standardize
    scaler_large = StandardScaler()
    X_large_scaled = scaler_large.fit_transform(X_large)
    
    # Split data
    X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
        X_large_scaled, y_large, test_size=0.2, random_state=42
    )
    
    # Different algorithms
    algorithms = ['auto', 'ball_tree', 'kd_tree', 'brute']
    algorithm_results = {}
    
    print("Performance comparison of different search algorithms:")
    print("Algorithm\t\tTraining Time\tPrediction Time\tAccuracy")
    print("-" * 50)
    
    for algorithm in algorithms:
        # Training time
        start_time = time.time()
        knn_alg = KNeighborsClassifier(n_neighbors=5, algorithm=algorithm)
        knn_alg.fit(X_train_large, y_train_large)
        train_time = time.time() - start_time
        
        # Prediction time
        start_time = time.time()
        y_pred_alg = knn_alg.predict(X_test_large)
        pred_time = time.time() - start_time
        
        # Accuracy
        accuracy_alg = accuracy_score(y_test_large, y_pred_alg)
        
        algorithm_results[algorithm] = {
            'train_time': train_time,
            'pred_time': pred_time,
            'accuracy': accuracy_alg
        }
        
        print(f"{algorithm}\t\t{train_time:.4f}s\t\t{pred_time:.4f}s\t\t{accuracy_alg:.4f}")
    
    # Visualize algorithm comparison
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    algs = list(algorithm_results.keys())
    train_times = [algorithm_results[alg]['train_time'] for alg in algs]
    pred_times = [algorithm_results[alg]['pred_time'] for alg in algs]
    accuracies_alg = [algorithm_results[alg]['accuracy'] for alg in algs]
    
    # Training time
    axes[0].bar(algs, train_times, color='skyblue', alpha=0.7)
    axes[0].set_title('Training Time Comparison')
    axes[0].set_ylabel('Time (seconds)')
    axes[0].tick_params(axis='x', rotation=45)
    
    # Prediction time
    axes[1].bar(algs, pred_times, color='lightgreen', alpha=0.7)
    axes[1].set_title('Prediction Time Comparison')
    axes[1].set_ylabel('Time (seconds)')
    axes[1].tick_params(axis='x', rotation=45)
    
    # Accuracy
    axes[2].bar(algs, accuracies_alg, color='lightcoral', alpha=0.7)
    axes[2].set_title('Accuracy Comparison')
    axes[2].set_ylabel('Accuracy')
    axes[2].tick_params(axis='x', rotation=45)
    axes[2].set_ylim(0.8, 1.0)
    
    plt.tight_layout()
    plt.show()
    
    return algorithm_results

alg_results = compare_exact_vs_approximate_search()

10.11 KNN vs Other Algorithms

10.11.1 Comprehensive Performance Comparison

python
# Compare KNN performance with other algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

def comprehensive_algorithm_comparison():
    """Comprehensive comparison of KNN with other algorithms"""
    
    # Use wine dataset
    wine = load_wine()
    X_wine, y_wine = wine.data, wine.target
    
    X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
        X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
    )
    
    # Standardize (important for some algorithms)
    scaler_wine = StandardScaler()
    X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
    X_test_wine_scaled = scaler_wine.transform(X_test_wine)
    
    # Define algorithms
    algorithms = {
        'KNN': KNeighborsClassifier(n_neighbors=5),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'SVM': SVC(random_state=42),
        'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
        'Naive Bayes': GaussianNB()
    }
    
    results = {}
    
    print("Comprehensive algorithm performance comparison:")
    print("Algorithm\t\tTraining Time\tPrediction Time\tAccuracy\t\tCV Score")
    print("-" * 65)
    
    for name, algorithm in algorithms.items():
        # Select appropriate data
        if name in ['KNN', 'SVM', 'Logistic Regression']:
            X_train_alg = X_train_wine_scaled
            X_test_alg = X_test_wine_scaled
            X_cv = X_wine
            scaler_cv = StandardScaler()
            X_cv_scaled = scaler_cv.fit_transform(X_cv)
            X_cv_final = X_cv_scaled
        else:
            X_train_alg = X_train_wine
            X_test_alg = X_test_wine
            X_cv_final = X_wine
        
        # Training time
        start_time = time.time()
        algorithm.fit(X_train_alg, y_train_wine)
        train_time = time.time() - start_time
        
        # Prediction time
        start_time = time.time()
        y_pred_alg = algorithm.predict(X_test_alg)
        pred_time = time.time() - start_time
        
        # Accuracy
        accuracy_alg = accuracy_score(y_test_wine, y_pred_alg)
        
        # Cross-validation
        cv_scores = cross_val_score(algorithm, X_cv_final, y_wine, cv=5)
        cv_mean = np.mean(cv_scores)
        
        results[name] = {
            'train_time': train_time,
            'pred_time': pred_time,
            'accuracy': accuracy_alg,
            'cv_score': cv_mean
        }
        
        print(f"{name}\t{train_time:.4f}s\t\t{pred_time:.4f}s\t\t{accuracy_alg:.4f}\t\t{cv_mean:.4f}")
    
    return results

comparison_results = comprehensive_algorithm_comparison()

# Visualize comprehensive comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Comprehensive Algorithm Performance Comparison', fontsize=16)

names = list(comparison_results.keys())
train_times = [comparison_results[name]['train_time'] for name in names]
pred_times = [comparison_results[name]['pred_time'] for name in names]
accuracies = [comparison_results[name]['accuracy'] for name in names]
cv_scores = [comparison_results[name]['cv_score'] for name in names]

# Training time
axes[0, 0].bar(names, train_times, color='skyblue', alpha=0.7)
axes[0, 0].set_title('Training Time')
axes[0, 0].set_ylabel('Time (seconds)')
axes[0, 0].tick_params(axis='x', rotation=45)

# Prediction time
axes[0, 1].bar(names, pred_times, color='lightgreen', alpha=0.7)
axes[0, 1].set_title('Prediction Time')
axes[0, 1].set_ylabel('Time (seconds)')
axes[0, 1].tick_params(axis='x', rotation=45)

# Test accuracy
axes[1, 0].bar(names, accuracies, color='lightcoral', alpha=0.7)
axes[1, 0].set_title('Test Accuracy')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].set_ylim(0.8, 1.0)

# Cross-validation scores
axes[1, 1].bar(names, cv_scores, color='gold', alpha=0.7)
axes[1, 1].set_title('Cross-validation Scores')
axes[1, 1].set_ylabel('CV Score')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].set_ylim(0.8, 1.0)

plt.tight_layout()
plt.show()

10.12 Exercises

Exercise 1: Basic KNN

  1. Train a KNN classifier using the iris dataset
  2. Compare the impact of different K values (1, 3, 5, 7, 9) on performance
  3. Visualize decision boundaries (select two features)

Exercise 2: Distance Metric Comparison

  1. Create a 2D classification dataset
  2. Compare the effects of Euclidean, Manhattan, and Chebyshev distances
  3. Analyze which distance metric works best for your data

Exercise 3: Feature Scaling Experiment

  1. Create a dataset with features of different scales
  2. Compare KNN performance with standardization, min-max scaling, and no scaling
  3. Explain why feature scaling is so important for KNN

Exercise 4: KNN Regression

  1. Train a KNN regressor using the Boston housing dataset
  2. Compare the effects of different K values and weighting methods
  3. Compare performance with linear regression

Exercise 5: High-dimensional Data Processing

  1. Create a high-dimensional dataset (number of features > 50)
  2. Use feature selection techniques to improve KNN performance
  3. Analyze the impact of the curse of dimensionality on KNN

10.13 Summary

In this chapter, we have thoroughly learned various aspects of the K-Nearest Neighbors algorithm:

Core Concepts

  • Instance-based Learning: Making predictions through similar samples
  • Distance Metrics: Euclidean, Manhattan, Chebyshev, and other distances
  • K Value Selection: Important parameter balancing bias and variance

Main Techniques

  • KNN Classification: Majority voting decision
  • KNN Regression: Average or weighted average prediction
  • Distance Weighting: Greater weight for nearest neighbors
  • Different Algorithms: Brute force, KD-tree, Ball tree

Practical Skills

  • Feature Scaling: Importance of standardization and normalization
  • Hyperparameter Tuning: K value, distance metrics, weighting methods
  • Performance Optimization: Feature selection, algorithm selection
  • Practical Applications: Recommendation systems, image recognition

Key Points

  • KNN is simple and intuitive with no training process
  • Sensitive to feature scaling and distance metrics
  • High computational complexity, not suitable for large-scale data
  • Performs well on low-dimensional data, requires feature selection for high-dimensional data

Advantages and Disadvantages

Advantages:

  • Simple algorithm, easy to understand and implement
  • No assumptions about data distribution
  • Suitable for multi-class problems
  • Can capture local patterns

Disadvantages:

  • High computational and storage overhead
  • Sensitive to curse of dimensionality
  • Sensitive to noise and outliers
  • Requires appropriate K value and distance metric selection

10.14 Usage Recommendations

When to Use KNN

  • Small to medium-sized datasets
  • Nonlinear decision boundaries
  • Need for simple baseline models
  • Problems where local patterns are important

When to Avoid KNN

  • Large-scale datasets
  • High-dimensional data (without feature selection)
  • Real-time prediction requirements
  • Limited storage space

Best Practices

  1. Must perform feature scaling
  2. Use cross-validation to select K value
  3. Consider feature selection or dimensionality reduction
  4. Choose distance metrics based on data characteristics
  5. Consider approximate methods for large datasets

10.15 Next Steps

Now you have mastered the intuitive and powerful K-Nearest Neighbors algorithm! In the next chapter Cluster Analysis, we will enter the world of unsupervised learning and learn how to discover hidden patterns in data without labels.


Chapter Key Points Review:

  • ✅ Understood the basic principles and algorithm steps of KNN
  • ✅ Mastered the selection and application of different distance metrics
  • ✅ Learned the implementation of KNN classification and regression
  • ✅ Understood the importance of feature scaling for KNN
  • ✅ Mastered KNN hyperparameter tuning methods
  • ✅ Able to handle curse of dimensionality and performance optimization issues
  • ✅ Understood the advantages and limitations of KNN in practical applications

Content is for learning and research only.