Chapter 10: K-Nearest Neighbors Algorithm
The K-Nearest Neighbors (KNN) algorithm is one of the simplest and most intuitive algorithms in machine learning. It is based on the idea that "birds of a feather flock together," making predictions by finding the K most similar neighbors.
10.1 What is K-Nearest Neighbors?
KNN is an instance-based learning method, also known as "lazy learning," because it doesn't build an explicit model during the training phase but performs calculations only during prediction.
10.1.1 Core Concepts
- Similarity Assumption: Similar samples have similar labels
- Locality Principle: Nearby samples are more important than distant samples
- Non-parametric Method: Makes no assumptions about data distribution
10.1.2 Algorithm Steps
- Store Training Data: Save all training samples
- Calculate Distances: Calculate distances between test samples and all training samples
- Find K Nearest Neighbors: Select the K samples with the smallest distances
- Make Predictions:
- Classification: Majority voting
- Regression: Average or weighted average
10.1.3 Advantages of KNN
- Simple and Intuitive: Algorithm logic is straightforward
- No Training Required: No need to learn parameters
- Strong Adaptability: Can handle multi-class problems
- Locally Sensitive: Can capture local patterns
10.1.4 Disadvantages of KNN
- High Computational Complexity: Need to calculate all distances during prediction
- Large Storage Requirements: Need to save all training data
- Sensitive to Dimensions: Performance degrades with high-dimensional data
- Sensitive to Noise: Outliers affect prediction results
10.2 Setting Up Environment and Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, load_iris, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# Set random seed
np.random.seed(42)
# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False10.3 Distance Metrics
10.3.1 Common Distance Metrics
The core of KNN is calculating distances between samples. Common distance metrics include:
def demonstrate_distance_metrics():
"""Demonstrate calculation of different distance metrics"""
# Create two example points
point1 = np.array([1, 2])
point2 = np.array([4, 6])
print("Coordinates of two points:")
print(f"Point 1: {point1}")
print(f"Point 2: {point2}")
print()
# 1. Euclidean distance (L2)
euclidean_dist = np.sqrt(np.sum((point1 - point2)**2))
print(f"Euclidean distance: {euclidean_dist:.4f}")
# 2. Manhattan distance (L1)
manhattan_dist = np.sum(np.abs(point1 - point2))
print(f"Manhattan distance: {manhattan_dist:.4f}")
# 3. Chebyshev distance (L∞)
chebyshev_dist = np.max(np.abs(point1 - point2))
print(f"Chebyshev distance: {chebyshev_dist:.4f}")
# 4. Minkowski distance (general form)
def minkowski_distance(p1, p2, p):
return np.sum(np.abs(p1 - p2)**p)**(1/p)
print(f"Minkowski distance (p=1): {minkowski_distance(point1, point2, 1):.4f}")
print(f"Minkowski distance (p=2): {minkowski_distance(point1, point2, 2):.4f}")
print(f"Minkowski distance (p=3): {minkowski_distance(point1, point2, 3):.4f}")
demonstrate_distance_metrics()10.3.2 Visualizing Different Distance Metrics
def visualize_distance_metrics():
"""Visualize equidistant lines for different distance metrics"""
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Center point
center = np.array([0, 0])
# Create grid
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
# Calculate different distances
# Euclidean distance
euclidean = np.sqrt(X**2 + Y**2)
axes[0].contour(X, Y, euclidean, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
axes[0].set_title('Euclidean Distance Contours')
axes[0].set_xlabel('X')
axes[0].set_ylabel('Y')
axes[0].grid(True, alpha=0.3)
axes[0].plot(0, 0, 'ko', markersize=8)
# Manhattan distance
manhattan = np.abs(X) + np.abs(Y)
axes[1].contour(X, Y, manhattan, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
axes[1].set_title('Manhattan Distance Contours')
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[1].grid(True, alpha=0.3)
axes[1].plot(0, 0, 'ko', markersize=8)
# Chebyshev distance
chebyshev = np.maximum(np.abs(X), np.abs(Y))
axes[2].contour(X, Y, chebyshev, levels=[0.5, 1.0, 1.5], colors=['red', 'blue', 'green'])
axes[2].set_title('Chebyshev Distance Contours')
axes[2].set_xlabel('X')
axes[2].set_ylabel('Y')
axes[2].grid(True, alpha=0.3)
axes[2].plot(0, 0, 'ko', markersize=8)
plt.tight_layout()
plt.show()
visualize_distance_metrics()10.4 KNN Classification
10.4.1 Basic KNN Classification
# Create binary classification data
X_class, y_class = make_classification(
n_samples=200,
n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
random_state=42
)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)
# Create KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
# Predict
y_pred = knn_classifier.predict(X_test)
y_pred_proba = knn_classifier.predict_proba(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Classification Accuracy: {accuracy:.4f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))
# Visualize data distribution
plt.figure(figsize=(12, 5))
# Training data
plt.subplot(1, 2, 1)
colors = ['red', 'blue']
for i, color in enumerate(colors):
mask = y_train == i
plt.scatter(X_train[mask, 0], X_train[mask, 1],
c=color, label=f'Class {i}', alpha=0.7)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Training Data Distribution')
plt.legend()
plt.grid(True, alpha=0.3)
# Test data and predictions
plt.subplot(1, 2, 2)
for i, color in enumerate(colors):
mask = y_test == i
plt.scatter(X_test[mask, 0], X_test[mask, 1],
c=color, label=f'True Class {i}', alpha=0.7, s=100)
# Mark wrong predictions
wrong_predictions = y_test != y_pred
plt.scatter(X_test[wrong_predictions, 0], X_test[wrong_predictions, 1],
facecolors='none', edgecolors='black', s=200, linewidths=2,
label='Wrong Prediction')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Test Results')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()10.4.2 Decision Boundary Visualization
def plot_knn_decision_boundary(X, y, k_values, title="KNN Decision Boundary"):
"""Plot KNN decision boundaries for different K values"""
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle(title, fontsize=16)
for i, k in enumerate(k_values):
row = i // 2
col = i % 2
# Train KNN
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X, y)
# Create grid
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict grid points
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
axes[row, col].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
# Plot data points
colors = ['red', 'blue']
for j, color in enumerate(colors):
mask = y == j
axes[row, col].scatter(X[mask, 0], X[mask, 1],
c=color, alpha=0.7, s=50)
axes[row, col].set_title(f'K = {k}')
axes[row, col].set_xlabel('Feature 1')
axes[row, col].set_ylabel('Feature 2')
axes[row, col].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Compare decision boundaries for different K values
k_values = [1, 3, 5, 15]
plot_knn_decision_boundary(X_train, y_train, k_values, "KNN Decision Boundaries for Different K Values")10.4.3 K Value Impact Analysis
# Analyze the impact of different K values on performance
k_range = range(1, 31)
train_scores = []
test_scores = []
print("Impact of K value on KNN performance:")
print("K Value\tTraining Accuracy\tTest Accuracy")
print("-" * 35)
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
train_pred = knn.predict(X_train)
test_pred = knn.predict(X_test)
train_acc = accuracy_score(y_train, train_pred)
test_acc = accuracy_score(y_test, test_pred)
train_scores.append(train_acc)
test_scores.append(test_acc)
if k <= 10 or k % 5 == 0:
print(f"{k}\t{train_acc:.4f}\t\t{test_acc:.4f}")
# Visualize K value impact
plt.figure(figsize=(10, 6))
plt.plot(k_range, train_scores, 'o-', label='Training Accuracy', linewidth=2, markersize=6)
plt.plot(k_range, test_scores, 's-', label='Test Accuracy', linewidth=2, markersize=6)
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('Impact of K Value on KNN Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 31, 2))
plt.show()
# Find best K value
best_k = k_range[np.argmax(test_scores)]
best_test_score = max(test_scores)
print(f"\nBest K value: {best_k}")
print(f"Best test accuracy: {best_test_score:.4f}")10.5 Importance of Feature Scaling
10.5.1 Comparison Before and After Scaling
# Create data with features of different scales
np.random.seed(42)
n_samples = 200
# Feature 1: Small range [0, 1]
feature1 = np.random.uniform(0, 1, n_samples)
# Feature 2: Large range [0, 1000]
feature2 = np.random.uniform(0, 1000, n_samples)
# Create labels (based on combination of both features)
labels = ((feature1 > 0.5) & (feature2 > 500)).astype(int)
X_scale = np.column_stack([feature1, feature2])
y_scale = labels
print("Data statistics before feature scaling:")
print(f"Feature 1 range: [{X_scale[:, 0].min():.3f}, {X_scale[:, 0].max():.3f}]")
print(f"Feature 2 range: [{X_scale[:, 1].min():.3f}, {X_scale[:, 1].max():.3f}]")
# Split data
X_train_scale, X_test_scale, y_train_scale, y_test_scale = train_test_split(
X_scale, y_scale, test_size=0.2, random_state=42, stratify=y_scale
)
# 1. KNN without scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train_scale, y_train_scale)
y_pred_no_scale = knn_no_scale.predict(X_test_scale)
acc_no_scale = accuracy_score(y_test_scale, y_pred_no_scale)
# 2. KNN after standardization
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(X_train_scale)
X_test_std = scaler_std.transform(X_test_scale)
knn_std = KNeighborsClassifier(n_neighbors=5)
knn_std.fit(X_train_std, y_train_scale)
y_pred_std = knn_std.predict(X_test_std)
acc_std = accuracy_score(y_test_scale, y_pred_std)
# 3. KNN after min-max scaling
scaler_minmax = MinMaxScaler()
X_train_minmax = scaler_minmax.fit_transform(X_train_scale)
X_test_minmax = scaler_minmax.transform(X_test_scale)
knn_minmax = KNeighborsClassifier(n_neighbors=5)
knn_minmax.fit(X_train_minmax, y_train_scale)
y_pred_minmax = knn_minmax.predict(X_test_minmax)
acc_minmax = accuracy_score(y_test_scale, y_pred_minmax)
print(f"\nImpact of feature scaling on KNN performance:")
print(f"No scaling: {acc_no_scale:.4f}")
print(f"Standardization: {acc_std:.4f}")
print(f"Min-Max scaling: {acc_minmax:.4f}")
# Visualize scaling effects
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Impact of Feature Scaling on KNN', fontsize=16)
# Original data
axes[0, 0].scatter(X_train_scale[y_train_scale==0, 0], X_train_scale[y_train_scale==0, 1],
c='red', alpha=0.6, label='Class 0')
axes[0, 0].scatter(X_train_scale[y_train_scale==1, 0], X_train_scale[y_train_scale==1, 1],
c='blue', alpha=0.6, label='Class 1')
axes[0, 0].set_title(f'Original Data (Accuracy: {acc_no_scale:.3f})')
axes[0, 0].set_xlabel('Feature 1')
axes[0, 0].set_ylabel('Feature 2')
axes[0, 0].legend()
# Standardized data
axes[0, 1].scatter(X_train_std[y_train_scale==0, 0], X_train_std[y_train_scale==0, 1],
c='red', alpha=0.6, label='Class 0')
axes[0, 1].scatter(X_train_std[y_train_scale==1, 0], X_train_std[y_train_scale==1, 1],
c='blue', alpha=0.6, label='Class 1')
axes[0, 1].set_title(f'Standardized Data (Accuracy: {acc_std:.3f})')
axes[0, 1].set_xlabel('Feature 1 (Standardized)')
axes[0, 1].set_ylabel('Feature 2 (Standardized)')
axes[0, 1].legend()
# Min-Max scaled data
axes[1, 0].scatter(X_train_minmax[y_train_scale==0, 0], X_train_minmax[y_train_scale==0, 1],
c='red', alpha=0.6, label='Class 0')
axes[1, 0].scatter(X_train_minmax[y_train_scale==1, 0], X_train_minmax[y_train_scale==1, 1],
c='blue', alpha=0.6, label='Class 1')
axes[1, 0].set_title(f'Min-Max Scaling (Accuracy: {acc_minmax:.3f})')
axes[1, 0].set_xlabel('Feature 1 (Scaled)')
axes[1, 0].set_ylabel('Feature 2 (Scaled)')
axes[1, 0].legend()
# Performance comparison bar chart
methods = ['No Scaling', 'Standardization', 'Min-Max Scaling']
accuracies = [acc_no_scale, acc_std, acc_minmax]
axes[1, 1].bar(methods, accuracies, color=['red', 'green', 'blue'], alpha=0.7)
axes[1, 1].set_title('Scaling Method Performance Comparison')
axes[1, 1].set_ylabel('Accuracy')
axes[1, 1].set_ylim(0, 1)
# Add value labels
for i, acc in enumerate(accuracies):
axes[1, 1].text(i, acc + 0.02, f'{acc:.3f}', ha='center')
plt.tight_layout()
plt.show()10.6 KNN Regression
10.6.1 Basic KNN Regression
# Create regression data
X_reg, y_reg = make_regression(
n_samples=200,
n_features=1,
noise=10,
random_state=42
)
# Add some nonlinear relationship
X_reg = X_reg.flatten()
y_reg = y_reg + 0.1 * X_reg**2
# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X_reg.reshape(-1, 1), y_reg, test_size=0.2, random_state=42
)
# Compare KNN regression with different K values
k_values_reg = [1, 3, 5, 10, 20]
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('KNN Regression with Different K Values', fontsize=16)
from sklearn.metrics import mean_squared_error, r2_score
print("KNN regression performance with different K values:")
print("K Value\tR² Score\t\tRMSE")
print("-" * 30)
for i, k in enumerate(k_values_reg):
row = i // 3
col = i % 3
# Train KNN regressor
knn_reg = KNeighborsRegressor(n_neighbors=k)
knn_reg.fit(X_train_reg, y_train_reg)
# Predict
y_pred_reg = knn_reg.predict(X_test_reg)
# Evaluate
r2 = r2_score(y_test_reg, y_pred_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
print(f"{k}\t{r2:.4f}\t\t{rmse:.2f}")
# Visualize
axes[row, col].scatter(X_train_reg, y_train_reg, alpha=0.6, label='Training Data', s=20)
axes[row, col].scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='Test Data', s=20)
# Plot prediction curve
X_plot = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
y_plot = knn_reg.predict(X_plot)
axes[row, col].plot(X_plot, y_plot, color='red', linewidth=2, label='KNN Prediction')
axes[row, col].set_title(f'K={k}, R²={r2:.3f}, RMSE={rmse:.1f}')
axes[row, col].set_xlabel('X')
axes[row, col].set_ylabel('y')
axes[row, col].legend()
axes[row, col].grid(True, alpha=0.3)
# Remove empty subplot
if len(k_values_reg) < 6:
axes[1, 2].remove()
plt.tight_layout()
plt.show()10.6.2 Weighted KNN Regression
# Compare ordinary KNN and weighted KNN
def compare_weighted_knn():
"""Compare ordinary KNN and weighted KNN regression"""
# Use the same regression data
k = 5
# Ordinary KNN regression
knn_uniform = KNeighborsRegressor(n_neighbors=k, weights='uniform')
knn_uniform.fit(X_train_reg, y_train_reg)
y_pred_uniform = knn_uniform.predict(X_test_reg)
# Distance-weighted KNN regression
knn_distance = KNeighborsRegressor(n_neighbors=k, weights='distance')
knn_distance.fit(X_train_reg, y_train_reg)
y_pred_distance = knn_distance.predict(X_test_reg)
# Evaluate
r2_uniform = r2_score(y_test_reg, y_pred_uniform)
r2_distance = r2_score(y_test_reg, y_pred_distance)
rmse_uniform = np.sqrt(mean_squared_error(y_test_reg, y_pred_uniform))
rmse_distance = np.sqrt(mean_squared_error(y_test_reg, y_pred_distance))
print(f"Impact of weighting method on KNN regression (K={k}):")
print(f"Ordinary KNN: R²={r2_uniform:.4f}, RMSE={rmse_uniform:.2f}")
print(f"Distance-weighted: R²={r2_distance:.4f}, RMSE={rmse_distance:.2f}")
# Visualize comparison
plt.figure(figsize=(15, 5))
X_plot = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
y_plot_uniform = knn_uniform.predict(X_plot)
y_plot_distance = knn_distance.predict(X_plot)
# Ordinary KNN
plt.subplot(1, 3, 1)
plt.scatter(X_train_reg, y_train_reg, alpha=0.6, label='Training Data', s=20)
plt.scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='Test Data', s=20)
plt.plot(X_plot, y_plot_uniform, color='red', linewidth=2, label='Ordinary KNN')
plt.title(f'Ordinary KNN (R²={r2_uniform:.3f})')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
# Distance-weighted KNN
plt.subplot(1, 3, 2)
plt.scatter(X_train_reg, y_train_reg, alpha=0.6, label='Training Data', s=20)
plt.scatter(X_test_reg, y_test_reg, alpha=0.6, color='green', label='Test Data', s=20)
plt.plot(X_plot, y_plot_distance, color='blue', linewidth=2, label='Distance-weighted KNN')
plt.title(f'Distance-weighted KNN (R²={r2_distance:.3f})')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
# Comparison
plt.subplot(1, 3, 3)
plt.scatter(X_train_reg, y_train_reg, alpha=0.4, label='Training Data', s=20, color='gray')
plt.plot(X_plot, y_plot_uniform, color='red', linewidth=2, label='Ordinary KNN', linestyle='--')
plt.plot(X_plot, y_plot_distance, color='blue', linewidth=2, label='Distance-weighted KNN')
plt.title('Comparison of Both Methods')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
compare_weighted_knn()10.7 Impact of Distance Metrics
10.7.1 Comparison of Different Distance Metrics
# Compare the impact of different distance metrics on KNN
from sklearn.neighbors import DistanceMetric
# Use iris dataset
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)
# Standardize data
scaler = StandardScaler()
X_train_iris_scaled = scaler.fit_transform(X_train_iris)
X_test_iris_scaled = scaler.transform(X_test_iris)
# Different distance metrics
distance_metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
metric_results = {}
print("Impact of different distance metrics on KNN classification:")
print("Distance Metric\t\tAccuracy\t\tCross-validation Score")
print("-" * 50)
for metric in distance_metrics:
if metric == 'minkowski':
# Minkowski distance requires specifying p parameter
knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
else:
knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric)
knn_metric.fit(X_train_iris_scaled, y_train_iris)
y_pred_metric = knn_metric.predict(X_test_iris_scaled)
accuracy_metric = accuracy_score(y_test_iris, y_pred_metric)
cv_scores = cross_val_score(knn_metric, X_iris, y_iris, cv=5)
cv_mean = np.mean(cv_scores)
metric_results[metric] = {'accuracy': accuracy_metric, 'cv_score': cv_mean}
print(f"{metric}\t\t{accuracy_metric:.4f}\t\t{cv_mean:.4f}")
# Visualize impact of distance metrics
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
metrics = list(metric_results.keys())
accuracies = [metric_results[m]['accuracy'] for m in metrics]
cv_scores = [metric_results[m]['cv_score'] for m in metrics]
# Test accuracy
axes[0].bar(metrics, accuracies, color='skyblue', alpha=0.7)
axes[0].set_title('Test Accuracy for Different Distance Metrics')
axes[0].set_ylabel('Accuracy')
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_ylim(0.8, 1.0)
# Cross-validation scores
axes[1].bar(metrics, cv_scores, color='lightgreen', alpha=0.7)
axes[1].set_title('Cross-validation Scores for Different Distance Metrics')
axes[1].set_ylabel('CV Score')
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0.8, 1.0)
plt.tight_layout()
plt.show()10.8 Hyperparameter Tuning
10.8.1 Grid Search Optimization
# Use grid search to optimize KNN hyperparameters
param_grid = {
'n_neighbors': range(1, 21),
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski'],
'p': [1, 2, 3] # Only effective for minkowski distance
}
# Create KNN classifier
knn_grid = KNeighborsClassifier()
# Grid search
print("Performing KNN hyperparameter grid search...")
grid_search = GridSearchCV(
knn_grid,
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train_iris_scaled, y_train_iris)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
# Test best model
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_iris_scaled)
test_accuracy = accuracy_score(y_test_iris, y_pred_best)
print(f"Test set accuracy: {test_accuracy:.4f}")
# Detailed evaluation
print("\nDetailed evaluation of best KNN model:")
print(classification_report(y_test_iris, y_pred_best,
target_names=iris.target_names))10.8.2 Learning Curve Analysis
from sklearn.model_selection import learning_curve
def plot_learning_curve_knn(estimator, X, y, title="KNN Learning Curve"):
"""Plot learning curve for KNN"""
train_sizes, train_scores, val_scores = learning_curve(
estimator, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy'
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
alpha=0.1, color='blue')
plt.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
alpha=0.1, color='red')
plt.xlabel('Number of Training Samples')
plt.ylabel('Accuracy')
plt.title(title)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Plot learning curve for best KNN model
plot_learning_curve_knn(best_knn, X_iris, y_iris, "Best KNN Model Learning Curve")10.9 Practical Application Examples
10.9.1 Handwritten Digit Recognition
# Create simplified handwritten digit data
from sklearn.datasets import load_digits
# Load digit dataset
digits = load_digits()
X_digits, y_digits = digits.data, digits.target
print(f"Handwritten digit dataset information:")
print(f"Number of samples: {X_digits.shape[0]}")
print(f"Number of features: {X_digits.shape[1]}")
print(f"Number of classes: {len(np.unique(y_digits))}")
# Visualize some digit samples
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('Handwritten Digit Samples', fontsize=16)
for i in range(10):
row = i // 5
col = i % 5
# Find first sample of this digit
idx = np.where(y_digits == i)[0][0]
digit_image = X_digits[idx].reshape(8, 8)
axes[row, col].imshow(digit_image, cmap='gray')
axes[row, col].set_title(f'Digit {i}')
axes[row, col].axis('off')
plt.tight_layout()
plt.show()
# Split data
X_train_digits, X_test_digits, y_train_digits, y_test_digits = train_test_split(
X_digits, y_digits, test_size=0.2, random_state=42, stratify=y_digits
)
# Standardize features
scaler_digits = StandardScaler()
X_train_digits_scaled = scaler_digits.fit_transform(X_train_digits)
X_test_digits_scaled = scaler_digits.transform(X_test_digits)
# Train KNN classifier
knn_digits = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_digits.fit(X_train_digits_scaled, y_train_digits)
# Predict
y_pred_digits = knn_digits.predict(X_test_digits_scaled)
accuracy_digits = accuracy_score(y_test_digits, y_pred_digits)
print(f"\nKNN accuracy for handwritten digit recognition: {accuracy_digits:.4f}")
# Detailed classification report
print("\nDetailed classification report:")
print(classification_report(y_test_digits, y_pred_digits))
# Confusion matrix
cm_digits = confusion_matrix(y_test_digits, y_pred_digits)
plt.figure(figsize=(10, 8))
sns.heatmap(cm_digits, annot=True, fmt='d', cmap='Blues',
xticklabels=range(10), yticklabels=range(10))
plt.title('Confusion Matrix for Handwritten Digit Recognition')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
# Analyze misclassified samples
wrong_predictions = y_test_digits != y_pred_digits
wrong_indices = np.where(wrong_predictions)[0]
if len(wrong_indices) > 0:
print(f"\nNumber of misclassified samples: {len(wrong_indices)}")
# Display some misclassified samples
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('Misclassified Samples', fontsize=16)
for i in range(min(10, len(wrong_indices))):
row = i // 5
col = i % 5
idx = wrong_indices[i]
digit_image = X_test_digits[idx].reshape(8, 8)
true_label = y_test_digits[idx]
pred_label = y_pred_digits[idx]
axes[row, col].imshow(digit_image, cmap='gray')
axes[row, col].set_title(f'True: {true_label}, Pred: {pred_label}')
axes[row, col].axis('off')
# Remove empty subplots
for i in range(min(10, len(wrong_indices)), 10):
row = i // 5
col = i % 5
axes[row, col].remove()
plt.tight_layout()
plt.show()10.9.2 Recommendation System Basics
# Create simple recommendation system example
def create_recommendation_system():
"""Create simple recommendation system based on KNN"""
# Create user-item rating matrix
np.random.seed(42)
n_users = 100
n_items = 20
# Generate sparse rating matrix
ratings = np.zeros((n_users, n_items))
# Randomly rate some items for each user
for user in range(n_users):
# Each user rates 5-15 items
n_ratings = np.random.randint(5, 16)
items_to_rate = np.random.choice(n_items, n_ratings, replace=False)
for item in items_to_rate:
# Rating 1-5
ratings[user, item] = np.random.randint(1, 6)
print(f"User-item rating matrix:")
print(f"Number of users: {n_users}")
print(f"Number of items: {n_items}")
print(f"Rating density: {np.count_nonzero(ratings) / (n_users * n_items):.2%}")
# Use KNN to find similar users
# Only use users with ratings for training
user_profiles = ratings[np.sum(ratings, axis=1) > 0] # Filter out users without ratings
# Use cosine similarity
knn_recommender = KNeighborsClassifier(
n_neighbors=5,
metric='cosine',
algorithm='brute' # Use brute force for sparse data
)
# For demonstration, create some dummy labels (user types)
user_types = np.random.randint(0, 3, len(user_profiles))
knn_recommender.fit(user_profiles, user_types)
# Recommend for new user
new_user_ratings = np.zeros(n_items)
# New user has ratings for some items
rated_items = [0, 2, 5, 8, 12]
new_user_ratings[rated_items] = [4, 5, 3, 4, 5]
print(f"\nNew user's ratings:")
for i, item in enumerate(rated_items):
print(f"Item {item}: {new_user_ratings[item]}")
# Find similar users
distances, indices = knn_recommender.kneighbors([new_user_ratings], n_neighbors=5)
print(f"\nTop 5 most similar users:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
print(f"User {idx}: Distance = {dist:.3f}")
# Recommend items based on similar users
similar_users = user_profiles[indices[0]]
# Calculate recommendation scores (average rating of similar users)
recommendation_scores = np.mean(similar_users, axis=0)
# Exclude already rated items
recommendation_scores[rated_items] = 0
# Get recommended items
recommended_items = np.argsort(recommendation_scores)[::-1][:5]
print(f"\nRecommended items:")
for i, item in enumerate(recommended_items):
if recommendation_scores[item] > 0:
print(f"{i+1}. Item {item}: Recommendation score = {recommendation_scores[item]:.2f}")
return ratings, recommendation_scores
ratings_matrix, rec_scores = create_recommendation_system()
# Visualize rating matrix
plt.figure(figsize=(12, 8))
plt.imshow(ratings_matrix[:20, :], cmap='YlOrRd', aspect='auto')
plt.colorbar(label='Rating')
plt.title('User-Item Rating Matrix (First 20 Users)')
plt.xlabel('Item ID')
plt.ylabel('User ID')
plt.show()10.10 KNN Optimization Techniques
10.10.1 Curse of Dimensionality Problem
# Demonstrate the impact of curse of dimensionality on KNN
def demonstrate_curse_of_dimensionality():
"""Demonstrate the impact of curse of dimensionality on KNN performance"""
dimensions = [2, 5, 10, 20, 50, 100]
accuracies = []
print("Impact of curse of dimensionality on KNN:")
print("Dimension\tSamples\tAccuracy\tTraining Time")
print("-" * 45)
import time
for dim in dimensions:
# Create data with different dimensions
X_dim, y_dim = make_classification(
n_samples=500,
n_features=dim,
n_informative=min(dim, 10),
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)
# Split data
X_train_dim, X_test_dim, y_train_dim, y_test_dim = train_test_split(
X_dim, y_dim, test_size=0.2, random_state=42, stratify=y_dim
)
# Standardize
scaler_dim = StandardScaler()
X_train_dim_scaled = scaler_dim.fit_transform(X_train_dim)
X_test_dim_scaled = scaler_dim.transform(X_test_dim)
# Train KNN
start_time = time.time()
knn_dim = KNeighborsClassifier(n_neighbors=5)
knn_dim.fit(X_train_dim_scaled, y_train_dim)
# Predict
y_pred_dim = knn_dim.predict(X_test_dim_scaled)
train_time = time.time() - start_time
# Evaluate
accuracy_dim = accuracy_score(y_test_dim, y_pred_dim)
accuracies.append(accuracy_dim)
print(f"{dim}\t{len(X_dim)}\t{accuracy_dim:.4f}\t\t{train_time:.4f}s")
# Visualize impact of dimensions on performance
plt.figure(figsize=(10, 6))
plt.plot(dimensions, accuracies, 'o-', linewidth=2, markersize=8)
plt.xlabel('Feature Dimensions')
plt.ylabel('Accuracy')
plt.title('Impact of Curse of Dimensionality on KNN Performance')
plt.grid(True, alpha=0.3)
plt.show()
return dimensions, accuracies
dims, accs = demonstrate_curse_of_dimensionality()10.10.2 Feature Selection
# Use feature selection to improve KNN performance
from sklearn.feature_selection import SelectKBest, f_classif
# Use high-dimensional dataset
X_high_dim, y_high_dim = make_classification(
n_samples=1000,
n_features=100,
n_informative=20,
n_redundant=10,
random_state=42
)
X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(
X_high_dim, y_high_dim, test_size=0.2, random_state=42, stratify=y_high_dim
)
# Standardize
scaler_hd = StandardScaler()
X_train_hd_scaled = scaler_hd.fit_transform(X_train_hd)
X_test_hd_scaled = scaler_hd.transform(X_test_hd)
# Compare impact of different feature counts
k_features = [5, 10, 20, 30, 50, 100]
feature_selection_results = {}
print("Impact of feature selection on KNN performance:")
print("Features\tAccuracy\tTraining Time")
print("-" * 35)
for k in k_features:
if k < X_high_dim.shape[1]:
# Feature selection
selector = SelectKBest(f_classif, k=k)
X_train_selected = selector.fit_transform(X_train_hd_scaled, y_train_hd)
X_test_selected = selector.transform(X_test_hd_scaled)
else:
X_train_selected = X_train_hd_scaled
X_test_selected = X_test_hd_scaled
# Train KNN
start_time = time.time()
knn_fs = KNeighborsClassifier(n_neighbors=5)
knn_fs.fit(X_train_selected, y_train_hd)
y_pred_fs = knn_fs.predict(X_test_selected)
train_time = time.time() - start_time
accuracy_fs = accuracy_score(y_test_hd, y_pred_fs)
feature_selection_results[k] = {'accuracy': accuracy_fs, 'time': train_time}
print(f"{k}\t{accuracy_fs:.4f}\t\t{train_time:.4f}s")
# Visualize effect of feature selection
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
features = list(feature_selection_results.keys())
accuracies_fs = [feature_selection_results[k]['accuracy'] for k in features]
times_fs = [feature_selection_results[k]['time'] for k in features]
# Accuracy
axes[0].plot(features, accuracies_fs, 'o-', linewidth=2, markersize=8, color='blue')
axes[0].set_xlabel('Number of Features')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Impact of Feature Count on Accuracy')
axes[0].grid(True, alpha=0.3)
# Training time
axes[1].plot(features, times_fs, 's-', linewidth=2, markersize=8, color='red')
axes[1].set_xlabel('Number of Features')
axes[1].set_ylabel('Training Time (seconds)')
axes[1].set_title('Impact of Feature Count on Training Time')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()10.10.3 Approximate Nearest Neighbor Search
# Demonstrate concept of approximate nearest neighbor search
from sklearn.neighbors import NearestNeighbors
def compare_exact_vs_approximate_search():
"""Compare exact search and approximate search"""
# Create large-scale dataset
X_large, y_large = make_classification(
n_samples=5000,
n_features=50,
n_informative=30,
random_state=42
)
# Standardize
scaler_large = StandardScaler()
X_large_scaled = scaler_large.fit_transform(X_large)
# Split data
X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
X_large_scaled, y_large, test_size=0.2, random_state=42
)
# Different algorithms
algorithms = ['auto', 'ball_tree', 'kd_tree', 'brute']
algorithm_results = {}
print("Performance comparison of different search algorithms:")
print("Algorithm\t\tTraining Time\tPrediction Time\tAccuracy")
print("-" * 50)
for algorithm in algorithms:
# Training time
start_time = time.time()
knn_alg = KNeighborsClassifier(n_neighbors=5, algorithm=algorithm)
knn_alg.fit(X_train_large, y_train_large)
train_time = time.time() - start_time
# Prediction time
start_time = time.time()
y_pred_alg = knn_alg.predict(X_test_large)
pred_time = time.time() - start_time
# Accuracy
accuracy_alg = accuracy_score(y_test_large, y_pred_alg)
algorithm_results[algorithm] = {
'train_time': train_time,
'pred_time': pred_time,
'accuracy': accuracy_alg
}
print(f"{algorithm}\t\t{train_time:.4f}s\t\t{pred_time:.4f}s\t\t{accuracy_alg:.4f}")
# Visualize algorithm comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
algs = list(algorithm_results.keys())
train_times = [algorithm_results[alg]['train_time'] for alg in algs]
pred_times = [algorithm_results[alg]['pred_time'] for alg in algs]
accuracies_alg = [algorithm_results[alg]['accuracy'] for alg in algs]
# Training time
axes[0].bar(algs, train_times, color='skyblue', alpha=0.7)
axes[0].set_title('Training Time Comparison')
axes[0].set_ylabel('Time (seconds)')
axes[0].tick_params(axis='x', rotation=45)
# Prediction time
axes[1].bar(algs, pred_times, color='lightgreen', alpha=0.7)
axes[1].set_title('Prediction Time Comparison')
axes[1].set_ylabel('Time (seconds)')
axes[1].tick_params(axis='x', rotation=45)
# Accuracy
axes[2].bar(algs, accuracies_alg, color='lightcoral', alpha=0.7)
axes[2].set_title('Accuracy Comparison')
axes[2].set_ylabel('Accuracy')
axes[2].tick_params(axis='x', rotation=45)
axes[2].set_ylim(0.8, 1.0)
plt.tight_layout()
plt.show()
return algorithm_results
alg_results = compare_exact_vs_approximate_search()10.11 KNN vs Other Algorithms
10.11.1 Comprehensive Performance Comparison
# Compare KNN performance with other algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
def comprehensive_algorithm_comparison():
"""Comprehensive comparison of KNN with other algorithms"""
# Use wine dataset
wine = load_wine()
X_wine, y_wine = wine.data, wine.target
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
)
# Standardize (important for some algorithms)
scaler_wine = StandardScaler()
X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
X_test_wine_scaled = scaler_wine.transform(X_test_wine)
# Define algorithms
algorithms = {
'KNN': KNeighborsClassifier(n_neighbors=5),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(random_state=42),
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Naive Bayes': GaussianNB()
}
results = {}
print("Comprehensive algorithm performance comparison:")
print("Algorithm\t\tTraining Time\tPrediction Time\tAccuracy\t\tCV Score")
print("-" * 65)
for name, algorithm in algorithms.items():
# Select appropriate data
if name in ['KNN', 'SVM', 'Logistic Regression']:
X_train_alg = X_train_wine_scaled
X_test_alg = X_test_wine_scaled
X_cv = X_wine
scaler_cv = StandardScaler()
X_cv_scaled = scaler_cv.fit_transform(X_cv)
X_cv_final = X_cv_scaled
else:
X_train_alg = X_train_wine
X_test_alg = X_test_wine
X_cv_final = X_wine
# Training time
start_time = time.time()
algorithm.fit(X_train_alg, y_train_wine)
train_time = time.time() - start_time
# Prediction time
start_time = time.time()
y_pred_alg = algorithm.predict(X_test_alg)
pred_time = time.time() - start_time
# Accuracy
accuracy_alg = accuracy_score(y_test_wine, y_pred_alg)
# Cross-validation
cv_scores = cross_val_score(algorithm, X_cv_final, y_wine, cv=5)
cv_mean = np.mean(cv_scores)
results[name] = {
'train_time': train_time,
'pred_time': pred_time,
'accuracy': accuracy_alg,
'cv_score': cv_mean
}
print(f"{name}\t{train_time:.4f}s\t\t{pred_time:.4f}s\t\t{accuracy_alg:.4f}\t\t{cv_mean:.4f}")
return results
comparison_results = comprehensive_algorithm_comparison()
# Visualize comprehensive comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Comprehensive Algorithm Performance Comparison', fontsize=16)
names = list(comparison_results.keys())
train_times = [comparison_results[name]['train_time'] for name in names]
pred_times = [comparison_results[name]['pred_time'] for name in names]
accuracies = [comparison_results[name]['accuracy'] for name in names]
cv_scores = [comparison_results[name]['cv_score'] for name in names]
# Training time
axes[0, 0].bar(names, train_times, color='skyblue', alpha=0.7)
axes[0, 0].set_title('Training Time')
axes[0, 0].set_ylabel('Time (seconds)')
axes[0, 0].tick_params(axis='x', rotation=45)
# Prediction time
axes[0, 1].bar(names, pred_times, color='lightgreen', alpha=0.7)
axes[0, 1].set_title('Prediction Time')
axes[0, 1].set_ylabel('Time (seconds)')
axes[0, 1].tick_params(axis='x', rotation=45)
# Test accuracy
axes[1, 0].bar(names, accuracies, color='lightcoral', alpha=0.7)
axes[1, 0].set_title('Test Accuracy')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].set_ylim(0.8, 1.0)
# Cross-validation scores
axes[1, 1].bar(names, cv_scores, color='gold', alpha=0.7)
axes[1, 1].set_title('Cross-validation Scores')
axes[1, 1].set_ylabel('CV Score')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].set_ylim(0.8, 1.0)
plt.tight_layout()
plt.show()10.12 Exercises
Exercise 1: Basic KNN
- Train a KNN classifier using the iris dataset
- Compare the impact of different K values (1, 3, 5, 7, 9) on performance
- Visualize decision boundaries (select two features)
Exercise 2: Distance Metric Comparison
- Create a 2D classification dataset
- Compare the effects of Euclidean, Manhattan, and Chebyshev distances
- Analyze which distance metric works best for your data
Exercise 3: Feature Scaling Experiment
- Create a dataset with features of different scales
- Compare KNN performance with standardization, min-max scaling, and no scaling
- Explain why feature scaling is so important for KNN
Exercise 4: KNN Regression
- Train a KNN regressor using the Boston housing dataset
- Compare the effects of different K values and weighting methods
- Compare performance with linear regression
Exercise 5: High-dimensional Data Processing
- Create a high-dimensional dataset (number of features > 50)
- Use feature selection techniques to improve KNN performance
- Analyze the impact of the curse of dimensionality on KNN
10.13 Summary
In this chapter, we have thoroughly learned various aspects of the K-Nearest Neighbors algorithm:
Core Concepts
- Instance-based Learning: Making predictions through similar samples
- Distance Metrics: Euclidean, Manhattan, Chebyshev, and other distances
- K Value Selection: Important parameter balancing bias and variance
Main Techniques
- KNN Classification: Majority voting decision
- KNN Regression: Average or weighted average prediction
- Distance Weighting: Greater weight for nearest neighbors
- Different Algorithms: Brute force, KD-tree, Ball tree
Practical Skills
- Feature Scaling: Importance of standardization and normalization
- Hyperparameter Tuning: K value, distance metrics, weighting methods
- Performance Optimization: Feature selection, algorithm selection
- Practical Applications: Recommendation systems, image recognition
Key Points
- KNN is simple and intuitive with no training process
- Sensitive to feature scaling and distance metrics
- High computational complexity, not suitable for large-scale data
- Performs well on low-dimensional data, requires feature selection for high-dimensional data
Advantages and Disadvantages
Advantages:
- Simple algorithm, easy to understand and implement
- No assumptions about data distribution
- Suitable for multi-class problems
- Can capture local patterns
Disadvantages:
- High computational and storage overhead
- Sensitive to curse of dimensionality
- Sensitive to noise and outliers
- Requires appropriate K value and distance metric selection
10.14 Usage Recommendations
When to Use KNN
- Small to medium-sized datasets
- Nonlinear decision boundaries
- Need for simple baseline models
- Problems where local patterns are important
When to Avoid KNN
- Large-scale datasets
- High-dimensional data (without feature selection)
- Real-time prediction requirements
- Limited storage space
Best Practices
- Must perform feature scaling
- Use cross-validation to select K value
- Consider feature selection or dimensionality reduction
- Choose distance metrics based on data characteristics
- Consider approximate methods for large datasets
10.15 Next Steps
Now you have mastered the intuitive and powerful K-Nearest Neighbors algorithm! In the next chapter Cluster Analysis, we will enter the world of unsupervised learning and learn how to discover hidden patterns in data without labels.
Chapter Key Points Review:
- ✅ Understood the basic principles and algorithm steps of KNN
- ✅ Mastered the selection and application of different distance metrics
- ✅ Learned the implementation of KNN classification and regression
- ✅ Understood the importance of feature scaling for KNN
- ✅ Mastered KNN hyperparameter tuning methods
- ✅ Able to handle curse of dimensionality and performance optimization issues
- ✅ Understood the advantages and limitations of KNN in practical applications