Chapter 11: Clustering Analysis
Clustering analysis is an important branch of unsupervised learning that groups samples based on data similarity without label information. This chapter will详细介绍 various clustering algorithms, their principles, implementation, and applications.
11.1 What is Clustering Analysis?
Clustering analysis is an exploratory data analysis technique that aims to group similar data points together and dissimilar points into different groups. Unlike classification, clustering is unsupervised learning and does not require prior knowledge of category labels.
11.1.1 Goals of Clustering
- Maximize intra-cluster similarity: Samples within the same cluster should be as similar as possible
- Minimize inter-cluster similarity: Samples between different clusters should be as different as possible
- Discover data structure: Reveal hidden patterns in the data
11.1.2 Applications of Clustering
- Market segmentation: Group customers based on consumption behavior
- Image segmentation: Divide images into different regions
- Gene analysis: Classify genes based on expression patterns
- Recommendation systems: Discover user groups and item categories
- Anomaly detection: Identify outliers that don't belong to any cluster
11.1.3 Classification of Clustering Algorithms
- Distance-based clustering: K-means, K-medoids
- Hierarchical clustering: Agglomerative clustering, divisive clustering
- Density-based clustering: DBSCAN, OPTICS
- Distribution-based clustering: Gaussian Mixture Models
- Grid-based clustering: Suitable for large datasets
11.2 Preparing Environment and Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_circles, make_moons, load_iris, load_wine
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score, calinski_harabasz_score
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist, squareform
import warnings
warnings.filterwarnings('ignore')
# Set random seed
np.random.seed(42)
# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False11.3 K-means Clustering
11.3.1 K-means Algorithm Principle
K-means is the most classic clustering algorithm, which minimizes the within-cluster sum of squares through iterative optimization.
def demonstrate_kmeans_principle():
"""Demonstrate the basic principle of K-means algorithm"""
# Create simple clustering data
X_demo, y_true = make_blobs(n_samples=300, centers=3, n_features=2,
random_state=42, cluster_std=1.5)
print("K-means algorithm steps demonstration:")
print("1. Randomly initialize K cluster centers")
print("2. Assign each point to the nearest cluster center")
print("3. Update cluster centers to the mean of points in the cluster")
print("4. Repeat steps 2-3 until convergence")
# Visualize data
plt.figure(figsize=(15, 10))
# Original data
plt.subplot(2, 3, 1)
plt.scatter(X_demo[:, 0], X_demo[:, 1], c='gray', alpha=0.6)
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
# Manually demonstrate K-means iteration process
k = 3
# Randomly initialize cluster centers
centers = np.random.randn(k, 2) * 2
for iteration in range(5):
if iteration < 4: # Only show first 4 iterations
plt.subplot(2, 3, iteration + 2)
# Calculate distance from each point to cluster centers
distances = np.sqrt(((X_demo - centers[:, np.newaxis])**2).sum(axis=2))
labels = np.argmin(distances, axis=0)
# Plot data points and cluster centers
colors = ['red', 'blue', 'green']
for i in range(k):
mask = labels == i
plt.scatter(X_demo[mask, 0], X_demo[mask, 1],
c=colors[i], alpha=0.6, s=30)
plt.scatter(centers[i, 0], centers[i, 1],
c='black', marker='x', s=200, linewidths=3)
plt.title(f'Iteration {iteration + 1}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
# Update cluster centers
new_centers = np.array([X_demo[labels == i].mean(axis=0) for i in range(k)])
centers = new_centers
# Final result
plt.subplot(2, 3, 6)
for i in range(k):
mask = labels == i
plt.scatter(X_demo[mask, 0], X_demo[mask, 1],
c=colors[i], alpha=0.6, s=30)
plt.scatter(centers[i, 0], centers[i, 1],
c='black', marker='x', s=200, linewidths=3)
plt.title('Final Result')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return X_demo, y_true
X_demo, y_true = demonstrate_kmeans_principle()11.3.2 Implementing K-means with Scikit-learn
# Use Scikit-learn's K-means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X_demo)
# Get cluster centers
centers = kmeans.cluster_centers_
print("K-means clustering results:")
print(f"Cluster centers:\n{centers}")
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")
# Visualize results
plt.figure(figsize=(15, 5))
# True labels
plt.subplot(1, 3, 1)
colors = ['red', 'blue', 'green']
for i in range(3):
mask = y_true == i
plt.scatter(X_demo[mask, 0], X_demo[mask, 1], c=colors[i], alpha=0.6, label=f'True Cluster {i}')
plt.title('True Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)
# K-means results
plt.subplot(1, 3, 2)
for i in range(3):
mask = y_pred == i
plt.scatter(X_demo[mask, 0], X_demo[mask, 1], c=colors[i], alpha=0.6, label=f'Predicted Cluster {i}')
plt.scatter(centers[:, 0], centers[:, 1], c='black', marker='x', s=200, linewidths=3, label='Cluster Centers')
plt.title('K-means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)
# Comparison
plt.subplot(1, 3, 3)
plt.scatter(X_demo[:, 0], X_demo[:, 1], c=y_pred, cmap='viridis', alpha=0.6)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=200, linewidths=3)
plt.title('Clustering Results (Color Coded)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()11.3.3 Selecting Optimal K Value
def find_optimal_k(X, k_range):
"""Use elbow method and silhouette coefficient to select optimal K value"""
inertias = []
silhouette_scores = []
print("Clustering performance for different K values:")
print("K Value\tInertia(WCSS)\tSilhouette Coefficient")
print("-" * 35)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
inertia = kmeans.inertia_
if k > 1: # Silhouette coefficient requires at least 2 clusters
silhouette = silhouette_score(X, y_pred)
else:
silhouette = 0
inertias.append(inertia)
silhouette_scores.append(silhouette)
print(f"{k}\t{inertia:.2f}\t\t{silhouette:.4f}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Elbow method
axes[0].plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('K Value')
axes[0].set_ylabel('Inertia (WCSS)')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)
# Silhouette coefficient
axes[1].plot(k_range[1:], silhouette_scores[1:], 'ro-', linewidth=2, markersize=8)
axes[1].set_xlabel('K Value')
axes[1].set_ylabel('Silhouette Coefficient')
axes[1].set_title('Silhouette Coefficient Method')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Find best K value
best_k_silhouette = k_range[1:][np.argmax(silhouette_scores[1:])]
print(f"\nBest K value based on silhouette coefficient: {best_k_silhouette}")
return inertias, silhouette_scores
# Find optimal K value
k_range = range(1, 11)
inertias, silhouette_scores = find_optimal_k(X_demo, k_range)11.3.4 K-means Variants
# Compare different K-means variants
from sklearn.cluster import MiniBatchKMeans
def compare_kmeans_variants():
"""Compare different K-means variants"""
# Create larger dataset
X_large, y_large = make_blobs(n_samples=2000, centers=4, n_features=2,
random_state=42, cluster_std=2.0)
# Different K-means algorithms
algorithms = {
'K-means': KMeans(n_clusters=4, random_state=42),
'Mini-batch K-means': MiniBatchKMeans(n_clusters=4, random_state=42, batch_size=100)
}
results = {}
print("K-means variant performance comparison:")
print("Algorithm\t\tTraining Time\tInertia\t\tSilhouette Coefficient")
print("-" * 60)
import time
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
for i, (name, algorithm) in enumerate(algorithms.items()):
# Training time
start_time = time.time()
y_pred = algorithm.fit_predict(X_large)
train_time = time.time() - start_time
# Performance metrics
inertia = algorithm.inertia_
silhouette = silhouette_score(X_large, y_pred)
results[name] = {
'time': train_time,
'inertia': inertia,
'silhouette': silhouette
}
print(f"{name}\t{train_time:.4f}s\t\t{inertia:.2f}\t\t{silhouette:.4f}")
# Visualization
axes[i].scatter(X_large[:, 0], X_large[:, 1], c=y_pred, cmap='viridis', alpha=0.6, s=10)
axes[i].scatter(algorithm.cluster_centers_[:, 0], algorithm.cluster_centers_[:, 1],
c='red', marker='x', s=200, linewidths=3)
axes[i].set_title(f'{name}\nSilhouette Coefficient: {silhouette:.3f}')
axes[i].set_xlabel('Feature 1')
axes[i].set_ylabel('Feature 2')
axes[i].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return results
kmeans_results = compare_kmeans_variants()11.4 Hierarchical Clustering
11.4.1 Agglomerative Hierarchical Clustering
def demonstrate_hierarchical_clustering():
"""Demonstrate hierarchical clustering"""
# Create small-scale data for demonstration
X_hier, y_hier = make_blobs(n_samples=50, centers=3, n_features=2,
random_state=42, cluster_std=1.0)
# Agglomerative hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
y_pred_hier = hierarchical.fit_predict(X_hier)
print("Hierarchical clustering results:")
print(f"Number of clusters: {len(np.unique(y_pred_hier))}")
# Visualize results
plt.figure(figsize=(15, 5))
# Original data
plt.subplot(1, 3, 1)
plt.scatter(X_hier[:, 0], X_hier[:, 1], c='gray', alpha=0.6)
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
# Hierarchical clustering results
plt.subplot(1, 3, 2)
colors = ['red', 'blue', 'green']
for i in range(3):
mask = y_pred_hier == i
plt.scatter(X_hier[mask, 0], X_hier[mask, 1], c=colors[i], alpha=0.6, label=f'Cluster {i}')
plt.title('Hierarchical Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)
# Dendrogram
plt.subplot(1, 3, 3)
# Calculate linkage matrix
linkage_matrix = linkage(X_hier, method='ward')
dendrogram(linkage_matrix, truncate_mode='level', p=5)
plt.title('Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.tight_layout()
plt.show()
return X_hier, y_pred_hier
X_hier, y_pred_hier = demonstrate_hierarchical_clustering()11.4.2 Comparison of Different Linkage Methods
def compare_linkage_methods():
"""Compare different linkage methods"""
# Create clustering data with different shapes
X_shapes = []
y_shapes = []
# Circular clusters
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)
X_shapes.append(('Circular', X_circles, y_circles))
# Moon-shaped clusters
X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
X_shapes.append(('Moon-shaped', X_moons, y_moons))
# Spherical clusters
X_blobs, y_blobs = make_blobs(n_samples=200, centers=3, random_state=42)
X_shapes.append(('Spherical', X_blobs, y_blobs))
# Different linkage methods
linkage_methods = ['ward', 'complete', 'average', 'single']
fig, axes = plt.subplots(len(X_shapes), len(linkage_methods) + 1, figsize=(20, 12))
fig.suptitle('Hierarchical Clustering Comparison with Different Linkage Methods', fontsize=16)
for i, (shape_name, X, y_true) in enumerate(X_shapes):
# Original data
axes[i, 0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.6)
axes[i, 0].set_title(f'{shape_name} Data')
axes[i, 0].set_xlabel('Feature 1')
axes[i, 0].set_ylabel('Feature 2')
# Different linkage methods
for j, linkage_method in enumerate(linkage_methods):
n_clusters = len(np.unique(y_true))
if linkage_method == 'ward':
# Ward method only applies to Euclidean distance
hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage_method)
else:
hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage_method)
y_pred = hierarchical.fit_predict(X)
axes[i, j + 1].scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.6)
axes[i, j + 1].set_title(f'{linkage_method} Linkage')
axes[i, j + 1].set_xlabel('Feature 1')
axes[i, j + 1].set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
compare_linkage_methods()11.5 Density-Based Clustering (DBSCAN)
11.5.1 DBSCAN Algorithm Principle
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that can discover clusters of arbitrary shapes and identify noise points.
def demonstrate_dbscan_principle():
"""Demonstrate DBSCAN algorithm principle"""
print("DBSCAN algorithm core concepts:")
print("1. Core points: Points with at least min_samples points in their ε-neighborhood")
print("2. Border points: Points that are not core points but are in the ε-neighborhood of a core point")
print("3. Noise points: Points that are neither core points nor border points")
print("4. Density reachable: Points connected through a sequence of core points")
# Create data containing noise
X_noise, _ = make_blobs(n_samples=200, centers=3, n_features=2,
random_state=42, cluster_std=1.0)
# Add noise points
noise_points = np.random.uniform(-8, 8, (20, 2))
X_dbscan = np.vstack([X_noise, noise_points])
# DBSCAN clustering
dbscan = DBSCAN(eps=1.5, min_samples=5)
y_pred_dbscan = dbscan.fit_predict(X_dbscan)
# Statistics
n_clusters = len(set(y_pred_dbscan)) - (1 if -1 in y_pred_dbscan else 0)
n_noise = list(y_pred_dbscan).count(-1)
print(f"\nDBSCAN clustering results:")
print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")
print(f"Number of core points: {len(dbscan.core_sample_indices_)}")
# Visualize results
plt.figure(figsize=(15, 5))
# Original data
plt.subplot(1, 3, 1)
plt.scatter(X_dbscan[:, 0], X_dbscan[:, 1], c='gray', alpha=0.6)
plt.title('Original Data (With Noise)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
# DBSCAN results
plt.subplot(1, 3, 2)
unique_labels = set(y_pred_dbscan)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Noise points in black
col = 'black'
marker = 'x'
alpha = 0.5
label = 'Noise Points'
else:
marker = 'o'
alpha = 0.7
label = f'Cluster {k}'
class_member_mask = (y_pred_dbscan == k)
xy = X_dbscan[class_member_mask]
plt.scatter(xy[:, 0], xy[:, 1], c=[col], marker=marker, alpha=alpha, s=50)
plt.title('DBSCAN Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
# Core point markers
plt.subplot(1, 3, 3)
# Plot all points
plt.scatter(X_dbscan[:, 0], X_dbscan[:, 1], c=y_pred_dbscan,
cmap='Spectral', alpha=0.6, s=50)
# Mark core points
core_samples = X_dbscan[dbscan.core_sample_indices_]
plt.scatter(core_samples[:, 0], core_samples[:, 1],
facecolors='none', edgecolors='red', s=100, linewidths=2, label='Core Points')
plt.title('Core Point Markers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return X_dbscan, y_pred_dbscan
X_dbscan, y_pred_dbscan = demonstrate_dbscan_principle()11.5.2 DBSCAN Parameter Tuning
def tune_dbscan_parameters():
"""Tune DBSCAN parameters"""
# Use moon-shaped data to test DBSCAN
X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
# Different eps and min_samples combinations
eps_values = [0.1, 0.2, 0.3, 0.5]
min_samples_values = [3, 5, 10, 15]
fig, axes = plt.subplots(len(eps_values), len(min_samples_values), figsize=(16, 12))
fig.suptitle('DBSCAN Parameter Tuning', fontsize=16)
print("DBSCAN parameter tuning results:")
print("eps\tmin_samples\tClusters\tNoise Points\tSilhouette Coefficient")
print("-" * 55)
best_score = -1
best_params = None
for i, eps in enumerate(eps_values):
for j, min_samples in enumerate(min_samples_values):
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y_pred = dbscan.fit_predict(X_moons)
n_clusters = len(set(y_pred)) - (1 if -1 in y_pred else 0)
n_noise = list(y_pred).count(-1)
# Calculate silhouette coefficient (requires at least 2 clusters and not all noise)
if n_clusters >= 2 and len(set(y_pred)) > 1:
# Exclude noise points when calculating silhouette coefficient
mask = y_pred != -1
if np.sum(mask) > 0:
silhouette = silhouette_score(X_moons[mask], y_pred[mask])
else:
silhouette = -1
else:
silhouette = -1
print(f"{eps}\t{min_samples}\t\t{n_clusters}\t{n_noise}\t\t{silhouette:.4f}")
# Record best parameters
if silhouette > best_score:
best_score = silhouette
best_params = (eps, min_samples)
# Visualization
axes[i, j].scatter(X_moons[:, 0], X_moons[:, 1], c=y_pred,
cmap='Spectral', alpha=0.7, s=30)
axes[i, j].set_title(f'eps={eps}, min_samples={min_samples}\n'
f'Clusters={n_clusters}, Silhouette={silhouette:.3f}')
axes[i, j].set_xlabel('Feature 1')
axes[i, j].set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
print(f"\nBest parameters: eps={best_params[0]}, min_samples={best_params[1]}")
print(f"Best silhouette coefficient: {best_score:.4f}")
return best_params
best_dbscan_params = tune_dbscan_parameters()11.6 Gaussian Mixture Models (GMM)
11.6.1 GMM Basic Principles
def demonstrate_gmm():
"""Demonstrate Gaussian Mixture Models"""
print("Gaussian Mixture Model (GMM) characteristics:")
print("1. Assumes data comes from a mixture of multiple Gaussian distributions")
print("2. Uses EM algorithm for parameter estimation")
print("3. Provides soft clustering (probability assignment)")
print("4. Can handle elliptical clusters")
# Create elliptical cluster data
X_ellipse, y_ellipse = make_blobs(n_samples=300, centers=3, n_features=2,
random_state=42, cluster_std=2.0)
# Transform data to make it elliptical
transformation_matrix = np.array([[1, 0.5], [0.5, 1]])
X_ellipse = X_ellipse.dot(transformation_matrix)
# Gaussian Mixture Model
gmm = GaussianMixture(n_components=3, random_state=42)
y_pred_gmm = gmm.fit_predict(X_ellipse)
y_proba_gmm = gmm.predict_proba(X_ellipse)
print(f"\nGMM clustering results:")
print(f"Number of clusters: {gmm.n_components}")
print(f"Converged: {gmm.converged_}")
print(f"Number of iterations: {gmm.n_iter_}")
print(f"Log likelihood: {gmm.score(X_ellipse):.2f}")
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Gaussian Mixture Model Clustering', fontsize=16)
# Original data
axes[0, 0].scatter(X_ellipse[:, 0], X_ellipse[:, 1], c='gray', alpha=0.6)
axes[0, 0].set_title('Original Data')
axes[0, 0].set_xlabel('Feature 1')
axes[0, 0].set_ylabel('Feature 2')
axes[0, 0].grid(True, alpha=0.3)
# GMM clustering results
colors = ['red', 'blue', 'green']
for i in range(3):
mask = y_pred_gmm == i
axes[0, 1].scatter(X_ellipse[mask, 0], X_ellipse[mask, 1],
c=colors[i], alpha=0.6, label=f'Cluster {i}')
axes[0, 1].set_title('GMM Clustering Results')
axes[0, 1].set_xlabel('Feature 1')
axes[0, 1].set_ylabel('Feature 2')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Probability assignment (soft clustering)
# Show probability of each point belonging to the first cluster
scatter = axes[1, 0].scatter(X_ellipse[:, 0], X_ellipse[:, 1],
c=y_proba_gmm[:, 0], cmap='Reds', alpha=0.7)
axes[1, 0].set_title('Probability of Belonging to Cluster 0')
axes[1, 0].set_xlabel('Feature 1')
axes[1, 0].set_ylabel('Feature 2')
plt.colorbar(scatter, ax=axes[1, 0])
# Gaussian distribution contours
axes[1, 1].scatter(X_ellipse[:, 0], X_ellipse[:, 1], c=y_pred_gmm,
cmap='viridis', alpha=0.6)
# Draw Gaussian distribution contours
x_min, x_max = X_ellipse[:, 0].min() - 2, X_ellipse[:, 0].max() + 2
y_min, y_max = X_ellipse[:, 1].min() - 2, X_ellipse[:, 1].max() + 2
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
mesh_points = np.c_[xx.ravel(), yy.ravel()]
Z = gmm.score_samples(mesh_points)
Z = Z.reshape(xx.shape)
axes[1, 1].contour(xx, yy, Z, levels=10, alpha=0.5)
axes[1, 1].set_title('GMM Contours')
axes[1, 1].set_xlabel('Feature 1')
axes[1, 1].set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
return X_ellipse, y_pred_gmm, gmm
X_ellipse, y_pred_gmm, gmm = demonstrate_gmm()11.6.2 Model Selection and Information Criteria
def gmm_model_selection():
"""Use information criteria to select the optimal number of GMM components"""
n_components_range = range(1, 11)
aic_scores = []
bic_scores = []
print("GMM model selection:")
print("Components\tAIC\t\tBIC")
print("-" * 30)
for n_components in n_components_range:
gmm = GaussianMixture(n_components=n_components, random_state=42)
gmm.fit(X_ellipse)
aic = gmm.aic(X_ellipse)
bic = gmm.bic(X_ellipse)
aic_scores.append(aic)
bic_scores.append(bic)
print(f"{n_components}\t{aic:.2f}\t\t{bic:.2f}")
# Visualize information criteria
plt.figure(figsize=(10, 6))
plt.plot(n_components_range, aic_scores, 'bo-', label='AIC', linewidth=2, markersize=8)
plt.plot(n_components_range, bic_scores, 'ro-', label='BIC', linewidth=2, markersize=8)
plt.xlabel('Number of Components')
plt.ylabel('Information Criterion Value')
plt.title('GMM Model Selection - AIC vs BIC')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Find optimal number of components
best_n_aic = n_components_range[np.argmin(aic_scores)]
best_n_bic = n_components_range[np.argmin(bic_scores)]
print(f"\nBest number of components based on AIC: {best_n_aic}")
print(f"Best number of components based on BIC: {best_n_bic}")
return best_n_aic, best_n_bic
best_n_aic, best_n_bic = gmm_model_selection()11.7 Clustering Algorithm Comparison
11.7.1 Performance on Different Data Types
def comprehensive_clustering_comparison():
"""Comprehensively compare different clustering algorithms"""
# Create different types of datasets
datasets = []
# 1. Spherical clusters
X_blobs, y_blobs = make_blobs(n_samples=200, centers=3, random_state=42)
datasets.append(('Spherical Clusters', X_blobs, y_blobs))
# 2. Moon-shaped
X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
datasets.append(('Moon-shaped', X_moons, y_moons))
# 3. Circular
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)
datasets.append(('Circular', X_circles, y_circles))
# 4. Different densities
X_varied, y_varied = make_blobs(n_samples=200, centers=3,
cluster_std=[1.0, 2.5, 0.5], random_state=42)
datasets.append(('Different Densities', X_varied, y_varied))
# Clustering algorithms
algorithms = {
'K-means': KMeans(n_clusters=2, random_state=42),
'Hierarchical Clustering': AgglomerativeClustering(n_clusters=2, linkage='ward'),
'DBSCAN': DBSCAN(eps=0.3, min_samples=5),
'GMM': GaussianMixture(n_components=2, random_state=42),
'Spectral Clustering': SpectralClustering(n_clusters=2, random_state=42)
}
# Create comparison plot
fig, axes = plt.subplots(len(datasets), len(algorithms) + 1, figsize=(20, 16))
fig.suptitle('Comparison of Different Clustering Algorithms', fontsize=16)
print("Clustering algorithm performance comparison:")
print("Dataset\t\tAlgorithm\t\t\tSilhouette\tARI")
print("-" * 60)
for i, (dataset_name, X, y_true) in enumerate(datasets):
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Original data
axes[i, 0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7)
axes[i, 0].set_title(f'{dataset_name}\n(True Labels)')
axes[i, 0].set_xlabel('Feature 1')
axes[i, 0].set_ylabel('Feature 2')
for j, (alg_name, algorithm) in enumerate(algorithms.items()):
# Adjust number of clusters
if hasattr(algorithm, 'n_clusters'):
algorithm.n_clusters = len(np.unique(y_true))
elif hasattr(algorithm, 'n_components'):
algorithm.n_components = len(np.unique(y_true))
# Use standardized data for some algorithms
if alg_name in ['Spectral Clustering']:
X_input = X_scaled
else:
X_input = X
try:
y_pred = algorithm.fit_predict(X_input)
# Calculate evaluation metrics
if len(set(y_pred)) > 1:
silhouette = silhouette_score(X_input, y_pred)
ari = adjusted_rand_score(y_true, y_pred)
else:
silhouette = -1
ari = -1
print(f"{dataset_name}\t{alg_name}\t\t{silhouette:.4f}\t\t{ari:.4f}")
# Visualize results
axes[i, j + 1].scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.7)
axes[i, j + 1].set_title(f'{alg_name}\nSilhouette: {silhouette:.3f}')
axes[i, j + 1].set_xlabel('Feature 1')
axes[i, j + 1].set_ylabel('Feature 2')
except Exception as e:
print(f"{dataset_name}\t{alg_name}\t\tError: {str(e)[:20]}")
axes[i, j + 1].text(0.5, 0.5, 'Algorithm Failed', ha='center', va='center',
transform=axes[i, j + 1].transAxes)
axes[i, j + 1].set_title(f'{alg_name}\nFailed')
plt.tight_layout()
plt.show()
comprehensive_clustering_comparison()11.8 Clustering Evaluation
11.8.1 Internal Evaluation Metrics
def clustering_evaluation_metrics():
"""Demonstrate clustering evaluation metrics"""
# Use iris dataset
iris = load_iris()
X_iris = iris.data
y_true_iris = iris.target
# Standardize data
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)
# Different clustering algorithms
algorithms = {
'K-means': KMeans(n_clusters=3, random_state=42),
'Hierarchical Clustering': AgglomerativeClustering(n_clusters=3),
'GMM': GaussianMixture(n_components=3, random_state=42)
}
print("Clustering evaluation metrics comparison:")
print("Algorithm\t\tSilhouette\tCalinski-Harabasz\tARI\t\tMutual Information")
print("-" * 70)
results = {}
for name, algorithm in algorithms.items():
# Clustering
y_pred = algorithm.fit_predict(X_iris_scaled)
# Internal evaluation metrics (no true labels needed)
silhouette = silhouette_score(X_iris_scaled, y_pred)
calinski_harabasz = calinski_harabasz_score(X_iris_scaled, y_pred)
# External evaluation metrics (true labels needed)
from sklearn.metrics import adjusted_mutual_info_score
ari = adjusted_rand_score(y_true_iris, y_pred)
ami = adjusted_mutual_info_score(y_true_iris, y_pred)
results[name] = {
'silhouette': silhouette,
'calinski_harabasz': calinski_harabasz,
'ari': ari,
'ami': ami,
'y_pred': y_pred
}
print(f"{name}\t{silhouette:.4f}\t\t{calinski_harabasz:.2f}\t\t{ari:.4f}\t\t{ami:.4f}")
# Visualize evaluation results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Clustering Evaluation Metrics Comparison', fontsize=16)
metrics = ['silhouette', 'calinski_harabasz', 'ari', 'ami']
metric_names = ['Silhouette Coefficient', 'Calinski-Harabasz Index', 'Adjusted Rand Index', 'Adjusted Mutual Information']
for i, (metric, metric_name) in enumerate(zip(metrics, metric_names)):
row = i // 2
col = i % 2
names = list(results.keys())
values = [results[name][metric] for name in names]
bars = axes[row, col].bar(names, values, alpha=0.7,
color=['skyblue', 'lightgreen', 'lightcoral'])
axes[row, col].set_title(metric_name)
axes[row, col].set_ylabel('Score')
axes[row, col].tick_params(axis='x', rotation=45)
# Add value labels
for bar, value in zip(bars, values):
height = bar.get_height()
axes[row, col].text(bar.get_x() + bar.get_width()/2., height + height*0.01,
f'{value:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
return results
evaluation_results = clustering_evaluation_metrics()11.8.2 Clustering Stability Analysis
def clustering_stability_analysis():
"""Analyze the stability of clustering algorithms"""
# Create dataset
X_stability, y_stability = make_blobs(n_samples=300, centers=3,
n_features=2, random_state=42)
# Run clustering algorithms multiple times
n_runs = 20
algorithms = {
'K-means': KMeans(n_clusters=3),
'Hierarchical Clustering': AgglomerativeClustering(n_clusters=3),
'GMM': GaussianMixture(n_components=3)
}
stability_results = {}
print("Clustering stability analysis:")
print("Algorithm\t\tMean Silhouette\tStd Dev\t\tCoefficient of Variation")
print("-" * 55)
for name, algorithm in algorithms.items():
silhouette_scores = []
for run in range(n_runs):
# Use different random seeds each time to test stability
if hasattr(algorithm, 'random_state'):
algorithm.random_state = run
y_pred = algorithm.fit_predict(X_stability)
silhouette = silhouette_score(X_stability, y_pred)
silhouette_scores.append(silhouette)
mean_silhouette = np.mean(silhouette_scores)
std_silhouette = np.std(silhouette_scores)
cv_silhouette = std_silhouette / mean_silhouette if mean_silhouette != 0 else 0
stability_results[name] = {
'scores': silhouette_scores,
'mean': mean_silhouette,
'std': std_silhouette,
'cv': cv_silhouette
}
print(f"{name}\t{mean_silhouette:.4f}\t\t{std_silhouette:.4f}\t\t{cv_silhouette:.4f}")
# Visualize stability
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Box plot
data_for_boxplot = [stability_results[name]['scores'] for name in algorithms.keys()]
axes[0].boxplot(data_for_boxplot, labels=algorithms.keys())
axes[0].set_title('Clustering Algorithm Stability (Box Plot)')
axes[0].set_ylabel('Silhouette Coefficient')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3)
# Coefficient of variation comparison
names = list(stability_results.keys())
cvs = [stability_results[name]['cv'] for name in names]
bars = axes[1].bar(names, cvs, color=['skyblue', 'lightgreen', 'lightcoral'], alpha=0.7)
axes[1].set_title('Coefficient of Variation Comparison')
axes[1].set_ylabel('Coefficient of Variation')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3)
# Add value labels
for bar, cv in zip(bars, cvs):
height = bar.get_height()
axes[1].text(bar.get_x() + bar.get_width()/2., height + height*0.01,
f'{cv:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
return stability_results
stability_results = clustering_stability_analysis()11.9 Practice Exercises
Exercise 1: Basic Clustering
- Use
make_blobsto generate a clustering dataset - Train K-means and visualize the clustering results
- Analyze the impact of different K values on model performance
Exercise 2: Hierarchical Clustering
- Create a dataset with hierarchical structure
- Apply hierarchical clustering and visualize the dendrogram
- Compare different linkage methods
Exercise 3: Density-Based Clustering
- Generate data with complex shapes (circles, moons)
- Apply DBSCAN and compare with K-means
- Analyze the impact of eps and min_samples parameters
Exercise 4: Model Selection
- Use multiple clustering algorithms on the same dataset
- Compare their performance using different evaluation metrics
- Select the best algorithm and justify your choice
11.10 Summary
In this chapter, we have deeply learned various aspects of clustering analysis:
Core Concepts
- Clustering principles: Similarity-based grouping, unsupervised learning
- Algorithm types: Distance-based, hierarchical, density-based, distribution-based
- Evaluation metrics: Internal and external evaluation methods
Main Techniques
- K-means: Classic centroid-based clustering
- Hierarchical clustering: Tree-structured clustering
- DBSCAN: Density-based clustering for arbitrary shapes
- GMM: Probabilistic clustering with soft assignments
Practical Skills
- Parameter selection: Elbow method, silhouette coefficient
- Algorithm comparison: Performance analysis on different data types
- Stability analysis: Evaluating algorithm robustness
- Real applications: Customer segmentation, anomaly detection
Key Points
- Clustering is unsupervised learning that discovers data structure
- Different algorithms suit different data characteristics
- Evaluation metrics help select optimal parameters and algorithms
- Visual inspection is important for understanding clustering results
- Real-world applications require considering business context
11.11 Next Steps
Now you have mastered clustering analysis! In the next chapter Decision Trees, we will learn about supervised learning algorithms that use tree structures for classification and regression.
Chapter Highlights:
- ✅ Understood the principles of various clustering algorithms
- ✅ Mastered K-means, hierarchical clustering, DBSCAN, and GMM
- ✅ Learned parameter selection and evaluation methods
- ✅ Understood the advantages and limitations of different algorithms
- ✅ Mastered clustering evaluation metrics
- ✅ Can apply clustering to real-world problems