Skip to content

Chapter 11: Clustering Analysis

Clustering analysis is an important branch of unsupervised learning that groups samples based on data similarity without label information. This chapter will详细介绍 various clustering algorithms, their principles, implementation, and applications.

11.1 What is Clustering Analysis?

Clustering analysis is an exploratory data analysis technique that aims to group similar data points together and dissimilar points into different groups. Unlike classification, clustering is unsupervised learning and does not require prior knowledge of category labels.

11.1.1 Goals of Clustering

  • Maximize intra-cluster similarity: Samples within the same cluster should be as similar as possible
  • Minimize inter-cluster similarity: Samples between different clusters should be as different as possible
  • Discover data structure: Reveal hidden patterns in the data

11.1.2 Applications of Clustering

  • Market segmentation: Group customers based on consumption behavior
  • Image segmentation: Divide images into different regions
  • Gene analysis: Classify genes based on expression patterns
  • Recommendation systems: Discover user groups and item categories
  • Anomaly detection: Identify outliers that don't belong to any cluster

11.1.3 Classification of Clustering Algorithms

  • Distance-based clustering: K-means, K-medoids
  • Hierarchical clustering: Agglomerative clustering, divisive clustering
  • Density-based clustering: DBSCAN, OPTICS
  • Distribution-based clustering: Gaussian Mixture Models
  • Grid-based clustering: Suitable for large datasets

11.2 Preparing Environment and Data

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_circles, make_moons, load_iris, load_wine
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score, calinski_harabasz_score
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist, squareform
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

11.3 K-means Clustering

11.3.1 K-means Algorithm Principle

K-means is the most classic clustering algorithm, which minimizes the within-cluster sum of squares through iterative optimization.

python
def demonstrate_kmeans_principle():
    """Demonstrate the basic principle of K-means algorithm"""
    
    # Create simple clustering data
    X_demo, y_true = make_blobs(n_samples=300, centers=3, n_features=2, 
                                random_state=42, cluster_std=1.5)
    
    print("K-means algorithm steps demonstration:")
    print("1. Randomly initialize K cluster centers")
    print("2. Assign each point to the nearest cluster center")
    print("3. Update cluster centers to the mean of points in the cluster")
    print("4. Repeat steps 2-3 until convergence")
    
    # Visualize data
    plt.figure(figsize=(15, 10))
    
    # Original data
    plt.subplot(2, 3, 1)
    plt.scatter(X_demo[:, 0], X_demo[:, 1], c='gray', alpha=0.6)
    plt.title('Original Data')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(True, alpha=0.3)
    
    # Manually demonstrate K-means iteration process
    k = 3
    # Randomly initialize cluster centers
    centers = np.random.randn(k, 2) * 2
    
    for iteration in range(5):
        if iteration < 4:  # Only show first 4 iterations
            plt.subplot(2, 3, iteration + 2)
            
            # Calculate distance from each point to cluster centers
            distances = np.sqrt(((X_demo - centers[:, np.newaxis])**2).sum(axis=2))
            labels = np.argmin(distances, axis=0)
            
            # Plot data points and cluster centers
            colors = ['red', 'blue', 'green']
            for i in range(k):
                mask = labels == i
                plt.scatter(X_demo[mask, 0], X_demo[mask, 1], 
                           c=colors[i], alpha=0.6, s=30)
                plt.scatter(centers[i, 0], centers[i, 1], 
                           c='black', marker='x', s=200, linewidths=3)
            
            plt.title(f'Iteration {iteration + 1}')
            plt.xlabel('Feature 1')
            plt.ylabel('Feature 2')
            plt.grid(True, alpha=0.3)
            
            # Update cluster centers
            new_centers = np.array([X_demo[labels == i].mean(axis=0) for i in range(k)])
            centers = new_centers
    
    # Final result
    plt.subplot(2, 3, 6)
    for i in range(k):
        mask = labels == i
        plt.scatter(X_demo[mask, 0], X_demo[mask, 1], 
                   c=colors[i], alpha=0.6, s=30)
        plt.scatter(centers[i, 0], centers[i, 1], 
                   c='black', marker='x', s=200, linewidths=3)
    plt.title('Final Result')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return X_demo, y_true

X_demo, y_true = demonstrate_kmeans_principle()

11.3.2 Implementing K-means with Scikit-learn

python
# Use Scikit-learn's K-means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X_demo)

# Get cluster centers
centers = kmeans.cluster_centers_

print("K-means clustering results:")
print(f"Cluster centers:\n{centers}")
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")

# Visualize results
plt.figure(figsize=(15, 5))

# True labels
plt.subplot(1, 3, 1)
colors = ['red', 'blue', 'green']
for i in range(3):
    mask = y_true == i
    plt.scatter(X_demo[mask, 0], X_demo[mask, 1], c=colors[i], alpha=0.6, label=f'True Cluster {i}')
plt.title('True Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)

# K-means results
plt.subplot(1, 3, 2)
for i in range(3):
    mask = y_pred == i
    plt.scatter(X_demo[mask, 0], X_demo[mask, 1], c=colors[i], alpha=0.6, label=f'Predicted Cluster {i}')
plt.scatter(centers[:, 0], centers[:, 1], c='black', marker='x', s=200, linewidths=3, label='Cluster Centers')
plt.title('K-means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)

# Comparison
plt.subplot(1, 3, 3)
plt.scatter(X_demo[:, 0], X_demo[:, 1], c=y_pred, cmap='viridis', alpha=0.6)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=200, linewidths=3)
plt.title('Clustering Results (Color Coded)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

11.3.3 Selecting Optimal K Value

python
def find_optimal_k(X, k_range):
    """Use elbow method and silhouette coefficient to select optimal K value"""
    
    inertias = []
    silhouette_scores = []
    
    print("Clustering performance for different K values:")
    print("K Value\tInertia(WCSS)\tSilhouette Coefficient")
    print("-" * 35)
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        y_pred = kmeans.fit_predict(X)
        
        inertia = kmeans.inertia_
        if k > 1:  # Silhouette coefficient requires at least 2 clusters
            silhouette = silhouette_score(X, y_pred)
        else:
            silhouette = 0
        
        inertias.append(inertia)
        silhouette_scores.append(silhouette)
        
        print(f"{k}\t{inertia:.2f}\t\t{silhouette:.4f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Elbow method
    axes[0].plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)
    axes[0].set_xlabel('K Value')
    axes[0].set_ylabel('Inertia (WCSS)')
    axes[0].set_title('Elbow Method')
    axes[0].grid(True, alpha=0.3)
    
    # Silhouette coefficient
    axes[1].plot(k_range[1:], silhouette_scores[1:], 'ro-', linewidth=2, markersize=8)
    axes[1].set_xlabel('K Value')
    axes[1].set_ylabel('Silhouette Coefficient')
    axes[1].set_title('Silhouette Coefficient Method')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Find best K value
    best_k_silhouette = k_range[1:][np.argmax(silhouette_scores[1:])]
    print(f"\nBest K value based on silhouette coefficient: {best_k_silhouette}")
    
    return inertias, silhouette_scores

# Find optimal K value
k_range = range(1, 11)
inertias, silhouette_scores = find_optimal_k(X_demo, k_range)

11.3.4 K-means Variants

python
# Compare different K-means variants
from sklearn.cluster import MiniBatchKMeans

def compare_kmeans_variants():
    """Compare different K-means variants"""
    
    # Create larger dataset
    X_large, y_large = make_blobs(n_samples=2000, centers=4, n_features=2, 
                                  random_state=42, cluster_std=2.0)
    
    # Different K-means algorithms
    algorithms = {
        'K-means': KMeans(n_clusters=4, random_state=42),
        'Mini-batch K-means': MiniBatchKMeans(n_clusters=4, random_state=42, batch_size=100)
    }
    
    results = {}
    
    print("K-means variant performance comparison:")
    print("Algorithm\t\tTraining Time\tInertia\t\tSilhouette Coefficient")
    print("-" * 60)
    
    import time
    
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    for i, (name, algorithm) in enumerate(algorithms.items()):
        # Training time
        start_time = time.time()
        y_pred = algorithm.fit_predict(X_large)
        train_time = time.time() - start_time
        
        # Performance metrics
        inertia = algorithm.inertia_
        silhouette = silhouette_score(X_large, y_pred)
        
        results[name] = {
            'time': train_time,
            'inertia': inertia,
            'silhouette': silhouette
        }
        
        print(f"{name}\t{train_time:.4f}s\t\t{inertia:.2f}\t\t{silhouette:.4f}")
        
        # Visualization
        axes[i].scatter(X_large[:, 0], X_large[:, 1], c=y_pred, cmap='viridis', alpha=0.6, s=10)
        axes[i].scatter(algorithm.cluster_centers_[:, 0], algorithm.cluster_centers_[:, 1], 
                       c='red', marker='x', s=200, linewidths=3)
        axes[i].set_title(f'{name}\nSilhouette Coefficient: {silhouette:.3f}')
        axes[i].set_xlabel('Feature 1')
        axes[i].set_ylabel('Feature 2')
        axes[i].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return results

kmeans_results = compare_kmeans_variants()

11.4 Hierarchical Clustering

11.4.1 Agglomerative Hierarchical Clustering

python
def demonstrate_hierarchical_clustering():
    """Demonstrate hierarchical clustering"""
    
    # Create small-scale data for demonstration
    X_hier, y_hier = make_blobs(n_samples=50, centers=3, n_features=2, 
                                random_state=42, cluster_std=1.0)
    
    # Agglomerative hierarchical clustering
    hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
    y_pred_hier = hierarchical.fit_predict(X_hier)
    
    print("Hierarchical clustering results:")
    print(f"Number of clusters: {len(np.unique(y_pred_hier))}")
    
    # Visualize results
    plt.figure(figsize=(15, 5))
    
    # Original data
    plt.subplot(1, 3, 1)
    plt.scatter(X_hier[:, 0], X_hier[:, 1], c='gray', alpha=0.6)
    plt.title('Original Data')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(True, alpha=0.3)
    
    # Hierarchical clustering results
    plt.subplot(1, 3, 2)
    colors = ['red', 'blue', 'green']
    for i in range(3):
        mask = y_pred_hier == i
        plt.scatter(X_hier[mask, 0], X_hier[mask, 1], c=colors[i], alpha=0.6, label=f'Cluster {i}')
    plt.title('Hierarchical Clustering Results')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Dendrogram
    plt.subplot(1, 3, 3)
    # Calculate linkage matrix
    linkage_matrix = linkage(X_hier, method='ward')
    dendrogram(linkage_matrix, truncate_mode='level', p=5)
    plt.title('Dendrogram')
    plt.xlabel('Sample Index')
    plt.ylabel('Distance')
    
    plt.tight_layout()
    plt.show()
    
    return X_hier, y_pred_hier

X_hier, y_pred_hier = demonstrate_hierarchical_clustering()

11.4.2 Comparison of Different Linkage Methods

python
def compare_linkage_methods():
    """Compare different linkage methods"""
    
    # Create clustering data with different shapes
    X_shapes = []
    y_shapes = []
    
    # Circular clusters
    X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)
    X_shapes.append(('Circular', X_circles, y_circles))
    
    # Moon-shaped clusters
    X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
    X_shapes.append(('Moon-shaped', X_moons, y_moons))
    
    # Spherical clusters
    X_blobs, y_blobs = make_blobs(n_samples=200, centers=3, random_state=42)
    X_shapes.append(('Spherical', X_blobs, y_blobs))
    
    # Different linkage methods
    linkage_methods = ['ward', 'complete', 'average', 'single']
    
    fig, axes = plt.subplots(len(X_shapes), len(linkage_methods) + 1, figsize=(20, 12))
    fig.suptitle('Hierarchical Clustering Comparison with Different Linkage Methods', fontsize=16)
    
    for i, (shape_name, X, y_true) in enumerate(X_shapes):
        # Original data
        axes[i, 0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.6)
        axes[i, 0].set_title(f'{shape_name} Data')
        axes[i, 0].set_xlabel('Feature 1')
        axes[i, 0].set_ylabel('Feature 2')
        
        # Different linkage methods
        for j, linkage_method in enumerate(linkage_methods):
            n_clusters = len(np.unique(y_true))
            
            if linkage_method == 'ward':
                # Ward method only applies to Euclidean distance
                hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage_method)
            else:
                hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage=linkage_method)
            
            y_pred = hierarchical.fit_predict(X)
            
            axes[i, j + 1].scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.6)
            axes[i, j + 1].set_title(f'{linkage_method} Linkage')
            axes[i, j + 1].set_xlabel('Feature 1')
            axes[i, j + 1].set_ylabel('Feature 2')
    
    plt.tight_layout()
    plt.show()

compare_linkage_methods()

11.5 Density-Based Clustering (DBSCAN)

11.5.1 DBSCAN Algorithm Principle

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that can discover clusters of arbitrary shapes and identify noise points.

python
def demonstrate_dbscan_principle():
    """Demonstrate DBSCAN algorithm principle"""
    
    print("DBSCAN algorithm core concepts:")
    print("1. Core points: Points with at least min_samples points in their ε-neighborhood")
    print("2. Border points: Points that are not core points but are in the ε-neighborhood of a core point")
    print("3. Noise points: Points that are neither core points nor border points")
    print("4. Density reachable: Points connected through a sequence of core points")
    
    # Create data containing noise
    X_noise, _ = make_blobs(n_samples=200, centers=3, n_features=2, 
                           random_state=42, cluster_std=1.0)
    
    # Add noise points
    noise_points = np.random.uniform(-8, 8, (20, 2))
    X_dbscan = np.vstack([X_noise, noise_points])
    
    # DBSCAN clustering
    dbscan = DBSCAN(eps=1.5, min_samples=5)
    y_pred_dbscan = dbscan.fit_predict(X_dbscan)
    
    # Statistics
    n_clusters = len(set(y_pred_dbscan)) - (1 if -1 in y_pred_dbscan else 0)
    n_noise = list(y_pred_dbscan).count(-1)
    
    print(f"\nDBSCAN clustering results:")
    print(f"Number of clusters: {n_clusters}")
    print(f"Number of noise points: {n_noise}")
    print(f"Number of core points: {len(dbscan.core_sample_indices_)}")
    
    # Visualize results
    plt.figure(figsize=(15, 5))
    
    # Original data
    plt.subplot(1, 3, 1)
    plt.scatter(X_dbscan[:, 0], X_dbscan[:, 1], c='gray', alpha=0.6)
    plt.title('Original Data (With Noise)')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(True, alpha=0.3)
    
    # DBSCAN results
    plt.subplot(1, 3, 2)
    unique_labels = set(y_pred_dbscan)
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
    
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Noise points in black
            col = 'black'
            marker = 'x'
            alpha = 0.5
            label = 'Noise Points'
        else:
            marker = 'o'
            alpha = 0.7
            label = f'Cluster {k}'
        
        class_member_mask = (y_pred_dbscan == k)
        xy = X_dbscan[class_member_mask]
        plt.scatter(xy[:, 0], xy[:, 1], c=[col], marker=marker, alpha=alpha, s=50)
    
    plt.title('DBSCAN Clustering Results')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(True, alpha=0.3)
    
    # Core point markers
    plt.subplot(1, 3, 3)
    # Plot all points
    plt.scatter(X_dbscan[:, 0], X_dbscan[:, 1], c=y_pred_dbscan, 
               cmap='Spectral', alpha=0.6, s=50)
    
    # Mark core points
    core_samples = X_dbscan[dbscan.core_sample_indices_]
    plt.scatter(core_samples[:, 0], core_samples[:, 1], 
               facecolors='none', edgecolors='red', s=100, linewidths=2, label='Core Points')
    
    plt.title('Core Point Markers')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return X_dbscan, y_pred_dbscan

X_dbscan, y_pred_dbscan = demonstrate_dbscan_principle()

11.5.2 DBSCAN Parameter Tuning

python
def tune_dbscan_parameters():
    """Tune DBSCAN parameters"""
    
    # Use moon-shaped data to test DBSCAN
    X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
    
    # Different eps and min_samples combinations
    eps_values = [0.1, 0.2, 0.3, 0.5]
    min_samples_values = [3, 5, 10, 15]
    
    fig, axes = plt.subplots(len(eps_values), len(min_samples_values), figsize=(16, 12))
    fig.suptitle('DBSCAN Parameter Tuning', fontsize=16)
    
    print("DBSCAN parameter tuning results:")
    print("eps\tmin_samples\tClusters\tNoise Points\tSilhouette Coefficient")
    print("-" * 55)
    
    best_score = -1
    best_params = None
    
    for i, eps in enumerate(eps_values):
        for j, min_samples in enumerate(min_samples_values):
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            y_pred = dbscan.fit_predict(X_moons)
            
            n_clusters = len(set(y_pred)) - (1 if -1 in y_pred else 0)
            n_noise = list(y_pred).count(-1)
            
            # Calculate silhouette coefficient (requires at least 2 clusters and not all noise)
            if n_clusters >= 2 and len(set(y_pred)) > 1:
                # Exclude noise points when calculating silhouette coefficient
                mask = y_pred != -1
                if np.sum(mask) > 0:
                    silhouette = silhouette_score(X_moons[mask], y_pred[mask])
                else:
                    silhouette = -1
            else:
                silhouette = -1
            
            print(f"{eps}\t{min_samples}\t\t{n_clusters}\t{n_noise}\t\t{silhouette:.4f}")
            
            # Record best parameters
            if silhouette > best_score:
                best_score = silhouette
                best_params = (eps, min_samples)
            
            # Visualization
            axes[i, j].scatter(X_moons[:, 0], X_moons[:, 1], c=y_pred, 
                             cmap='Spectral', alpha=0.7, s=30)
            axes[i, j].set_title(f'eps={eps}, min_samples={min_samples}\n'
                                f'Clusters={n_clusters}, Silhouette={silhouette:.3f}')
            axes[i, j].set_xlabel('Feature 1')
            axes[i, j].set_ylabel('Feature 2')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nBest parameters: eps={best_params[0]}, min_samples={best_params[1]}")
    print(f"Best silhouette coefficient: {best_score:.4f}")
    
    return best_params

best_dbscan_params = tune_dbscan_parameters()

11.6 Gaussian Mixture Models (GMM)

11.6.1 GMM Basic Principles

python
def demonstrate_gmm():
    """Demonstrate Gaussian Mixture Models"""
    
    print("Gaussian Mixture Model (GMM) characteristics:")
    print("1. Assumes data comes from a mixture of multiple Gaussian distributions")
    print("2. Uses EM algorithm for parameter estimation")
    print("3. Provides soft clustering (probability assignment)")
    print("4. Can handle elliptical clusters")
    
    # Create elliptical cluster data
    X_ellipse, y_ellipse = make_blobs(n_samples=300, centers=3, n_features=2, 
                                     random_state=42, cluster_std=2.0)
    
    # Transform data to make it elliptical
    transformation_matrix = np.array([[1, 0.5], [0.5, 1]])
    X_ellipse = X_ellipse.dot(transformation_matrix)
    
    # Gaussian Mixture Model
    gmm = GaussianMixture(n_components=3, random_state=42)
    y_pred_gmm = gmm.fit_predict(X_ellipse)
    y_proba_gmm = gmm.predict_proba(X_ellipse)
    
    print(f"\nGMM clustering results:")
    print(f"Number of clusters: {gmm.n_components}")
    print(f"Converged: {gmm.converged_}")
    print(f"Number of iterations: {gmm.n_iter_}")
    print(f"Log likelihood: {gmm.score(X_ellipse):.2f}")
    
    # Visualize results
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    fig.suptitle('Gaussian Mixture Model Clustering', fontsize=16)
    
    # Original data
    axes[0, 0].scatter(X_ellipse[:, 0], X_ellipse[:, 1], c='gray', alpha=0.6)
    axes[0, 0].set_title('Original Data')
    axes[0, 0].set_xlabel('Feature 1')
    axes[0, 0].set_ylabel('Feature 2')
    axes[0, 0].grid(True, alpha=0.3)
    
    # GMM clustering results
    colors = ['red', 'blue', 'green']
    for i in range(3):
        mask = y_pred_gmm == i
        axes[0, 1].scatter(X_ellipse[mask, 0], X_ellipse[mask, 1], 
                          c=colors[i], alpha=0.6, label=f'Cluster {i}')
    axes[0, 1].set_title('GMM Clustering Results')
    axes[0, 1].set_xlabel('Feature 1')
    axes[0, 1].set_ylabel('Feature 2')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Probability assignment (soft clustering)
    # Show probability of each point belonging to the first cluster
    scatter = axes[1, 0].scatter(X_ellipse[:, 0], X_ellipse[:, 1], 
                                c=y_proba_gmm[:, 0], cmap='Reds', alpha=0.7)
    axes[1, 0].set_title('Probability of Belonging to Cluster 0')
    axes[1, 0].set_xlabel('Feature 1')
    axes[1, 0].set_ylabel('Feature 2')
    plt.colorbar(scatter, ax=axes[1, 0])
    
    # Gaussian distribution contours
    axes[1, 1].scatter(X_ellipse[:, 0], X_ellipse[:, 1], c=y_pred_gmm, 
                      cmap='viridis', alpha=0.6)
    
    # Draw Gaussian distribution contours
    x_min, x_max = X_ellipse[:, 0].min() - 2, X_ellipse[:, 0].max() + 2
    y_min, y_max = X_ellipse[:, 1].min() - 2, X_ellipse[:, 1].max() + 2
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                        np.linspace(y_min, y_max, 100))
    
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    Z = gmm.score_samples(mesh_points)
    Z = Z.reshape(xx.shape)
    
    axes[1, 1].contour(xx, yy, Z, levels=10, alpha=0.5)
    axes[1, 1].set_title('GMM Contours')
    axes[1, 1].set_xlabel('Feature 1')
    axes[1, 1].set_ylabel('Feature 2')
    
    plt.tight_layout()
    plt.show()
    
    return X_ellipse, y_pred_gmm, gmm

X_ellipse, y_pred_gmm, gmm = demonstrate_gmm()

11.6.2 Model Selection and Information Criteria

python
def gmm_model_selection():
    """Use information criteria to select the optimal number of GMM components"""
    
    n_components_range = range(1, 11)
    aic_scores = []
    bic_scores = []
    
    print("GMM model selection:")
    print("Components\tAIC\t\tBIC")
    print("-" * 30)
    
    for n_components in n_components_range:
        gmm = GaussianMixture(n_components=n_components, random_state=42)
        gmm.fit(X_ellipse)
        
        aic = gmm.aic(X_ellipse)
        bic = gmm.bic(X_ellipse)
        
        aic_scores.append(aic)
        bic_scores.append(bic)
        
        print(f"{n_components}\t{aic:.2f}\t\t{bic:.2f}")
    
    # Visualize information criteria
    plt.figure(figsize=(10, 6))
    plt.plot(n_components_range, aic_scores, 'bo-', label='AIC', linewidth=2, markersize=8)
    plt.plot(n_components_range, bic_scores, 'ro-', label='BIC', linewidth=2, markersize=8)
    
    plt.xlabel('Number of Components')
    plt.ylabel('Information Criterion Value')
    plt.title('GMM Model Selection - AIC vs BIC')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Find optimal number of components
    best_n_aic = n_components_range[np.argmin(aic_scores)]
    best_n_bic = n_components_range[np.argmin(bic_scores)]
    
    print(f"\nBest number of components based on AIC: {best_n_aic}")
    print(f"Best number of components based on BIC: {best_n_bic}")
    
    return best_n_aic, best_n_bic

best_n_aic, best_n_bic = gmm_model_selection()

11.7 Clustering Algorithm Comparison

11.7.1 Performance on Different Data Types

python
def comprehensive_clustering_comparison():
    """Comprehensively compare different clustering algorithms"""
    
    # Create different types of datasets
    datasets = []
    
    # 1. Spherical clusters
    X_blobs, y_blobs = make_blobs(n_samples=200, centers=3, random_state=42)
    datasets.append(('Spherical Clusters', X_blobs, y_blobs))
    
    # 2. Moon-shaped
    X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
    datasets.append(('Moon-shaped', X_moons, y_moons))
    
    # 3. Circular
    X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)
    datasets.append(('Circular', X_circles, y_circles))
    
    # 4. Different densities
    X_varied, y_varied = make_blobs(n_samples=200, centers=3, 
                                   cluster_std=[1.0, 2.5, 0.5], random_state=42)
    datasets.append(('Different Densities', X_varied, y_varied))
    
    # Clustering algorithms
    algorithms = {
        'K-means': KMeans(n_clusters=2, random_state=42),
        'Hierarchical Clustering': AgglomerativeClustering(n_clusters=2, linkage='ward'),
        'DBSCAN': DBSCAN(eps=0.3, min_samples=5),
        'GMM': GaussianMixture(n_components=2, random_state=42),
        'Spectral Clustering': SpectralClustering(n_clusters=2, random_state=42)
    }
    
    # Create comparison plot
    fig, axes = plt.subplots(len(datasets), len(algorithms) + 1, figsize=(20, 16))
    fig.suptitle('Comparison of Different Clustering Algorithms', fontsize=16)
    
    print("Clustering algorithm performance comparison:")
    print("Dataset\t\tAlgorithm\t\t\tSilhouette\tARI")
    print("-" * 60)
    
    for i, (dataset_name, X, y_true) in enumerate(datasets):
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Original data
        axes[i, 0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7)
        axes[i, 0].set_title(f'{dataset_name}\n(True Labels)')
        axes[i, 0].set_xlabel('Feature 1')
        axes[i, 0].set_ylabel('Feature 2')
        
        for j, (alg_name, algorithm) in enumerate(algorithms.items()):
            # Adjust number of clusters
            if hasattr(algorithm, 'n_clusters'):
                algorithm.n_clusters = len(np.unique(y_true))
            elif hasattr(algorithm, 'n_components'):
                algorithm.n_components = len(np.unique(y_true))
            
            # Use standardized data for some algorithms
            if alg_name in ['Spectral Clustering']:
                X_input = X_scaled
            else:
                X_input = X
            
            try:
                y_pred = algorithm.fit_predict(X_input)
                
                # Calculate evaluation metrics
                if len(set(y_pred)) > 1:
                    silhouette = silhouette_score(X_input, y_pred)
                    ari = adjusted_rand_score(y_true, y_pred)
                else:
                    silhouette = -1
                    ari = -1
                
                print(f"{dataset_name}\t{alg_name}\t\t{silhouette:.4f}\t\t{ari:.4f}")
                
                # Visualize results
                axes[i, j + 1].scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.7)
                axes[i, j + 1].set_title(f'{alg_name}\nSilhouette: {silhouette:.3f}')
                axes[i, j + 1].set_xlabel('Feature 1')
                axes[i, j + 1].set_ylabel('Feature 2')
                
            except Exception as e:
                print(f"{dataset_name}\t{alg_name}\t\tError: {str(e)[:20]}")
                axes[i, j + 1].text(0.5, 0.5, 'Algorithm Failed', ha='center', va='center', 
                                   transform=axes[i, j + 1].transAxes)
                axes[i, j + 1].set_title(f'{alg_name}\nFailed')
    
    plt.tight_layout()
    plt.show()

comprehensive_clustering_comparison()

11.8 Clustering Evaluation

11.8.1 Internal Evaluation Metrics

python
def clustering_evaluation_metrics():
    """Demonstrate clustering evaluation metrics"""
    
    # Use iris dataset
    iris = load_iris()
    X_iris = iris.data
    y_true_iris = iris.target
    
    # Standardize data
    scaler = StandardScaler()
    X_iris_scaled = scaler.fit_transform(X_iris)
    
    # Different clustering algorithms
    algorithms = {
        'K-means': KMeans(n_clusters=3, random_state=42),
        'Hierarchical Clustering': AgglomerativeClustering(n_clusters=3),
        'GMM': GaussianMixture(n_components=3, random_state=42)
    }
    
    print("Clustering evaluation metrics comparison:")
    print("Algorithm\t\tSilhouette\tCalinski-Harabasz\tARI\t\tMutual Information")
    print("-" * 70)
    
    results = {}
    
    for name, algorithm in algorithms.items():
        # Clustering
        y_pred = algorithm.fit_predict(X_iris_scaled)
        
        # Internal evaluation metrics (no true labels needed)
        silhouette = silhouette_score(X_iris_scaled, y_pred)
        calinski_harabasz = calinski_harabasz_score(X_iris_scaled, y_pred)
        
        # External evaluation metrics (true labels needed)
        from sklearn.metrics import adjusted_mutual_info_score
        ari = adjusted_rand_score(y_true_iris, y_pred)
        ami = adjusted_mutual_info_score(y_true_iris, y_pred)
        
        results[name] = {
            'silhouette': silhouette,
            'calinski_harabasz': calinski_harabasz,
            'ari': ari,
            'ami': ami,
            'y_pred': y_pred
        }
        
        print(f"{name}\t{silhouette:.4f}\t\t{calinski_harabasz:.2f}\t\t{ari:.4f}\t\t{ami:.4f}")
    
    # Visualize evaluation results
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Clustering Evaluation Metrics Comparison', fontsize=16)
    
    metrics = ['silhouette', 'calinski_harabasz', 'ari', 'ami']
    metric_names = ['Silhouette Coefficient', 'Calinski-Harabasz Index', 'Adjusted Rand Index', 'Adjusted Mutual Information']
    
    for i, (metric, metric_name) in enumerate(zip(metrics, metric_names)):
        row = i // 2
        col = i % 2
        
        names = list(results.keys())
        values = [results[name][metric] for name in names]
        
        bars = axes[row, col].bar(names, values, alpha=0.7, 
                                 color=['skyblue', 'lightgreen', 'lightcoral'])
        axes[row, col].set_title(metric_name)
        axes[row, col].set_ylabel('Score')
        axes[row, col].tick_params(axis='x', rotation=45)
        
        # Add value labels
        for bar, value in zip(bars, values):
            height = bar.get_height()
            axes[row, col].text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                               f'{value:.3f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    return results

evaluation_results = clustering_evaluation_metrics()

11.8.2 Clustering Stability Analysis

python
def clustering_stability_analysis():
    """Analyze the stability of clustering algorithms"""
    
    # Create dataset
    X_stability, y_stability = make_blobs(n_samples=300, centers=3, 
                                         n_features=2, random_state=42)
    
    # Run clustering algorithms multiple times
    n_runs = 20
    algorithms = {
        'K-means': KMeans(n_clusters=3),
        'Hierarchical Clustering': AgglomerativeClustering(n_clusters=3),
        'GMM': GaussianMixture(n_components=3)
    }
    
    stability_results = {}
    
    print("Clustering stability analysis:")
    print("Algorithm\t\tMean Silhouette\tStd Dev\t\tCoefficient of Variation")
    print("-" * 55)
    
    for name, algorithm in algorithms.items():
        silhouette_scores = []
        
        for run in range(n_runs):
            # Use different random seeds each time to test stability
            if hasattr(algorithm, 'random_state'):
                algorithm.random_state = run
            
            y_pred = algorithm.fit_predict(X_stability)
            silhouette = silhouette_score(X_stability, y_pred)
            silhouette_scores.append(silhouette)
        
        mean_silhouette = np.mean(silhouette_scores)
        std_silhouette = np.std(silhouette_scores)
        cv_silhouette = std_silhouette / mean_silhouette if mean_silhouette != 0 else 0
        
        stability_results[name] = {
            'scores': silhouette_scores,
            'mean': mean_silhouette,
            'std': std_silhouette,
            'cv': cv_silhouette
        }
        
        print(f"{name}\t{mean_silhouette:.4f}\t\t{std_silhouette:.4f}\t\t{cv_silhouette:.4f}")
    
    # Visualize stability
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Box plot
    data_for_boxplot = [stability_results[name]['scores'] for name in algorithms.keys()]
    axes[0].boxplot(data_for_boxplot, labels=algorithms.keys())
    axes[0].set_title('Clustering Algorithm Stability (Box Plot)')
    axes[0].set_ylabel('Silhouette Coefficient')
    axes[0].tick_params(axis='x', rotation=45)
    axes[0].grid(True, alpha=0.3)
    
    # Coefficient of variation comparison
    names = list(stability_results.keys())
    cvs = [stability_results[name]['cv'] for name in names]
    
    bars = axes[1].bar(names, cvs, color=['skyblue', 'lightgreen', 'lightcoral'], alpha=0.7)
    axes[1].set_title('Coefficient of Variation Comparison')
    axes[1].set_ylabel('Coefficient of Variation')
    axes[1].tick_params(axis='x', rotation=45)
    axes[1].grid(True, alpha=0.3)
    
    # Add value labels
    for bar, cv in zip(bars, cvs):
        height = bar.get_height()
        axes[1].text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                   f'{cv:.3f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    return stability_results

stability_results = clustering_stability_analysis()

11.9 Practice Exercises

Exercise 1: Basic Clustering

  1. Use make_blobs to generate a clustering dataset
  2. Train K-means and visualize the clustering results
  3. Analyze the impact of different K values on model performance

Exercise 2: Hierarchical Clustering

  1. Create a dataset with hierarchical structure
  2. Apply hierarchical clustering and visualize the dendrogram
  3. Compare different linkage methods

Exercise 3: Density-Based Clustering

  1. Generate data with complex shapes (circles, moons)
  2. Apply DBSCAN and compare with K-means
  3. Analyze the impact of eps and min_samples parameters

Exercise 4: Model Selection

  1. Use multiple clustering algorithms on the same dataset
  2. Compare their performance using different evaluation metrics
  3. Select the best algorithm and justify your choice

11.10 Summary

In this chapter, we have deeply learned various aspects of clustering analysis:

Core Concepts

  • Clustering principles: Similarity-based grouping, unsupervised learning
  • Algorithm types: Distance-based, hierarchical, density-based, distribution-based
  • Evaluation metrics: Internal and external evaluation methods

Main Techniques

  • K-means: Classic centroid-based clustering
  • Hierarchical clustering: Tree-structured clustering
  • DBSCAN: Density-based clustering for arbitrary shapes
  • GMM: Probabilistic clustering with soft assignments

Practical Skills

  • Parameter selection: Elbow method, silhouette coefficient
  • Algorithm comparison: Performance analysis on different data types
  • Stability analysis: Evaluating algorithm robustness
  • Real applications: Customer segmentation, anomaly detection

Key Points

  • Clustering is unsupervised learning that discovers data structure
  • Different algorithms suit different data characteristics
  • Evaluation metrics help select optimal parameters and algorithms
  • Visual inspection is important for understanding clustering results
  • Real-world applications require considering business context

11.11 Next Steps

Now you have mastered clustering analysis! In the next chapter Decision Trees, we will learn about supervised learning algorithms that use tree structures for classification and regression.


Chapter Highlights:

  • ✅ Understood the principles of various clustering algorithms
  • ✅ Mastered K-means, hierarchical clustering, DBSCAN, and GMM
  • ✅ Learned parameter selection and evaluation methods
  • ✅ Understood the advantages and limitations of different algorithms
  • ✅ Mastered clustering evaluation metrics
  • ✅ Can apply clustering to real-world problems

Content is for learning and research only.